catalog Documentation - Complete Reference Guide

Command Line Reference

Complete guide to all catalog command-line options and parameters

Basic Syntax

catalog [OPTIONS]

Generate llms.txt, llms-full.txt, and related files from Markdown/HTML directories.

Core Options

--input, -i <path>

Source directory of Markdown/HTML files (default: current directory)

catalog --input docs --output build

--output, -o <path>

Destination directory for generated files (default: current directory)

catalog --input docs --output build

--base-url <url>

Base URL for generating absolute links in output files

catalog --base-url https://docs.example.com

--silent

Suppress non-error output for automation

catalog --input docs --output build --silent

Content Selection

--include <pattern>

Include files matching glob pattern (can be used multiple times)

catalog --include "*.md" --include "guides/*.html"

--exclude <pattern>

Exclude files matching glob pattern (can be used multiple times)

catalog --exclude "**/*draft*" --exclude "temp/*"

--optional <pattern>

Mark files matching glob pattern as optional (can be used multiple times)

catalog --optional "drafts/**/*" --optional "**/CHANGELOG.md"

Output Generation

--validate

Validate generated llms.txt compliance with standard

catalog --input docs --output build --validate

--index

Generate index.json files for directory navigation and metadata

catalog --input docs --output build --index

--sitemap

Generate XML sitemap for search engines (requires --base-url)

catalog --sitemap --base-url https://docs.example.com

--sitemap-no-extensions

Generate sitemap URLs without file extensions for clean URLs

catalog --sitemap --sitemap-no-extensions --base-url https://docs.example.com

Complete Example Workflows

🤖 AI Training Pipeline

                                    catalog --input docs --output ai-training \
  --optional "examples/**/*" \
  --optional "appendix/**/*" \
  --validate
                                

Creates AI-optimized documentation with essential content prioritized and supplementary material marked as optional.

🌐 Documentation Website

                                    catalog --input docs --output build \
  --base-url https://docs.example.com \
  --sitemap --sitemap-no-extensions \
  --index --validate
                                

Complete documentation site preparation with SEO optimization, navigation metadata, and standards compliance.

🔄 CI/CD Integration

                                    catalog --input docs --output dist \
  --validate \
  --silent
                                

Automated documentation processing with validation and silent operation for continuous integration pipelines.

llms.txt Standard Compliance

Understanding the llms.txt format and catalog's enterprise-grade compliance features

🏆 Complete Standard Implementation

catalog provides full compliance with the llms.txt standard for AI-ready documentation indexing:

✅

H1 → Blockquote → Sections Format

Proper structure with title, description, and organized sections

✅

Section Hierarchy Validation

Ensures correct H2 section organization and ordering

✅

Markdown Link Syntax Compliance

Validates proper link formatting and descriptions

✅

Intelligent Document Ordering

Prioritizes important documentation with smart organization

✅

Path-Based Section Generation

Automatic organization using directory structure

✅

Optional Content Categorization

Separates core and supplementary content appropriately

📋 Standard Format Structure

The llms.txt format follows a specific structure that catalog implements perfectly:

Required Format Elements

                                # Project Title
> Brief project description

## Section Name
- [file.md](file.md) - Optional file description
- [another-file.md](another-file.md) - Another description

## Another Section  
- [section/file.md](section/file.md) - Organized by directory
- [section/other.md](section/other.md) - Maintains structure

## Optional
- [drafts/future.md](drafts/future.md) - Supplementary content
- [archive/old.md](archive/old.md) - Historical documentation
                            

Format Rules

H1 Title: Single project title at the top
Blockquote Description: Brief project description following the title
H2 Sections: Organized content sections with meaningful names
Markdown Links: Proper link syntax with optional descriptions
Optional Section: Separate section for supplementary content
Consistent Structure: Maintains organization and readability

✅ Validation Features

catalog includes comprehensive validation to ensure your output meets the standard:

Structure Validation

                                    # Validates proper H1 → blockquote → sections format
catalog --validate

# Example validation output:
✅ H1 title found: "Documentation Project"
✅ Blockquote description found
✅ Proper section hierarchy (H2 sections)
✅ Valid Markdown link syntax
✅ Appropriate content organization
                                

Link Format Checking

                                    # Ensures all links use proper Markdown syntax
❌ Error: Invalid link format found
   Line 15: [file.md] - Missing parentheses
   Should be: [file.md](file.md) - Description

✅ Suggestion: Use proper Markdown link syntax
                                

URL Validation

                                    # When using --base-url, validates absolute URLs
catalog --base-url https://docs.example.com --validate

✅ All URLs properly formatted with base URL
✅ No broken or malformed links detected
✅ Consistent URL structure maintained
                                

File Processing and Content Extraction

How catalog intelligently processes different file types and extracts metadata

📝 Supported File Types

📄

Markdown Files (.md, .mdx)

Full support for Markdown with YAML frontmatter extraction and content processing.

YAML Frontmatter Title Extraction Content Processing

🌐

HTML Files (.html)

Automatic conversion to Markdown with meta tag extraction and content cleaning.

HTML to Markdown Meta Tag Extraction Content Cleaning

🧠 Intelligent Document Ordering

catalog uses sophisticated logic to organize documents in a meaningful hierarchy:

1

Index/Root Files

Prioritizes index.md, readme.md, home.md files

index.md > readme.md > home.md

2

Important Documentation

Files containing keywords like catalog, tutorial, intro, getting-started

Tutorial content gets higher priority

3

Path-Based Sections

Automatic organization by directory structure (e.g., api/, guides/)

Maintains logical content grouping

4

Alphabetical Fallback

Within sections, files are sorted alphabetically for consistent organization

Ensures predictable structure

🔍 Metadata Extraction

catalog automatically extracts metadata from multiple sources:

📋 YAML Frontmatter

                                    ---
title: "Getting Started Guide"
description: "Complete setup instructions"
date: 2024-01-15
author: "Documentation Team"
---

# Getting Started

Content here...
                                

Extracts title, description, and other metadata while preserving content structure.

🌐 HTML Meta Tags

                                    <head>
  <title>API Documentation</title>
  <meta name="description" content="Complete API reference">
  <meta name="author" content="API Team">
</head>
                                

Processes HTML meta tags for title, description, and SEO information.

📂 Site Metadata Detection

                                    # Automatically detects from root index files
# Uses directory name as fallback
# Extracts site-wide description and title
                                

Intelligent detection of site-wide metadata from root documentation files.

🧹 Content Processing Pipeline

catalog uses a sophisticated multi-stage processing pipeline:

📂

Discovery & Scanning

Recursive directory traversal with pattern matching and security validation

→

🔍

Content Extraction

YAML frontmatter stripping, HTML processing, and metadata extraction

→

📊

Organization

Intelligent ordering, section generation, and content categorization

→

📋

Generation

Multiple output formats with validation and optimization

Output Formats and Features

Comprehensive guide to all output formats generated by catalog

📋

llms.txt (Structured Index)

Standard-compliant structured index with H1 → blockquote → sections format, perfect for AI context windows with clear organization.

                                # Documentation Project
> Complete API and user guide documentation

## Core Documentation
- [index.md](index.md) - Project overview and introduction
- [getting-started.md](getting-started.md) - Quick start guide

## API Reference
- [api/authentication.md](api/authentication.md) - Authentication methods
- [api/endpoints.md](api/endpoints.md) - API endpoints reference

## Optional
- [drafts/future-plans.md](drafts/future-plans.md) - Future development plans
                            

Standard Compliant AI Optimized Hierarchical

📚

llms-full.txt (Complete Content)

Complete concatenated content with clear separators for comprehensive AI analysis and training data preparation.

                                # Documentation Project
> Complete API and user guide documentation

## index.md
# Welcome to Our Documentation
[Complete content with frontmatter stripped]

---
## getting-started.md
# Getting Started Guide
[Full content continues...]

---
[Content continues for all files...]
                            

Full Content Training Data Separated

🎯

llms-ctx.txt (Context-Only)

Structured index without optional sections, optimized for AI systems with limited context windows.

                                # Documentation Project
> Complete API and user guide documentation

## Core Documentation
- [index.md](index.md) - Project overview and introduction
- [getting-started.md](getting-started.md) - Quick start guide

## API Reference
- [api/authentication.md](api/authentication.md) - Authentication methods
- [api/endpoints.md](api/endpoints.md) - API endpoints reference

# Note: Optional sections excluded for context optimization
                            

Context Limited Essential Only Optimized

🗺️

sitemap.xml (SEO Optimization)

XML sitemap with intelligent priority assignment and change frequency detection for search engine optimization.

                                <?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://docs.example.com/</loc>
    <lastmod>2024-01-15T10:30:00Z</lastmod>
    <changefreq>weekly</changefreq>
    <priority>1.0</priority>
  </url>
</urlset>
                            

SEO Optimized Smart Priorities Clean URLs

📊

index.json (Navigation Metadata)

Comprehensive directory and file metadata for programmatic navigation and content management.

                                {
  "directory": ".",
  "generated": "2024-01-15T10:30:00Z",
  "files": [
    {
      "name": "index.md",
      "path": "index.md", 
      "size": 1234,
      "modified": "2024-01-15T10:30:00Z",
      "type": "md",
      "isMarkdown": true
    }
  ],
  "summary": {
    "totalFiles": 5,
    "markdownFiles": 3,
    "totalSize": 12543
  }
}
                            

Metadata Rich Navigation Statistics

Enterprise Features

Advanced capabilities for large-scale documentation processing

🛡️ Security and Validation

Path Traversal Prevention

Blocks ../ sequences and validates all file paths to prevent directory traversal attacks.

Input Sanitization

All user inputs are validated and sanitized to prevent injection attacks and ensure safe processing.

Content Scanning

Detects malicious patterns, suspicious URLs, and potentially harmful content during processing.

File Size Limits

Configurable limits prevent processing of extremely large files that could cause memory issues.

📊 Performance Monitoring

Real-Time Performance Tracking

                                    📊 Performance Report:
  Total Time: 147ms
  Memory Usage:
    Heap Used: 12.45MB
    RSS: 89.23MB
  Memory Delta:
    Heap: +2.1MB
    RSS: +5.7MB
  Operations:
    file_scanning: 23ms
    content_processing: 89ms
    sitemap_generation: 12ms
    files_processed: 42
    total_file_size: 2.3MB
                                

Detailed timing for all major operations
Memory usage monitoring and optimization
Processing statistics and bottleneck identification
Concurrent processing utilities for large document sets

🔧 Error Handling and Recovery

Actionable Error Messages

                                    ❌ Error in file processing: Permission denied: /protected/file.md

Details:
  EACCES: permission denied

Suggestions:
  → Check file permissions
  → Ensure the directory is not locked by another process
  → Try running with appropriate permissions