catalog Documentation

Complete reference guide for AI-ready documentation indexing
Master catalog's features for generating llms.txt standard-compliant indexes, SEO-optimized sitemaps, and enterprise-grade documentation processing.

Command Line Reference

Complete guide to all catalog command-line options and parameters

Basic Syntax

catalog [OPTIONS]

Generate llms.txt, llms-full.txt, and related files from Markdown/HTML directories.

Core Options

--input, -i <path>

Source directory of Markdown/HTML files (default: current directory)

catalog --input docs --output build
--output, -o <path>

Destination directory for generated files (default: current directory)

catalog --input docs --output build
--base-url <url>

Base URL for generating absolute links in output files

catalog --base-url https://docs.example.com
--silent

Suppress non-error output for automation

catalog --input docs --output build --silent

Content Selection

--include <pattern>

Include files matching glob pattern (can be used multiple times)

catalog --include "*.md" --include "guides/*.html"
--exclude <pattern>

Exclude files matching glob pattern (can be used multiple times)

catalog --exclude "**/*draft*" --exclude "temp/*"
--optional <pattern>

Mark files matching glob pattern as optional (can be used multiple times)

catalog --optional "drafts/**/*" --optional "**/CHANGELOG.md"

Output Generation

--validate

Validate generated llms.txt compliance with standard

catalog --input docs --output build --validate
--index

Generate index.json files for directory navigation and metadata

catalog --input docs --output build --index
--sitemap

Generate XML sitemap for search engines (requires --base-url)

catalog --sitemap --base-url https://docs.example.com
--sitemap-no-extensions

Generate sitemap URLs without file extensions for clean URLs

catalog --sitemap --sitemap-no-extensions --base-url https://docs.example.com

Complete Example Workflows

🤖 AI Training Pipeline

catalog --input docs --output ai-training \ --optional "examples/**/*" \ --optional "appendix/**/*" \ --validate

Creates AI-optimized documentation with essential content prioritized and supplementary material marked as optional.

🌐 Documentation Website

catalog --input docs --output build \ --base-url https://docs.example.com \ --sitemap --sitemap-no-extensions \ --index --validate

Complete documentation site preparation with SEO optimization, navigation metadata, and standards compliance.

🔄 CI/CD Integration

catalog --input docs --output dist \ --validate \ --silent

Automated documentation processing with validation and silent operation for continuous integration pipelines.

llms.txt Standard Compliance

Understanding the llms.txt format and catalog's enterprise-grade compliance features

🏆 Complete Standard Implementation

catalog provides full compliance with the llms.txt standard for AI-ready documentation indexing:

H1 → Blockquote → Sections Format

Proper structure with title, description, and organized sections

Section Hierarchy Validation

Ensures correct H2 section organization and ordering

Markdown Link Syntax Compliance

Validates proper link formatting and descriptions

Intelligent Document Ordering

Prioritizes important documentation with smart organization

Path-Based Section Generation

Automatic organization using directory structure

Optional Content Categorization

Separates core and supplementary content appropriately

📋 Standard Format Structure

The llms.txt format follows a specific structure that catalog implements perfectly:

Required Format Elements

# Project Title > Brief project description ## Section Name - [file.md](file.md) - Optional file description - [another-file.md](another-file.md) - Another description ## Another Section - [section/file.md](section/file.md) - Organized by directory - [section/other.md](section/other.md) - Maintains structure ## Optional - [drafts/future.md](drafts/future.md) - Supplementary content - [archive/old.md](archive/old.md) - Historical documentation

Format Rules

  • H1 Title: Single project title at the top
  • Blockquote Description: Brief project description following the title
  • H2 Sections: Organized content sections with meaningful names
  • Markdown Links: Proper link syntax with optional descriptions
  • Optional Section: Separate section for supplementary content
  • Consistent Structure: Maintains organization and readability

✅ Validation Features

catalog includes comprehensive validation to ensure your output meets the standard:

Structure Validation

# Validates proper H1 → blockquote → sections format catalog --validate # Example validation output: ✅ H1 title found: "Documentation Project" ✅ Blockquote description found ✅ Proper section hierarchy (H2 sections) ✅ Valid Markdown link syntax ✅ Appropriate content organization

Link Format Checking

# Ensures all links use proper Markdown syntax ❌ Error: Invalid link format found Line 15: [file.md] - Missing parentheses Should be: [file.md](file.md) - Description ✅ Suggestion: Use proper Markdown link syntax

URL Validation

# When using --base-url, validates absolute URLs catalog --base-url https://docs.example.com --validate ✅ All URLs properly formatted with base URL ✅ No broken or malformed links detected ✅ Consistent URL structure maintained

File Processing and Content Extraction

How catalog intelligently processes different file types and extracts metadata

📝 Supported File Types

📄

Markdown Files (.md, .mdx)

Full support for Markdown with YAML frontmatter extraction and content processing.

YAML Frontmatter Title Extraction Content Processing
🌐

HTML Files (.html)

Automatic conversion to Markdown with meta tag extraction and content cleaning.

HTML to Markdown Meta Tag Extraction Content Cleaning

🧠 Intelligent Document Ordering

catalog uses sophisticated logic to organize documents in a meaningful hierarchy:

1

Index/Root Files

Prioritizes index.md, readme.md, home.md files

index.md > readme.md > home.md
2

Important Documentation

Files containing keywords like catalog, tutorial, intro, getting-started

Tutorial content gets higher priority
3

Path-Based Sections

Automatic organization by directory structure (e.g., api/, guides/)

Maintains logical content grouping
4

Alphabetical Fallback

Within sections, files are sorted alphabetically for consistent organization

Ensures predictable structure

🔍 Metadata Extraction

catalog automatically extracts metadata from multiple sources:

🧹 Content Processing Pipeline

catalog uses a sophisticated multi-stage processing pipeline:

📂

Discovery & Scanning

Recursive directory traversal with pattern matching and security validation

🔍

Content Extraction

YAML frontmatter stripping, HTML processing, and metadata extraction

📊

Organization

Intelligent ordering, section generation, and content categorization

📋

Generation

Multiple output formats with validation and optimization

Output Formats and Features

Comprehensive guide to all output formats generated by catalog

📋

llms.txt (Structured Index)

Standard-compliant structured index with H1 → blockquote → sections format, perfect for AI context windows with clear organization.

# Documentation Project > Complete API and user guide documentation ## Core Documentation - [index.md](index.md) - Project overview and introduction - [getting-started.md](getting-started.md) - Quick start guide ## API Reference - [api/authentication.md](api/authentication.md) - Authentication methods - [api/endpoints.md](api/endpoints.md) - API endpoints reference ## Optional - [drafts/future-plans.md](drafts/future-plans.md) - Future development plans
Standard Compliant AI Optimized Hierarchical
📚

llms-full.txt (Complete Content)

Complete concatenated content with clear separators for comprehensive AI analysis and training data preparation.

# Documentation Project > Complete API and user guide documentation ## index.md # Welcome to Our Documentation [Complete content with frontmatter stripped] --- ## getting-started.md # Getting Started Guide [Full content continues...] --- [Content continues for all files...]
Full Content Training Data Separated
🎯

llms-ctx.txt (Context-Only)

Structured index without optional sections, optimized for AI systems with limited context windows.

# Documentation Project > Complete API and user guide documentation ## Core Documentation - [index.md](index.md) - Project overview and introduction - [getting-started.md](getting-started.md) - Quick start guide ## API Reference - [api/authentication.md](api/authentication.md) - Authentication methods - [api/endpoints.md](api/endpoints.md) - API endpoints reference # Note: Optional sections excluded for context optimization
Context Limited Essential Only Optimized
🗺️

sitemap.xml (SEO Optimization)

XML sitemap with intelligent priority assignment and change frequency detection for search engine optimization.

<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>https://docs.example.com/</loc> <lastmod>2024-01-15T10:30:00Z</lastmod> <changefreq>weekly</changefreq> <priority>1.0</priority> </url> </urlset>
SEO Optimized Smart Priorities Clean URLs
📊

index.json (Navigation Metadata)

Comprehensive directory and file metadata for programmatic navigation and content management.

{ "directory": ".", "generated": "2024-01-15T10:30:00Z", "files": [ { "name": "index.md", "path": "index.md", "size": 1234, "modified": "2024-01-15T10:30:00Z", "type": "md", "isMarkdown": true } ], "summary": { "totalFiles": 5, "markdownFiles": 3, "totalSize": 12543 } }
Metadata Rich Navigation Statistics

Enterprise Features

Advanced capabilities for large-scale documentation processing

🛡️ Security and Validation

Path Traversal Prevention

Blocks ../ sequences and validates all file paths to prevent directory traversal attacks.

Input Sanitization

All user inputs are validated and sanitized to prevent injection attacks and ensure safe processing.

Content Scanning

Detects malicious patterns, suspicious URLs, and potentially harmful content during processing.

File Size Limits

Configurable limits prevent processing of extremely large files that could cause memory issues.

📊 Performance Monitoring

Real-Time Performance Tracking

📊 Performance Report: Total Time: 147ms Memory Usage: Heap Used: 12.45MB RSS: 89.23MB Memory Delta: Heap: +2.1MB RSS: +5.7MB Operations: file_scanning: 23ms content_processing: 89ms sitemap_generation: 12ms files_processed: 42 total_file_size: 2.3MB
  • Detailed timing for all major operations
  • Memory usage monitoring and optimization
  • Processing statistics and bottleneck identification
  • Concurrent processing utilities for large document sets

🔧 Error Handling and Recovery

Actionable Error Messages

❌ Error in file processing: Permission denied: /protected/file.md Details: EACCES: permission denied Suggestions: → Check file permissions → Ensure the directory is not locked by another process → Try running with appropriate permissions
  • Graceful degradation when individual files fail
  • Comprehensive error categorization with recovery suggestions
  • Standard exit codes for reliable automation
  • Detailed logging with security-focused error handling

Integration Architecture

How catalog fits into enterprise documentation workflows

📥

Content Sources

Markdown, HTML, Git repos

i

inform

Web content extraction

c

catalog

AI-ready indexing

🤖

AI Systems

Training, RAG, Context

Integration Benefits

🔄

Seamless Workflow

Works perfectly with inform for web content extraction and site generation pipelines

📊

Multiple Outputs

Single source generates formats for AI training, context windows, and SEO optimization

⚙️

CI/CD Ready

Standard exit codes and silent operation for automated documentation pipelines

🛡️

Enterprise Security

Comprehensive security validation and monitoring for production environments

Continue Learning

Explore more catalog capabilities and related tools