catalog Documentation - Complete Reference Guide

catalog Documentation

Complete reference guide for AI-ready documentation indexing
Master catalog's features for generating llms.txt standard-compliant indexes, SEO-optimized sitemaps, and enterprise-grade documentation processing.

Command Line Reference

Complete guide to all catalog command-line options and parameters

Basic Syntax

catalog [OPTIONS]

Generate llms.txt, llms-full.txt, and related files from Markdown/HTML directories.

Core Options

--input, -i <path>

Source directory of Markdown/HTML files (default: current directory)

catalog --input docs --output build
--output, -o <path>

Destination directory for generated files (default: current directory)

catalog --input docs --output build
--base-url <url>

Base URL for generating absolute links in output files

catalog --base-url https://docs.example.com
--silent

Suppress non-error output for automation

catalog --input docs --output build --silent

Content Selection

--include <pattern>

Include files matching glob pattern (can be used multiple times)

catalog --include "*.md" --include "guides/*.html"
--exclude <pattern>

Exclude files matching glob pattern (can be used multiple times)

catalog --exclude "**/*draft*" --exclude "temp/*"
--optional <pattern>

Mark files matching glob pattern as optional (can be used multiple times)

catalog --optional "drafts/**/*" --optional "**/CHANGELOG.md"

Output Generation

--validate

Validate generated llms.txt compliance with standard

catalog --input docs --output build --validate
--index

Generate index.json files for directory navigation and metadata

catalog --input docs --output build --index
--sitemap

Generate XML sitemap for search engines (requires --base-url)

catalog --sitemap --base-url https://docs.example.com
--sitemap-no-extensions

Generate sitemap URLs without file extensions for clean URLs

catalog --sitemap --sitemap-no-extensions --base-url https://docs.example.com

Complete Example Workflows

πŸ€– AI Training Pipeline

catalog --input docs --output ai-training \ --optional "examples/**/*" \ --optional "appendix/**/*" \ --validate

Creates AI-optimized documentation with essential content prioritized and supplementary material marked as optional.

🌐 Documentation Website

catalog --input docs --output build \ --base-url https://docs.example.com \ --sitemap --sitemap-no-extensions \ --index --validate

Complete documentation site preparation with SEO optimization, navigation metadata, and standards compliance.

πŸ”„ CI/CD Integration

catalog --input docs --output dist \ --validate \ --silent

Automated documentation processing with validation and silent operation for continuous integration pipelines.

llms.txt Standard Compliance

Understanding the llms.txt format and catalog's enterprise-grade compliance features

πŸ† Complete Standard Implementation

catalog provides full compliance with the llms.txt standard for AI-ready documentation indexing:

βœ…

H1 β†’ Blockquote β†’ Sections Format

Proper structure with title, description, and organized sections

βœ…

Section Hierarchy Validation

Ensures correct H2 section organization and ordering

βœ…

Markdown Link Syntax Compliance

Validates proper link formatting and descriptions

βœ…

Intelligent Document Ordering

Prioritizes important documentation with smart organization

βœ…

Path-Based Section Generation

Automatic organization using directory structure

βœ…

Optional Content Categorization

Separates core and supplementary content appropriately

πŸ“‹ Standard Format Structure

The llms.txt format follows a specific structure that catalog implements perfectly:

Required Format Elements

# Project Title > Brief project description ## Section Name - [file.md](file.md) - Optional file description - [another-file.md](another-file.md) - Another description ## Another Section - [section/file.md](section/file.md) - Organized by directory - [section/other.md](section/other.md) - Maintains structure ## Optional - [drafts/future.md](drafts/future.md) - Supplementary content - [archive/old.md](archive/old.md) - Historical documentation

Format Rules

  • H1 Title: Single project title at the top
  • Blockquote Description: Brief project description following the title
  • H2 Sections: Organized content sections with meaningful names
  • Markdown Links: Proper link syntax with optional descriptions
  • Optional Section: Separate section for supplementary content
  • Consistent Structure: Maintains organization and readability

βœ… Validation Features

catalog includes comprehensive validation to ensure your output meets the standard:

Structure Validation

# Validates proper H1 β†’ blockquote β†’ sections format catalog --validate # Example validation output: βœ… H1 title found: "Documentation Project" βœ… Blockquote description found βœ… Proper section hierarchy (H2 sections) βœ… Valid Markdown link syntax βœ… Appropriate content organization

Link Format Checking

# Ensures all links use proper Markdown syntax ❌ Error: Invalid link format found Line 15: [file.md] - Missing parentheses Should be: [file.md](file.md) - Description βœ… Suggestion: Use proper Markdown link syntax

URL Validation

# When using --base-url, validates absolute URLs catalog --base-url https://docs.example.com --validate βœ… All URLs properly formatted with base URL βœ… No broken or malformed links detected βœ… Consistent URL structure maintained

File Processing and Content Extraction

How catalog intelligently processes different file types and extracts metadata

πŸ“ Supported File Types

πŸ“„

Markdown Files (.md, .mdx)

Full support for Markdown with YAML frontmatter extraction and content processing.

YAML Frontmatter Title Extraction Content Processing
🌐

HTML Files (.html)

Automatic conversion to Markdown with meta tag extraction and content cleaning.

HTML to Markdown Meta Tag Extraction Content Cleaning

🧠 Intelligent Document Ordering

catalog uses sophisticated logic to organize documents in a meaningful hierarchy:

1

Index/Root Files

Prioritizes index.md, readme.md, home.md files

index.md > readme.md > home.md
2

Important Documentation

Files containing keywords like catalog, tutorial, intro, getting-started

Tutorial content gets higher priority
3

Path-Based Sections

Automatic organization by directory structure (e.g., api/, guides/)

Maintains logical content grouping
4

Alphabetical Fallback

Within sections, files are sorted alphabetically for consistent organization

Ensures predictable structure

πŸ” Metadata Extraction

catalog automatically extracts metadata from multiple sources:

🧹 Content Processing Pipeline

catalog uses a sophisticated multi-stage processing pipeline:

πŸ“‚

Discovery & Scanning

Recursive directory traversal with pattern matching and security validation

β†’
πŸ”

Content Extraction

YAML frontmatter stripping, HTML processing, and metadata extraction

β†’
πŸ“Š

Organization

Intelligent ordering, section generation, and content categorization

β†’
πŸ“‹

Generation

Multiple output formats with validation and optimization

Output Formats and Features

Comprehensive guide to all output formats generated by catalog

πŸ“‹

llms.txt (Structured Index)

Standard-compliant structured index with H1 β†’ blockquote β†’ sections format, perfect for AI context windows with clear organization.

# Documentation Project > Complete API and user guide documentation ## Core Documentation - [index.md](index.md) - Project overview and introduction - [getting-started.md](getting-started.md) - Quick start guide ## API Reference - [api/authentication.md](api/authentication.md) - Authentication methods - [api/endpoints.md](api/endpoints.md) - API endpoints reference ## Optional - [drafts/future-plans.md](drafts/future-plans.md) - Future development plans
Standard Compliant AI Optimized Hierarchical
πŸ“š

llms-full.txt (Complete Content)

Complete concatenated content with clear separators for comprehensive AI analysis and training data preparation.

# Documentation Project > Complete API and user guide documentation ## index.md # Welcome to Our Documentation [Complete content with frontmatter stripped] --- ## getting-started.md # Getting Started Guide [Full content continues...] --- [Content continues for all files...]
Full Content Training Data Separated
🎯

llms-ctx.txt (Context-Only)

Structured index without optional sections, optimized for AI systems with limited context windows.

# Documentation Project > Complete API and user guide documentation ## Core Documentation - [index.md](index.md) - Project overview and introduction - [getting-started.md](getting-started.md) - Quick start guide ## API Reference - [api/authentication.md](api/authentication.md) - Authentication methods - [api/endpoints.md](api/endpoints.md) - API endpoints reference # Note: Optional sections excluded for context optimization
Context Limited Essential Only Optimized
πŸ—ΊοΈ

sitemap.xml (SEO Optimization)

XML sitemap with intelligent priority assignment and change frequency detection for search engine optimization.

<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>https://docs.example.com/</loc> <lastmod>2024-01-15T10:30:00Z</lastmod> <changefreq>weekly</changefreq> <priority>1.0</priority> </url> </urlset>
SEO Optimized Smart Priorities Clean URLs
πŸ“Š

index.json (Navigation Metadata)

Comprehensive directory and file metadata for programmatic navigation and content management.

{ "directory": ".", "generated": "2024-01-15T10:30:00Z", "files": [ { "name": "index.md", "path": "index.md", "size": 1234, "modified": "2024-01-15T10:30:00Z", "type": "md", "isMarkdown": true } ], "summary": { "totalFiles": 5, "markdownFiles": 3, "totalSize": 12543 } }
Metadata Rich Navigation Statistics

Enterprise Features

Advanced capabilities for large-scale documentation processing

πŸ›‘οΈ Security and Validation

Path Traversal Prevention

Blocks ../ sequences and validates all file paths to prevent directory traversal attacks.

Input Sanitization

All user inputs are validated and sanitized to prevent injection attacks and ensure safe processing.

Content Scanning

Detects malicious patterns, suspicious URLs, and potentially harmful content during processing.

File Size Limits

Configurable limits prevent processing of extremely large files that could cause memory issues.

πŸ“Š Performance Monitoring

Real-Time Performance Tracking

πŸ“Š Performance Report: Total Time: 147ms Memory Usage: Heap Used: 12.45MB RSS: 89.23MB Memory Delta: Heap: +2.1MB RSS: +5.7MB Operations: file_scanning: 23ms content_processing: 89ms sitemap_generation: 12ms files_processed: 42 total_file_size: 2.3MB
  • Detailed timing for all major operations
  • Memory usage monitoring and optimization
  • Processing statistics and bottleneck identification
  • Concurrent processing utilities for large document sets

πŸ”§ Error Handling and Recovery

Actionable Error Messages

❌ Error in file processing: Permission denied: /protected/file.md Details: EACCES: permission denied Suggestions: β†’ Check file permissions β†’ Ensure the directory is not locked by another process β†’ Try running with appropriate permissions
  • Graceful degradation when individual files fail
  • Comprehensive error categorization with recovery suggestions
  • Standard exit codes for reliable automation
  • Detailed logging with security-focused error handling

Integration Architecture

How catalog fits into enterprise documentation workflows

πŸ“₯

Content Sources

Markdown, HTML, Git repos

β†’
i

inform

Web content extraction

β†’
c

catalog

AI-ready indexing

β†’
πŸ€–

AI Systems

Training, RAG, Context

Integration Benefits

πŸ”„

Seamless Workflow

Works perfectly with inform for web content extraction and unify for site generation

πŸ“Š

Multiple Outputs

Single source generates formats for AI training, context windows, and SEO optimization

βš™οΈ

CI/CD Ready

Standard exit codes and silent operation for automated documentation pipelines

πŸ›‘οΈ

Enterprise Security

Comprehensive security validation and monitoring for production environments

Continue Learning

Explore more catalog capabilities and related tools