catalog Examples - Real-World Documentation Indexing

AI Training and RAG Workflows

Prepare documentation for AI systems with optimized indexing and content organization

🤖

AI Training Data Pipeline

Intermediate

Prepare comprehensive documentation for AI model training with proper content categorization and quality control.

Step 1: Organize Training Content

                                # Organize documentation by importance
mkdir -p ai-training/{core,supplementary,examples}

# Move essential documentation to core
cp -r docs/api ai-training/core/
cp -r docs/guides ai-training/core/
cp -r docs/tutorials ai-training/core/

# Move supplementary content
cp -r docs/examples ai-training/supplementary/
cp -r docs/appendix ai-training/supplementary/
cp -r docs/archive ai-training/supplementary/
                            

Step 2: Generate AI-Optimized Indexes

                                # Create comprehensive training dataset
catalog --input ai-training \
  --output ai-ready \
  --optional "supplementary/**/*" \
  --optional "examples/**/*" \
  --validate

# Results:
# - llms.txt: Structured index focusing on core content
# - llms-full.txt: Complete content for training
# - llms-ctx.txt: Essential content only for context windows
                            

Step 3: Quality Validation and Processing

                                # Validate content quality
catalog --input ai-training --output validated \
  --validate \
  --silent

# Check validation results
if [ $? -eq 0 ]; then
  echo "✅ All content validated successfully"
  echo "📊 Ready for AI training pipeline"
else
  echo "❌ Validation failed - check content structure"
  exit 1
fi

# Generate statistics
find ai-ready -name "*.txt" -exec wc -w {} \; > content-stats.txt
echo "📈 Content statistics generated"
                            

Expected Results

llms.txt with core documentation prioritized
llms-full.txt containing complete training content
llms-ctx.txt optimized for context-limited AI systems
Validated content structure ensuring training quality

🧠

RAG System Content Preparation

Advanced

Optimize documentation for Retrieval-Augmented Generation (RAG) systems with structured indexing and semantic organization.

Multi-Source Content Aggregation

                                # Aggregate documentation from multiple sources
sources=(
  "product-docs"
  "api-documentation" 
  "user-guides"
  "troubleshooting"
  "faqs"
)

# Create unified structure for RAG
mkdir -p rag-content/{knowledge-base,context-chunks,embeddings}

# Process each source with appropriate categorization
for source in "${sources[@]}"; do
  echo "Processing: $source"
  
  catalog --input "$source" \
    --output "rag-content/knowledge-base/$source" \
    --index \
    --validate
done
                            

Context-Optimized Index Generation

                                # Generate comprehensive RAG indexes
catalog --input rag-content/knowledge-base \
  --output rag-content/context-chunks \
  --optional "faqs/**/*" \
  --optional "troubleshooting/legacy/**/*" \
  --base-url https://docs.company.com \
  --validate \
  --index

# Create semantic categorization
echo "🧠 Generating semantic categories..."

# Core knowledge (highest priority for RAG)
catalog --input rag-content/knowledge-base \
  --output rag-content/embeddings/core \
  --include "*/api/*" \
  --include "*/guides/*" \
  --validate

# Contextual knowledge (secondary priority)
catalog --input rag-content/knowledge-base \
  --output rag-content/embeddings/context \
  --include "*/examples/*" \
  --include "*/tutorials/*" \
  --validate
                            

RAG System Integration

                                # Prepare for vector database ingestion
echo "📋 RAG Content Summary:"
echo "Core Documents: $(find rag-content/embeddings/core -name "*.md" | wc -l)"
echo "Context Documents: $(find rag-content/embeddings/context -name "*.md" | wc -l)"
echo "Total llms.txt files: $(find rag-content -name "llms*.txt" | wc -l)"

# Generate metadata for vector database
catalog --input rag-content/context-chunks \
  --output rag-ready \
  --index \
  --base-url https://docs.company.com

echo "🚀 RAG content preparation complete!"
echo "📂 Upload rag-ready/ to your vector database system"
                            

RAG Optimization Benefits

Hierarchical content organization for relevance ranking
Context-optimized chunks for retrieval efficiency
Metadata-rich indexes for semantic search
Structured format compatible with vector databases

Documentation Website Workflows

Generate SEO-optimized documentation sites with comprehensive indexing

🌐

Static Site Generation

Beginner

Documentation Site Pipeline

                                # Generate comprehensive documentation site
catalog --input docs \
  --output site-build \
  --base-url https://docs.company.com \
  --sitemap \
  --sitemap-no-extensions \
  --index \
  --validate

# Results:
# - llms.txt: Structured documentation index
# - sitemap.xml: SEO-optimized sitemap
# - index.json: Navigation metadata
# - Validated compliance with standards
                            

Integration with Static Site Generators

                                # Hugo integration
catalog --input content \
  --output static/llms \
  --base-url https://docs.example.com \
  --sitemap \
  --index

# Jekyll integration  
catalog --input _docs \
  --output _site/generated \
  --base-url https://docs.example.com \
  --sitemap-no-extensions

# unify integration (fwdslsh ecosystem)
catalog --input docs --output indexed \
  --base-url https://docs.example.com \
  --sitemap --index

unify build --input indexed --output public
                            

🔍

Knowledge Base Creation

Intermediate

Enterprise Knowledge Base Setup

                                # Organize knowledge base content
mkdir -p knowledge-base/{public,internal,archived}

# Process public documentation
catalog --input public-docs \
  --output knowledge-base/public \
  --base-url https://kb.company.com \
  --sitemap \
  --validate

# Process internal documentation (marked as optional)
catalog --input internal-docs \
  --output knowledge-base/internal \
  --optional "**/*" \
  --base-url https://internal.company.com \
  --validate

# Combine all knowledge sources
catalog --input knowledge-base \
  --output unified-kb \
  --optional "internal/**/*" \
  --optional "archived/**/*" \
  --index \
  --sitemap \
  --base-url https://kb.company.com
                            

Search Integration

                                # Generate search-optimized content
catalog --input knowledge-base \
  --output search-ready \
  --index \
  --validate

# Extract metadata for search indexing
find search-ready -name "index.json" | while read file; do
  echo "Processing search metadata: $file"
  # Send to search index (Elasticsearch, Algolia, etc.)
done

echo "🔍 Knowledge base ready for search integration"
                            

Automation and CI/CD Examples

Integrate catalog into automated documentation pipelines and workflows

🔄 GitHub Actions Workflow

Automated documentation processing on every commit

                            # .github/workflows/docs.yml
name: Documentation Processing

on:
  push:
    paths:
      - 'docs/**'
      - '.github/workflows/docs.yml'
  pull_request:
    paths:
      - 'docs/**'

jobs:
  process-docs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4

      - name: Install catalog
        run: |
          curl -fsSL https://raw.githubusercontent.com/fwdslsh/catalog/main/install.sh | bash
          catalog --version

      - name: Process documentation
        run: |
          catalog --input docs \
            --output dist \
            --base-url ${{ secrets.DOCS_BASE_URL }} \
            --sitemap \
            --validate \
            --index

      - name: Validate output
        run: |
          if [ ! -f "dist/llms.txt" ]; then
            echo "❌ llms.txt not generated"
            exit 1
          fi
          
          if [ ! -f "dist/sitemap.xml" ]; then
            echo "❌ sitemap.xml not generated"
            exit 1
          fi
          
          echo "✅ All outputs generated successfully"

      - name: Deploy to GitHub Pages
        if: github.ref == 'refs/heads/main'
        uses: peaceiris/actions-gh-pages@v3
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          publish_dir: ./dist
                        

📅 Scheduled Documentation Sync

Automatically sync documentation from multiple sources

                            #!/bin/bash
# sync-docs.sh - Scheduled documentation synchronization

# Configuration
DOC_SOURCES=(
  "https://api-docs.company.com"
  "https://user-guides.company.com"
  "https://developer.company.com"
)

OUTPUT_DIR="/var/www/unified-docs"
TEMP_DIR="/tmp/doc-sync-$(date +%Y%m%d-%H%M%S)"
LOG_FILE="/var/log/doc-sync.log"

echo "$(date): Starting documentation sync" >> $LOG_FILE

# Create temporary directory
mkdir -p $TEMP_DIR

# Extract from each source using inform
for source in "${DOC_SOURCES[@]}"; do
  domain=$(echo $source | sed 's/https:\/\///' | sed 's/\/.*$//' | sed 's/\./-/g')
  echo "$(date): Processing $source" >> $LOG_FILE
  
  inform "$source" \
    --output-dir "$TEMP_DIR/$domain" \
    --max-pages 200 \
    --delay 1000
done

# Generate unified index with catalog
echo "$(date): Generating unified documentation index" >> $LOG_FILE

catalog --input $TEMP_DIR \
  --output $OUTPUT_DIR \
  --base-url https://docs.company.com \
  --sitemap \
  --index \
  --validate

# Check if generation was successful
if [ $? -eq 0 ]; then
  echo "$(date): Documentation sync completed successfully" >> $LOG_FILE
  
  # Generate notification
  echo "📚 Documentation updated at $(date)" | \
    curl -X POST -H 'Content-Type: application/json' \
    -d '{"text":"'"$(cat)"'"}' \
    $SLACK_WEBHOOK_URL
else
  echo "$(date): Documentation sync failed" >> $LOG_FILE
  exit 1
fi

# Cleanup
rm -rf $TEMP_DIR
                        

🧪 Content Quality Validation

Automated quality checks for documentation

                            #!/bin/bash
# validate-docs.sh - Comprehensive documentation validation

DOC_DIR="$1"
REPORT_FILE="validation-report.json"

if [ -z "$DOC_DIR" ]; then
  echo "Usage: $0 <documentation-directory>"
  exit 1
fi

echo "🔍 Starting documentation validation for: $DOC_DIR"

# Run catalog validation
echo "📋 Running llms.txt standard validation..."
catalog --input "$DOC_DIR" \
  --output validation-output \
  --validate \
  --index

VALIDATION_RESULT=$?

# Generate detailed report
echo "📊 Generating validation report..."

cat > $REPORT_FILE << EOF
{
  "validation_date": "$(date -Iseconds)",
  "directory": "$DOC_DIR",
  "llms_validation": {
    "passed": $([ $VALIDATION_RESULT -eq 0 ] && echo "true" || echo "false"),
    "exit_code": $VALIDATION_RESULT
  },
  "content_stats": {
    "markdown_files": $(find "$DOC_DIR" -name "*.md" | wc -l),
    "html_files": $(find "$DOC_DIR" -name "*.html" | wc -l),
    "total_size": "$(du -sh "$DOC_DIR" | cut -f1)"
  },
  "generated_files": {
    "llms_txt": $([ -f "validation-output/llms.txt" ] && echo "true" || echo "false"),
    "llms_full_txt": $([ -f "validation-output/llms-full.txt" ] && echo "true" || echo "false"),
    "llms_ctx_txt": $([ -f "validation-output/llms-ctx.txt" ] && echo "true" || echo "false"),
    "index_json": $([ -f "validation-output/index.json" ] && echo "true" || echo "false")
  }
}
EOF

echo "📋 Validation report generated: $REPORT_FILE"

# Print summary
if [ $VALIDATION_RESULT -eq 0 ]; then
  echo "✅ All validations passed!"
  echo "📚 Documentation is ready for production"
else
  echo "❌ Validation failed"
  echo "📋 Check the validation output for details"
  exit 1
fi
                        

Integration Patterns

Combine catalog with other tools for powerful documentation workflows

🔄 Complete fwdslsh Ecosystem Workflow

End-to-end documentation pipeline using all fwdslsh tools

i

Content Extraction

                                        # Extract from multiple documentation sources
inform https://legacy-docs.company.com \
  --output-dir extracted/legacy \
  --max-pages 200

inform https://api.company.com/docs \
  --output-dir extracted/api \
  --max-pages 100
                                    

→

c

Content Indexing

                                        # Combine and index all content
mkdir -p combined-docs
cp -r extracted/*/* combined-docs/

catalog --input combined-docs \
  --output indexed-docs \
  --base-url https://docs.company.com \
  --sitemap --index --validate
                                    

→

u

Site Generation

                                        # Build beautiful documentation site
unify build \
  --input indexed-docs \
  --output production-site
                                    

→

g

Version Control

                                        # Professional commit with AI
git add .
giv message
# "docs: integrate legacy and API documentation with comprehensive indexing"

git push origin main
                                    

Complete Pipeline Benefits

Automated content migration from multiple sources
AI-ready indexing with llms.txt standard compliance
SEO-optimized site generation with navigation
Professional version control with AI-generated messages

🤖 AI-Enhanced Documentation Workflow

Leverage AI throughout the documentation lifecycle

Multi-Stage AI Integration

                                # Stage 1: Extract content with inform
inform https://docs.example.com \
  --output-dir raw-content \
  --max-pages 300

# Stage 2: Generate AI-optimized indexes
catalog --input raw-content \
  --output ai-enhanced \
  --optional "examples/**/*" \
  --optional "archived/**/*" \
  --validate

# Stage 3: Use llms.txt for AI-powered content enhancement
# (Custom AI processing using generated llms-full.txt)
python enhance-content.py \
  --input ai-enhanced/llms-full.txt \
  --output enhanced-content/

# Stage 4: Re-index enhanced content
catalog --input enhanced-content \
  --output final-output \
  --base-url https://docs.example.com \
  --sitemap --index --validate

# Stage 5: AI-powered commit message
giv message
# Automatically generates: "docs: enhance documentation with AI-powered content optimization"
                            

Continuous AI Enhancement

                                # Set up automated AI enhancement pipeline
echo "🤖 Setting up AI-enhanced documentation pipeline..."

# Monitor for content changes
while inotifywait -e modify,create,delete raw-content/; do
  echo "📝 Content changed, re-processing..."
  
  # Re-generate indexes
  catalog --input raw-content \
    --output updated-ai \
    --validate \
    --silent
  
  # AI enhancement (custom processing)
  python ai-enhance.py updated-ai/
  
  # Update live documentation
  rsync -av updated-ai/ /var/www/docs/
  
  echo "✅ Documentation updated with AI enhancements"
done
                            

Specialized Use Cases

Advanced patterns for specific documentation scenarios

📚

Multi-Language Documentation

Advanced

Language-Specific Processing

                                # Process documentation in multiple languages
languages=("en" "es" "fr" "de" "ja")

for lang in "${languages[@]}"; do
  echo "Processing documentation for: $lang"
  
  catalog --input "docs/$lang" \
    --output "localized/$lang" \
    --base-url "https://docs.example.com/$lang" \
    --sitemap \
    --validate
  
  # Generate language-specific metadata
  echo "Language: $lang" > "localized/$lang/language.txt"
  echo "Generated: $(date)" >> "localized/$lang/language.txt"
done

# Create unified multilingual index
catalog --input localized \
  --output unified-multilang \
  --index \
  --base-url https://docs.example.com
                            

🏢

Enterprise Documentation Hub

Advanced

Department-Specific Organization

                                # Organize enterprise documentation by department
departments=("engineering" "product" "support" "sales" "legal")

mkdir -p enterprise-hub/{public,internal,confidential}

for dept in "${departments[@]}"; do
  echo "Processing $dept documentation..."
  
  # Public documentation
  catalog --input "departments/$dept/public" \
    --output "enterprise-hub/public/$dept" \
    --base-url "https://docs.company.com/$dept" \
    --sitemap \
    --validate
  
  # Internal documentation (marked as optional)
  catalog --input "departments/$dept/internal" \
    --output "enterprise-hub/internal/$dept" \
    --optional "**/*" \
    --base-url "https://internal.company.com/$dept" \
    --validate
done

# Generate master enterprise index
catalog --input enterprise-hub \
  --output master-docs \
  --optional "internal/**/*" \
  --optional "confidential/**/*" \
  --index \
  --sitemap \
  --base-url https://docs.company.com
                            

🔬

Research and Academic Papers

Intermediate

Academic Content Organization

                                # Process research papers and academic content
mkdir -p research-index/{papers,datasets,methodologies,appendices}

# Organize papers by topic
topics=("ai-ml" "computer-vision" "nlp" "robotics" "theory")

for topic in "${topics[@]}"; do
  catalog --input "research/$topic" \
    --output "research-index/papers/$topic" \
    --optional "appendices/**/*" \
    --optional "raw-data/**/*" \
    --validate
done

# Create comprehensive research index
catalog --input research-index \
  --output academic-kb \
  --optional "appendices/**/*" \
  --optional "datasets/**/*" \
  --index \
  --base-url https://research.university.edu

# Generate citation metadata
find academic-kb -name "*.md" | while read file; do
  echo "Processing citations for: $file"
  # Extract and process academic citations
done
                            

Troubleshooting Examples

Solutions for common catalog challenges and optimization techniques

🔧 Large Document Set Optimization

                            # Handle very large documentation sets efficiently
# Split processing into batches for memory management

BATCH_SIZE=500
DOC_DIR="large-docs"
OUTPUT_DIR="processed-docs"

# Count total files
total_files=$(find "$DOC_DIR" -name "*.md" -o -name "*.html" | wc -l)
batches=$(( (total_files + BATCH_SIZE - 1) / BATCH_SIZE ))

echo "Processing $total_files files in $batches batches..."

# Process in batches
for ((i=1; i<=batches; i++)); do
  echo "Processing batch $i of $batches..."
  
  # Create batch directory
  batch_dir="batch-$i"
  mkdir -p "$batch_dir"
  
  # Copy files for this batch
  find "$DOC_DIR" -name "*.md" -o -name "*.html" | \
    head -n $((i * BATCH_SIZE)) | \
    tail -n +$(((i-1) * BATCH_SIZE + 1)) | \
    xargs -I {} cp {} "$batch_dir/"
  
  # Process batch
  catalog --input "$batch_dir" \
    --output "$OUTPUT_DIR/batch-$i" \
    --validate \
    --silent
  
  # Cleanup batch directory
  rm -rf "$batch_dir"
done

# Combine all batches
echo "Combining all batches..."
mkdir -p combined-output
cp -r processed-docs/batch-*/* combined-output/

# Generate final index
catalog --input combined-output \
  --output final-large-docs \
  --index \
  --validate
                        

📊 Content Quality Assessment

                            # Comprehensive content quality assessment
#!/bin/bash

DOC_DIR="$1"
QUALITY_REPORT="quality-assessment.md"

echo "# Documentation Quality Assessment" > $QUALITY_REPORT
echo "Generated: $(date)" >> $QUALITY_REPORT
echo "" >> $QUALITY_REPORT

# Run catalog with validation
echo "## Validation Results" >> $QUALITY_REPORT
catalog --input "$DOC_DIR" --validate --silent
validation_result=$?

if [ $validation_result -eq 0 ]; then
  echo "✅ **PASSED**: llms.txt standard compliance" >> $QUALITY_REPORT
else
  echo "❌ **FAILED**: llms.txt standard compliance" >> $QUALITY_REPORT
fi

# Content statistics
echo "" >> $QUALITY_REPORT
echo "## Content Statistics" >> $QUALITY_REPORT
echo "- Markdown files: $(find "$DOC_DIR" -name "*.md" | wc -l)" >> $QUALITY_REPORT
echo "- HTML files: $(find "$DOC_DIR" -name "*.html" | wc -l)" >> $QUALITY_REPORT
echo "- Total size: $(du -sh "$DOC_DIR" | cut -f1)" >> $QUALITY_REPORT

# File size analysis
echo "" >> $QUALITY_REPORT
echo "## File Size Analysis" >> $QUALITY_REPORT
echo "Large files (>100KB):" >> $QUALITY_REPORT
find "$DOC_DIR" -name "*.md" -o -name "*.html" | \
  xargs ls -la | awk '$5 > 102400 {print $9 ": " $5/1024 "KB"}' >> $QUALITY_REPORT

# Content issues
echo "" >> $QUALITY_REPORT
echo "## Potential Issues" >> $QUALITY_REPORT

# Check for empty files
empty_files=$(find "$DOC_DIR" -name "*.md" -o -name "*.html" -empty | wc -l)
if [ $empty_files -gt 0 ]; then
  echo "⚠️ Found $empty_files empty files" >> $QUALITY_REPORT
fi

# Check for very short files
short_files=$(find "$DOC_DIR" -name "*.md" -o -name "*.html" -exec wc -w {} \; | awk '$1 < 10 {count++} END {print count+0}')
if [ $short_files -gt 0 ]; then
  echo "⚠️ Found $short_files files with fewer than 10 words" >> $QUALITY_REPORT
fi

echo "📋 Quality assessment complete: $QUALITY_REPORT"
                        

🛠️ Custom Pattern Debugging

                            # Debug include/exclude patterns
#!/bin/bash

DOC_DIR="$1"
TEST_PATTERNS=(
  "*.md"
  "docs/*.md"
  "**/*.html"
  "guides/*"
  "api/**/*"
)

echo "🔍 Testing glob patterns against: $DOC_DIR"
echo ""

for pattern in "${TEST_PATTERNS[@]}"; do
  echo "Testing pattern: $pattern"
  
  # Create test output
  test_output="pattern-test-$(echo $pattern | sed 's/[^a-zA-Z0-9]//g')"
  
  catalog --input "$DOC_DIR" \
    --output "$test_output" \
    --include "$pattern" \
    --silent
  
  file_count=$(find "$test_output" -name "*.md" -o -name "*.html" | wc -l)
  echo "  → Matched $file_count files"
  
  if [ $file_count -eq 0 ]; then
    echo "  ⚠️ No files matched - check pattern syntax"
  elif [ $file_count -gt 100 ]; then
    echo "  ⚠️ Many files matched - pattern might be too broad"
  else
    echo "  ✅ Reasonable match count"
  fi
  
  # Cleanup
  rm -rf "$test_output"
  echo ""
done

echo "🏁 Pattern testing complete"
                        

Continue Exploring

Dive deeper into catalog capabilities and related tools

📚

Complete Documentation

Master all catalog features with comprehensive guides and CLI reference

→

🔄

Ecosystem Integration

See how catalog combines with inform, unify, and giv for complete workflows

→

📂

Source Code & Issues

View source code, report issues, or contribute to catalog development

→