inform Examples - Real-World Web Content Extraction

Documentation Migration Patterns

Proven workflows for migrating documentation between platforms

📖

Legacy Platform Migration

Intermediate

Complete workflow for migrating documentation from an old platform to a modern static site generator.

Step 1: Analysis and Planning

                                # Analyze the current site structure
inform https://old-docs.company.com \
  --max-pages 5 \
  --verbose \
  --output-dir analysis

# Review extracted content structure
ls analysis/
head -20 analysis/index.md
                            

Step 2: Full Content Extraction

                                # Extract all documentation with conservative settings
inform https://old-docs.company.com \
  --output-dir migrated-docs \
  --max-pages 200 \
  --delay 2000 \
  --concurrency 2 \
  --include "*/docs/*" \
  --include "*/guide/*" \
  --exclude "*/internal/*" \
  --exclude "*/admin/*"
                            

Step 3: Content Organization

                                # Organize content for new platform
mkdir -p new-docs-site/content/{guides,api,tutorials}

# Move content to appropriate sections
mv migrated-docs/guide/* new-docs-site/content/guides/
mv migrated-docs/api/* new-docs-site/content/api/
mv migrated-docs/tutorial/* new-docs-site/content/tutorials/

# Generate index files
catalog --input new-docs-site/content \
  --output new-docs-site/indexed \
  --sitemap \
  --base-url https://docs.company.com
                            

Step 4: Build and Deploy

                                # Build with unify
cd new-docs-site
unify build --input indexed --output dist

# Generate professional commit message
giv message
# "docs: migrate legacy documentation to modern platform"

# Deploy
git add . && git commit -m "$(giv message)"
git push origin main
                            

Expected Results

Organized content structure preserving original hierarchy
Clean Markdown files with proper metadata
Generated sitemap and navigation structure
Professional deployment with AI-generated commit messages

🔄

Multi-Source Consolidation

Advanced

Combine documentation from multiple sources into a unified knowledge base.

Extract from Multiple Sources

                                # Extract from primary documentation
inform https://docs.mainproduct.com \
  --output-dir sources/main-docs \
  --max-pages 150 \
  --delay 1000

# Extract API documentation  
inform https://api.mainproduct.com/docs \
  --output-dir sources/api-docs \
  --max-pages 50 \
  --include "*/reference/*" \
  --include "*/endpoints/*"

# Extract community tutorials
inform https://community.mainproduct.com \
  --output-dir sources/community \
  --max-pages 100 \
  --include "*/tutorial/*" \
  --include "*/howto/*"

# Extract GitHub repository docs
inform https://github.com/company/product/tree/main/docs \
  --output-dir sources/github-docs \
  --include "*.md"
                            

Consolidate and Structure

                                # Create unified structure
mkdir -p consolidated/{core,api,tutorials,community}

# Organize by content type
cp -r sources/main-docs/* consolidated/core/
cp -r sources/api-docs/* consolidated/api/
cp -r sources/community/* consolidated/community/
cp -r sources/github-docs/* consolidated/core/

# Generate comprehensive index
catalog --input consolidated \
  --output unified-knowledge-base \
  --index \
  --sitemap \
  --validate \
  --base-url https://knowledge.company.com
                            

Create Cross-References

                                # Generate relationship mapping
find consolidated -name "*.md" | while read file; do
  echo "Processing: $file"
  # Add cross-reference metadata
  # (Custom script to identify related content)
done

# Build final knowledge base
unify build --input unified-knowledge-base --output kb-site
                            

Benefits

Comprehensive documentation coverage
Consistent formatting across sources
Unified search and navigation
Automated cross-referencing

Content Research Patterns

Intelligence gathering and competitive analysis workflows

🔍

Competitive Analysis

Intermediate

Multi-Competitor Intelligence

                                #!/bin/bash
# competitive-research.sh

competitors=(
  "https://competitor1.com/docs"
  "https://competitor2.com/help"  
  "https://competitor3.com/guides"
  "https://competitor4.com/api"
)

# Extract from each competitor
for url in "${competitors[@]}"; do
  domain=$(echo $url | sed 's/https:\/\///' | sed 's/\/.*$//')
  echo "Analyzing $domain..."
  
  inform "$url" \
    --output-dir "research/$domain" \
    --max-pages 30 \
    --delay 2000 \
    --include "*/docs/*" \
    --include "*/guide/*" \
    --include "*/api/*"
    
  echo "Completed $domain analysis"
  sleep 60  # Respectful delay between competitors
done

# Generate comparative analysis
catalog --input research \
  --output competitive-analysis \
  --index \
  --validate

echo "Competitive research complete!"
echo "Review results in: competitive-analysis/"
                            

Analysis and Insights

                                # Generate content comparison
find research -name "*.md" | head -20 | xargs wc -w > word-counts.txt

# Create feature comparison matrix
# (Custom analysis script)
python analyze-features.py research/ > feature-matrix.csv

# Generate summary report
echo "# Competitive Analysis Report" > analysis-report.md
echo "" >> analysis-report.md
echo "## Content Volume Analysis" >> analysis-report.md
cat word-counts.txt >> analysis-report.md

echo "" >> analysis-report.md  
echo "## Feature Coverage Matrix" >> analysis-report.md
cat feature-matrix.csv >> analysis-report.md
                            

Research Insights

Content structure and organization comparison
Feature coverage and documentation depth
User experience and information architecture
Messaging and positioning analysis

📊

Industry Knowledge Mining

Advanced

Domain-Specific Content Extraction

                                # Industry-specific sources
industry_sources=(
  "https://techblog.company1.com"
  "https://engineering.company2.com"  
  "https://blog.company3.com"
  "https://medium.com/@industry-expert"
)

# Extract industry insights
for source in "${industry_sources[@]}"; do
  domain=$(echo $source | sed 's/https:\/\///' | sed 's/\/.*$//')
  
  inform "$source" \
    --output-dir "industry-insights/$domain" \
    --max-pages 50 \
    --delay 1500 \
    --include "*technical*" \
    --include "*engineering*" \
    --include "*architecture*" \
    --exclude "*job*" \
    --exclude "*hiring*"
done

# Process and categorize content
catalog --input industry-insights \
  --output processed-insights \
  --optional "drafts/**/*" \
  --validate
                            

Content Processing and Analysis

                                # Generate topic clustering
# (Requires additional analysis tools)
python cluster-topics.py processed-insights/ > topic-clusters.json

# Extract technical patterns
grep -r "architecture\|design pattern\|best practice" processed-insights/ > technical-patterns.txt

# Create trending analysis
python analyze-trends.py processed-insights/ > trend-analysis.json

# Generate final report
python generate-industry-report.py \
  --topics topic-clusters.json \
  --patterns technical-patterns.txt \
  --trends trend-analysis.json \
  --output industry-knowledge-report.md
                            

Integration Workflows

Combining inform with other fwdslsh tools for powerful pipelines

🔄 Complete Documentation Pipeline

End-to-end workflow from web crawling to deployment

i

Content Extraction

                                        # Extract from multiple documentation sources
inform https://old-docs.example.com \
  --output-dir raw-content \
  --max-pages 100 \
  --delay 1000

inform https://api-docs.example.com \
  --output-dir raw-content/api \
  --max-pages 50 \
  --include "*/reference/*"
                                    

→

c

Content Indexing

                                        # Generate structured indexes and navigation
catalog --input raw-content \
  --output structured-content \
  --sitemap \
  --index \
  --base-url https://new-docs.example.com \
  --validate
                                    

→

u

Site Generation

                                        # Build modern static site with navigation
unify build \
  --input structured-content \
  --output production-site \
  --optimize
                                    

→

g

Version Control

                                        # Professional commit with AI assistance
git add .
giv message
# "docs: migrate and modernize documentation platform"

git push origin main
                                    

Pipeline Benefits

Automated content migration and structuring
SEO-optimized site generation with navigation
Professional version control and documentation
Repeatable and scalable process

🤖 AI-Ready Content Pipeline

Prepare content for AI training and RAG applications

High-Quality Content Extraction

                                # Extract comprehensive, high-quality content
inform https://comprehensive-docs.example.com \
  --output-dir ai-training-content \
  --max-pages 500 \
  --delay 800 \
  --include "*/docs/*" \
  --include "*/guide/*" \
  --include "*/tutorial/*" \
  --include "*/reference/*" \
  --exclude "*/blog/*" \
  --exclude "*/news/*"
                            

Content Structuring and Validation

                                # Generate AI-optimized indexes with validation
catalog --input ai-training-content \
  --output ai-ready-content \
  --validate \
  --optional "examples/**/*" \
  --optional "advanced/**/*"

# Quality check the content
find ai-ready-content -name "*.md" | xargs wc -w | tail -1
echo "Content validation complete"
                            

AI Integration Preparation

                                # Create training data sets
mkdir -p ai-datasets/{training,validation,context}

# Split content for different AI purposes
cp ai-ready-content/llms.txt ai-datasets/context/
cp ai-ready-content/llms-full.txt ai-datasets/training/
cp ai-ready-content/llms-ctx.txt ai-datasets/validation/

# Generate metadata for AI systems
echo "AI-ready content prepared with structured indexes"
ls -la ai-datasets/*/
                            

AI-Optimized Features

Clean, structured content without noise
Consistent formatting for AI processing
Hierarchical organization for context
Quality validation and completeness checks

Specialized Extraction Patterns

Advanced techniques for specific content types and platforms

📱

GitHub Repository Documentation

Beginner

Repository Documentation Extraction

                                # Extract documentation from GitHub repositories
inform https://github.com/facebook/react/tree/main/docs \
  --output-dir react-docs \
  --include "*.md" \
  --include "*.mdx"

# Extract from multiple open source projects
projects=(
  "https://github.com/vuejs/vue/tree/dev/docs"
  "https://github.com/angular/angular/tree/main/docs"
  "https://github.com/sveltejs/svelte/tree/master/site/content/docs"
)

for project in "${projects[@]}"; do
  project_name=$(echo $project | sed 's/.*github.com\/\([^\/]*\)\/\([^\/]*\)\/.*/\1-\2/')
  inform "$project" \
    --output-dir "oss-docs/$project_name" \
    --include "*.md" \
    --include "*.mdx"
done
                            

Documentation Comparison

                                # Generate comparative documentation analysis
catalog --input oss-docs \
  --output framework-comparison \
  --index \
  --sitemap \
  --base-url https://framework-docs-comparison.dev

# Build comparison site
unify build \
  --input framework-comparison \
  --output comparison-site
                            

🛒

E-commerce Content Mining

Advanced

Product Information Extraction

                                # Extract product documentation and guides
inform https://help.shopify.com \
  --output-dir ecommerce-content/shopify \
  --max-pages 200 \
  --include "*/manual/*" \
  --include "*/themes/*" \
  --exclude "*/billing/*"

inform https://docs.woocommerce.com \
  --output-dir ecommerce-content/woocommerce \
  --max-pages 150 \
  --include "*/document/*" \
  --include "*/tutorial/*"

# Extract best practices and guides
inform https://ecommerce-platforms.com \
  --output-dir ecommerce-content/best-practices \
  --max-pages 100 \
  --include "*/guide/*" \
  --include "*/best-practice/*"
                            

Content Organization

                                # Organize by platform and topic
mkdir -p ecommerce-knowledge/{platforms,guides,tutorials}

# Categorize content
cp -r ecommerce-content/shopify/* ecommerce-knowledge/platforms/shopify/
cp -r ecommerce-content/woocommerce/* ecommerce-knowledge/platforms/woocommerce/
cp -r ecommerce-content/best-practices/* ecommerce-knowledge/guides/

# Generate comprehensive e-commerce knowledge base
catalog --input ecommerce-knowledge \
  --output ecommerce-kb \
  --index \
  --sitemap \
  --validate \
  --base-url https://ecommerce-knowledge.dev
                            

🔬

Technical Blog Aggregation

Intermediate

Engineering Blog Extraction

                                # Extract from major tech company blogs
tech_blogs=(
                                  "https://engineering.fb.com"
  "https://blog.google/technology"
  "https://eng.uber.com"
  "https://medium.engineering"
  "https://netflixtechblog.com"
)

for blog in "${tech_blogs[@]}"; do
  domain=$(echo $blog | sed 's/https:\/\///' | sed 's/\/.*$//' | sed 's/\./-/g')
  
  inform "$blog" \
    --output-dir "tech-blogs/$domain" \
    --max-pages 50 \
    --delay 2000 \
    --include "*engineering*" \
    --include "*technical*" \
    --include "*architecture*" \
    --exclude "*job*" \
    --exclude "*career*"
done
                            

Content Aggregation and Analysis

                                # Create unified tech blog archive
catalog --input tech-blogs \
  --output tech-insights \
  --index \
  --validate

# Generate trend analysis
find tech-blogs -name "*.md" -exec grep -l "microservices\|kubernetes\|serverless" {} \; > trending-topics.txt

# Create searchable archive
unify build \
  --input tech-insights \
  --output tech-blog-archive

echo "Tech blog aggregation complete!"
echo "Archive available in: tech-blog-archive/"
                            

Automation Scripts

Ready-to-use scripts for common inform workflows

📅 Scheduled Content Monitoring

Monitor websites for content changes and updates

                            #!/bin/bash
# content-monitor.sh - Monitor websites for changes

# Configuration
SITES=(
  "https://docs.example.com"
  "https://api.example.com/docs"
  "https://help.example.com"
)

BACKUP_DIR="/backup/content-monitoring"
DATE=$(date +%Y-%m-%d)

# Create daily backup directory
mkdir -p "$BACKUP_DIR/$DATE"

# Monitor each site
for site in "${SITES[@]}"; do
  site_name=$(echo $site | sed 's/https:\/\///' | sed 's/\/.*$//' | sed 's/\./-/g')
  
  echo "Monitoring: $site"
  
  # Extract current content
  inform "$site" \
    --output-dir "$BACKUP_DIR/$DATE/$site_name" \
    --max-pages 50 \
    --delay 1000
  
  # Compare with previous day if exists
  previous_date=$(date -d "yesterday" +%Y-%m-%d)
  if [ -d "$BACKUP_DIR/$previous_date/$site_name" ]; then
    echo "Comparing with previous extraction..."
    diff -r "$BACKUP_DIR/$previous_date/$site_name" "$BACKUP_DIR/$DATE/$site_name" > "$BACKUP_DIR/$DATE/$site_name-changes.txt"
    
    if [ -s "$BACKUP_DIR/$DATE/$site_name-changes.txt" ]; then
      echo "Changes detected in $site_name!"
      # Send notification (email, Slack, etc.)
      # notify-send "Content Changes" "Changes detected in $site_name"
    fi
  fi
done

echo "Content monitoring complete for $DATE"
                        

🔄 Automated Documentation Sync

Keep local documentation in sync with web sources

                            #!/bin/bash
# doc-sync.sh - Automated documentation synchronization

# Configuration
SOURCE_URL="https://docs.upstream.com"
LOCAL_DOCS="./docs"
BACKUP_DIR="./docs-backup"
SYNC_LOG="./sync.log"

echo "$(date): Starting documentation sync" >> $SYNC_LOG

# Create backup of current docs
if [ -d "$LOCAL_DOCS" ]; then
  echo "Creating backup..."
  cp -r "$LOCAL_DOCS" "$BACKUP_DIR-$(date +%Y%m%d-%H%M%S)"
fi

# Extract latest documentation
echo "Extracting latest documentation..."
inform "$SOURCE_URL" \
  --output-dir "$LOCAL_DOCS-new" \
  --max-pages 200 \
  --delay 1000 \
  --include "*/docs/*" \
  --include "*/guide/*"

# Check if extraction was successful
if [ $? -eq 0 ]; then
  echo "Extraction successful, updating local docs..."
  
  # Replace old docs with new
  rm -rf "$LOCAL_DOCS"
  mv "$LOCAL_DOCS-new" "$LOCAL_DOCS"
  
  # Generate index
  catalog --input "$LOCAL_DOCS" \
    --output "$LOCAL_DOCS-indexed" \
    --sitemap \
    --index
  
  # Build site
  unify build \
    --input "$LOCAL_DOCS-indexed" \
    --output "./public"
  
  # Commit changes with giv
  git add .
  commit_message=$(giv message)
  git commit -m "$commit_message"
  
  echo "$(date): Sync completed successfully" >> $SYNC_LOG
else
  echo "$(date): Sync failed during extraction" >> $SYNC_LOG
  exit 1
fi
                        

Troubleshooting Examples

Solutions for common inform challenges and edge cases

🚫 Handling Rate-Limited Sites

                            # Conservative crawling for rate-sensitive sites
inform https://rate-limited-site.com \
  --delay 5000 \
  --concurrency 1 \
  --max-pages 25 \
  --verbose

# Multi-session approach for large sites
sessions=(
  "*/docs/*"
  "*/api/*" 
  "*/guide/*"
)

for session in "${sessions[@]}"; do
  echo "Processing session: $session"
  inform https://large-site.com \
    --include "$session" \
    --output-dir "./content/$(echo $session | sed 's/[^a-zA-Z0-9]//g')" \
    --delay 3000 \
    --max-pages 50
  
  echo "Waiting 10 minutes before next session..."
  sleep 600
done
                        

🔧 Custom Content Detection

                            # When standard content detection fails
# Test with smaller sample first
inform https://unusual-site.com/sample-page \
  --max-pages 1 \
  --verbose

# Review extracted content
head -50 crawled-pages/sample-page.md

# Adjust strategy based on results
# (Future: Custom selector support)
inform https://unusual-site.com \
  --output-dir custom-extraction \
  --max-pages 20 \
  --delay 2000
                        

📦 Large-Scale Content Processing

                            # Handling very large sites efficiently
#!/bin/bash
# large-scale-crawl.sh

BASE_URL="https://massive-docs.com"
BATCH_SIZE=50
MAX_BATCHES=20

for ((i=1; i<=MAX_BATCHES; i++)); do
  echo "Processing batch $i of $MAX_BATCHES"
  
  inform "$BASE_URL" \
    --max-pages $BATCH_SIZE \
    --output-dir "batches/batch-$i" \
    --delay 1000 \
    --concurrency 3
  
  # Process batch immediately
  catalog --input "batches/batch-$i" \
    --output "processed/batch-$i" \
    --validate
  
  echo "Batch $i complete. Waiting 2 minutes..."
  sleep 120
done

# Combine all batches
echo "Combining all batches..."
mkdir -p final-output
cp -r processed/batch-*/* final-output/

# Generate final index
catalog --input final-output \
  --output complete-site \
  --sitemap \
  --index \
  --base-url https://docs.example.com
                        

Continue Exploring

Dive deeper into inform capabilities and related tools

📚

Complete Documentation

Comprehensive guides, CLI reference, and advanced features for mastering inform

→

🔄

Ecosystem Integration

See how inform combines with catalog, unify, and giv for complete workflows

→

📂

Source Code & Issues

View source code, report issues, or contribute to inform development

→