inform Examples - Real-World Web Content Extraction

inform Examples

Real-world patterns for web content extraction
Practical examples, proven workflows, and integration patterns for extracting, converting, and organizing web content with inform.

Documentation Migration Patterns

Proven workflows for migrating documentation between platforms

📖

Legacy Platform Migration

Intermediate

Complete workflow for migrating documentation from an old platform to a modern static site generator.

Step 1: Analysis and Planning

# Analyze the current site structure inform https://old-docs.company.com \ --max-pages 5 \ --verbose \ --output-dir analysis # Review extracted content structure ls analysis/ head -20 analysis/index.md

Step 2: Full Content Extraction

# Extract all documentation with conservative settings inform https://old-docs.company.com \ --output-dir migrated-docs \ --max-pages 200 \ --delay 2000 \ --concurrency 2 \ --include "*/docs/*" \ --include "*/guide/*" \ --exclude "*/internal/*" \ --exclude "*/admin/*"

Step 3: Content Organization

# Organize content for new platform mkdir -p new-docs-site/content/{guides,api,tutorials} # Move content to appropriate sections mv migrated-docs/guide/* new-docs-site/content/guides/ mv migrated-docs/api/* new-docs-site/content/api/ mv migrated-docs/tutorial/* new-docs-site/content/tutorials/ # Generate index files catalog --input new-docs-site/content \ --output new-docs-site/indexed \ --sitemap \ --base-url https://docs.company.com

Step 4: Build and Deploy

# Build with unify cd new-docs-site unify build --input indexed --output dist # Generate professional commit message giv message # "docs: migrate legacy documentation to modern platform" # Deploy git add . && git commit -m "$(giv message)" git push origin main

Expected Results

  • Organized content structure preserving original hierarchy
  • Clean Markdown files with proper metadata
  • Generated sitemap and navigation structure
  • Professional deployment with AI-generated commit messages
🔄

Multi-Source Consolidation

Advanced

Combine documentation from multiple sources into a unified knowledge base.

Extract from Multiple Sources

# Extract from primary documentation inform https://docs.mainproduct.com \ --output-dir sources/main-docs \ --max-pages 150 \ --delay 1000 # Extract API documentation inform https://api.mainproduct.com/docs \ --output-dir sources/api-docs \ --max-pages 50 \ --include "*/reference/*" \ --include "*/endpoints/*" # Extract community tutorials inform https://community.mainproduct.com \ --output-dir sources/community \ --max-pages 100 \ --include "*/tutorial/*" \ --include "*/howto/*" # Extract GitHub repository docs inform https://github.com/company/product/tree/main/docs \ --output-dir sources/github-docs \ --include "*.md"

Consolidate and Structure

# Create unified structure mkdir -p consolidated/{core,api,tutorials,community} # Organize by content type cp -r sources/main-docs/* consolidated/core/ cp -r sources/api-docs/* consolidated/api/ cp -r sources/community/* consolidated/community/ cp -r sources/github-docs/* consolidated/core/ # Generate comprehensive index catalog --input consolidated \ --output unified-knowledge-base \ --index \ --sitemap \ --validate \ --base-url https://knowledge.company.com

Create Cross-References

# Generate relationship mapping find consolidated -name "*.md" | while read file; do echo "Processing: $file" # Add cross-reference metadata # (Custom script to identify related content) done # Build final knowledge base unify build --input unified-knowledge-base --output kb-site

Benefits

  • Comprehensive documentation coverage
  • Consistent formatting across sources
  • Unified search and navigation
  • Automated cross-referencing

Content Research Patterns

Intelligence gathering and competitive analysis workflows

🔍

Competitive Analysis

Intermediate

Multi-Competitor Intelligence

#!/bin/bash # competitive-research.sh competitors=( "https://competitor1.com/docs" "https://competitor2.com/help" "https://competitor3.com/guides" "https://competitor4.com/api" ) # Extract from each competitor for url in "${competitors[@]}"; do domain=$(echo $url | sed 's/https:\/\///' | sed 's/\/.*$//') echo "Analyzing $domain..." inform "$url" \ --output-dir "research/$domain" \ --max-pages 30 \ --delay 2000 \ --include "*/docs/*" \ --include "*/guide/*" \ --include "*/api/*" echo "Completed $domain analysis" sleep 60 # Respectful delay between competitors done # Generate comparative analysis catalog --input research \ --output competitive-analysis \ --index \ --validate echo "Competitive research complete!" echo "Review results in: competitive-analysis/"

Analysis and Insights

# Generate content comparison find research -name "*.md" | head -20 | xargs wc -w > word-counts.txt # Create feature comparison matrix # (Custom analysis script) python analyze-features.py research/ > feature-matrix.csv # Generate summary report echo "# Competitive Analysis Report" > analysis-report.md echo "" >> analysis-report.md echo "## Content Volume Analysis" >> analysis-report.md cat word-counts.txt >> analysis-report.md echo "" >> analysis-report.md echo "## Feature Coverage Matrix" >> analysis-report.md cat feature-matrix.csv >> analysis-report.md

Research Insights

  • Content structure and organization comparison
  • Feature coverage and documentation depth
  • User experience and information architecture
  • Messaging and positioning analysis
📊

Industry Knowledge Mining

Advanced

Domain-Specific Content Extraction

# Industry-specific sources industry_sources=( "https://techblog.company1.com" "https://engineering.company2.com" "https://blog.company3.com" "https://medium.com/@industry-expert" ) # Extract industry insights for source in "${industry_sources[@]}"; do domain=$(echo $source | sed 's/https:\/\///' | sed 's/\/.*$//') inform "$source" \ --output-dir "industry-insights/$domain" \ --max-pages 50 \ --delay 1500 \ --include "*technical*" \ --include "*engineering*" \ --include "*architecture*" \ --exclude "*job*" \ --exclude "*hiring*" done # Process and categorize content catalog --input industry-insights \ --output processed-insights \ --optional "drafts/**/*" \ --validate

Content Processing and Analysis

# Generate topic clustering # (Requires additional analysis tools) python cluster-topics.py processed-insights/ > topic-clusters.json # Extract technical patterns grep -r "architecture\|design pattern\|best practice" processed-insights/ > technical-patterns.txt # Create trending analysis python analyze-trends.py processed-insights/ > trend-analysis.json # Generate final report python generate-industry-report.py \ --topics topic-clusters.json \ --patterns technical-patterns.txt \ --trends trend-analysis.json \ --output industry-knowledge-report.md

Integration Workflows

Combining inform with other fwdslsh tools for powerful pipelines

🔄 Complete Documentation Pipeline

End-to-end workflow from web crawling to deployment

i

Content Extraction

# Extract from multiple documentation sources inform https://old-docs.example.com \ --output-dir raw-content \ --max-pages 100 \ --delay 1000 inform https://api-docs.example.com \ --output-dir raw-content/api \ --max-pages 50 \ --include "*/reference/*"
c

Content Indexing

# Generate structured indexes and navigation catalog --input raw-content \ --output structured-content \ --sitemap \ --index \ --base-url https://new-docs.example.com \ --validate
u

Site Generation

# Build modern static site with navigation unify build \ --input structured-content \ --output production-site \ --optimize
g

Version Control

# Professional commit with AI assistance git add . giv message # "docs: migrate and modernize documentation platform" git push origin main

Pipeline Benefits

  • Automated content migration and structuring
  • SEO-optimized site generation with navigation
  • Professional version control and documentation
  • Repeatable and scalable process

🤖 AI-Ready Content Pipeline

Prepare content for AI training and RAG applications

High-Quality Content Extraction

# Extract comprehensive, high-quality content inform https://comprehensive-docs.example.com \ --output-dir ai-training-content \ --max-pages 500 \ --delay 800 \ --include "*/docs/*" \ --include "*/guide/*" \ --include "*/tutorial/*" \ --include "*/reference/*" \ --exclude "*/blog/*" \ --exclude "*/news/*"

Content Structuring and Validation

# Generate AI-optimized indexes with validation catalog --input ai-training-content \ --output ai-ready-content \ --validate \ --optional "examples/**/*" \ --optional "advanced/**/*" # Quality check the content find ai-ready-content -name "*.md" | xargs wc -w | tail -1 echo "Content validation complete"

AI Integration Preparation

# Create training data sets mkdir -p ai-datasets/{training,validation,context} # Split content for different AI purposes cp ai-ready-content/llms.txt ai-datasets/context/ cp ai-ready-content/llms-full.txt ai-datasets/training/ cp ai-ready-content/llms-ctx.txt ai-datasets/validation/ # Generate metadata for AI systems echo "AI-ready content prepared with structured indexes" ls -la ai-datasets/*/

AI-Optimized Features

  • Clean, structured content without noise
  • Consistent formatting for AI processing
  • Hierarchical organization for context
  • Quality validation and completeness checks

Specialized Extraction Patterns

Advanced techniques for specific content types and platforms

📱

GitHub Repository Documentation

Beginner

Repository Documentation Extraction

# Extract documentation from GitHub repositories inform https://github.com/facebook/react/tree/main/docs \ --output-dir react-docs \ --include "*.md" \ --include "*.mdx" # Extract from multiple open source projects projects=( "https://github.com/vuejs/vue/tree/dev/docs" "https://github.com/angular/angular/tree/main/docs" "https://github.com/sveltejs/svelte/tree/master/site/content/docs" ) for project in "${projects[@]}"; do project_name=$(echo $project | sed 's/.*github.com\/\([^\/]*\)\/\([^\/]*\)\/.*/\1-\2/') inform "$project" \ --output-dir "oss-docs/$project_name" \ --include "*.md" \ --include "*.mdx" done

Documentation Comparison

# Generate comparative documentation analysis catalog --input oss-docs \ --output framework-comparison \ --index \ --sitemap \ --base-url https://framework-docs-comparison.dev # Build comparison site unify build \ --input framework-comparison \ --output comparison-site
🛒

E-commerce Content Mining

Advanced

Product Information Extraction

# Extract product documentation and guides inform https://help.shopify.com \ --output-dir ecommerce-content/shopify \ --max-pages 200 \ --include "*/manual/*" \ --include "*/themes/*" \ --exclude "*/billing/*" inform https://docs.woocommerce.com \ --output-dir ecommerce-content/woocommerce \ --max-pages 150 \ --include "*/document/*" \ --include "*/tutorial/*" # Extract best practices and guides inform https://ecommerce-platforms.com \ --output-dir ecommerce-content/best-practices \ --max-pages 100 \ --include "*/guide/*" \ --include "*/best-practice/*"

Content Organization

# Organize by platform and topic mkdir -p ecommerce-knowledge/{platforms,guides,tutorials} # Categorize content cp -r ecommerce-content/shopify/* ecommerce-knowledge/platforms/shopify/ cp -r ecommerce-content/woocommerce/* ecommerce-knowledge/platforms/woocommerce/ cp -r ecommerce-content/best-practices/* ecommerce-knowledge/guides/ # Generate comprehensive e-commerce knowledge base catalog --input ecommerce-knowledge \ --output ecommerce-kb \ --index \ --sitemap \ --validate \ --base-url https://ecommerce-knowledge.dev
🔬

Technical Blog Aggregation

Intermediate

Engineering Blog Extraction

# Extract from major tech company blogs tech_blogs=( "https://engineering.fb.com" "https://blog.google/technology" "https://eng.uber.com" "https://medium.engineering" "https://netflixtechblog.com" ) for blog in "${tech_blogs[@]}"; do domain=$(echo $blog | sed 's/https:\/\///' | sed 's/\/.*$//' | sed 's/\./-/g') inform "$blog" \ --output-dir "tech-blogs/$domain" \ --max-pages 50 \ --delay 2000 \ --include "*engineering*" \ --include "*technical*" \ --include "*architecture*" \ --exclude "*job*" \ --exclude "*career*" done

Content Aggregation and Analysis

# Create unified tech blog archive catalog --input tech-blogs \ --output tech-insights \ --index \ --validate # Generate trend analysis find tech-blogs -name "*.md" -exec grep -l "microservices\|kubernetes\|serverless" {} \; > trending-topics.txt # Create searchable archive unify build \ --input tech-insights \ --output tech-blog-archive echo "Tech blog aggregation complete!" echo "Archive available in: tech-blog-archive/"

Automation Scripts

Ready-to-use scripts for common inform workflows

📅 Scheduled Content Monitoring

Monitor websites for content changes and updates

#!/bin/bash # content-monitor.sh - Monitor websites for changes # Configuration SITES=( "https://docs.example.com" "https://api.example.com/docs" "https://help.example.com" ) BACKUP_DIR="/backup/content-monitoring" DATE=$(date +%Y-%m-%d) # Create daily backup directory mkdir -p "$BACKUP_DIR/$DATE" # Monitor each site for site in "${SITES[@]}"; do site_name=$(echo $site | sed 's/https:\/\///' | sed 's/\/.*$//' | sed 's/\./-/g') echo "Monitoring: $site" # Extract current content inform "$site" \ --output-dir "$BACKUP_DIR/$DATE/$site_name" \ --max-pages 50 \ --delay 1000 # Compare with previous day if exists previous_date=$(date -d "yesterday" +%Y-%m-%d) if [ -d "$BACKUP_DIR/$previous_date/$site_name" ]; then echo "Comparing with previous extraction..." diff -r "$BACKUP_DIR/$previous_date/$site_name" "$BACKUP_DIR/$DATE/$site_name" > "$BACKUP_DIR/$DATE/$site_name-changes.txt" if [ -s "$BACKUP_DIR/$DATE/$site_name-changes.txt" ]; then echo "Changes detected in $site_name!" # Send notification (email, Slack, etc.) # notify-send "Content Changes" "Changes detected in $site_name" fi fi done echo "Content monitoring complete for $DATE"

🔄 Automated Documentation Sync

Keep local documentation in sync with web sources

#!/bin/bash # doc-sync.sh - Automated documentation synchronization # Configuration SOURCE_URL="https://docs.upstream.com" LOCAL_DOCS="./docs" BACKUP_DIR="./docs-backup" SYNC_LOG="./sync.log" echo "$(date): Starting documentation sync" >> $SYNC_LOG # Create backup of current docs if [ -d "$LOCAL_DOCS" ]; then echo "Creating backup..." cp -r "$LOCAL_DOCS" "$BACKUP_DIR-$(date +%Y%m%d-%H%M%S)" fi # Extract latest documentation echo "Extracting latest documentation..." inform "$SOURCE_URL" \ --output-dir "$LOCAL_DOCS-new" \ --max-pages 200 \ --delay 1000 \ --include "*/docs/*" \ --include "*/guide/*" # Check if extraction was successful if [ $? -eq 0 ]; then echo "Extraction successful, updating local docs..." # Replace old docs with new rm -rf "$LOCAL_DOCS" mv "$LOCAL_DOCS-new" "$LOCAL_DOCS" # Generate index catalog --input "$LOCAL_DOCS" \ --output "$LOCAL_DOCS-indexed" \ --sitemap \ --index # Build site unify build \ --input "$LOCAL_DOCS-indexed" \ --output "./public" # Commit changes with giv git add . commit_message=$(giv message) git commit -m "$commit_message" echo "$(date): Sync completed successfully" >> $SYNC_LOG else echo "$(date): Sync failed during extraction" >> $SYNC_LOG exit 1 fi

Troubleshooting Examples

Solutions for common inform challenges and edge cases

🚫 Handling Rate-Limited Sites

# Conservative crawling for rate-sensitive sites inform https://rate-limited-site.com \ --delay 5000 \ --concurrency 1 \ --max-pages 25 \ --verbose # Multi-session approach for large sites sessions=( "*/docs/*" "*/api/*" "*/guide/*" ) for session in "${sessions[@]}"; do echo "Processing session: $session" inform https://large-site.com \ --include "$session" \ --output-dir "./content/$(echo $session | sed 's/[^a-zA-Z0-9]//g')" \ --delay 3000 \ --max-pages 50 echo "Waiting 10 minutes before next session..." sleep 600 done

🔧 Custom Content Detection

# When standard content detection fails # Test with smaller sample first inform https://unusual-site.com/sample-page \ --max-pages 1 \ --verbose # Review extracted content head -50 crawled-pages/sample-page.md # Adjust strategy based on results # (Future: Custom selector support) inform https://unusual-site.com \ --output-dir custom-extraction \ --max-pages 20 \ --delay 2000

📦 Large-Scale Content Processing

# Handling very large sites efficiently #!/bin/bash # large-scale-crawl.sh BASE_URL="https://massive-docs.com" BATCH_SIZE=50 MAX_BATCHES=20 for ((i=1; i<=MAX_BATCHES; i++)); do echo "Processing batch $i of $MAX_BATCHES" inform "$BASE_URL" \ --max-pages $BATCH_SIZE \ --output-dir "batches/batch-$i" \ --delay 1000 \ --concurrency 3 # Process batch immediately catalog --input "batches/batch-$i" \ --output "processed/batch-$i" \ --validate echo "Batch $i complete. Waiting 2 minutes..." sleep 120 done # Combine all batches echo "Combining all batches..." mkdir -p final-output cp -r processed/batch-*/* final-output/ # Generate final index catalog --input final-output \ --output complete-site \ --sitemap \ --index \ --base-url https://docs.example.com

Continue Exploring

Dive deeper into inform capabilities and related tools