inform Documentation - Complete Reference Guide

inform Documentation

Complete reference guide for high-performance web content extraction
Master inform's features, from basic crawling to advanced workflows and integrations. Everything you need to extract, convert, and organize web content efficiently.

Command Line Reference

Complete guide to all inform command-line options and parameters

Basic Syntax

inform <URL> [OPTIONS]

Extract content from a website starting at the specified URL.

Essential Options

--output-dir <directory>

Specify the output directory for extracted content. Creates directory if it doesn't exist.

inform https://docs.example.com --output-dir ./content

--max-pages <number>

Limit the maximum number of pages to crawl. Useful for large sites or testing.

inform https://blog.example.com --max-pages 50

--delay <milliseconds>

Add delay between requests to be respectful to target servers.

inform https://api-docs.com --delay 1000

Content Processing Options

--selector <css-selector>

Target specific content using CSS selectors. Perfect for extracting main content areas.

inform https://news.site --selector "article.main-content"

--include-links

Preserve internal links in the extracted Markdown content.

inform https://wiki.example.com --include-links

--follow-external

Follow links to external domains (use with caution).

inform https://linkfarm.site --follow-external --max-pages 10

--exclude-patterns <patterns>

Exclude URLs matching specified patterns (comma-separated).

inform https://docs.site --exclude-patterns "/admin,/login,*.pdf"

Performance and Rate Limiting

--concurrent <number>

Number of concurrent requests (default: 3). Higher values for faster crawling.

inform https://fast-site.com --concurrent 10

--timeout <seconds>

Request timeout in seconds. Useful for slow or unreliable sites.

inform https://slow-site.com --timeout 30

--user-agent <string>

Custom User-Agent string for requests.

inform https://api.site --user-agent "MyBot/1.0"

Advanced Features

Powerful capabilities for complex content extraction workflows

🎯 Content Quality Enhancement

inform automatically cleans and processes content for optimal readability and consistency.

Smart Content Detection

Automatically identifies main content areas, removing navigation, ads, and boilerplate.

Markdown Conversion

Converts HTML to clean, standards-compliant Markdown with proper formatting.

Link Processing

Intelligently handles internal and external links, with options for link preservation.

Image Handling

Downloads images and updates references in Markdown output (optional).

🏗️ Site Structure Analysis

Advanced crawling strategies that understand website architecture.

Sitemap Detection

Automatically discovers and uses XML sitemaps for comprehensive coverage.

Robots.txt Compliance

Respects robots.txt rules and crawl-delay directives.

URL Pattern Recognition

Intelligently follows URL patterns to discover content systematically.

Duplicate Detection

Avoids crawling duplicate content and similar pages.

Integration and Workflows

Connect inform with other tools and systems for powerful content workflows

🔄

fwdslsh Ecosystem

Seamlessly integrates with catalog, unify, and giv for complete documentation workflows from extraction to publication.

inform https://docs.site --output-dir content/ catalog --input content/ --output indexed/ unify build --input indexed/ --output site/

Performance Optimization

Optimize crawling for large sites with concurrent requests, rate limiting, and intelligent content targeting.

inform https://large-site.com \ --max-pages 1000 \ --concurrent 5 \ --delay 500
🛠️

Troubleshooting

Handle JavaScript-heavy sites, rate limiting, and authentication with advanced configuration options.

inform https://spa-site.com \ --selector "main, .content" \ --timeout 30

Continue Learning

Explore more inform capabilities and related tools