inform - High-Performance Web Content Crawler

inform

High-Performance Web Content Crawler

Crawl websites, extract main content, and convert to clean Markdown. Powered by Bun for maximum performance with concurrent processing and zero dependencies.

https://docs.example.com/ 12.3KB
https://docs.example.com/api/auth 8.7KB
https://docs.example.com/guides/setup 15.2KB
https://docs.example.com/tutorials/ crawling...
42
Pages Crawled
2.3MB
Content Extracted
847ms
Total Time
5
Concurrent

Built for Speed and Reliability

Modern web crawling without the complexity

🚀

Powered by Bun

Significantly faster than Node.js with built-in optimizations. Native DOM parsing and zero-dependency HTML processing for maximum performance.

Concurrent Crawling

Process multiple pages simultaneously with configurable concurrency limits. Intelligent rate limiting and backoff strategies.

🎯

Smart Content Extraction

Intelligently identifies main content by removing navigation, ads, and other non-content elements. Preserves structure and formatting.

📝

Clean Markdown Output

Converts HTML to properly formatted Markdown. Code examples become code blocks, maintains heading hierarchy and link structure.

🗂️

Structure Preservation

Maintains original folder structure from URLs. /docs/api/ becomes docs/api.md with meaningful filenames based on content.

🔧

Flexible Configuration

Configurable delays, concurrency limits, include/exclude patterns, and output formats. Works with any website structure.

Multiple Content Sources

Crawl from websites, Git repositories, and more

🌐

Website Crawling

inform https://docs.example.com

Crawl any website with automatic link discovery and same-domain restriction

📂

Git Repository Downloads

inform github.com/owner/repo/tree/main/docs

Download specific directories from GitHub repositories without cloning the entire repo

🎯

Pattern-Based Filtering

inform site.com --include "*.md" --exclude "temp/*"

Use glob patterns to include or exclude specific content during crawling

🔧

Custom Configuration

inform site.com --concurrency 10 --delay 100

Fine-tune performance with custom concurrency limits and request delays

Installation

Get inform running in seconds

Quick Install

# One-line install script curl -fsSL https://raw.githubusercontent.com/fwdslsh/inform/main/install.sh | sh # Start crawling immediately inform https://docs.example.com

Automatically downloads the right binary for your platform

📦

Bun Package

# Install globally with Bun bun install -g @fwdslsh/inform # Or add to project bun add @fwdslsh/inform

Use with Bun's superior performance and built-in features

🐳

Docker

# Pull latest image docker pull fwdslsh/inform:latest # Run crawler docker run --rm fwdslsh/inform https://docs.example.com

Containerized for consistent environments and CI/CD

Performance Comparison

Why inform is the fastest choice for web crawling

3x
Faster than Node.js crawlers
0
External dependencies
95%
Content extraction accuracy
5MB
Binary size
Feature inform Puppeteer Scrapy wget
Content Extraction ✓ Smart ✓ Full ✓ Custom ✗ Raw Only
Markdown Output ✓ Built-in ✗ Manual ✗ Manual ✗ None
Concurrent Processing ✓ Native ✓ Heavy ✓ Yes ✗ Sequential
Memory Usage ✓ Low ✗ High ✓ Medium ✓ Low
Setup Complexity ✓ Zero ✗ Complex ✗ Complex ✓ Simple

Integration Workflow

Perfect for documentation pipelines and content management

🌐

Crawl Content

Extract content from websites or repositories

📋

Generate Index

Use catalog to create llms.txt files

🏗️

Build Site

Create static sites with unify

🚀

Deploy

Publish to your hosting platform

# Complete documentation workflow
inform https://docs.example.com --output-dir docs
catalog --input docs --output build --sitemap --base-url https://docs.example.com
unify build --input build --output dist
giv message # AI-powered commit for the updates

Ready to Start Crawling?

Transform web content into clean, usable Markdown