gather
High-performance content ingestion
Turn web into dataset. Crawl sites, download repos, and ingest feeds with a single tool. Automatic HTML-to-Markdown conversion makes everything AI-ready.
Three Modes, One Tool
Unified content ingestion from multiple sources
Web Crawler
Concurrent crawling with respect for robots.txt. Converts HTML to clean Markdown using Turndown. Smart content extraction removes navigation and ads.
Git Downloader
Download specific subdirectories from GitHub repositories without cloning entire history. Supports sparse checkout for minimal data transfer.
Feed Ingestion
Monitor RSS/Atom feeds, YouTube channels, Bluesky profiles, and X/Twitter feeds for new content. Automatic Markdown conversion.
Advanced Features
Production-ready crawling with enterprise capabilities
Powered by Bun
Significantly faster than Node.js with built-in optimizations. Native DOM parsing and zero-dependency HTML processing for maximum performance.
robots.txt Support
Automatically fetches and respects robots.txt directives. Respects Disallow, Crawl-delay, and applies wildcard rules.
YAML Configuration
Use config files for complex crawling setups. Define targets, patterns, and feed settings in gather.yaml.
Smart Content Extraction
Intelligently identifies main content using semantic HTML (main, article). Preserves structure and removes non-content elements.
Structure Preservation
Maintains original folder structure from URLs. Code examples properly converted to markdown code blocks with language detection.
Flexible Filtering
Include/exclude patterns with glob syntax. Filter by file types, directories, and custom patterns for precise control.
Performance Metrics
Real-world crawling statistics
Installation
Get gather running in seconds
Quick Install
Automatically downloads right binary for your platform
Bun Package
Native Bun package for superior performance
Docker
Containerized for consistent environments
Configuration Example
Use gather.yaml for complex setups
maxPages: 100
delay: 1000
concurrency: 3
outputDir: "./crawled-content"
# Target-specific overrides
targets:
-
url: "https://docs.example.com"
maxPages: 50
delay: 500
-
url: "https://github.com/owner/repo"
include: ["*.md", "*.txt"]
exclude: ["node_modules/**"]
# Feed mode settings
feed:
limit: 100
ytLang: "en"
noYtTranscript: false
Integration with catalog
Complete documentation pipeline with fwdslsh tools
# Step 1: Crawl documentation site with Gather gather https://docs.example.com --output-dir docs-content # Step 2: Generate LLMS.txt files with @fwdslsh/catalog catalog --input docs-content --output build --base-url https://docs.example.com \ --sitemap --validate --index --toc
Benefits of this approach:
- Separation of concerns: Gather focuses on high-quality web crawling and Markdown conversion
- Flexibility: Use @fwdslsh/catalog's advanced LLMS.txt generation with any Markdown content
- Maintainability: Each tool optimized for its specific purpose
- Reusability: Generated Markdown can be used for multiple purposes beyond LLMS.txt generation
Ready to Gather Content?
Transform web content into clean, usable Markdown