gather

High-performance content ingestion

Turn web into dataset. Crawl sites, download repos, and ingest feeds with a single tool. Automatic HTML-to-Markdown conversion makes everything AI-ready.

$ gather https://docs.example.com
> Crawling docs.example.com...
> Discovered 45 pages within same domain
> Processing with concurrency: 5
> Converted 45 pages to Markdown
> Saved to ./content/docs.example.com/
$ gather https://github.com/owner/repo/tree/main/docs
> Downloading via sparse checkout...
> Processing docs/*.md files...
> Retrieved 12 files
> Saved to ./content/repo-docs/
$ gather --feed https://example.com/feed.xml
> Detected RSS feed
> Processing 50 recent items...
> Extracting main content from articles...
> Converted 50 articles to Markdown
> Saved to ./content/feed-articles/

Three Modes, One Tool

Unified content ingestion from multiple sources

🕷️

Web Crawler

Concurrent crawling with respect for robots.txt. Converts HTML to clean Markdown using Turndown. Smart content extraction removes navigation and ads.

⬇️

Git Downloader

Download specific subdirectories from GitHub repositories without cloning entire history. Supports sparse checkout for minimal data transfer.

📡

Feed Ingestion

Monitor RSS/Atom feeds, YouTube channels, Bluesky profiles, and X/Twitter feeds for new content. Automatic Markdown conversion.

Advanced Features

Production-ready crawling with enterprise capabilities

Powered by Bun

Significantly faster than Node.js with built-in optimizations. Native DOM parsing and zero-dependency HTML processing for maximum performance.

🤖

robots.txt Support

Automatically fetches and respects robots.txt directives. Respects Disallow, Crawl-delay, and applies wildcard rules.

📝

YAML Configuration

Use config files for complex crawling setups. Define targets, patterns, and feed settings in gather.yaml.

🎯

Smart Content Extraction

Intelligently identifies main content using semantic HTML (main, article). Preserves structure and removes non-content elements.

🗂️

Structure Preservation

Maintains original folder structure from URLs. Code examples properly converted to markdown code blocks with language detection.

🔧

Flexible Filtering

Include/exclude patterns with glob syntax. Filter by file types, directories, and custom patterns for precise control.

Performance Metrics

Real-world crawling statistics

3x
Faster than Node.js crawlers
0
External dependencies
95%
Content extraction accuracy
5MB
Binary size

Installation

Get gather running in seconds

Quick Install

# One-line install script curl -fsSL https://raw.githubusercontent.com/fwdslsh/gather/main/install.sh | sh # Start crawling gather https://docs.example.com

Automatically downloads right binary for your platform

📦

Bun Package

# Install globally bun install -g @fwdslsh/gather # Or add to project bun add @fwdslsh/gather

Native Bun package for superior performance

🐳

Docker

# Pull latest image docker pull fwdslsh/gather:latest # Run crawler docker run --rm fwdslsh/gather https://docs.example.com

Containerized for consistent environments

Configuration Example

Use gather.yaml for complex setups

# Global settings
maxPages: 100
delay: 1000
concurrency: 3
outputDir: "./crawled-content"

# Target-specific overrides
targets:
  -
    url: "https://docs.example.com"
    maxPages: 50
    delay: 500
  -
    url: "https://github.com/owner/repo"
    include: ["*.md", "*.txt"]
    exclude: ["node_modules/**"]

# Feed mode settings
feed:
  limit: 100
  ytLang: "en"
  noYtTranscript: false

Integration with catalog

Complete documentation pipeline with fwdslsh tools

# Step 1: Crawl documentation site with Gather
gather https://docs.example.com --output-dir docs-content

# Step 2: Generate LLMS.txt files with @fwdslsh/catalog
catalog --input docs-content --output build --base-url https://docs.example.com \
  --sitemap --validate --index --toc

Benefits of this approach:

  • Separation of concerns: Gather focuses on high-quality web crawling and Markdown conversion
  • Flexibility: Use @fwdslsh/catalog's advanced LLMS.txt generation with any Markdown content
  • Maintainability: Each tool optimized for its specific purpose
  • Reusability: Generated Markdown can be used for multiple purposes beyond LLMS.txt generation

Ready to Gather Content?

Transform web content into clean, usable Markdown