gather

High-performance content ingestion

Turn web into dataset. Crawl sites, download repos, and ingest feeds with a single tool. Automatic HTML-to-Markdown conversion makes everything AI-ready.

g Start Gathering View Source

$ gather https://docs.example.com

> Crawling docs.example.com...

> Discovered 45 pages within same domain

> Processing with concurrency: 5

> Converted 45 pages to Markdown

> Saved to ./content/docs.example.com/

$ gather https://github.com/owner/repo/tree/main/docs

> Downloading via sparse checkout...

> Processing docs/*.md files...

> Retrieved 12 files

> Saved to ./content/repo-docs/

$ gather --feed https://example.com/feed.xml

> Detected RSS feed

> Processing 50 recent items...

> Extracting main content from articles...

> Converted 50 articles to Markdown

> Saved to ./content/feed-articles/

Three Modes, One Tool

Unified content ingestion from multiple sources

🕷️

Web Crawler

Concurrent crawling with respect for robots.txt. Converts HTML to clean Markdown using Turndown. Smart content extraction removes navigation and ads.

⬇️

Git Downloader

Download specific subdirectories from GitHub repositories without cloning entire history. Supports sparse checkout for minimal data transfer.

📡

Feed Ingestion

Monitor RSS/Atom feeds, YouTube channels, Bluesky profiles, and X/Twitter feeds for new content. Automatic Markdown conversion.

Advanced Features

Production-ready crawling with enterprise capabilities

⚡

Powered by Bun

Significantly faster than Node.js with built-in optimizations. Native DOM parsing and zero-dependency HTML processing for maximum performance.

🤖

robots.txt Support

Automatically fetches and respects robots.txt directives. Respects Disallow, Crawl-delay, and applies wildcard rules.

📝

YAML Configuration

Use config files for complex crawling setups. Define targets, patterns, and feed settings in gather.yaml.

🎯

Smart Content Extraction

Intelligently identifies main content using semantic HTML (main, article). Preserves structure and removes non-content elements.

🗂️

Structure Preservation

Maintains original folder structure from URLs. Code examples properly converted to markdown code blocks with language detection.

🔧

Flexible Filtering

Include/exclude patterns with glob syntax. Filter by file types, directories, and custom patterns for precise control.

Performance Metrics

Real-world crawling statistics

Faster than Node.js crawlers

External dependencies

95%

Content extraction accuracy

5MB

Binary size

Installation

Get gather running in seconds

⚡

Quick Install

# One-line install script curl -fsSL https://raw.githubusercontent.com/fwdslsh/gather/main/install.sh | sh # Start crawling gather https://docs.example.com

Automatically downloads right binary for your platform

📦

Bun Package

# Install globally bun install -g @fwdslsh/gather # Or add to project bun add @fwdslsh/gather

Native Bun package for superior performance

🐳

Docker

# Pull latest image docker pull fwdslsh/gather:latest # Run crawler docker run --rm fwdslsh/gather https://docs.example.com

Containerized for consistent environments

Configuration Example

Use gather.yaml for complex setups

# Global settings
maxPages: 100
delay: 1000
concurrency: 3
outputDir: "./crawled-content"

# Target-specific overrides
targets:
  -
    url: "https://docs.example.com"
    maxPages: 50
    delay: 500
  -
    url: "https://github.com/owner/repo"
    include: ["*.md", "*.txt"]
    exclude: ["node_modules/**"]

# Feed mode settings
feed:
  limit: 100
  ytLang: "en"
  noYtTranscript: false

Integration with catalog

Complete documentation pipeline with fwdslsh tools

# Step 1: Crawl documentation site with Gather
gather https://docs.example.com --output-dir docs-content

# Step 2: Generate LLMS.txt files with @fwdslsh/catalog
catalog --input docs-content --output build --base-url https://docs.example.com \
  --sitemap --validate --index --toc

Benefits of this approach:

Separation of concerns: Gather focuses on high-quality web crawling and Markdown conversion
Flexibility: Use @fwdslsh/catalog's advanced LLMS.txt generation with any Markdown content
Maintainability: Each tool optimized for its specific purpose
Reusability: Generated Markdown can be used for multiple purposes beyond LLMS.txt generation

Ready to Gather Content?

Transform web content into clean, usable Markdown

🚀 Getting Started Guide 📚 View Documentation 🔄 Ecosystem Overview