Getting Started with gather
Installation and basic usage guide
Installation
Quick Install Script
# One-line install script curl -fsSL https://raw.githubusercontent.com/fwdslsh/gather/main/install.sh | sh # Verify installation gather --version
This automatically downloads the correct binary for your platform (Linux, macOS, Windows) and installs it to your PATH.
Bun Package
# Install globally with Bun bun install -g @fwdslsh/gather # Or add to specific project bun add @fwdslsh/gather
Docker
# Pull latest image docker pull fwdslsh/gather:latest # Run crawler docker run --rm fwdslsh/gather https://docs.example.com
Build from Source
# Clone repository git clone https://github.com/fwdslsh/gather.git cd gather # Install dependencies bun install # Build for current platform bun run build # Run from source bun src/cli.js https://docs.example.com
Basic Usage
Web Crawling
Crawl any website with automatic link discovery:
# Basic crawl gather https://docs.example.com # With options gather https://docs.example.com \ --max-pages 100 \ --delay 500 \ --concurrency 5 \ --output-dir ./docs
Git Repository Downloads
Download specific directories from GitHub repositories without cloning the entire history:
# Download entire repository gather https://github.com/owner/repo # Download specific directory gather https://github.com/owner/repo/tree/main/docs # With filtering gather https://github.com/owner/repo \ --include "*.md" \ --exclude "node_modules/**"
GitHub Authentication
For unauthenticated requests, GitHub limits you to 60 requests per hour. With authentication, this increases to 5,000 requests per hour:
# Set GitHub token export GITHUB_TOKEN="your_github_personal_access_token" # Crawl with authenticated access gather https://github.com/owner/repo
Feed Ingestion
Ingest content from RSS/Atom feeds, YouTube channels, Bluesky profiles, and X/Twitter feeds:
# RSS/Atom feeds (auto-detected) gather --feed https://example.com/feed.xml gather --feed https://blog.example.com/rss # YouTube channels and playlists gather --feed https://www.youtube.com/c/ExampleChannel gather --feed https://www.youtube.com/playlist?list=PLExample # Bluesky profiles gather --feed https://bsky.app/profile/example.bsky.social # X/Twitter profiles (requires Bearer token) export X_BEARER_TOKEN="your_x_bearer_token" gather --feed https://x.com/example_user
Configuration
Command Line Options
# Web & Git Crawling Options -o, --output-dirOutput directory (default: crawled-pages) --max-pages Maximum pages to crawl (default: 100) --delay Delay between requests in ms (default: 1000) --concurrency Number of concurrent requests (default: 3) --max-queue-size Maximum URLs in queue (default: 10000) --max-retries Maximum retry attempts (default: 3) --ignore-robots Ignore robots.txt directives (use with caution) --raw Output raw HTML without Markdown conversion --include Include files matching glob pattern --exclude Exclude files matching glob pattern --ignore-errors Exit with code 0 even if some pages fail --verbose Enable verbose logging --quiet Suppress non-essential output # Feed Mode Options --feed Enable feed ingestion mode --limit Maximum items to ingest (default: 50) --yt-lang YouTube transcript language (default: 'en') --no-yt-transcript Skip YouTube transcript extraction --x-rss-templateCustom X RSS template URL --bsky-api-base Custom Bluesky API endpoint # Configuration File --config Path to YAML configuration file --init Generate sample configuration
YAML Configuration File
Create a gather.yaml file in your project root for complex setups:
# gather.yaml
# Global settings
maxPages: 100
delay: 1000
concurrency: 3
outputDir: "./crawled-content"
# Target-specific overrides
targets:
- url: "https://docs.example.com"
maxPages: 50
delay: 500
- url: "https://github.com/owner/repo"
include: ["*.md", "*.txt"]
exclude: ["node_modules/**"]
# Feed mode settings
feed:
limit: 100
ytLang: "en"
noYtTranscript: false
Understanding Output
Folder Structure
Gather maintains the original website structure in the output directory:
crawled-content/
├── docs.example.com/
│ ├── index.md # Root page
│ ├── api/
│ │ ├── authentication.md # /docs/api/authentication
│ │ └── endpoints.md # /docs/api/endpoints
│ ├── guides/
│ │ └── setup.md # /docs/guides/setup
│ └── tutorials/
│ └── index.md # /docs/tutorials
└── github.com/
└── owner/
└── repo/
└── main/
└── docs/
├── README.md
├── api.md
└── guide.md
File Naming
- Root pages become
index.md - Query parameters included in filenames (e.g.,
page?id=123.md) - Markdown files end in
.mdor.mdx - Raw HTML files end in
.htmlwhen using--raw