Getting Started with gather

Installation and basic usage guide

Installation

Quick Install Script

# One-line install script
curl -fsSL https://raw.githubusercontent.com/fwdslsh/gather/main/install.sh | sh

# Verify installation
gather --version

This automatically downloads the correct binary for your platform (Linux, macOS, Windows) and installs it to your PATH.

Bun Package

# Install globally with Bun
bun install -g @fwdslsh/gather

# Or add to specific project
bun add @fwdslsh/gather

Docker

# Pull latest image
docker pull fwdslsh/gather:latest

# Run crawler
docker run --rm fwdslsh/gather https://docs.example.com

Build from Source

# Clone repository
git clone https://github.com/fwdslsh/gather.git
cd gather

# Install dependencies
bun install

# Build for current platform
bun run build

# Run from source
bun src/cli.js https://docs.example.com

Basic Usage

Web Crawling

Crawl any website with automatic link discovery:

# Basic crawl
gather https://docs.example.com

# With options
gather https://docs.example.com \
  --max-pages 100 \
  --delay 500 \
  --concurrency 5 \
  --output-dir ./docs

Git Repository Downloads

Download specific directories from GitHub repositories without cloning the entire history:

# Download entire repository
gather https://github.com/owner/repo

# Download specific directory
gather https://github.com/owner/repo/tree/main/docs

# With filtering
gather https://github.com/owner/repo \
  --include "*.md" \
  --exclude "node_modules/**"

GitHub Authentication

For unauthenticated requests, GitHub limits you to 60 requests per hour. With authentication, this increases to 5,000 requests per hour:

# Set GitHub token
export GITHUB_TOKEN="your_github_personal_access_token"

# Crawl with authenticated access
gather https://github.com/owner/repo

Feed Ingestion

Ingest content from RSS/Atom feeds, YouTube channels, Bluesky profiles, and X/Twitter feeds:

# RSS/Atom feeds (auto-detected)
gather --feed https://example.com/feed.xml
gather --feed https://blog.example.com/rss

# YouTube channels and playlists
gather --feed https://www.youtube.com/c/ExampleChannel
gather --feed https://www.youtube.com/playlist?list=PLExample

# Bluesky profiles
gather --feed https://bsky.app/profile/example.bsky.social

# X/Twitter profiles (requires Bearer token)
export X_BEARER_TOKEN="your_x_bearer_token"
gather --feed https://x.com/example_user

Configuration

Command Line Options

# Web & Git Crawling Options
-o, --output-dir       Output directory (default: crawled-pages)
--max-pages          Maximum pages to crawl (default: 100)
--delay                 Delay between requests in ms (default: 1000)
--concurrency        Number of concurrent requests (default: 3)
--max-queue-size     Maximum URLs in queue (default: 10000)
--max-retries        Maximum retry attempts (default: 3)
--ignore-robots            Ignore robots.txt directives (use with caution)
--raw                      Output raw HTML without Markdown conversion
--include           Include files matching glob pattern
--exclude           Exclude files matching glob pattern
--ignore-errors             Exit with code 0 even if some pages fail
--verbose                   Enable verbose logging
--quiet                     Suppress non-essential output

# Feed Mode Options
--feed                      Enable feed ingestion mode
--limit             Maximum items to ingest (default: 50)
--yt-lang             YouTube transcript language (default: 'en')
--no-yt-transcript         Skip YouTube transcript extraction
--x-rss-template     Custom X RSS template URL
--bsky-api-base       Custom Bluesky API endpoint

# Configuration File
--config              Path to YAML configuration file
--init                      Generate sample configuration

YAML Configuration File

Create a gather.yaml file in your project root for complex setups:

# gather.yaml
# Global settings
maxPages: 100
delay: 1000
concurrency: 3
outputDir: "./crawled-content"

# Target-specific overrides
targets:
  - url: "https://docs.example.com"
    maxPages: 50
    delay: 500

  - url: "https://github.com/owner/repo"
    include: ["*.md", "*.txt"]
    exclude: ["node_modules/**"]

# Feed mode settings
feed:
  limit: 100
  ytLang: "en"
  noYtTranscript: false

Understanding Output

Folder Structure

Gather maintains the original website structure in the output directory:

crawled-content/
├── docs.example.com/
│   ├── index.md                    # Root page
│   ├── api/
│   │   ├── authentication.md       # /docs/api/authentication
│   │   └── endpoints.md           # /docs/api/endpoints
│   ├── guides/
│   │   └── setup.md               # /docs/guides/setup
│   └── tutorials/
│       └── index.md               # /docs/tutorials
└── github.com/
    └── owner/
        └── repo/
            └── main/
                └── docs/
                    ├── README.md
                    ├── api.md
                    └── guide.md

File Naming

Next Steps