Command Reference

Complete CLI options and advanced features

Basic Syntax

gather [URL] [OPTIONS]
gather --feed [FEED_URL] [OPTIONS]

Web & Git Crawling Options

Option Short Default Description
--output-dir -o crawled-pages Output directory for saved files
--max-pages 100 Maximum number of pages to crawl
--delay 1000 Delay between requests in milliseconds
--concurrency 3 Number of concurrent requests
--max-queue-size 10000 Maximum URLs in queue before skipping new links
--max-retries 3 Maximum retry attempts for failed requests
--ignore-robots false Ignore robots.txt directives (use with caution)
--raw false Output raw HTML without Markdown conversion
--include * Include files matching glob pattern (can be used multiple times)
--exclude Exclude files matching glob pattern (can be used multiple times)
--ignore-errors false Exit with code 0 even if some pages fail

Feed Mode Options

Feed ingestion is enabled with --feed flag. Supports RSS, Atom, YouTube, Bluesky, and X/Twitter.

Option Default Description
--feed false Enable feed ingestion mode
--limit 50 Maximum number of items to ingest
--yt-lang 'en' YouTube transcript language code
--no-yt-transcript false Skip YouTube transcript extraction
--x-rss-template default Custom X/Twitter RSS template URL
--bsky-api-base default Custom Bluesky API endpoint

General Options

Option Short Description
--config Path to YAML configuration file (defaults to gather.yaml)
--init Generate sample configuration file to stdout
--verbose Enable verbose logging (detailed output including retries, skipped files, and queue status)
--quiet -q Enable quiet mode (errors only, no progress messages)
--help -h Show help message
--version -v Show version information

Advanced Features

Production-ready crawling capabilities

robots.txt Support

Gather automatically respects robots.txt files to follow web crawling best practices:

Note: Use --ignore-robots only if you have explicit permission from the site owner. Violating robots.txt may violate terms of service.

Content Extraction Strategy

Gather intelligently extracts main content from web pages using semantic HTML:

  1. Searches for <main> element (HTML5 semantic element)
  2. Checks for [role="main"] ARIA attribute
  3. Looks for common content class names (e.g., .main-content, .content, .article)
  4. Finds <article> elements for blog posts
  5. Detects Bootstrap-style containers
  6. Falls back to <body> content if nothing found

Content Cleanup

These elements are automatically removed during extraction:

Configuration Files

Use YAML configuration files for complex crawling setups. Create gather.yaml in your project root:

# gather.yaml
# Global settings
maxPages: 100
delay: 1000
concurrency: 3
outputDir: "./crawled-content"

# Target-specific overrides
targets:
  - url: "https://docs.example.com"
    maxPages: 50
    delay: 500

  - url: "https://github.com/owner/repo"
    include: ["*.md", "*.txt"]
    exclude: ["node_modules/**", "**/test/**"]

# Feed mode settings
feed:
  limit: 100
  ytLang: "en"
  noYtTranscript: false

Using Configuration Files

# Auto-detect gather.yaml in current directory
gather

# Specify custom configuration file
gather --config production-crawl.yaml

# Generate sample configuration
gather --init > gather.yaml

Markdown Conversion

Gather converts HTML to clean Markdown using Turndown:

Integration Patterns

Combine gather with other fwdslsh tools

Complete Documentation Pipeline

Use gather with catalog for AI-ready documentation:

# Step 1: Crawl documentation site with Gather
gather https://docs.example.com --output-dir ./docs

# Step 2: Generate llms.txt files with catalog
catalog ./docs --output build --base-url https://docs.example.com \
  --sitemap --validate --index --toc

# Step 3: Use with AI tools
# Now feed build/llms.txt to your LLM for context

CI/CD Integration

#!/bin/bash
# .github/workflows/docs-sync.yml

# Crawl latest documentation
gather https://docs.example.com \
  --max-pages 1000 \
  --output-dir ./docs

# Generate AI-ready artifacts
catalog ./docs \
  --output ./build \
  --base-url https://docs.example.com \
  --sitemap --validate

# Upload to deployment location
# ...

Multi-Source Crawling

# gather.yaml configuration for multiple sources
maxPages: 50
delay: 1000

targets:
  # Main documentation site
  - url: "https://docs.example.com"
    outputDir: "./docs/official"

  # Community wiki
  - url: "https://wiki.example.com"
    outputDir: "./docs/community"

  # GitHub repository docs
  - url: "https://github.com/owner/repo/tree/main/docs"
    include: ["*.md"]
    exclude: ["**/draft/**"]
    outputDir: "./docs/github"

GitHub Integration

Download repositories and specific directories

Authentication

For authenticated access to private repositories or higher rate limits:

# Set GitHub token (unauthenticated: 60 req/hour, authenticated: 5,000 req/hour)
export GITHUB_TOKEN="your_github_personal_access_token"

# Crawl with authenticated access
gather https://github.com/owner/repo

Sparse Checkout

Download only specific directories using Git sparse checkout:

# Download entire repository
gather https://github.com/owner/repo

# Download specific directory
gather https://github.com/owner/repo/tree/main/docs

# Download specific file path
gather https://github.com/owner/repo/blob/main/README.md

File Filtering

# Include only markdown files
gather https://github.com/owner/repo --include "*.md"

# Exclude node_modules and test directories
gather https://github.com/owner/repo \
  --exclude "node_modules/**" \
  --exclude "**/*.test.md" \
  --exclude "**/__tests__/**"

# Multiple include patterns
gather https://github.com/owner/repo \
  --include "*.md" \
  --include "*.txt" \
  --include "*.json"

Environment Variables

Configure behavior via environment variables

Variable Description
GITHUB_TOKEN GitHub Personal Access Token for API authentication
X_BEARER_TOKEN X/Twitter Bearer token for feed ingestion
HTTP_PROXY HTTP proxy URL for web requests
HTTPS_PROXY HTTPS proxy URL for web requests
NO_PROXY Comma-separated list of hosts to bypass proxy

Exit Codes

CLI exit codes for automation

Code Meaning
0 Success - all pages/files crawled successfully
1 General failure - unrecoverable error
2 Usage error - invalid arguments or options
3 Network error - connection failed, timeout, or DNS issue
4 Partial success - some pages failed but operation completed