Command Reference

Complete CLI options and advanced features

Basic Syntax

gather [URL] [OPTIONS]
gather --feed [FEED_URL] [OPTIONS]

Web & Git Crawling Options

Option	Short	Default	Description
`--output-dir`	`-o`	`crawled-pages`	Output directory for saved files
`--max-pages`		`100`	Maximum number of pages to crawl
`--delay`		`1000`	Delay between requests in milliseconds
`--concurrency`		`3`	Number of concurrent requests
`--max-queue-size`		`10000`	Maximum URLs in queue before skipping new links
`--max-retries`		`3`	Maximum retry attempts for failed requests
`--ignore-robots`		`false`	Ignore robots.txt directives (use with caution)
`--raw`		`false`	Output raw HTML without Markdown conversion
`--include`		`*`	Include files matching glob pattern (can be used multiple times)
`--exclude`			Exclude files matching glob pattern (can be used multiple times)
`--ignore-errors`		`false`	Exit with code 0 even if some pages fail

Feed Mode Options

Feed ingestion is enabled with --feed flag. Supports RSS, Atom, YouTube, Bluesky, and X/Twitter.

Option	Default	Description
`--feed`	`false`	Enable feed ingestion mode
`--limit`	`50`	Maximum number of items to ingest
`--yt-lang`	`'en'`	YouTube transcript language code
`--no-yt-transcript`	`false`	Skip YouTube transcript extraction
`--x-rss-template`	default	Custom X/Twitter RSS template URL
`--bsky-api-base`	default	Custom Bluesky API endpoint

General Options

Option	Short	Description
`--config`		Path to YAML configuration file (defaults to gather.yaml)
`--init`		Generate sample configuration file to stdout
`--verbose`		Enable verbose logging (detailed output including retries, skipped files, and queue status)
`--quiet`	`-q`	Enable quiet mode (errors only, no progress messages)
`--help`	`-h`	Show help message
`--version`	`-v`	Show version information

Advanced Features

Production-ready crawling capabilities

robots.txt Support

Gather automatically respects robots.txt files to follow web crawling best practices:

Automatic fetching: Fetches and parses robots.txt from target sites
User agent compliance: Identifies as "Gather/1.0" user agent
Directive respect: Honors Disallow, Allow, and Crawl-delay directives
Wildcard support: Follows "*" wildcard rules for all user agents
Logging: Logs blocked URLs with reason for transparency

Note: Use --ignore-robots only if you have explicit permission from the site owner. Violating robots.txt may violate terms of service.

Content Extraction Strategy

Gather intelligently extracts main content from web pages using semantic HTML:

Searches for <main> element (HTML5 semantic element)
Checks for [role="main"] ARIA attribute
Looks for common content class names (e.g., .main-content, .content, .article)
Finds <article> elements for blog posts
Detects Bootstrap-style containers
Falls back to <body> content if nothing found

Content Cleanup

These elements are automatically removed during extraction:

Navigation elements (nav, .menu, .navigation)
Headers and footers
Advertisement blocks (.ad, .advertisement)
Social sharing buttons
Comments sections
Scripts and styles

Configuration Files

Use YAML configuration files for complex crawling setups. Create gather.yaml in your project root:

# gather.yaml
# Global settings
maxPages: 100
delay: 1000
concurrency: 3
outputDir: "./crawled-content"

# Target-specific overrides
targets:
  - url: "https://docs.example.com"
    maxPages: 50
    delay: 500

  - url: "https://github.com/owner/repo"
    include: ["*.md", "*.txt"]
    exclude: ["node_modules/**", "**/test/**"]

# Feed mode settings
feed:
  limit: 100
  ytLang: "en"
  noYtTranscript: false

Using Configuration Files

# Auto-detect gather.yaml in current directory
gather

# Specify custom configuration file
gather --config production-crawl.yaml

# Generate sample configuration
gather --init > gather.yaml

Markdown Conversion

Gather converts HTML to clean Markdown using Turndown:

Code blocks: Properly converted from <pre> and <code> elements with language detection
Links: Converted to Markdown link syntax with relative URL preservation
Tables: Converted to Markdown table format
Headers: Preserved with correct nesting (h1-h6)
Lists: Ordered and unordered lists maintained
Images: Converted to Markdown image syntax

Integration Patterns

Combine gather with other fwdslsh tools

Complete Documentation Pipeline

Use gather with catalog for AI-ready documentation:

# Step 1: Crawl documentation site with Gather
gather https://docs.example.com --output-dir ./docs

# Step 2: Generate llms.txt files with catalog
catalog ./docs --output build --base-url https://docs.example.com \
  --sitemap --validate --index --toc

# Step 3: Use with AI tools
# Now feed build/llms.txt to your LLM for context

CI/CD Integration

#!/bin/bash
# .github/workflows/docs-sync.yml

# Crawl latest documentation
gather https://docs.example.com \
  --max-pages 1000 \
  --output-dir ./docs

# Generate AI-ready artifacts
catalog ./docs \
  --output ./build \
  --base-url https://docs.example.com \
  --sitemap --validate

# Upload to deployment location
# ...

Multi-Source Crawling

# gather.yaml configuration for multiple sources
maxPages: 50
delay: 1000

targets:
  # Main documentation site
  - url: "https://docs.example.com"
    outputDir: "./docs/official"

  # Community wiki
  - url: "https://wiki.example.com"
    outputDir: "./docs/community"

  # GitHub repository docs
  - url: "https://github.com/owner/repo/tree/main/docs"
    include: ["*.md"]
    exclude: ["**/draft/**"]
    outputDir: "./docs/github"

GitHub Integration

Download repositories and specific directories

Authentication

For authenticated access to private repositories or higher rate limits:

# Set GitHub token (unauthenticated: 60 req/hour, authenticated: 5,000 req/hour)
export GITHUB_TOKEN="your_github_personal_access_token"

# Crawl with authenticated access
gather https://github.com/owner/repo

Sparse Checkout

Download only specific directories using Git sparse checkout:

# Download entire repository
gather https://github.com/owner/repo

# Download specific directory
gather https://github.com/owner/repo/tree/main/docs

# Download specific file path
gather https://github.com/owner/repo/blob/main/README.md

File Filtering

# Include only markdown files
gather https://github.com/owner/repo --include "*.md"

# Exclude node_modules and test directories
gather https://github.com/owner/repo \
  --exclude "node_modules/**" \
  --exclude "**/*.test.md" \
  --exclude "**/__tests__/**"

# Multiple include patterns
gather https://github.com/owner/repo \
  --include "*.md" \
  --include "*.txt" \
  --include "*.json"

Environment Variables

Configure behavior via environment variables

Variable	Description
`GITHUB_TOKEN`	GitHub Personal Access Token for API authentication
`X_BEARER_TOKEN`	X/Twitter Bearer token for feed ingestion
`HTTP_PROXY`	HTTP proxy URL for web requests
`HTTPS_PROXY`	HTTPS proxy URL for web requests
`NO_PROXY`	Comma-separated list of hosts to bypass proxy

Exit Codes

CLI exit codes for automation

Code	Meaning
`0`	Success - all pages/files crawled successfully
`1`	General failure - unrecoverable error
`2`	Usage error - invalid arguments or options
`3`	Network error - connection failed, timeout, or DNS issue
`4`	Partial success - some pages failed but operation completed