Command Reference
Complete CLI options and advanced features
Basic Syntax
gather [URL] [OPTIONS] gather --feed [FEED_URL] [OPTIONS]
Web & Git Crawling Options
| Option | Short | Default | Description |
|---|---|---|---|
--output-dir |
-o |
crawled-pages |
Output directory for saved files |
--max-pages |
100 |
Maximum number of pages to crawl | |
--delay |
1000 |
Delay between requests in milliseconds | |
--concurrency |
3 |
Number of concurrent requests | |
--max-queue-size |
10000 |
Maximum URLs in queue before skipping new links | |
--max-retries |
3 |
Maximum retry attempts for failed requests | |
--ignore-robots |
false |
Ignore robots.txt directives (use with caution) | |
--raw |
false |
Output raw HTML without Markdown conversion | |
--include |
* |
Include files matching glob pattern (can be used multiple times) | |
--exclude |
Exclude files matching glob pattern (can be used multiple times) | ||
--ignore-errors |
false |
Exit with code 0 even if some pages fail |
Feed Mode Options
Feed ingestion is enabled with --feed flag. Supports RSS, Atom, YouTube, Bluesky, and X/Twitter.
| Option | Default | Description |
|---|---|---|
--feed |
false |
Enable feed ingestion mode |
--limit |
50 |
Maximum number of items to ingest |
--yt-lang |
'en' |
YouTube transcript language code |
--no-yt-transcript |
false |
Skip YouTube transcript extraction |
--x-rss-template |
default | Custom X/Twitter RSS template URL |
--bsky-api-base |
default | Custom Bluesky API endpoint |
General Options
| Option | Short | Description |
|---|---|---|
--config |
Path to YAML configuration file (defaults to gather.yaml) | |
--init |
Generate sample configuration file to stdout | |
--verbose |
Enable verbose logging (detailed output including retries, skipped files, and queue status) | |
--quiet |
-q |
Enable quiet mode (errors only, no progress messages) |
--help |
-h |
Show help message |
--version |
-v |
Show version information |
Advanced Features
Production-ready crawling capabilities
robots.txt Support
Gather automatically respects robots.txt files to follow web crawling best practices:
- Automatic fetching: Fetches and parses robots.txt from target sites
- User agent compliance: Identifies as "Gather/1.0" user agent
- Directive respect: Honors Disallow, Allow, and Crawl-delay directives
- Wildcard support: Follows "*" wildcard rules for all user agents
- Logging: Logs blocked URLs with reason for transparency
--ignore-robots only if you have explicit permission from the site owner. Violating robots.txt may violate terms of service.
Content Extraction Strategy
Gather intelligently extracts main content from web pages using semantic HTML:
- Searches for
<main>element (HTML5 semantic element) - Checks for
[role="main"]ARIA attribute - Looks for common content class names (e.g.,
.main-content,.content,.article) - Finds
<article>elements for blog posts - Detects Bootstrap-style containers
- Falls back to
<body>content if nothing found
Content Cleanup
These elements are automatically removed during extraction:
- Navigation elements (
nav,.menu,.navigation) - Headers and footers
- Advertisement blocks (
.ad,.advertisement) - Social sharing buttons
- Comments sections
- Scripts and styles
Configuration Files
Use YAML configuration files for complex crawling setups. Create gather.yaml in your project root:
# gather.yaml
# Global settings
maxPages: 100
delay: 1000
concurrency: 3
outputDir: "./crawled-content"
# Target-specific overrides
targets:
- url: "https://docs.example.com"
maxPages: 50
delay: 500
- url: "https://github.com/owner/repo"
include: ["*.md", "*.txt"]
exclude: ["node_modules/**", "**/test/**"]
# Feed mode settings
feed:
limit: 100
ytLang: "en"
noYtTranscript: false
Using Configuration Files
# Auto-detect gather.yaml in current directory gather # Specify custom configuration file gather --config production-crawl.yaml # Generate sample configuration gather --init > gather.yaml
Markdown Conversion
Gather converts HTML to clean Markdown using Turndown:
- Code blocks: Properly converted from
<pre>and<code>elements with language detection - Links: Converted to Markdown link syntax with relative URL preservation
- Tables: Converted to Markdown table format
- Headers: Preserved with correct nesting (h1-h6)
- Lists: Ordered and unordered lists maintained
- Images: Converted to Markdown image syntax
Integration Patterns
Combine gather with other fwdslsh tools
Complete Documentation Pipeline
Use gather with catalog for AI-ready documentation:
# Step 1: Crawl documentation site with Gather gather https://docs.example.com --output-dir ./docs # Step 2: Generate llms.txt files with catalog catalog ./docs --output build --base-url https://docs.example.com \ --sitemap --validate --index --toc # Step 3: Use with AI tools # Now feed build/llms.txt to your LLM for context
CI/CD Integration
#!/bin/bash # .github/workflows/docs-sync.yml # Crawl latest documentation gather https://docs.example.com \ --max-pages 1000 \ --output-dir ./docs # Generate AI-ready artifacts catalog ./docs \ --output ./build \ --base-url https://docs.example.com \ --sitemap --validate # Upload to deployment location # ...
Multi-Source Crawling
# gather.yaml configuration for multiple sources
maxPages: 50
delay: 1000
targets:
# Main documentation site
- url: "https://docs.example.com"
outputDir: "./docs/official"
# Community wiki
- url: "https://wiki.example.com"
outputDir: "./docs/community"
# GitHub repository docs
- url: "https://github.com/owner/repo/tree/main/docs"
include: ["*.md"]
exclude: ["**/draft/**"]
outputDir: "./docs/github"
GitHub Integration
Download repositories and specific directories
Authentication
For authenticated access to private repositories or higher rate limits:
# Set GitHub token (unauthenticated: 60 req/hour, authenticated: 5,000 req/hour) export GITHUB_TOKEN="your_github_personal_access_token" # Crawl with authenticated access gather https://github.com/owner/repo
Sparse Checkout
Download only specific directories using Git sparse checkout:
# Download entire repository gather https://github.com/owner/repo # Download specific directory gather https://github.com/owner/repo/tree/main/docs # Download specific file path gather https://github.com/owner/repo/blob/main/README.md
File Filtering
# Include only markdown files gather https://github.com/owner/repo --include "*.md" # Exclude node_modules and test directories gather https://github.com/owner/repo \ --exclude "node_modules/**" \ --exclude "**/*.test.md" \ --exclude "**/__tests__/**" # Multiple include patterns gather https://github.com/owner/repo \ --include "*.md" \ --include "*.txt" \ --include "*.json"
Environment Variables
Configure behavior via environment variables
| Variable | Description |
|---|---|
GITHUB_TOKEN |
GitHub Personal Access Token for API authentication |
X_BEARER_TOKEN |
X/Twitter Bearer token for feed ingestion |
HTTP_PROXY |
HTTP proxy URL for web requests |
HTTPS_PROXY |
HTTPS proxy URL for web requests |
NO_PROXY |
Comma-separated list of hosts to bypass proxy |
Exit Codes
CLI exit codes for automation
| Code | Meaning |
|---|---|
0 |
Success - all pages/files crawled successfully |
1 |
General failure - unrecoverable error |
2 |
Usage error - invalid arguments or options |
3 |
Network error - connection failed, timeout, or DNS issue |
4 |
Partial success - some pages failed but operation completed |