WORLDBOOK

firecrawl | Worldbooks | WebMCP | Search | Submit

firecrawl

Category: Unknown Author: Unknown Version: 1.0.0 Updated: Unknown
0

Firecrawl CLI

Website: https://www.firecrawl.dev CLI Tool: firecrawl Authentication: Firecrawl API key

Description

Firecrawl is a web scraping and crawling API that converts websites into clean, LLM-ready markdown or structured data. The CLI provides command-line access to Firecrawl's powerful web extraction capabilities. Essential for AI agents that need to process web content, build knowledge bases, or automate data collection.

Commands

Authentication

Set API Key

export FIRECRAWL_API_KEY=your_api_key
firecrawl config set-key <api-key>

Configure Firecrawl API key.

Scraping

Scrape Single Page

firecrawl scrape <url>
firecrawl scrape https://example.com
firecrawl scrape <url> --format markdown
firecrawl scrape <url> --format html
firecrawl scrape <url> --format json

Scrape single webpage and return content.

Scrape with Options

firecrawl scrape <url> --include-tags article,main
firecrawl scrape <url> --exclude-tags nav,footer
firecrawl scrape <url> --wait-for <selector>
firecrawl scrape <url> --screenshot

Scrape with specific extraction options.

Save to File

firecrawl scrape <url> --output file.md
firecrawl scrape <url> --output-dir ./scraped

Save scraped content to file.

Crawling

Crawl Website

firecrawl crawl <url>
firecrawl crawl https://example.com
firecrawl crawl <url> --max-depth 3
firecrawl crawl <url> --limit 100

Crawl entire website or sitemap.

Crawl with Filters

firecrawl crawl <url> --include-paths /blog/*
firecrawl crawl <url> --exclude-paths /admin/*,/login
firecrawl crawl <url> --allow-external-links

Crawl with URL pattern filters.

Crawl and Save

firecrawl crawl <url> --output-dir ./crawled
firecrawl crawl <url> --output-format markdown
firecrawl crawl <url> --output-format json

Save crawled pages to directory.

Maps

Generate Sitemap

firecrawl map <url>
firecrawl map https://example.com
firecrawl map <url> --search <query>

Generate list of all URLs on website.

Map with Filters

firecrawl map <url> --ignore-sitemap
firecrawl map <url> --include-subdomains
firecrawl map <url> --limit 1000

Generate filtered URL map.

Batch Operations

Batch Scrape

firecrawl batch scrape --urls <file>
firecrawl batch scrape --urls urls.txt

Scrape multiple URLs from file.

Batch Crawl

firecrawl batch crawl --urls <file>
firecrawl batch crawl --urls sites.txt --max-depth 2

Crawl multiple websites.

Extraction

Extract Structured Data

firecrawl extract <url> --schema schema.json
firecrawl extract <url> --prompt "Extract product information"

Extract structured data using schema or prompt.

Extract with LLM

firecrawl extract <url> --prompt "Summarize this article"
firecrawl extract <url> --prompt "List all prices mentioned"

Use LLM to extract specific information.

Status and Jobs

Check Job Status

firecrawl status <job-id>
firecrawl status abc-123-def

Check status of crawl job.

Cancel Job

firecrawl cancel <job-id>

Cancel running crawl job.

List Jobs

firecrawl jobs list
firecrawl jobs list --status completed
firecrawl jobs list --limit 10

List recent crawl jobs.

Configuration

View Config

firecrawl config show

Display current configuration.

Set Default Options

firecrawl config set-default format markdown
firecrawl config set-default max-depth 2

Set default options for commands.

Examples

Basic Scraping

# Scrape single page
firecrawl scrape https://example.com/article

# Get as markdown
firecrawl scrape https://example.com/docs --format markdown

# Save to file
firecrawl scrape https://example.com/page --output page.md

# Include specific elements
firecrawl scrape https://example.com --include-tags article,main,section

Website Crawling

# Crawl entire site
firecrawl crawl https://example.com

# Limit to 2 levels deep
firecrawl crawl https://example.com --max-depth 2

# Crawl specific paths
firecrawl crawl https://example.com --include-paths /blog/*,/docs/*

# Exclude admin pages
firecrawl crawl https://example.com --exclude-paths /admin/*

# Save all pages
firecrawl crawl https://example.com --output-dir ./site-crawl

URL Mapping

# Get all URLs
firecrawl map https://example.com

# Search for specific pages
firecrawl map https://example.com --search "pricing"

# Include subdomains
firecrawl map https://example.com --include-subdomains

# Ignore sitemap.xml
firecrawl map https://example.com --ignore-sitemap

Data Extraction

# Extract with schema
cat > schema.json << 'EOF'
{
  "type": "object",
  "properties": {
    "title": {"type": "string"},
    "price": {"type": "number"},
    "description": {"type": "string"}
  }
}
EOF

firecrawl extract https://example.com/product --schema schema.json

# Extract with prompt
firecrawl extract https://news.example.com/article \
  --prompt "Extract the article headline, author, date, and summary"

# Extract from multiple pages
firecrawl crawl https://example.com/products \
  --extract \
  --schema schema.json

Batch Operations

# Create URL list
cat > urls.txt << 'EOF'
https://example.com/page1
https://example.com/page2
https://example.com/page3
EOF

# Batch scrape
firecrawl batch scrape --urls urls.txt --output-dir ./batch-results

# Batch with options
firecrawl batch scrape \
  --urls urls.txt \
  --format markdown \
  --include-tags article

Advanced Scraping

# Wait for dynamic content
firecrawl scrape https://spa.example.com \
  --wait-for "#content-loaded" \
  --timeout 10000

# Take screenshot
firecrawl scrape https://example.com \
  --screenshot \
  --output screenshot.png

# Custom headers
firecrawl scrape https://api.example.com \
  --headers '{"Authorization": "Bearer token"}'

# JavaScript execution
firecrawl scrape https://example.com \
  --execute-javascript "window.scrollTo(0, document.body.scrollHeight)"

Building Knowledge Base

# Crawl documentation site
firecrawl crawl https://docs.example.com \
  --format markdown \
  --output-dir ./knowledge-base \
  --max-depth 5

# Process for RAG
for file in ./knowledge-base/*.md; do
  # Chunk and embed for vector database
  echo "Processing $file"
done

Monitoring and Jobs

# Start large crawl
JOB_ID=$(firecrawl crawl https://large-site.com --async)

# Check status
firecrawl status $JOB_ID

# Wait for completion
while [ $(firecrawl status $JOB_ID --format json | jq -r '.status') = "running" ]; do
  echo "Still crawling..."
  sleep 10
done

# Get results
firecrawl get-results $JOB_ID --output-dir ./results

Content Extraction Pipeline

# Extract product data from e-commerce site
firecrawl crawl https://shop.example.com/products \
  --include-paths /products/* \
  --extract \
  --prompt "Extract product name, price, description, and rating" \
  --output-format json \
  --output-dir ./products

# Process extracted data
cat ./products/*.json | jq -s '.' > all-products.json

API Integration

# Use with curl
curl -X POST https://api.firecrawl.dev/v1/scrape \
  -H "Authorization: Bearer $FIRECRAWL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown", "html"]
  }'

# Use CLI wrapper
firecrawl scrape https://example.com --format markdown | \
  # Pipe to your processing tool
  your-text-processor

Scheduled Scraping

# Cron job for daily scraping
# 0 2 * * * /usr/local/bin/firecrawl crawl https://news.example.com --output-dir /data/news/$(date +\%Y-\%m-\%d)

# With notification
firecrawl crawl https://competitor.com \
  --output-dir ./competitor-data && \
  echo "Crawl complete" | mail -s "Scraping Done" admin@example.com

Notes

  • Authentication: Requires Firecrawl API key
  • Formats: Markdown, HTML, JSON, structured data
  • Crawling: Respects robots.txt by default
  • Rate Limiting: API enforces rate limits
  • JavaScript: Renders JavaScript-heavy sites
  • Screenshots: Can capture page screenshots
  • Async: Large crawls run asynchronously
  • Webhooks: Callback on job completion
  • Filtering: Include/exclude URL patterns
  • Depth: Control crawl depth
  • Limits: Set maximum pages per crawl
  • Extraction: LLM-powered data extraction
  • Schemas: JSON schema for structured extraction
  • Caching: Results cached for performance
  • Proxies: Automatic proxy rotation
  • Blocking: Handles CAPTCHAs and blocks
  • Mobile: Can emulate mobile browsers
  • Metadata: Extracts page metadata
  • Links: Extracts internal and external links
  • Clean: Removes ads, navigation, footers
  • LLM-Ready: Optimized for AI processing
  • Vector DB: Compatible with RAG systems
  • Pricing: Based on credits and pages
  • API: RESTful API available
  • SDKs: Python, Node.js, Go SDKs
  • Best Practices:
  • Respect robots.txt and ToS
  • Use appropriate rate limiting
  • Filter URLs to avoid unnecessary pages
  • Use markdown format for LLMs
  • Set reasonable depth limits
  • Save results incrementally
  • Monitor API usage
  • Handle errors gracefully
  • Use batch operations for multiple URLs
  • Extract structured data when possible
  • Cache results when appropriate
  • Set timeouts for slow pages
  • Use webhooks for long crawls
  • Validate extracted data

Get this worldbook via CLI

worldbook get firecrawl

Comments (0)

Add a Comment

No comments yet. Be the first to comment!