Firecrawl CLI
Website: https://www.firecrawl.dev CLI Tool: firecrawl Authentication: Firecrawl API key
Description
Firecrawl is a web scraping and crawling API that converts websites into clean, LLM-ready markdown or structured data. The CLI provides command-line access to Firecrawl's powerful web extraction capabilities. Essential for AI agents that need to process web content, build knowledge bases, or automate data collection.
Commands
Authentication
Set API Key
export FIRECRAWL_API_KEY=your_api_key
firecrawl config set-key <api-key>
Configure Firecrawl API key.
Scraping
Scrape Single Page
firecrawl scrape <url>
firecrawl scrape https://example.com
firecrawl scrape <url> --format markdown
firecrawl scrape <url> --format html
firecrawl scrape <url> --format json
Scrape single webpage and return content.
Scrape with Options
firecrawl scrape <url> --include-tags article,main
firecrawl scrape <url> --exclude-tags nav,footer
firecrawl scrape <url> --wait-for <selector>
firecrawl scrape <url> --screenshot
Scrape with specific extraction options.
Save to File
firecrawl scrape <url> --output file.md
firecrawl scrape <url> --output-dir ./scraped
Save scraped content to file.
Crawling
Crawl Website
firecrawl crawl <url>
firecrawl crawl https://example.com
firecrawl crawl <url> --max-depth 3
firecrawl crawl <url> --limit 100
Crawl entire website or sitemap.
Crawl with Filters
firecrawl crawl <url> --include-paths /blog/*
firecrawl crawl <url> --exclude-paths /admin/*,/login
firecrawl crawl <url> --allow-external-links
Crawl with URL pattern filters.
Crawl and Save
firecrawl crawl <url> --output-dir ./crawled
firecrawl crawl <url> --output-format markdown
firecrawl crawl <url> --output-format json
Save crawled pages to directory.
Maps
Generate Sitemap
firecrawl map <url>
firecrawl map https://example.com
firecrawl map <url> --search <query>
Generate list of all URLs on website.
Map with Filters
firecrawl map <url> --ignore-sitemap
firecrawl map <url> --include-subdomains
firecrawl map <url> --limit 1000
Generate filtered URL map.
Batch Operations
Batch Scrape
firecrawl batch scrape --urls <file>
firecrawl batch scrape --urls urls.txt
Scrape multiple URLs from file.
Batch Crawl
firecrawl batch crawl --urls <file>
firecrawl batch crawl --urls sites.txt --max-depth 2
Crawl multiple websites.
Extraction
Extract Structured Data
firecrawl extract <url> --schema schema.json
firecrawl extract <url> --prompt "Extract product information"
Extract structured data using schema or prompt.
Extract with LLM
firecrawl extract <url> --prompt "Summarize this article"
firecrawl extract <url> --prompt "List all prices mentioned"
Use LLM to extract specific information.
Status and Jobs
Check Job Status
firecrawl status <job-id>
firecrawl status abc-123-def
Check status of crawl job.
Cancel Job
firecrawl cancel <job-id>
Cancel running crawl job.
List Jobs
firecrawl jobs list
firecrawl jobs list --status completed
firecrawl jobs list --limit 10
List recent crawl jobs.
Configuration
View Config
firecrawl config show
Display current configuration.
Set Default Options
firecrawl config set-default format markdown
firecrawl config set-default max-depth 2
Set default options for commands.
Examples
Basic Scraping
# Scrape single page
firecrawl scrape https://example.com/article
# Get as markdown
firecrawl scrape https://example.com/docs --format markdown
# Save to file
firecrawl scrape https://example.com/page --output page.md
# Include specific elements
firecrawl scrape https://example.com --include-tags article,main,section
Website Crawling
# Crawl entire site
firecrawl crawl https://example.com
# Limit to 2 levels deep
firecrawl crawl https://example.com --max-depth 2
# Crawl specific paths
firecrawl crawl https://example.com --include-paths /blog/*,/docs/*
# Exclude admin pages
firecrawl crawl https://example.com --exclude-paths /admin/*
# Save all pages
firecrawl crawl https://example.com --output-dir ./site-crawl
URL Mapping
# Get all URLs
firecrawl map https://example.com
# Search for specific pages
firecrawl map https://example.com --search "pricing"
# Include subdomains
firecrawl map https://example.com --include-subdomains
# Ignore sitemap.xml
firecrawl map https://example.com --ignore-sitemap
Data Extraction
# Extract with schema
cat > schema.json << 'EOF'
{
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"},
"description": {"type": "string"}
}
}
EOF
firecrawl extract https://example.com/product --schema schema.json
# Extract with prompt
firecrawl extract https://news.example.com/article \
--prompt "Extract the article headline, author, date, and summary"
# Extract from multiple pages
firecrawl crawl https://example.com/products \
--extract \
--schema schema.json
Batch Operations
# Create URL list
cat > urls.txt << 'EOF'
https://example.com/page1
https://example.com/page2
https://example.com/page3
EOF
# Batch scrape
firecrawl batch scrape --urls urls.txt --output-dir ./batch-results
# Batch with options
firecrawl batch scrape \
--urls urls.txt \
--format markdown \
--include-tags article
Advanced Scraping
# Wait for dynamic content
firecrawl scrape https://spa.example.com \
--wait-for "#content-loaded" \
--timeout 10000
# Take screenshot
firecrawl scrape https://example.com \
--screenshot \
--output screenshot.png
# Custom headers
firecrawl scrape https://api.example.com \
--headers '{"Authorization": "Bearer token"}'
# JavaScript execution
firecrawl scrape https://example.com \
--execute-javascript "window.scrollTo(0, document.body.scrollHeight)"
Building Knowledge Base
# Crawl documentation site
firecrawl crawl https://docs.example.com \
--format markdown \
--output-dir ./knowledge-base \
--max-depth 5
# Process for RAG
for file in ./knowledge-base/*.md; do
# Chunk and embed for vector database
echo "Processing $file"
done
Monitoring and Jobs
# Start large crawl
JOB_ID=$(firecrawl crawl https://large-site.com --async)
# Check status
firecrawl status $JOB_ID
# Wait for completion
while [ $(firecrawl status $JOB_ID --format json | jq -r '.status') = "running" ]; do
echo "Still crawling..."
sleep 10
done
# Get results
firecrawl get-results $JOB_ID --output-dir ./results
Content Extraction Pipeline
# Extract product data from e-commerce site
firecrawl crawl https://shop.example.com/products \
--include-paths /products/* \
--extract \
--prompt "Extract product name, price, description, and rating" \
--output-format json \
--output-dir ./products
# Process extracted data
cat ./products/*.json | jq -s '.' > all-products.json
API Integration
# Use with curl
curl -X POST https://api.firecrawl.dev/v1/scrape \
-H "Authorization: Bearer $FIRECRAWL_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"formats": ["markdown", "html"]
}'
# Use CLI wrapper
firecrawl scrape https://example.com --format markdown | \
# Pipe to your processing tool
your-text-processor
Scheduled Scraping
# Cron job for daily scraping
# 0 2 * * * /usr/local/bin/firecrawl crawl https://news.example.com --output-dir /data/news/$(date +\%Y-\%m-\%d)
# With notification
firecrawl crawl https://competitor.com \
--output-dir ./competitor-data && \
echo "Crawl complete" | mail -s "Scraping Done" admin@example.com
Notes
- Authentication: Requires Firecrawl API key
- Formats: Markdown, HTML, JSON, structured data
- Crawling: Respects robots.txt by default
- Rate Limiting: API enforces rate limits
- JavaScript: Renders JavaScript-heavy sites
- Screenshots: Can capture page screenshots
- Async: Large crawls run asynchronously
- Webhooks: Callback on job completion
- Filtering: Include/exclude URL patterns
- Depth: Control crawl depth
- Limits: Set maximum pages per crawl
- Extraction: LLM-powered data extraction
- Schemas: JSON schema for structured extraction
- Caching: Results cached for performance
- Proxies: Automatic proxy rotation
- Blocking: Handles CAPTCHAs and blocks
- Mobile: Can emulate mobile browsers
- Metadata: Extracts page metadata
- Links: Extracts internal and external links
- Clean: Removes ads, navigation, footers
- LLM-Ready: Optimized for AI processing
- Vector DB: Compatible with RAG systems
- Pricing: Based on credits and pages
- API: RESTful API available
- SDKs: Python, Node.js, Go SDKs
- Best Practices:
- Respect robots.txt and ToS
- Use appropriate rate limiting
- Filter URLs to avoid unnecessary pages
- Use markdown format for LLMs
- Set reasonable depth limits
- Save results incrementally
- Monitor API usage
- Handle errors gracefully
- Use batch operations for multiple URLs
- Extract structured data when possible
- Cache results when appropriate
- Set timeouts for slow pages
- Use webhooks for long crawls
- Validate extracted data
Comments (0)
Add a Comment
No comments yet. Be the first to comment!