web-data-extractor-suite
A comprehensive suite offering advanced web content acquisition, site traversal, and transformation into structured data formats, primarily serving AI application needs with clean, parsed outputs.
Author

xiyuefox
Quick Info
Actions
Tags
🔥 Firecrawl: AI-Powered Web Data Acquisition
This tool empowers your artificial intelligence applications by providing robust capabilities for scraping, site-wide crawling, and precise data extraction, converting diverse web assets into standardized, machine-readable formats (like clean Markdown or JSON).
Status Note: The repository is actively evolving; self-hosting deployment capabilities are currently pending full integration within the monorepo structure, though local execution is supported.
Core Functionality
Firecrawl operates as an API utility that accepts a target URL, systematically traverses its accessible internal links, and furnishes clean content outputs for each page discovered. A sitemap is not a prerequisite for this deep traversal.
Explore our comprehensive documentation for detailed usage guides.
Getting Started with the API
Access is available via our managed cloud service, featuring a live playground and exhaustive documentation here. Local self-hosting of the backend infrastructure is also an option.
Key Resources: - API Reference: Documentation Link - Software Development Kits (SDKs): Python, Node.js, Go, Rust - Integration with LLM Frameworks: Extensive support across Langchain, Llama Index, Crew.ai, and numerous others. - Ecosystem Support: Integration points with low-code platforms like Dify and automation tools such as Zapier.
To utilize the service, an API credential obtained via registration at Firecrawl is mandatory.
Feature Set Overview
Firecrawl specializes in several distinct operations:
- Scrape: Retrieves content from a single Uniform Resource Locator (URL) in formats optimized for Large Language Models (LLMs) (e.g., Markdown, structured JSON via [LLM Extract]), including screenshots and raw HTML.
- Crawl: Initiates a recursive traversal of a domain starting from a root URL, returning processed content for all reachable internal paths.
- Map (Alpha): Rapidly inventories all reachable URLs within a specified domain.
- Extract: Leverages generative models to synthesize specific structured data from single or multiple web pages (or entire domains using wildcard matching) based on provided natural language prompts and optional JSON schemas.
Advanced Handling: * Format Diversity: Outputs include LLM-optimized text, object representations, binary artifacts (PDFs, DOCX), visual captures, and metadata. * Technical Complexity Management: Seamlessly manages proxies, obfuscation techniques (anti-bot defenses), JavaScript-rendered content, complex output parsing, and process orchestration. * Fine-Tuning: Allows customization such as excluding specific HTML elements, navigating protected sites using custom HTTP headers, and defining maximum traversal depth. * Interactive Capabilities (Cloud): Supports sequenced browser actions (e.g., clicking, scrolling, form input, timed waits) prior to data capture. * Asynchronous Processing (New): New endpoints enable parallel processing of thousands of URLs via job queuing.
Example: Asynchronous Web Traversal (Crawl)
Submitting a crawl request returns a job identifier for status polling:
curl -X POST https://api.firecrawl.dev/v1/crawl \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer fc-YOUR_API_KEY' \
-d '{
"url": "https://docs.firecrawl.dev",
"limit": 10,
"scrapeOptions": {
"formats": ["markdown", "html"]
}
}'
Example: Structured Data Extraction (Extract)
This capability uses a prompt and an explicit JSON schema to mandate the output structure across one or more targets, supporting domain-wide recursive extraction via https://domain.com/*:
curl -X POST https://api.firecrawl.dev/v1/extract \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{ "urls": ["https://firecrawl.dev/*"], "prompt": "Summarize the product's core offering.", "schema": { ... } }'
SDK Implementations
Python Usage Example
Installation: pip install firecrawl-py
To retrieve structured data using Pydantic definitions:
from firecrawl.firecrawl import FirecrawlApp
from pydantic import BaseModel
app = FirecrawlApp(api_key="fc-YOUR_API_KEY")
class CompanyProfile(BaseModel):
mission: str
is_open_source: bool
# Extracts data into the specified schema
data = app.scrape_url('https://firecrawl.dev', {
'formats': ['json'],
'jsonOptions': {
'schema': CompanyProfile.model_json_schema()
}
})
print(data["json"])
Node.js Usage Example
Installation: npm install @mendable/firecrawl-js
Utilizing Zod for schema definition in Node:
import FirecrawlApp from "@mendable/firecrawl-js";
import { z } from "zod";
const app = new FirecrawlApp({ apiKey: "fc-YOUR_API_KEY" });
const schema = z.object({ /* ... schema definition ... */ });
const scrapeResult = await app.scrapeUrl("https://news.ycombinator.com", {
jsonOptions: { extractionSchema: schema },
});
console.log(scrapeResult.data["json"]);
Licensing & Responsibility
Firecrawl is principally offered under the AGPL-3.0 open-source license, with specific SDK components and UI elements released under the permissive MIT License. Users must ensure adherence to the target websites' terms of service and robots.txt directives when employing scraping or crawling functions. The cloud service expands functionality beyond the OSS version.
