How to crawl websites for LLMs - using Firecrawl

blog preview

Many AI applications with modern LLMs (Large Language Models) require access to high-quality data provided in the public web. Think about customer service applications, research applications or general-purpose chatbots. However, the web is a vast and unstructured source of data, composed of html pages, filled with unstructured texts. That's where this blog comes into play.

We'll explore how to use Firecrawl to crawl websites and prepare data for your LLM projects. We'll also demonstrate how to integrate Firecrawl with LangChain, a popular framework for developing applications with LLMs.

After reading this post, you'll have a good understanding and a step-by-step guide on how to efficiently get data from any public website.

What is Firecrawl?

Firecrawl is a powerful web crawling and scraping tool designed to simplify the process of converting entire websites into clean, structured data, particularly in formats that are ready for large language models.

Key Features are:

Dynamic Content Handling: Firecrawl is very good in what it does, which is scraping websites that use JavaScript, making it highly effective for modern, dynamic sites.

Automated Content Transformation: The tool automatically converts scraped data into markdown or other structured formats, making it easier to feed into machine learning models or content pipelines.

Caching and Rate Limiting: To prevent overloading websites and to comply with rate limits, Firecrawl features intelligent caching and rate-limiting capabilities. This ensures ethical and efficient data scraping without impacting server performance.

Scalability: Firecrawl is built to scale, making it suitable for both small projects and large-scale data operations. Whether you’re scraping a single website or thousands, Firecrawl adapts to your needs.

Open-Source and Extensible: Firecrawl is open-source, allowing users to self-host and customize the platform.

User-Friendly Interface: Despite its advanced capabilities, Firecrawl is designed with ease of use in mind. Its intuitive interface and clear documentation make it accessible even to those who are new to web scraping.

Support for structured data extraction: Firecrawl allows to create pydantic models to extract data in a structured way.

Crawling websites with Firecrawl

Let's see the tool in action. First, we're going to use the managed cloud version of Firecrawl - as it makes getting started easy. If you want to self-host, jump straight to the last chapter in this article.

  1. Navigate to Firecrawl and sign up for an account.

  2. Head over to your Firecrawl keys section.

  3. Create a new API key, or use the default one. Copy the key.

  4. Install the Firecrawl Python client:

    1pip install firecrawl-py
  5. Use the following code snippet, to scrape a single URL without any sub-pages:

    1 from firecrawl import FirecrawlApp
    2 app = FirecrawlApp(api_key="fc-YOUR_API_KEY")
    3
    4 scrape_result = app.scrape_url('pondhouse-data.com', params={'formats': ['markdown', 'html']})
    5 print(scrape_result)

    This will output the pondhouse-data website in markdown and html format.

    1{
    2 "markdown": "# Custom AI and Data engineering for your business\n\nWe\\'re a boutique agency specializing in cutting-edge AI and data solutions. From self-h....",
    3 "html": "<!DOCTYPE html><html lang=\"en\" class=\"text-base antialiased\"><!--$!--><!--/$--><body><div class=\"relative flex min-h-screen flex-col bg-background...."
    4}
  6. If you want to crawl a website with multiple sub-pages, you can use the following code snippet:

    1from firecrawl import FirecrawlApp
    2
    3app = FirecrawlApp(api_key="fc-YOUR_API_KEY")
    4
    5crawl_status = app.crawl_url(
    6 'https://pondhouse-data.com',
    7 params={
    8 'limit': 300,
    9 'scrapeOptions': {'formats': ['markdown']}
    10 },
    11 poll_interval=10
    12)
    13print(crawl_status)

    This will crawl all the pages of this site, poll for the status every 10 seconds and create markdown representation of the pages. Output will be similar to:

    1{
    2 "status":"completed",
    3 "completed":36,
    4 "total":36,
    5 "creditsUsed":36,
    6 "expiresAt":"2024-09-07T16:34:46.000Z",
    7 "data":[
    8 {
    9 "markdown":"##### Pondhouse Data Blog\n\nPreviousNext\n\nShowing1to10of29results",
    10 "metadata":{
    11 "title":"Pondhouse AI Blog",
    12 "ogImage":"https://www.pondhouse-data.com/pondhouse-data-header.png",
    13 "ogTitle":"Pondhouse AI Blog",
    14 "language":"en",
    15 "sourceURL":"https://www.pondhouse-data.com/blog",
    16 "description":"Blogs and articles about AI applications and technology.",
    17 "ogDescription":"Blogs and articles about AI applications and technology.",
    18 "ogLocaleAlternate":[
    19
    20 ],
    21 "statusCode":200
    22 }
    23 },
    24 {
    25 "markdown":"# Advanced RAG: Recursive Retrieval with llamaindex\n\n![blog preview](https://www.pondhouse-data.com/_next/image?url=%2Fblog%2FBlogHeader_RecursiveRetrieval.jpg&w=3840&q=75)\n\nWhen it comes to [Retrieval Augmented Generation\\\\\n(RAG)](/blog/integrating-knowledge-and-llms), the quality of both the\ncreated document index and the retrieval process is crucial for getting\ngood and consistent answers based on your documentation. One especially\nchallenging aspect is how to model relationships between text chunks of\nyour documents.\n\nAs a quick reminder, in text-based RAGstatus":"completed",
    26 "scrapedUrls":300,
    27 "totalUrls":300
    28 }
    29 ...

Asynchronous Crawling with Firecrawl

While the above crawling examples already demonstrate the ease of use of Firecrawl, they run synchronously. Especially for crawling large pages with many sub-pages, this might block your application for many minutes. Alternatively, you can run the crawling job asynchronously:

  1. Create a crawl job:

    1crawl_status = app.async_crawl_url(
    2 'https://pondhouse-data.com',
    3 params={
    4 'limit': 300,
    5 'scrapeOptions': {'formats': ['markdown']}
    6 }
    7)
    8print(crawl_status)

    This will create a crawl job and return the job id which can be used to check for the status of the job.

  2. Check the status of the job and retrieve the data:

    1import asyncio
    2
    3async def check_status():
    4 while True:
    5 crawl_data = app.check_crawl_status(crawl_status['id'])
    6 if crawl_data['status'] == 'completed':
    7 print(crawl_data)
    8 break
    9 await asyncio.sleep(10) # Wait for 10 seconds before checking again
    10
    11asyncio.run(check_status())

    Once done, this will output the same result as for the synchronous case.

Integrating Firecrawl with LangChain

Another method for using Firecrawl is to use it together with LangChain. LangChain is a powerful framework for developing applications with LLMs. They for example provide a vast collection of data loaders, which are easy-to-use python classes to load data from a huge variety of sources in a standardized format. One of their data loaders is the Firecrawl data loader.

  1. Additionally to firecrawl, install the langchain community package:

    1pip install langchain_community
  2. Then, use the following code snippet to load the data from the pondhouse-data website:

    1from langchain_community.document_loaders import FireCrawlLoader
    2import os
    3
    4os.environ["FIRECRAWL_API_KEY"] = "fc-YOUR_API_KEY"
    5
    6loader = FireCrawlLoader(
    7 url = "https://pondhouse-data.com",
    8 mode = "crawl"
    9)
    10
    11data = loader.load()
    12# or use data = await loader.aload()

For more details, refer to the Langchain documentation.

Parsing specific content from websites using schema-based parsing

Additionally to provide formatted versions of the full page content, Firecrawl allows to use an LLM to extract specific information from the website.

This is done by providing a pydantic schema, defining what exactly to extract.

  1. Define the schema:

    1from pydantic import BaseModel, Field
    2
    3class ExtractSchema(BaseModel):
    4 company_name: str
    5 company_sector: str
    6 is_ai_company: bool

    Make sure to have the field names quite verbose - as the LLM will use them to decide what to extract.

  2. Next, we can use this model as part of the Firecrawl invocation:

    1from firecrawl import FirecrawlApp
    2import json
    3app = FirecrawlApp(api_key='your_api_key')
    4data = app.scrape_url('https://pondhouse-data.com', {
    5 'formats': ['extract'],
    6 'extract': {
    7 'schema': ExtractSchema.model_json_schema(),
    8 }
    9})
    10print(json.dumps(data["extract"], indent=2))

    The output is a JSON object with the extracted data:

    1{
    2 "company_name": "Pondhouse Data OG",
    3 "company_sector": "AI and Data Solutions",
    4 "is_ai_company": true
    5}

This is actually a very powerful feature - extracting structured information from the mess of a website with just 5 lines of code is quite remarkable.

Running your self-hosted version of Firecrawl

Now that we understand how to use Firecrawl, let's see how to run a fully self-hosted version of the tool. This is particularly useful if you want to run Firecrawl on your own infrastructure or if you have specific needs in terms of data privacy or data security.

  1. Install docker, as outlined here.

  2. Clone the Firecrawl github repository:

    1git clone https://github.com/mendableai/firecrawl.git

    Navigate into the repository:

    1cd firecrawl
  3. Create a .env file with the following content:

    1NUM_WORKERS_PER_QUEUE=8
    2PORT=3002
    3HOST=0.0.0.0
    4REDIS_URL=redis://redis:6379
    5REDIS_RATE_LIMIT_URL=redis://redis:6379`

    Note that there are advanced features like authentication, AI features JS block support or pdf parsing available, which require additional environment variables. Read more here.

  4. Buld and run the docker containers

    1docker compose up --build

    Your Firecrawl instance will now be available at http://localhost:3002. To use your self-hosted Firecrawl api, instantiate your FirecrawlApp as follows:

    1app = FirecrawlApp(api_key="something", api_url='http://localhost:3002')

    Note: You need to set the api_key parameter, even though we've not enabled Firecrawl authentication in this example. The Python SDK will error out, if the api_key is not set. So, set it to a any string.

Conclusion

To quickly conclude, we've seen in this post how to use Firecrawl to quickly and easily scrape and crawl websites for the LLM age. We've also seen how to extract structured data from websites using schema-based parsing. Finally, we demonstrated how to integrate Firecrawl with LangChain for even more convenience.

And as a bonus lesson, we learned how to run Firecrawl on our own infrastructure - providing a good starting point for creating a self-hosted scraping environment.

Further reading

------------------

Interested in how to train your very own Large Language Model?

We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:

  • Cost control
  • Data privacy
  • Excellent performance - adjusted specifically for your intended use
More information on our managed RAG solution?
To Pondhouse AI
More tips and tricks on how to work with AI?
To our Blog