How to crawl websites for LLMs - using Firecrawl
Many AI applications with modern LLMs (Large Language Models) require access to high-quality data provided in the public web. Think about customer service applications, research applications or general-purpose chatbots. However, the web is a vast and unstructured source of data, composed of html pages, filled with unstructured texts. That's where this blog comes into play.
We'll explore how to use Firecrawl to crawl websites and prepare data for your LLM projects. We'll also demonstrate how to integrate Firecrawl with LangChain, a popular framework for developing applications with LLMs.
After reading this post, you'll have a good understanding and a step-by-step guide on how to efficiently get data from any public website.
What is Firecrawl?
Firecrawl is a powerful web crawling and scraping tool designed to simplify the process of converting entire websites into clean, structured data, particularly in formats that are ready for large language models.
Key Features are:
Dynamic Content Handling: Firecrawl is very good in what it does, which is scraping websites that use JavaScript, making it highly effective for modern, dynamic sites.
Automated Content Transformation: The tool automatically converts scraped data into markdown or other structured formats, making it easier to feed into machine learning models or content pipelines.
Caching and Rate Limiting: To prevent overloading websites and to comply with rate limits, Firecrawl features intelligent caching and rate-limiting capabilities. This ensures ethical and efficient data scraping without impacting server performance.
Scalability: Firecrawl is built to scale, making it suitable for both small projects and large-scale data operations. Whether you’re scraping a single website or thousands, Firecrawl adapts to your needs.
Open-Source and Extensible: Firecrawl is open-source, allowing users to self-host and customize the platform.
User-Friendly Interface: Despite its advanced capabilities, Firecrawl is designed with ease of use in mind. Its intuitive interface and clear documentation make it accessible even to those who are new to web scraping.
Support for structured data extraction: Firecrawl allows to create pydantic models to extract data in a structured way.
Crawling websites with Firecrawl
Let's see the tool in action. First, we're going to use the managed cloud version of Firecrawl - as it makes getting started easy. If you want to self-host, jump straight to the last chapter in this article.
-
Navigate to Firecrawl and sign up for an account.
-
Head over to your Firecrawl keys section.
-
Create a new API key, or use the default one. Copy the key.
-
Install the Firecrawl Python client:
-
Use the following code snippet, to scrape a single URL without any sub-pages:
This will output the pondhouse-data website in markdown and html format.
-
If you want to crawl a website with multiple sub-pages, you can use the following code snippet:
This will crawl all the pages of this site, poll for the status every 10 seconds and create markdown representation of the pages. Output will be similar to:
Asynchronous Crawling with Firecrawl
While the above crawling examples already demonstrate the ease of use of Firecrawl, they run synchronously. Especially for crawling large pages with many sub-pages, this might block your application for many minutes. Alternatively, you can run the crawling job asynchronously:
-
Create a crawl job:
This will create a crawl job and return the job id which can be used to check for the status of the job.
-
Check the status of the job and retrieve the data:
Once done, this will output the same result as for the synchronous case.
Integrating Firecrawl with LangChain
Another method for using Firecrawl is to use it together with LangChain. LangChain is a powerful framework for developing applications with LLMs. They for example provide a vast collection of data loaders, which are easy-to-use python classes to load data from a huge variety of sources in a standardized format. One of their data loaders is the Firecrawl data loader.
-
Additionally to firecrawl, install the langchain community package:
-
Then, use the following code snippet to load the data from the pondhouse-data website:
For more details, refer to the Langchain documentation.
Parsing specific content from websites using schema-based parsing
Additionally to provide formatted versions of the full page content, Firecrawl allows to use an LLM to extract specific information from the website.
This is done by providing a pydantic schema, defining what exactly to extract.
-
Define the schema:
Make sure to have the field names quite verbose - as the LLM will use them to decide what to extract.
-
Next, we can use this model as part of the Firecrawl invocation:
The output is a JSON object with the extracted data:
This is actually a very powerful feature - extracting structured information from the mess of a website with just 5 lines of code is quite remarkable.
Running your self-hosted version of Firecrawl
Now that we understand how to use Firecrawl, let's see how to run a fully self-hosted version of the tool. This is particularly useful if you want to run Firecrawl on your own infrastructure or if you have specific needs in terms of data privacy or data security.
-
Install docker, as outlined here.
-
Clone the Firecrawl github repository:
Navigate into the repository:
-
Create a
.env
file with the following content:Note that there are advanced features like authentication, AI features JS block support or pdf parsing available, which require additional environment variables. Read more here.
-
Buld and run the docker containers
Your Firecrawl instance will now be available at
http://localhost:3002
. To use your self-hosted Firecrawl api, instantiate yourFirecrawlApp
as follows:Note: You need to set the
api_key
parameter, even though we've not enabled Firecrawl authentication in this example. The Python SDK will error out, if the api_key is not set. So, set it to a any string.
Conclusion
To quickly conclude, we've seen in this post how to use Firecrawl to quickly and easily scrape and crawl websites for the LLM age. We've also seen how to extract structured data from websites using schema-based parsing. Finally, we demonstrated how to integrate Firecrawl with LangChain for even more convenience.
And as a bonus lesson, we learned how to run Firecrawl on our own infrastructure - providing a good starting point for creating a self-hosted scraping environment.
Further reading
- Using AI directly from your PostgreSQL database
- Use Vector Search in BigQuery
- How to host your own LLM - including HTTPS?
Interested in how to train your very own Large Language Model?
We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:
- Cost control
- Data privacy
- Excellent performance - adjusted specifically for your intended use