Crawl4AI Tutorial: Build a Powerful Web Crawler for AI Applications Using Docker
Building modern LLM applications often requires more than just connecting to OpenAI's API. Real-world use cases demand current, specific information that goes beyond the model's training data.
Consider a competitive analysis bot that needs to track competitor websites, a research assistant that requires access to academic papers, or an e-commerce AI that needs to monitor pricing across multiple stores. The challenge isn't just accessing this data - it's getting it in a format that LLMs can effectively process.
Modern websites rarely serve their content in simple HTML anymore. Instead, they use JavaScript to load content dynamically after the initial page load. This means basic HTTP requests aren't sufficient - you need a crawler that can execute JavaScript and wait for content to load, similar to how a real browser works. For example, when crawling an e-commerce site, product prices might only appear after JavaScript runs, or when scraping a news site, articles might load as you scroll down the page.
Crawl4AI - the tool we're going to discuss today, offers two key advantages that are particularly relevant for AI applications. First, they can handle many requests simultaneously through asynchronous processing. When your application needs to analyze hundreds of pages quickly, this concurrent processing capability becomes highly important for performance.
Second, Crawl4AI is designed to work seamlessly with LLMs, both in how it processes output and how it can use AI to improve extraction. Instead of wrestling with complex HTML cleaning or writing brittle parsing rules, you get clean, structured data that's ready for an LLM to process.
Being open-source, Crawl4AI offers these capabilities without the limitations of proprietary solutions. You can inspect the code, modify it for your needs, and scale your infrastructure as required. Let's look at how to set this up using Docker.
In this guide, we'll walk you through setting up Crawl4AI using Docker. Using the dockerized version of Crawl4Crawl4AI allows us to run a web server that can handle multiple crawl requests simultaneously. It also provides a nice crawling REST API, so it's very easy to integrate in virtually any application.
Preparation
As of time of this writing, Crawl4AI is available in version 0.3.73. It just recently introduced the full api and Docker support. They also introduced official docker containers - however only for arm64-based CPUs.
As you most probably want to run this on an x86-based server, we have to build the image ourselves.
Note: It's worth checkign their official docker hub for new images I assume they will provide x86 images soon - then you can skip this whole preparation step.
-
Let's clone the repository to our machine:
-
Next, navigate into the cloned repository:
-
Open the file
main.py
and remove the following line from the file.This line would mount the Crawl4AI documentation and provide it as part of the web server. However, first we don't need it and secondly it would require a second build step, which we want to avoid for now.
Note: This might change in the future. So if you can't find this line in the file, don't worry, just continue with the next steps.
-
Now execute the following command to build the docker image:
Note: instead of setting the
INSTALL_TYPE
tofull
, you can also set it tobasic
. The full version is required for AI powered extraction whereas if you just want the web-crawling without AI extraction, the basic version is sufficient.
How to use the Crawl4AI API
Now that we have ourselves a docker image, lets start it up and see how we can use it.
-
OPENAI_API_KEY
: Set this to your OpenAI API key. We need this for AI extraction tasks later on. -
MAX_CONCURRENT_TASKS
: This sets the maximum number of concurrent tasks the crawler will handle. You can adjust this to your server's capabilities.
That's all we need to run Craw4AI, next up, using the API. In the examples
below we'll use simple curl
commands, but naturally you can use any
REST client available to you.
Simple Web Crawl
First up, let's do a simple web crawl:
The result looks as follows:
We don't get the result directly as the crawling is done asynchronously. (Which by the way is one of the great features of Crawl4AI - it offers asynchronous processing for multiple requests). The result we get back is the idea of the crawling task.
As you might guess we now have to wait for the crawling to finish. We can this by continuously calling the following endpoint:
The result is quite a huge json, but what we are interested in is the status
field of the response.
If we continue our simple curl/bash journey, we can use jq
to extract the
status, like so:
If the status is completed, we know the task is done. If the task is in state completed we also get the crawling results as part of the task request - so no need to call another API endpoint.
The result is then populated in the result
field. There are 4 interesting
properties in the result:
-
Cleaned HTML: Provides a cleaned html version of the page. Can be accessed via
result.cleaned_html
field: -
Markdown: Provides a markdown version of the page. Can be accessed via
result.markdown
field. For most AI cases, this might be the preferable result format, as mostly you don't need the html tags provided by the above option. -
Metadata: Extracts the header metadata of the page, like title, page description or
og:image
.Results are like:
-
Links: And finally - quite interesting: Crawl4AI provides quite significant features when it comes to links. First and foremost it extracts all internal and external links from the page and classifies them as such. While link analyses is beyond this post, Crawl4AI offers a good tutorial on that in their docs.
Using python with Crawl4AI API
Just for the sake of completeness, here is how you'd do the requests we did above in python, using the requests library:
Advanced parameters for the Crawl4AI API
As shown above, it's quite easy to start a simple crawling job. However there are a number of parameters you might consider:
While most of the parameters seem self-explanatory, the last 2 might need some additional explanation: Session management in Crawl4AI allows you to maintain state across multiple requests and handle complex multi-page crawling tasks, particularly useful for dynamic websites.
Please have a look at these two documentation sites for more on that:
Using the Crawl4AI LLM Extraction
In the section above we demonstrated a simple crawling job. Now let's step up the game a notch and use the AI extraction capabilities of Crawl4AI.
Let's say we want to crawl websites and find very specific information, like who are the owners of companies.
Before we continue...
Interested in how to train your very own Large Language Model?
We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:
- Cost control
- Data privacy
- Excellent performance - adjusted specifically for your intended use
We can use the following requests for automatically invoking an LLM to extract certain information:
To get the extracted results, look at the result.extracted_content
field.
It will be json encoded, therefore we'll use jq's fromjson
filter to
decode it:
Quite amazing, isn't it? It's very simple to get this whole pipeline working.
Note: As of time of this writing, if you create a crawling job for the same url multiple times, the parameters seem to be cached - meaning if you call the same url twice but with different extraction instructions, it will cache the extraction. Make sure to keep this in mind.
Using different LLM providers with Crawl4AI
If you want to use a different LLM provider, like Azure OpenAI for example,
just refer to the liteLLM providers
manual - as Crawl4AI uses liteLLM under the hood.
Note that the Azure OpenAI provider needs to be authenticated with an
EntraID token, not an Azure OpenAI API token (so set the api_token
to an
EntraID token, authenticating your Azure OpenAI resource).
An example request, using Azure OpenAI looks as follows: