Crawl4AI Tutorial: Build a Powerful Web Crawler for AI Applications Using Docker

blog preview

Building modern LLM applications often requires more than just connecting to OpenAI's API. Real-world use cases demand current, specific information that goes beyond the model's training data.

Consider a competitive analysis bot that needs to track competitor websites, a research assistant that requires access to academic papers, or an e-commerce AI that needs to monitor pricing across multiple stores. The challenge isn't just accessing this data - it's getting it in a format that LLMs can effectively process.

Modern websites rarely serve their content in simple HTML anymore. Instead, they use JavaScript to load content dynamically after the initial page load. This means basic HTTP requests aren't sufficient - you need a crawler that can execute JavaScript and wait for content to load, similar to how a real browser works. For example, when crawling an e-commerce site, product prices might only appear after JavaScript runs, or when scraping a news site, articles might load as you scroll down the page.

Crawl4AI - the tool we're going to discuss today, offers two key advantages that are particularly relevant for AI applications. First, they can handle many requests simultaneously through asynchronous processing. When your application needs to analyze hundreds of pages quickly, this concurrent processing capability becomes highly important for performance.

Second, Crawl4AI is designed to work seamlessly with LLMs, both in how it processes output and how it can use AI to improve extraction. Instead of wrestling with complex HTML cleaning or writing brittle parsing rules, you get clean, structured data that's ready for an LLM to process.

Being open-source, Crawl4AI offers these capabilities without the limitations of proprietary solutions. You can inspect the code, modify it for your needs, and scale your infrastructure as required. Let's look at how to set this up using Docker.

In this guide, we'll walk you through setting up Crawl4AI using Docker. Using the dockerized version of Crawl4Crawl4AI allows us to run a web server that can handle multiple crawl requests simultaneously. It also provides a nice crawling REST API, so it's very easy to integrate in virtually any application.

Preparation

As of time of this writing, Crawl4AI is available in version 0.3.73. It just recently introduced the full api and Docker support. They also introduced official docker containers - however only for arm64-based CPUs.

As you most probably want to run this on an x86-based server, we have to build the image ourselves.

Note: It's worth checkign their official docker hub for new images I assume they will provide x86 images soon - then you can skip this whole preparation step.

  1. Let's clone the repository to our machine:

    1 git clone https://github.com/unclecode/crawl4ai.git
  2. Next, navigate into the cloned repository:

    1 cd crawl4ai
  3. Open the file main.py and remove the following line from the file.

    1 app.mount("/mkdocs", StaticFiles(directory="site", html=True), name="mkdocs")

    This line would mount the Crawl4AI documentation and provide it as part of the web server. However, first we don't need it and secondly it would require a second build step, which we want to avoid for now.

    Note: This might change in the future. So if you can't find this line in the file, don't worry, just continue with the next steps.

  4. Now execute the following command to build the docker image:

    1docker build -f Dockerfile -t pondhouse/crawl4ai:beta-1 --build-arg INSTALL_TYPE=all .

    Note: instead of setting the INSTALL_TYPE to full, you can also set it to basic. The full version is required for AI powered extraction whereas if you just want the web-crawling without AI extraction, the basic version is sufficient.

How to use the Crawl4AI API

Now that we have ourselves a docker image, lets start it up and see how we can use it.

1docker run -p 11235:11235 \
2-e OPENAI_API_KEY=<your-openai-api-key> \
3-e MAX_CONCURRENT_TASKS=5 \
4pondhouse/crawl4ai:beta-1
  • OPENAI_API_KEY: Set this to your OpenAI API key. We need this for AI extraction tasks later on.

    Note: Crawl4AI uses litellm for connecting to LLMs. So you can use most of the providers offered by liteLLM. Just change the provider parameter in the requests described below.

  • MAX_CONCURRENT_TASKS: This sets the maximum number of concurrent tasks the crawler will handle. You can adjust this to your server's capabilities.

That's all we need to run Craw4AI, next up, using the API. In the examples below we'll use simple curl commands, but naturally you can use any REST client available to you.

Simple Web Crawl

First up, let's do a simple web crawl:

1curl -X POST 'http://127.0.0.1:11235/crawl' \
2-H 'Content-Type: application/json' \
3-d '{
4 "urls": "https://pondhouse-data.com"
5}'

The result looks as follows:

1{ "task_id": "377fd003-2cd2-49bf-9f18-ee843ca21179" }

We don't get the result directly as the crawling is done asynchronously. (Which by the way is one of the great features of Crawl4AI - it offers asynchronous processing for multiple requests). The result we get back is the idea of the crawling task.

As you might guess we now have to wait for the crawling to finish. We can this by continuously calling the following endpoint:

1curl 'http://127.0.0.1:11235/task/<task_id_from_above>'

The result is quite a huge json, but what we are interested in is the status field of the response.

If we continue our simple curl/bash journey, we can use jq to extract the status, like so:

1curl 'http://127.0.0.1:11235/task/377fd003-2cd2-49bf-9f18-ee843ca21179' | jq .status

If the status is completed, we know the task is done. If the task is in state completed we also get the crawling results as part of the task request - so no need to call another API endpoint.

The result is then populated in the result field. There are 4 interesting properties in the result:

  1. Cleaned HTML: Provides a cleaned html version of the page. Can be accessed via result.cleaned_html field:

    1 curl 'http://127.0.0.1:11235/task/<task-id>' | jq .result.cleaned_html
  2. Markdown: Provides a markdown version of the page. Can be accessed via result.markdown field. For most AI cases, this might be the preferable result format, as mostly you don't need the html tags provided by the above option.

    1 curl 'http://127.0.0.1:11235/task/<task-id>' | jq .result.markdown
  3. Metadata: Extracts the header metadata of the page, like title, page description or og:image.

    1curl 'http://127.0.0p1:11235/task/<task-id>' | jq .result.metadata

    Results are like:

    1{
    2 "title": "Pondhouse Data - Customized AI Solutions for your business",
    3 "description": "Find relevant information faster than ever!",
    4 "keywords": null,
    5 "author": null,
    6 "og:title": "Pondhouse AI - Customized AI Solutions for your business",
    7 "og:description": "Find relevant information faster than ever!",
    8 "og:image": "https://www.pondhouse-data.com/pondhouse-data-header.png",
    9 "twitter:card": "summary_large_image",
    10 "twitter:site": "@techscienceandy",
    11 "twitter:title": "Pondhouse AI - Customized AI Solutions for your business",
    12 "twitter:description": "Find relevant information faster than ever!",
    13 "twitter:image": "https://www.pondhouse-data.com/pondhouse-data-header.png"
    14}
  4. Links: And finally - quite interesting: Crawl4AI provides quite significant features when it comes to links. First and foremost it extracts all internal and external links from the page and classifies them as such. While link analyses is beyond this post, Crawl4AI offers a good tutorial on that in their docs.

    1curl 'http://127.0.0.1:11235/task/<task-id>' | jq .result.links

Using python with Crawl4AI API

Just for the sake of completeness, here is how you'd do the requests we did above in python, using the requests library:

1request_data = {
2 "urls": "https://pondhouse-data.com"
3}
4response = requests.post(f"127.0.0.1/crawl", json=request_data)
5task_id = response.json()["task_id"]
6
7start_time = time.time()
8while True:
9 # Check for timeout
10 if time.time() - start_time > timeout:
11 raise TimeoutError(f"Task {task_id} timeout")
12
13 result = requests.get(f"127.0.0.1/task/{task_id}")
14 status = result.json()
15
16 if status["status"] == "completed":
17 return status
18
19 time.sleep(2)

Advanced parameters for the Crawl4AI API

As shown above, it's quite easy to start a simple crawling job. However there are a number of parameters you might consider:

1 "urls": "https://example.com",
2 "crawler_params": {
3 # Browser Configuration
4 "headless": True, # Run in headless mode
5 "browser_type": "chromium", # chromium/firefox/webkit
6 "user_agent": "custom-agent", # Custom user agent
7 "proxy": "http://proxy:8080", # Proxy configuration
8
9 # Performance & Behavior
10 "page_timeout": 30000, # Page load timeout (ms)
11 "verbose": True, # Enable detailed logging
12
13 # Anti-Detection Features
14 "simulate_user": True, # Simulate human behavior
15 "magic": True, # Advanced anti-detection
16 "override_navigator": True, # Override navigator properties
17
18 # Session Management
19 "user_data_dir": "./browser-data", # Browser profile location
20 "use_managed_browser": True, # Use persistent browser
21 }

While most of the parameters seem self-explanatory, the last 2 might need some additional explanation: Session management in Crawl4AI allows you to maintain state across multiple requests and handle complex multi-page crawling tasks, particularly useful for dynamic websites.

Please have a look at these two documentation sites for more on that:

  • Session management: here
  • Dynamic content handling: here

Using the Crawl4AI LLM Extraction

In the section above we demonstrated a simple crawling job. Now let's step up the game a notch and use the AI extraction capabilities of Crawl4AI.

Let's say we want to crawl websites and find very specific information, like who are the owners of companies.


Before we continue...

Interested in how to train your very own Large Language Model?

We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:

  • Cost control
  • Data privacy
  • Excellent performance - adjusted specifically for your intended use

We can use the following requests for automatically invoking an LLM to extract certain information:

1curl -X POST 'http://127.0.0.1:11235/crawl' \
2-H 'Content-Type: application/json' \
3-d '{
4 "urls": "https://pondhouse-data.com",
5 "extraction_config": {
6 "type": "llm",
7 "params": {
8 "provider": "openai/gpt-4o",
9 "api_token": "<your-openai-api-token>",
10 "instruction": "From the provided homepage data, extract the owners of the company.",
11 }
12 },
13}'

To get the extracted results, look at the result.extracted_content field. It will be json encoded, therefore we'll use jq's fromjson filter to decode it:

1curl 'http://127.0.0.1:11235/task/<task-id>' | jq '.result.extracted_content | fromjson'
1[
2 {
3 "index": 0,
4 "tags": ["owners"],
5 "content": ["Sascha Gstir", "Andreas Nigg"],
6 "error": false
7 }
8]

Quite amazing, isn't it? It's very simple to get this whole pipeline working.

Note: As of time of this writing, if you create a crawling job for the same url multiple times, the parameters seem to be cached - meaning if you call the same url twice but with different extraction instructions, it will cache the extraction. Make sure to keep this in mind.

Using different LLM providers with Crawl4AI

If you want to use a different LLM provider, like Azure OpenAI for example, just refer to the liteLLM providers manual - as Crawl4AI uses liteLLM under the hood. Note that the Azure OpenAI provider needs to be authenticated with an EntraID token, not an Azure OpenAI API token (so set the api_token to an EntraID token, authenticating your Azure OpenAI resource).

An example request, using Azure OpenAI looks as follows:

1curl -X POST 'http://127.0.0.1:11235/crawl' \
2-H 'Content-Type: application/json' \
3-d '{
4 "urls": "https://pondhouse-data.com",
5 "extraction_config": {
6 "type": "llm",
7 "params": {
8 "provider": "azure/gpt-4o",
9 "api_token": "<your-entraid-api-token>",
10 "instruction": "From the provided homepage data, extract the owners of the company.",
11 }
12 },
13}'

Further Reading

More information on our managed RAG solution?
To Pondhouse AI
More tips and tricks on how to work with AI?
To our Blog