Parsing pdf, word and excel documents with GPT-4o

blog preview

Extracting data from "human readable" documents like pdfs, word documents and excel sheets is an important problem with LLM applications. LLMs can't directly read the information provided in these documents.

For decades people have been working on solutions to this problem. The most common approaches were rule-based data extraction or machine learning based solutions. Both of these approaches have their own limitations. Rule-based solutions are very fast, but they are limited in variety of documents they can handle. Machine learning based solutions are more flexible, but very slow. And even they struggle with the sheer variety of document formats and layouts.

In our article about using LLMs to extract data, we discussed, how and why LLMs might finally solve this problem. We established, that while it looks quite ridiculous to use a Billion-Parameter model to extract text, we saw, that the results were just miles and miles better than anything we had so far. If you want some more background information on how LLMs can help in terms of document parsing, we highly recommend you to read that article.

In this article we extend upon what we already discussed and show you, how to use GPT-4o - arguably one of the best LLMs currently available with vision input capabilities - to extract data. And we'll also show, how to not also extract data from word and excel documents.

Strategy for parsing documents using LLMs

For a detailed explanation of the strategy, please visit our introductory article on document extraction with LLMs. Summarized, the strategy is as follows:

  1. Convert the document into an image.
  2. Send the image to an LLM with vision capabilities.
  3. Prompt the LLM to extract the text from the image.

While it sounds a little ridiculous at first - creating screenshots of documents, sending them to LLMs and "asking" them to kindly return the text - it makes sense on second thought. We simply ask LLMs to "read" our documents by providing them a visual representation of the document. It's actually quite similar to how we mere humans extract text from documents. We read them.

This strategy proved to be very successful in practice.

  • LLMs are very good at understanding document contexts and are able to identify Layouts, tables and other document structures.

  • By using LLMs, we can use advanced prompting techniques to define the output. Eg. we can ask to not extract header, footers or other irrelevant document parts. We can ask to format headlines in a specific way (eg. markdown).

  • As we simply screenshot the whole document, LLMs can also extract information from images embedded in the document - and even charts! See our article on image analyzes with LLMs for more information.

Setting up our system for document parsing

Enough talk, let's dig right into it. We'll use good old Python for this task.

  1. Create and activate yourself a venv or conda environment.

  2. Install the following pip packages

    1pip install openai pillow pdfium2 backoff

Parsing pdf files using GPT-4o

We'll start with the surprisingly easiest document format: PDFs.

(Why surprisingly? Historically, PDFs are a pain to work with. They have no standardized layout, no concept of tables (tables in pdfs are just lines and text!!) and they can contain images, charts and visualizations of all kinds. This made conventional data extraction systems struggle with PDFs).

  1. First, let's import our packages and we'll initialize our GPT-4o API client.

    1from openai import AsyncOpenAI
    2from io import BytesIO
    3import base64
    4import pypdfium2 as pdfium
    5import backoff
    6
    7client = AsyncOpenAI(api_key="your-api-key")

    Note: We use the async version of the OpenAI SDK so that we can send multiple requests to the API concurrently.

  2. Next, we need to load the pdf files

1 pdf_file = "mypdf.pdf"
2 pdf = pdfium.PdfDocument(pdf_file)
  1. To create the images we send to GPT-4o, we simply loop through all the pages and convert them to images, using pdfium2 .

    One important thing to note: GPT-4o requires the images to be base64 encoded

    1images = []
    2for i in range(len(pdf)):
    3 page = pdf[i]
    4 image = page.render(scale=4).to_pil()
    5 buffered = BytesIO()
    6 image.save(buffered, format="JPEG")
    7 img_byte = buffered.getvalue()
    8 img_base64 = base64.b64encode(img_byte).decode("utf-8")
    9 images.append(img_base64)
  2. All that's left is to send the images to GPT-4o and wait for the extracted text to return. We'll create a parse_page_with_gpt function that takes a single base64 encoded page and returns the extracted text. We then use asyncio to send all the pages of our document concurrently, meaning they run more or less in parallel.

    Please note, that we use the detail = low parameter in the API call. OpenAI allows for two different image qualities : low and high. low means, the image pipeline of OpenAI will downsample the image to 512x512 pixels. high on the other hand also downsamples the image to 512x512 pixels, but subsequently also creates detailed crops of image with 512x512 pixels each. The model first sees the low fidelity image to get an overview, followed by the detailed crops. The problem however: if the details you want to extract are larger than 512 pixels, the detailed crops are useless. For document extraction therefore you almost always end up with better results (and less cost) by using the low quality setting.

    1@backoff.on_exception(backoff.expo, RateLimitError)
    2async def parse_page_with_gpt(base64_image: str) -> str:
    3
    4 messages=[
    5 {
    6 "role": "system",
    7 "content": """You are a system to extract knowledge from documents.
    8 We want to add this knowledge to a wiki afterwards.
    9 Please extract the knowledge of this document.
    10 Please only extract the information given in the document.
    11 Do not answer with any additional explanations or text.
    12 I want to take your answer 1 to 1 and put it in the wiki.
    13 Please do not reference graphical elements or visualization in your answer. Just answer with the extracted text.
    14 Make sure that text at the end of the page as well as text at the beginning of the page are also at the end and beginning of your extraction - as this might be continuations of the previous and next page.
    15 """
    16 },
    17 {
    18 "role": "user",
    19 "content": [
    20 {"type": "text", "text": "Extract knowledge from this document"},
    21 {
    22 "type": "image_url",
    23 "image_url": {
    24 "url": f"data:image/jpeg;base64,{base64_image}",
    25 "detail": "low"
    26 },
    27 },
    28 ],
    29 }
    30 ]
    31
    32 response = await client.chat.completions.create(
    33 model="gpt-4o-2024-08-06",
    34 messages=messages,
    35 max_tokens=4096,
    36 )
    37
    38 return response.choices[0].message.content or ""
    39
    40text_of_pages = await asyncio.gather(*[parse_page_with_gpt(image) for image in images])

    Note that we use the remarkable backoff library to handle rate limiting. This is quite important, as the OpenAI API has quite strict limits.

  3. As a final step we might simply join the parsed per-page texts to a single document text.

    1document_text = "\n".join(text_of_pages)

That's it. With just a few lines of code, we were able to extract text from a pdf file using GPT-4o. And from our experience, the parsing quality is just not comparable to conventional methods - in the most positive way imaginable.

How much does extracting a pdf cost?

While it's arguably quite easy to use LLMs for document parsing, what are the costs of using OpenAIs models for this?

As usual, you'll pay for both the input tokens and the generated output tokens.

The output tokens are simply to calculate: They are the number of tokens extracted from the document. The current rate is 10$ per 1 Million tokens.

From our experience a very full page has around 600 tokens, while most pages have rather 300 tokens. So, one page is about 0.3 to 0.6 cents.

Additionally there is the cost for the input tokens - which is surprisingly cheap when using low - res images: OpenAI charges 85 input tokens per image, meaning additional 0.0213 cents per page.

Overall, you'll pay about 0.32 to 0.62 cents per page.

Why not sending the whole document at once?

You might ask yourself why we go through the hassle of dividing the document into pages and sending them one by one to the API? Simply because of the resolution limitations of the GPT-4o vision model. As outlined above, our image will be downsampled to 512x512 pixels - meaning we need to make sure, that the text is somewhat readable with this resolution. One A4 page is usually well enough readable in 512x512.

Parsing docx word documents using GPT-4o

Let's move on to the next task: Parsing word documents in the docx format. Why is this different from parsing pdfs? Well, because Microsoft - in a move which can only be described as absolutely genius (sarcasm...) - decided to not include any kind of page break information in the file format. Meaning, the rendering software decides where to add line breaks. (That's by the way one of the reasons why each docx rendering software provides a little different rendering of the same document). For us, this means, that we can't simply create screenshots of the pages because ... well ... we don't know where a page starts and ends.

We did a lot of experiments and found the best way is to first transform the word document to pdf and then use the strategy as outlined in the chapter above.

Yes, dear reader, the best way to parse docx word documents seems to be:

  1. Convert the docx to pdf
  2. Convert the document into an image.
  3. Send the image to an LLM with vision capabilities.
  4. Prompt the LLM to extract the text from the image.

(I can't believe that I'm writing this ... but believe me, the output quality is indeed very, very good!)

As we already discussed steps 2, 3 and 4 in the pdf chapter, let's focus on how to convert a docx to pdf. Again, there are multiple ways, the one which we found works best, is to use a headless LibreOffice instance, render the docx in headless mode, save it as pdf.

  1. Install LibreOffice. On Ubuntu, you can simply use apt install libreoffice. For other platforms, please see the LibreOffice download page.

  2. Let's then create a function to start LibreOffice in the correct location.

    1import sys
    2def libreoffice():
    3 if sys.platform == 'darwin':
    4 return '/Applications/LibreOffice.app/Contents/MacOS/soffice'
    5 if sys.platform == 'win32':
    6 return 'C:\\Program Files\\LibreOffice\\program\\soffice.exe'
    7 return 'libreoffice'
  3. Next, let's create our conversion method. The method will use a BytesIO as input and returns a BytesIO object, which contains the pdf file. If you are not familiar with BytesIO - it's a binary stream using an in-memory bytes buffer.

    • we first create a temporary file from the BytesIO input (as LibreOffice can only work with files, not streams)
    • then, we run the LibreOffice command to convert the file to pdf
    • after a check whether the conversion succeeded, we read the generated pdf file and return it as BytesIO.
    1def convert_to_pdf(source: BytesIO, timeout=20) -> BytesIO:
    2 # Create a temp directory to store the input and output file
    3 with tempfile.TemporaryDirectory() as temp_dir:
    4 temp_input_path = os.path.join(temp_dir, "input_file")
    5 with open(temp_input_path, "wb") as temp_file:
    6 temp_file.write(source.getvalue())
    7
    8 # Construct the LibreOffice command
    9 args = [
    10 libreoffice(),
    11 '--headless',
    12 '--convert-to',
    13 'pdf',
    14 '--outdir',
    15 temp_dir,
    16 temp_input_path
    17 ]
    18
    19 process = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout)
    20
    21 # Check for the output filename in the LibreOffice output.
    22 # The output is something like -> <filename> using filter
    23 filename_match = re.search('-> (.*?) using filter', process.stdout.decode())
    24
    25 if filename_match is None:
    26 raise Exception(process.stdout.decode())
    27
    28 output_filename = filename_match.group(1)
    29 output_path = os.path.join(temp_dir, output_filename)
    30
    31 # Read the generated PDF file
    32 with open(output_path, "rb") as pdf_file:
    33 pdf_content = BytesIO(pdf_file.read())
    34
    35 return pdf_content

Now all that's left is to put everything together:

1docx_path = "mydoc.docx"
2with open(docx_path, "rb") as docx_file:
3 docx_content = BytesIO(docx_file.read())
4pdf = convert_to_pdf(docx_content)
5
6# ... continue with the pdf parsing as outlined in the chapter above...

Parsing xlsx excel documents using GPT-4o

Now for the sake of completion, we have a similar dilemma with Excel files. There, it's not clear, where a page starts and ends. Once again, we use LibreOffice to convert the Excel file to pdf and then continue as usual.

The process is exactly the same as for docx files, so we can re-use our methods from above.

1docx_path = "mydoc.xlsx"
2with open(docx_path, "rb") as xlsx_file:
3 xlsx_content = BytesIO(docx_file.read())
4pdf = convert_to_pdf(xlsx_content)
5
6# ... continue with the pdf parsing as outlined in the chapter above...

Note: LibreOffice will use the default print settings of your file. So if you set the page to layout to landscape, LibreOffice will print landscape. If you set to fit all columns on one page, LibreOffice will do so. If you set the print area to a specific range, LibreOffice will only print this range. This can be quite useful to control the output of the document.

Conclusion

In this article, we've demonstrated how to use GPT-4o, one of the most advanced language models with vision capabilities, to tackle the long-standing challenge of extracting data from complex documents such as PDFs, Word files, and Excel spreadsheets.

We've shown that by converting these documents into images and inferencing GPT-4o's visual understanding capabilities, we can achieve unheard-of levels of accuracy in text extraction and document parsing. This approach proves particularly effective for handling multi-column layouts, tables, and even embedded images or charts – elements that have traditionally posed significant challenges for conventional parsing methods.

To be fair though: The approach may seem a little odd at first glance, considering that we first open a file, convert it to pdf, convert it to an image, send the image to a multi-billion-parameter model, ask it nicely to extract the text and then finally receive the text. However, the results speak for themselves - they are miles and miles ahead of what was there previously.

Key takeaways from our exploration include:

Simplicity and Effectiveness: Despite its seeming counter-intuitivity, the strategy of converting documents to images for LLM processing yields remarkably accurate results.

Versatility: This method works across various document types, including PDFs, Word documents, and Excel spreadsheets, providing a unified approach to document parsing.

Cost-Effectiveness: While using GPT-4o for document parsing does incur costs, they are relatively modest, ranging from 0.32 to 0.62 cents per page for most documents.

Adaptability: By using prompts, we can customize the extraction process to focus on specific elements or formats, enhancing the flexibility of this approach.

Overcoming Format Limitations: For formats like DOCX and XLSX that don't have inherent page definitions, converting to PDF first proves to be an effective workaround.

Further Reading

------------------

Interested in how to train your very own Large Language Model?

We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:

  • Cost control
  • Data privacy
  • Excellent performance - adjusted specifically for your intended use
More information on our managed RAG solution?
To Pondhouse AI
More tips and tricks on how to work with AI?
To our Blog