Retrieval Augmented Generation - Extract text and images from a PDF document

Despite their multi-modal capabilities, LLMs work best when fed with text.

For humans, the standard digital document is a PDF with text, tables and images. It is therefore necessary to extract the content of a PDF and then provide it to an LLM.

PDF being what it is (horribly complicated), most modern techniques use a mix of text extraction, images and character recognition (OCR).

In this article, we will study two solutions for extracting text and images from a PDF document:

a cloud solution: Reducto

a self-hosted solution: Marker

Extract and split PDF with Reducto

Reducto offers a SaaS API to ingest PDF documents and extract text and images.

For each page, we will be able to retrieve the raw text content with a bonus description of the images and tables.

We can also retrieve the list of elements of each page (text, image, table, etc.) with their content and position on the page.

See an example of an analyzed document

Reducto also offers to take care of the chunking to then facilitate the use of data in a RAG type system.

Pros:

Easy-to-use SaaS API

Extraction, description and positioning of PDF elements

Chunking for RAG

Cons:

No pay-as-you-go (minimum $300/month, $0.02 per page)

SaaS API (data sovereignty issue)

Extracted images hosted by them

Self-hosting on the enterprise plan only (>$2000/month)

Extracting Text and Images from PDF with Marker

Marker is a tool for recognizing and extracting information from a PDF document.

They use their own PDF recognition models to recognize the different elements of a PDF (table, images, text, etc.) as well as an OCR model (Surya).

License

Marker is licensed under GPL but the model weights are licensed under cc-by-nc-sa-4.0 which excludes commercial uses. You must purchase an On Premise license from Datalab, the company that publishes Marker, to use it on your own servers.

However, for personal use, for research or if your company has a turnover of less than 5 million or has raised less than 5 million dollars, then using Marker is free.

Usage

To use Marker, you must first install Python3 and Torch (machine learning framework).

For a Mac (M3):

# Create working dir
mkdir marker-pdf && cd marker-pdf

# Install virtual env
python3 -m venv venv

# Initialize virtual env
source venv/bin/activate

# Install pytorch
pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu

# Install marker pdf
pip3 install marker-pdf

Here is an example script to convert the PDF to markdown format and save the images.

We will use this example PDF from a textbook.

The first time the code runs, it will download the weights of all the models used (~2 GB in total).

from marker.convert import convert_single_pdf
from marker.models import load_all_models

import os

fpath = "./somatosensory.pdf"
model_lst = load_all_models()
full_text, images, metadata = convert_single_pdf(fpath, model_lst)

# Create images directory if it doesn't exist
os.makedirs("images", exist_ok=True)

# Save images and create markdown content
md_content = []

# Add extracted text
md_content.append(full_text + "\n\n")

# Create doc directory if it doesn't exist
os.makedirs("doc", exist_ok=True)

# Save and link images
for image_name, image_obj in images.items():
    # Save image to the images directory
    image_path = os.path.join("doc", image_name)
    image_obj.save(image_path)


# Write markdown file
with open("doc/result.md", "w", encoding="utf-8") as f:
    f.writelines(md_content)

Here is the result of the generated markdown and the extracted images:

It is then possible to apply additional processing to the markdown:

description of images using a Vision model

splitting the text into chapters using titles

This data is then ready to be used by the RAG ingestion pipeline.

Pros:

Advanced element extraction with specialized templates

Available in cloud or self-hosted version

Data sovereignty in self-hosted version

Cons:

No built-in slicing

Only one paid plan at $5000 ($0.0025 per page)

Potentially expensive self-hosted license (>$5000)

Conclusion

Whatever the solution chosen, the results will not be perfect and some information may be extracted incorrectly.

PDF extraction and cutting is a fast-moving field and it is better to allow yourself to experiment with several tools before making a significant investment such as paying for and deploying a self-hosted version of a tool.