Building a FastAPI-Powered PDF Search Engine: Harnessing the Power of OpenAI , langchain and Detectron2 for Advanced Document Processing

Posted on Apr 10, 2023

This blog assumes that you have docker and docker-compose installed on your machine

Getting started

Create a Dockerfile

# Use the official Python 3.10 image
FROM python:3.10

# Set the working directory
WORKDIR /app

# Install system dependencies
RUN apt-get update && \
    apt-get install -y libgl1-mesa-glx poppler-utils && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Copy requirements and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt && \
    python -m pip install --no-cache-dir 'git+https://github.com/facebookresearch/detectron2.git' && \
    pip install --no-cache-dir "git+https://github.com/philferriere/cocoapi.git#egg=pycocotools&subdirectory=PythonAPI" && \
    pip install --no-cache-dir layoutparser[layoutmodels,tesseract]

# Download NLTK packages
RUN python -c "import nltk; nltk.download('punkt'); nltk.download('averaged_perceptron_tagger')"

# Copy the rest of the application
COPY . .

# Expose the application on port 8000 and run it
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Create docker-compose.yml

version: "3.9"

services:
  app:
    build: .
    ports:
      - "8000:8000"

Create a empty txt file will explain it later just do it

echo Hello world > load.txt

Now convert this load.txt to load.pdf

One of the easiest and most common methods is to use the command-line tool called pandoc to convert it to pdf.

For Ubuntu or Debian-based distributions:

sudo apt-get install pandoc

For Fedora:

sudo dnf install pandoc

For Arch Linux:

sudo pacman -S pandoc

Once you have pandoc installed, you can convert a text file to a PDF file using the following command:

pandoc load.txt -o load.pdf

Running into this error? “pdflatex not found. Please select a different –pdf-engine or install pdflatex”

I have got you covered

The error you’re encountering is because pandoc requires a PDF engine to create PDF files. The default engine is pdflatex, which is part of the TeX Live distribution. To resolve this issue, you need to install pdflatex on your system. Here’s how to do it for some popular distributions:

For Ubuntu or Debian-based distributions:

sudo apt-get install texlive-latex-base

For Fedora:

sudo dnf install texlive-latex

For Arch Linux:

sudo pacman -S texlive-core

Rerun this command

pandoc load.txt -o load.pdf

The Main application

Create requirements.txt and add

fastapi
uvicorn
python-multipart
torch
torchvision
layoutparser
langchain
unstructured
openai
unstructured[local-inference]
pybind11
chromadb
Cython
pytesseract
Pillow
tiktoken
aiofiles
asyncio

Now create main.py file and add

import os
import aiofiles
import hashlib
import asyncio
from chromadb.errors import NotEnoughElementsException
from fastapi import Body, FastAPI, File, UploadFile
from fastapi.responses import JSONResponse
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.indexes import VectorstoreIndexCreator
from detectron2.config import get_cfg

app = FastAPI()
pdf_indexes = {}

# Set the OpenAI API key environment variable
os.environ['OPENAI_API_KEY'] = 'sk---your-openapi-key-here'  # Do not forget to add this.

try:
    loader = UnstructuredPDFLoader('./load.pdf')
    index = VectorstoreIndexCreator().from_loaders([loader])
except:
    print("Supress error")

# Set up the Detectron2 configuration
cfg = get_cfg()
cfg.MODEL.DEVICE = 'cpu'  # Use 'gpu' for better performance if available

async def remove_file(filename: str):
    await asyncio.sleep(0.1)  # Replace with an async file removal library
    os.remove(filename)

@app.get("/")
async def root():
    return {"message": "Hello World"}

# Endpoint to upload a PDF file and generate a PDF index
@app.post("/upload_pdf/")
async def upload_pdf(pdf: UploadFile = File(...)):
    content = await pdf.read()
    content_hash = hashlib.md5(content).hexdigest()

    async with aiofiles.open(f"pdf_{content_hash}.pdf", "wb") as f:
        await f.write(content)

    # Create a new PDF index if not already in pdf_indexes
    if content_hash not in pdf_indexes:
        loader = UnstructuredPDFLoader(f"pdf_{content_hash}.pdf")
        index = VectorstoreIndexCreator().from_loaders([loader])
        pdf_indexes[content_hash] = index
    else:
        index = pdf_indexes[content_hash]

    # Schedule file removal after processing
    asyncio.create_task(remove_file(f"pdf_{content_hash}.pdf"))

    return {"pdf_id": content_hash}

# Endpoint to query a specific PDF using its ID
@app.post("/query/{pdf_id}")
async def query_pdf(pdf_id: str, query: str = Body(...)):
    try:
        # Return an error if the PDF ID is not found
        if pdf_id not in pdf_indexes:
            return JSONResponse(content={"error": "PDF ID not found"}, status_code=404)

        index = pdf_indexes[pdf_id]
        response = index.query(query)

        return {"response": response}
    except NotEnoughElementsException:
        # Return an error if there are not enough elements in the index
        return JSONResponse(content={"error": "Not enough elements in index"}, status_code=404)

Your folder structure should look like this

.
├── docker-compose.yml
├── Dockerfile
├── load.pdf
├── load.txt
├── main.py
└── requirements.txt

Running the application

docker-compose up --build

This should start the application in port 8000

Testing the application

Route 1 http://127.0.0.1:8000/upload_pdf –> POST Route 2 http://127.0.0.1:8000/query/3e02b425ca46a1210a21c8f01e8f0e0a –> POST

To test the root endpoint:

curl http://localhost:8000/

To upload a PDF file and generate a PDF index:

curl -X POST -H "Content-Type: application/pdf" --data-binary "@path/to/your/file.pdf" http://localhost:8000/upload_pdf/

Make sure to replace path/to/your/file.pdf with the actual path to the PDF file you want to upload.

After running this command, you should receive a response containing the pdf_id. You will need this ID to query the specific PDF.

To query a specific PDF using its ID:

curl -X POST -H "Content-Type: application/json" -d '{"query": "your search query"}' http://localhost:8000/query/<pdf_id>

Replace <pdf_id> with the pdf_id you received in the previous step and replace your search query with the text you want to search for in the PDF.

You should receive a JSON response containing the search results for your query.

Overall

This code defines a FastAPI application that serves as a PDF search engine. It allows users to upload PDF files, generates indexes for the uploaded files, and provides an endpoint to query the PDFs using text-based search queries. The code uses the UnstructuredPDFLoader and VectorstoreIndexCreator from the langchain library, GPT-4 from OpenAI, and Detectron2 for its functionality.

Here is a breakdown of the code:

Imports required libraries and modules, including FastAPI, aiofiles, hashlib, asyncio, and others.
Initializes the FastAPI application instance as app.
Sets the OpenAI API key environment variable using os.environ.
Tries to load a PDF file and create an index from it, suppressing any errors.
Sets up the Detectron2 configuration with the chosen device (CPU or GPU).
Defines an async function remove_file to remove a file after a certain time, using asyncio.
Defines the root endpoint ("/") which returns a “Hello World” message.
Defines the “/upload_pdf/” endpoint, which accepts a PDF file as an input, calculates its MD5 hash, saves the file with the hash as its name, creates a new PDF index using UnstructuredPDFLoader and VectorstoreIndexCreator, stores the index in the pdf_indexes dictionary, schedules file removal after processing, and returns the pdf_id (the content hash).
Defines the “/query/{pdf_id}” endpoint, which takes a PDF ID and a query as input, checks if the PDF ID exists in the pdf_indexes dictionary, and if so, queries the index and returns the response. If the PDF ID is not found or there are not enough elements in the index, appropriate error messages are returned.

Overall, this code provides a simple API for uploading, indexing, and querying PDF files using FastAPI, Openai, Detectron2 and most importantly langchain.

*Do not forget to keep an eye on the credit card bill @ https://platform.openai.com/ **Do not expose API KEY it should be kept secret

References

https://www.youtube.com/watch?v=bOS929yCkGE