Building a FastAPI-Powered PDF Search Engine: Harnessing the Power of OpenAI , langchain and Detectron2 for Advanced Document Processing
This blog assumes that you have docker
and docker-compose
installed on your machine
Getting started
Create a Dockerfile
# Use the official Python 3.10 image
FROM python:3.10
# Set the working directory
WORKDIR /app
# Install system dependencies
RUN apt-get update && \
apt-get install -y libgl1-mesa-glx poppler-utils && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Copy requirements and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt && \
python -m pip install --no-cache-dir 'git+https://github.com/facebookresearch/detectron2.git' && \
pip install --no-cache-dir "git+https://github.com/philferriere/cocoapi.git#egg=pycocotools&subdirectory=PythonAPI" && \
pip install --no-cache-dir layoutparser[layoutmodels,tesseract]
# Download NLTK packages
RUN python -c "import nltk; nltk.download('punkt'); nltk.download('averaged_perceptron_tagger')"
# Copy the rest of the application
COPY . .
# Expose the application on port 8000 and run it
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Create docker-compose.yml
version: "3.9"
services:
app:
build: .
ports:
- "8000:8000"
Create a empty txt file will explain it later just do it
echo Hello world > load.txt
Now convert this load.txt
to load.pdf
One of the easiest and most common methods is to use the command-line tool called pandoc
to convert it to pdf.
For Ubuntu or Debian-based distributions:
sudo apt-get install pandoc
For Fedora:
sudo dnf install pandoc
For Arch Linux:
sudo pacman -S pandoc
Once you have pandoc
installed, you can convert a text file to a PDF file using the following command:
pandoc load.txt -o load.pdf
Running into this error? “pdflatex not found. Please select a different –pdf-engine or install pdflatex”
I have got you covered
The error you’re encountering is because pandoc
requires a PDF engine to create PDF files. The default engine is pdflatex
, which is part of the TeX Live distribution. To resolve this issue, you need to install pdflatex
on your system. Here’s how to do it for some popular distributions:
For Ubuntu or Debian-based distributions:
sudo apt-get install texlive-latex-base
For Fedora:
sudo dnf install texlive-latex
For Arch Linux:
sudo pacman -S texlive-core
Rerun this command
pandoc load.txt -o load.pdf
The Main application
Create requirements.txt
and add
fastapi
uvicorn
python-multipart
torch
torchvision
layoutparser
langchain
unstructured
openai
unstructured[local-inference]
pybind11
chromadb
Cython
pytesseract
Pillow
tiktoken
aiofiles
asyncio
Now create main.py
file and add
import os
import aiofiles
import hashlib
import asyncio
from chromadb.errors import NotEnoughElementsException
from fastapi import Body, FastAPI, File, UploadFile
from fastapi.responses import JSONResponse
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.indexes import VectorstoreIndexCreator
from detectron2.config import get_cfg
app = FastAPI()
pdf_indexes = {}
# Set the OpenAI API key environment variable
os.environ['OPENAI_API_KEY'] = 'sk---your-openapi-key-here' # Do not forget to add this.
try:
loader = UnstructuredPDFLoader('./load.pdf')
index = VectorstoreIndexCreator().from_loaders([loader])
except:
print("Supress error")
# Set up the Detectron2 configuration
cfg = get_cfg()
cfg.MODEL.DEVICE = 'cpu' # Use 'gpu' for better performance if available
async def remove_file(filename: str):
await asyncio.sleep(0.1) # Replace with an async file removal library
os.remove(filename)
@app.get("/")
async def root():
return {"message": "Hello World"}
# Endpoint to upload a PDF file and generate a PDF index
@app.post("/upload_pdf/")
async def upload_pdf(pdf: UploadFile = File(...)):
content = await pdf.read()
content_hash = hashlib.md5(content).hexdigest()
async with aiofiles.open(f"pdf_{content_hash}.pdf", "wb") as f:
await f.write(content)
# Create a new PDF index if not already in pdf_indexes
if content_hash not in pdf_indexes:
loader = UnstructuredPDFLoader(f"pdf_{content_hash}.pdf")
index = VectorstoreIndexCreator().from_loaders([loader])
pdf_indexes[content_hash] = index
else:
index = pdf_indexes[content_hash]
# Schedule file removal after processing
asyncio.create_task(remove_file(f"pdf_{content_hash}.pdf"))
return {"pdf_id": content_hash}
# Endpoint to query a specific PDF using its ID
@app.post("/query/{pdf_id}")
async def query_pdf(pdf_id: str, query: str = Body(...)):
try:
# Return an error if the PDF ID is not found
if pdf_id not in pdf_indexes:
return JSONResponse(content={"error": "PDF ID not found"}, status_code=404)
index = pdf_indexes[pdf_id]
response = index.query(query)
return {"response": response}
except NotEnoughElementsException:
# Return an error if there are not enough elements in the index
return JSONResponse(content={"error": "Not enough elements in index"}, status_code=404)
Your folder structure should look like this
.
├── docker-compose.yml
├── Dockerfile
├── load.pdf
├── load.txt
├── main.py
└── requirements.txt
Running the application
docker-compose up --build
This should start the application in port 8000
Testing the application
Route 1 http://127.0.0.1:8000/upload_pdf –> POST Route 2 http://127.0.0.1:8000/query/3e02b425ca46a1210a21c8f01e8f0e0a –> POST
To test the root endpoint:
curl http://localhost:8000/
To upload a PDF file and generate a PDF index:
curl -X POST -H "Content-Type: application/pdf" --data-binary "@path/to/your/file.pdf" http://localhost:8000/upload_pdf/
Make sure to replace path/to/your/file.pdf
with the actual path to the PDF file you want to upload.
After running this command, you should receive a response containing the pdf_id
. You will need this ID to query the specific PDF.
To query a specific PDF using its ID:
curl -X POST -H "Content-Type: application/json" -d '{"query": "your search query"}' http://localhost:8000/query/<pdf_id>
Replace <pdf_id>
with the pdf_id
you received in the previous step and replace your search query
with the text you want to search for in the PDF.
You should receive a JSON response containing the search results for your query.
Overall
This code defines a FastAPI application that serves as a PDF search engine. It allows users to upload PDF files, generates indexes for the uploaded files, and provides an endpoint to query the PDFs using text-based search queries. The code uses the UnstructuredPDFLoader and VectorstoreIndexCreator from the langchain library, GPT-4 from OpenAI, and Detectron2 for its functionality.
Here is a breakdown of the code:
- Imports required libraries and modules, including FastAPI, aiofiles, hashlib, asyncio, and others.
- Initializes the FastAPI application instance as
app
. - Sets the OpenAI API key environment variable using
os.environ
. - Tries to load a PDF file and create an index from it, suppressing any errors.
- Sets up the Detectron2 configuration with the chosen device (CPU or GPU).
- Defines an
async
functionremove_file
to remove a file after a certain time, using asyncio. - Defines the root endpoint ("/") which returns a “Hello World” message.
- Defines the “/upload_pdf/” endpoint, which accepts a PDF file as an input, calculates its MD5 hash, saves the file with the hash as its name, creates a new PDF index using UnstructuredPDFLoader and VectorstoreIndexCreator, stores the index in the
pdf_indexes
dictionary, schedules file removal after processing, and returns thepdf_id
(the content hash). - Defines the “/query/{pdf_id}” endpoint, which takes a PDF ID and a query as input, checks if the PDF ID exists in the
pdf_indexes
dictionary, and if so, queries the index and returns the response. If the PDF ID is not found or there are not enough elements in the index, appropriate error messages are returned.
Overall, this code provides a simple API for uploading, indexing, and querying PDF files using FastAPI, Openai, Detectron2 and most importantly langchain.
*Do not forget to keep an eye on the credit card bill @ https://platform.openai.com/ **Do not expose API KEY it should be kept secret