Streamline Your Hiring Process: Build a Simple Resume Parser with FastAPI
Getting started
Create a requirements.txt
file with the below contents.
torch
transformers
fastapi
uvicorn
pydantic
pypdf2
python-multipart
Install the packages by running:
pip install -r requirements.txt
Here we are going to use the has-abi/extended_distilBERT-finetuned-resumes-sections
Create a main.py
and add
# Import necessary libraries and models
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from fastapi import FastAPI, File, UploadFile
from pydantic import BaseModel
from typing import List
import torch
import PyPDF2
import re
import io
# Import necessary libraries
tokenizer = AutoTokenizer.from_pretrained("has-abi/extended_distilBERT-finetuned-resumes-sections")
model = AutoModelForSequenceClassification.from_pretrained("has-abi/extended_distilBERT-finetuned-resumes-sections")
# Define a function to preprocess resume text
def preprocess_resume_text(resume_text):
# Remove carriage returns and new lines
resume_text = resume_text.replace('\r', '').replace('\n', ' ')
# Split the text into sections using regular expression pattern matching
sections = re.split(r'\s{2,}', resume_text)
# Return the sections
return sections
# Define function that takes in a resume_text parameter
def classify_and_convert_to_json(resume_text):
# Preprocess the resume text and store in sections variable
sections = preprocess_resume_text(resume_text)
# Create a dictionary with keys / values for various sections of the resume
classified_resume = {
"awards": "",
"certificates": "",
"contact_name_title": "",
"education": "",
"interests": "",
"languages": "",
"para": "",
"professional_experiences": "",
"projects": "",
"skills": "",
"soft_skills": "",
"summary": ""
}
# Classify each section of the resume and add it to the corresponding key in classified_resume
for section in sections:
section_label = classify_resume_section(section)
if section_label == "contact/name/title":
section_label = "contact_name_title" # Update this key
if classified_resume[section_label]:
classified_resume[section_label] += "\n" + section
else:
classified_resume[section_label] = section
# return classified_resume
return classified_resume
# defining a function to classify a resume section based on the given text
def classify_resume_section(text):
# tokenizing the input text and converting it into PyTorch tensors
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
# passing the tensors through a pre-trained model and getting the outputs
outputs = model(**inputs)
# calculating the probabilities of the output labels using softmax function
probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
# defining a list of possible output labels
labels = [
"awards",
"certificates",
"contact/name/title",
"education",
"interests",
"languages",
"para",
"professional_experiences",
"projects",
"skills",
"soft_skills",
"summary"
]
# finding the index of the label with maximum probability
max_index = probabilities.argmax().item()
# returning the label with the maximum probability
return labels[int(max_index)]
resume_text = """
John Doe
Software Engineer
Experience
Developed a full-stack web application using React, Node.js, and MongoDB.
Implemented RESTful APIs in Node.js for a mobile application.
Education
B.S. in Computer Science, XYZ University, 2015-2019
Skills
JavaScript, Python, React, Node.js, MongoDB, Git, Agile
"""
# Classify the resume text and convert it to JSON format
classified_resume = classify_and_convert_to_json(resume_text)
# Print the classified resume
print(classified_resume)
# Initialize FastAPI
app = FastAPI()
# Define function to read PDF file
def read_pdf(file: UploadFile):
# Read PDF file contents
content = file.file.read()
# Use io module to create a bytes buffer from contents of file
with io.BytesIO(content) as bytes_io:
# Create a PDF reader object to read the contents of the buffer
pdf_reader = PyPDF2.PdfReader(bytes_io)
# Get the number of pages in the PDF
num_pages = len(pdf_reader.pages)
# Initialize empty string to hold text from PDF
text = ""
# Loop through each page of the PDF
for page_num in range(num_pages):
# Extract text from the current page of the PDF
page = pdf_reader.pages[page_num]
text += page.extract_text()
# Return the extracted text
return text
# Define Pydantic model for each section of the resume
class ResumeSection(BaseModel):
text: str
label: str
# Define Pydantic model for the aggregated resume sections
class AggregatedResumeSections(BaseModel):
awards: str
certificates: str
contact_name_title: str
education: str
interests: str
languages: str
para: str
professional_experiences: str
projects: str
skills: str
soft_skills: str
summary: str
# Define endpoint for classifying the resume
@app.post("/classify_resume", response_model=List[ResumeSection])
async def classify_resume(pdf_file: UploadFile = File(...)):
# Read the contents of the uploaded PDF file
resume_text = read_pdf(pdf_file)
# Classify the resume and convert to JSON
classified_resume = classify_and_convert_to_json(resume_text)
# Aggregate the classified resume sections
aggregated_resume_sections = AggregatedResumeSections(**classified_resume)
# Convert the AggregatedResumeSections object back to a list of ResumeSection objects
resume_sections = []
for label, text in aggregated_resume_sections.dict().items():
if text:
resume_sections.append(ResumeSection(text=text, label=label))
# Return the list of ResumeSection objects
return resume_sections
Save the file, your folder should look like this
.
├── main.py
└── requirements.txt
Run the application
uvicorn main:app --reload
Test the application
curl -X POST -F "pdf_file=@/path/to/your/resume.pdf" http://localhost:8000/classify_resume
Overall
This code defines a FastAPI endpoint to classify and extract different sections of a resume in PDF format. The endpoint accepts a PDF file as input, reads its contents, preprocesses the text, classifies each section of the resume, aggregates the classified sections into a dictionary, and returns a list of ResumeSection objects that contain the classified text and corresponding labels for each section.
The code first imports necessary libraries and models, including transformers, PyPDF2, io, and FastAPI. It loads a pre-trained tokenizer and a pre-trained model for sequence classification based on extended DistilBERT architecture. It also defines a function to preprocess resume text and split it into different sections and another function to classify a resume section based on its text.
The code then defines two Pydantic models: ResumeSection and AggregatedResumeSections. The resume section model contains a text field to hold the text of a classified resume section and a label field to hold the section’s label. The AggregatedResumeSections model contains fields for each section of the resume that the classifier can recognize.
Finally, the code defines a FastAPI endpoint /classify_resume that accepts a PDF file as input and returns a list of ResumeSection objects containing the classified text and corresponding labels for each section. The endpoint reads the contents of the uploaded PDF file, preprocesses the text, classifies each section of the resume, aggregates the classified sections into a dictionary, and converts it to the AggregatedResumeSections object. The endpoint then converts the AggregatedResumeSections object back to a list of ResumeSection objects and returns it.
Improving the code
Error handling: The code currently does not have any error handling mechanisms. Adding error handling for cases such as invalid input files, corrupted PDFs, or unexpected errors during classification would be good.
Unit tests: Writing unit tests for each application component can help ensure everything works as expected.
Data validation: The code assumes that the input file is a PDF file and that the resume sections are well-formed. Adding data validation checks can improve the robustness of the application.
Better use of Pydantic: Pydantic is a powerful library that can help with data validation and parsing. The current implementation of the code could be improved by making better use of Pydantic’s features.
Fine-tuning the model: Fine-tuning the pre-trained model on more relevant data can help improve the accuracy of the classification.
Use of asynchronous programming: Using asynchronous programming with FastAPI can help improve the application’s scalability.
Refactoring: Refactoring the code to improve its structure, readability, and maintainability can make it easier to modify and extend in the future.
Additional
You can also use has-abi/bert-finetuned-resumes-sections has-abi/distilBERT-finetuned-resumes-sections with the same code it should work.