Streamline Your Hiring Process: Build a Simple Resume Parser with FastAPI |

Streamline Your Hiring Process: Build a Simple Resume Parser with FastAPI

Posted on Apr 8, 2023

Getting started

Create a requirements.txt file with the below contents.

torch
transformers
fastapi
uvicorn
pydantic
pypdf2
python-multipart

Install the packages by running:

pip install -r requirements.txt

Here we are going to use the has-abi/extended_distilBERT-finetuned-resumes-sections

Create a main.py and add

# Import necessary libraries and models
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from fastapi import FastAPI, File, UploadFile
from pydantic import BaseModel
from typing import List
import torch
import PyPDF2
import re
import io

# Import necessary libraries
tokenizer = AutoTokenizer.from_pretrained("has-abi/extended_distilBERT-finetuned-resumes-sections")
model = AutoModelForSequenceClassification.from_pretrained("has-abi/extended_distilBERT-finetuned-resumes-sections")

# Define a function to preprocess resume text
def preprocess_resume_text(resume_text):
    # Remove carriage returns and new lines
    resume_text = resume_text.replace('\r', '').replace('\n', ' ')
    # Split the text into sections using regular expression pattern matching
    sections = re.split(r'\s{2,}', resume_text)
    # Return the sections
    return sections


# Define function that takes in a resume_text parameter
def classify_and_convert_to_json(resume_text):
    # Preprocess the resume text and store in sections variable
    sections = preprocess_resume_text(resume_text)
    # Create a dictionary with keys / values for various sections of the resume
    classified_resume = {
        "awards": "",
        "certificates": "",
        "contact_name_title": "", 
        "education": "",
        "interests": "",
        "languages": "",
        "para": "",
        "professional_experiences": "",
        "projects": "",
        "skills": "",
        "soft_skills": "",
        "summary": ""
    }

    # Classify each section of the resume and add it to the corresponding key in classified_resume
    for section in sections:
        section_label = classify_resume_section(section)
        if section_label == "contact/name/title":
            section_label = "contact_name_title"  # Update this key

        if classified_resume[section_label]:
            classified_resume[section_label] += "\n" + section
        else:
            classified_resume[section_label] = section

    # return classified_resume
    return classified_resume

# defining a function to classify a resume section based on the given text
def classify_resume_section(text):
    # tokenizing the input text and converting it into PyTorch tensors
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    # passing the tensors through a pre-trained model and getting the outputs
    outputs = model(**inputs)
    # calculating the probabilities of the output labels using softmax function
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    # defining a list of possible output labels
    labels = [
        "awards",
        "certificates",
        "contact/name/title",
        "education",
        "interests",
        "languages",
        "para",
        "professional_experiences",
        "projects",
        "skills",
        "soft_skills",
        "summary"
    ]
    # finding the index of the label with maximum probability
    max_index = probabilities.argmax().item()
    # returning the label with the maximum probability
    return labels[int(max_index)]

resume_text = """
John Doe
Software Engineer

Experience
Developed a full-stack web application using React, Node.js, and MongoDB.
Implemented RESTful APIs in Node.js for a mobile application.

Education
B.S. in Computer Science, XYZ University, 2015-2019

Skills
JavaScript, Python, React, Node.js, MongoDB, Git, Agile
"""

# Classify the resume text and convert it to JSON format
classified_resume = classify_and_convert_to_json(resume_text)

# Print the classified resume
print(classified_resume)

# Initialize FastAPI
app = FastAPI()

# Define function to read PDF file
def read_pdf(file: UploadFile):
    # Read PDF file contents
    content = file.file.read()
    # Use io module to create a bytes buffer from contents of file
    with io.BytesIO(content) as bytes_io:
        # Create a PDF reader object to read the contents of the buffer
        pdf_reader = PyPDF2.PdfReader(bytes_io)
        # Get the number of pages in the PDF
        num_pages = len(pdf_reader.pages)
        # Initialize empty string to hold text from PDF
        text = ""
        # Loop through each page of the PDF
        for page_num in range(num_pages):
            # Extract text from the current page of the PDF
            page = pdf_reader.pages[page_num]
            text += page.extract_text()
    # Return the extracted text
    return text

# Define Pydantic model for each section of the resume
class ResumeSection(BaseModel):
    text: str
    label: str

# Define Pydantic model for the aggregated resume sections
class AggregatedResumeSections(BaseModel):
    awards: str
    certificates: str
    contact_name_title: str
    education: str
    interests: str
    languages: str
    para: str
    professional_experiences: str
    projects: str
    skills: str
    soft_skills: str
    summary: str

# Define endpoint for classifying the resume
@app.post("/classify_resume", response_model=List[ResumeSection])
async def classify_resume(pdf_file: UploadFile = File(...)):
    # Read the contents of the uploaded PDF file
    resume_text = read_pdf(pdf_file)
    # Classify the resume and convert to JSON
    classified_resume = classify_and_convert_to_json(resume_text)
    # Aggregate the classified resume sections
    aggregated_resume_sections = AggregatedResumeSections(**classified_resume)

    # Convert the AggregatedResumeSections object back to a list of ResumeSection objects
    resume_sections = []
    for label, text in aggregated_resume_sections.dict().items():
        if text:
            resume_sections.append(ResumeSection(text=text, label=label))

    # Return the list of ResumeSection objects
    return resume_sections

Save the file, your folder should look like this

.
├── main.py
└── requirements.txt

Run the application

uvicorn main:app --reload

Test the application

curl -X POST -F "pdf_file=@/path/to/your/resume.pdf" http://localhost:8000/classify_resume

Overall

This code defines a FastAPI endpoint to classify and extract different sections of a resume in PDF format. The endpoint accepts a PDF file as input, reads its contents, preprocesses the text, classifies each section of the resume, aggregates the classified sections into a dictionary, and returns a list of ResumeSection objects that contain the classified text and corresponding labels for each section.

The code first imports necessary libraries and models, including transformers, PyPDF2, io, and FastAPI. It loads a pre-trained tokenizer and a pre-trained model for sequence classification based on extended DistilBERT architecture. It also defines a function to preprocess resume text and split it into different sections and another function to classify a resume section based on its text.

The code then defines two Pydantic models: ResumeSection and AggregatedResumeSections. The resume section model contains a text field to hold the text of a classified resume section and a label field to hold the section’s label. The AggregatedResumeSections model contains fields for each section of the resume that the classifier can recognize.

Finally, the code defines a FastAPI endpoint /classify_resume that accepts a PDF file as input and returns a list of ResumeSection objects containing the classified text and corresponding labels for each section. The endpoint reads the contents of the uploaded PDF file, preprocesses the text, classifies each section of the resume, aggregates the classified sections into a dictionary, and converts it to the AggregatedResumeSections object. The endpoint then converts the AggregatedResumeSections object back to a list of ResumeSection objects and returns it.

Improving the code

  1. Error handling: The code currently does not have any error handling mechanisms. Adding error handling for cases such as invalid input files, corrupted PDFs, or unexpected errors during classification would be good.

  2. Unit tests: Writing unit tests for each application component can help ensure everything works as expected.

  3. Data validation: The code assumes that the input file is a PDF file and that the resume sections are well-formed. Adding data validation checks can improve the robustness of the application.

  4. Better use of Pydantic: Pydantic is a powerful library that can help with data validation and parsing. The current implementation of the code could be improved by making better use of Pydantic’s features.

  5. Fine-tuning the model: Fine-tuning the pre-trained model on more relevant data can help improve the accuracy of the classification.

  6. Use of asynchronous programming: Using asynchronous programming with FastAPI can help improve the application’s scalability.

  7. Refactoring: Refactoring the code to improve its structure, readability, and maintainability can make it easier to modify and extend in the future.

Additional

You can also use has-abi/bert-finetuned-resumes-sections has-abi/distilBERT-finetuned-resumes-sections with the same code it should work.