Extracting Text from a PDF to a Word Document using PyPDF2 & pdfplumber

1. Importing Required Libraries

First, we need to import the necessary libraries that will help us manipulate PDF files and create a Word document. The libraries we are using are:

PyPDF2: Used for reading and working with PDF files.
pdfplumber: Provides a more robust way to extract text from PDFs.
python-docx: Allows us to create and manipulate Word documents (.docx files).

Here's the code to import these libraries:


# ================== Importing Required Libraries ==================
# Import necessary libraries for PDF processing and Word document creation
import PyPDF2
import pdfplumber
from docx import Document

2. Defining File Paths and Page Range

Next, the user needs to define the file paths for the input PDF and output Word document, as well as the range of pages they wish to extract. This step sets the variables that will be used throughout the process:

file_name: The full path to the PDF file.
output_word_file: The full path for the output Word document, including the .docx extension.
page_range: The range of pages to extract, specified as a string (e.g., '1-12' for pages 1 to 12).

Here's the code for this section:


# ================== Define File Paths and Page Range ==================
# Define the path to the PDF file, the output Word file, and the range of pages to extract
file_name = r'C:\\Users\\omidm\\OneDrive\\Desktop\\Clifton_Agreement.pdf'
output_word_file = r'C:\\Users\\omidm\\OneDrive\\Desktop\\extracted_Clifton_Agreement2.docx'
page_range = '1-12'  # Specify the page range to extract, e.g., '1-12' for pages 1 to 12

3. Function to Extract PDF Pages and Save as Word

This is the main function of the script. It performs the following tasks:

Opening the PDF File: It opens the PDF file in binary read mode.
Parsing the Page Range: The page range is parsed to handle both single pages and ranges of pages.
Adjusting for Zero-Indexing: PyPDF2 uses zero-based indexing, so the page numbers are adjusted accordingly.
Validating the Page Range: It checks if the specified page range is valid within the document.
Creating a Word Document: Initializes a new Word document where the extracted text will be stored.
Extracting and Adding Text: Extracts text from each page in the specified range and adds it to the Word document.
Saving the Word Document: Finally, it saves the Word document to the specified location.

Here’s the code for this function:


# ================== Function to Extract PDF Pages and Save as Word ==================
def extract_pdf_to_word(file_name, page_range, output_word_file):
    # Open the PDF file in read-binary mode
    with open(file_name, 'rb') as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        
        # Parse the page range input
        if '-' in page_range:
            start_page, end_page = map(int, page_range.split('-'))  # Handle a range of pages
        else:
            start_page = end_page = int(page_range)  # Handle a single page

        # Adjust for zero-indexing used by PyPDF2 (subtract 1 from each page number)
        start_page -= 1
        end_page -= 1

        # Validate the page range
        if start_page < 0 or end_page >= len(pdf_reader.pages) or start_page > end_page:
            print("Invalid page range.")  # Print an error message if the range is invalid
            return  # Exit the function if the range is not valid

        # ================== Create a Word Document ==================
        doc = Document()  # Initialize a new Word document

        # ================== Extract and Add Text from Each Page ==================
        with pdfplumber.open(file_name) as pdf:  # Open the PDF with pdfplumber
            for page_num in range(start_page, end_page + 1):
                page = pdf.pages[page_num]  # Access the page
                text = page.extract_text()  # Extract text from the page
                if text:
                    doc.add_paragraph(text)  # Add the extracted text to the Word document
                else:
                    doc.add_paragraph("No text found on page.")  # Handle pages with no text

        # ================== Save the Word Document ==================
        doc.save(output_word_file)  # Save the Word document to the specified file
        print(f"Pages {start_page + 1} to {end_page + 1} have been extracted to {output_word_file}")  # Confirmation message

4. Calling the Function

Finally, the script calls the extract_pdf_to_word function with the predefined variables. This triggers the entire process of extracting text from the PDF and saving it into a Word document.

Here’s the code to call the function:


# ================== Call the Function with Predefined Variables ==================
extract_pdf_to_word(file_name, page_range, output_word_file)  # Call the function with the specified inputs

Summary

Importing Required Libraries: The first step involves importing the necessary libraries.

Defining File Paths and Page Range: The user specifies the input/output file paths and the page range.

Main Function: This function handles the core tasks of extracting the specified pages and saving them to a Word document.

Function Call: The script concludes by calling the function with the predefined settings, executing the extraction process.

This structured explanation and corresponding code sections make it easier to understand the flow of the script and the purpose of each part.

Extracting Text from a PDF (Python)

1. Importing Required Libraries

2. Defining File Paths and Page Range

3. Function to Extract PDF Pages and Save as Word

4. Calling the Function

Summary