First, we need to import the necessary libraries that will help us manipulate PDF files and create a Word document. The libraries we are using are:
Here's the code to import these libraries:
# ================== Importing Required Libraries ==================
# Import necessary libraries for PDF processing and Word document creation
import PyPDF2
import pdfplumber
from docx import Document
Next, the user needs to define the file paths for the input PDF and output Word document, as well as the range of pages they wish to extract. This step sets the variables that will be used throughout the process:
Here's the code for this section:
# ================== Define File Paths and Page Range ==================
# Define the path to the PDF file, the output Word file, and the range of pages to extract
file_name = r'C:\\Users\\omidm\\OneDrive\\Desktop\\Clifton_Agreement.pdf'
output_word_file = r'C:\\Users\\omidm\\OneDrive\\Desktop\\extracted_Clifton_Agreement2.docx'
page_range = '1-12' # Specify the page range to extract, e.g., '1-12' for pages 1 to 12
This is the main function of the script. It performs the following tasks:
Here’s the code for this function:
# ================== Function to Extract PDF Pages and Save as Word ==================
def extract_pdf_to_word(file_name, page_range, output_word_file):
# Open the PDF file in read-binary mode
with open(file_name, 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
# Parse the page range input
if '-' in page_range:
start_page, end_page = map(int, page_range.split('-')) # Handle a range of pages
else:
start_page = end_page = int(page_range) # Handle a single page
# Adjust for zero-indexing used by PyPDF2 (subtract 1 from each page number)
start_page -= 1
end_page -= 1
# Validate the page range
if start_page < 0 or end_page >= len(pdf_reader.pages) or start_page > end_page:
print("Invalid page range.") # Print an error message if the range is invalid
return # Exit the function if the range is not valid
# ================== Create a Word Document ==================
doc = Document() # Initialize a new Word document
# ================== Extract and Add Text from Each Page ==================
with pdfplumber.open(file_name) as pdf: # Open the PDF with pdfplumber
for page_num in range(start_page, end_page + 1):
page = pdf.pages[page_num] # Access the page
text = page.extract_text() # Extract text from the page
if text:
doc.add_paragraph(text) # Add the extracted text to the Word document
else:
doc.add_paragraph("No text found on page.") # Handle pages with no text
# ================== Save the Word Document ==================
doc.save(output_word_file) # Save the Word document to the specified file
print(f"Pages {start_page + 1} to {end_page + 1} have been extracted to {output_word_file}") # Confirmation message
Finally, the script calls the extract_pdf_to_word function with the predefined variables. This triggers the entire process of extracting text from the PDF and saving it into a Word document.
Here’s the code to call the function:
# ================== Call the Function with Predefined Variables ==================
extract_pdf_to_word(file_name, page_range, output_word_file) # Call the function with the specified inputs
Importing Required Libraries: The first step involves importing the necessary libraries.
Defining File Paths and Page Range: The user specifies the input/output file paths and the page range.
Main Function: This function handles the core tasks of extracting the specified pages and saving them to a Word document.
Function Call: The script concludes by calling the function with the predefined settings, executing the extraction process.
This structured explanation and corresponding code sections make it easier to understand the flow of the script and the purpose of each part.