Extract/Read

Extract/Read (4)

Perform PDF OCR with Python (Extract Text from Scanned PDF)

2025-07-18 06:44:09 Written by zaki zou

OCR PDF and Extract Text Using Python In daily work, extracting text from PDF files is a common task. For standard digital documents—such as those exported from Word to PDF—this process is usually straightforward. However, things get tricky when dealing with scanned PDFs, which are essentially images of printed documents. In such cases, traditional text extraction methods fail, and OCR (Optical Character Recognition) becomes necessary to recognize and convert the text within images into editable content.

In this article, we’ll walk through how to perform PDF OCR using Python to automate this workflow and significantly reduce manual effort.

Why OCR is Needed for PDF Text Extraction
Best Python OCR Libraries for PDF Processing
Convert PDF Pages to Images Using Python
Scan and Extract Text from Images Using Spire.OCR
Conclusion

Why OCR is Needed for PDF Text Extraction

When it comes to extracting text from PDF files, one important factor that determines your approach is the type of PDF. Generally, PDFs fall into two categories: scanned (image-based) PDFs and searchable PDFs. Each requires a different strategy for text extraction.

Scanned PDFs are typically created by digitizing physical documents such as books, invoices, contracts, or magazines. While the text appears readable to the human eye, it's actually embedded as an image—making it inaccessible to traditional text extraction tools. Older digital files or password-protected PDFs may also lack an actual text layer.
Searchable PDFs, on the other hand, contain a hidden text layer that allows computers to search, copy, or parse the content. These files are usually generated directly from applications like Microsoft Word or PDF editors and are much easier to process programmatically.

This distinction highlights the importance of OCR (Optical Character Recognition) when working with scanned PDFs. With tools like Python PDF OCR, we can convert these image-based PDFs into images, run OCR to recognize the text, and extract it for further use—all in an automated way.

Best Python OCR Libraries for PDF Processing

Before diving into the implementation, let’s take a quick look at the tools we’ll be using in this tutorial. To simplify the process, we’ll use Spire.PDF for Python and Spire.OCR for Python to perform PDF OCR in Python.

Spire.PDF will handle the conversion from PDF to images.
Spire.OCR, a powerful OCR tool for PDF files, will recognize the text in those images and extract it as editable content.

You can install Spire.PDF using the following pip command:

pip install spire.pdf

and install Spire.OCR with:

pip install spire.ocr

Alternatively, you can download and install them manually by visiting the official Spire.PDF and Spire.OCR pages.

Convert PDF Pages to Images Using Python

Before we dive into Python PDF OCR, it's crucial to understand a foundational step: OCR technology doesn't directly process PDF files. Especially with image-based PDFs (like those created from scanned documents), we first need to convert them into individual image files.

Converting PDFs to images using the Spire.PDF library is straightforward. You simply load your target PDF document and then iterate through each page. For every page, call the PdfDocument.SaveAsImage() method to save it as a separate image file. Once this step is complete, your images are ready for the subsequent OCR process.

Here's a code example showing how to convert PDF to PNG:

from spire.pdf import *

# Load the PDF file
pdf = PdfDocument()
pdf.LoadFromFile("/AI-Generated Art.pdf")

# Loop through pages and save as images
for i in range(pdf.Pages.Count):
    # Convert each page to image
    with pdf.SaveAsImage(i) as image:
        
        # Save in different formats as needed
        image.Save(f"/output/pdftoimage/ToImage_{i}.png")
        # image.Save(f"Output/ToImage_{i}.jpg")
        # image.Save(f"Output/ToImage_{i}.bmp")

# Close the PDF document
pdf.Close()

Conversion result preview: Convert PDF to PNG in Python

Scan and Extract Text from Images Using Spire.OCR

After converting the scanned PDF into images, we can now move on to OCR PDF with Python and to extract text from the PDF. With OcrScanner.Scan() from Spire.OCR, recognizing text in images becomes straightforward. It supports multiple languages such as English, Chinese, French, and German. Once the text is extracted, you can easily save it to a .txt file or generate a Word document.

The code example below shows how to OCR the first PDF page and export to text in Python:

from spire.ocr import *

# Create OCR scanner instance
scanner = OcrScanner()

# Configure OCR model path and language
configureOptions = ConfigureOptions()
configureOptions.ModelPath = r'E:/DownloadsNew/win-x64/'
configureOptions.Language = 'English'
scanner.ConfigureDependencies(configureOptions)

# Perform OCR on the image
scanner.Scan(r'/output/pdftoimage/ToImage_0.png')

# Save extracted text to file
text = scanner.Text.ToString()
with open('/output/scannedpdfoutput.txt', 'a', encoding='utf-8') as file:
    file.write(text + '\n')

Result preview: OCR the First PDF Image and Extract Text Using Python

The Conclusion

In this article, we covered how to perform PDF OCR with Python—from converting PDFs to images, to recognizing text with OCR, and finally saving the extracted content as a plain text file. With this streamlined approach, extracting text from scanned PDFs becomes effortless. If you're looking to automate your PDF processing workflows, feel free to reach out and request a 30-day free trial. It’s time to simplify your document management.

Published in Extract/Read

Tagged under

pdf Python Extract Read

Python: Get Coordinates of the Specified Text or Image in PDF

2024-05-21 01:58:08 Written by Koohji

Retrieving the coordinates of text or images within a PDF document can quickly locate specific elements, which is valuable for extracting content from PDFs. This capability also enables adding annotations, marks, or stamps to the desired locations in a PDF, allowing for more advanced document processing and manipulation.

In this article, you will learn how to get coordinates of the specified text or image in a PDF document using Spire.PDF for Python.

Get Coordinates of the Specified Text in PDF in Python
Get Coordinates of the Specified Image in PDF in Python

Install Spire.PDF for Python

This scenario requires Spire.PDF for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.

Package Manager

pip install Spire.PDF

If you are unsure how to install, please refer to this tutorial: How to Install Spire.PDF for Python on Windows

Coordinate System in Spire.PDF

When using Spire.PDF to process an existing PDF document, the origin of the coordinate system is located at the top left corner of the page. The X-axis extends horizontally from the origin to the right, and the Y-axis extends vertically downward from the origin (shown as below).

Python: Get Coordinates of the Specified Text or Image in PDF

Get Coordinates of the Specified Text in PDF in Python

To find the coordinates of a specific piece of text within a PDF document, you must first use the PdfTextFinder.Find() method to locate all instances of the target text on a particular page. Once you have found these instances, you can then access the PdfTextFragment.Positions property to retrieve the precise (X, Y) coordinates for each instance of the text.

The steps to get coordinates of the specified text in PDF are as follows.

Create a PdfDocument object.
Load a PDF document from a specified path.
Get a specific page from the document.
Create a PdfTextFinder object.
Specify find options through PdfTextFinder.Options property.
Search for a string within the page using PdfTextFinder.Find() method.
Get a specific instance of the search results.
Get X and Y coordinates of the text through PdfTextFragment.Positions[0].X and PdfTextFragment.Positions[0].Y properties.

Python

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF document
doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Privacy Policy.pdf")

# Get a specific page
page = doc.Pages[0]

# Create a PdfTextFinder object
textFinder = PdfTextFinder(page)

# Specify find options
findOptions = PdfTextFindOptions()
findOptions.Parameter = TextFindParameter.IgnoreCase
findOptions.Parameter = TextFindParameter.WholeWord
textFinder.Options = findOptions
 
# Search for the string "PRIVACY POLICY" within the page
findResults = textFinder.Find("PRIVACY POLICY") 

# Get the first instance of the results
result = findResults[0]

# Get X/Y coordinates of the found text
x = int(result.Positions[0].X)
y = int(result.Positions[0].Y)
print("The coordinates of the first instance of the found text are:", (x, y))

# Dispose resources
doc.Dispose()

Python: Get Coordinates of the Specified Text or Image in PDF

Get Coordinates of the Specified Image in PDF in Python

Spire.PDF for Python provides the PdfImageHelper class, which allows users to extract image details from a specific page within a PDF file. By doing so, you can leverage the PdfImageInfo.Bounds property to retrieve the (X, Y) coordinates of an individual image.

The steps to get coordinates of the specified image in PDF are as follows.

Create a PdfDocument object.
Load a PDF document from a specified path.
Get a specific page from the document.
Create a PdfImageHelper object.
Get the image information from the page using PdfImageHelper.GetImagesInfo() method.
Get X and Y coordinates of a specific image through PdfImageInfo.Bounds property.

Python

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF document
doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Privacy Policy.pdf")

# Get a specific page 
page = doc.Pages[0]

# Create a PdfImageHelper object
imageHelper = PdfImageHelper()

# Get image information from the page
imageInformation = imageHelper.GetImagesInfo(page)

# Get X/Y coordinates of a specific image
x = int(imageInformation[0].Bounds.X)
y = int(imageInformation[0].Bounds.Y)
print("The coordinates of the specified image are:", (x, y))

# Dispose resources
doc.Dispose()

Python: Get Coordinates of the Specified Text or Image in PDF

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

Published in Extract/Read

Tagged under

pdf Python Extract Read

Extract Text from PDF in Python: A Complete Guide with Practical Code Samples

2023-10-09 01:26:32 Written by Koohji

Precisely Extract Text from PDF using Python

PDF files are everywhere—from contracts and research papers to eBooks and invoices. While they preserve formatting perfectly, extracting text from PDFs can be challenging, especially with large or complex documents. Manual copying is not only slow but often inaccurate.

Whether you’re a developer automating workflows, a data analyst processing content, or simply someone needing quick text extraction, programmatic methods can save you valuable time and effort.

In this comprehensive guide, you’ll learn how to extract text from PDF files in Python using Spire.PDF for Python — a powerful and easy-to-use PDF processing library. We’ll cover extracting all text, targeting specific pages or areas, ignoring hidden text, and capturing layout details such as text position and size.

Why Extract Text from PDF Files
Install Spire.PDF for Python: Powerful PDF Parser Library
Extract Text from PDF (Basic Example)
Advanced Text Extraction Features
Conclusion
FAQs

Why Extract Text from PDF Files

Text extraction from PDFs is essential for many use cases, including:

Automating data entry and document processing
Enabling full-text search and indexing
Performing data analysis on reports and surveys
Extracting content for machine learning and NLP
Converting PDFs to other editable formats

Install Spire.PDF for Python: Powerful PDF Parser Library

Spire.PDF for Python is a comprehensive and easy-to-use PDF processing library that simplifies all your PDF manipulation needs. It offers advanced text extraction capabilities that work seamlessly with both simple and complex PDF documents.

Installation

The library can be installed easily via pip. Open your terminal and run the following command:

pip install spire.pdf

Need help with the installation? Follow this step-by-step guide: How to Install Spire.PDF for Python on Windows

Extract Text from PDF (Basic Example)

If you just want to quickly read all the text from a PDF, this simple example shows how to do it. It iterates over each page, extracts the full text using PdfTextExtractor, and saves it to a text file with spacing and line breaks preserved.

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF document
doc.LoadFromFile('C:/Users/Administrator/Desktop/Terms of service.pdf')

# Prepare a variable to hold the extracted text
all_text = ""

# Create a PdfTextExtractOptions object
extractOptions = PdfTextExtractOptions()
# Extract all text including whitespaces
extractOptions.IsExtractAllText = True

# Loop through all pages and extract text
for i in range(doc.Pages.Count):
    page = doc.Pages[i]
    textExtractor = PdfTextExtractor(page)
    text = textExtractor.ExtractText(extractOptions)
    # Append text from each page
    all_text += text + "\n"

# Write all extracted text to a file
with open('output/TextOfAllPages.txt', 'w', encoding='utf-8') as file:
    file.write(all_text)

Advanced Text Extraction Features

For greater control over what and how text is extracted, Spire.PDF for Python offers advanced options. You can selectively extract content from specific pages or regions, or even with layout details, such as text position and size, to better suit your specific data processing needs.

Retrieve Text from Selected Pages

Instead of processing an entire PDF, you can target specific pages for text extraction. This is especially useful for large documents where only certain sections are relevant for your task.

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF document
doc.LoadFromFile('C:/Users/Administrator/Desktop/Terms of service.pdf')

# Create a PdfTextExtractOptions object and enable full text extraction
extractOptions = PdfTextExtractOptions()
# Extract all text including whitespaces
extractOptions.IsExtractAllText = True

# Get a specific page (e.g., page 2)
page = doc.Pages[1]

# Create a PdfTextExtractor object
textExtractor = PdfTextExtractor(page)

# Extract text from the page
text = textExtractor.ExtractText(extractOptions)

# Write the extracted text to a file using UTF-8 encoding
with open('output/TextOfPage.txt', 'w', encoding='utf-8') as file:
    file.write(text)

Retrieve Text from Selected Pages Output

Get Text from Defined Area

When dealing with structured documents like forms or invoices, extracting text from a specific region can be more efficient. You can define a rectangular area and extract only the text within that boundary on the page.

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF document
doc.LoadFromFile('C:/Users/Administrator/Desktop/Terms of service.pdf')

# Get a specific page (e.g., page 2)
page = doc.Pages[1]

# Create a PdfTextExtractor object
textExtractor = PdfTextExtractor(page)

# Create a PdfTextExtractOptions object
extractOptions = PdfTextExtractOptions()

# Define the rectangular area to extract text from
# RectangleF(left, top, width, height)
extractOptions.ExtractArea = RectangleF(0.0, 100.0, 890.0, 80.0)

# Extract text from the specified area, keeping white spaces
text = textExtractor.ExtractText(extractOptions)

# Write the extracted text to a file using UTF-8 encoding
with open('output/TextOfRectangle.txt', 'w', encoding='utf-8') as file:
    file.write(text)

Retrieve Text from Defined Area Output

Ignore Hidden Text During Extraction

Some PDFs contain hidden or invisible text, often used for accessibility or OCR layers. You can choose to ignore such content during extraction to focus only on what is actually visible to users.

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF document
doc.LoadFromFile('C:/Users/Administrator/Desktop/Terms of service.pdf')

# Create a PdfTextExtractOptions object
extractOptions = PdfTextExtractOptions()
# Ignore hidden text during extraction
extractOptions.IsShowHiddenText = False

# Get a specific page (e.g., page 2)
page = doc.Pages[1]

# Create a PdfTextExtractor object
textExtractor = PdfTextExtractor(page)

# Extract text from the page
text = textExtractor.ExtractText(extractOptions)

# Write the extracted text to a file using UTF-8 encoding
with open('output/ExcludeHiddenText.txt', 'w', encoding='utf-8') as file:
    file.write(text)

Retrieve Text with Position (Coordinates) and Size Information

For layout-sensitive applications—such as converting PDF content into editable formats or reconstructing page structure—you can extract text along with its position and size. This provides precise control over how content is interpreted and used.

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF document
doc.LoadFromFile('C:/Users/Administrator/Desktop/Terms of service.pdf')

# Loop through all pages of the document
for i in range(doc.Pages.Count):
    page = doc.Pages[i]

    # Create a PdfTextFinder object for the current page
    finder = PdfTextFinder(page)

    # Find all text fragments on the page
    fragments = finder.FindAllText()

    print(f"Page {i + 1}:")

    # Loop through all text fragments
    for fragment in fragments:
        # Extract text content from the current text fragment
        text = fragment.Text

        # Get bounding rectangles with position and size
        rects = fragment.Bounds

        print(f'Text: "{text}"')

        # Iterate through all rectangles
        for rect in rects:
            # Print the position and size information of the current rectangle
            print(f"Position: ({rect.X}, {rect.Y}), Size: ({rect.Width} x {rect.Height})")

        print()

Conclusion

Extracting text from PDF files in Python becomes efficient and flexible with Spire.PDF for Python. Whether you need to process entire documents or extract text from specific pages or regions, Spire.PDF provides a robust set of tools to meet your needs. By automating text extraction, you can streamline workflows, power intelligent search systems, or prepare data for analysis and machine learning.

FAQs

Q1: Can text be extracted from password-protected PDFs?

A1: Yes, Spire.PDF for Python can open and extract text from secured files by providing the correct password when loading the PDF document.

Q2: Is batch text extraction from multiple PDFs supported?

A2: Yes, you can programmatically iterate through a directory of PDF files and apply text extraction to each file efficiently using Spire.PDF for Python.

Q3: Is it possible to extract images or tables from PDFs?

A3: While this guide focuses on text extraction, Spire.PDF for Python also supports image extraction and table extraction.

Q4: Can text be extracted from scanned (image-based) PDFs?

A4: Extracting text from scanned PDFs requires OCR (Optical Character Recognition). Spire.PDF for Python does not include built-in OCR, but you can combine it with an OCR library like Spire.OCR for image-to-text conversion.

Get a Free License

To fully experience the capabilities of Spire.PDF for Python without any evaluation limitations, you can request a free 30-day trial license.

Published in Extract/Read

Tagged under

pdf Python Extract Read

Extract Images from PDF in Python – A Complete Guide

2024-07-19 01:08:00 Written by Koohji

PDF files often contain critical embedded images (e.g., charts, diagrams, scanned documents). For developers, knowing how to extract images from PDF in Python allows them to repurpose graphical content for automated report generation or feed these visuals into machine learning models for analysis and OCR tasks.

Visual guide for Extract Images from PDF Python

This article explores how to leverage the Spire.PDF for Python library to extract images from PDF files via Python, covering the following aspects:

Installation & Environment Setup
How to Extract Images from PDFs using Python
- Example 1: Extract Images from a PDF Page
- Example 2: Extract All Images from a PDF File
Handle Different Image Formats While Extraction
Frequently Asked Questions
Conclusion (Extract Text and More)

Installation & Environment Setup

Before you start using Spire.PDF for Python to extract images from PDF, make sure you have the following in place:

Python Environment: Ensure that you have Python installed on your system. It is recommended to use the latest stable version for the best compatibility and performance.
Spire.PDF for Python Library: You need to install the Python PDF SDK, and the easiest way is using pip, the Python package installer.

Open your command prompt or terminal and run the following command:

pip install Spire.PDF

How to Extract Images from PDFs using Python

Example 1: Extract Images from a PDF Page

Here’s a complete Python script to extract and save images from a specified page in PDF:

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument instance
pdf = PdfDocument()

# Load a PDF file
pdf.LoadFromFile("template1.pdf")

# Get the first page
page = pdf.Pages[0]

# Create a PdfImageHelper instance
imageHelper = PdfImageHelper()

# Get the image information on the page
imageInfo = imageHelper.GetImagesInfo(page)

# Iterate through the image information
for i in range(0, len(imageInfo)):
    # Save images to file
    imageInfo[i].Image.Save("PageImage\\Image" + str(i) + ".png")

# Release resources
pdf.Dispose()

Key Steps Explained:

Load the PDF: Use the LoadFromFile() method to load a PDF file.
Access a Page: Access a specified PDF page by index.
Extract Image information:
- Create a PdfImageHelper instance to facilitate image extraction.
- Use the GetImagesInfo() method to retrieve image information from the specified page, and return a list of PdfImageInfo objects.
Save Images to Files:
- Loops through all detected images on the page
- Use the PdfImageInfo[].Image.Save() method to save the image to disk.

Output:

Extract all images from the first page in a PDF

Example 2: Extract All Images from a PDF File

Building on the single-page extraction method, you can iterate through all pages of the PDF document to extract every embedded image.

Python code example:

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument instance
pdf = PdfDocument()

# Load a PDF file
pdf.LoadFromFile("template1.pdf")

# Create a PdfImageHelper instance
imageHelper = PdfImageHelper()

# Iterate through the pages in the document
for i in range(0, pdf.Pages.Count):
    # Get the current page
    page = pdf.Pages[i]
    # Get the image information on the page
    imageInfo = imageHelper.GetImagesInfo(page)
    # Iterate through the image information items
    for j in range(0, len(imageInfo)):
        # Save the current image to file
        imageInfo[j].Image.Save(f"Images\\Image{i}_{j}.png")

# Release resources
pdf.Close()

Output:

Extract all images from an entire PDF file.

Handle Different Image Formats While Extraction

Spire.PDF for Python supports extracting images in various formats such as PNG, JPG/JPEG, BMP, etc. When saving the extracted images, you can choose the appropriate format based on your needs.

Common Image Formats:

Format	Best Use Cases	PDF Extraction Notes
JPG/JPEG	Photos, scanned documents	Common in PDFs; quality loss on re-compress
PNG	Web graphics, diagrams, screenshots	Preserves transparency; larger file sizes
BMP	Windows applications, temp storage	Rare in modern PDFs; avoid for web use
TIFF	Archiving, print, OCR input	Ideal for document preservation; multi-page
EMF	Windows vector editing	Editable in Illustrator/Inkscape

Frequently Asked Questions

Q1: Is Spire.PDF for Python a free library?

Spire.PDF for Python offers both free and commercial versions. The free version has limitations, such as a maximum of 10 pages per PDF. For commercial use or to remove these restrictions, you can request a trial license here.

Q2: Can I extract images from a specified page range only?

Yes. Instead of iterating through all pages, specify the page indices you want. For example, to extract images from the pages 2 to 5:

# Extract images from pages 2 to 5
for i in range(1, 4): # Pages are zero-indexed
    page = pdf.Pages[i]
    # Process images as before

Q3: Is it possible to extract text from images?

Yes. For scanned PDF files, after extracting the images, you can extract the text in the images in conjunction with the Spire.OCR for Python library.

A step-by-step guide: How to Extract Text from Image Using Python (OCR Code Examples)

Conclusion (Extract Text and More)

Spire.PDF simplifies image extraction from PDF in Python with minimal code. By following this guide, you can:

Extract images from single pages or entire PDF documents.
Save images from PDF in various formats (PNG, JPG, BMP or TIFF).

As a PDF document can contain different elements, the Python PDF library is also capable of:

Published in Extract/Read

Tagged under

pdf Python Extract Read

News Category

Extract/Read (4)

Perform PDF OCR with Python (Extract Text from Scanned PDF)

Why OCR is Needed for PDF Text Extraction

Best Python OCR Libraries for PDF Processing

Convert PDF Pages to Images Using Python

Scan and Extract Text from Images Using Spire.OCR

The Conclusion

Python: Get Coordinates of the Specified Text or Image in PDF

Install Spire.PDF for Python

Coordinate System in Spire.PDF

Get Coordinates of the Specified Text in PDF in Python

Get Coordinates of the Specified Image in PDF in Python

Apply for a Temporary License

Extract Text from PDF in Python: A Complete Guide with Practical Code Samples

Table of Contents

Why Extract Text from PDF Files

Install Spire.PDF for Python: Powerful PDF Parser Library

Installation

Extract Text from PDF (Basic Example)

Advanced Text Extraction Features

Retrieve Text from Selected Pages

Get Text from Defined Area

Ignore Hidden Text During Extraction

Retrieve Text with Position (Coordinates) and Size Information

Conclusion

FAQs

Q1: Can text be extracted from password-protected PDFs?

Q2: Is batch text extraction from multiple PDFs supported?

Q3: Is it possible to extract images or tables from PDFs?

Q4: Can text be extracted from scanned (image-based) PDFs?

Get a Free License

Extract Images from PDF in Python – A Complete Guide

Installation & Environment Setup

How to Extract Images from PDFs using Python

Example 1: Extract Images from a PDF Page

Example 2: Extract All Images from a PDF File

Handle Different Image Formats While Extraction

Frequently Asked Questions

Q1: Is Spire.PDF for Python a free library?

Q2: Can I extract images from a specified page range only?

Q3: Is it possible to extract text from images?

Conclusion (Extract Text and More)