page 1

Subscribe to this RSS feed

Spire.PDF for Python (87)

Children categories

Program Guide (86)

View items...

Perform PDF OCR with Python (Extract Text from Scanned PDF)

2025-07-18 06:44:09 Written by zaki zou

OCR PDF and Extract Text Using Python In daily work, extracting text from PDF files is a common task. For standard digital documents—such as those exported from Word to PDF—this process is usually straightforward. However, things get tricky when dealing with scanned PDFs, which are essentially images of printed documents. In such cases, traditional text extraction methods fail, and OCR (Optical Character Recognition) becomes necessary to recognize and convert the text within images into editable content.

In this article, we’ll walk through how to perform PDF OCR using Python to automate this workflow and significantly reduce manual effort.

Why OCR is Needed for PDF Text Extraction
Best Python OCR Libraries for PDF Processing
Convert PDF Pages to Images Using Python
Scan and Extract Text from Images Using Spire.OCR
Conclusion

Why OCR is Needed for PDF Text Extraction

When it comes to extracting text from PDF files, one important factor that determines your approach is the type of PDF. Generally, PDFs fall into two categories: scanned (image-based) PDFs and searchable PDFs. Each requires a different strategy for text extraction.

Scanned PDFs are typically created by digitizing physical documents such as books, invoices, contracts, or magazines. While the text appears readable to the human eye, it's actually embedded as an image—making it inaccessible to traditional text extraction tools. Older digital files or password-protected PDFs may also lack an actual text layer.
Searchable PDFs, on the other hand, contain a hidden text layer that allows computers to search, copy, or parse the content. These files are usually generated directly from applications like Microsoft Word or PDF editors and are much easier to process programmatically.

This distinction highlights the importance of OCR (Optical Character Recognition) when working with scanned PDFs. With tools like Python PDF OCR, we can convert these image-based PDFs into images, run OCR to recognize the text, and extract it for further use—all in an automated way.

Best Python OCR Libraries for PDF Processing

Before diving into the implementation, let’s take a quick look at the tools we’ll be using in this tutorial. To simplify the process, we’ll use Spire.PDF for Python and Spire.OCR for Python to perform PDF OCR in Python.

Spire.PDF will handle the conversion from PDF to images.
Spire.OCR, a powerful OCR tool for PDF files, will recognize the text in those images and extract it as editable content.

You can install Spire.PDF using the following pip command:

pip install spire.pdf

and install Spire.OCR with:

pip install spire.ocr

Alternatively, you can download and install them manually by visiting the official Spire.PDF and Spire.OCR pages.

Convert PDF Pages to Images Using Python

Before we dive into Python PDF OCR, it's crucial to understand a foundational step: OCR technology doesn't directly process PDF files. Especially with image-based PDFs (like those created from scanned documents), we first need to convert them into individual image files.

Converting PDFs to images using the Spire.PDF library is straightforward. You simply load your target PDF document and then iterate through each page. For every page, call the PdfDocument.SaveAsImage() method to save it as a separate image file. Once this step is complete, your images are ready for the subsequent OCR process.

Here's a code example showing how to convert PDF to PNG:

from spire.pdf import *

# Load the PDF file
pdf = PdfDocument()
pdf.LoadFromFile("/AI-Generated Art.pdf")

# Loop through pages and save as images
for i in range(pdf.Pages.Count):
    # Convert each page to image
    with pdf.SaveAsImage(i) as image:
        
        # Save in different formats as needed
        image.Save(f"/output/pdftoimage/ToImage_{i}.png")
        # image.Save(f"Output/ToImage_{i}.jpg")
        # image.Save(f"Output/ToImage_{i}.bmp")

# Close the PDF document
pdf.Close()

Conversion result preview: Convert PDF to PNG in Python

Scan and Extract Text from Images Using Spire.OCR

After converting the scanned PDF into images, we can now move on to OCR PDF with Python and to extract text from the PDF. With OcrScanner.Scan() from Spire.OCR, recognizing text in images becomes straightforward. It supports multiple languages such as English, Chinese, French, and German. Once the text is extracted, you can easily save it to a .txt file or generate a Word document.

The code example below shows how to OCR the first PDF page and export to text in Python:

from spire.ocr import *

# Create OCR scanner instance
scanner = OcrScanner()

# Configure OCR model path and language
configureOptions = ConfigureOptions()
configureOptions.ModelPath = r'E:/DownloadsNew/win-x64/'
configureOptions.Language = 'English'
scanner.ConfigureDependencies(configureOptions)

# Perform OCR on the image
scanner.Scan(r'/output/pdftoimage/ToImage_0.png')

# Save extracted text to file
text = scanner.Text.ToString()
with open('/output/scannedpdfoutput.txt', 'a', encoding='utf-8') as file:
    file.write(text + '\n')

Result preview: OCR the First PDF Image and Extract Text Using Python

The Conclusion

In this article, we covered how to perform PDF OCR with Python—from converting PDFs to images, to recognizing text with OCR, and finally saving the extracted content as a plain text file. With this streamlined approach, extracting text from scanned PDFs becomes effortless. If you're looking to automate your PDF processing workflows, feel free to reach out and request a 30-day free trial. It’s time to simplify your document management.

Published in Extract/Read

Tagged under

pdf Python Extract Read

Convert PDF to Markdown in Python – Single & Batch Conversion

2025-07-17 02:36:46 Written by zaki zou

Visual guide of PDF to Markdown in Python

PDFs are ubiquitous in digital document management, but their rigid formatting often makes them less than ideal for content that needs to be easily edited, updated, or integrated into modern workflows. Markdown (.md), on the other hand, offers a lightweight, human-readable syntax perfect for web publishing, documentation, and version control. In this guide, we'll explore how to leverage the Spire.PDF for Python library to perform single or batch conversions from PDF to Markdown in Python efficiently.

Why Convert PDFs to Markdown?
Python PDF Converter Library – Installation
Convert PDF to Markdown in Python
Batch Convert Multiple PDFs to Markdown in Python
Frequently Asked Questions
Conclusion

Why Convert PDFs to Markdown?

Markdown offers several advantages over PDF for content creation and management:

Version control friendly: Easily track changes in Git
Lightweight and readable: Plain text format with simple syntax
Editability: Simple to modify without specialized software
Web integration: Natively supported by platforms like GitHub, GitLab, and static site generators (e.g., Jekyll, Hugo).

Spire.PDF for Python provides a robust solution for extracting text and structure from PDFs while preserving essential formatting elements like tables, lists, and basic styling.

Python PDF Converter Library - Installation

To use Spire.PDF for Python in your projects, you need to install the library via PyPI (Python Package Index) using pip. Open your terminal/command prompt and run:

pip install Spire.PDF

To upgrade an existing installation to the latest version:

pip install --upgrade spire.pdf

Convert PDF to Markdown in Python

Here’s a basic example demonstrates how to use Python to convert a PDF file to a Markdown (.md) file.

from spire.pdf.common import *
from spire.pdf import *

# Create an instance of PdfDocument class
pdf = PdfDocument()

# Load a PDF document
pdf.LoadFromFile("TestFile.pdf")

# Convert the PDF to a Markdown file
pdf.SaveToFile("PDFToMarkdown.md", FileFormat.Markdown) 
pdf.Close()

This Python script loads a PDF file and then uses the SaveToFile() method to convert it to Markdown format. The FileFormat.Markdown parameter specifies the output format.

How Conversion Works

The library extracts text, images, tables, and basic formatting from the PDF and converts them into Markdown syntax.

Text: Preserved with paragraphs/line breaks.
Images: Images in the PDF are converted to base64-encoded PNG format and embedded directly in the Markdown.
Tables: Tabular data is converted to Markdown table syntax (rows/columns with pipes |).
Styling: Basic formatting (bold, italic) is retained using Markdown syntax.

Output: Convert a PDF file to a Markdown file.

Batch Convert Multiple PDFs to Markdown in Python

This Python script uses a loop to convert all PDF files in a specified directory to Markdown format.

import os
from spire.pdf import *

# Configure paths
input_folder = "pdf_folder/"
output_folder = "markdown_output/"

# Create output directory
os.makedirs(output_folder, exist_ok=True)

# Process all PDFs in folder
for file_name in os.listdir(input_folder):
    if file_name.endswith(".pdf"):
        # Initialize document
        pdf = PdfDocument()
        pdf.LoadFromFile(os.path.join(input_folder, file_name))
        
        # Generate output path
        md_name = os.path.splitext(file_name)[0] + ".md"
        output_path = os.path.join(output_folder, md_name)
        
        # Convert to Markdown
        pdf.SaveToFile(output_path, FileFormat.Markdown)
        pdf.Close()

Key Characteristics

Batch Processing: Automatically processes all PDFs in input folder, improving efficiency for bulk operations.
1:1 Conversion: Each PDF generates corresponding Markdown file.
Sequential Execution: Files processed in alphabetical order.
Resource Management: Each PDF is closed immediately after conversion.

Output:

Batch convert multiple PDF files to Markdown files.

Need to convert Markdown to PDF? Refer to: Convert Markdown to PDF in Python

Frequently Asked Questions (FAQs)

Q1: Is Spire.PDF for Python free?

A: Spire.PDF offers a free version with limitations (e.g., maximum 3 pages per conversion). For unlimited use, request a 30-day free trial for evaluation.

Q2: Can I convert password-protected PDFs to Markdown?

A: Yes. Use the LoadFromFile method with the password as a second parameter:

pdf.LoadFromFile("ProtectedFile.pdf", "your_password")

Q3: Can Spire.PDF convert scanned/image-based PDFs to Markdown?

A: No. The library extracts text-based content only. For scanned PDFs, use OCR tools (like Spire.OCR for Python) to create searchable PDFs first.

Conclusion

Spire.PDF for Python simplifies PDF to Markdown conversion for both single file and batch processing.

Its advantages include:

Simple API with minimal code
Preservation of document structure
Batch processing capabilities
Cross-platform compatibility

Whether you're migrating documentation, processing research papers, or building content pipelines, by following the examples in this guide, you can efficiently transform static PDF documents into flexible, editable Markdown content, streamlining workflows and improving collaboration.

Published in Conversion

Tagged under

pdf Python Conversion

How to Add Text to PDF in Python (Create & Edit with Examples)

2025-07-10 08:51:40 Written by zaki zou

Illustration showing Python code adding text to a PDF file

Adding text to a PDF is a common task in Python — whether you're generating reports, adding annotations, filling templates, or labeling documents. This guide will walk you through how to write text in a PDF file using Python, including both creating new PDF documents and updating existing ones.

We’ll be using a dedicated Python PDF library - Spire.PDF for Python, which allows precise control over text placement, font styling, and batch processing. The examples are concise, beginner-friendly, and ready for real-world projects.

Sections Covered

Setup: Install the PDF Library in Python
Add Text to a New PDF
Add Text to an Existing PDF
Control Text Style, Position, Transparency, and Rotation
Common Pitfalls and Cross-Platform Tips
Conclusion
FAQ

Setup: Install the PDF Library in Python

To get started, install Spire.PDF for Python, a flexible and cross-platform PDF library.

pip install Spire.PDF

Or use Free Spire.PDF for Python:

pip install spire.pdf.free

Why use this library?

Works without Adobe Acrobat or Microsoft Office
Add and format text at exact positions
Supports both new and existing PDF editing
Runs on Windows, macOS, and Linux

Add Text to a New PDF Using Python

If you want to create a PDF from text using Python, the example below shows how to insert a line of text into a blank PDF page using custom font and position settings.

Example: Create and write text to a blank PDF

from spire.pdf import PdfDocument, PdfTrueTypeFont, PdfFontStyle, PdfSolidBrush, PdfRGBColor, PointF, RectangleF, \
    PdfStringFormat, PdfTextAlignment, PdfVerticalAlignment

# Create a new PDF document and add a new page
pdf = PdfDocument()
page = pdf.Pages.Add()

text = ("This report summarizes the sales performance of various products in the first quarter of 2025. " +
        "Below is a breakdown of the total sales by product category, " +
        "followed by a comparison of sales in different regions.")

# Set the font, brush, and point
font = PdfTrueTypeFont("Arial", 14.0, PdfFontStyle.Regular, True)
brush = PdfSolidBrush(PdfRGBColor(0, 0, 0))  # black
point = PointF(50.0, 100.0)

# Set the layout area and string format
layoutArea = RectangleF(50.0, 50.0, page.GetClientSize().Width - 100.0, page.GetClientSize().Height)
stringFormat = PdfStringFormat(PdfTextAlignment.Left, PdfVerticalAlignment.Top)

page.Canvas.DrawString(text, font, brush, layoutArea, stringFormat, False)

pdf.SaveToFile("output/new.pdf")
pdf.Close()

Technical Notes

PdfTrueTypeFont() loads a TrueType font from the system with customizable size and style (e.g., regular, bold). It ensures consistent text rendering in the PDF.
PdfSolidBrush() defines the fill color for text or shapes using RGB values. In this example, it's set to black ((0, 0, 0)).
RectangleF(x, y, width, height) specifies a rectangular layout area for drawing text. It enables automatic line wrapping and precise control of text boundaries.
PdfStringFormat() controls the alignment of the text inside the rectangle. Here, text is aligned to the top-left (Left and Top).
DrawString() draws the specified text within the defined layout area without affecting existing content on the page.

Example output PDF showing wrapped black text starting at coordinates (50, 50).

Generated PDF showing wrapped black text block starting at position (50, 50)

Tip: To display multiple paragraphs or line breaks, consider adjusting the Y-coordinate dynamically or using multiple DrawString() calls with updated positions.

If you want to learn how to convert TXT files to PDF directly using Python, please check: How to Convert Text Files to PDF Using Python.

Add Text to an Existing PDF in Python

Need to add text to an existing PDF using Python? This method lets you load a PDF, access a page, and write new text anywhere on the canvas.

This is helpful for:

Adding comments or annotations
Labeling document versions
Filling pre-designed templates

Example: Open an existing PDF and insert text

from spire.pdf import PdfDocument, PdfFontStyle, PdfSolidBrush, PdfRGBColor, PointF, PdfFont, PdfFontFamily

pdf = PdfDocument()
pdf.LoadFromFile("input.pdf")
page = pdf.Pages[0]

font = PdfFont(PdfFontFamily.TimesRoman, 12.0, PdfFontStyle.Bold)
brush = PdfSolidBrush(PdfRGBColor(255, 0, 0))  # red
location = PointF(150.0, 110.0)

page.Canvas.DrawString("This document is approved.", font, brush, location)

pdf.SaveToFile("output/modified.pdf")
pdf.Close()

Technical Notes

LoadFromFile() loads an existing PDF into memory.
You can access specific pages via pdf.Pages[index].
New content is drawn on top of the existing layout, non-destructively.
The text position is again controlled via PointF(x, y).

Modified PDF page with newly added red text annotation on the first page.

Modified PDF with added red text label on the first page

Use different x, y coordinates to insert content at custom positions.

Related article: Replace Text in PDF with Python

Control Text Style, Positioning, Transparency, and Rotation

When adding text to a PDF, you often need more than just plain content—you may want to customize the font, color, placement, rotation, and transparency, especially for annotations or watermarks.

Spire.PDF for Python offers fine-grained control for these visual elements, whether you’re building structured reports or stamping dynamic text overlays.

Set Font Style and Color

# Create PdfTrueTypeFont
font = PdfTrueTypeFont("Calibri", 16.0, PdfFontStyle.Italic, True)

# Create PdfFont
font = PdfFont(PdfFontFamily.TimesRoman, 16.0, PdfFontStyle.Italic)

# Create PdfBrush to specify text drawing color
brush = PdfSolidBrush(PdfRGBColor(34, 139, 34))  # forest green

PdfTrueTypeFont will embed the font into the PDF file. To reduce file size, you may use PdfFont, which uses system fonts without embedding them.

Apply Transparency and Rotation

You can adjust transparency and rotation when drawing text to achieve effects like watermarks or angled labels.

# Save the current canvas state
state = page.Canvas.Save()

# Set semi-transparency (0.0 = fully transparent, 1.0 = fully opaque)
page.Canvas.SetTransparency(0.4)

# Move the origin to the center of the page
page.Canvas.TranslateTransform(page.Size.Width / 2, page.Size.Height / 2)

# Rotate the canvas -45 degrees (counterclockwise)
page.Canvas.RotateTransform(-45)

# Draw text at new origin
page.Canvas.DrawString("DRAFT", font, brush, PointF(-50, -20))

Example: Add a Diagonal Watermark to the Center of the Page

The following example demonstrates how to draw a centered, rotated, semi-transparent watermark using all the style controls above:

from spire.pdf import PdfDocument, PdfTrueTypeFont, PdfFontStyle, PdfSolidBrush, PdfRGBColor, PointF
from spire.pdf.common import Color

pdf = PdfDocument()
pdf.LoadFromFile("input1.pdf")
page = pdf.Pages[0]

text = "Confidential"
font = PdfTrueTypeFont("Arial", 40.0, PdfFontStyle.Bold, True)
brush = PdfSolidBrush(PdfRGBColor(Color.get_DarkBlue()))  # gray

# Measure text size to calculate center
size = font.MeasureString(text)
x = (page.Canvas.ClientSize.Width - size.Width) / 2
y = (page.Canvas.ClientSize.Height - size.Height) / 2

state = page.Canvas.Save()
page.Canvas.SetTransparency(0.3)
page.Canvas.TranslateTransform(x + size.Width / 2, y + size.Height / 2)
page.Canvas.RotateTransform(-45.0)
page.Canvas.DrawString(text, font, brush, PointF(-size.Width / 2, -size.Height / 2))
page.Canvas.Restore(state)

pdf.SaveToFile("output/with_watermark.pdf")
pdf.Close()

PDF page displaying a centered, rotated, semi-transparent watermark text.

Screenshot showing several PDF files with uniform footer text added programmatically

This approach works well for dynamic watermarking, diagonal stamps like "VOID", "COPY", or "ARCHIVED", and supports full automation.

Make sure all files are closed and not in use to avoid PermissionError.

For more details on inserting watermarks into PDF with Python, please refer to: How to Insert Text Watermarks into PDFs Using Python.

Common Pitfalls and Cross-Platform Considerations

Even with the right API, issues can arise when deploying PDF text operations across different environments or font configurations. Here are some common problems and how to resolve them:

Issue	Cause	Recommended Fix
Text appears in wrong position	Hardcoded coordinates not accounting for page size	Use ClientSize and MeasureString() for dynamic layout
Font not rendered	Font lacks glyphs or isn't supported	Use PdfTrueTypeFont to embed supported fonts like Arial Unicode
Unicode text not displayed	Font does not support full Unicode range	Use universal fonts (e.g., Arial Unicode, Noto Sans)
Text overlaps existing content	Positioning too close to body text	Adjust Y-offsets or add padding with MeasureString()
Watermark text appears on output	You are using the paid version without a license	Use the free version or apply for a temporary license
Font file too large	Embedded font increases PDF size	Use PdfFont for system fonts (non-embedded), if portability is not a concern
Inconsistent results on macOS/Linux	Fonts not available or different metrics	Ship fonts with your application, or use built-in cross-platform fonts

Conclusion

With Spire.PDF for Python, adding text to PDFs—whether creating new files, updating existing ones, or automating batch edits—can be done easily and precisely. From annotations to watermarks, the library gives you full control over layout and styling.

You can start with the free version right away, or apply for a temporary license to unlock full features.

FAQ

How to add text to a PDF using Python?

Use a PDF library such as Spire.PDF to insert text via the DrawString() method. You can define font, position, and styling.

Can I write text into an existing PDF file with Python?

Yes. Load the file with LoadFromFile(), then use DrawString() to add text at a specific location.

How do I generate a PDF from text using Python?

Create a new document and use drawing methods to write content line by line with precise positioning.

Can I add the same text to many PDFs automatically?

Yes. Use a loop to process multiple files and insert text programmatically using a template script.

Published in Text

Tagged under

pdf Python Text

How to Read PDF Files in Python – Text, Tables, Images, and More

2025-06-06 08:07:20 Written by zaki zou

Read PDF files in Python using Spire.PDF library

Reading PDF files using Python is essential for tasks like document automation, content analysis, and data scraping. Whether you're working with contracts, reports, invoices, or scientific papers, being able to programmatically access PDF content saves time and enables powerful workflows.

To reliably read PDF content in Python — including text, tables, images, and metadata — you need a reliable Python PDF reader. In this guide, we’ll show you how to read PDFs in Python using Spire.PDF for Python, a professional and easy-to-use library that supports full-featured PDF reading without relying on any third-party tools.

Here's what's covered:

Preparing Your Environment
Load a PDF File in Python
Read Text from PDF Pages in Python
Read Table Data from PDFs in Python
Read Images from PDFs in Python
Read PDF Metadata (Title, Author, etc.)
Common Questions on Reading PDFs

Environment Setup for Reading PDFs in Python

Spire.PDF for Python is a powerful Python PDF reader that allows users to read PDF content with simple Python code, including text, tables, images, and metadata. It offers a developer-friendly interface and supports a wide range of PDF reading operations:

Read PDF files from disk or memory
Access text, tables, metadata, and images
No need for third-party tools
High accuracy for structured data reading
Free version available

It’s suitable for developers who want to read and process PDFs with minimal setup.

You can install Spire.PDF for Python via pip:

pip install spire.pdf

Or the free version Free Spire.PDF for Python for small tasks:

pip install spire.pdf.free

Load a PDF File in Python

Before accessing content, the first step is to load the PDF into memory. Spire.PDF lets you read PDF files from a path on disk or directly from in-memory byte streams — ideal for reading from web uploads or APIs.

Read PDF from File Path

To begin reading a PDF in Python, load the file using PdfDocument.LoadFromFile(). This creates a document object you can use to access content.

from spire.pdf import PdfDocument

# Create a PdfDocument instance
pdf = PdfDocument()
# Load a PDF document
pdf.LoadFromFile("sample.pdf")

Read PDF from Bytes (In-Memory)

To read a PDF file from memory without saving it to disk, you can first load its byte content and then initialize a PdfDocument using a Stream object. This method is especially useful when handling PDF files received from web uploads, APIs, or temporary in-memory data.

from spire.pdf import PdfDocument, Stream

# Read the PDF file to a byte array
with open("sample.pdf", "rb") as f:
    byte_data = f.read()
    
# Create a stream using the byte array
pdfStream = Stream(byte_data)
# Create a PdfDocument using the stream
pdf = PdfDocument(pdfStream)

To go further, check out this guide: Loading and Saving PDFs via Byte Streams in Python

Read Text from PDF Pages in Python

Reading text from a PDF file is one of the most common use cases in document automation. With Spire.PDF, you can easily retrieve all visible text from the entire PDF or from individual pages using simple methods.

Read All Text from PDF

To extract all text from a PDF, loop through each page and call PdfTextExtractor.ExtractText() to collect visible text content.

from spire.pdf import PdfDocument, PdfTextExtractor, PdfTextExtractOptions

# Create a PdfDocument instance
pdf = PdfDocument()
# Load a PDF document
pdf.LoadFromFile("sample.pdf")

all_text = ""
# Loop through each page
for pageIndex in range(pdf.Pages.Count):
    page = pdf.Pages.get_Item(pageIndex)
    # Create a PdfTextExtract instance
    text_extractor = PdfTextExtractor(page)
    # Configure extracting options
    options = PdfTextExtractOptions()
    options.IsExtractAllText = True
    options.IsSimpleExtraction = True
    # Extract text from the current page
    all_text += text_extractor.ExtractText(options)
print(all_text)

Sample text content retrieved:

Python code to extract all text from a PDF using Spire.PDF

Read Text from Specific Area of a Page

You can also read text from a defined region of a page using a bounding box. This is useful when only a portion of the layout contains relevant information.

from spire.pdf import RectangleF, PdfDocument, PdfTextExtractor, PdfTextExtractOptions

# Load the PDF file
pdf = PdfDocument()
pdf.LoadFromFile("sample.pdf")

# Get the first page
page = pdf.Pages.get_Item(0)
# Create a PdfTextExtractor instance
textExtractor = PdfTextExtractor(page)
# Set the area to extract text by configuring the PdfTextExtractOptions
options = PdfTextExtractOptions()
area = RectangleF.FromLTRB(0, 200, page.Size.Width, 270)  # x, y, width, height
options.ExtractArea = area
options.IsSimpleExtraction = True

# Extract text from the area
text = textExtractor.ExtractText(options)
print(text)

The text read from the PDF page area:

Python example of reading text from a defined area in a PDF

Read Table Data from PDFs in Python

PDF tables are often used in reports, invoices, and statements. With Spire.PDF, you can read PDF tables in Python by extracting structured tabular content using its layout-aware table extractor, making it ideal for financial and business documents. Use PdfTableExtractor.ExtractTable() to detect tables page by page and output each row and cell as structured text.

from spire.pdf import PdfDocument, PdfTableExtractor

# Load the PDF file
pdf = PdfDocument()
pdf.LoadFromFile("sample.pdf")

# Create a PdfTableExtractor instance
table_extractor = PdfTableExtractor(pdf)
# Extract the table from the first page
tables = table_extractor.ExtractTable(0)
for table in tables:
    # Get the number of rows and columns
    row_count = table.GetRowCount()
    column_count = table.GetColumnCount()
    # Iterate all rows
    for i in range(row_count):
        table_row = []
        # Iterate all columns
        for j in range(column_count):
            # Get the cell
            cell_text = table.GetText(i, j)
            table_row.append(cell_text)
        print(table_row)

Table content extracted using the code above:

Read table data from a PDF in Python using Spire.PDF

Want to extract text from scanned PDFs using OCR? Read this guide on OCR with Python

Read Images from PDF in Python

PDF files often contain logos, scanned pages, or embedded images. Spire.PDF allows you to read and export these images, which is helpful for working with digitized documents or preserving visual content. Use PdfImageHelper.GetImagesInfo() on each page to retrieve and save all embedded images.

from spire.pdf import PdfDocument, PdfImageHelper

# Load the PDF file
pdf = PdfDocument()
pdf.LoadFromFile("sample.pdf")

# Get the first page
page = pdf.Pages.get_Item(0)

# Create a PdfImageHelper object
image_helper = PdfImageHelper()
# Get the image information from the page
images_info = image_helper.GetImagesInfo(page)
# Save the images from the page as image files
for i in range(len(images_info)):
    images_info[i].Image.Save("output/Images/image" + str(i) + ".png")

The image read from the PDF file:

Extract images from PDF pages using Spire.PDF and Python

Read PDF Metadata (Title, Author, etc.)

Sometimes you may want to access document metadata like author, subject, and title. This can be helpful for indexing or organizing files. Use the ocumentInformation property to read metadata fields.

from spire.pdf import PdfDocument

# Load the PDF file
pdf = PdfDocument()
pdf.LoadFromFile("sample.pdf")

# Get the document properties
properties = pdf.DocumentInformation
print("Title: " + properties.Title)
print("Author: " + properties.Author)
print("Subject: " + properties.Subject)
print("Keywords: " + properties.Keywords)

The metadata read from the PDF document:

Get PDF metadata such as author and title using Python

Common Questions on Reading PDFs

Can Python parse a PDF file?

Yes. Libraries like Spire.PDF for Python allow you to read PDF text, extract tables, and access embedded images or metadata. It supports methods like PdfTextExtractor.ExtractText() and PdfTableExtractor.ExtractTable() for structured content parsing.

How do I read a PDF in Jupyter?

Spire.PDF works seamlessly in Jupyter Notebooks. Just install it via pip and use its API to read PDF files, extract text, or parse tables and images directly in your notebook environment.

How to read text from a PDF file?

Use the PdfTextExtractor.ExtractText() method on each page after loading the PDF with Spire.PDF. This lets you read PDF file to text in Python and retrieve visible content for processing or analysis.

Can I read a PDF file without saving it to disk?

Yes. You can use LoadFromStream() to read PDF content as bytes and load it directly from memory. This is useful for processing PDFs received from web APIs or file uploads.

Conclusion

With Spire.PDF for Python, you can easily read a PDF in Python — including reading PDF text, tables, images, and metadata — and even read a PDF file to text for further processing or automation. This makes it an ideal solution for document automation, data ingestion, and content parsing in Python.

Need to process large PDF files or unlock all features? Request a free license and take full advantage of Spire.PDF for Python today!

Published in Document Operation

Tagged under

pdf Python Document Operation

How to Convert PDF to CSV in Python (Fast & Accurate Table Extraction)

2025-05-19 03:43:16 Written by Administrator

Comprehensive Guide for Converting PDF to CSV by Extracting Tables Using Python

Working with PDFs that contain tables, reports, or invoice data? Manually copying that information into spreadsheets is slow, error-prone, and just plain frustrating. Fortunately, there's a smarter way: you can convert PDF to CSV in Python automatically — making your data easy to analyze, import, or automate.

In this guide, you’ll learn how to use Python for PDF to CSV conversion by directly extracting tables with Spire.PDF for Python — a pure Python library that doesn’t require any external tools.

✅ No Adobe or third-party tools required

✅ High-accuracy table recognition

✅ Ideal for structured data workflows

In this guide, we’ll cover:

Convert PDF to CSV in Python Using Table Extraction
Related Use Cases
Why Use Spire.PDF for PDF to CSV Conversion in Python?
Frequently Asked Questions

Convert PDF to CSV in Python Using Table Extraction

The best way to convert PDF to CSV using Python is by extracting tables directly — no need for intermediate formats like Excel. This method is fast, clean, and highly effective for documents with structured data such as invoices, bank statements, or reports. It gives you usable CSV output with minimal code and high accuracy, making it ideal for automation and data analysis workflows.

Step 1: Install Spire.PDF for Python

Before writing code, make sure to install the required library. You can install Spire.PDF for Python via pip:

pip install spire.pdf

You can also install Free Spire.PDF for Python if you're working on smaller tasks:

pip install spire.pdf.free

Step 2: Python Code — Extract Table from PDF and Save as CSV

Python

from spire.pdf import PdfDocument, PdfTableExtractor
import csv
import os

# Load the PDF document
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")

# Create a table extractor
extractor = PdfTableExtractor(pdf)

# Ensure output directory exists
os.makedirs("output/Tables", exist_ok=True)

# Loop through each page in the PDF
for page_index in range(pdf.Pages.Count):
    # Extract tables on the current page
    tables = extractor.ExtractTable(page_index)
    for table_index, table in enumerate(tables):
        table_data = []

        # Extract all rows and columns
        for row in range(table.GetRowCount()):
            row_data = []
            for col in range(table.GetColumnCount()):
                # Get cleaned cell text
                cell_text = table.GetText(row, col).replace("\n", "").strip()
                row_data.append(cell_text)
            table_data.append(row_data)

        # Write the table to a CSV file
        output_path = os.path.join("output", "Tables", f"Page{page_index + 1}-Table{table_index + 1}.csv")
        with open(output_path, "w", newline="", encoding="utf-8") as csvfile:
            writer = csv.writer(csvfile)
            writer.writerows(table_data)

# Release PDF resources
pdf.Dispose()

The conversion result:

The Result of Converting PDF to CSV with Python Using Spire.PDF

What is PdfTableExtractor?

PdfTableExtractor is a utility class provided by Spire.PDF for Python that detects and extracts table structures from PDF pages. Unlike plain text extraction, it maintains the row-column alignment of tabular data, making it ideal for converting PDF tables to CSV with clean structure.

Best for:

PDFs with structured tabular data
Automated Python PDF to CSV conversion
Fast Python-based data workflows

Relate Article: How to Convert PDFs to Excel XLSX Files with Python

Related Use Cases

If your PDF doesn't contain traditional tables — such as when it's formatted as paragraphs, key-value pairs, or scanned as an image — the following approaches can help you convert such PDFs to CSV using Python effectively:

Extract non-tabular PDF text and save to CSV

Useful when data is in paragraph or report form — format it into table-like CSV using Python logic.

Use OCR to extract tables from scanned PDF in Python

Perfect for image-based PDFs — use OCR to detect and export tables to CSV.

Why Choose Spire.PDF for Python?

Spire.PDF for Python is a robust PDF SDK tailored for developers. Whether you're building automated reports, analytics tools, or ETL pipelines — it just works.

Key Benefits:

Accurate Table Recognition

Smartly extracts structured data from tables

Pure Python, No Adobe Needed

Lightweight and dependency-free

Multi-Format Support

Also supports conversion to text, images, Excel, and more

Frequently Asked Questions

Can I convert PDF to CSV using Python?

Yes, you can convert PDF to CSV in Python using Spire.PDF. It supports both direct table extraction to CSV and an optional workflow that converts PDFs to Excel first. No Adobe Acrobat or third-party tools are required.

What's the best way to extract tables from PDFs in Python?

The most efficient way is using Spire.PDF’s PdfTableExtractor class. It automatically detects tables on each page and lets you export structured data to CSV with just a few lines of Python code — ideal for invoices, reports, and automated processing.

Why would I convert PDF to Excel before CSV?

You might convert PDF to Excel first if the layout is complex or needs manual review. This gives you more control over formatting and cleanup before saving as CSV, but it's slower than direct extraction and not recommended for automation workflows.

Does Spire.PDF work without Adobe Acrobat?

Yes. Spire.PDF for Python is a 100% standalone library that doesn’t rely on Adobe Acrobat or any external software. It's a pure Python solution for converting, extracting, and manipulating PDF content programmatically.

Conclusion

Converting PDF to CSV in Python doesn’t have to be a hassle. With Spire.PDF for Python, you can:

Automatically extract structured tables to CSV
Build seamless, automated workflows in Python
Handle both native PDFs and scanned ones (with OCR)

Get a Free License

Spire.PDF for Python offers a free edition suitable for basic tasks. If you need access to more features, you can also apply for a free license for evaluation use. Simply submit a request, and a license key will be sent to your email after approval.

Published in Conversion

Tagged under

pdf Python Conversion

Edit PDF Using Python: A Practical Guide to PDF Modification

2025-05-06 03:42:04 Written by Administrator

PDFs are widely used in reports, invoices, and digital forms due to their consistent formatting across platforms. However, their fixed layout makes editing difficult without specialized tools. For developers looking to edit PDF using Python, Spire.PDF for Python provides a comprehensive and easy-to-use solution. This Python PDF editor enables you to modify PDF files programmatically—changing text, replacing images, adding annotations, handling forms, and securing files—without relying on Adobe Acrobat or any external software.

In this article, we will explore how to use Spire.PDF for Python to programmatically edit PDFs in Python applications.

Why Use Python and Spire.PDF to Edit PDF Documents?
Getting Started with Spire.PDF for Python
How to Edit an Existing PDF Using Spire.PDF for Python
Frequently Asked Questions

Why Use Python and Spire.PDF to Edit PDF Documents?

Python is a highly versatile programming language that provides an excellent platform for automating and managing PDF documents. When it comes to edit PDF Python tasks, Spire.PDF for Python stands out as a comprehensive and easy-to-use solution for all your PDF manipulation needs.

Benefits of Using Python for PDF Editing

Automation and Batch Processing: Streamline repetitive PDF editing tasks efficiently.
Cost-Effective: Reduce manual work, saving time and resources when you Python-edit PDF files.
Integration: Seamlessly incorporate PDF editing into existing Python-based systems and workflows.

Advantages of Spire.PDF for Python

Spire.PDF for Python is a standalone library that enables developers to create, read, edit, convert, and save PDF files without relying on external software. As a trusted Python PDF editor, it offers powerful features such as:

Text and Image Editing
Annotations and Bookmark Management
Form Field Handling
Security Settings (Encryption and Permissions)
Conversion to Word, Excel, HTML, and Images

To learn more about these specific features, visit the Spire.PDF for Python tutorials.

With its intuitive API design, Spire.PDF makes it easier than ever to edit PDF files in Python quickly and effectively, ensuring a smooth development experience.

Getting Started with Spire.PDF for Python

Installation:

To install Spire.PDF for Python, simply run the following pip command:

Package Manager

pip install spire.pdf

Alternatively, you can install Free Spire.PDF for Python, a free version suitable for small projects, by running:

Package Manager

pip install spire.pdf.free

You can also download the library manually from the links.

Basic Setup Example:

The following example demonstrates how to create a simple PDF using Spire.PDF for Python:

Python

from spire.pdf import PdfDocument, PdfFont, PdfBrushes, PdfFontFamily, PdfFontStyle

# Create a new PDF document
pdf = PdfDocument()
# Add a new page to the document
page = pdf.Pages.Add()
# Create a font
font = PdfFont(PdfFontFamily.TimesRoman, 28.0, PdfFontStyle.Bold)
# Create a brush
brush = PdfBrushes.get_Black()
# Draw the string using the font and brush
page.Canvas.DrawString("Hello, World", font, brush, 100.0, 100.0)
# Save the document
pdf.SaveToFile("output/NewPDF.pdf")
pdf.Close()

Result: The generated PDF displays the text "Hello, World" using Times Roman Bold.

PDF created using Spire.PDF for Python showing Hello World text

With Spire.PDF installed, you're now ready to edit PDFs using Python. The sections below explain how to manipulate structure, content, security, and metadata.

How to Edit an Existing PDF Using Spire.PDF for Python

Spire.PDF for Python provides a simple yet powerful way to edit PDF using Python. With its intuitive API, developers can automate a wide range of PDF editing tasks including modifying document structure, page content, security settings, and properties. This section outlines the core categories of editing and their typical use cases.

Edit PDF Pages and Structure with Python

Structure editing lets you manipulate PDF page order, merge files, or insert/delete pages—ideal for document assembly workflows.

Insert or Delete Pages

Use the Pages.Insert() and Pages.RemoveAt() methods of the PdfDocument class to insert or delete pages at specific positions.

Code Example

Python

from spire.pdf import PdfDocument, PdfPageSize, PdfMargins, PdfPageRotateAngle

# Load a PDF document
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")

# Insert and delete pages
# Insert at beginning
pdf.Pages.Insert(0, PdfPageSize.A4(), PdfMargins(50.0, 60.0), PdfPageRotateAngle.RotateAngle90)
# Delete second page
pdf.Pages.RemoveAt(1)

# Save the document
pdf.SaveToFile("output/InsertDeletePage.pdf")
pdf.Close()

Result:

PDF pages inserted and deleted with Python code using Spire.PDF

Merge Two PDF Files

The AppendPage() method allows you to combine PDFs by inserting pages from one document into another.

Code Example

Python

import os
from spire.pdf import PdfDocument

# Specify the PDF file path
pdfPath = "PDFs/"
# Read the PDF file names from the path and add them to a list
files = [pdfPath + file for file in os.listdir(pdfPath) if file.endswith(".pdf")]

# Load the first PDF file
pdf = PdfDocument()
pdf.LoadFromFile(files[0])
# Iterate through the other PDF files
for i in range(1, len(files)):
    # Load the current PDF file
    pdf2 = PdfDocument()
    pdf2.LoadFromFile(files[i])
    # Append the pages from the current PDF file to the first PDF file
    pdf.AppendPage(pdf2)

# Save the merged PDF file
pdf.SaveToFile("output/MergePDFs.pdf")
pdf.Close()

Result:

Merged PDF documents using Python and Spire.PDF

You may also like: Splitting PDF Files with Python Code

Edit PDF Content with Python

As a Python PDF editor, Spire.PDF supports a variety of content-level operations, including modifying text, images, annotations, and interactive forms.

Replace Text in a PDF

The PdfTextReplacer class can be used to find and replace text from a page. Note that precise replacement may require case and layout-aware handling.

Code Example

Python

from spire.pdf import PdfDocument, PdfTextReplacer, ReplaceActionType, Color

# Load the PDF file
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")

# Iterate through the pages
for i in range(pdf.Pages.Count):
    page = pdf.Pages.get_Item(i)
    # Create a PdfTextReplacer object
    replacer = PdfTextReplacer(page)
    # Set the replacement options
    replacer.Options.ReplaceType = ReplaceActionType.IgnoreCase
    # Replace the text
    replacer.ReplaceAllText("drones", "ROBOTS", Color.get_Aqua()) # Setting the color is optional

# Save the merged PDF file
pdf.SaveToFile("output/ReplaceText.pdf")
pdf.Close()

Result:

Alt: Text replaced in a PDF file using Python with Spire.PDF

Replace Images in a PDF

Spire.PDF for Python provides the PdfImageHelper class to help you replace images in a PDF file with ease. By retrieving image information from a specific page, you can use the ReplaceImage() method to directly substitute the original image with a new one.

Code Example

Python

from spire.pdf import PdfDocument, PdfImageHelper, PdfImage

# Load the PDF file
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")

# Get a page
page = pdf.Pages.get_Item(0)

# Create a PdfImageHelper instance
imageHelper = PdfImageHelper()
# Get the image info of the first image on the page
imageInfo = imageHelper.GetImagesInfo(page)[0]
# Load a new image
newImage = PdfImage.FromFile("Image.png")
# Replace the image
imageHelper.ReplaceImage(imageInfo, newImage)

# Save the PDF file
pdf.SaveToFile("output/ReplaceImage.pdf")
pdf.Close()

Result:

Image replacement in a PDF document using Spire.PDF for Python

Add Comments or Notes

To add comments or notes with Python, use the PdfTextMarkupAnnotation class and add it to the page’s AnnotationsWidget collection.

Code Example

Python

from spire.pdf import PdfDocument, PdfTextFinder, PdfTextMarkupAnnotation, PdfRGBColor, Color

# Load the PDF file
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")

# Get a page
page = pdf.Pages.get_Item(0)

#Create a PdfTextFinder instance and set the options
finder = PdfTextFinder(page)
finder.Options.Parameter.IgnoreCase = False
finder.Options.Parameter.WholeWord = True
# Find the text to comment
text = finder.Find("redefining entire industries")[0]

# Get the bound of the text
bound = text.Bounds[0]

# Add comment
commentText = ("This is a powerful expression, but a bit vague. "
                "You might consider specifying which industries are "
                "being redefined and how, to make the claim more "
                "concrete and credible.")
comment = PdfTextMarkupAnnotation("Commenter", commentText, bound)
comment.TextMarkupColor = PdfRGBColor(Color.get_Yellow())
page.AnnotationsWidget.Add(comment)

# Save the PDF file
pdf.SaveToFile("output/CommentNote.pdf")
pdf.Close()

Result:

Comment added to PDF using Python Spire.PDF annotations

Edit or Read Form Fields

Spire.PDF for Python allows you to programmatically fill out and read form fields in a PDF document. By accessing the FieldsWidget property of a PdfFormWidget object, you can iterate through all interactive form elements, such as text boxes, combo boxes, and checkboxes, and update or extract their values.

Code Example

Python

from spire.pdf import PdfDocument, PdfFormWidget, PdfComboBoxWidgetFieldWidget, PdfCheckBoxWidgetFieldWidget, PdfTextBoxFieldWidget

# Load the PDF file
pdf = PdfDocument()
pdf.LoadFromFile("EmployeeInformationForm.pdf")

forms = pdf.Form
formWidgets = PdfFormWidget(forms).FieldsWidget

# Fill the forms
for i in range(formWidgets.Count):
    formField = formWidgets.get_Item(i)
    if formField.Name == "FullName":
        textBox = PdfTextBoxFieldWidget(formField)
        textBox.Text = "Amanda Ray Thompson"
    elif formField.Name == "DateOfBirth":
        textBox = PdfTextBoxFieldWidget(formField)
        textBox.Text = "01/01/1980"
    elif formField.Name == "Gender":
        comboBox = PdfComboBoxWidgetFieldWidget(formField)
        comboBox.SelectedIndex  = [ 1 ]
    elif formField.Name == "Department":
        formField.Value = "Human Resources"
    elif formField.Name == "AgreeTerms":
        checkBox = PdfCheckBoxWidgetFieldWidget(formField)
        checkBox.Checked = True

# Read the forms
formValues = []

for i in range(formWidgets.Count):
    formField = formWidgets.get_Item(i)
    if isinstance(formField, PdfTextBoxFieldWidget):
        formValues.append(formField.Name + ": " + formField.Text)
    elif isinstance(formField, PdfComboBoxWidgetFieldWidget):
        formValues.append(formField.Name + ": " + formField.SelectedValue)
    elif isinstance(formField, PdfCheckBoxWidgetFieldWidget):
        formValues.append(formField.Name + ": " + str(formField.Checked))

# Write the form values to a file
with open("output/FormValues.txt", "w") as file:
    file.write("\n".join(formValues))

# Save the PDF file
pdf.SaveToFile("output/FilledForm.pdf")
pdf.Close()

Result:

PDF form fields filled and retrieved programmatically with Python and Spire.PDF

Explore more: How to Insert Page Numbers to PDF Using Python

Manage PDF Security with Python

PDF security editing is essential when dealing with sensitive documents. Spire.PDF supports encryption, password protection, digital signature handling, and permission settings.

Add a Password and Set Permissions

The Encrypt() method lets you secure a PDF with user/owner passwords and define allowed actions like printing or copying.

Code Example

Python

from spire.pdf import PdfDocument, PdfEncryptionAlgorithm, PdfDocumentPrivilege, PdfPasswordSecurityPolicy

# Load the PDF file
pdf = PdfDocument()
pdf.LoadFromFile("EmployeeInformationForm.pdf")

# Create a PdfSecurityPolicy object and set the passwords and encryption algorithm
securityPolicy = PdfPasswordSecurityPolicy("userPSD", "ownerPSD")
securityPolicy.EncryptionAlgorithm = PdfEncryptionAlgorithm.AES_128

# Set the document privileges
pdfPrivileges = PdfDocumentPrivilege.ForbidAll()
pdfPrivileges.AllowPrint = True
pdfPrivileges.AllowFillFormFields  = True
# Apply the document privileges
securityPolicy.DocumentPrivilege = pdfPrivileges

# Encrypt the PDF with the security policy
pdf.Encrypt(securityPolicy)

# Save the PDF file
pdf.SaveToFile("output/EncryptedForm.pdf")
pdf.Close()

Result

Encrypted PDF file with password using Spire.PDF for Python

Remove the Password from a PDF

To open a protected file, provide the user password when calling LoadFromFile(), use Decrypt() to decrypt the document, and save it again unprotected.

Code Example

Python

from spire.pdf import PdfDocument

# Load the encrypted PDF file with the owner password
pdf = PdfDocument()
pdf.LoadFromFile("output/EncryptedForm.pdf", "ownerPSD")

# Decrypt the PDF file
pdf.Decrypt()

# Save the PDF file
pdf.SaveToFile("output/DecryptedForm.pdf")
pdf.Close()

Edit PDF Properties with Python

Use Spire.PDF to read and edit PDF metadata and viewer preferences—key features for document presentation and organization.

Update Document Metadata

Update metadata such as title, author, or subject via the DocumentInformation property of the PDF document.

Code Example

Python

from spire.pdf import PdfDocument

# Load the PDF file
pdf = PdfDocument()
pdf.LoadFromFile("EmployeeInformationForm.pdf")

# Set document metadata
pdf.DocumentInformation.Author = "John Doe"
pdf.DocumentInformation.Title = "Employee Information Form"
pdf.DocumentInformation.Producer  = "Spire.PDF"

# Save the PDF file
pdf.SaveToFile("output/EditProperties.pdf")
pdf.Close()

Result:

PDF metadata edited using Python Spire.PDF API

Set View Preferences

The ViewerPreferences property allows you to customize the viewing mode of a PDF (e.g., two-column layout).

Code Example

Python

from spire.pdf import PdfDocument, PdfPageLayout, PrintScalingMode

# Load the PDF file
pdf = PdfDocument()
pdf.LoadFromFile("EmployeeInformationForm.pdf")

# Set the viewer preferences
pdf.ViewerPreferences.DisplayTitle = True
pdf.ViewerPreferences.HideToolbar = True
pdf.ViewerPreferences.HideWindowUI = True
pdf.ViewerPreferences.FitWindow = False
pdf.ViewerPreferences.HideMenubar = True
pdf.ViewerPreferences.PrintScaling = PrintScalingMode.AppDefault
pdf.ViewerPreferences.PageLayout = PdfPageLayout.OneColumn

# Save the PDF file
pdf.SaveToFile("output/EditViewerPreference.pdf")
pdf.Close()

Result:

PDF viewer preferences set using Spire.PDF for Python

Conclusion

Editing PDFs using Python is both practical and efficient with Spire.PDF for Python. Whether you're building automation tools, editing digital forms, or securing sensitive reports, Spire.PDF equips you with a comprehensive suite of editing features—all accessible via clean and simple Python code.

With capabilities that span content editing, form interaction, document structuring, and security control, this Python PDF editor is a go-to solution for developers and organizations aiming to streamline their PDF workflows.

Frequently Asked Questions

Q: Can I edit a PDF using Python?

A: Yes, Python offers powerful libraries like Spire.PDF for Python that enable you to edit text, images, forms, annotations, and even security settings in a PDF file.

Q: How to edit a PDF using coding?

A: By using libraries such as Spire.PDF for Python, you can load an existing PDF, modify its content or structure programmatically, and save the changes with just a few lines of code.

Q: What is the Python library for PDF editor?

A: Spire.PDF for Python is a popular choice. It offers comprehensive functionalities for creating, reading, editing, converting, and securing PDF documents without the need for additional software.

Q: Can I modify a PDF for free?

A: Yes, you can use the free edition of Spire.PDF for Python to edit PDF files, although it comes with some limitations, such as processing up to 10 pages per document. Additionally, you can apply for a 30-day temporary license that removes all limitations and watermarks for full functionality testing.

Published in Document Operation

Tagged under

pdf Python Conversion

Detect and Remove Blank Pages from PDF Files in Python

2025-04-27 02:26:16 Written by Administrator

PDF documents may occasionally include blank pages. These pages can affect the reading experience, increase the file size and lead to paper waste during printing. To improve the professionalism and usability of a PDF document, detecting and removing blank pages is an essential step.

This article shows how to accurately detect and remove blank pages—including those that appear empty but actually contain invisible elements—using Python, Spire.PDF for Python, and Pillow.

Install Required Libraries

This tutorial requires two Python libraries:

Spire.PDF for Python: Used for loading PDFs and detecting/removing blank pages.
Pillow: A library for image processing that helps detect visually blank pages, which may contain invisible content.

You can easily install both libraries using pip:

Package Manager

pip install Spire.PDF Pillow

Need help installing Spire.PDF? Refer to this guide:

How to Install Spire.PDF for Python on Windows

How to Effectively Detect and Remove Blank Pages from PDF Files in Python

Spire.PDF provides a method called PdfPageBase.IsBlank() to check if a page is completely empty. However, some pages may appear blank but actually contain hidden content like white text, watermarks, or background images. These cannot be reliably detected using the PdfPageBase.IsBlank() method alone.

To ensure accuracy, this tutorial adopts a two-step detection strategy:

Use the PdfPageBase.IsBlank() method to identify and remove fully blank pages.
Convert non-blank pages to images and analyze them using Pillow to determine if they are visually blank.

⚠️ Important:

If you don’t use a valid license during the PDF-to-image conversion, an evaluation watermark will appear on the image, potentially affecting the blank page detection.

Contact the E-iceblue sales team to request a temporary license for proper functionality.

Steps to Detect and Remove Blank Pages from PDF in Python

Follow these steps to implement blank page detection and removal in Python:

1. Define a custom is_blank_image() Method

This custom function uses Pillow to check whether the converted image of a PDF page is blank (i.e., if all pixels are white).

2. Load the PDF Document

Load the PDF using the PdfDocument.LoadFromFile() method.

3. Iterate Through Pages

Loop through each page to check if it’s blank using two methods:

If the PdfPageBase.IsBlank() method returns True, remove the page directly.
If not, convert the page to an image using the PdfDocument.SaveAsImage() method and analyze it with the custom is_blank_image() method.

4. Save the Result PDF

Finally, save the PDF with blank pages removed using the PdfDocument.SaveToFile() method.

Code Example

Python

import io
from spire.pdf import PdfDocument
from PIL import Image

# Apply the License Key
License.SetLicenseKey("License-Key")

# Custom function: Check if the image is blank (whether all pixels are white)
def is_blank_image(image):
        # Convert to RGB mode and then get the pixels
        img = image.convert("RGB")
        # Get all pixel points and check if they are all white
        white_pixel = (255, 255, 255)
        return all(pixel == white_pixel for pixel in img.getdata())

# Load the PDF document
pdf = PdfDocument()
pdf.LoadFromFile("Sample1111.pdf")

# Iterate through each page in reverse order to avoid index issues when deleting
for i in range(pdf.Pages.Count - 1, -1, -1):
    page = pdf.Pages[i]
    # Check if the current page is completely blank
    if page.IsBlank():
        # If it's completely blank, remove it directly from the document
        pdf.Pages.RemoveAt(i)
    else:
        # Convert the current page to an image
        with pdf.SaveAsImage(i) as image_data:
            image_bytes = image_data.ToArray()
            pil_image = Image.open(io.BytesIO(image_bytes))
            # Check if the image is blank
            if is_blank_image(pil_image):
                # If it's a blank image, remove the corresponding page from the document
                pdf.Pages.RemoveAt(i)

# Save the resulting PDF
pdf.SaveToFile("RemoveBlankPages.pdf")
pdf.Close()

Frequently Asked Questions (FAQs)

Q1: What is considered a blank page in a PDF file?

A: A blank page may be truly empty or contain hidden elements such as white text, watermarks, or transparent objects. This solution detects both types using a dual-check strategy.

Q2: Can I use this method without a Spire.PDF license?

A: Yes, you can run it without a license. However, during PDF-to-image conversion, an evaluation watermark will be added to the output images, which may affect the accuracy of blank page detection. It's best to request a free temporary license for testing.

Q3: What versions of Python are compatible with Spire.PDF?

A: Spire.PDF for Python supports Python 3.7 and above. Ensure that Pillow is also installed to perform image-based blank page detection.

Q4: Can I modify the script to only detect blank pages without deleting them?

A: Absolutely. Just remove or comment out the pdf.Pages.RemoveAt(i) line and use print() or logging to list detected blank pages for further review.

Conclusion

Removing unnecessary blank pages from PDF files is an important step in optimizing documents for readability, file size, and professional presentation. With the combined power of Spire.PDF for Python and Pillow, developers can precisely identify both completely blank pages and pages that appear empty but contain invisible content. Whether you're generating reports, cleaning scanned files, or preparing documents for print, this Python-based solution ensures clean and efficient PDFs.

Get a Free License

To fully experience the capabilities of Spire.PDF for Python without any evaluation limitations, you can request a free 30-day trial license.

Published in Page Setting

Tagged under

pdf Python Page Setting

Add a Timestamped Digital Signature to PDF in Python

2025-04-18 07:51:17 Written by Koohji

Introduction

Digital signatures help verify the authenticity and integrity of PDF documents. However, if a signing certificate expires or is revoked, the signature alone may no longer be considered valid. To solve this, a timestamp can be added to the digital signature, proving that the document was signed at a specific point in time-validated by a trusted Time Stamp Authority (TSA).

In this tutorial, we will introduce how to use the Spire.PDF for Python library to digitally sign a PDF document with a timestamp in Python.

Prerequisites

To follow this tutorial, ensure you have the following:

Spire.PDF for Python library

You can install Spire.PDF for Python via pip:

Package Manager

pip install Spire.PDF

If you are unsure how to install, please refer to this tutorial: How to Install Spire.PDF for Python on Windows

A valid digital certificate (.pfx file)

This certificate is used to create the digital signature.

A sample PDF file

This is the document you want to sign.

An image to display as the signature appearance (optional)

For visual representation of the signer.

A reliable Time Stamp Authority (TSA) URL

This provides the timestamp token during signing.

How to Digitally Sign a PDF with a Timestamp in Python

In Spire.PDF for Python, the Security_PdfSignature class is used to create a digital signature, and the ConfigureTimestamp(tsaUrl) method in this class is used to embed a timestamp into the signature. The tsaUrl parameter specifies the address of the TSA server.

Steps to Add a Timestamped Digital Signature

Follow these steps to add a timestamped digital signature to a PDF in Python using Spire.PDF for Python:

Create a PdfDocument instance and use the LoadFromFile() method to load the PDF you want to sign.
Create a Security_PdfSignature object, specifying the target page, certificate file path, certificate password, and signature name.
Configure the signature's appearance, including its position, size, display labels, and signature image.
Embed a timestamp by calling the ConfigureTimestamp(tsaUrl) method with a valid Time Stamp Authority (TSA) URL.
Save the signed PDF using the SaveToFile() method.

Code Example

Python

from spire.pdf import *

inputFile = "Sample.pdf"
inputFile_pfx = "gary.pfx"
inputImage = "E-iceblueLogo.png"
outputFile = "SignWithTimestamp.pdf"

# Create a PdfDocument instance and load the PDF file to be signed
doc = PdfDocument()
doc.LoadFromFile(inputFile)

# Create a digital signature object by specifying the document, target page, certificate file path, certificate password, and signature name
signature = Security_PdfSignature(doc, doc.Pages.get_Item(0), inputFile_pfx, "e-iceblue", "signature")

# Define the position and size of the signature on the page (unit: point)
signature.Bounds = RectangleF(PointF(90.0, 600.0), SizeF(180.0, 90.0))

# Set the labels and content for the signature details
signature.NameLabel = "Digitally signed by: "
signature.Name = "Gary"
signature.LocationInfoLabel = "Location: "
signature.LocationInfo = "CN"
signature.ReasonLabel = "Reason: "
signature.Reason = "Ensure authenticity"
signature.ContactInfoLabel = "Contact Number: "
signature.ContactInfo = "028-81705109"

# Set document permissions: allow form filling, forbid further changes
signature.DocumentPermissions = PdfCertificationFlags.AllowFormFill.value | PdfCertificationFlags.ForbidChanges.value

# Set the graphic mode to include both image and signature details,
# and set the signature image
signature.GraphicsMode = Security_GraphicMode.SignImageAndSignDetail
signature.SignImageSource = PdfImage.FromFile(inputImage)

# Embed a timestamp into the signature using a Time Stamp Authority (TSA) server
url = "http://tsa.cesnet.cz:3161/tsa"
signature.ConfigureTimestamp(url)

# Save the signed PDF and close the document
doc.SaveToFile(outputFile)
doc.Close()

View the Timestamp in PDF

When you open the signed PDF in a viewer like Adobe Acrobat, you can click the Signature Panel to view both the digital signature and the timestamp, which confirm the document’s validity and the signing time:

Add a Timestamped Digital Signature to PDF in Python

Get a Free License

To fully experience the capabilities of Spire.PDF for Python without any evaluation limitations, you can request a free 30-day trial license.

Conclusion

Timestamping enhances the reliability of digital signatures by proving when a PDF was signed-even after the certificate has expired. With Spire.PDF for Python, implementing a timestamped digital signature is a straightforward process. Whether you're handling contracts, invoices, or confidential records, this approach ensures long-term document validity and compliance.

Published in Security

Tagged under

pdf Python Security

Python: Convert PDF to Postscript or PCL

2025-04-18 01:34:56 Written by Koohji

PostScript, developed by Adobe, is a page description language known for its high-quality graphics and text rendering capabilities. By converting PDF to PostScript, you can have a precise control over complex graphics, fonts and colors when printing brochures, magazines, advertisements, or other materials.

PCL, on the other hand, is a printer control language developed by Hewlett-Packard. It is designed to be efficient and easy for the printers to interpret. Converting PDF to PCL ensures compatibility with a large number of printers and also optimizes the printing speed for text-heavy documents such as academic reports, letters, or contracts.

This article will demonstrate how to convert PDF to PS or PDF to PCL in Python using Spire.PDF for Python.

Convert PDF to PostScript in Python
Convert PDF to PCL in Python

Install Spire.PDF for Python

This scenario requires Spire.PDF for Python. It can be easily installed in your Windows through the following pip command.

Package Manager

pip install Spire.PDF

If you are unsure how to install, please refer to this tutorial: How to Install Spire.PDF for Python on Windows

Convert PDF to PostScript in Python

Converting PDF to PS can improve the quality of the printed output. Spire.PDF for .NET allows you to load a PDF file and then converting it to PS format using PdfDocument.SaveToFile(filename: string, FileFormat.POSTSCRIPT) method. The following are the detailed steps.

Create a PdfDocument instance.
Load a PDF file using PdfDocument.LoadFromFile() method.
Save the PDF file to PostScript format using PdfDocument.SaveToFile(filename: string, FileFormat.POSTSCRIPT) method.

Python

from spire.pdf.common import *
from spire.pdf import *

# Specify the input and output file paths
inputFile = "input1.pdf"
outputFile = "PdfToPostScript.ps"

# Create a PdfDocument instance
pdf = PdfDocument()

# Load a PDF document
pdf.LoadFromFile(inputFile)

# Convert the PDF to a PostScript file
pdf.SaveToFile(outputFile, FileFormat.POSTSCRIPT)
pdf.Close()

Convert a PDF file to a PostScript file

Convert PDF to PCL in Python

Converting PDF to PCL can ensure faster printing speed. By using the PdfDocument.SaveToFile(filename: string, FileFormat.PCL) method, you can save a loaded PDF file as a PCL file. The following are the detailed steps.

Create a PdfDocument instance.
Load a PDF file using PdfDocument.LoadFromFile() method.
Save the PDF file to PCL format using PdfDocument.SaveToFile(filename: string, FileFormat.PCL) method.

Python

from spire.pdf.common import *
from spire.pdf import *

# Specify the input and output file paths
inputFile = "input1.pdf"
outputFile = "ToPCL\\PdfToPcl.pcl"

# Create a PdfDocument instance
pdf = PdfDocument()

# Load a PDF document
pdf.LoadFromFile(inputFile)

# Convert the PDF to a PCL file
pdf.SaveToFile(outputFile, FileFormat.PCL)
pdf.Close()

A PCL file converted from a PDF file

Get a Free License

To fully experience the capabilities of Spire.PDF for Python without any evaluation limitations, you can request a free 30-day trial license.

Published in Conversion

Tagged under

pdf Python Conversion

Python: Check if a PDF is Password Protected and Determine the Correct Password

2025-03-19 01:06:07 Written by Koohji

When working with PDF files, you may encounter documents that are password protected. This means that you cannot view or edit the content without entering the correct password. Understanding how to check if a PDF is password protected and determining the correct password is essential for accessing important information. In this guide, we will introduce how to check if a PDF is password protected and determine the correct password using Python and the Spire.PDF for Python library.

Check if a PDF is Password Protected
Determine the Correct Password for a PDF

Install Spire.PDF for Python

This scenario requires Spire.PDF for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.

Package Manager

pip install Spire.PDF

If you are unsure how to install, please refer to this tutorial: How to Install Spire.PDF for Python on Windows

Check if a PDF is Password Protected

Spire.PDF for Python offers the PdfDocument.IsPasswordProtected(fileName: str) method to check if a PDF file is password protected. The detailed steps are as follows.

Specify the input and output file paths.
Check if the PDF file is password protected or not using the PdfDocument.IsPasswordProtected() method.
Save the result to a text file.

Python

from spire.pdf import *

# Specify the input and output file paths
inputFile = "Secured.pdf"
outputFile = "CheckPasswordProtection.txt"

# Check if the input PDF file is password protected
isProtected = PdfDocument.IsPasswordProtected(inputFile)

# Write the result into a text file
with open(outputFile, "w") as fp:
    fp.write("The PDF is " + ("password protected!" if isProtected else "not password protected!"))

Check if a PDF is Password Protected

Determine the Correct Password for a PDF

While Spire.PDF for Python does not provide a direct method to check if a password is correct, you can achieve this by attempting to load the PDF with the password and catching exceptions. If the password is incorrect, an exception will be thrown. The detailed steps are as follows.

Specify the input and output file paths.
Create a list of potential passwords to test.
Iterate through the list and load the PDF with each password using the PdfDocument.LoadFromFile(filename: str, password: str) method.
If no exception is thrown, the password is correct. Otherwise, the password is incorrect.
Save the results to a text file.

Python

from spire.pdf import *

# Specify the input and output file paths
inputFile = "Secured.pdf"
outputFile = "DetermineCorrectPassword.txt"

# Create a list of potential passwords to test
passwords = ["password1", "password2", "password3", "test", "sample"]

# Create a text file to store the results
with open(outputFile, "w") as fp:
    for value in passwords:
        try:
            # Load the PDF with the current password
            doc = PdfDocument()
            doc.LoadFromFile(inputFile, value)
            # If successful, write that the password is correct
            fp.write(f'Password "{value}" is correct\n')
        except SpireException:
            # If an exception occurs, write that the password is not correct
            fp.write(f'Password "{value}" is not correct\n')

Determine the Correct Password for a PDF

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

Published in Security

Tagged under

pdf Python Security

News Category

Spire.PDF for Python (87)

Children categories

Why OCR is Needed for PDF Text Extraction

Best Python OCR Libraries for PDF Processing

Convert PDF Pages to Images Using Python

Scan and Extract Text from Images Using Spire.OCR

The Conclusion

Why Convert PDFs to Markdown?

Python PDF Converter Library - Installation

Convert PDF to Markdown in Python

Batch Convert Multiple PDFs to Markdown in Python

Frequently Asked Questions (FAQs)

Q1: Is Spire.PDF for Python free?

Q2: Can I convert password-protected PDFs to Markdown?

Q3: Can Spire.PDF convert scanned/image-based PDFs to Markdown?

Conclusion

Setup: Install the PDF Library in Python

Add Text to a New PDF Using Python

Example: Create and write text to a blank PDF

Add Text to an Existing PDF in Python

Example: Open an existing PDF and insert text

Control Text Style, Positioning, Transparency, and Rotation

Set Font Style and Color

Apply Transparency and Rotation

Example: Add a Diagonal Watermark to the Center of the Page

Common Pitfalls and Cross-Platform Considerations

Conclusion

FAQ

How to add text to a PDF using Python?

Can I write text into an existing PDF file with Python?

How do I generate a PDF from text using Python?

Can I add the same text to many PDFs automatically?

Environment Setup for Reading PDFs in Python

Load a PDF File in Python

Read PDF from File Path

Read PDF from Bytes (In-Memory)

Read Text from PDF Pages in Python

Read All Text from PDF

Read Text from Specific Area of a Page

Read Table Data from PDFs in Python

Read Images from PDF in Python

Read PDF Metadata (Title, Author, etc.)

Common Questions on Reading PDFs

Can Python parse a PDF file?

How do I read a PDF in Jupyter?

How to read text from a PDF file?

Can I read a PDF file without saving it to disk?

Conclusion

Convert PDF to CSV in Python Using Table Extraction

Step 1: Install Spire.PDF for Python

Step 2: Python Code — Extract Table from PDF and Save as CSV

What is PdfTableExtractor?

Best for:

Related Use Cases

Why Choose Spire.PDF for Python?

Frequently Asked Questions

Can I convert PDF to CSV using Python?

What's the best way to extract tables from PDFs in Python?

Why would I convert PDF to Excel before CSV?

Does Spire.PDF work without Adobe Acrobat?

Conclusion

Get a Free License

Why Use Python and Spire.PDF to Edit PDF Documents?

Benefits of Using Python for PDF Editing

Advantages of Spire.PDF for Python

Getting Started with Spire.PDF for Python

Installation:

Basic Setup Example:

How to Edit an Existing PDF Using Spire.PDF for Python

Edit PDF Pages and Structure with Python

Edit PDF Content with Python

Manage PDF Security with Python

Edit PDF Properties with Python

Conclusion

Frequently Asked Questions

Q: Can I edit a PDF using Python?

Q: How to edit a PDF using coding?

Q: What is the Python library for PDF editor?

Q: Can I modify a PDF for free?