Spire.Office Knowledgebase Page 45 | E-iceblue

Compress PDF in Python using Spire.PDF

Large PDF files can slow down email delivery, break upload limits, and consume unnecessary storage. This is especially common in PDFs that include high-resolution scans, images, or embedded fonts. If you're working with Python and need to automate PDF compression without compromising quality, this guide will help you get started.

In this tutorial, you’ll learn how to compress PDF files in Python using the Spire.PDF for Python library. We'll cover several effective techniques, including image recompression, font optimization, metadata removal, and batch compression—perfect for web, backend, or desktop applications.

Table of Contents

Common Scenarios Requiring PDF Compression

Reducing the size of PDF documents is often essential in the following situations:

Use Case Benefit
Email Attachments Avoid size limits and improve deliverability
Web Uploads Reduce upload time and server storage
Mobile Access Faster loading and less data consumption
Cloud Archiving Lower storage cost for backups
App Submissions Meet strict file size limits

Prerequisites

Before you begin compressing PDFs with Python, make sure the following requirements are met:

  • Python 3.7 or above
    Ensure that Python (version 3.7 or later) is installed on your system. You can download it from the official Python website.
  • Spire.PDF for Python
    This is a powerful PDF library that allows you to programmatically create, manipulate, and compress PDF documents—without relying on external software like Adobe Acrobat.

To install Spire.PDF for Python, run the following command in your terminal or command prompt:

pip install spire.pdf

Need help with the installation? See our step-by-step guide: How to Install Spire.PDF for Python on Windows_

Practical PDF Compression Techniques in Python

In this section, you'll explore five practical techniques for reducing PDF file size:

  • Font compression and unembedding
  • Image compression
  • Full-document compression
  • Metadata and attachment removal
  • Batch compressing multiple PDFs

Font Compression and Unembedding

Fonts embedded in a PDF—especially those from large font libraries or multilingual character sets—can significantly increase the file size. Spire.PDF allows you to:

  • Compress embedded fonts to minimize space usage
  • Unembed fonts that are not essential for rendering
from spire.pdf import *

# Create a PdfCompressor object and load the PDF file
compressor = PdfCompressor("C:/Users/Administrator/Documents/Example.pdf")

# Get the OptimizationOptions object
compression_options = compressor.OptimizationOptions

# Enable font compression
compression_options.SetIsCompressFonts(True)

# Optional: unembed fonts to further reduce size
# compression_options.SetIsUnembedFonts(True)

# Compress the PDF and save the result
compressor.CompressToFile("CompressFonts.pdf")

Image Compression

Spire.PDF lets you reduce the size of all images in a PDF by creating a PdfCompressor instance, enabling the image resizing and compression options, and specifying the image quality level. This approach applies compression uniformly across the entire document.

from spire.pdf import *

# Create a PdfCompressor object and load the PDF file
compressor = PdfCompressor("C:/Users/Administrator/Documents/Example.pdf")

# Get the OptimizationOptions object
compression_options = compressor.OptimizationOptions

# Enable image resizing
compression_options.SetResizeImages(True)

# Enable image compression
compression_options.SetIsCompressImage(True)

# Set image quality (available options: Low, Medium, High)
compression_options.SetImageQuality(ImageQuality.Medium)

# Compress and save the PDF file
compressor.CompressToFile("Compressed.pdf")

Compress PDF Images in Python with Spire.PDF

Full Document Compression

Beyond optimizing individual elements, Spire.PDF also supports full-document compression. By adjusting the document's CompressionLevel and disabling incremental updates, you can apply comprehensive optimization to reduce overall file size.

from spire.pdf import *

# Create a PdfDocument object
pdf = PdfDocument()

# Load the PDF file
pdf.LoadFromFile("C:/Users/Administrator/Documents/Example.pdf")

# Disable incremental update
pdf.FileInfo.IncrementalUpdate = False

# Set the compression level to the highest
pdf.CompressionLevel = PdfCompressionLevel.Best

# Save the optimized PDF
pdf.SaveToFile("OptimizeDocumentContent.pdf")
pdf.Close()

Removing Metadata and Attachments

Cleaning up metadata and removing embedded attachments is a quick way to reduce PDF size. Spire.PDF lets you remove unnecessary information like author/title fields and attached files:

from spire.pdf import *

# Load the PDF
pdf = PdfDocument()
pdf.LoadFromFile("Example.pdf")

# Disable the incremental update
pdf.FileInfo.IncrementalUpdate = False

# Remove metadata
pdf.DocumentInformation.Author = ""
pdf.DocumentInformation.Title = ""

# Remove attachments
pdf.Attachments.Clear()

# Save the optimized PDF
pdf.SaveToFile("Cleaned.pdf")
pdf.Close()

Batch Compressing Multiple PDFs

You can compress multiple PDFs at once by looping through files in a folder and applying the same optimization settings:

import os
from spire.pdf import *

# Folder containing the PDF files to compress
input_folder = "C:/PDFs/" 

# Loop through all files in the input folder
for file in os.listdir(input_folder):
    # Process only PDF files
    if file.endswith(".pdf"):  
        # Create a PdfCompressor instance and load the file
        compressor = PdfCompressor(os.path.join(input_folder, file))
        
        # Access compression options
        opt = compressor.OptimizationOptions        
        # Enable image resizing
        opt.SetResizeImages(True)        
        # Enable image compression
        opt.SetIsCompressImage(True)        
        # Set image quality to medium (options: Low, Medium, High)
        opt.SetImageQuality(ImageQuality.Medium)
        
        # Define output file path with "compressed_" prefix
        output_path = os.path.join(input_folder, "compressed_" + file)        
        # Perform compression and save the result
        compressor.CompressToFile(output_path)

Summary

Reducing the size of PDF files is a practical step toward faster workflows, especially when dealing with email sharing, web uploads, and large-scale archiving. With Spire.PDF for Python, developers can implement smart compression techniques—ranging from optimizing images and fonts to stripping unnecessary elements like metadata and attachments.

Whether you're building automation scripts, integrating PDF handling into backend services, or preparing documents for long-term storage, these tools give you the flexibility to control file size without losing visual quality. By combining multiple strategies—like full-document compression and batch processing—you can keep your PDFs lightweight, efficient, and ready for distribution across platforms.

Want to explore more ways to work with PDFs in Python? Explore the full range of Spire.PDF for Python tutorials to learn how to merge/split PDFs, convert PDF to PDF/A, add password protection, and more.

Frequently Asked Questions

Q1: Can I use Spire.PDF for Python on Linux or macOS?

A1: Yes. Spire.PDF for Python is compatible with Windows, Linux, and macOS.

Q2: Is Spire.PDF for Python free?

A2: Spire.PDF for Python offers a free version suitable for small-scale and non-commercial use. For full functionality, including unrestricted use in commercial applications, a commercial version is available. You can request a free 30-day trial license to explore all its premium features.

Q3: Will compressing the PDF reduce the visual quality?

A3: Not necessarily. Spire.PDF’s compression methods are designed to preserve visual fidelity while optimizing file size. You can fine-tune image quality or leave it to the default settings.

Word documents often contain metadata known as document properties, which include information like title, author, subject, and keywords. Manipulating these properties is invaluable for maintaining organized documentation, enhancing searchability, and ensuring proper attribution in collaborative environments. With Spire.Doc for Python, developers can automate the tasks of adding, reading, and removing document properties in Word documents to streamline document management workflows and enable the integration of these processes into larger automated systems. This article provides detailed steps and code examples that demonstrate how to utilize Spire.Doc for Python to effectively manage document properties within Word files.

Install Spire.Doc for Python

This scenario requires Spire.Doc for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip commands.

pip install Spire.Doc

If you are unsure how to install, please refer to: How to Install Spire.Doc for Python on Windows

Add Built-in Document Properties to Word Documents with Python

Spire.Doc for Python provides developers with the Document.BuiltinDocumentProperties property to access the built-in properties of Word documents. The value of these properties can be set using the corresponding properties under the BuiltinDocumentProperties class.

The following steps show how to add the main built-in properties in Word documents:

  • Create an object of Document class.
  • Load a Word document using Document.LoadFromFile() method.
  • Get the built-in properties through Document.BuiltinDocumentProperties property.
  • Add values to the properties with properties under BuiltinDocumentProperties property.
  • Save the document using Document.SaveToFile() method.
  • Python
from spire.doc import *
from spire.doc.common import *

# Create an object of Document
doc = Document()

# Load a Word document
doc.LoadFromFile("Sample.docx")

# Set the built-in property
builtinProperty = doc.BuiltinDocumentProperties
builtinProperty.Title = "Revolutionizing Artificial Intelligence"
builtinProperty.Subject = "Advanced Applications and Future Directions of Neural Networks in Artificial Intelligence"
builtinProperty.Author = "Simon"
builtinProperty.Manager = "Arie"
builtinProperty.Company = "AI Research Lab"
builtinProperty.Category = "Research"
builtinProperty.Keywords = "Machine Learning, Neural Network, Artificial Intelligence"
builtinProperty.Comments = "This paper is about the state of the art of artificial intelligence."
builtinProperty.HyperLinkBase = "www.e-iceblue.com"

# Save the document
doc.SaveToFile("output/AddPropertyWord.docx", FileFormat.Docx2019)
doc.Close()

Python: Add, Read, and Remove Built-in Document Properties in Word Documents

Read Built-in Document Properties from Word Documents with Python

Besides adding values, the properties under the BuiltinDocumentProperties class also empower developers to read existing built-in properties of Word documents. This enables various functionalities like document search, information extraction, and document analysis.

The detailed steps for reading document built-in properties using Spire.Doc for Python are as follows:

  • Create an object of Document class.
  • Load a Word document using Document.LoadFromFile() method.
  • Get the built-in properties of Document using Document.BuiltinDocumentProperties property.
  • Get the value of the properties using properties under BuiltinDocumentProperties class.
  • Output the built-in properties of the document.
  • Python
from spire.doc import *
from spire.doc.common import *

# Create an object of Document
doc = Document()

# Load a Word document
doc.LoadFromFile("output/AddPropertyWord.docx")

# Get the built-in properties of the document
builtinProperties = doc.BuiltinDocumentProperties

# Get the value of the built-in properties
properties = [
    "Author: " + builtinProperties.Author,
    "Company: " + builtinProperties.Company,
    "Title: " + builtinProperties.Title,
    "Subject: " + builtinProperties.Subject,
    "Keywords: " + builtinProperties.Keywords,
    "Category: " + builtinProperties.Category,
    "Manager: " + builtinProperties.Manager,
    "Comments: " + builtinProperties.Comments,
    "Hyperlink Base: " + builtinProperties.HyperLinkBase,
    "Word Count: " + str(builtinProperties.WordCount),
    "Page Count: " + str(builtinProperties.PageCount),
]

# Output the built-in properties
for i in range(0, len(properties)):
    print(properties[i])

doc.Close()

Python: Add, Read, and Remove Built-in Document Properties in Word Documents

Remove Built-in Document Properties from Word Documents with Python

The built-in document properties of a Word document that contain specific content can be removed by setting them to null values. This protects private information while retaining necessary details.

The detailed steps for removing specific built-in document properties from Word documents are as follows:

  • Create an object of Document class.
  • Load a Word document using Document.LoadFromFile() method.
  • Get the built-in properties of the document through Document.BuiltinDocumentProperties property.
  • Set the value of some properties to none to remove the properties with properties under BuiltinDocumentProperties class.
  • Save the document using Document.SaveToFile() method.
  • Python
from spire.doc import *
from spire.doc.common import *

# Create an instance of the Document class
doc = Document()

# Load the Word document
doc.LoadFromFile("output/AddPropertyWord.docx")

# Get the document's built-in properties
builtinProperties = doc.BuiltinDocumentProperties

# Remove the built-in properties by setting them to None
builtinProperties.Author = None
builtinProperties.Company = None
builtinProperties.Title = None
builtinProperties.Subject = None
builtinProperties.Keywords = None
builtinProperties.Comments = None
builtinProperties.Category = None
builtinProperties.Manager = None

# Save the document
doc.SaveToFile("output/RemovePropertyWord.docx", FileFormat.Docx)
doc.Close()

Python: Add, Read, and Remove Built-in Document Properties in Word Documents

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

Extract Tables from PDF Using Python – Spire.PDF

Extracting tables from PDF using Python typically involves understanding how content is visually laid out in rows and columns. Many PDF tables are defined using cell borders, making them easier to detect programmatically. In such cases, a layout-aware library that reads content positioning—rather than just raw text—is essential for accurate PDF table extraction in Python.

In this tutorial, you’ll learn a reliable method to extract tables from PDF using Python, no OCR or machine learning required. Whether your PDF contains clean grids or complex layouts, we'll show how to turn table data into structured formats like Excel or pandas DataFrames for further analysis.

Table of Contents


Handling Table Extraction from PDF in Python

Unlike Excel or CSV files, PDF documents don’t store tables as structured data. To extract tables from PDF files using Python, you need a library that can analyze the layout and detect tabular structures.

Spire.PDF for Python simplifies this process by providing built-in methods to extract tables page by page. It works best with clearly formatted tables and helps developers convert PDF content into usable data formats like Excel or CSV.

You can install the library with:

pip install Spire.PDF

Or install the free version for smaller PDF table extraction tasks:

pip install spire.pdf.free

Extracting Tables from PDF – Step-by-Step

To extract tables from a PDF file using Python, we start by loading the document and analyzing each page individually. With Spire.PDF for Python, you can detect tables based on their layout structure and extract them programmatically—even from multi-page documents.

Load PDF and Extract Tables

Here's a basic example that shows how to read tables from a PDF using Python. This method uses Spire.PDF to extract each table from the document page by page, making it ideal for developers who want to programmatically extract tabular data from PDFs.

from spire.pdf import PdfDocument, PdfTableExtractor

# Load PDF document
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")

# Create a PdfTableExtractor object
table_extractor = PdfTableExtractor(pdf)

# Extract tables from each page
for i in range(pdf.Pages.Count):
    tables = table_extractor.ExtractTable(i)
    for table_index, table in enumerate(tables):
        print(f"Table {table_index + 1} on page {i + 1}:")
        for row in range(table.GetRowCount()):
            row_data = []
            for col in range(table.GetColumnCount()):
                text = table.GetText(row, col).replace("\n", " ")
                row_data.append(text.strip())
            print("\t".join(row_data))

This method works reliably for bordered tables. However, for tables without visible borders—especially those with multi-line cells or unmarked headers—the extractor may fail to detect the tabular structure.

The result of extracting table data from a PDF using Python and Spire.PDF is shown below:

Python Code to Read Tables from PDF with Spire.PDF

Export Tables to Excel and CSV

If you want to analyze or store the extracted PDF tables, you can convert them to Excel and CSV formats using Python. In this example, we use Spire.XLS for Python to create a spreadsheet for each table, allowing easy data processing or sharing. You can install the library from pip: pip install spire.xls.

from spire.pdf import PdfDocument, PdfTableExtractor
from spire.xls import Workbook, FileFormat

# Load PDF document
pdf = PdfDocument()
pdf.LoadFromFile("G:/Documents/Sample101.pdf")

# Set up extractor and Excel workbook
extractor = PdfTableExtractor(pdf)
workbook = Workbook()
workbook.Worksheets.Clear()

# Extract tables page by page
for page_index in range(pdf.Pages.Count):
    tables = extractor.ExtractTable(page_index)
    for t_index, table in enumerate(tables):
        sheet = workbook.Worksheets.Add(f"Page{page_index+1}_Table{t_index+1}")
        for row in range(table.GetRowCount()):
            for col in range(table.GetColumnCount()):
                text = table.GetText(row, col).replace("\n", " ").strip()
                sheet.Range.get_Item(row + 1, col + 1).Value = text
                sheet.AutoFitColumn(col + 1)

# Save all tables to one Excel file
workbook.SaveToFile("output/Sample.xlsx", FileFormat.Version2016)

As shown below, the extracted PDF tables are converted to Excel and CSV using Spire.XLS for Python.

Export PDF Tables to Excel and CSV Using Python

You may also like: How to Insert Data into Excel Files in Python


Tips to Improve PDF Table Extraction Accuracy in Python

Extracting tables from PDFs can sometimes yield imperfect results—especially when dealing with complex layouts, page breaks, or inconsistent formatting. Below are a few practical techniques to help improve table extraction accuracy in Python and get cleaner, more structured output.

1. Merging Multi-Page Tables

Spire.PDF extracts tables on a per-page basis. If a table spans multiple pages, you can combine them manually by appending the rows:

Example:

# Extract and combine tables
combined_rows = []
for i in range(start_page, end_page + 1):
    tables = table_extractor.ExtractTable(i)
    if tables:
        table = tables[0]  # Assuming one table per page
        for row in range(table.GetRowCount()):
            cells = [table.GetText(row, col).strip().replace("\n", " ") for col in range(table.GetColumnCount())]
            combined_rows.append(cells)

You can then convert combined_rows into Excel or CSV if you prefer analysis via these formats.

2. Filtering Out Empty or Invalid Rows

Tables may contain empty rows or columns, or the extractor may return blank rows depending on layout. You can filter them out before exporting.

Example:

# Step 1: Filter out empty rows
filtered_rows = []
for row in range(table.GetRowCount()):
    row_data = [table.GetText(row, col).strip().replace("\n", " ") for col in range(table.GetColumnCount())]
    if any(cell for cell in row_data):  # Skip completely empty rows
        filtered_rows.append(row_data)

# Step 2: Transpose and filter out empty columns
transposed = list(zip(*filtered_rows))
filtered_columns = [col for col in transposed if any(cell.strip() for cell in col)]

# Step 3: Transpose back to original row-column format
filtered_data = list(zip(*filtered_columns))

This helps improve accuracy when working with noisy or inconsistent layouts.


Common Questions (FAQ)

Q: Can I extract both text and tables from a PDF?

Yes, use PdfTextExtractor to retrieve the full page text and PdfTableExtractor to extract structured tables.

Q: Why aren't my tables detected?

Make sure the PDF is text-based (not scanned images) and that the layout follows a logical row-column format. Spire.PDF for Python detects only bordered tables; unbordered tables are often not recognized.

If you are handling an image-based PDF document, you can use Spire.OCR for Python to extract table data. Please refer to: How to Extract Text from Images Using Python.

Q: How to extract tables without borders from PDF documents?

Spire.PDF may have difficulty extracting tables without visible borders. If the tables are not extracted correctly, consider the following approaches:

  • Using PdfTextExtractor to extract raw text and then writing custom logic to identify rows and columns.
  • Using a large language model API (e.g., GPT) to interpret the structure from extracted plain text and return only structured table data.
  • Consider adding visible borders to tables in the original document before generating the PDF, as this makes it easier to extract them using Python code.

Q: How do I convert extracted tables to a pandas DataFrame?

While Spire.PDF doesn’t provide native DataFrame output, you can collect cell values into a list of lists and then convert:

import pandas as pd
df = pd.DataFrame(table_data)

This lets you convert PDF tables into pandas DataFrames using Python for data analysis.

Q: Is Spire.PDF for Python free to use?

Yes, there are two options available:


Conclusion

Whether you're working with structured reports, financial data, or standardized forms, extracting tables from PDFs in Python can streamline your workflow. With a layout-aware parser like Spire.PDF for Python, you can reliably detect and export tables—no OCR or manual formatting needed. By converting tables to Excel, CSV, or DataFrame, you unlock their full potential for automation and analysis.

In summary, extracting tables from PDFs in Python becomes much easier with Spire.PDF, especially when converting them into structured formats like Excel and CSV for analysis.

page 45

Coupon Code Copied!

Christmas Sale

Celebrate the season with exclusive savings

Save 10% Sitewide

Use Code:

View Campaign Details