Extract Tables from PDF Using Python - Easy Table Parsing Guide

Extract Tables from PDF Using Python – Spire.PDF

Extracting tables from PDF using Python typically involves understanding how content is visually laid out in rows and columns. Many PDF tables are defined using cell borders, making them easier to detect programmatically. In such cases, a layout-aware library that reads content positioning—rather than just raw text—is essential for accurate PDF table extraction in Python.

In this tutorial, you’ll learn a reliable method to extract tables from PDF using Python, no OCR or machine learning required. Whether your PDF contains clean grids or complex layouts, we'll show how to turn table data into structured formats like Excel or pandas DataFrames for further analysis.

Table of Contents


Handling Table Extraction from PDF in Python

Unlike Excel or CSV files, PDF documents don’t store tables as structured data. To extract tables from PDF files using Python, you need a library that can analyze the layout and detect tabular structures.

Spire.PDF for Python simplifies this process by providing built-in methods to extract tables page by page. It works best with clearly formatted tables and helps developers convert PDF content into usable data formats like Excel or CSV.

You can install the library with:

pip install Spire.PDF

Or install the free version for smaller PDF table extraction tasks:

pip install spire.pdf.free

Extracting Tables from PDF – Step-by-Step

To extract tables from a PDF file using Python, we start by loading the document and analyzing each page individually. With Spire.PDF for Python, you can detect tables based on their layout structure and extract them programmatically—even from multi-page documents.

Load PDF and Extract Tables

Here's a basic example that shows how to read tables from a PDF using Python. This method uses Spire.PDF to extract each table from the document page by page, making it ideal for developers who want to programmatically extract tabular data from PDFs.

from spire.pdf import PdfDocument, PdfTableExtractor

# Load PDF document
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")

# Create a PdfTableExtractor object
table_extractor = PdfTableExtractor(pdf)

# Extract tables from each page
for i in range(pdf.Pages.Count):
    tables = table_extractor.ExtractTable(i)
    for table_index, table in enumerate(tables):
        print(f"Table {table_index + 1} on page {i + 1}:")
        for row in range(table.GetRowCount()):
            row_data = []
            for col in range(table.GetColumnCount()):
                text = table.GetText(row, col).replace("\n", " ")
                row_data.append(text.strip())
            print("\t".join(row_data))

This method works reliably for bordered tables. However, for tables without visible borders—especially those with multi-line cells or unmarked headers—the extractor may fail to detect the tabular structure.

The result of extracting table data from a PDF using Python and Spire.PDF is shown below:

Python Code to Read Tables from PDF with Spire.PDF

Export Tables to Excel and CSV

If you want to analyze or store the extracted PDF tables, you can convert them to Excel and CSV formats using Python. In this example, we use Spire.XLS for Python to create a spreadsheet for each table, allowing easy data processing or sharing. You can install the library from pip: pip install spire.xls.

from spire.pdf import PdfDocument, PdfTableExtractor
from spire.xls import Workbook, FileFormat

# Load PDF document
pdf = PdfDocument()
pdf.LoadFromFile("G:/Documents/Sample101.pdf")

# Set up extractor and Excel workbook
extractor = PdfTableExtractor(pdf)
workbook = Workbook()
workbook.Worksheets.Clear()

# Extract tables page by page
for page_index in range(pdf.Pages.Count):
    tables = extractor.ExtractTable(page_index)
    for t_index, table in enumerate(tables):
        sheet = workbook.Worksheets.Add(f"Page{page_index+1}_Table{t_index+1}")
        for row in range(table.GetRowCount()):
            for col in range(table.GetColumnCount()):
                text = table.GetText(row, col).replace("\n", " ").strip()
                sheet.Range.get_Item(row + 1, col + 1).Value = text
                sheet.AutoFitColumn(col + 1)

# Save all tables to one Excel file
workbook.SaveToFile("output/Sample.xlsx", FileFormat.Version2016)

As shown below, the extracted PDF tables are converted to Excel and CSV using Spire.XLS for Python.

Export PDF Tables to Excel and CSV Using Python

You may also like: How to Insert Data into Excel Files in Python


Tips to Improve PDF Table Extraction Accuracy in Python

Extracting tables from PDFs can sometimes yield imperfect results—especially when dealing with complex layouts, page breaks, or inconsistent formatting. Below are a few practical techniques to help improve table extraction accuracy in Python and get cleaner, more structured output.

1. Merging Multi-Page Tables

Spire.PDF extracts tables on a per-page basis. If a table spans multiple pages, you can combine them manually by appending the rows:

Example:

# Extract and combine tables
combined_rows = []
for i in range(start_page, end_page + 1):
    tables = table_extractor.ExtractTable(i)
    if tables:
        table = tables[0]  # Assuming one table per page
        for row in range(table.GetRowCount()):
            cells = [table.GetText(row, col).strip().replace("\n", " ") for col in range(table.GetColumnCount())]
            combined_rows.append(cells)

You can then convert combined_rows into Excel or CSV if you prefer analysis via these formats.

2. Filtering Out Empty or Invalid Rows

Tables may contain empty rows or columns, or the extractor may return blank rows depending on layout. You can filter them out before exporting.

Example:

# Step 1: Filter out empty rows
filtered_rows = []
for row in range(table.GetRowCount()):
    row_data = [table.GetText(row, col).strip().replace("\n", " ") for col in range(table.GetColumnCount())]
    if any(cell for cell in row_data):  # Skip completely empty rows
        filtered_rows.append(row_data)

# Step 2: Transpose and filter out empty columns
transposed = list(zip(*filtered_rows))
filtered_columns = [col for col in transposed if any(cell.strip() for cell in col)]

# Step 3: Transpose back to original row-column format
filtered_data = list(zip(*filtered_columns))

This helps improve accuracy when working with noisy or inconsistent layouts.


Common Questions (FAQ)

Q: Can I extract both text and tables from a PDF?

Yes, use PdfTextExtractor to retrieve the full page text and PdfTableExtractor to extract structured tables.

Q: Why aren't my tables detected?

Make sure the PDF is text-based (not scanned images) and that the layout follows a logical row-column format. Spire.PDF for Python detects only bordered tables; unbordered tables are often not recognized.

If you are handling an image-based PDF document, you can use Spire.OCR for Python to extract table data. Please refer to: How to Extract Text from Images Using Python.

Q: How to extract tables without borders from PDF documents?

Spire.PDF may have difficulty extracting tables without visible borders. If the tables are not extracted correctly, consider the following approaches:

  • Using PdfTextExtractor to extract raw text and then writing custom logic to identify rows and columns.
  • Using a large language model API (e.g., GPT) to interpret the structure from extracted plain text and return only structured table data.
  • Consider adding visible borders to tables in the original document before generating the PDF, as this makes it easier to extract them using Python code.

Q: How do I convert extracted tables to a pandas DataFrame?

While Spire.PDF doesn’t provide native DataFrame output, you can collect cell values into a list of lists and then convert:

import pandas as pd
df = pd.DataFrame(table_data)

This lets you convert PDF tables into pandas DataFrames using Python for data analysis.

Q: Is Spire.PDF for Python free to use?

Yes, there are two options available:


Conclusion

Whether you're working with structured reports, financial data, or standardized forms, extracting tables from PDFs in Python can streamline your workflow. With a layout-aware parser like Spire.PDF for Python, you can reliably detect and export tables—no OCR or manual formatting needed. By converting tables to Excel, CSV, or DataFrame, you unlock their full potential for automation and analysis.

In summary, extracting tables from PDFs in Python becomes much easier with Spire.PDF, especially when converting them into structured formats like Excel and CSV for analysis.