page 76

Subscribe to this RSS feed

Knowledgebase (2311)

Children categories

Spire.OfficeJs (3)

View items...

How to Convert PDF to Excel with Formatting in Python (Step-by-Step Guide)

2023-09-07 01:11:29 Written by Koohji

Automate PDF to Excel in Python with Spire.PDF

Converting PDF files to Excel spreadsheets in Python is an effective way to extract structured data for analysis, reporting, and automation. While PDFs are excellent for preserving layout across platforms, their static format often makes data extraction challenging.

Excel, on the other hand, provides robust features for sorting, filtering, calculating, and visualizing data. By using Python along with the Spire.PDF for Python library, you can automate the entire PDF to Excel conversion process — from basic one-page documents to complex, multi-page PDFs.

Whether you're automating data extraction from PDFs or integrating PDF content into Excel workflows, this tutorial will walk you through both quick-start and advanced methods for reliable Python PDF to Excel conversion.

Why Convert PDF to Excel Programmatically in Python
Setting Up Your Development Environment
Quick Start: Convert PDF to Excel in Python
Advanced PDF to Excel Conversion with Layout Control and Formatting Options
Conclusion

Why Convert PDF to Excel Programmatically in Python

PDFs are ideal for sharing documents with consistent formatting, but their fixed structure makes them difficult to analyze or reuse, especially if they contain tables.

Converting PDF to Excel allows you to:

Extract tabular data for analysis or visualization
Automate monthly or recurring report extraction
Enable downstream processing in Excel
Save hours of manual copy-pasting

Using Python for this task adds automation, flexibility, and scalability — ideal for integration into data pipelines or backend services.

Setting Up Your Development Environment

Before you start converting PDF files to Excel using Python, it’s essential to set up your development environment properly. This ensures you have all the necessary tools and libraries installed to follow the tutorial smoothly.

Install Python

If you haven’t already installed Python on your system, download and install the latest version from the official website.

Make sure to add Python to your system PATH during installation to run Python commands from the terminal or command prompt easily.

Install Spire.PDF for Python

Spire.PDF for Python is the core library used in this tutorial to load, manipulate, and convert PDF documents.

To install it, open your terminal and run:

pip install Spire.PDF

This command downloads and installs Spire.PDF along with any required dependencies.

If you encounter any issues or need detailed installation help, please refer to our step-by-step guide: How to Install Spire.PDF for Python on Windows

Quick Start: Convert PDF to Excel in Python

If your PDF has a clean and simple layout without complex formatting or multiple page structures, you can convert it directly to Excel with just 3 lines of code using Spire.PDF for Python.

Steps to Quickly Export PDF to Excel

Follow these straightforward steps to export your PDF file to an Excel spreadsheet in Python:

Import the required classes.
Create a PdfDocument object.
Load your PDF file with the LoadFromFile method.
Export the PDF to Excel (.xlsx) format using the SaveToFile method and specify FileFormat.XLSX as the output format.

Code Example

from spire.pdf import *

# Create a PdfDocument object
pdf = PdfDocument()

# Load your PDF file
pdf.LoadFromFile("Sample.pdf")

# Convert and save the PDF to Excel
pdf.SaveToFile("output.xlsx", FileFormat.XLSX)

# Close the document
pdf.Close()

Advanced PDF to Excel Conversion with Layout Control and Formatting Options

For more complex PDF documents—such as those containing multiple pages, rotated text, table cells with multiple lines of text, or overlapping content - you can use the XlsxLineLayoutOptions class to gain precise control over the conversion process.

This allows you to preserve the original structure and formatting of your PDF more accurately when exporting to Excel.

Layout Options You Can Configure

The XlsxLineLayoutOptions class in Spire.PDF provides several properties that give you granular control over how PDF content is exported to Excel. Below is a breakdown of each option and its behavior:

Option	Description
convertToMultipleSheet	Determines whether to convert each PDF page into a separate worksheet. The default value is true.
rotatedText	Specifies whether to preserve the original rotation of angled text. The default value is true.
splitCell	Determines whether to split a PDF table cell with multiple lines of text into separate rows in the Excel output. The default value is true.
wrapText	Determines whether to enable word wrap inside Excel cells. The default value is true.
overlapText	Specifies whether text overlapping in the original PDF should be preserved in the Excel output. The default value is false.

Code Example

from spire.pdf import *

# Create a PdfDocument object
pdf = PdfDocument()

# Load your PDF file
pdf.LoadFromFile("Sample.pdf")

# Define layout options
# Parameters: convertToMultipleSheet, rotatedText, splitCell, wrapText, overlapText
layout_options = XlsxLineLayoutOptions(True, True, False, True, False)

# Apply layout options
pdf.ConvertOptions.SetPdfToXlsxOptions(layout_options)

# Convert and save the PDF to Excel
pdf.SaveToFile("advanced_output.xlsx", FileFormat.XLSX)

# Close the document
pdf.Close()

Advanced PDF to Excel conversion example in Python with Spire.PDF

Conclusion

Converting PDF files to Excel in Python is an efficient way to automate data extraction and processing tasks. Whether you need a quick conversion or fine-grained layout control, Spire.PDF for Python offers flexible options that scale from simple to complex scenarios.

Ready to automate your PDF to Excel conversions?
Get a free trial license for Spire.PDF for Python and explore the full Spire.PDF Documentation to get started today!

FAQs

Q1: Can I convert each PDF page into a separate Excel worksheet?

A1: Yes. Use the convertToMultipleSheet=True option in the XlsxLineLayoutOptions class to export each page to its own sheet.

Q2. What Excel format does Spire.PDF export to?

A2: Spire.PDF converts PDFs to .xlsx, the modern Excel format supported by Excel 2007 and later.

Q3: Can I convert a PDF to Excel in Python without losing formatting?

A3: Yes. Using Spire.PDF for Python, you can retain the original formatting, including merged cells, cell background colors, and other format settings when saving PDFs to Excel.

Q4: Can I extract only a specific table from a PDF to Excel instead of converting the whole document?

A4: Yes, Spire.PDF for Python supports extracting specific tables from PDF files. You can then write the extracted table data to Excel using our Excel processing library - Spire.XLS for Python.

Published in Conversion

Tagged under

pdf Python Conversion

Merge PDF Files in Python - Complete Tutorial

2023-09-06 01:09:40 Written by Koohji

Merging PDF files is a common task in many applications, from combining report sections to creating comprehensive document collections. For developers, using Python to merge PDF files programmatically can significantly streamline the process and help build automated workflows.

This article explores how to merge PDFs in Python using Spire.PDF for Python - a robust library designed for efficient PDF manipulation.

Visual guide of Python Merge PDF

Table of Contents:

5 Reasons Why You Should Use Python to Combine PDFs
Step-by-Step: Merge PDF Files in Python
- Install Spire.PDF for Python
- Merge Multiple PDF Files into One
Advanced: Merge Selected Pages from PDFs in Python
Batch Processing: Merge Multiple PDF Files in a Folder
Frequently Asked Questions
Conclusion

5 Reasons Why You Should Use Python to Combine PDFs

While GUI tools like Adobe Acrobat offer PDF merging capabilities, Python provides distinct advantages for developers and enterprises. Python’s PDF merging feature shines when you need to:

Process documents in bulk
Schedule scripts to run automatically (e.g., daily report merging).
Integrate with data workflows
Implement business-specific logic
Deploy in server/cloud environments

Step-by-Step: Merge PDF Files in Python

Step 1: Install Spire.PDF for Python

Before you can start combining PDFs with Spire.PDF for Python, you need to install the library. You can do this using pip, the Python package manager. Open your terminal and run the following command:

pip install Spire.PDF

Step 2: Merge Multiple PDF Files into One

Now, let's dive into the Python code for merging multiple PDF files into a single PDF.

1. Import the Required Classes

First, import the necessary classes from the Spire.PDF library:

from spire.pdf.common import *
from spire.pdf import *

2. Define Paths of PDFs to Merge

Define three PDF file paths and stored them in a list. You can modify these paths or adjust the number of files according to your needs.

inputFile1 = "Sample1.pdf"
inputFile2 = "Sample2.pdf"
inputFile3 = "Sample3.pdf"
files = [inputFile1, inputFile2, inputFile3]

3. Merge PDF Files

The MergeFiles() method combines all PDFs in the list into a new PDF document object.

pdf = PdfDocument.MergeFiles(files)

4. Save the Merged PDF Finally, save the combined PDF to a specified output path.

pdf.Save("output/MergePDF.pdf", FileFormat.PDF)

Full Python Code to Combine PDFs:

from spire.pdf.common import *
from spire.pdf import *

# Create a list of the PDF file paths
inputFile1 = "Sample1.pdf"
inputFile2 = "Sample2.pdf"
inputFile3 = "Sample3.pdf"
files = [inputFile1, inputFile2, inputFile3]

# Merge the PDF documents
pdf = PdfDocument.MergeFiles(files)

# Save the result document
pdf.Save("output/MergePDF.pdf", FileFormat.PDF)
pdf.Close()

Result: Combine three PDF files (total of 6 pages) into one PDF file.

Merge multiple PDF files into a single PDF

Advanced: Merge Selected Pages from PDFs in Python

In some cases, you may only want to merge specific pages of multiple PDFs. Spire.PDF for Python makes this easy by allowing you to select pages from different PDF documents and insert them into a new PDF file.

from spire.pdf import *
from spire.pdf.common import *

# Create a list of the PDF file paths
file1 = "Sample1.pdf"
file2 = "Sample2.pdf"
file3 = "Sample3.pdf"
files = [file1, file2, file3]

# Load each PDF file as an PdfDocument object and add them to a list
pdfs = []
for file in files:
    pdfs.append(PdfDocument(file))

# Create an object of PdfDocument class
newPdf = PdfDocument()

# Insert the selected pages from the loaded PDF documents into the new document
newPdf.InsertPage(pdfs[0], 0)
newPdf.InsertPage(pdfs[1], 1)
newPdf.InsertPageRange(pdfs[2], 0, 1)

# Save the new PDF document
newPdf.SaveToFile("output/SelectedPages.pdf")

Explanation:

PdfDocument(): Initializes a new PDF document object.
InsertPage(): Insert a specified page to the new PDF (Page index starts at 0).
InsertPageRange(): Inserts a range of pages to the new PDF.
SaveToFile(): Save the combined PDF to the specified output path.

Result: Combine selected pages from three separate PDF files into a new PDF.

Combine pages from different PDFs to a new PDF

Batch Processing: Merge Multiple PDF Files in a Folder

The Python script loops through each source PDF in a specified folder, then appends all pages from the source PDFs to a new PDF file.

import os 
from spire.pdf.common import *
from spire.pdf import *

# Specify the directory where the source PDFs are stored
folder = "pdf_folder/"  

# Create a new PDF to hold the combined content.
merged_pdf = PdfDocument()  

# Loop through each source PDF
for file in os.listdir(folder):  
    if file.endswith(".pdf"):  
        pdf = PdfDocument(os.path.join(folder, file))  
        # Appends all pages from each source PDF to the new PDF
        merged_pdf.AppendPage(pdf)  
        pdf.Close()  # Close source PDF

# Save the merged PDF after processing all files
merged_pdf.SaveToFile("BatchCombinePDFs.pdf")
merged_pdf.Close()  # Release resources

Frequently Asked Questions

Q1: Is Spire.PDF for Python free?

A: Spire.PDF for Python offers a 30-day free trial with full features. There’s also a free version available but with page limits.

Q2: Can I merge scanned/image-based PDFs?

A: Yes, Spire.PDF handles image-only PDFs. However, OCR/text extraction requires the Spire.OCR for Python library.

Q3: How to add page numbers to the merged PDF?

A: Refer to this comprehensive guide: Add Page Numbers to PDF in Python

Q4: How to reduce the size of the merged PDF?

A: You can compress the high-resolution images and fonts contained in the merged PDF file. A related tutorial: Compress PDF Documents in Python.

Conclusion

Merging PDFs with Python doesn't have to be a complex task. With Spire.PDF for Python, you can efficiently combine multiple PDF files into a single document with just a few lines of code. Whether you need to merge entire documents, specific pages, or a batch merge, this guide outlines step-by-step instructions to help you automate the PDF merging process.

Explore Spire.PDF's online documentation for more PDF prcessing features with Python.

Published in Document Operation

Tagged under

pdf Python Document Operation

Python: Create, Read, or Update a Word Document

2023-09-05 01:40:47 Written by Administrator

Creating, reading, and updating Word documents is a common need for many developers working with the Python programming language. Whether it's generating reports, manipulating existing documents, or automating document creation processes, having the ability to work with Word documents programmatically can greatly enhance productivity and efficiency. In this article, you will learn how to create, read, or update Word documents in Python using Spire.Doc for Python.

Create a Word Document from Scratch in Python
Read Text of a Word Document in Python
Update a Word Document in Python

Install Spire.Doc for Python

This scenario requires Spire.Doc for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.

Package Manager

pip install Spire.Doc

If you are unsure how to install, please refer to this tutorial: How to Install Spire.Doc for Python on Windows

Create a Word Document from Scratch in Python

Spire.Doc for Python offers the Document class to represent a Word document model. A document must contain at least one section (represented by the Section class) and each section is a container for various elements such as paragraphs, tables, charts, and images. This example shows you how to create a simple Word document containing several paragraphs using Spire.Doc for Python.

Create a Document object.
Add a section using Document.AddSection() method.
Set the page margins through Section.PageSetUp.Margins property.
Add several paragraphs to the section using Section.AddParagraph() method.
Add text to the paragraphs using Paragraph.AppendText() method.
Create a ParagraphStyle object, and apply it to a specific paragraph using Paragraph.ApplyStyle() method.
Save the document to a Word file using Document.SaveToFile() method.

Python

from spire.doc import *	
from spire.doc.common import *

# Create a Document object
doc = Document()

# Add a section
section = doc.AddSection()

# Set the page margins
section.PageSetup.Margins.All = 40

# Add a title
titleParagraph = section.AddParagraph()
titleParagraph.AppendText("Introduction of Spire.Doc for Python")

# Add two paragraphs
bodyParagraph_1 = section.AddParagraph()
bodyParagraph_1.AppendText("Spire.Doc for Python is a professional Python library designed for developers to " +
                           "create, read, write, convert, compare and print Word documents in any Python application " +
                           "with fast and high-quality performance.")

bodyParagraph_2 = section.AddParagraph()
bodyParagraph_2.AppendText("As an independent Word Python API, Spire.Doc for Python doesn't need Microsoft Word to " +
                           "be installed on neither the development nor target systems. However, it can incorporate Microsoft Word " +
                           "document creation capabilities into any developers' Python applications.")

# Apply heading1 to the title
titleParagraph.ApplyStyle(BuiltinStyle.Heading1)

# Create a style for the paragraphs
style2 = ParagraphStyle(doc)
style2.Name = "paraStyle"
style2.CharacterFormat.FontName = "Arial"
style2.CharacterFormat.FontSize = 13
doc.Styles.Add(style2)
bodyParagraph_1.ApplyStyle("paraStyle")
bodyParagraph_2.ApplyStyle("paraStyle")

# Set the horizontal alignment of the paragraphs
titleParagraph.Format.HorizontalAlignment = HorizontalAlignment.Center
bodyParagraph_1.Format.HorizontalAlignment = HorizontalAlignment.Left
bodyParagraph_2.Format.HorizontalAlignment = HorizontalAlignment.Left

# Set the after spacing
titleParagraph.Format.AfterSpacing = 10
bodyParagraph_1.Format.AfterSpacing = 10

# Save to file
doc.SaveToFile("output/WordDocument.docx", FileFormat.Docx2019)

Python: Create, Read, or Update a Word Document

Read Text of a Word Document in Python

To get the text of an entire Word document, you could simply use Document.GetText() method. The following are the detailed steps.

Create a Document object.
Load a Word document using Document.LoadFromFile() method.
Get text from the entire document using Document.GetText() method.

Python

from spire.doc import *
from spire.doc.common import *

# Create a Document object
doc = Document()

# Load a Word file
doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\WordDocument.docx")

# Get text from the entire document
text = doc.GetText()

# Print text
print(text)

Python: Create, Read, or Update a Word Document

Update a Word Document in Python

To access a specific paragraph, you can use the Section.Paragraphs[index] property. If you want to modify the text of the paragraph, you can reassign text to the paragraph through the Paragraph.Text property. The following are the detailed steps.

Create a Document object.
Load a Word document using Document.LoadFromFile() method.
Get a specific section through Document.Sections[index] property.
Get a specific paragraph through Section.Paragraphs[index] property.
Change the text of the paragraph through Paragraph.Text property.
Save the document to another Word file using Document.SaveToFile() method.

Python

from spire.doc import *
from spire.doc.common import *

# Create a Document object
doc = Document()

# Load a Word file
doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\WordDocument.docx")

# Get a specific section
section = doc.Sections.get_Item(0)

# Get a specific paragraph
paragraph = section.Paragraphs.get_Item(1)

# Change the text of the paragraph
paragraph.Text = "The title has been changed"

# Save to file
doc.SaveToFile("output/Updated.docx", FileFormat.Docx2019)

Python: Create, Read, or Update a Word Document

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

Published in Document Operation

Tagged under

doc Python Document Operation

News Category

Knowledgebase (2311)

Children categories

Table of Contents

Why Convert PDF to Excel Programmatically in Python

Setting Up Your Development Environment

Install Python

Install Spire.PDF for Python

Quick Start: Convert PDF to Excel in Python

Steps to Quickly Export PDF to Excel

Code Example

Advanced PDF to Excel Conversion with Layout Control and Formatting Options

Layout Options You Can Configure

Code Example

Conclusion

FAQs

Q1: Can I convert each PDF page into a separate Excel worksheet?

Q2. What Excel format does Spire.PDF export to?

Q3: Can I convert a PDF to Excel in Python without losing formatting?

Q4: Can I extract only a specific table from a PDF to Excel instead of converting the whole document?

5 Reasons Why You Should Use Python to Combine PDFs

Step-by-Step: Merge PDF Files in Python

Step 1: Install Spire.PDF for Python

Step 2: Merge Multiple PDF Files into One

Advanced: Merge Selected Pages from PDFs in Python

Batch Processing: Merge Multiple PDF Files in a Folder

Frequently Asked Questions

Q1: Is Spire.PDF for Python free?

Q2: Can I merge scanned/image-based PDFs?

Q3: How to add page numbers to the merged PDF?

Q4: How to reduce the size of the merged PDF?

Conclusion

Install Spire.Doc for Python

Create a Word Document from Scratch in Python

Read Text of a Word Document in Python

Update a Word Document in Python

Apply for a Temporary License

More...