Set different text alignment in Word with Python

In the world of document automation, proper text alignment is crucial for creating professional, readable, and visually appealing documents. For developers and data professionals building reports, drafting letters, or designing invoices, mastering text alignment in Python is essential to producing polished, consistent documents without manual editing.

This guide delivers a step-by-step walkthrough on how to align text in Python using Spire.Doc for Python, a library that enables effortless control over Word document formatting.


Why Choose Spire.Doc for Python to Align Text?​

Before diving into code, let’s clarify why Spire.Doc is a top choice for text alignment tasks:​

  • Full Alignment Support: Natively supports all standard alignment types (Left, Right, Center, Justify) for paragraphs.​
  • No Microsoft Word Dependency: Runs independently - no need to install Word on your machine.​
  • High Compatibility: Works with .docx, .doc, and other Word formats, ensuring your aligned documents open correctly across devices.​
  • Fine-Grained Control: Adjust alignment for entire paragraphs or table cells.​

Core Text Alignment Types in Spire.Doc​

Spire.Doc uses the HorizontalAlignment enum to define text alignment. The most common values are:​

  • HorizontalAlignment.Left: Aligns text to the left margin (default).​
  • HorizontalAlignment.Right: Aligns text to the right margin.​
  • HorizontalAlignment.Center: Centers text horizontally between margins.​
  • HorizontalAlignment.Justify: Adjusts text spacing so both left and right edges align with margins.​
  • HorizontalAlignment.Distribute: Adjusts character spacing (adds space between letters) and word spacing to fill the line.

Below, we’ll cover how to programmatically set paragraph alignment (left, right, center, justified, and distributed) in Word using Python


Step-by-Step: Align Text in Word in Python

Here are the actionable steps to generate a Word document with 5 paragraphs, each using a different alignment style.

Step 1: Install Spire.Doc for Python

Open your terminal/command prompt​, and then run the following command to install the latest version:

pip install Spire.Doc

Step 2: Import Required Modules

Import the core classes from Spire.Doc. These modules let you create documents, sections, paragraphs, and configure formatting:

from spire.doc import *
from spire.doc.common import *

Step 3: Create a New Word Document

Initialize a Document instance that represents your empty Word file:

# Create a Document instance
doc = Document()

Step 4: Add a Section to the Document

Word documents organize content into sections (each section can have its own margins, page size, etc.). We’ll add one section to hold our paragraphs:

# Add a section to the document
section = doc.AddSection()

Step 5: Add Paragraphs with Different Alignments

A section contains paragraphs, and each paragraph’s alignment is controlled via the HorizontalAlignment enum. We’ll create 5 paragraphs, one for each alignment type.

1. Left Align in Python

Left alignment is the default for most text (text aligns to the left margin).

# Left aligned text
paragraph1 = section.AddParagraph()
paragraph1.AppendText("This is left-aligned text.")
paragraph1.Format.HorizontalAlignment = HorizontalAlignment.Left

2. Right Align Text in Python

Right alignment is useful for dates, signatures, or page numbers (text aligns to the right margin).

# Right aligned text
paragraph2 = section.AddParagraph()
paragraph2.AppendText("This is right-aligned text.")
paragraph2.Format.HorizontalAlignment = HorizontalAlignment.Right

3. Center Text in Python

Center alignment works well for titles or headings (text centers between left and right margins). Use to center text in Python:

# Center aligned text
paragraph3 = section.AddParagraph()
paragraph3.AppendText("This is center-aligned text.")
paragraph3.Format.HorizontalAlignment = HorizontalAlignment.Center

4. Justify Text in Python

Justified text aligns both left and right margins (spaces between words are adjusted for consistency). Ideal for formal documents like essays or reports.

# Justified
paragraph4 = section.AddParagraph()
paragraph4.AppendText("This is justified text.")
paragraph4.Format.HorizontalAlignment = HorizontalAlignment.Justify

Note: Justified alignment is more visible with longer text - short phrases may not show the spacing adjustment.

5. Distribute Text in Python

Distributed alignment is similar to justified, but evenly distributes single-line text (e.g., unevenly spaced words or short phrases).

# Distributed
Paragraph5 = section.AddParagraph()
Paragraph5.AppendText("This is evenly distributed text.")
Paragraph5.Format.HorizontalAlignment = HorizontalAlignment.Distribute

Step 6: Save and Close the Document

Finally, save the document to a specified path and close the Document instance to free resources:

# Save the document
document.SaveToFile("TextAlignment.docx", FileFormat.Docx2016)
# Close the document to release memory
document.Close()

Output:

Align paragraph text in Word in Python

Pro Tip: Spire.Doc for Python also provides interfaces to align tables in Word or align text in table cells.


FAQs About Python Text Alignment

Q1: Is Spire.Doc for Python free?

A: Spire.Doc offers a free version with limitations. For full functionality, you can request a 30-day trial license here.

Q2: Can I set text alignment for existing Word documents

A: Yes. Spire.Doc lets you load existing documents and modify text alignment for specific paragraphs. Here’s a quick example:

from spire.doc import *

# Load an existing document
doc = Document()
doc.LoadFromFile("ExistingDocument.docx")

# Get the first section and first paragraph
section = doc.Sections[0]
paragraph = section.Paragraphs[0]

# Change alignment to center
paragraph.Format.HorizontalAlignment = HorizontalAlignment.Center

# Save the modified document
doc.SaveToFile("UpdatedDocument.docx", FileFormat.Docx2016)
doc.Close()

Q3: Can I apply different alignments to different parts of the same paragraph?

A: No. Text alignment is a paragraph-level setting in Word, not a character-level setting. This means all text within a single paragraph must share the same alignment (left, right, center, etc.).

If you need mixed alignment in the same line, you’ll need to use a table with invisible borders.

Q4: Can Spire.Doc for Python handle other text formatting?

A: Absolutely! Spire.Doc lets you combine alignment with other formatting like fonts, line spacing, bullet points, and more.


Conclusion

Automating Word text alignment with Python and Spire.Doc saves time, reduces human error, and ensures consistency across documents. The code example provided offers a clear template for implementing left, right, center, justified, and distributed alignment, and adapting it to your needs is as simple as modifying the text or adding more formatting rules.

Try experimenting with different alignment combinations, and explore Spire.Doc’s online documentation to unlock more formatting possibilities.

C# Guide to Read Word Document Content

Word documents (.doc and .docx) are widely used in business, education, and professional workflows for reports, contracts, manuals, and other essential content. As a C# developer, you may find the need to programmatically read these files to extract information, analyze content, and integrate document data into applications.

In this complete guide, we will delve into the process of reading Word documents in C#. We will explore various scenarios, including:

  • Extracting text, paragraphs, and formatting details
  • Retrieving images and structured table data
  • Accessing comments and document metadata
  • Reading headers and footers for comprehensive document analysis

By the end of this guide, you will have a solid understanding of how to efficiently parse Word documents in C#, allowing your applications to access and utilize document content with accuracy and ease.

Table of Contents

Set Up Your Development Environment for Reading Word Documents in C#

Before you can read Word documents in C#, it’s crucial to ensure that your development environment is properly set up. This section outlines the necessary prerequisites and step-by-step installation instructions to get you ready for seamless Word document handling.

Prerequisites

Install Spire.Doc

To incorporate Spire.Doc into your C# project, follow these steps to install it via NuGet:

  1. Open your project in Visual Studio.
  2. Right-click on your project in the Solution Explorer and select Manage NuGet Packages.
  3. In the Browse tab, search for "Spire.Doc" and click Install.

Alternatively, you can use the Package Manager Console with the following command:

PM> Install-Package Spire.Doc

This installation adds the necessary references, enabling you to programmatically work with Word documents.

Load Word Document (.doc/.docx) in C#

To begin, you need to load a Word document into your project. The following example demonstrates how to load a .docx or .doc file in C#:

using Spire.Doc;
using Spire.Doc.Documents;
using System;

namespace LoadWordExample
{
    class Program
    {
        static void Main(string[] args)
        {
            // Specify the path of the Word document
            string filePath = @"C:\Documents\Sample.docx";

            // Create a Document object
            using (Document document = new Document())
            {
                // Load the Word .docx or .doc document
                document.LoadFromFile(filePath);
            }
        }
    }
}

This code loads a Word file from the specified path into a Document object, which is the entry point for accessing all document elements.

Read and Extract Content from Word Document in C#

After loading the Word document into a Document object, you can access its contents programmatically. This section covers various methods for extracting different types of content effectively.

Extract Text

Extracting text is often the first step in reading Word documents. You can retrieve all text content using the built-in GetText() method:

using (StreamWriter writer = new StreamWriter("ExtractedText.txt", false, Encoding.UTF8))
{
    // Get all text from the document
    string allText = document.GetText();
    
    // Write the entire text to a file
    writer.Write(allText);
}

This method extracts all text, disregarding formatting and non-text elements like images.

C# Example to Extract All Text from Word Document

Read Paragraphs and Formatting Information

When working with Word documents, it is often useful not only to access the text content of paragraphs but also to understand how each paragraph is formatted. This includes details such as alignment and spacing after the paragraph, which can affect layout and readability.

The following example demonstrates how to iterate through all paragraphs in a Word document and retrieve their text content and paragraph-level formatting in C#:

using (StreamWriter writer = new StreamWriter("Paragraphs.txt", false, Encoding.UTF8))
{
    // Loop through all sections
    foreach (Section section in document.Sections)
    {
        // Loop through all paragraphs in the section
        foreach (Paragraph paragraph in section.Paragraphs)
        {
            // Get paragraph alignment
            HorizontalAlignment alignment = paragraph.Format.HorizontalAlignment;

            // Get spacing after paragraph
            float afterSpacing = paragraph.Format.AfterSpacing;

            // Write paragraph formatting and text to the file
            writer.WriteLine($"[Alignment: {alignment}, AfterSpacing: {afterSpacing}]");
            writer.WriteLine(paragraph.Text);
            writer.WriteLine(); // Add empty line between paragraphs
        }
    }
}

This approach allows you to extract both the text and key paragraph formatting attributes, which can be useful for tasks such as document analysis, conditional processing, or preserving layout when exporting content.

Extract Images

Images embedded within Word documents play a vital role in conveying information. To extract these images, you will examine each paragraph's content, identify images (typically represented as DocPicture objects), and save them for further use:

// Create the folder if it does not exist
string imageFolder = "ExtractedImages";
if (!Directory.Exists(imageFolder))
    Directory.CreateDirectory(imageFolder);

int imageIndex = 1;

// Loop through sections and paragraphs to find images
foreach (Section section in document.Sections)
{
    foreach (Paragraph paragraph in section.Paragraphs)
    {
        foreach (DocumentObject obj in paragraph.ChildObjects)
        {
            if (obj is DocPicture picture)
            {
                // Save each image as a separate PNG file
                string fileName = Path.Combine(imageFolder, $"Image_{imageIndex}.png");
                picture.Image.Save(fileName, System.Drawing.Imaging.ImageFormat.Png);
                imageIndex++;
            }
        }
    }
}

This code saves all images in the document as separate PNG files, with options to choose other formats like JPEG or BMP.

C# Example to Extract Images from Word Document

Extract Table Data

Tables are commonly used to organize structured data, such as financial reports or survey results. To access this data, iterate through the tables in each section and retrieve the content of individual cells:

// Create a folder to store tables
string tableDir = "Tables";
if (!Directory.Exists(tableDir))
    Directory.CreateDirectory(tableDir);

// Loop through each section
for (int sectionIndex = 0; sectionIndex < document.Sections.Count; sectionIndex++)
{
    Section section = document.Sections[sectionIndex];
    TableCollection tables = section.Tables;

    // Loop through all tables in the section
    for (int tableIndex = 0; tableIndex < tables.Count; tableIndex++)
    {
        ITable table = tables[tableIndex];
        string fileName = Path.Combine(tableDir, $"Section{sectionIndex + 1}_Table{tableIndex + 1}.txt");

        using (StreamWriter writer = new StreamWriter(fileName, false, Encoding.UTF8))
        {
            // Loop through each row
            for (int rowIndex = 0; rowIndex < table.Rows.Count; rowIndex++)
            {
                TableRow row = table.Rows[rowIndex];

                // Loop through each cell
                for (int cellIndex = 0; cellIndex < row.Cells.Count; cellIndex++)
                {
                    TableCell cell = row.Cells[cellIndex];

                    // Loop through each paragraph in the cell
                    for (int paraIndex = 0; paraIndex < cell.Paragraphs.Count; paraIndex++)
                    {
                        writer.Write(cell.Paragraphs[paraIndex].Text.Trim() + " ");
                    }

                    // Add tab between cells
                    if (cellIndex < row.Cells.Count - 1) writer.Write("\t");
                }

                // Add newline after each row
                writer.WriteLine();
            }
        }
    }
}

This method allows efficient extraction of structured data, making it ideal for generating reports or integrating content into databases.

C# Example to Extract Table Data from Word Document

Read Comments

Comments are valuable for collaboration and feedback within documents. Extracting them is crucial for auditing and understanding the document's revision history.

The Document object provides a Comments collection, which allows you to access all comments in a Word document. Each comment contains one or more paragraphs, and you can extract their text for further processing or save them into a file.

using (StreamWriter writer = new StreamWriter("Comments.txt", false, Encoding.UTF8))
{
    // Loop through all comments in the document
    foreach (Comment comment in document.Comments)
    {
        // Loop through each paragraph in the comment
        foreach (Paragraph p in comment.Body.Paragraphs)
        {
            writer.WriteLine(p.Text);
        }
        // Add empty line to separate different comments
        writer.WriteLine();
    }
}

This code retrieves the content of all comments and outputs it into a single text file.

Retrieve Document Metadata

Word documents contain metadata such as the title, author, and subject. These metadata items are stored as document properties, which can be accessed through the BuiltinDocumentProperties property of the Document object:

using (StreamWriter writer = new StreamWriter("Metadata.txt", false, Encoding.UTF8))
{
    // Write built-in document properties to file
    writer.WriteLine("Title: " + document.BuiltinDocumentProperties.Title);
    writer.WriteLine("Author: " + document.BuiltinDocumentProperties.Author);
    writer.WriteLine("Subject: " + document.BuiltinDocumentProperties.Subject);
}

Read Headers and Footers

Headers and footers frequently contain essential content like page numbers and titles. To programmatically access this information, iterate through each section's header and footer paragraphs and retrieve the text of each paragraph:

using (StreamWriter writer = new StreamWriter("HeadersFooters.txt", false, Encoding.UTF8))
{
    // Loop through all sections
    foreach (Section section in document.Sections)
    {
        // Write header paragraphs
        foreach (Paragraph headerParagraph in section.HeadersFooters.Header.Paragraphs)
        {
            writer.WriteLine("Header: " + headerParagraph.Text);
        }

        // Write footer paragraphs
        foreach (Paragraph footerParagraph in section.HeadersFooters.Footer.Paragraphs)
        {
            writer.WriteLine("Footer: " + footerParagraph.Text);
        }
    }
}

This method ensures that all recurring content is accurately captured during document processing.

Advanced Tips and Best Practices for Reading Word Documents in C#

To get the most out of programmatically reading Word documents, following these tips can help improve efficiency, reliability, and code maintainability:

  • Use using Statements: Always wrap Document objects in using to ensure proper memory management.
  • Check for Null or Empty Sections: Prevent errors by verifying sections, paragraphs, tables, or images exist before accessing them.
  • Batch Reading Multiple Documents: Loop through a folder of Word files and apply the same extraction logic to each file. This helps automate workflows and consolidate extracted content efficiently.

Conclusion

Efficiently reading Word documents programmatically in C# involves handling various content types. With the techniques outlined in this guide, developers can:

  • Load Word documents (.doc and .docx) with ease.
  • Extract text, paragraphs, and formatting details for thorough analysis.
  • Retrieve images, structured table data, and comments.
  • Access headers, footers, and document metadata for complete insights.

FAQs

Q1: Can I read Word documents without installing Microsoft Word?

A1: Yes, libraries like Spire.Doc enable you to read and process Word files without requiring Microsoft Word installation.

Q2: Does this support both .doc and .docx formats?

A2: Absolutely, all methods discussed in this guide work seamlessly with both legacy (.doc) and modern (.docx) Word files.

Q3: Can I extract only specific sections of a document?

A3: Yes, by iterating through sections and paragraphs, you can selectively filter and extract the desired content.

Convert an HTML File to PDF in Python

Converting HTML to PDF in Python is a common need when you want to generate printable reports, preserve web content, or create offline documentation with consistent formatting. In this tutorial, you’ll learn how to convert HTML to PDF in Python— whether you're working with a local HTML file or a HTML string. If you're looking for a simple and reliable way to generate PDF files from HTML in Python, this guide is for you.

Install Spire.Doc to Convert HTML to PDF Easily

To convert HTML to PDF in Python, you’ll need a reliable library that supports HTML parsing and PDF rendering. Spire.Doc for Python is a powerful and easy-to-use HTML to PDF converter library that lets you generate PDF documents from HTML content — without relying on a browser, headless engine, or third-party tools.

Install via pip

You can install the library quickly with pip:

pip install spire.doc

Alternative: Manual Installation

You can also download the Spire.Doc package and perform a custom installation if you need more control over the environment.

Tip: Spire.Doc offers a free version suitable for small projects or evaluation purposes.

Once installed, you're ready to convert HTML to PDF in Python in just a few lines of code.

Convert HTML Files to PDF in Python

Spire.Doc for Python makes it easy to convert HTML files to PDF. The Document.LoadFromFile() method supports loading various file formats, including .html, .doc, and .docx. After loading an HTML file, you can convert it to PDF by calling Document.SaveToFile() method. Follow the steps below to convert an HTML file to PDF in Python using Spire.Doc.

Steps to convert an HTML file to PDF in Python:

  • Create a Document object.
  • Load an HTML file using Document.LoadFromFile() method.
  • Convert it to PDF using Document.SaveToFile() method.

The following code shows how to convert an HTML file directly to PDF in Python:

from spire.doc import *
from spire.doc.common import *

# Create a Document object
document = Document()

# Load an HTML file 
document.LoadFromFile("Sample.html", FileFormat.Html, XHTMLValidationType.none)

# Save the HTML file to a pdf file
document.SaveToFile("output/ToPdf.pdf", FileFormat.PDF)
document.Close()

Convert an HTML File to PDF in Python

Convert an HTML String to PDF in Python

If you want to convert an HTML string to PDF in Python, Spire.Doc for Python provides a straightforward solution. For simple HTML content like paragraphs, text styles, and basic formatting, you can use the Paragraph.AppendHTML() method to insert the HTML into a Word document. Once added, you can save the document as a PDF using the Document.SaveToFile() method.

Here are the steps to convert an HTML string to a PDF file in Python.

  • Create a Document object.
  • Add a section using Document.AddSection() method and insert a paragraph using Section.AddParagraph() method.
  • Specify the HTML string and add it to the paragraph using Paragraph.AppendHTML() method.
  • Save the document as a PDF file using Document.SaveToFile() method.

Here's the complete Python code that shows how to convert an HTML string to a PDF:

from spire.doc import *
from spire.doc.common import *

# Create a Document object
document = Document()

# Add a section to the document
sec = document.AddSection()

# Add a paragraph to the section
paragraph = sec.AddParagraph()

# Specify the HTML string
htmlString = """
<html>
<head>
    <title>HTML to Word Example</title>
    <style>
        body {
            font-family: Arial, sans-serif;
        }
        h1 {
            color: #FF5733;
            font-size: 24px;
            margin-bottom: 20px;
        }
        p {
            color: #333333;
            font-size: 16px;
            margin-bottom: 10px;
        }
        ul {
            list-style-type: disc;
            margin-left: 20px;
            margin-bottom: 15px;
        }
        li {
            font-size: 14px;
            margin-bottom: 5px;
        }
        table {
            border-collapse: collapse;
            width: 100%;
            margin-bottom: 20px;
        }
        th, td {
            border: 1px solid #CCCCCC;
            padding: 8px;
            text-align: left;
        }
        th {
            background-color: #F2F2F2;
            font-weight: bold;
        }
        td {
            color: #0000FF;
        }
    </style>
</head>
<body>
    <h1>This is a Heading</h1>
    <p>This is a paragraph.</p>
    <p>Here's an unordered list:</p>
    <ul>
        <li>Item 1</li>
        <li>Item 2</li>
        <li>Item 3</li>
    </ul>
    <p>And here's a table:</p>
    <table>
        <tr>
            <th>Name</th>
            <th>Age</th>
            <th>Gender</th>
        </tr>
        <tr>
            <td>John Smith</td>
            <td>35</td>
            <td>Male</td>
        </tr>
        <tr>
            <td>Jenny Garcia</td>
            <td>27</td>
            <td>Female</td>
        </tr>
    </table>
</body>
</html>
"""

# Append the HTML string to the paragraph
paragraph.AppendHTML(htmlString)

# Save the document as a pdf file
document.SaveToFile("output/HtmlStringToPdf.pdf", FileFormat.PDF)
document.Close()

Convert an HTML File to PDF in Python

Customize the Conversion from HTML to PDF in Python

While converting HTML to PDF in Python is often straightforward, there are times when you need more control over the output. For example, you may want to set a password to protect the PDF document, or embed fonts to ensure consistent formatting across different devices. In this section, you’ll learn how to customize the HTML to PDF conversion using Spire.Doc for Python.

1. Set a Password to Protect the PDF

To prevent unauthorized viewing or editing, you can encrypt the PDF by specifying a user password and an owner password.

# Create a ToPdfParameterList object
toPdf = ToPdfParameterList()

# Set PDF encryption passwords
userPassword = "viewer"
ownerPassword = "E-iceblue"
toPdf.PdfSecurity.Encrypt(userPassword, ownerPassword, PdfPermissionsFlags.Default, PdfEncryptionKeySize.Key128Bit)

# Save as PDF with password protection
document.SaveToFile("/HtmlToPdfWithPassword.pdf", toPdf)

2. Embed Fonts to Preserve Formatting

To ensure the PDF displays correctly across all devices, you can embed all fonts used in the document.

# Create a ToPdfParameterList object
ppl = ToPdfParameterList()
ppl.IsEmbeddedAllFonts = True 

# Save as PDF with embedded fonts
document.SaveToFile("/HtmlToPdfWithEmbeddedFonts.pdf", ppl)

These options give you finer control when you convert HTML to PDF in Python, especially for professional document sharing or long-term storage scenarios.

The Conclusion

Converting HTML to PDF in Python becomes simple and flexible with Spire.Doc for Python. Whether you're handling static HTML files or dynamic HTML strings, or need to secure and customize your PDFs, this library provides everything you need — all in just a few lines of code. Get a free 30-day license and start converting HTML to high-quality PDF documents in Python today!

FAQs

Q1: Can I convert an HTML file to PDF in Python? Yes. Using Spire.Doc for Python, you can convert a local HTML file to PDF with just a few lines of code.

Q2: How do I convert HTML to PDF in Chrome? While Chrome allows manual "Save as PDF", it’s not suitable for batch or automated workflows. If you're working in Python, Spire.Doc provides a better solution for programmatically converting HTML to PDF.

Q3: How do I convert HTML to PDF without losing formatting? To preserve formatting:

  • Use embedded or inline CSS (not external files).
  • Use absolute URLs for images and resources.
  • Embed fonts using Spire.Doc options like IsEmbeddedAllFonts(True).

Converting PDF files to editable text is a common need for researchers, analysts, and professionals who deal with large volumes of documents. Manual copying wastes time—Python offers a faster, more flexible solution. In this guide, you’ll learn how to convert PDF to text in Python efficiently, whether you want to keep the layout or extract specific content.

Convert PDF to text without layout

Getting Started: Why Choose Spire.PDF for PDF to Text in Python

To convert PDF files to text using Python, you’ll need a reliable PDF processing library. Spire.PDF for Python is a powerful and developer-friendly API that allows you to read, edit, and convert PDF documents in Python applications — no need for Adobe Acrobat or other third-party software.
This library is ideal for automating PDF workflows such as extracting text, adding annotations, or merging and splitting files. It supports a wide range of PDF features and works seamlessly in both desktop and server environments. You can donwload it to install mannually or quickly install Spire.PDF via PyPI using the following command:

pip install Spire.PDF

For smaller or personal projects, a free version is available with basic functionality. If you need advanced features such as PDF signing or form filling, you can upgrade to the commercial edition at any time.

General Workflow for PDF to Text in Python

Converting a PDF to text becomes simple and efficient with the help of Spire.PDF for Python. You can easily complete the task by reusing the sample code provided in the following sections and customizing it to fit your needs. But before diving into the code, let’s take a quick look at the general workflow behind this process.

  • Create an object of PdfDocument class and load a PDF file using LoadFromFile() method.
  • Create an object of PdfTextExtractOptions class and set the text extracting options, including extracting all text, showing hidden text, only extracting text in a specified area, and simple extraction.
  • Get a page in the document using PdfDocument.Pages.get_Item() method and create PdfTextExtractor objects based on each page to extract the text from the page using Extract() method with specified options.
  • Save the extracted text as a text file and close the object.

How to Convert PDF to Text in Python Without Layout

If you only need the plain text content from a PDF and don’t care about preserving the original layout, you can use a simple method to extract text. This approach is faster and easier, especially when working with scanned documents or large batches of files. In this section, we’ll show you how to convert PDF to text in Python without preserving the layout.

To extract text without preserving layout, follow these simplified steps:

  • Create an instance of PdfDocument and load the PDF file.
  • Create a PdfTextExtractOptions object and configure the text extraction options.
  • Set IsSimpleExtraction = True to ignore the layout and extract raw text.
  • Loop through all pages of the PDF.
  • Extract text from each page and write it to a .txt file.
from spire.pdf import PdfDocument
from spire.pdf import PdfTextExtractOptions
from spire.pdf import PdfTextExtractor

# Create an object of PdfDocument class and load a PDF file
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")

# Create a string object to store the text
extracted_text = ""

# Create an object of PdfExtractor
extract_options = PdfTextExtractOptions()
# Set to use simple extraction method
extract_options.IsSimpleExtraction = True

# Loop through the pages in the document
for i in range(pdf.Pages.Count):
    # Get a page
    page = pdf.Pages.get_Item(i)
    # Create an object of PdfTextExtractor passing the page as paramter
    text_extractor = PdfTextExtractor(page)
    # Extract the text from the page
    text = text_extractor.ExtractText(extract_options)
    # Add the extracted text to the string object
    extracted_text += text

# Write the extracted text to a text file
with open("output/ExtractedText.txt", "w") as file:
    file.write(extracted_text)
pdf.Close()

Convert PDF to text without layout

How to Convert PDF to Text in Python With Layout

To convert PDF to text in Python with layout, Spire.PDF preserves formatting like tables and paragraphs by default. The steps are similar to the general overview, but you still need to loop through each page for full-text extraction.

from spire.pdf import PdfDocument
from spire.pdf import PdfTextExtractOptions
from spire.pdf import PdfTextExtractor

# Create an object of PdfDocument class and load a PDF file
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")

# Create a string object to store the text
extracted_text = ""

# Create an object of PdfExtractor
extract_options = PdfTextExtractOptions()

# Loop through the pages in the document
for i in range(pdf.Pages.Count):
    # Get a page
    page = pdf.Pages.get_Item(i)
    # Create an object of PdfTextExtractor passing the page as paramter
    text_extractor = PdfTextExtractor(page)
    # Extract the text from the page
    text = text_extractor.ExtractText(extract_options)
    # Add the extracted text to the string object
    extracted_text += text

# Write the extracted text to a text file
with open("output/ExtractedText.txt", "w") as file:
    file.write(extracted_text)
pdf.Close()

Convert PDF to text without layout

Convert a Specific PDF Page to Text in Python

Need to extract text from only one page of a PDF instead of the entire document? With Spire.PDF, the PDF to Text converter in Python, you can easily target and convert a specific PDF page to text. The steps are the same as shown in the general overview. If you're already familiar with them, just copy the code below into any Python editor and automate your PDF to text conversion!

from spire.pdf import PdfDocument
from spire.pdf import PdfTextExtractOptions
from spire.pdf import PdfTextExtractor
from spire.pdf import RectangleF

# Create an object of PdfDocument class and load a PDF file
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")

# Create an object of PdfExtractor
extract_options = PdfTextExtractOptions()

# Set to extract specific page area
extract_options.ExtractArea = RectangleF(50.0, 220.0, 700.0, 230.0)

# Get a page
page = pdf.Pages.get_Item(0)

# Create an object of PdfTextExtractor passing the page as paramter
text_extractor = PdfTextExtractor(page)

# Extract the text from the page
extracted_text = text_extractor.ExtractText(extract_options)

# Write the extracted text to a text file
with open("output/ExtractedText.txt", "w") as file:
    file.write(extracted_text)
pdf.Close()

Convert PDF to text without layout

To Wrap Up

In this post, we covered how to convert PDF to text using Python and Spire.PDF, with clear steps and code examples for fast, efficient conversion. We also highlighted the benefits and pointed to OCR tools for image-based PDFs. For any issues or support, feel free to contact us.

FAQs about Converting PDF to Text

Q1: How do I convert a PDF to readable and editable text in Python?
A: You can convert a PDF to text in Python using the Spire.PDF library. It allows you to extract text from PDF files while optionally keeping the original layout. You don’t need Adobe Acrobat, and both visible and image-based PDFs are supported.

Q2: Is there a free tool to convert PDF to text?
A: Yes. Spire.PDF for Python provides a free edition that allows you to convert PDF to text without relying on Adobe Acrobat or other software. Online tools are also available, but they’re more suitable for occasional use or small files.

Q3: Can Python extract data from PDF? A: Yes, Python can extract data from PDF files. Using Spire.PDF, you can easily extract not only text but also other elements such as images, annotations, bookmarks, and even attachments. This makes it a versatile tool for working with PDF content in Python.

SEE ALSO:

Combine Excel Workbooks or Worksheets into One using Python

Merging Excel files is a common task for data analysts and financial teams working with large datasets. While Microsoft Excel supports manual merging, it becomes inefficient and error-prone when dealing with large volumes of files.

In this step-by-step guide, you will learn how to merge multiple Excel files (.xls and .xlsx) using Python and Spire.XLS for Python library. Whether you're combining workbooks, merging worksheets, or automating bulk Excel file processing, this guide will help you save time and streamline your workflow with practical solutions.

Table of Contents

Why Merge Excel Files with Python?

Using Python to merge Excel files brings several key advantages:

  • Automation: Save time and eliminate repetitive manual work by automating the merging process.
  • No Excel Dependency: Merge files without installing Microsoft Excel—ideal for headless, server-side, or cloud environments.
  • Flexible Merging: Customize merging by selecting specific sheets, ranges, columns, or rows.
  • Scalability: Handle hundreds or even thousands of Excel files with consistent performance.
  • Error Reduction: Reduce manual errors and ensure data accuracy with automated scripts.

Whether you’re consolidating monthly reports or merging large datasets, Python helps streamline the process efficiently.

Getting Started with Spire.XLS for Python

Spire.XLS for Python is a standalone library that allows developers to create, read, edit, and save Excel files without the need for Microsoft Excel installation.

Key Features Include:

  • Supports Multiple Formats: .xls, .xlsx, and more.
  • Worksheet Operations: Copy, rename, delete, and merge worksheets seamlessly across workbooks.
  • Formula & Formatting Preservation: Retain formulas and formatting during editing or merging.
  • Advanced Features: Includes chart creation, conditional formatting, pivot tables, and more.
  • File Conversion: Convert Excel files to PDF, HTML, CSV, and more.

Installation

Run the following pip command in your terminal or command prompt to install Spire.XLS from PyPI:

pip install spire.xls

How to Merge Multiple Excel Files into One Workbook using Python

When working with multiple Excel files, consolidating all worksheets into a single workbook can simplify data management and reporting. This approach preserves each original worksheet separately, making it easy to organize and review data from different sources such as department budgets, regional reports, or monthly summaries.

Steps

To merge multiple Excel files into a single workbook using Python, follow these steps:

  • Loop through the files.
  • Load each Excel file using LoadFromFile().
  • For the first file, assign it as the base workbook.
  • For subsequent files, copy all worksheets into the base workbook using AddCopy().
  • Save the final combined workbook to a new file.

Code Example

import os
from spire.xls import *

# Folder containing Excel files to merge
input_folder = './sample_files'   
# Output file name for the merged workbook       
output_file = 'merged_workbook.xlsx'    

# Initialize merged workbook as None
merged_workbook = None  

# Iterate over all files in the input folder
for filename in os.listdir(input_folder):
    # Process only Excel files with .xls or .xlsx extensions
    if filename.endswith('.xlsx') or filename.endswith('.xls'):
        file_path = os.path.join(input_folder, filename)
        
        # Load the current Excel file into a Workbook object
        source_workbook = Workbook()
        source_workbook.LoadFromFile(file_path)

        if merged_workbook is None:
            # For the first file, assign it as the base merged workbook
            merged_workbook = source_workbook
        else:
            # For subsequent files, copy each worksheet into the merged workbook
            for i in range(source_workbook.Worksheets.Count):
                sheet = source_workbook.Worksheets.get_Item(i)
                merged_workbook.Worksheets.AddCopy(sheet, WorksheetCopyType.CopyAll)

# Save the combined workbook to the specified output file
merged_workbook.SaveToFile(output_file, ExcelVersion.Version2016)

Consolidate Excel Files into One using Python

How to Combine Multiple Excel Worksheets into a Single Worksheet using Python

Merging data from multiple Excel worksheets into one worksheet allows you to aggregate information efficiently, especially when working with data such as sales logs, survey responses, or performance reports.

Steps

To combine worksheet data from multiple Excel files into a single worksheet using Python, follow these steps:

  • Create a new workbook and select its first worksheet as the destination.
  • Loop through the files.
  • Load each Excel file using LoadFromFile().
  • Get the desired worksheet that you want to merge from the current file.
  • Copy the used cell range from the desired worksheet to the destination worksheet, placing data consecutively below the previously copied content.
  • Save the combined data into a new Excel file.

Code Example

import os
from spire.xls import *

# Folder containing Excel files to merge
input_folder = './excel_worksheets'
# Output file name for the merged workbook
output_file = 'merged_into_one_sheet.xlsx'

# Create a new workbook to hold merged data
merged_workbook = Workbook()
# Use the first worksheet in the new workbook as the merge target
merged_sheet = merged_workbook.Worksheets[0]

# Initialize the starting row for copying data
current_row = 1

# Loop through all files in the input folder
for filename in os.listdir(input_folder):
    # Process only Excel files (.xls or .xlsx)
    if filename.endswith('.xlsx') or filename.endswith('.xls'):
        file_path = os.path.join(input_folder, filename)

        # Load the current Excel file
        workbook = Workbook()
        workbook.LoadFromFile(file_path)

        # Get the first worksheet from the current workbook
        sheet = workbook.Worksheets[0]

        # Get the used range from the first worksheet
        source_range = sheet.Range

        # Set the destination range in the merged worksheet starting at current_row
        dest_range = merged_sheet.Range[current_row, 1]

        # Copy data from the used range to the destination range
        source_range.Copy(dest_range)

        # Update current_row to the row after the last copied row to prevent overlap
        current_row += sheet.LastRow

# Save the merged workbook to the specified output file in Excel 2016 format
merged_workbook.SaveToFile(output_file, ExcelVersion.Version2016)

Merge Excel Worksheets into One using Python

Conclusion

When merging multiple Excel files into a single document—whether by appending sheets or combining data row by row—using a Python library like Spire.XLS enables automation and improves accuracy. This approach can help streamline workflows, especially in enterprise scenarios that require handling large datasets without relying on Microsoft Excel.

FAQs: Merge Excel Files with Python

Q1: Can I merge .xls and .xlsx files together?

A1: Yes. Spire.XLS handles both formats without needing conversion.

Q2: Do I need Excel installed on my machine to use Spire.XLS?

A2: No. Spire.XLS is standalone and works without Microsoft Office installed.

Q3: Can I merge only specific sheets from each workbook?

A3: Yes. You can customize your code to merge sheets by name or index. For example:

sheet = source_workbook.Worksheets["Summary"]

Q4: How do I avoid copying header rows multiple times?

A4: Add logic like:

if current_row > 1:
    start_row = 2 # Skip header

else:
    start_row = 1

Q5: Can I keep track of which file each row came from?

A5: Yes. Add a new column in the merged sheet containing the source file name for each row.

Q6: Is there a file size or row limit when using Spire.XLS?

A6: Spire.XLS follows the same row and column limits as Excel: .xlsx supports up to 1,048,576 rows × 16,384 columns, and .xls supports up to 65,536 rows × 256 columns.

Q7: Can I preserve formulas and formatting while merging?

A7: Yes. When merging Excel files, formatting and formulas are preserved.

Thursday, 19 October 2023 01:08

Adding Watermarks to PDF Files Using Python

Python Add watermarks to PDF

Watermarking is a critical technique for securing documents, indicating ownership, and preventing unauthorized copying. Whether you're distributing drafts or branding final deliverables, applying watermarks helps protect your content effectively. In this tutorial, you’ll learn how to add watermarks to a PDF in Python using the powerful and easy-to-use Spire.PDF for Python library.

We'll walk through how to insert both text and image watermarks , handle transparency and positioning, and resolve common issues — all with clean, well-documented code examples.

Table of Contents:

Python Library for Watermarking PDFs

Spire.PDF for Python is a robust library that provides comprehensive PDF manipulation capabilities. For watermarking specifically, it offers:

  • High precision in watermark placement and rotation.
  • Flexible transparency controls.
  • Support for both text and image watermarks.
  • Ability to apply watermarks to specific pages or entire documents.
  • Preservation of original PDF quality.

Before proceeding, ensure you have Spire.PDF installed in your Python environment:

pip install spire.pdf

Adding a Text Watermark to a PDF

This code snippet demonstrates how to add a diagonal "DO NOT COPY" watermark to each page of a PDF file. It manages the size, color, positioning, rotation, and transparency of the watermark for a professional result.

from spire.pdf import *
from spire.pdf.common import *
import math

# Create an object of PdfDocument class
doc = PdfDocument()

# Load a PDF document from the specified path
doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pdf")

# Create an object of PdfTrueTypeFont class for the watermark font
font = PdfTrueTypeFont("Times New Roman", 48.0, 0, True)

# Specify the watermark text
text = "DO NOT COPY"

# Measure the dimensions of the text to ensure proper positioning
text_width = font.MeasureString(text).Width
text_height = font.MeasureString(text).Height

# Loop through each page in the document
for i in range(doc.Pages.Count):

    # Get the current page
    page = doc.Pages.get_Item(i)
    
    # Save the current canvas state
    state = page.Canvas.Save()
 
    # Calculate the center coordinates of the page
    x = page.Canvas.Size.Width  / 2
    y = page.Canvas.Size.Height / 2

    # Translate the coodinate system to the center so that the center of the page becomes the origin (0, 0)
    page.Canvas.TranslateTransform(x, y)
    
    # Rotate the canvas 45 degrees counterclockwise for the watermark
    page.Canvas.RotateTransform(-45.0)

    # Set the transparency of the watermark
    page.Canvas.SetTransparency(0.7)
    
    # Draw the watermark text at the centered position using negative offsets 
    page.Canvas.DrawString(text, font, PdfBrushes.get_Blue(), PointF(-text_width / 2, -text_height / 2))
    
    # Restore the canvas state to prevent transformations from affecting subsequent drawings
    page.Canvas.Restore(state)

# Save the modified document to a new PDF file
doc.SaveToFile("output/TextWatermark.pdf")

# Dispose resources
doc.Dispose()

Breakdown of the Code :

  1. Load the PDF Document : The script loads an input PDF file from a specified path using the PdfDocument class.
  2. Configure Watermark Text : A watermark text ("DO NOT COPY") is set with a specific font (Times New Roman, 48pt) and measured for accurate positioning.
  3. Apply Transformations : For each page, the script:
    • Centers the coordinate system.
    • Rotates the canvas by 45 degrees counterclockwise.
    • Sets transparency (70%) for the watermark.
  4. Draw the Watermark : The text is drawn at (-text_width / 2, -text_height / 2), which aligns the text perfectly around the center point of the page, regardless of the rotation applied.
  5. Save the Document : The modified document is saved to a new PDF file.

Output:

Add a text watermark to PDF

Adding an Image Watermark to a PDF

This code snippet adds a semi-transparent image watermark to each page of a PDF, ensuring proper positioning and a professional appearance.

from spire.pdf import *
from spire.pdf.common import *

# Create an object of PdfDocument class
doc = PdfDocument()

# Load a PDF document from the specified path
doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pdf")

# Load the watermark image from the specified path
image = PdfImage.FromFile("C:\\Users\\Administrator\\Desktop\\logo.png")

# Get the width and height of the loaded image for positioning
imageWidth = float(image.Width)
imageHeight = float(image.Height)

# Loop through each page in the document to apply the watermark
for i in range(doc.Pages.Count):
    # Get the current page
    page = doc.Pages.get_Item(i)

    # Set the transparency of the watermark to 50%
    page.Canvas.SetTransparency(0.5)

    # Get the dimensions of the current page
    pageWidth = page.ActualSize.Width
    pageHeight = page.ActualSize.Height

    # Calculate the x and y coordinates to center the image on the page
    x = (pageWidth - imageWidth) / 2
    y = (pageHeight - imageHeight) / 2

    # Draw the image at the calculated center position on the page
    page.Canvas.DrawImage(image, x, y, imageWidth, imageHeight)

# Save the modified document to a new PDF file
doc.SaveToFile("output/ImageWatermark.pdf")

# Dispose resources
doc.Dispose()

Breakdown of the Code :

  1. Load the PDF Document : The script loads an input PDFfile from a specified path using the PdfDocument class.
  2. Configure Watermark Image : The watermark image is loaded from a specified path, and its dimensions are retrieved for accurate positioning.
  3. Apply Transformations : For each page, the script:
    • Sets the watermark transparency (50%).
    • Calculates the center position of the page for the watermark.
  4. Draw the Watermark : The image is drawn at the calculated center coordinates, ensuring it is centered on each page.
  5. Save the Document : The modified document is saved to a new PDF file.

Output:

Add an image watermark to PDF

Apart from watermarks, you can also add stamps to PDFs. Unlike watermarks, which are fixed in place, stamps can be freely moved or deleted, offering greater flexibility in document annotation.

Troubleshooting Common Issues

  1. Watermark Not Appearing :
    • Verify file paths are correct.
    • Check transparency isn't set to 0 (fully transparent).
    • Ensure coordinates place the watermark within page bounds.
  2. Quality Issues :
    • For text, use higher-quality fonts.
    • For images, ensure adequate resolution.
  3. Rotation Problems :
    • Remember that rotation occurs around the current origin point.
    • The order of transformations matters (translate then rotate).

Wrapping Up

With Spire.PDF for Python, adding watermarks to PDF documents becomes a simple and powerful process. Whether you need bold "Confidential" text across every page or subtle branding with logos, the library handles it all efficiently. By combining coordinate transformations, transparency settings, and drawing commands, you can create highly customized watermarking workflows tailored to your document's purpose.

FAQs

Q1. Can I add both text and image watermarks to the same PDF?

Yes, you can combine both approaches in a single loop over the PDF pages.

Q2. How can I rotate image watermarks?

Use Canvas.RotateTransform(angle) before drawing the image, similar to the text watermark example.

Q3. Does Spire.PDF support transparent PNGs for watermarks?

Yes, Spire.PDF preserves the transparency of PNG images when used as watermarks.

Q4. Can I apply different watermarks to different pages?

Absolutely. You can implement conditional logic within your page loop to apply different watermarks based on page number or other criteria.

Get a Free License

To fully experience the capabilities of Spire.PDF for Python without any evaluation limitations, you can request a free 30-day trial license.

Wednesday, 08 March 2023 01:26

C++: Create Tables in Word Documents

A table is a powerful tool for organizing and presenting data. It arranges data into rows and columns, making it easier for authors to illustrate the relationships between different data categories and for readers to understand and analyze complex data. In this article, you will learn how to programmatically create tables in Word documents in C++ using Spire.Doc for C++.

Install Spire.Doc for C++

There are two ways to integrate Spire.Doc for C++ into your application. One way is to install it through NuGet, and the other way is to download the package from our website and copy the libraries into your program. Installation via NuGet is simpler and more recommended. You can find more details by visiting the following link.

Integrate Spire.Doc for C++ in a C++ Application

Create a Table in Word in C++

Spire.Doc for C++ offers the Section->AddTable() method to add a table to a section of a Word document. The detailed steps are as follows:

  • Initialize an instance of the Document class.
  • Add a section to the document using Document->AddSection() method.
  • Define the data for the header row and remaining rows, storing them in a one-dimensional vector and a two-dimensional vector respectively.
  • Add a table to the section using Section->AddTable() method.
  • Specify the number of rows and columns in the table using Table->ResetCells(int, int) method.
  • Add data in the one-dimensional vector to the header row and set formatting.
  • Add data in the two-dimensional vector to the remaining rows and set formatting.
  • Save the result document using Document->SaveToFile() method.
  • C++
#include "Spire.Doc.o.h"

using namespace Spire::Doc;
using namespace std;

int main()
{
	//Initialize an instance of the Document class
	intrusive_ptr<Document> doc = new Document();

	//Add a section to the document
	intrusive_ptr<Section> section = doc->AddSection();

	//Set page margins for the section
	section->GetPageSetup()->GetMargins()->SetAll(72);

	//Define the data for the header row
	vector<wstring> header = { L"Name", L"Capital", L"Continent", L"Area", L"Population" };
	//Define the data for the remaining rows
	vector<vector<wstring>> data =
	{
		{L"Argentina", L"Buenos Aires", L"South America", L"2777815", L"32300003"},
		{L"Bolivia", L"La Paz", L"South America", L"1098575", L"7300000"},
		{L"Brazil", L"Brasilia", L"South America", L"8511196", L"150400000"},
		{L"Canada", L"Ottawa", L"North America", L"9976147", L"26500000"},
		{L"Chile", L"Santiago", L"South America", L"756943", L"13200000"},
		{L"Colombia", L"Bogota", L"South America", L"1138907", L"33000000"},
		{L"Cuba", L"Havana", L"North America", L"114524", L"10600000"},
		{L"Ecuador", L"Quito", L"South America", L"455502", L"10600000"},
		{L"El Salvador", L"San Salvador", L"North America", L"20865", L"5300000"},
		{L"Guyana", L"Georgetown", L"South America", L"214969", L"800000"},
		{L"Jamaica", L"Kingston", L"North America", L"11424", L"2500000"},
		{L"Mexico", L"Mexico City", L"North America", L"1967180", L"88600000"},
		{L"Nicaragua", L"Managua", L"North America", L"139000", L"3900000"},
		{L"Paraguay", L"Asuncion", L"South America", L"406576", L"4660000"},
		{L"Peru", L"Lima", L"South America", L"1285215", L"21600000"},
		{L"United States", L"Washington", L"North America", L"9363130", L"249200000"},
		{L"Uruguay", L"Montevideo", L"South America", L"176140", L"3002000"},
		{L"Venezuela", L"Caracas", L"South America", L"912047", L"19700000"}
	};

	//Add a table to the section
	intrusive_ptr<Table> table = section->AddTable(true);
	//Specify the number of rows and columns for the table
	table->ResetCells(data.size() + 1, header.size());

	//Set the first row as the header row
	intrusive_ptr<TableRow> row = table->GetRows()->GetItemInRowCollection(0);
	row->SetIsHeader(true);

	//Set height and background color for the header row
	row->SetHeight(20);
	row->SetHeightType(TableRowHeightType::Exactly);
	for (int i = 0; i < row->GetCells()->GetCount(); i++)
	{
		row->GetCells()->GetItemInCellCollection(i)->GetCellFormat()->GetShading()->SetBackgroundPatternColor(Color::FromArgb(142, 170, 219));
	}

	//Add data to the header row and set formatting
	for (size_t i = 0; i < header.size(); i++)
	{
		//Add a paragraph
		intrusive_ptr<Paragraph> p1 = row->GetCells()->GetItemInCellCollection(i)->AddParagraph();
		//Set alignment
		p1->GetFormat()->SetHorizontalAlignment(HorizontalAlignment::Center);
		row->GetCells()->GetItemInCellCollection(i)->GetCellFormat()->SetVerticalAlignment(VerticalAlignment::Middle);
		//Add data
		intrusive_ptr<TextRange> tR1 = p1->AppendText(header[i].c_str());
		//Set data formatting
		tR1->GetCharacterFormat()->SetFontName(L"Calibri");
		tR1->GetCharacterFormat()->SetFontSize(12);
		tR1->GetCharacterFormat()->SetBold(true);
	}

	//Add data to the remaining rows and set formatting
	for (size_t r = 0; r < data.size(); r++)
	{
		//Set height for the remaining rows
		intrusive_ptr<TableRow> dataRow = table->GetRows()->GetItemInRowCollection(r + 1);
		dataRow->SetHeight(20);
		dataRow->SetHeightType(TableRowHeightType::Exactly);

		for (size_t c = 0; c < data[r].size(); c++)
		{
			//Add a paragraph
			intrusive_ptr<Paragraph> p2 = dataRow->GetCells()->GetItemInCellCollection(c)->AddParagraph();
			//Set alignment
			dataRow->GetCells()->GetItemInCellCollection(c)->GetCellFormat()->SetVerticalAlignment(VerticalAlignment::Middle);
			//Add data
			intrusive_ptr<TextRange> tR2 = p2->AppendText(data[r][c].c_str());
			//Set data formatting
			tR2->GetCharacterFormat()->SetFontName(L"Calibri");
			tR2->GetCharacterFormat()->SetFontSize(11);
		}
	}

	//Save the result document
	doc->SaveToFile(L"CreateTable.docx", FileFormat::Docx2013);
	doc->Close();
}

C++: Create Tables in Word Documents

Create a Nested Table in Word in C++

Spire.Doc for C++ offers the TableCell->AddTable() method to add a nested table to a specific table cell. The detailed steps are as follows:

  • Initialize an instance of the Document class.
  • Add a section to the document using Document->AddSection() method.
  • Add a table to the section using Section.AddTable() method.
  • Specify the number of rows and columns in the table using Table->ResetCells(int, int) method.
  • Get the rows of the table and add data to the cells of each row.
  • Add a nested table to a specific table cell using TableCell->AddTable() method.
  • Specify the number of rows and columns in the nested table.
  • Get the rows of the nested table and add data to the cells of each row.
  • Save the result document using Document->SaveToFile() method.
  • C++
#include "Spire.Doc.o.h"

using namespace Spire::Doc;
using namespace std;

int main()
{
	//Initialize an instance of the Document class
	intrusive_ptr<Document> doc = new Document();
	//Add a section to the document
	intrusive_ptr<Section> section = doc->AddSection();

	//Set page margins for the section
	section->GetPageSetup()->GetMargins()->SetAll(72);

	//Add a table to the section
	intrusive_ptr<Table> table = section->AddTable(true);
	//Set the number of rows and columns in the table
	table->ResetCells(2, 2);

	//Autofit the table width to window
	table->AutoFit(AutoFitBehaviorType::AutoFitToWindow);

	//Get the table rows
	intrusive_ptr<TableRow> row1 = table->GetRows()->GetItemInRowCollection(0);
	intrusive_ptr<TableRow> row2 = table->GetRows()->GetItemInRowCollection(1);

	//Add data to cells of the table
	intrusive_ptr<TableCell> cell1 = row1->GetCells()->GetItemInCellCollection(0);
	intrusive_ptr<TextRange> tR = cell1->AddParagraph()->AppendText(L"Product");
	tR->GetCharacterFormat()->SetFontSize(13);
	tR->GetCharacterFormat()->SetBold(true);
	intrusive_ptr<TableCell> cell2 = row1->GetCells()->GetItemInCellCollection(1);
	tR = cell2->AddParagraph()->AppendText(L"Description");
	tR->GetCharacterFormat()->SetFontSize(13);
	tR->GetCharacterFormat()->SetBold(true);
	intrusive_ptr<TableCell> cell3 = row2->GetCells()->GetItemInCellCollection(0);
	cell3->AddParagraph()->AppendText(L"Spire.Doc for C++");
	intrusive_ptr<TableCell> cell4 = row2->GetCells()->GetItemInCellCollection(1);
	cell4->AddParagraph()->AppendText(L"Spire.Doc for C++ is a professional Word "
		L"library specifically designed for developers to create, "
		L"read, write and convert Word documents in C++ "
		L"applications with fast and high-quality performance.");

	//Add a nested table to the fourth cell
	intrusive_ptr<Table> nestedTable = cell4->AddTable(true);
	//Set the number of rows and columns in the nested table
	nestedTable->ResetCells(3, 2);

	//Autofit the table width to content
	nestedTable->AutoFit(AutoFitBehaviorType::AutoFitToContents);

	//Get table rows
	intrusive_ptr<TableRow> nestedRow1 = nestedTable->GetRows()->GetItemInRowCollection(0);
	intrusive_ptr<TableRow> nestedRow2 = nestedTable->GetRows()->GetItemInRowCollection(1);
	intrusive_ptr<TableRow> nestedRow3 = nestedTable->GetRows()->GetItemInRowCollection(2);

	//Add data to cells of the nested table
	intrusive_ptr<TableCell> nestedCell1 = nestedRow1->GetCells()->GetItemInCellCollection(0);
	tR = nestedCell1->AddParagraph()->AppendText(L"Item");
	tR->GetCharacterFormat()->SetBold(true);
	intrusive_ptr<TableCell> nestedCell2 = nestedRow1->GetCells()->GetItemInCellCollection(1);
	tR = nestedCell2->AddParagraph()->AppendText(L"Price");
	tR->GetCharacterFormat()->SetBold(true);
	intrusive_ptr<TableCell> nestedCell3 = nestedRow2->GetCells()->GetItemInCellCollection(0);
	nestedCell3->AddParagraph()->AppendText(L"Developer Subscription");
	intrusive_ptr<TableCell> nestedCell4 = nestedRow2->GetCells()->GetItemInCellCollection(1);
	nestedCell4->AddParagraph()->AppendText(L"$999");
	intrusive_ptr<TableCell> nestedCell5 = nestedRow3->GetCells()->GetItemInCellCollection(0);
	nestedCell5->AddParagraph()->AppendText(L"Developer OEM Subscription");
	intrusive_ptr<TableCell> nestedCell6 = nestedRow3->GetCells()->GetItemInCellCollection(1);
	nestedCell6->AddParagraph()->AppendText(L"$2999");

	//Save the result document
	doc->SaveToFile(L"CreateNestedTable.docx", FileFormat::Docx2013);
	doc->Close();
}

C++: Create Tables in Word Documents

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

Visual guide of java write to excel

Looking to automate Excel data entry in Java? Manually inputting data into Excel worksheets is time-consuming and error-prone, especially when dealing with large datasets. The good news is that with the right Java Excel library, you can streamline this process. This comprehensive guide explores three efficient methods to write data to Excel in Java using the powerful Spire.XLS for Java library, covering basic cell-by-cell entries, bulk array inserts, and DataTable exports.

Prerequisites: Setup & Installation

Before you start, you’ll need to add Spire.XLS for Java to your project. Here’s how to do it quickly:

Option 1: Download the JAR File

Option 2: Use Maven

If you’re using Maven, add the following repository and dependency to your pom.xml file. This automatically downloads and integrates the library:

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependency>
    <groupId>e-iceblue</groupId>
    <artifactId>spire.xls</artifactId>
    <version>15.7.7</version>
</dependency>

3 Ways to Write Data to Excel using Java

Spire.XLS for Java offers flexible methods to write data, tailored to different scenarios. Let’s explore each with complete code samples, explanations, and use cases.

1. Write Text or Numbers to Excel Cells

Need to populate individual cells with text or numbers? Spire.XLS lets you directly target a specific cell using row/column indices (e.g., (2,1) for row 2, column 1) or Excel-style references (e.g., "A1", "B3"):

How It Works:

  • Use the Worksheet.get(int row, int column) or Worksheet.get(String name) method to access a specific Excel cell.
  • Use the setValue() method to write a text value to the cell.
  • Use the setNumberValue() method to write a numeric value to the cell.

**Java code to write data to Excel: **

import com.spire.xls.*;

public class WriteToCells {

    public static void main(String[] args) {

        // Create a Workbook object
        Workbook workbook = new Workbook();

        // Get the first worksheet
        Worksheet worksheet = workbook.getWorksheets().get(0);

        // Write data to specific cells
        worksheet.get("A1").setValue("Name");
        worksheet.get("B1").setValue("Age");
        worksheet.get("C1").setValue("Department");
        worksheet.get("D1").setValue("Hiredate");
        worksheet.get(2,1).setValue("Hazel");
        worksheet.get(2,2).setNumberValue(29);
        worksheet.get(2,3).setValue("Marketing");
        worksheet.get(2,4).setValue("2019-07-01");
        worksheet.get(3,1).setValue("Tina");
        worksheet.get(3,2).setNumberValue(31);
        worksheet.get(3,3).setValue("Technical Support");
        worksheet.get(3,4).setValue("2015-04-27");

        // Autofit column widths
        worksheet.getAllocatedRange().autoFitColumns();

        // Apply a style to the first row
        CellStyle style = workbook.getStyles().addStyle("newStyle");
        style.getFont().isBold(true);
        worksheet.getRange().get(1,1,1,4).setStyle(style);

        // Save to an Excel file
        workbook.saveToFile("output/WriteToCells.xlsx", ExcelVersion.Version2016);
    }
}

When to use this: Small datasets where you need precise control over cell placement (e.g., adding a title, single-row entries).

Write data to specific cells in Excel.

2. Write Arrays to Excel Worksheets

For bulk data, writing arrays (1D or 2D) is far more efficient than updating cells one by one. Spire.XLS for Java allows inserting arrays into a contiguous cell range.

insertArray() Method Explained:

The insertArray() method handles 1D arrays (single rows) and 2D arrays (multiple rows/columns) effortlessly. Its parameters are:

  • Object[] array/ Object[][] array: The 1D or 2D array containing data to insert.
  • int firstRow: The starting row index (1-based).
  • int firstColumn: The starting column index (1-based).
  • boolean isVertical: A boolean indicating the insertion direction:
    • false: Insert horizontally (left to right).
    • true: Insert vertically (top to bottom).

**Java code to insert arrays into Excel: **

import com.spire.xls.*;

public class WriteArrayToWorksheet {

    public static void main(String[] args) {

        // Create a Workbook instance
        Workbook workbook = new Workbook();

        // Get the first worksheet
        Worksheet worksheet = workbook.getWorksheets().get(0);

        // Create a one-dimensional array
        Object[] oneDimensionalArray = {"January", "February", "March", "April","May", "June"};

        // Write the array to the first row of the worksheet
        worksheet.insertArray(oneDimensionalArray, 1, 1, false);

        // Create a two-dimensional array
        Object[][] twoDimensionalArray = {
                {"Name", "Age", "Sex", "Dept.", "Tel."},
                {"John", "25", "Male", "Development","654214"},
                {"Albert", "24", "Male", "Support","624847"},
                {"Amy", "26", "Female", "Sales","624758"}
        };

        // Write the array to the worksheet starting from the cell A3
        worksheet.insertArray(twoDimensionalArray, 3, 1);

        // Autofit column width in the located range
        worksheet.getAllocatedRange().autoFitColumns();

        // Apply a style to the first and the third row
        CellStyle style = workbook.getStyles().addStyle("newStyle");
        style.getFont().isBold(true);
        worksheet.getRange().get(1,1,1,6).setStyle(style);
        worksheet.getRange().get(3,1,3,6).setStyle(style);

        // Save to an Excel file
        workbook.saveToFile("WriteArrays.xlsx", ExcelVersion.Version2016);
    }
}

When to use this: Sequential data (e.g., inventory logs, user lists) that needs bulk insertion.

Insert 1D and 2D arrays into an Excel sheet.

3. Write DataTable to Excel

If your data is stored in a DataTable (e.g., from a database), Spire.XLS lets you directly export it to Excel with insertDataTable(), preserving structure and column headers.

insertDataTable() Method Explained:

The insertDataTable() method is a sophisticated bulk-insert operation designed specifically for transferring structured data collections into Excel. Its parameters are:

  • DataTable dataTable: The DataTable object containing the data to insert.
  • boolean columnHeaders: A boolean indicating whether to include column names from the DataTable as headers in Excel.
    • true: Inserts column names as the first row.
    • false: Skips column names; data starts from the first row.
  • int firstRow: The starting row index (1-based).
  • int firstColumn: The starting column index (1-based).
  • boolean transTypes: A boolean indicating whether to preserve data types.

Java code to export DataTable to Excel:

import com.spire.xls.*;
import com.spire.xls.data.table.DataRow;
import com.spire.xls.data.table.DataTable;

public class WriteDataTableToWorksheet {

    public static void main(String[] args) throws Exception {

        // Create a Workbook instance
        Workbook workbook = new Workbook();

        // Get the first worksheet
        Worksheet worksheet = workbook.getWorksheets().get(0);

        // Create a DataTable object
        DataTable dataTable = new DataTable();
        dataTable.getColumns().add("SKU", Integer.class);
        dataTable.getColumns().add("NAME", String.class);
        dataTable.getColumns().add("PRICE", String.class);

        // Create rows and add data
        DataRow dr = dataTable.newRow();
        dr.setInt(0, 512900512);
        dr.setString(1,"Wireless Mouse M200");
        dr.setString(2,"$85");
        dataTable.getRows().add(dr);
        dr = dataTable.newRow();
        dr.setInt(0,512900637);
        dr.setString(1,"B100 Cored Mouse ");
        dr.setString(2,"$99");
        dataTable.getRows().add(dr);
        dr = dataTable.newRow();
        dr.setInt(0,512901829);
        dr.setString(1,"Gaming Mouse");
        dr.setString(2,"$125");
        dataTable.getRows().add(dr);
        dr = dataTable.newRow();
        dr.setInt(0,512900386);
        dr.setString(1,"ZM Optical Mouse");
        dr.setString(2,"$89");
        dataTable.getRows().add(dr);

        // Write datatable to the worksheet
        worksheet.insertDataTable(dataTable,true,1,1,true);

        // Autofit column width in the located range
        worksheet.getAllocatedRange().autoFitColumns();

        // Apply a style to the first row
        CellStyle style = workbook.getStyles().addStyle("newStyle");
        style.getFont().isBold(true);
        worksheet.getRange().get(1,1,1,3).setStyle(style);

        // Save to an Excel file
        workbook.saveToFile("output/WriteDataTable.xlsx", ExcelVersion.Version2016);
    }
}

When to use this: Database exports, CRM data, or any structured data stored in a DataTable (e.g., SQL query results, CSV imports).

Export a Datatable to an Excel worksheet.

Performance Tips for Large Datasets

  • Use bulk operations (insertArray()/insertDataTable()) instead of writing cells one by one.
  • Disable auto-fit columns or styling during data insertion, then apply them once after all data is written.
  • For datasets with 100,000+ rows, consider streaming mode to reduce memory usage.

Frequently Asked Questions

Q1: What Excel formats does Spire.XLS support for writing data?

A: Spire.XLS for Java supports all major Excel formats, including:

  • Legacy formats: XLS (Excel 97-2003)
  • Modern formats: XLSX, XLSM (macro-enabled), XLSB, and more.

You can specify the output format when saving Excel with the saveToFile() method.

Q2: How do I format cells (colors, fonts, borders) when writing data?

A: Spire.XLS offers robust styling options. Check these guides:

Q3: How do I avoid the "Evaluation Warning" in output files?

A: To remove the evaluation sheets, get a 30-day free trial license here and then apply the license key in your code before creating the Workbook object:

com.spire.xls.license.LicenseProvider.setLicenseKey("Key");

Workbook workbook = new Workbook();

Final Thoughts

Mastering Excel export functionality is crucial for Java developers in data-driven applications. The Spire.XLS for Java library provides three efficient approaches to write data to Excel in Java:

  • Precision control with cell-by-cell writing
  • High-performance bulk inserts using arrays
  • Database-style exporting with DataTables

Each method serves distinct use cases - from simple reports to complex enterprise data exports. By following the examples in the article, developers can easily create and write to Excel files in Java applications.

Extract tables from PDF files in C#/.NET Extracting tables from PDF files is a common requirement in data processing, reporting, and automation tasks. PDFs are widely used for sharing structured data, but extracting tables programmatically can be challenging due to their complex layout. Fortunately, with the right tools, this process becomes straightforward. In this guide, we’ll explore how to extract tables from PDF in C# using the Spire.PDF for .NET library, and export the results to TXT and CSV formats for easy reuse.

Table of Contents:


Prerequisites for Reading PDF Tables in C#

Spire.PDF for .NET is a powerful library for processing PDF files in C# and VB.NET. It supports a wide range of PDF operations, including table extraction, text extraction, image extraction, and more.

The easiest way to add the Spire.PDF library is via NuGet Package Manager.​

1. Open Visual Studio and create a new C# project. (Here we create a Console App)

2. In Visual Studio, right-click your project > Manage NuGet Packages.

3. Search for “Spire.PDF” and install the latest version.


Understanding PDF Table Structure

Before coding, let’s clarify how PDFs store tables. Unlike Excel (which explicitly defines rows/columns), PDFs use:

  • Text Blocks: Individual text elements positioned with coordinates.
  • Borders/Lines: Visual cues (horizontal/vertical lines) that humans interpret as table edges.
  • Spacing: Consistent gaps between text blocks to indicate cells.

The Spire.PDF library infers table structure by analyzing these visual cues, matching text blocks to rows/columns based on proximity and alignment.


How to Extract Tables from PDF in C#

If you need a quick way to preview table data (e.g., debugging or verifying extraction), printing it to the console is a great starting point.

Key methods to extract data from a PDF table:

  • PdfDocument: Represents a PDF file.
  • LoadFromFile: Loads the PDF file for processing.
  • PdfTableExtractor: Analyzes the PDF to detect tables using visual cues (borders, spacing).
  • ExtractTable(pageIndex): Returns an array of PdfTable objects for the specified page.
  • GetRowCount()/GetColumnCount(): Retrieve the dimensions of each table.
  • GetText(rowIndex, columnIndex): Extracts text from the cell at the specified row and column.
using Spire.Pdf;
using Spire.Pdf.Utilities;

namespace ExtractPdfTable
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a PdfDocument object
            PdfDocument pdf = new PdfDocument();

            // Load a PDF file
            pdf.LoadFromFile("invoice.pdf");

            // Initialize an instance of PdfTableExtractor class
            PdfTableExtractor extractor = new PdfTableExtractor(pdf);


            // Loop through the pages 
            for (int pageIndex = 0; pageIndex < pdf.Pages.Count; pageIndex++)
            {
                // Extract tables from a specific page
                PdfTable[] tableList = extractor.ExtractTable(pageIndex);

                // Determine if the table list is null
                if (tableList != null && tableList.Length > 0)
                {
                    int tableNumber = 1;
                    // Loop through the table in the list
                    foreach (PdfTable table in tableList)
                    {
                        Console.WriteLine($"\nTable {tableNumber} on Page {pageIndex + 1}:");
                        Console.WriteLine("-----------------------------------");

                        // Get row number and column number of a certain table
                        int row = table.GetRowCount();
                        int column = table.GetColumnCount();

                        // Loop through rows and columns 
                        for (int i = 0; i < row; i++)
                        {
                            for (int j = 0; j < column; j++)
                            {
                                // Get text from the specific cell
                                string text = table.GetText(i, j);

                                // Print cell text to console with a separator
                                Console.Write($"{text}\t");
                            }
                            // New line after each row
                            Console.WriteLine();
                        }
                        tableNumber++;
                    }
                }
            }

            // Close the document
            pdf.Close();
        }
    }
}

When to Use This Method

  • Quick debugging or validation of extracted data.
  • Small datasets where you don’t need persistent storage.

Output: Retrieve PDF table data and output to the console

Extract data from a PDF table

Extract PDF Tables to a Text File in C#

For lightweight, human-readable storage, saving tables to a text file is ideal. This method uses StringBuilder to efficiently compile table data, preserving row breaks for readability.

Key features of extracting PDF tables and exporting to TXT:

  • Efficiency: StringBuilder minimizes memory overhead compared to string concatenation.
  • Persistent Storage: Saves data to a text file for later review or sharing.
  • Row Preservation: Uses \r\n to maintain row structure, making the text file easy to scan.
using Spire.Pdf;
using Spire.Pdf.Utilities;
using System.Text;

namespace ExtractTableToTxt
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a PdfDocument object
            PdfDocument pdf = new PdfDocument();

            // Load a PDF file
            pdf.LoadFromFile("invoice.pdf");

            // Create a StringBuilder object
            StringBuilder builder = new StringBuilder();

            // Initialize an instance of PdfTableExtractor class
            PdfTableExtractor extractor = new PdfTableExtractor(pdf);

            // Declare a PdfTable array 
            PdfTable[] tableList = null;

            // Loop through the pages 
            for (int pageIndex = 0; pageIndex < pdf.Pages.Count; pageIndex++)
            {
                // Extract tables from a specific page
                tableList = extractor.ExtractTable(pageIndex);

                // Determine if the table list is null
                if (tableList != null && tableList.Length > 0)
                {
                    // Loop through the table in the list
                    foreach (PdfTable table in tableList)
                    {
                        // Get row number and column number of a certain table
                        int row = table.GetRowCount();
                        int column = table.GetColumnCount();

                        // Loop through the rows and columns 
                        for (int i = 0; i < row; i++)
                        {
                            for (int j = 0; j < column; j++)
                            {
                                // Get text from the specific cell
                                string text = table.GetText(i, j);

                                // Add text to the string builder
                                builder.Append(text + " ");
                            }
                            builder.Append("\r\n");
                        }
                    }
                }
            }

            // Write to a .txt file
            File.WriteAllText("ExtractPDFTable.txt", builder.ToString());
        }
    }
}

When to Use This Method

  • Archiving table data in a lightweight, universally accessible format.
  • Sharing with teams that need to scan data without spreadsheet tools.
  • Using as input for basic scripts (e.g., PowerShell) to extract specific values.

Output: Extract PDF table data and save to a text file.

Extract table data from PDF to a TXT file

Pro Tip: For VB.NET demos, convert the above code using our C# ⇆ VB.NET Converter.

Export PDF Tables to CSV in C#

CSV (Comma-Separated Values) is the industry standard for tabular data, compatible with Excel, Google Sheets, and databases. This method formats the extracted tables into a valid CSV file by quoting cells and handling special characters.

Key features of extracting tables from PDF to CSV:

  • StreamWriter: Writes data incrementally to the CSV file, reducing memory usage for large PDFs.
  • Quoted Cells: Cells are wrapped in double quotes (" ") to avoid misinterpreting commas within text as column separators.
  • UTF-8 Encoding: Supports special characters in cell text.
  • Spreadsheet Ready: Directly opens in Excel, Google Sheets, or spreadsheet tools for analysis.
using Spire.Pdf;
using Spire.Pdf.Utilities;
using System.Text;

namespace ExtractTableToCsv
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a PdfDocument object
            PdfDocument pdf = new PdfDocument();

            // Load a PDF file
            pdf.LoadFromFile("invoice.pdf");

            // Create a StreamWriter object for efficient CSV writing
            using (StreamWriter csvWriter = new StreamWriter("PDFtable.csv", false, Encoding.UTF8))
            {
                // Create a PdfTableExtractor object
                PdfTableExtractor extractor = new PdfTableExtractor(pdf);

                // Loop through the pages 
                for (int pageIndex = 0; pageIndex < pdf.Pages.Count; pageIndex++)
                {
                    // Extract tables from a specific page
                    PdfTable[] tableList = extractor.ExtractTable(pageIndex);

                    // Determine if the table list is null
                    if (tableList != null && tableList.Length > 0)
                    {
                        // Loop through the table in the list
                        foreach (PdfTable table in tableList)
                        {
                            // Get row number and column number of a certain table
                            int row = table.GetRowCount();
                            int column = table.GetColumnCount();

                            // Loop through the rows
                            for (int i = 0; i < row; i++)
                            {
                                // Creates a list to store data 
                                List<string> rowData = new List<string>();
                                // Loop through the columns
                                for (int j = 0; j < column; j++)
                                {
                                    // Retrieve text from table cells
                                    string cellText = table.GetText(i, j).Replace("\"", "\"\"");
                                    // Add the cell text to the list and wrap in double quotes
                                    rowData.Add($"\"{cellText}\"");
                                }
                                // Join cells with commas and write to CSV
                                csvWriter.WriteLine(string.Join(",", rowData));
                            }
                        }
                    }
                }
            }
        }
    }
}

When to Use This Method

  • Data analysis (import into Excel for calculations).
  • Migrating PDF tables to databases (e.g., SQL Server, PostgreSQL, MySQL).
  • Collaborating with teams that rely on spreadsheets.

Output: Parse PDF table data and export to a CSV file.

Extract table data from PDF to a CSV file

Recommendation: Integrate with Spire.XLS for .NET to extract tables from PDF to Excel directly.


Conclusion

This guide has outlined three efficient methods for extracting tables from PDFs in C#. By leveraging the Spire.PDF for .NET library, you can automate the PDF table extraction process and export results to console, TXT, or CSV for further analysis. Whether you’re building a data pipeline, report generator, or business tool, these approaches streamline workflows, save time, and minimize human error.

Refer to the online documentation and obtain a free trial license here to explore more advanced PDF operations.


FAQs

Q1: Why use Spire.PDF for .NET to extract tables?

A: Spire.PDF provides a dedicated PdfTableExtractor class that detects tables based on visual cues (borders, spacing, and text alignment), simplifying the process of parsing structured data from PDFs.

Q2: Can Spire.PDF extract tables from scanned (image-based) PDFs?

A: No. The .NET PDF library works only with text-based PDFs (where text is selectable). For scanned PDFs, use Spire.OCR to extract text before parsing tables.

Q3: Can I extract tables from multiple PDFs at once?

A: Yes. To batch-process multiple PDFs, use Directory.GetFiles() to list all PDF files in a folder, then loop through each file and run the extraction logic. For example:

string[] pdfFiles = Directory.GetFiles(@"C:\Invoices\", "*.pdf");
foreach (string file in pdfFiles)
{
// Run extraction code for each file  
}

Q4: How can I improve performance when extracting tables from large PDFs?

A: For large PDFs (100+ pages), optimize performance by:

  • Processing pages in batches instead of loading the entire PDF at once.
  • Disposing of unused PdfTable or PdfDocument objects with the using statements to free memory.
  • Skipping pages with no tables early (using if (tableList == null || tableList.Length == 0)).
Friday, 08 April 2022 07:34

Java: Convert Images to PDF

Converting images to PDF is beneficial for many reasons. For one reason, it allows you to convert images into a format that is more readable and easier to share. For another reason, it dramatically reduces the size of the file while preserving the quality of images. In this article, you will learn how to convert images to PDF in Java using Spire.PDF for Java.

There is no straightforward method provided by Spire.PDF to convert images to PDF. You could, however, create a new PDF document and draw images at the specified locations. Depending on whether the page size of the generated PDF matches the image, this topic can be divided into two subtopics.

Install Spire.PDF for Java

First, you're required to add the Spire.Pdf.jar file as a dependency in your Java program. The JAR file can be downloaded from this link. If you use Maven, you can easily import the JAR file in your application by adding the following code to your project's pom.xml file.

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.pdf</artifactId>
        <version>12.5.1</version>
    </dependency>
</dependencies>

Additionally, the imgscalr library is used in the first code example to resize images. It is not necessary to install it if you do not need to adjust the image’s size.

Add an Image to PDF at a Specified Location

The following are the steps to add an image to PDF at a specified location using Spire.PDF for Java.

  • Create a PdfDocument object.
  • Set the page margins using PdfDocument.getPageSettings().setMargins() method.
  • Add a page using PdfDocument.getPages().add() method
  • Load an image using ImageIO.read() method, and get the image width and height.
  • If the image width is larger than the page (the content area) width, resize the image to make it to fit to the page width using the imgscalr library.
  • Create a PdfImage object based on the scaled image or the original image.
  • Draw the PdfImage object on the first page at (0, 0) using PdfPageBase.getCanvas().drawImage() method.
  • Save the document to a PDF file using PdfDocument.saveToFile() method.
  • Java
import com.spire.pdf.PdfDocument;
import com.spire.pdf.PdfPageBase;
import com.spire.pdf.graphics.PdfImage;
import org.imgscalr.Scalr;

import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.FileInputStream;
import java.io.IOException;

public class AddImageToPdf {

    public static void main(String[] args) throws IOException {

        //Create a PdfDocument object
        PdfDocument doc = new PdfDocument();

        //Set the margins
        doc.getPageSettings().setMargins(20);

        //Add a page
        PdfPageBase page = doc.getPages().add();

        //Load an image
        BufferedImage image = ImageIO.read(new FileInputStream("C:\\Users\\Administrator\\Desktop\\announcement.jpg"));

        //Get the image width and height
        int width = image.getWidth();
        int height = image.getHeight();

        //Declare a PdfImage variable
        PdfImage pdfImage;

        //If the image width is larger than page width
        if (width > page.getCanvas().getClientSize().getWidth())
        {
            //Resize the image to make it to fit to the page width
            int widthFitRate =  width / (int)page.getCanvas().getClientSize().getWidth();
            int targetWidth = width / widthFitRate;
            int targetHeight = height / widthFitRate;
            BufferedImage scaledImage = Scalr.resize(image,Scalr.Method.QUALITY,targetWidth,targetHeight);

            //Load the scaled image to the PdfImage object
            pdfImage = PdfImage.fromImage(scaledImage);

        } else
        {
            //Load the original image to the PdfImage object
            pdfImage = PdfImage.fromImage(image);
        }

        //Draw image at (0, 0)
        page.getCanvas().drawImage(pdfImage, 0, 0, pdfImage.getWidth(), pdfImage.getHeight());

        //Save to file
        doc.saveToFile("output/AddImage.pdf");
    }
}

Java: Convert Images to PDF

Convert an Image to PDF with the Same Width and Height

The following are the steps to convert an image to a PDF with the same page size as the image using Spire.PDF for Java.

  • Create a PdfDocument object.
  • Set the page margins to zero using PdfDocument.getPageSettings().setMargins() method.
  • Load an image using ImageIO.read() method, and get the image width and height.
  • Add a page to PDF based on the size of the image using PdfDocument.getPages().add() method.
  • Create a PdfImage object based on the image.
  • Draw the PdfImage object on the first page from the coordinate (0, 0) using PdfPageBase.getCanvas().drawImage() method.
  • Save the document to a PDF file using PdfDocument.saveToFile() method.
  • Java
import com.spire.pdf.PdfDocument;
import com.spire.pdf.PdfPageBase;
import com.spire.pdf.graphics.PdfImage;

import javax.imageio.ImageIO;
import java.awt.*;
import java.awt.image.BufferedImage;
import java.io.FileInputStream;
import java.io.IOException;

public class ConvertImageToPdfWithSameSize {

    public static void main(String[] args) throws IOException {

        //Create a PdfDocument object
        PdfDocument doc = new PdfDocument();

        //Set the margins to 0
        doc.getPageSettings().setMargins(0);

        //Load an image
        BufferedImage image = ImageIO.read(new FileInputStream("C:\\Users\\Administrator\\Desktop\\announcement.jpg"));

        //Get the image width and height
        int width = image.getWidth();
        int height = image.getHeight();

        //Add a page of the same size as the image
        PdfPageBase page = doc.getPages().add(new Dimension(width, height));

        //Create a PdfImage object based on the image
        PdfImage pdfImage = PdfImage.fromImage(image);

        //Draw image at (0, 0) of the page
        page.getCanvas().drawImage(pdfImage, 0, 0, pdfImage.getWidth(), pdfImage.getHeight());

        //Save to file
        doc.saveToFile("output/ConvertPdfWithSameSize.pdf");
    }
}

Java: Convert Images to PDF

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.