hayes Liu

Thursday, 28 December 2023 01:58

How to Convert PDF to Text in Python (Free & Easy Guide)

Why Choose Spire.PDF for PDF to Text
General Workflow for PDF to Text in Python
Convert PDF to Text in Python Without Layout
Convert PDF to Text in Python With Layout
Convert a Specific PDF Page to Text
To Wrap Up
FAQs

Install with Pip

pip install Spire.PDF

Getting Started: Why Choose Spire.PDF for PDF to Text in Python

To convert PDF files to text using Python, you’ll need a reliable PDF processing library. Spire.PDF for Python is a powerful and developer-friendly API that allows you to read, edit, and convert PDF documents in Python applications — no need for Adobe Acrobat or other third-party software.
This library is ideal for automating PDF workflows such as extracting text, adding annotations, or merging and splitting files. It supports a wide range of PDF features and works seamlessly in both desktop and server environments. You can donwload it to install mannually or quickly install Spire.PDF via PyPI using the following command:

pip install Spire.PDF

For smaller or personal projects, a free version is available with basic functionality. If you need advanced features such as PDF signing or form filling, you can upgrade to the commercial edition at any time.

General Workflow for PDF to Text in Python

Converting a PDF to text becomes simple and efficient with the help of Spire.PDF for Python. You can easily complete the task by reusing the sample code provided in the following sections and customizing it to fit your needs. But before diving into the code, let’s take a quick look at the general workflow behind this process.

Create an object of PdfDocument class and load a PDF file using LoadFromFile() method.
Create an object of PdfTextExtractOptions class and set the text extracting options, including extracting all text, showing hidden text, only extracting text in a specified area, and simple extraction.
Get a page in the document using PdfDocument.Pages.get_Item() method and create PdfTextExtractor objects based on each page to extract the text from the page using Extract() method with specified options.
Save the extracted text as a text file and close the object.

How to Convert PDF to Text in Python Without Layout

If you only need the plain text content from a PDF and don’t care about preserving the original layout, you can use a simple method to extract text. This approach is faster and easier, especially when working with scanned documents or large batches of files. In this section, we’ll show you how to convert PDF to text in Python without preserving the layout.

To extract text without preserving layout, follow these simplified steps:

Create an instance of PdfDocument and load the PDF file.
Create a PdfTextExtractOptions object and configure the text extraction options.
Set IsSimpleExtraction = True to ignore the layout and extract raw text.
Loop through all pages of the PDF.
Extract text from each page and write it to a .txt file.

from spire.pdf import PdfDocument
from spire.pdf import PdfTextExtractOptions
from spire.pdf import PdfTextExtractor

# Create an object of PdfDocument class and load a PDF file
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")

# Create a string object to store the text
extracted_text = ""

# Create an object of PdfExtractor
extract_options = PdfTextExtractOptions()
# Set to use simple extraction method
extract_options.IsSimpleExtraction = True

# Loop through the pages in the document
for i in range(pdf.Pages.Count):
    # Get a page
    page = pdf.Pages.get_Item(i)
    # Create an object of PdfTextExtractor passing the page as paramter
    text_extractor = PdfTextExtractor(page)
    # Extract the text from the page
    text = text_extractor.ExtractText(extract_options)
    # Add the extracted text to the string object
    extracted_text += text

# Write the extracted text to a text file
with open("output/ExtractedText.txt", "w") as file:
    file.write(extracted_text)
pdf.Close()

Convert PDF to text without layout

How to Convert PDF to Text in Python With Layout

To convert PDF to text in Python with layout, Spire.PDF preserves formatting like tables and paragraphs by default. The steps are similar to the general overview, but you still need to loop through each page for full-text extraction.

from spire.pdf import PdfDocument
from spire.pdf import PdfTextExtractOptions
from spire.pdf import PdfTextExtractor

# Create an object of PdfDocument class and load a PDF file
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")

# Create a string object to store the text
extracted_text = ""

# Create an object of PdfExtractor
extract_options = PdfTextExtractOptions()

# Loop through the pages in the document
for i in range(pdf.Pages.Count):
    # Get a page
    page = pdf.Pages.get_Item(i)
    # Create an object of PdfTextExtractor passing the page as paramter
    text_extractor = PdfTextExtractor(page)
    # Extract the text from the page
    text = text_extractor.ExtractText(extract_options)
    # Add the extracted text to the string object
    extracted_text += text

# Write the extracted text to a text file
with open("output/ExtractedText.txt", "w") as file:
    file.write(extracted_text)
pdf.Close()

Convert PDF to text without layout

Convert a Specific PDF Page to Text in Python

Need to extract text from only one page of a PDF instead of the entire document? With Spire.PDF, the PDF to Text converter in Python, you can easily target and convert a specific PDF page to text. The steps are the same as shown in the general overview. If you're already familiar with them, just copy the code below into any Python editor and automate your PDF to text conversion!

from spire.pdf import PdfDocument
from spire.pdf import PdfTextExtractOptions
from spire.pdf import PdfTextExtractor
from spire.pdf import RectangleF

# Create an object of PdfDocument class and load a PDF file
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")

# Create an object of PdfExtractor
extract_options = PdfTextExtractOptions()

# Set to extract specific page area
extract_options.ExtractArea = RectangleF(50.0, 220.0, 700.0, 230.0)

# Get a page
page = pdf.Pages.get_Item(0)

# Create an object of PdfTextExtractor passing the page as paramter
text_extractor = PdfTextExtractor(page)

# Extract the text from the page
extracted_text = text_extractor.ExtractText(extract_options)

# Write the extracted text to a text file
with open("output/ExtractedText.txt", "w") as file:
    file.write(extracted_text)
pdf.Close()

Convert PDF to text without layout

To Wrap Up

In this post, we covered how to convert PDF to text using Python and Spire.PDF, with clear steps and code examples for fast, efficient conversion. We also highlighted the benefits and pointed to OCR tools for image-based PDFs. For any issues or support, feel free to contact us.

FAQs about Converting PDF to Text

Q1: How do I convert a PDF to readable and editable text in Python?
A: You can convert a PDF to text in Python using the Spire.PDF library. It allows you to extract text from PDF files while optionally keeping the original layout. You don’t need Adobe Acrobat, and both visible and image-based PDFs are supported.

Q2: Is there a free tool to convert PDF to text?
A: Yes. Spire.PDF for Python provides a free edition that allows you to convert PDF to text without relying on Adobe Acrobat or other software. Online tools are also available, but they’re more suitable for occasional use or small files.

Q3: Can Python extract data from PDF? A: Yes, Python can extract data from PDF files. Using Spire.PDF, you can easily extract not only text but also other elements such as images, annotations, bookmarks, and even attachments. This makes it a versatile tool for working with PDF content in Python.

Why Merge Excel Files with Python?

Using Python to merge Excel files brings several key advantages:

Automation: Save time and eliminate repetitive manual work by automating the merging process.
No Excel Dependency: Merge files without installing Microsoft Excel—ideal for headless, server-side, or cloud environments.
Flexible Merging: Customize merging by selecting specific sheets, ranges, columns, or rows.
Scalability: Handle hundreds or even thousands of Excel files with consistent performance.
Error Reduction: Reduce manual errors and ensure data accuracy with automated scripts.

Whether you’re consolidating monthly reports or merging large datasets, Python helps streamline the process efficiently.

Getting Started with Spire.XLS for Python

Spire.XLS for Python is a standalone library that allows developers to create, read, edit, and save Excel files without the need for Microsoft Excel installation.

Key Features Include:

Supports Multiple Formats: .xls, .xlsx, and more.
Worksheet Operations: Copy, rename, delete, and merge worksheets seamlessly across workbooks.
Formula & Formatting Preservation: Retain formulas and formatting during editing or merging.
Advanced Features: Includes chart creation, conditional formatting, pivot tables, and more.
File Conversion: Convert Excel files to PDF, HTML, CSV, and more.

Installation

Run the following pip command in your terminal or command prompt to install Spire.XLS from PyPI:

pip install spire.xls

How to Merge Multiple Excel Files into One Workbook using Python

When working with multiple Excel files, consolidating all worksheets into a single workbook can simplify data management and reporting. This approach preserves each original worksheet separately, making it easy to organize and review data from different sources such as department budgets, regional reports, or monthly summaries.

Steps

To merge multiple Excel files into a single workbook using Python, follow these steps:

Loop through the files.
Load each Excel file using LoadFromFile().
For the first file, assign it as the base workbook.
For subsequent files, copy all worksheets into the base workbook using AddCopy().
Save the final combined workbook to a new file.

Code Example

import os
from spire.xls import *

# Folder containing Excel files to merge
input_folder = './sample_files'   
# Output file name for the merged workbook       
output_file = 'merged_workbook.xlsx'    

# Initialize merged workbook as None
merged_workbook = None  

# Iterate over all files in the input folder
for filename in os.listdir(input_folder):
    # Process only Excel files with .xls or .xlsx extensions
    if filename.endswith('.xlsx') or filename.endswith('.xls'):
        file_path = os.path.join(input_folder, filename)
        
        # Load the current Excel file into a Workbook object
        source_workbook = Workbook()
        source_workbook.LoadFromFile(file_path)

        if merged_workbook is None:
            # For the first file, assign it as the base merged workbook
            merged_workbook = source_workbook
        else:
            # For subsequent files, copy each worksheet into the merged workbook
            for i in range(source_workbook.Worksheets.Count):
                sheet = source_workbook.Worksheets.get_Item(i)
                merged_workbook.Worksheets.AddCopy(sheet, WorksheetCopyType.CopyAll)

# Save the combined workbook to the specified output file
merged_workbook.SaveToFile(output_file, ExcelVersion.Version2016)

Consolidate Excel Files into One using Python

How to Combine Multiple Excel Worksheets into a Single Worksheet using Python

Merging data from multiple Excel worksheets into one worksheet allows you to aggregate information efficiently, especially when working with data such as sales logs, survey responses, or performance reports.

Steps

To combine worksheet data from multiple Excel files into a single worksheet using Python, follow these steps:

Create a new workbook and select its first worksheet as the destination.
Loop through the files.
Load each Excel file using LoadFromFile().
Get the desired worksheet that you want to merge from the current file.
Copy the used cell range from the desired worksheet to the destination worksheet, placing data consecutively below the previously copied content.
Save the combined data into a new Excel file.

Code Example

import os
from spire.xls import *

# Folder containing Excel files to merge
input_folder = './excel_worksheets'
# Output file name for the merged workbook
output_file = 'merged_into_one_sheet.xlsx'

# Create a new workbook to hold merged data
merged_workbook = Workbook()
# Use the first worksheet in the new workbook as the merge target
merged_sheet = merged_workbook.Worksheets[0]

# Initialize the starting row for copying data
current_row = 1

# Loop through all files in the input folder
for filename in os.listdir(input_folder):
    # Process only Excel files (.xls or .xlsx)
    if filename.endswith('.xlsx') or filename.endswith('.xls'):
        file_path = os.path.join(input_folder, filename)

        # Load the current Excel file
        workbook = Workbook()
        workbook.LoadFromFile(file_path)

        # Get the first worksheet from the current workbook
        sheet = workbook.Worksheets[0]

        # Get the used range from the first worksheet
        source_range = sheet.Range

        # Set the destination range in the merged worksheet starting at current_row
        dest_range = merged_sheet.Range[current_row, 1]

        # Copy data from the used range to the destination range
        source_range.Copy(dest_range)

        # Update current_row to the row after the last copied row to prevent overlap
        current_row += sheet.LastRow

# Save the merged workbook to the specified output file in Excel 2016 format
merged_workbook.SaveToFile(output_file, ExcelVersion.Version2016)

Merge Excel Worksheets into One using Python

Conclusion

When merging multiple Excel files into a single document—whether by appending sheets or combining data row by row—using a Python library like Spire.XLS enables automation and improves accuracy. This approach can help streamline workflows, especially in enterprise scenarios that require handling large datasets without relying on Microsoft Excel.

FAQs: Merge Excel Files with Python

Q1: Can I merge .xls and .xlsx files together?

A1: Yes. Spire.XLS handles both formats without needing conversion.

Q2: Do I need Excel installed on my machine to use Spire.XLS?

A2: No. Spire.XLS is standalone and works without Microsoft Office installed.

Q3: Can I merge only specific sheets from each workbook?

A3: Yes. You can customize your code to merge sheets by name or index. For example:

sheet = source_workbook.Worksheets["Summary"]

Q4: How do I avoid copying header rows multiple times?

A4: Add logic like:

if current_row > 1:
    start_row = 2 # Skip header

else:
    start_row = 1

Q5: Can I keep track of which file each row came from?

A5: Yes. Add a new column in the merged sheet containing the source file name for each row.

Q6: Is there a file size or row limit when using Spire.XLS?

A6: Spire.XLS follows the same row and column limits as Excel: .xlsx supports up to 1,048,576 rows × 16,384 columns, and .xls supports up to 65,536 rows × 256 columns.

Q7: Can I preserve formulas and formatting while merging?

A7: Yes. When merging Excel files, formatting and formulas are preserved.

Published in Document Operation

Tagged under

xls Python Document Operation

Thursday, 19 October 2023 01:08

Adding Watermarks to PDF Files Using Python

Python Add watermarks to PDF

Watermarking is a critical technique for securing documents, indicating ownership, and preventing unauthorized copying. Whether you're distributing drafts or branding final deliverables, applying watermarks helps protect your content effectively. In this tutorial, you’ll learn how to add watermarks to a PDF in Python using the powerful and easy-to-use Spire.PDF for Python library.

We'll walk through how to insert both text and image watermarks , handle transparency and positioning, and resolve common issues — all with clean, well-documented code examples.

Table of Contents:

Python Library for Watermarking PDFs
Adding a Text Watermark to a PDF
Adding an Image Watermark to a PDF
Troubleshooting Common Issues
Wrapping Up
FAQs

Python Library for Watermarking PDFs

Spire.PDF for Python is a robust library that provides comprehensive PDF manipulation capabilities. For watermarking specifically, it offers:

High precision in watermark placement and rotation.
Flexible transparency controls.
Support for both text and image watermarks.
Ability to apply watermarks to specific pages or entire documents.
Preservation of original PDF quality.

Before proceeding, ensure you have Spire.PDF installed in your Python environment:

pip install spire.pdf

Adding a Text Watermark to a PDF

This code snippet demonstrates how to add a diagonal "DO NOT COPY" watermark to each page of a PDF file. It manages the size, color, positioning, rotation, and transparency of the watermark for a professional result.

from spire.pdf import *
from spire.pdf.common import *
import math

# Create an object of PdfDocument class
doc = PdfDocument()

# Load a PDF document from the specified path
doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pdf")

# Create an object of PdfTrueTypeFont class for the watermark font
font = PdfTrueTypeFont("Times New Roman", 48.0, 0, True)

# Specify the watermark text
text = "DO NOT COPY"

# Measure the dimensions of the text to ensure proper positioning
text_width = font.MeasureString(text).Width
text_height = font.MeasureString(text).Height

# Loop through each page in the document
for i in range(doc.Pages.Count):

    # Get the current page
    page = doc.Pages.get_Item(i)
    
    # Save the current canvas state
    state = page.Canvas.Save()
 
    # Calculate the center coordinates of the page
    x = page.Canvas.Size.Width  / 2
    y = page.Canvas.Size.Height / 2

    # Translate the coodinate system to the center so that the center of the page becomes the origin (0, 0)
    page.Canvas.TranslateTransform(x, y)
    
    # Rotate the canvas 45 degrees counterclockwise for the watermark
    page.Canvas.RotateTransform(-45.0)

    # Set the transparency of the watermark
    page.Canvas.SetTransparency(0.7)
    
    # Draw the watermark text at the centered position using negative offsets 
    page.Canvas.DrawString(text, font, PdfBrushes.get_Blue(), PointF(-text_width / 2, -text_height / 2))
    
    # Restore the canvas state to prevent transformations from affecting subsequent drawings
    page.Canvas.Restore(state)

# Save the modified document to a new PDF file
doc.SaveToFile("output/TextWatermark.pdf")

# Dispose resources
doc.Dispose()

Breakdown of the Code :

Load the PDF Document : The script loads an input PDF file from a specified path using the PdfDocument class.
Configure Watermark Text : A watermark text ("DO NOT COPY") is set with a specific font (Times New Roman, 48pt) and measured for accurate positioning.
Apply Transformations : For each page, the script:
- Centers the coordinate system.
- Rotates the canvas by 45 degrees counterclockwise.
- Sets transparency (70%) for the watermark.
Draw the Watermark : The text is drawn at (-text_width / 2, -text_height / 2), which aligns the text perfectly around the center point of the page, regardless of the rotation applied.
Save the Document : The modified document is saved to a new PDF file.

Output:

Add a text watermark to PDF

Adding an Image Watermark to a PDF

This code snippet adds a semi-transparent image watermark to each page of a PDF, ensuring proper positioning and a professional appearance.

from spire.pdf import *
from spire.pdf.common import *

# Create an object of PdfDocument class
doc = PdfDocument()

# Load a PDF document from the specified path
doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pdf")

# Load the watermark image from the specified path
image = PdfImage.FromFile("C:\\Users\\Administrator\\Desktop\\logo.png")

# Get the width and height of the loaded image for positioning
imageWidth = float(image.Width)
imageHeight = float(image.Height)

# Loop through each page in the document to apply the watermark
for i in range(doc.Pages.Count):
    # Get the current page
    page = doc.Pages.get_Item(i)

    # Set the transparency of the watermark to 50%
    page.Canvas.SetTransparency(0.5)

    # Get the dimensions of the current page
    pageWidth = page.ActualSize.Width
    pageHeight = page.ActualSize.Height

    # Calculate the x and y coordinates to center the image on the page
    x = (pageWidth - imageWidth) / 2
    y = (pageHeight - imageHeight) / 2

    # Draw the image at the calculated center position on the page
    page.Canvas.DrawImage(image, x, y, imageWidth, imageHeight)

# Save the modified document to a new PDF file
doc.SaveToFile("output/ImageWatermark.pdf")

# Dispose resources
doc.Dispose()

Breakdown of the Code :

Load the PDF Document : The script loads an input PDFfile from a specified path using the PdfDocument class.
Configure Watermark Image : The watermark image is loaded from a specified path, and its dimensions are retrieved for accurate positioning.
Apply Transformations : For each page, the script:
- Sets the watermark transparency (50%).
- Calculates the center position of the page for the watermark.
Draw the Watermark : The image is drawn at the calculated center coordinates, ensuring it is centered on each page.
Save the Document : The modified document is saved to a new PDF file.

Output:

Add an image watermark to PDF

Apart from watermarks, you can also add stamps to PDFs. Unlike watermarks, which are fixed in place, stamps can be freely moved or deleted, offering greater flexibility in document annotation.

Troubleshooting Common Issues

Watermark Not Appearing :
- Verify file paths are correct.
- Check transparency isn't set to 0 (fully transparent).
- Ensure coordinates place the watermark within page bounds.
Quality Issues :
- For text, use higher-quality fonts.
- For images, ensure adequate resolution.
Rotation Problems :
- Remember that rotation occurs around the current origin point.
- The order of transformations matters (translate then rotate).

Wrapping Up

With Spire.PDF for Python, adding watermarks to PDF documents becomes a simple and powerful process. Whether you need bold "Confidential" text across every page or subtle branding with logos, the library handles it all efficiently. By combining coordinate transformations, transparency settings, and drawing commands, you can create highly customized watermarking workflows tailored to your document's purpose.

FAQs

Q1. Can I add both text and image watermarks to the same PDF?

Yes, you can combine both approaches in a single loop over the PDF pages.

Q2. How can I rotate image watermarks?

Use Canvas.RotateTransform(angle) before drawing the image, similar to the text watermark example.

Q3. Does Spire.PDF support transparent PNGs for watermarks?

Yes, Spire.PDF preserves the transparency of PNG images when used as watermarks.

Q4. Can I apply different watermarks to different pages?

Absolutely. You can implement conditional logic within your page loop to apply different watermarks based on page number or other criteria.

Get a Free License

To fully experience the capabilities of Spire.PDF for Python without any evaluation limitations, you can request a free 30-day trial license.

Published in Watermark

Tagged under

pdf Python Watermark

Wednesday, 08 March 2023 01:26

C++: Create Tables in Word Documents

A table is a powerful tool for organizing and presenting data. It arranges data into rows and columns, making it easier for authors to illustrate the relationships between different data categories and for readers to understand and analyze complex data. In this article, you will learn how to programmatically create tables in Word documents in C++ using Spire.Doc for C++.

Create a Table in Word in C++
Create a Nested Table in Word in C++

Install Spire.Doc for C++

There are two ways to integrate Spire.Doc for C++ into your application. One way is to install it through NuGet, and the other way is to download the package from our website and copy the libraries into your program. Installation via NuGet is simpler and more recommended. You can find more details by visiting the following link.

Integrate Spire.Doc for C++ in a C++ Application

Create a Table in Word in C++

Spire.Doc for C++ offers the Section->AddTable() method to add a table to a section of a Word document. The detailed steps are as follows:

Initialize an instance of the Document class.
Add a section to the document using Document->AddSection() method.
Define the data for the header row and remaining rows, storing them in a one-dimensional vector and a two-dimensional vector respectively.
Add a table to the section using Section->AddTable() method.
Specify the number of rows and columns in the table using Table->ResetCells(int, int) method.
Add data in the one-dimensional vector to the header row and set formatting.
Add data in the two-dimensional vector to the remaining rows and set formatting.
Save the result document using Document->SaveToFile() method.

#include "Spire.Doc.o.h"

using namespace Spire::Doc;
using namespace std;

int main()
{
	//Initialize an instance of the Document class
	intrusive_ptr<Document> doc = new Document();

	//Add a section to the document
	intrusive_ptr<Section> section = doc->AddSection();

	//Set page margins for the section
	section->GetPageSetup()->GetMargins()->SetAll(72);

	//Define the data for the header row
	vector<wstring> header = { L"Name", L"Capital", L"Continent", L"Area", L"Population" };
	//Define the data for the remaining rows
	vector<vector<wstring>> data =
	{
		{L"Argentina", L"Buenos Aires", L"South America", L"2777815", L"32300003"},
		{L"Bolivia", L"La Paz", L"South America", L"1098575", L"7300000"},
		{L"Brazil", L"Brasilia", L"South America", L"8511196", L"150400000"},
		{L"Canada", L"Ottawa", L"North America", L"9976147", L"26500000"},
		{L"Chile", L"Santiago", L"South America", L"756943", L"13200000"},
		{L"Colombia", L"Bogota", L"South America", L"1138907", L"33000000"},
		{L"Cuba", L"Havana", L"North America", L"114524", L"10600000"},
		{L"Ecuador", L"Quito", L"South America", L"455502", L"10600000"},
		{L"El Salvador", L"San Salvador", L"North America", L"20865", L"5300000"},
		{L"Guyana", L"Georgetown", L"South America", L"214969", L"800000"},
		{L"Jamaica", L"Kingston", L"North America", L"11424", L"2500000"},
		{L"Mexico", L"Mexico City", L"North America", L"1967180", L"88600000"},
		{L"Nicaragua", L"Managua", L"North America", L"139000", L"3900000"},
		{L"Paraguay", L"Asuncion", L"South America", L"406576", L"4660000"},
		{L"Peru", L"Lima", L"South America", L"1285215", L"21600000"},
		{L"United States", L"Washington", L"North America", L"9363130", L"249200000"},
		{L"Uruguay", L"Montevideo", L"South America", L"176140", L"3002000"},
		{L"Venezuela", L"Caracas", L"South America", L"912047", L"19700000"}
	};

	//Add a table to the section
	intrusive_ptr<Table> table = section->AddTable(true);
	//Specify the number of rows and columns for the table
	table->ResetCells(data.size() + 1, header.size());

	//Set the first row as the header row
	intrusive_ptr<TableRow> row = table->GetRows()->GetItemInRowCollection(0);
	row->SetIsHeader(true);

	//Set height and background color for the header row
	row->SetHeight(20);
	row->SetHeightType(TableRowHeightType::Exactly);
	for (int i = 0; i < row->GetCells()->GetCount(); i++)
	{
		row->GetCells()->GetItemInCellCollection(i)->GetCellFormat()->GetShading()->SetBackgroundPatternColor(Color::FromArgb(142, 170, 219));
	}

	//Add data to the header row and set formatting
	for (size_t i = 0; i < header.size(); i++)
	{
		//Add a paragraph
		intrusive_ptr<Paragraph> p1 = row->GetCells()->GetItemInCellCollection(i)->AddParagraph();
		//Set alignment
		p1->GetFormat()->SetHorizontalAlignment(HorizontalAlignment::Center);
		row->GetCells()->GetItemInCellCollection(i)->GetCellFormat()->SetVerticalAlignment(VerticalAlignment::Middle);
		//Add data
		intrusive_ptr<TextRange> tR1 = p1->AppendText(header[i].c_str());
		//Set data formatting
		tR1->GetCharacterFormat()->SetFontName(L"Calibri");
		tR1->GetCharacterFormat()->SetFontSize(12);
		tR1->GetCharacterFormat()->SetBold(true);
	}

	//Add data to the remaining rows and set formatting
	for (size_t r = 0; r < data.size(); r++)
	{
		//Set height for the remaining rows
		intrusive_ptr<TableRow> dataRow = table->GetRows()->GetItemInRowCollection(r + 1);
		dataRow->SetHeight(20);
		dataRow->SetHeightType(TableRowHeightType::Exactly);

		for (size_t c = 0; c < data[r].size(); c++)
		{
			//Add a paragraph
			intrusive_ptr<Paragraph> p2 = dataRow->GetCells()->GetItemInCellCollection(c)->AddParagraph();
			//Set alignment
			dataRow->GetCells()->GetItemInCellCollection(c)->GetCellFormat()->SetVerticalAlignment(VerticalAlignment::Middle);
			//Add data
			intrusive_ptr<TextRange> tR2 = p2->AppendText(data[r][c].c_str());
			//Set data formatting
			tR2->GetCharacterFormat()->SetFontName(L"Calibri");
			tR2->GetCharacterFormat()->SetFontSize(11);
		}
	}

	//Save the result document
	doc->SaveToFile(L"CreateTable.docx", FileFormat::Docx2013);
	doc->Close();
}

C++: Create Tables in Word Documents

Create a Nested Table in Word in C++

Spire.Doc for C++ offers the TableCell->AddTable() method to add a nested table to a specific table cell. The detailed steps are as follows:

Initialize an instance of the Document class.
Add a section to the document using Document->AddSection() method.
Add a table to the section using Section.AddTable() method.
Specify the number of rows and columns in the table using Table->ResetCells(int, int) method.
Get the rows of the table and add data to the cells of each row.
Add a nested table to a specific table cell using TableCell->AddTable() method.
Specify the number of rows and columns in the nested table.
Get the rows of the nested table and add data to the cells of each row.
Save the result document using Document->SaveToFile() method.

#include "Spire.Doc.o.h"

using namespace Spire::Doc;
using namespace std;

int main()
{
	//Initialize an instance of the Document class
	intrusive_ptr<Document> doc = new Document();
	//Add a section to the document
	intrusive_ptr<Section> section = doc->AddSection();

	//Set page margins for the section
	section->GetPageSetup()->GetMargins()->SetAll(72);

	//Add a table to the section
	intrusive_ptr<Table> table = section->AddTable(true);
	//Set the number of rows and columns in the table
	table->ResetCells(2, 2);

	//Autofit the table width to window
	table->AutoFit(AutoFitBehaviorType::AutoFitToWindow);

	//Get the table rows
	intrusive_ptr<TableRow> row1 = table->GetRows()->GetItemInRowCollection(0);
	intrusive_ptr<TableRow> row2 = table->GetRows()->GetItemInRowCollection(1);

	//Add data to cells of the table
	intrusive_ptr<TableCell> cell1 = row1->GetCells()->GetItemInCellCollection(0);
	intrusive_ptr<TextRange> tR = cell1->AddParagraph()->AppendText(L"Product");
	tR->GetCharacterFormat()->SetFontSize(13);
	tR->GetCharacterFormat()->SetBold(true);
	intrusive_ptr<TableCell> cell2 = row1->GetCells()->GetItemInCellCollection(1);
	tR = cell2->AddParagraph()->AppendText(L"Description");
	tR->GetCharacterFormat()->SetFontSize(13);
	tR->GetCharacterFormat()->SetBold(true);
	intrusive_ptr<TableCell> cell3 = row2->GetCells()->GetItemInCellCollection(0);
	cell3->AddParagraph()->AppendText(L"Spire.Doc for C++");
	intrusive_ptr<TableCell> cell4 = row2->GetCells()->GetItemInCellCollection(1);
	cell4->AddParagraph()->AppendText(L"Spire.Doc for C++ is a professional Word "
		L"library specifically designed for developers to create, "
		L"read, write and convert Word documents in C++ "
		L"applications with fast and high-quality performance.");

	//Add a nested table to the fourth cell
	intrusive_ptr<Table> nestedTable = cell4->AddTable(true);
	//Set the number of rows and columns in the nested table
	nestedTable->ResetCells(3, 2);

	//Autofit the table width to content
	nestedTable->AutoFit(AutoFitBehaviorType::AutoFitToContents);

	//Get table rows
	intrusive_ptr<TableRow> nestedRow1 = nestedTable->GetRows()->GetItemInRowCollection(0);
	intrusive_ptr<TableRow> nestedRow2 = nestedTable->GetRows()->GetItemInRowCollection(1);
	intrusive_ptr<TableRow> nestedRow3 = nestedTable->GetRows()->GetItemInRowCollection(2);

	//Add data to cells of the nested table
	intrusive_ptr<TableCell> nestedCell1 = nestedRow1->GetCells()->GetItemInCellCollection(0);
	tR = nestedCell1->AddParagraph()->AppendText(L"Item");
	tR->GetCharacterFormat()->SetBold(true);
	intrusive_ptr<TableCell> nestedCell2 = nestedRow1->GetCells()->GetItemInCellCollection(1);
	tR = nestedCell2->AddParagraph()->AppendText(L"Price");
	tR->GetCharacterFormat()->SetBold(true);
	intrusive_ptr<TableCell> nestedCell3 = nestedRow2->GetCells()->GetItemInCellCollection(0);
	nestedCell3->AddParagraph()->AppendText(L"Developer Subscription");
	intrusive_ptr<TableCell> nestedCell4 = nestedRow2->GetCells()->GetItemInCellCollection(1);
	nestedCell4->AddParagraph()->AppendText(L"$999");
	intrusive_ptr<TableCell> nestedCell5 = nestedRow3->GetCells()->GetItemInCellCollection(0);
	nestedCell5->AddParagraph()->AppendText(L"Developer OEM Subscription");
	intrusive_ptr<TableCell> nestedCell6 = nestedRow3->GetCells()->GetItemInCellCollection(1);
	nestedCell6->AddParagraph()->AppendText(L"$2999");

	//Save the result document
	doc->SaveToFile(L"CreateNestedTable.docx", FileFormat::Docx2013);
	doc->Close();
}

C++: Create Tables in Word Documents

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

Published in Table

Tagged under

Friday, 12 August 2022 08:09

3 Efficient Methods to Write Data to Excel in Java

Visual guide of java write to excel

Looking to automate Excel data entry in Java? Manually inputting data into Excel worksheets is time-consuming and error-prone, especially when dealing with large datasets. The good news is that with the right Java Excel library, you can streamline this process. This comprehensive guide explores three efficient methods to write data to Excel in Java using the powerful Spire.XLS for Java library, covering basic cell-by-cell entries, bulk array inserts, and DataTable exports.

Prerequisites: Setup & Installation
3 Ways to Write Data to Excel using Java
Performance Tips for Large Datasets
Frequently Asked Questions
Final Thoughts

Prerequisites: Setup & Installation

Before you start, you’ll need to add Spire.XLS for Java to your project. Here’s how to do it quickly:

Option 1: Download the JAR File

Visit the Spire.XLS for Java download page.
Download the latest JAR file.
Add the JAR to your project’s build path.

Option 2: Use Maven

If you’re using Maven, add the following repository and dependency to your pom.xml file. This automatically downloads and integrates the library:

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependency>
    <groupId>e-iceblue</groupId>
    <artifactId>spire.xls</artifactId>
    <version>15.7.7</version>
</dependency>

3 Ways to Write Data to Excel using Java

Spire.XLS for Java offers flexible methods to write data, tailored to different scenarios. Let’s explore each with complete code samples, explanations, and use cases.

1. Write Text or Numbers to Excel Cells

Need to populate individual cells with text or numbers? Spire.XLS lets you directly target a specific cell using row/column indices (e.g., (2,1) for row 2, column 1) or Excel-style references (e.g., "A1", "B3"):

How It Works:

Use the Worksheet.get(int row, int column) or Worksheet.get(String name) method to access a specific Excel cell.
Use the setValue() method to write a text value to the cell.
Use the setNumberValue() method to write a numeric value to the cell.

**Java code to write data to Excel: **

import com.spire.xls.*;

public class WriteToCells {

    public static void main(String[] args) {

        // Create a Workbook object
        Workbook workbook = new Workbook();

        // Get the first worksheet
        Worksheet worksheet = workbook.getWorksheets().get(0);

        // Write data to specific cells
        worksheet.get("A1").setValue("Name");
        worksheet.get("B1").setValue("Age");
        worksheet.get("C1").setValue("Department");
        worksheet.get("D1").setValue("Hiredate");
        worksheet.get(2,1).setValue("Hazel");
        worksheet.get(2,2).setNumberValue(29);
        worksheet.get(2,3).setValue("Marketing");
        worksheet.get(2,4).setValue("2019-07-01");
        worksheet.get(3,1).setValue("Tina");
        worksheet.get(3,2).setNumberValue(31);
        worksheet.get(3,3).setValue("Technical Support");
        worksheet.get(3,4).setValue("2015-04-27");

        // Autofit column widths
        worksheet.getAllocatedRange().autoFitColumns();

        // Apply a style to the first row
        CellStyle style = workbook.getStyles().addStyle("newStyle");
        style.getFont().isBold(true);
        worksheet.getRange().get(1,1,1,4).setStyle(style);

        // Save to an Excel file
        workbook.saveToFile("output/WriteToCells.xlsx", ExcelVersion.Version2016);
    }
}

When to use this: Small datasets where you need precise control over cell placement (e.g., adding a title, single-row entries).

Write data to specific cells in Excel.

2. Write Arrays to Excel Worksheets

For bulk data, writing arrays (1D or 2D) is far more efficient than updating cells one by one. Spire.XLS for Java allows inserting arrays into a contiguous cell range.

insertArray() Method Explained:

The insertArray() method handles 1D arrays (single rows) and 2D arrays (multiple rows/columns) effortlessly. Its parameters are:

Object[] array/ Object[][] array: The 1D or 2D array containing data to insert.
int firstRow: The starting row index (1-based).
int firstColumn: The starting column index (1-based).
boolean isVertical: A boolean indicating the insertion direction:
- false: Insert horizontally (left to right).
- true: Insert vertically (top to bottom).

**Java code to insert arrays into Excel: **

import com.spire.xls.*;

public class WriteArrayToWorksheet {

    public static void main(String[] args) {

        // Create a Workbook instance
        Workbook workbook = new Workbook();

        // Get the first worksheet
        Worksheet worksheet = workbook.getWorksheets().get(0);

        // Create a one-dimensional array
        Object[] oneDimensionalArray = {"January", "February", "March", "April","May", "June"};

        // Write the array to the first row of the worksheet
        worksheet.insertArray(oneDimensionalArray, 1, 1, false);

        // Create a two-dimensional array
        Object[][] twoDimensionalArray = {
                {"Name", "Age", "Sex", "Dept.", "Tel."},
                {"John", "25", "Male", "Development","654214"},
                {"Albert", "24", "Male", "Support","624847"},
                {"Amy", "26", "Female", "Sales","624758"}
        };

        // Write the array to the worksheet starting from the cell A3
        worksheet.insertArray(twoDimensionalArray, 3, 1);

        // Autofit column width in the located range
        worksheet.getAllocatedRange().autoFitColumns();

        // Apply a style to the first and the third row
        CellStyle style = workbook.getStyles().addStyle("newStyle");
        style.getFont().isBold(true);
        worksheet.getRange().get(1,1,1,6).setStyle(style);
        worksheet.getRange().get(3,1,3,6).setStyle(style);

        // Save to an Excel file
        workbook.saveToFile("WriteArrays.xlsx", ExcelVersion.Version2016);
    }
}

When to use this: Sequential data (e.g., inventory logs, user lists) that needs bulk insertion.

Insert 1D and 2D arrays into an Excel sheet.

3. Write DataTable to Excel

If your data is stored in a DataTable (e.g., from a database), Spire.XLS lets you directly export it to Excel with insertDataTable(), preserving structure and column headers.

insertDataTable() Method Explained:

The insertDataTable() method is a sophisticated bulk-insert operation designed specifically for transferring structured data collections into Excel. Its parameters are:

DataTable dataTable: The DataTable object containing the data to insert.
boolean columnHeaders: A boolean indicating whether to include column names from the DataTable as headers in Excel.
- true: Inserts column names as the first row.
- false: Skips column names; data starts from the first row.
int firstRow: The starting row index (1-based).
int firstColumn: The starting column index (1-based).
boolean transTypes: A boolean indicating whether to preserve data types.

Java code to export DataTable to Excel:

import com.spire.xls.*;
import com.spire.xls.data.table.DataRow;
import com.spire.xls.data.table.DataTable;

public class WriteDataTableToWorksheet {

    public static void main(String[] args) throws Exception {

        // Create a Workbook instance
        Workbook workbook = new Workbook();

        // Get the first worksheet
        Worksheet worksheet = workbook.getWorksheets().get(0);

        // Create a DataTable object
        DataTable dataTable = new DataTable();
        dataTable.getColumns().add("SKU", Integer.class);
        dataTable.getColumns().add("NAME", String.class);
        dataTable.getColumns().add("PRICE", String.class);

        // Create rows and add data
        DataRow dr = dataTable.newRow();
        dr.setInt(0, 512900512);
        dr.setString(1,"Wireless Mouse M200");
        dr.setString(2,"$85");
        dataTable.getRows().add(dr);
        dr = dataTable.newRow();
        dr.setInt(0,512900637);
        dr.setString(1,"B100 Cored Mouse ");
        dr.setString(2,"$99");
        dataTable.getRows().add(dr);
        dr = dataTable.newRow();
        dr.setInt(0,512901829);
        dr.setString(1,"Gaming Mouse");
        dr.setString(2,"$125");
        dataTable.getRows().add(dr);
        dr = dataTable.newRow();
        dr.setInt(0,512900386);
        dr.setString(1,"ZM Optical Mouse");
        dr.setString(2,"$89");
        dataTable.getRows().add(dr);

        // Write datatable to the worksheet
        worksheet.insertDataTable(dataTable,true,1,1,true);

        // Autofit column width in the located range
        worksheet.getAllocatedRange().autoFitColumns();

        // Apply a style to the first row
        CellStyle style = workbook.getStyles().addStyle("newStyle");
        style.getFont().isBold(true);
        worksheet.getRange().get(1,1,1,3).setStyle(style);

        // Save to an Excel file
        workbook.saveToFile("output/WriteDataTable.xlsx", ExcelVersion.Version2016);
    }
}

When to use this: Database exports, CRM data, or any structured data stored in a DataTable (e.g., SQL query results, CSV imports).

Export a Datatable to an Excel worksheet.

Performance Tips for Large Datasets

Use bulk operations (insertArray()/insertDataTable()) instead of writing cells one by one.
Disable auto-fit columns or styling during data insertion, then apply them once after all data is written.
For datasets with 100,000+ rows, consider streaming mode to reduce memory usage.

Frequently Asked Questions

Q1: What Excel formats does Spire.XLS support for writing data?

A: Spire.XLS for Java supports all major Excel formats, including:

Legacy formats: XLS (Excel 97-2003)
Modern formats: XLSX, XLSM (macro-enabled), XLSB, and more.

You can specify the output format when saving Excel with the saveToFile() method.

Q2: How do I format cells (colors, fonts, borders) when writing data?

A: Spire.XLS offers robust styling options. Check these guides:

Q3: How do I avoid the "Evaluation Warning" in output files?

A: To remove the evaluation sheets, get a 30-day free trial license here and then apply the license key in your code before creating the Workbook object:

com.spire.xls.license.LicenseProvider.setLicenseKey("Key");

Workbook workbook = new Workbook();

Final Thoughts

Mastering Excel export functionality is crucial for Java developers in data-driven applications. The Spire.XLS for Java library provides three efficient approaches to write data to Excel in Java:

Precision control with cell-by-cell writing
High-performance bulk inserts using arrays
Database-style exporting with DataTables

Each method serves distinct use cases - from simple reports to complex enterprise data exports. By following the examples in the article, developers can easily create and write to Excel files in Java applications.

Published in Data Import/Export

Tagged under

Wednesday, 20 October 2021 02:03

Extract Tables from PDFs in C# - Export to TXT & CSV

Extract tables from PDF files in C#/.NET Extracting tables from PDF files is a common requirement in data processing, reporting, and automation tasks. PDFs are widely used for sharing structured data, but extracting tables programmatically can be challenging due to their complex layout. Fortunately, with the right tools, this process becomes straightforward. In this guide, we’ll explore how to extract tables from PDF in C# using the Spire.PDF for .NET library, and export the results to TXT and CSV formats for easy reuse.

Table of Contents:

Prerequisites for Reading PDF Tables in C#
Understanding PDF Table Structure
How to Extract Tables from PDF in C#
Extract PDF Tables to a Text File in C#
Export PDF Tables to CSV in C#
Conclusion
FAQs

Prerequisites for Reading PDF Tables in C#

Spire.PDF for .NET is a powerful library for processing PDF files in C# and VB.NET. It supports a wide range of PDF operations, including table extraction, text extraction, image extraction, and more.

The easiest way to add the Spire.PDF library is via NuGet Package Manager.

1. Open Visual Studio and create a new C# project. (Here we create a Console App)

2. In Visual Studio, right-click your project > Manage NuGet Packages.

3. Search for “Spire.PDF” and install the latest version.

Understanding PDF Table Structure

Before coding, let’s clarify how PDFs store tables. Unlike Excel (which explicitly defines rows/columns), PDFs use:

Text Blocks: Individual text elements positioned with coordinates.
Borders/Lines: Visual cues (horizontal/vertical lines) that humans interpret as table edges.
Spacing: Consistent gaps between text blocks to indicate cells.

The Spire.PDF library infers table structure by analyzing these visual cues, matching text blocks to rows/columns based on proximity and alignment.

How to Extract Tables from PDF in C#

If you need a quick way to preview table data (e.g., debugging or verifying extraction), printing it to the console is a great starting point.

Key methods to extract data from a PDF table:

PdfDocument: Represents a PDF file.
LoadFromFile: Loads the PDF file for processing.
PdfTableExtractor: Analyzes the PDF to detect tables using visual cues (borders, spacing).
ExtractTable(pageIndex): Returns an array of PdfTable objects for the specified page.
GetRowCount()/GetColumnCount(): Retrieve the dimensions of each table.
GetText(rowIndex, columnIndex): Extracts text from the cell at the specified row and column.

using Spire.Pdf;
using Spire.Pdf.Utilities;

namespace ExtractPdfTable
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a PdfDocument object
            PdfDocument pdf = new PdfDocument();

            // Load a PDF file
            pdf.LoadFromFile("invoice.pdf");

            // Initialize an instance of PdfTableExtractor class
            PdfTableExtractor extractor = new PdfTableExtractor(pdf);


            // Loop through the pages 
            for (int pageIndex = 0; pageIndex < pdf.Pages.Count; pageIndex++)
            {
                // Extract tables from a specific page
                PdfTable[] tableList = extractor.ExtractTable(pageIndex);

                // Determine if the table list is null
                if (tableList != null && tableList.Length > 0)
                {
                    int tableNumber = 1;
                    // Loop through the table in the list
                    foreach (PdfTable table in tableList)
                    {
                        Console.WriteLine($"\nTable {tableNumber} on Page {pageIndex + 1}:");
                        Console.WriteLine("-----------------------------------");

                        // Get row number and column number of a certain table
                        int row = table.GetRowCount();
                        int column = table.GetColumnCount();

                        // Loop through rows and columns 
                        for (int i = 0; i < row; i++)
                        {
                            for (int j = 0; j < column; j++)
                            {
                                // Get text from the specific cell
                                string text = table.GetText(i, j);

                                // Print cell text to console with a separator
                                Console.Write($"{text}\t");
                            }
                            // New line after each row
                            Console.WriteLine();
                        }
                        tableNumber++;
                    }
                }
            }

            // Close the document
            pdf.Close();
        }
    }
}

When to Use This Method

Quick debugging or validation of extracted data.
Small datasets where you don’t need persistent storage.

Output: Retrieve PDF table data and output to the console

Extract data from a PDF table

Extract PDF Tables to a Text File in C#

For lightweight, human-readable storage, saving tables to a text file is ideal. This method uses StringBuilder to efficiently compile table data, preserving row breaks for readability.

Key features of extracting PDF tables and exporting to TXT:

Efficiency: StringBuilder minimizes memory overhead compared to string concatenation.
Persistent Storage: Saves data to a text file for later review or sharing.
Row Preservation: Uses \r\n to maintain row structure, making the text file easy to scan.

using Spire.Pdf;
using Spire.Pdf.Utilities;
using System.Text;

namespace ExtractTableToTxt
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a PdfDocument object
            PdfDocument pdf = new PdfDocument();

            // Load a PDF file
            pdf.LoadFromFile("invoice.pdf");

            // Create a StringBuilder object
            StringBuilder builder = new StringBuilder();

            // Initialize an instance of PdfTableExtractor class
            PdfTableExtractor extractor = new PdfTableExtractor(pdf);

            // Declare a PdfTable array 
            PdfTable[] tableList = null;

            // Loop through the pages 
            for (int pageIndex = 0; pageIndex < pdf.Pages.Count; pageIndex++)
            {
                // Extract tables from a specific page
                tableList = extractor.ExtractTable(pageIndex);

                // Determine if the table list is null
                if (tableList != null && tableList.Length > 0)
                {
                    // Loop through the table in the list
                    foreach (PdfTable table in tableList)
                    {
                        // Get row number and column number of a certain table
                        int row = table.GetRowCount();
                        int column = table.GetColumnCount();

                        // Loop through the rows and columns 
                        for (int i = 0; i < row; i++)
                        {
                            for (int j = 0; j < column; j++)
                            {
                                // Get text from the specific cell
                                string text = table.GetText(i, j);

                                // Add text to the string builder
                                builder.Append(text + " ");
                            }
                            builder.Append("\r\n");
                        }
                    }
                }
            }

            // Write to a .txt file
            File.WriteAllText("ExtractPDFTable.txt", builder.ToString());
        }
    }
}

When to Use This Method

Archiving table data in a lightweight, universally accessible format.
Sharing with teams that need to scan data without spreadsheet tools.
Using as input for basic scripts (e.g., PowerShell) to extract specific values.

Output: Extract PDF table data and save to a text file.

Extract table data from PDF to a TXT file

Pro Tip: For VB.NET demos, convert the above code using our C# ⇆ VB.NET Converter.

Export PDF Tables to CSV in C#

CSV (Comma-Separated Values) is the industry standard for tabular data, compatible with Excel, Google Sheets, and databases. This method formats the extracted tables into a valid CSV file by quoting cells and handling special characters.

Key features of extracting tables from PDF to CSV:

StreamWriter: Writes data incrementally to the CSV file, reducing memory usage for large PDFs.
Quoted Cells: Cells are wrapped in double quotes (" ") to avoid misinterpreting commas within text as column separators.
UTF-8 Encoding: Supports special characters in cell text.
Spreadsheet Ready: Directly opens in Excel, Google Sheets, or spreadsheet tools for analysis.

using Spire.Pdf;
using Spire.Pdf.Utilities;
using System.Text;

namespace ExtractTableToCsv
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a PdfDocument object
            PdfDocument pdf = new PdfDocument();

            // Load a PDF file
            pdf.LoadFromFile("invoice.pdf");

            // Create a StreamWriter object for efficient CSV writing
            using (StreamWriter csvWriter = new StreamWriter("PDFtable.csv", false, Encoding.UTF8))
            {
                // Create a PdfTableExtractor object
                PdfTableExtractor extractor = new PdfTableExtractor(pdf);

                // Loop through the pages 
                for (int pageIndex = 0; pageIndex < pdf.Pages.Count; pageIndex++)
                {
                    // Extract tables from a specific page
                    PdfTable[] tableList = extractor.ExtractTable(pageIndex);

                    // Determine if the table list is null
                    if (tableList != null && tableList.Length > 0)
                    {
                        // Loop through the table in the list
                        foreach (PdfTable table in tableList)
                        {
                            // Get row number and column number of a certain table
                            int row = table.GetRowCount();
                            int column = table.GetColumnCount();

                            // Loop through the rows
                            for (int i = 0; i < row; i++)
                            {
                                // Creates a list to store data 
                                List<string> rowData = new List<string>();
                                // Loop through the columns
                                for (int j = 0; j < column; j++)
                                {
                                    // Retrieve text from table cells
                                    string cellText = table.GetText(i, j).Replace("\"", "\"\"");
                                    // Add the cell text to the list and wrap in double quotes
                                    rowData.Add($"\"{cellText}\"");
                                }
                                // Join cells with commas and write to CSV
                                csvWriter.WriteLine(string.Join(",", rowData));
                            }
                        }
                    }
                }
            }
        }
    }
}

When to Use This Method

Data analysis (import into Excel for calculations).
Migrating PDF tables to databases (e.g., SQL Server, PostgreSQL, MySQL).
Collaborating with teams that rely on spreadsheets.

Output: Parse PDF table data and export to a CSV file.

Extract table data from PDF to a CSV file

Recommendation: Integrate with Spire.XLS for .NET to extract tables from PDF to Excel directly.

Conclusion

This guide has outlined three efficient methods for extracting tables from PDFs in C#. By leveraging the Spire.PDF for .NET library, you can automate the PDF table extraction process and export results to console, TXT, or CSV for further analysis. Whether you’re building a data pipeline, report generator, or business tool, these approaches streamline workflows, save time, and minimize human error.

Refer to the online documentation and obtain a free trial license here to explore more advanced PDF operations.

FAQs

Q1: Why use Spire.PDF for .NET to extract tables?

A: Spire.PDF provides a dedicated PdfTableExtractor class that detects tables based on visual cues (borders, spacing, and text alignment), simplifying the process of parsing structured data from PDFs.

Q2: Can Spire.PDF extract tables from scanned (image-based) PDFs?

A: No. The .NET PDF library works only with text-based PDFs (where text is selectable). For scanned PDFs, use Spire.OCR to extract text before parsing tables.

Q3: Can I extract tables from multiple PDFs at once?

A: Yes. To batch-process multiple PDFs, use Directory.GetFiles() to list all PDF files in a folder, then loop through each file and run the extraction logic. For example:

string[] pdfFiles = Directory.GetFiles(@"C:\Invoices\", "*.pdf");
foreach (string file in pdfFiles)
{
// Run extraction code for each file  
}

Q4: How can I improve performance when extracting tables from large PDFs?

A: For large PDFs (100+ pages), optimize performance by:

Processing pages in batches instead of loading the entire PDF at once.
Disposing of unused PdfTable or PdfDocument objects with the using statements to free memory.
Skipping pages with no tables early (using if (tableList == null || tableList.Length == 0)).

Published in Table

Tagged under

pdf net Table

Friday, 08 April 2022 07:34

Java: Convert Images to PDF

Converting images to PDF is beneficial for many reasons. For one reason, it allows you to convert images into a format that is more readable and easier to share. For another reason, it dramatically reduces the size of the file while preserving the quality of images. In this article, you will learn how to convert images to PDF in Java using Spire.PDF for Java.

There is no straightforward method provided by Spire.PDF to convert images to PDF. You could, however, create a new PDF document and draw images at the specified locations. Depending on whether the page size of the generated PDF matches the image, this topic can be divided into two subtopics.

Add an Image to PDF at a Specified Location
Convert an Image to PDF with the Same Width and Height

Install Spire.PDF for Java

First, you're required to add the Spire.Pdf.jar file as a dependency in your Java program. The JAR file can be downloaded from this link. If you use Maven, you can easily import the JAR file in your application by adding the following code to your project's pom.xml file.

Package Manager

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.pdf</artifactId>
        <version>11.10.3</version>
    </dependency>
</dependencies>

Additionally, the imgscalr library is used in the first code example to resize images. It is not necessary to install it if you do not need to adjust the image’s size.

Add an Image to PDF at a Specified Location

The following are the steps to add an image to PDF at a specified location using Spire.PDF for Java.

Create a PdfDocument object.
Set the page margins using PdfDocument.getPageSettings().setMargins() method.
Add a page using PdfDocument.getPages().add() method
Load an image using ImageIO.read() method, and get the image width and height.
If the image width is larger than the page (the content area) width, resize the image to make it to fit to the page width using the imgscalr library.
Create a PdfImage object based on the scaled image or the original image.
Draw the PdfImage object on the first page at (0, 0) using PdfPageBase.getCanvas().drawImage() method.
Save the document to a PDF file using PdfDocument.saveToFile() method.

Java

import com.spire.pdf.PdfDocument;
import com.spire.pdf.PdfPageBase;
import com.spire.pdf.graphics.PdfImage;
import org.imgscalr.Scalr;

import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.FileInputStream;
import java.io.IOException;

public class AddImageToPdf {

    public static void main(String[] args) throws IOException {

        //Create a PdfDocument object
        PdfDocument doc = new PdfDocument();

        //Set the margins
        doc.getPageSettings().setMargins(20);

        //Add a page
        PdfPageBase page = doc.getPages().add();

        //Load an image
        BufferedImage image = ImageIO.read(new FileInputStream("C:\\Users\\Administrator\\Desktop\\announcement.jpg"));

        //Get the image width and height
        int width = image.getWidth();
        int height = image.getHeight();

        //Declare a PdfImage variable
        PdfImage pdfImage;

        //If the image width is larger than page width
        if (width > page.getCanvas().getClientSize().getWidth())
        {
            //Resize the image to make it to fit to the page width
            int widthFitRate =  width / (int)page.getCanvas().getClientSize().getWidth();
            int targetWidth = width / widthFitRate;
            int targetHeight = height / widthFitRate;
            BufferedImage scaledImage = Scalr.resize(image,Scalr.Method.QUALITY,targetWidth,targetHeight);

            //Load the scaled image to the PdfImage object
            pdfImage = PdfImage.fromImage(scaledImage);

        } else
        {
            //Load the original image to the PdfImage object
            pdfImage = PdfImage.fromImage(image);
        }

        //Draw image at (0, 0)
        page.getCanvas().drawImage(pdfImage, 0, 0, pdfImage.getWidth(), pdfImage.getHeight());

        //Save to file
        doc.saveToFile("output/AddImage.pdf");
    }
}

Java: Convert Images to PDF

Convert an Image to PDF with the Same Width and Height

The following are the steps to convert an image to a PDF with the same page size as the image using Spire.PDF for Java.

Create a PdfDocument object.
Set the page margins to zero using PdfDocument.getPageSettings().setMargins() method.
Load an image using ImageIO.read() method, and get the image width and height.
Add a page to PDF based on the size of the image using PdfDocument.getPages().add() method.
Create a PdfImage object based on the image.
Draw the PdfImage object on the first page from the coordinate (0, 0) using PdfPageBase.getCanvas().drawImage() method.
Save the document to a PDF file using PdfDocument.saveToFile() method.

Java

import com.spire.pdf.PdfDocument;
import com.spire.pdf.PdfPageBase;
import com.spire.pdf.graphics.PdfImage;

import javax.imageio.ImageIO;
import java.awt.*;
import java.awt.image.BufferedImage;
import java.io.FileInputStream;
import java.io.IOException;

public class ConvertImageToPdfWithSameSize {

    public static void main(String[] args) throws IOException {

        //Create a PdfDocument object
        PdfDocument doc = new PdfDocument();

        //Set the margins to 0
        doc.getPageSettings().setMargins(0);

        //Load an image
        BufferedImage image = ImageIO.read(new FileInputStream("C:\\Users\\Administrator\\Desktop\\announcement.jpg"));

        //Get the image width and height
        int width = image.getWidth();
        int height = image.getHeight();

        //Add a page of the same size as the image
        PdfPageBase page = doc.getPages().add(new Dimension(width, height));

        //Create a PdfImage object based on the image
        PdfImage pdfImage = PdfImage.fromImage(image);

        //Draw image at (0, 0) of the page
        page.getCanvas().drawImage(pdfImage, 0, 0, pdfImage.getWidth(), pdfImage.getHeight());

        //Save to file
        doc.saveToFile("output/ConvertPdfWithSameSize.pdf");
    }
}

Java: Convert Images to PDF

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

Published in Conversion

Tagged under

pdf java Conversion

Tuesday, 17 May 2022 09:05

How to Add a Digital Signature to a PDF Using Java

Digitally sign PDF with Java

Digital signatures play a crucial role in ensuring the authenticity and integrity of PDF documents. Whether you need to sign contracts, legal documents, or financial reports, adding a digital signature helps verify the signer's identity and prevents unauthorized modifications.

In this tutorial, we will explore how to add invisible and visible digital signatures to PDFs using Spire.PDF for Java. We will also cover how to create a signature field for later signing.

Java Library to Digitally Sign PDF Documents
Adding an Invisible Digital Signature to a PDF
Adding a Visible Digital Signature to a PDF
Creating a Signature Field in a PDF
Wrap Up
FAQs

Java Library to Digitally Sign PDF Documents

To work with digital signatures in PDFs, we will use Spire.PDF for Java, a powerful library that allows developers to create, edit, and sign PDF documents programmatically.

Key Features

Supports PFX certificates for digital signing.
Allows invisible and visible signatures .
Enables customization of signature appearance (image, text, details).
Works with existing PDF forms or creates new signature fields.

Prerequisites

Before you start, ensure you have:

Java Development Kit (JDK) installed.
Spire.PDF for Java added to your project.
A PFX certificate (for signing) and a sample PDF file.

Installation

Download Spire.PDF for Java from our website, and manually import the JAR file into your Java project. If you’re using Maven, add the following code to your project's pom.xml.

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.pdf</artifactId>
        <version>11.7.5</version>
    </dependency>
</dependencies>

Adding an Invisible Digital Signature to a PDF

An invisible digital signature embeds cryptographic authentication without displaying a visual element. This is useful for internal verification while keeping the document clean. Below are the steps to add an invisible signature to a PDF using Spire.PDF.

Step-by-Step Guide

Initialize a PdfDocument object.
Load the PDF file that you want to sign.
Use PdfCertificate to load the PFX certificate with password.
Initialize a PdfOrdinarySignatureMaker object to manage the signing process.
Use the makeSignature method to embed the signature without visual elements.
Save the signed PDF to a new file.

Code Example

import com.spire.pdf.PdfDocument;
import com.spire.pdf.interactive.digitalsignatures.PdfCertificate;
import com.spire.pdf.interactive.digitalsignatures.PdfOrdinarySignatureMaker;

public class AddInvisibleSignature {

    public static void main(String[] args) {

        // Create a new PDF document object
        PdfDocument doc = new PdfDocument();

        // Load the input PDF file that needs to be signed
        doc.loadFromFile("C:/Users/Administrator/Desktop/Input.pdf");

        // Specify the path to the PFX certificate and its password
        String filePath = "C:/Users/Administrator/Desktop/certificate.pfx";
        String password = "e-iceblue";

        // Load the digital certificate (PFX format) with the given password
        PdfCertificate certificate = new PdfCertificate(filePath, password);

        // Create a signature maker object to apply the digital signature
        PdfOrdinarySignatureMaker signatureMaker = new PdfOrdinarySignatureMaker(doc, certificate);

        // Apply an invisible digital signature with the name "signature 1"
        signatureMaker.makeSignature("signature 1");

        // Save the signed PDF to a new file
        doc.saveToFile("Signed.pdf");

        // Release resources
        doc.dispose();
    }
}

Output:

A PDF file containing an invisible digital signature.

You might also be interested in: How to Verify Signatures in PDF in Java

Adding a Visible Digital Signature to a PDF

A visible digital signature displays signer details (name, reason, image) at a specified location. This is ideal for contracts where visual confirmation is needed. Here’s how to add a visible signature using Spire.PDF.

Step-by-Step Guide

Create a PdfDocument object and load your target PDF file.
Load the PFX certificate by initializing the PdfCertificate object.
Create a PdfOrdinarySignatureMaker instance.
Define the signer’s name, contact info, and reason for signing.
Design signature appearance by adding an image, labels, and setting the layout (SignImageAndSignDetail mode).
Use the makeSignature method to place the signature at the desired coordinates on the PDF.
Save the signed PDF to a new file.

Code Example

import com.spire.pdf.PdfDocument;
import com.spire.pdf.PdfPageBase;
import com.spire.pdf.graphics.PdfImage;
import com.spire.pdf.interactive.digitalsignatures.*;

public class AddVisibleSignature {

    public static void main(String[] args) {

        // Create a new PDF document object
        PdfDocument doc = new PdfDocument();

        // Load the input PDF file that needs to be signed
        doc.loadFromFile("C:/Users/Administrator/Desktop/Input.pdf");

        // Specify the path to the PFX certificate and its password
        String filePath = "C:/Users/Administrator/Desktop/certificate.pfx";
        String password = "e-iceblue";

        // Load the digital certificate (PFX format) with the given password
        PdfCertificate certificate = new PdfCertificate(filePath, password);

        // Create a signature maker object to apply the digital signature
        PdfOrdinarySignatureMaker signatureMaker = new PdfOrdinarySignatureMaker(doc, certificate);

        // Get the pdf signature and set the sign details
        PdfSignature signature = signatureMaker.getSignature();
        signature.setName("Gary");
        signature.setContactInfo("112554");
        signature.setLocation("U.S.");
        signature.setReason("This is the final version.");

        // Create a signature appearance
        PdfSignatureAppearance appearance = new PdfSignatureAppearance(signature);

        // Set labels for the signature
        appearance.setNameLabel("Signer: ");
        appearance.setContactInfoLabel("Phone: ");
        appearance.setLocationLabel("Location: ");
        appearance.setReasonLabel("Reason: ");

        // Load an image
        PdfImage image = PdfImage.fromFile("C:/Users/Administrator/Desktop/signature.png");

        // Set the image as the signature image
        appearance.setSignatureImage(image);

        // Set the graphic mode as SignImageAndSignDetail
        appearance.setGraphicMode(GraphicMode.SignImageAndSignDetail);

        // Get the last page
        PdfPageBase page = doc.getPages().get(doc.getPages().getCount() - 1);

        // Add the signature to a specified location of the page
        signatureMaker.makeSignature("signature 1", page, 54.0f,  470.0f, 280.0f, 90.0f, appearance);

        // Save the signed PDF to a new file
        doc.saveToFile("Signed.pdf");

        // Release resources
        doc.dispose();
    }
}

Output:

A PDF file containing a visible digital signature that consists of an image, signer’s name, reason, etc.

Creating a Signature Field in a PDF

A signature field reserves a space in the PDF for later signing. This is useful for forms or documents that require user signatures. To create a signature field in PDF, follow these steps:

Step-by-Step Guide

Create a PdfDocument object and load your PDF file.
Get a specific page (usually last one) where the signature field will be placed.
Create a PdfSignatureField object for the selected page.
Customize the field’s appearance by setting border style and color, and field bounds.
Add the signature field to the document's form.
Save the updated document to a new PDF file.

Code Example

import com.spire.pdf.PdfDocument;
import com.spire.pdf.fields.PdfBorderStyle;
import com.spire.pdf.fields.PdfHighlightMode;
import com.spire.pdf.fields.PdfSignatureField;
import com.spire.pdf.graphics.PdfRGBColor;
import com.spire.pdf.PdfPageBase;
import java.awt.Rectangle;

public class AddDigitalSignatureField {

    public static void main(String[] args) {

        // Initialize a new PdfDocument object
        PdfDocument doc = new PdfDocument();

        // Load the existing PDF from the specified path
        doc.loadFromFile("C:/Users/Administrator/Desktop/Input.pdf");

        // Retrieve the last page of the document
        PdfPageBase page = doc.getPages().get(doc.getPages().getCount() - 1);

        // Create a signature field on the specified page
        PdfSignatureField signatureField = new PdfSignatureField(page, "signature");

        // Customize the appearance of the signature field
        signatureField.setBorderWidth(1.0f);
        signatureField.setBorderStyle(PdfBorderStyle.Solid);
        signatureField.setBorderColor(new PdfRGBColor(java.awt.Color.BLACK));
        signatureField.setHighlightMode(PdfHighlightMode.Outline);
        signatureField.setBounds(new Rectangle(54, 470, 200, 100));

        // Enable form creation if none exists in the document
        doc.setAllowCreateForm(doc.getForm() == null);

        // Add the signature field to the document's form
        doc.getForm().getFields().add(signatureField);

        // Save the modified document to a new file
        doc.saveToFile("SignatureField.pdf");

        // Release resources
        doc.dispose();
    }
}

Output:

A PDF file containing an unsigned signature field.

Wrap Up

In this tutorial, we explored how to add digital signatures to PDF documents in Java using the Spire.PDF library. We covered the steps for adding both invisible and visible signatures, as well as creating interactive signature fields. With these skills, you can enhance document security and ensure the integrity of your digital communications.

FAQs

Q1. What is a digital signature?

A digital signature is an electronic signature that uses cryptographic techniques to provide proof of the authenticity and integrity of a digital message or document.

Q2. Do I need a special certificate for signing PDFs?

Yes, a valid digital certificate (usually in PFX format) is required to sign PDF documents digitally.

Q3. How do I verify a signed PDF?

You can verify a signed PDF using Adobe Reader or by using Spire.PDF’s PdfSignature.verifySignature() method.

Q4. How can I customize the appearance of my visible digital signature?

With Spire.PDF for Java, you can fully customize visible signatures by:

Setting text properties (font, color, labels for signer info).
Adding a signature image (e.g., company logo or scanned handwritten signature).
Choosing layout modes (SignImageOnly, SignDetail, or SignImageAndSignDetail).
Adjusting position and dimensions on the page.

Q5. Can I add a timestamp when digitally signing a PDF document?

Yes, you can. Refer to the code:

PdfPKCS7Formatter formatter = new PdfPKCS7Formatter(certificate, false);
formatter.setTimestampService(new TSAHttpService("http://tsa.cesnet.cz:3161/tsa"));
PdfOrdinarySignatureMaker signatureMaker = new PdfOrdinarySignatureMaker(doc, formatter);
signatureMaker.makeSignature("signature 1");

Get a Free License

To fully experience the capabilities of Spire.PDF for Java without any evaluation limitations, you can request a free 30-day trial license.

Published in Security

Tagged under

pdf java Security

Tuesday, 27 June 2023 08:23

Master PDF Compression in Java: Reduce PDF File Size Efficiently

Java PDF Compression Guide: Optimize File Size and Performance

Handling large PDF files is a common challenge for Java developers. PDFs with high-resolution images, embedded fonts, and multimedia content can quickly become heavy, slowing down applications, increasing storage costs, and creating a poor user experience—especially on mobile devices.

Mastering PDF compression in Java is essential to reduce file size efficiently while maintaining document quality. This step-by-step guide demonstrates how to compress and optimize PDF files in Java. You’ll learn how to compress document content, optimize images, fonts, and metadata, ensuring faster file transfers, improved performance, and a smoother user experience in your Java applications.

What You Will Learn

Setting Up Your Development Environment
1. Prerequisites
2. Adding Dependencies
Reduce PDF File Size by Compressing Document Content in Java
Reduce PDF File Size by Optimizing Specific Elements in Java
Full Java Example that Combines All PDF Compressing Techniques
Best Practices for PDF Compression
Conclusion
FAQs

1. Setting Up Your Development Environment

Before implementing PDF compression in Java, ensure your development environment is properly configured.

1.1. Prerequisites

Java Development Kit (JDK): Ensure you have JDK 1.8 or later installed.
Build Tool: Maven or Gradle is recommended for dependency management.
Integrated Development Environment (IDE): IntelliJ IDEA or Eclipse is suitable.

1.2. Adding Dependencies

To programmatically compress PDF files, you need a PDF library that supports compression features. Spire.PDF for Java provides APIs for loading, reading, editing, and compressing PDF documents. You can include it via Maven or Gradle.

Maven (pom.xml):
Add the following repository and dependency to your project's pom.xml file within the <repositories> and <dependencies> tags, respectively:

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.pdf</artifactId>
        <version>11.10.3</version>
    </dependency>
</dependencies>

Gradle (build.gradle):
For Gradle users, add the repository and dependency as follows:

repositories {
    mavenCentral()
    maven {
        url "https://repo.e-iceblue.com/nexus/content/groups/public/"
    }
}

dependencies {
    implementation 'e-iceblue:spire.pdf:11.8.0'
}

After adding the dependency, refresh your Maven or Gradle project to download the necessary JAR files.

2. Reduce PDF File Size by Compressing Document Content in Java

One of the most straightforward techniques for reducing PDF file size is to apply document content compression. This approach automatically compresses the internal content streams of the PDF, such as text and graphics data, without requiring any manual fine-tuning. It is especially useful when you want a quick and effective solution that minimizes file size while maintaining document integrity.

The following example demonstrates how to enable and apply content compression in a PDF file using Java.

import com.spire.pdf.conversion.compression.PdfCompressor;

public class CompressContent {
    public static void main(String[] args){
        // Create a compressor
        PdfCompressor compressor = new PdfCompressor("test.pdf");

        // Enable document content compression
        compressor.getOptions().setCompressContents(true);

        // Compress and save
        compressor.compressToFile("ContentCompression.pdf");
    }
}

Key Points:

setCompressContents(true) enables document content compression.
Original PDFs remain unchanged; compressed files are saved separately.

3. Reduce PDF File Size by Optimizing Specific Elements in Java

Beyond compressing content streams, developers can also optimize individual elements of the PDF, such as images, fonts, and metadata. This allows for granular control over file size optimization.

3.1. Image Compression

Images are frequently the primary reason for large files. By lowering the image quality, you can significantly minimize the size of image-heavy PDF files.

import com.spire.pdf.conversion.compression.ImageCompressionOptions;
import com.spire.pdf.conversion.compression.ImageQuality;
import com.spire.pdf.conversion.compression.PdfCompressor;

public class CompressImages {
    public static void main(String[] args){
        // Load the PDF document
        PdfCompressor compressor = new PdfCompressor("test.pdf");

        // Get image compression options
        ImageCompressionOptions imageCompression = compressor.getOptions().getImageCompressionOptions();

        // Compress images and set quality
        imageCompression.setCompressImage(true);          // Enable image compression
        imageCompression.setImageQuality(ImageQuality.Low); // Set image quality (Low, Medium, High)
        imageCompression.setResizeImages(true);           // Resize images to reduce size

        // Save the compressed PDF
        compressor.compressToFile("ImageCompression.pdf");
    }
}

Key Points:

setCompressImage(true) enables image compression.
setImageQuality(...) adjusts the output image quality; the lower the quality, the smaller the image size.
setResizeImages(true) enables image resizing.

3.2. Font Compression or Unembedding

When a PDF uses custom fonts, the entire font file might be embedded, even if only a few characters are used. Font compression or unembedding is a technique that reduces the size of embedded fonts by compressing them or removing them entirely from the PDF.

import com.spire.pdf.conversion.compression.PdfCompressor;
import com.spire.pdf.conversion.compression.TextCompressionOptions;

public class CompressPDFWithOptions {
    public static void main(String[] args){
        // Load the PDF document
        PdfCompressor compressor = new PdfCompressor("test.pdf");

        // Get text compression options
        TextCompressionOptions textCompression = compressor.getOptions().getTextCompressionOptions();

        // Compress fonts
        textCompression.setCompressFonts(true);

        // Optional: unembed fonts to reduce size
        // textCompression.setUnembedFonts(true);

        // Save the compressed PDF
        compressor.compressToFile("FontOptimization.pdf");
    }
}

Key Points:

setCompressFonts(true) compresses embedded fonts while preserving document appearance.
setUnembedFonts(true) removes embedded fonts entirely, which may reduce file size but could affect text rendering if the fonts are not available on the system.

3.3 Metadata Removal

PDFs often store metadata such as author details, timestamps, and editing history that aren’t needed for viewing. Removing metadata reduces file size and protects sensitive information.

import com.spire.pdf.conversion.compression.PdfCompressor;

public class CompressPDFWithOptions {
    public static void main(String[] args){
        // Load the PDF document
        PdfCompressor compressor = new PdfCompressor("test.pdf");

        // Remove metadata
        compressor.getOptions().setRemoveMetadata(true);

        // Save the compressed PDF
        compressor.compressToFile("MetadataRemoval.pdf");
    }
}

4. Full Java Example that Combines All PDF Compressing Techniques

After exploring both document content compression and element-specific optimizations (images, fonts, and metadata), let’s explore how to apply all these techniques together in one workflow.

import com.spire.pdf.conversion.compression.ImageQuality;
import com.spire.pdf.conversion.compression.OptimizationOptions;
import com.spire.pdf.conversion.compression.PdfCompressor;

public class CompressPDFWithAllTechniques {
    public static void main(String[] args){
        // Initialize compressor
        PdfCompressor compressor = new PdfCompressor("test.pdf");

        // Enable document content compression
        OptimizationOptions options = compressor.getOptions();
        options.setCompressContents(true);

        // Optimize images (downsampling and compression)
        options.getImageCompressionOptions().setCompressImage(true);
        options.getImageCompressionOptions().setImageQuality(ImageQuality.Low);
        options.getImageCompressionOptions().setResizeImages(true);

        // Optimize fonts (compression or unembedding)
        // Compress fonts
        options.getTextCompressionOptions().setCompressFonts(true);
        // Optional: unembed fonts to reduce size
        // options.getTextCompressionOptions().setUnembedFonts(true);
        
        // Remove unnecessary metadata
        options.setRemoveMetadata(true);

        // Save the compressed PDF
        compressor.compressToFile("CompressPDFWithAllTechniques.pdf");
    }
}

Reviewing the Compression Effect:

After running the code, the original sample PDF of 3.09 MB was reduced to 742 KB. The compression ratio is approximately 76%.

Java Code Example to Compress PDF

5. Best Practices for PDF Compression

When applying PDF compression in Java, it’s important to follow some practical guidelines to ensure the file size is reduced effectively without sacrificing usability or compatibility.

Choose methods based on content: PDF compression depends heavily on the type of content. Text-based files may only require content and font optimization, while image-heavy documents benefit more from image compression. In many cases, combining multiple techniques yields the best results.
Balance quality with file size: Over-compression may influence the document's readability, so it’s important to maintain a balance.
Test across PDF readers: Ensure compatibility with Adobe Acrobat, browser viewers, and mobile apps.

6. Conclusion

Compressing PDF in Java is not just about saving disk space—it directly impacts performance, user experience, and system efficiency. Using Libraries like Spire.PDF for Java, developers can implement fine-grained compression techniques, from compressing content, optimizing images and fonts, to cleaning up unused metadata.

By applying the right strategies, you can minimize PDF size in Java significantly without sacrificing quality. This leads to faster file transfers, lower storage costs, and smoother rendering across platforms. Mastering these compression methods ensures your Java applications remain responsive and efficient, even when handling complex, resource-heavy PDFs.

7. FAQs

Q1: Can I reduce PDF file size in Java without losing quality?

A1: Yes. Spire.PDF allows selective compression of images, fonts, and other objects while maintaining readability and layout.

Q2: Will compressed PDFs remain compatible with popular PDF readers?

A2: Yes. Compressed PDFs remain compatible with Adobe Acrobat, browser viewers, mobile apps, and other standard PDF readers.

Q3: What’s the difference between image compression and font compression?

A3: Image compression reduces the size of embedded images, while font compression reduces embedded font data or removes unused fonts. Both techniques together optimize file size effectively.

Q4: How do I choose the best compression strategy?

A4: Consider the PDF content. Use image compression for image-heavy PDFs and font compression for text-heavy PDFs. Often, combining both techniques yields the best results without affecting readability.

Q5: Can I automate PDF compression for multiple files in Java?

A5: Yes. You can write Java scripts to batch compress multiple PDFs by applying the same compression settings consistently across all files.

Published in Document Operation

Tagged under

pdf java Operation

Thursday, 09 June 2022 07:47

Merge PDF Files in Java: Full, Partial, and Stream-Based Merging

Merge two or multiple PDF files into a single file in Java

Merging PDFs in Java is a critical requirement for document-intensive applications, from consolidating financial reports to automating archival systems. However, developers face significant challenges in preserving formatting integrity or managing resource efficiency across diverse PDF sources. Spire.PDF for Java provides a robust and straightforward solution to streamline the PDF merging task.

This comprehensive guide explores how to combine PDFs in Java, complete with practical examples to merge multiple files, selected pages, or stream-based merging.

Setting Up the Java PDF Merge Library
Merge Multiple PDF Files in Java
Merge Specific Pages from Multiple PDFs in Java
Merge PDF Files by Streams in Java
Conclusion
FAQs

Setting Up the Java PDF Merge Library

Why Choose Spire.PDF for Java?

No External Dependencies: Pure Java implementation.
Rich Features: Merge, split, encrypt, and annotate PDFs.
Cross-Platform: Works on Windows, Linux, and macOS.

Installation

Before using Spire.PDF for Java, you need to add it to your project.

Option 1: Maven

Add the repository and dependency to pom.xml:

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.pdf</artifactId>
        <version>11.10.3</version>
    </dependency>
</dependencies>

Option 2: Manual JAR

Download the JAR from the E-iceblue website and add it to your project's build path.

Merge Multiple PDF Files in Java

This example is ideal when you want to merge two or more PDF documents entirely. It’s simple, straightforward, and perfect for batch processing.

How It Works:

Define File Paths: Create an array of strings containing the full paths to the source PDFs.
Merge Files: The mergeFiles() method takes the array of paths, combines the PDFs, and returns a PdfDocumentBase object representing the merged file.
Save the Result: The merged PDF is saved to a new file using the save() method.

Java code to combine PDFs:

import com.spire.pdf.FileFormat;
import com.spire.pdf.PdfDocument;
import com.spire.pdf.PdfDocumentBase;

public class MergePdfs {
    public static void main(String[] args) {

        // Get the paths of the PDF documents to be merged 
        String[] files = new String[] {"sample-1.pdf", "sample-2.pdf", "sample-3.pdf"};

        // Merge these PDF documents
        PdfDocumentBase pdf = PdfDocument.mergeFiles(files);

        // Save the merged PDF file
        pdf.save("MergePDF.pdf", FileFormat.PDF);
    }
}

Best For:

Merging entire PDFs stored locally.
Simple batch operations where no page selection is needed.

Result: Combine three PDF files (a total of 10 pages) into one PDF file.

Merge multiple PDF files into a single PDF

Merging PDFs often results in large file sizes. To reduce the size, refer to: Compress PDF Files in Java.

Merge Specific Pages from Multiple PDFs in Java

Sometimes, you may only want to merge specific pages from different PDFs (e.g., pages 1-3 from File A and pages 2-5 from File B). This example gives you granular control over which pages to include from each source PDF.

How It Works:

Load PDFs: Load each source PDF into a PdfDocument object and store them in an array.
Create a New PDF: A blank PDF document is initialized to serve as the container for merged pages.
Insert Specific Pages:
- insertPage(): Insert a specified page into the new PDF.
- insertPageRange(): Inserts a range of pages into the new PDF.
Save the Result: The merged PDF is saved using the saveToFile() method.

Java code to combine selected PDF pages:

import com.spire.pdf.PdfDocument;

public class MergeSelectedPages {

    public static void main(String[] args) {

        // Get the paths of the PDF documents to be merged
        String[] files = new String[] {"sample-1.pdf", "sample-2.pdf", "sample-3.pdf"};

        // Create an array of PdfDocument
        PdfDocument[] pdfs = new PdfDocument[files.length];

        // Loop through the documents
        for (int i = 0; i < files.length; i++)
        {
            // Load a specific document
            pdfs[i] = new PdfDocument(files[i]);
        }

        // Create a new PDF document
        PdfDocument pdf = new PdfDocument();

        // Insert the selected pages from different PDFs to the new PDF
        pdf.insertPage(pdfs[0], 0);
        pdf.insertPageRange(pdfs[1], 1,3);
        pdf.insertPage(pdfs[2], 0);

        // Save the merged PDF
        pdf.saveToFile("MergePdfPages.pdf");
    }
}

Best For:

Creating custom PDFs with selected pages (e.g., extracting key sections from reports).
Scenarios where you need to exclude irrelevant pages from source documents.

Result: Combine selected pages from three separate PDF files into a new PDF

Combine specified pages from different PDFs into a new PDF file

Merge PDF Files by Streams in Java

In applications where PDFs are stored as streams (e.g., PDFs from network streams, in-memory data, or temporary files), Spire.PDF supports merging without saving files to disk.

How It Works:

Create Input Streams: The FileInputStream objects read the raw byte data of each PDF file.
Merge Streams: The mergeFiles() method accepts an array of streams, merges them, and returns a PdfDocumentBase object.
Save and Clean Up: The merged PDF is saved, and all streams and documents are closed to free system resources (critical for preventing leaks).

Java code to merge PDFs via streams:

import com.spire.pdf.*;
import java.io.*;

public class mergePdfsByStream {
    public static void main(String[] args) throws IOException {
        // Create FileInputStream objects for each PDF document file
        FileInputStream stream1 = new FileInputStream(new File("Template_1.pdf"));
        FileInputStream stream2 = new FileInputStream(new File("Template_2.pdf"));
        FileInputStream stream3 = new FileInputStream(new File("Template_3.pdf"));

        // Initialize an array of InputStream objects containing the file input streams
        InputStream[] streams = new FileInputStream[]{stream1, stream2, stream3};

        // Merge the input streams into a single PdfDocumentBase object
        PdfDocumentBase pdf = PdfDocument.mergeFiles(streams);

        // Save the merged PDF file
        pdf.save("MergePdfsByStream.pdf", FileFormat.PDF);

        // Releases system resources used by the merged document
        pdf.close();
        pdf.dispose();

        // Closes all input streams to free up resources
        stream1.close();
        stream2.close();
        stream3.close();
    }
}

Best For:

Merging PDFs from non-file sources (e.g., network downloads, in-memory generation).
Environments where direct file path access is restricted.

Conclusion

Spire.PDF for Java simplifies complex PDF merging tasks through its intuitive, user-friendly API. Whether you need to merge entire documents, create custom page sequences, or combine PDFs from stream sources, these examples enable efficient PDF merging in Java to address diverse document processing requirements.

To explore more features (e.g., encrypting merged PDFs, adding bookmarks), refer to the official documentation.

Frequently Asked Questions (FAQs)

Q1: Why do merged PDFs show "Evaluation Warning" watermarks?

A: The commercial version adds watermarks. Solutions:

Request a 30-day trial license to test without any restrictions.
Use the free version for documents ≤10 pages

Q2: How do I control the order of pages in the merged PDF?

A: The order of pages in the merged PDF is determined by the order of input files (or streams) and the pages you select. For example:

In full-document merging, files in the input array are merged in the order they appear.
In selective page merging, use insertPage() or insertPageRange() in the sequence you want pages to appear.

Q3: Can I merge password-protected PDFs?

A: Yes. Spire.PDF for Java supports merging encrypted PDFs, but you must provide the password when loading the file. Use the overloaded loadFromFile() method with the password parameter:

PdfDocument pdf = new PdfDocument();
pdf.loadFromFile("sample.pdf", "userPassword"); // Decrypt with password

Q4: How to merge scanned/image-based PDFs?

A: Spire.PDF handles image-PDFs like regular PDFs, but file sizes may increase significantly.

Published in Document Operation

Tagged under

pdf java Operation

Start «1 2 3 4 5 6 789 »End

Page 8 of 9

Table of Contents

Install with Pip

Related Links

Getting Started: Why Choose Spire.PDF for PDF to Text in Python

General Workflow for PDF to Text in Python

How to Convert PDF to Text in Python Without Layout

How to Convert PDF to Text in Python With Layout

Convert a Specific PDF Page to Text in Python

To Wrap Up

FAQs about Converting PDF to Text

SEE ALSO:

Table of Contents

Why Merge Excel Files with Python?

Getting Started with Spire.XLS for Python

How to Merge Multiple Excel Files into One Workbook using Python

Steps

Code Example

How to Combine Multiple Excel Worksheets into a Single Worksheet using Python

Steps

Code Example

Conclusion

FAQs: Merge Excel Files with Python

Q1: Can I merge .xls and .xlsx files together?

Q2: Do I need Excel installed on my machine to use Spire.XLS?

Q3: Can I merge only specific sheets from each workbook?

Q4: How do I avoid copying header rows multiple times?

Q5: Can I keep track of which file each row came from?

Q6: Is there a file size or row limit when using Spire.XLS?

Q7: Can I preserve formulas and formatting while merging?

Python Library for Watermarking PDFs

Adding a Text Watermark to a PDF

Adding an Image Watermark to a PDF

Troubleshooting Common Issues

Wrapping Up

FAQs

Q1. Can I add both text and image watermarks to the same PDF?

Q2. How can I rotate image watermarks?

Q3. Does Spire.PDF support transparent PNGs for watermarks?

Q4. Can I apply different watermarks to different pages?

Get a Free License

Install Spire.Doc for C++

Create a Table in Word in C++

Create a Nested Table in Word in C++

Apply for a Temporary License

Prerequisites: Setup & Installation

3 Ways to Write Data to Excel using Java

1. Write Text or Numbers to Excel Cells

2. Write Arrays to Excel Worksheets

3. Write DataTable to Excel

Performance Tips for Large Datasets

Frequently Asked Questions

Q1: What Excel formats does Spire.XLS support for writing data?

Q2: How do I format cells (colors, fonts, borders) when writing data?

Q3: How do I avoid the "Evaluation Warning" in output files?

Final Thoughts

Prerequisites for Reading PDF Tables in C#

Understanding PDF Table Structure

How to Extract Tables from PDF in C#

Extract PDF Tables to a Text File in C#

Export PDF Tables to CSV in C#

Conclusion

FAQs

Q1: Why use Spire.PDF for .NET to extract tables?

Q2: Can Spire.PDF extract tables from scanned (image-based) PDFs?

Q3: Can I extract tables from multiple PDFs at once?

Q4: How can I improve performance when extracting tables from large PDFs?

Install Spire.PDF for Java

Add an Image to PDF at a Specified Location

Convert an Image to PDF with the Same Width and Height

Apply for a Temporary License

Java Library to Digitally Sign PDF Documents

Adding an Invisible Digital Signature to a PDF

Adding a Visible Digital Signature to a PDF

Creating a Signature Field in a PDF

Wrap Up

FAQs

Q1. What is a digital signature?

Q2. Do I need a special certificate for signing PDFs?

Q3. How do I verify a signed PDF?

Q4. How can I customize the appearance of my visible digital signature?

Merge Multiple PDF Files in Java