Administrator

Wednesday, 03 September 2025 01:16

Convert HTML to Text in Python | Simple Plain Text Output

Python Convert HTML Text Quickly and Easily

HTML (HyperText Markup Language) is a markup language used to create web pages, allowing developers to build rich and visually appealing layouts. However, HTML files often contain a large number of tags, which makes them difficult to read if you only need the main content. By using Python to convert HTML to text, this problem can be easily solved. Unlike raw HTML, the converted text file strips away all unnecessary markup, leaving only clean and readable content that is easier to store, analyze, or process further.

Install HTML to Text Converter in Python
Python Convert HTML File to Text
Python Convert HTML String to Text
The Conclusion
FAQs

Install HTML to Text Converter in Python

To simplify the task, we recommend using Spire.Doc for Python. This Python Word library allows you to quickly remove HTML markup and extract clean plain text with ease. It not only works as an HTML-to-text converter, but also offers a wide range of features—covering almost everything you can do in Microsoft Word.

To install it, you can run the following pip command:

pip install spire.doc

Alternatively, you can download the Spire.Doc package and install it manually.

Python Convert HTML Files to Text in 3 Steps

After preparing the necessary tools, let's dive into today's main topic: how to convert HTML to plain text using Python. With the help of Spire.Doc, this task can be accomplished in just three simple steps: create a new document object, load the HTML file, and save it as a text file. It’s straightforward and efficient, even for beginners. Let’s take a closer look at how this process can be implemented in code!

Code Example – Converting an HTML File to a Text File:

from spire.doc import *
from spire.doc.common import *

# Open an html file
document = Document()
document.LoadFromFile("/input/htmlsample.html", FileFormat.Html, XHTMLValidationType.none)
# Save it as a Text document.
document.SaveToFile("/output/HtmlFileTotext.txt", FileFormat.Txt)

document.Close()

The following is a preview comparison between the source document (.html) and the output document (.txt):

Python Convert an HTML File to a Text Document

Note that if the HTML file contains tables, the output text file will only retain the values within the tables and cannot preserve the original table formatting. If you want to keep certain styles while removing markup, it is recommended to convert HTML to a Word document . This way, you can retain headings, tables, and other formatting, making the content easier to edit and use.

How to Convert an HTML String to Text in Python

Sometimes, we don’t need the entire content of a web page and only want to extract specific parts. In such cases, you can convert an HTML string directly to text. This approach allows you to precisely control the information you need without further editing. Using Python to convert an HTML string to a text file is also straightforward. Here’s a detailed step-by-step guide:

Steps to convert an HTML string to a text document using Spire.Doc:

Input the HTML string directly or read it from a local file.
Create a Document object and add sections and paragraphs.
Use Paragraph.AppendHTML() method to insert the HTML string into a paragraph.
Save the document as a .txt file using Document.SaveToFile() method.

The following code demonstrates how to convert an HTML string to a text file using Python:

from spire.doc import *
from spire.doc.common import *

#Get html string.
#with open(inputFile) as fp:
    #HTML = fp.read()

# Load HTML from string
html = """<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>HTML to Text Example</title>
  <style>
    body { font-family: Arial, sans-serif; margin: 20px; }
    header { background: #f4f4f4; padding: 10px; }
    nav a { margin: 0 10px; text-decoration: none; color: #333; }
    main { margin-top: 20px; }
  </style>
</head>
<body>
  <header>
    <h1>My Demo Page</h1>
    <nav>
      <a href="#">Home</a>
      <a href="#">About</a>
      <a href="#">Contact</a>
    </nav>
  </header>
  
  <main>
    <h2>Convert HTML to Text</h2>
    <p>This is a simple demo showing how HTML content can be displayed before converting it to plain text.</p>
  </main>
</body>
</html>
"""

# Create a new document
document = Document()
section = document.AddSection()
section.AddParagraph().AppendHTML(html)

# Save directly as TXT
document.SaveToFile("/output/HtmlStringTotext.txt", FileFormat.Txt)
document.Close()

Here's the preview of the converted .txt file: Python Convert an HTML String to a Text Document

The Conclusion

In today’s tutorial, we focused on how to use Python to convert HTML to a text file. With the help of Spire.Doc, you can handle both HTML files and HTML strings in just a few lines of code, easily generating clean plain text files. If you’re interested in the other powerful features of the Python Word library, you can request a 30-day free trial license and explore its full capabilities for yourself.

FAQs about Converting HTML to Text in Python

Q1: How can I convert HTML to plain text using Python?

A: Use Spire.Doc to load an HTML file or string, insert it into a Document object with AppendHTML(), and save it as a .txt file.

Q2: Can I keep some formatting when converting HTML to text?

A: To retain styles like headings or tables, convert HTML to a Word document first, then export to text if needed.

Q3: Is it possible to convert only part of an HTML page to text?

A: Yes, extract the specific HTML segment as a string and convert it to text using Python for precise control.

Published in Conversion

Tagged under

doc Python Conversion

Saturday, 21 June 2025 07:01

Convert PPT or PPTX to Images with Python and Online Tools

Why You Should Convert PowerPoint Slides to Images
Install PPT to Image Library for Python
Convert PPT to PNG, JPG, BMP (Raster Images) in Python
- Save Slides to Images at Original Size
- Customize the Output Image Size
Convert PPT to SVG (Scalable Vector Graphics) in Python
- Save Slides as SVG
- Include Speaker Notes in SVG Output
Export Shapes as Images in Python
Convert PowerPoint to Images Online for Free(No Code)
Conclusion
FAQs

Install with pip

pip install spire.presentation.free

Why You Should Convert PowerPoint Slides to Images?

Converting slides to images provides several benefits for both technical and general use cases:

Universal Compatibility: Images can be viewed on any device without PowerPoint installed.
Easy Embedding: Ideal for websites, mobile apps, social media, and documentation.
Content Protection: Prevents unauthorized editing by distributing non-editable content.
Better for Printing: High-resolution image output improves print fidelity.
Archiving & Backup: Storing slides as images ensures long-term accessibility.

Install PPT to Image Library for Python

To convert PowerPoint to image formats in Python, install Free Spire.Presentation for Python, a free and feature-rich presentation processing library that supports high-fidelity exports to both raster and vector image formats.

Installation

Before getting started, install the library using the following pip command:

Package Manager

pip install spire.presentation.free

Once installed, you can import it into your project and begin converting slides with just a few lines of code.

Convert PPT to PNG, JPG, BMP (Raster Images) in Python

Raster images like PNG, JPG, and BMP are composed of pixels, making them suitable for digital sharing and printing. Free Spire.Presentation offers two methods for exporting slides as raster images: preserving original size or specifying custom dimensions.

Save Slides as Images at Original Size

This example demonstrates how to convert each slide in a PowerPoint presentation to a PNG image while preserving the original dimensions.

Python

from Free Spire.Presentation import *

# Load a PowerPoint presentation
ppt = Presentation()
ppt.LoadFromFile("Sample.pptx")

# Loop through each slide and export it as a PNG image at the original size
# You can change the image extension to .jpg or .bmp as needed
for i in range(ppt.Slides.Count):
    image = ppt.Slides[i].SaveAsImage()
    image.Save(f"RasterImages/ToImage_{i}.png")

ppt.Dispose()

Python example to save PPT slide to JPG, PNG and BMP)

Customize the Output Image Size

In some use cases, such as generating thumbnail previews or preparing high-resolution exports for print, you may need to customize the width and height of the output image. Here is how to achieve this:

Python

from Free Spire.Presentation import *

# Load a PowerPoint presentation
ppt = Presentation()
ppt.LoadFromFile("Sample.pptx")

# Loop through each slide and export it as a PNG image with a custom dimension 700 x 400
for i in range(ppt.Slides.Count):
    image = ppt.Slides[i].SaveAsImageByWH(700, 400)
    image.Save(f"RasterImages/ToImage_{i}.png")

ppt.Dispose()

Python example to save PPT slide to custom-sized image

Convert PowerPoint to SVG (Scalable Vector Graphics) in Python

Unlike raster images, SVG (Scalable Vector Graphics) preserves infinite scalability and clarity, making it the ideal format for responsive designs, technical diagrams, and printing at any size.

Save Slides as SVG

The code below demonstrates how to convert each slide to a standalone SVG file. The output files will contain scalable vector representations of slide content, including text, shapes, images, and more.

Python

from Free Spire.Presentation import *

# Load a PowerPoint presentation
ppt = Presentation()
ppt.LoadFromFile("Sample.pptx")

# Loop through each slide and export it as an SVG image
for i in range(ppt.Slides.Count):
    image = ppt.Slides[i].SaveToSVG()
    image.Save(f"VectorImages/ToSVG_{i}.svg")

ppt.Dispose()

Python example to export PPT slide to SVG

Include Speaker Notes in SVG Output

Some presentations include speaker notes that provide context or instructions during a talk. If these notes are relevant to your image export, you can configure the SVG output to include them by adding the following code before conversion:

Python

# Enable the IsNoteRetained property to retain notes when converting the presentation to SVG files
ppt.IsNoteRetained = True

Export Shapes as Images in Python

You can also extract individual shapes from slides and save them as images. This is especially useful for exporting diagrams, logos, or annotated graphics separately.

Python

from spire.presentation import *

# Load a PowerPoint presentation
ppt = Presentation()
ppt.LoadFromFile("Sample.pptx")

# Get the 3rd slide
slide = ppt.Slides[3]

# Loop through each shape on the slide and export it as an image
for i in range(slide.Shapes.Count):
    image = slide.Shapes.SaveAsImage(i, 96, 96)
    image.Save(f"Shapes/ShapeToImage{i}.png")

ppt.Dispose()

For detailed guidance on shapes to image conversion, please refer to our tutorial: Python: Save Shapes as Image Files in PowerPoint Presentations.

Convert PowerPoint to Images Online for Free (No Code)

For users who prefer not to install libraries or write code, online conversion tools offer a quick and convenient way to convert PowerPoint slides to image formats. One of the most reliable options is Cloudxdocs, a free web-based file conversion service that supports multiple output formats.

Use Cloudxdocs to Convert PPT or PPTX to Image

Cloudxdocs provides a user-friendly interface to convert a wide range of file formats, such as Word, Excel, PDF, and PowerPoint presentations, to various image formats in just a few steps. No account or software installation is required.

Key Benefits:

Supports PPT and PPTX files
Fast online conversion with downloadable results
Works on any browser (Windows, macOS, Linux, mobile)
No need to install Microsoft PowerPoint or Python

How to Use:

Go to Cloudxdocs's PPT to Image Converter.

Online PPT to Image Conversion Tool

Click "+" and upload your PowerPoint file to start the conversion.
Once finished, download your converted image files.

Tip: Online tools are best for quick, one-time conversions. For batch processing, or custom sizing, use the Python code-based method.

Conclusion

Converting PowerPoint slides and shapes to image formats expands the versatility of your presentation content. Whether you’re developing a web application, preparing materials for print, or simply looking to share slides in a more accessible format, this article provides two effective approaches:

Use Free Spire.Presentation for Python to programmatically export slides to JPG, PNG, BMP, or SVG formats with high-quality output and custom sizing options.
Try Cloudxdocs, a free online converter, for quick and hassle-free conversions—no installation or coding required.

Both methods help you repurpose PowerPoint content beyond the confines of the original format—making it easier to integrate, distribute, and preserve across platforms.

FAQs

Q1: How do I convert a PPTX file to PNG using Python?

A1: You can use the Free Spire.Presentation library to load the PPTX file and export each slide to a PNG image using the SaveAsImage() method. It supports high-quality raster image output.

Q2: Can I convert PowerPoint slides to JPG and BMP formats in Python?

A2: Yes. You can easily save each slide as a JPG or BMP file by changing the image format when saving the output image.

Q3: Can I extract and save individual shapes from a slide as images?

A3: Absolutely. You can extract specific shapes and save them independently as PNG or other image formats. See our guide to exporting shapes for details.

Q4: Can I customize the size of the output image when converting a slide?

A4: Yes. The SaveAsImageByWH(width, height) method lets you define custom dimensions for the output image in pixels, useful for thumbnails or print layouts.

Q5: Does the conversion require Microsoft PowerPoint to be installed?

A5: No. Free Spire.Presentation is a standalone library and does not depend on Microsoft Office or PowerPoint to perform the conversion.

Q6: Is there a way to convert PowerPoint to images without coding?

A6: Yes. You can use Cloudxdocs, a free online tool that converts PPT or PPTX files to images directly in your browser—no software installation needed.

Published in Tools

Thursday, 19 June 2025 06:58

How to Read Barcodes from PDF in C# – Easy Methods with Code

read barcode from PDF in C# overview image

Reading barcodes from PDF in C# is a common requirement in document processing workflows, especially when dealing with scanned forms or digital PDFs. In industries like logistics, finance, healthcare, and manufacturing, PDFs often contain barcodes—either embedded as images or rendered as vector graphics. Automating this process can reduce manual work and improve accuracy.

This guide shows how to read barcode from PDF with C# using two practical methods: extracting images embedded in PDF pages and scanning them, or rendering entire pages as images and detecting barcodes from the result. Both techniques support reliable recognition of 1D and 2D barcodes in different types of PDF documents.

Table of Contents

Getting Started: Tools and Setup
Step-by-Step: Read Barcodes from PDF in C#
- Method 1: Extract Embedded Images and Detect Barcodes
- Method 2: Render Page as Image and Scan
Which Method Should You Use?
Real-World Use Cases
FAQ

Getting Started: Tools and Setup

To extract or recognize barcodes from PDF documents using C#, make sure your environment is set up correctly.

Here’s what you need:

Any C# project that supports NuGet package installation (such as .NET Framework, .NET Core, or .NET).
The following libraries, Spire.Barcode for .NET for barcode recognition and Spire.PDF for .NET for PDF processing, can be installed via NuGet Package Manager:

Install-Package Spire.Barcode
Install-Package Spire.PDF

Step-by-Step: Read Barcodes from PDF in C#

There are two ways to extract barcode data from PDF files. Choose one based on how the barcode is stored in the document.

Method 1: Extract Embedded Images and Detect Barcodes

This method is suitable for scanned PDF documents, where each page often contains a raster image with one or more barcodes. The BarcodeScanner.ScanOne() method can read one barcode from one image.

Code Example: Extract and Scan

using Spire.Barcode;
using Spire.Pdf;
using Spire.Pdf.Utilities;
using System.Drawing;

namespace ReadPDFBarcodeByExtracting
{
    class Program
    {
        static void Main(string[] args)
        {
            // Load a PDF file
            PdfDocument pdf = new PdfDocument();
            pdf.LoadFromFile("Sample.pdf");

            // Get a page and the image information on the page
            PdfPageBase page = pdf.Pages[0];
            PdfImageHelper imageHelper = new PdfImageHelper();
            PdfImageInfo[] imagesInfo = imageHelper.GetImagesInfo(page);

            // Loop through the image information
            int index = 0;
            foreach (PdfImageInfo imageInfo in imagesInfo)
            {
                // Get the image as an Image object
                Image image = imageInfo.Image;
                // Scan the barcode and output the result
                string scanResult = BarcodeScanner.ScanOne((Bitmap)image);
                Console.WriteLine($"Scan result of image {index + 1}:\n" + scanResult + "\n");
                index++;
            }
        }
    }
}

The following image shows a scanned PDF page and the barcode recognition result using Method 1 (extracting embedded images):

Extract barcode from embedded image in PDF using C#

When to use: If the PDF is a scan or contains images with embedded barcodes.

You may also like: Generate Barcodes in C# (QR Code Example)

Method 2: Render Page as Image and Scan

When barcodes are drawn using vector elements (not embedded images), you can render each PDF page as a bitmap and perform barcode scanning on it. The BarcodeScanner.Scan() method can read multiple barcodes from one image.

Code Example: Render and Scan

using Spire.Barcode;
using Spire.Pdf;
using System.Drawing;

namespace ReadPDFBarcodeByExtracting
{
    class Program
    {
        static void Main(string[] args)
        {
            // Load a PDF file
            PdfDocument pdf = new PdfDocument();
            pdf.LoadFromFile("Sample.pdf");

            // Save each page as an image
            for (int i = 0; i < pdf.Pages.Count; i++)
            {
                Image image = pdf.SaveAsImage(i);
                // Read the barcodes on the image
                string[] scanResults = BarcodeScanner.Scan((Bitmap)image);
                // Output the results
                for (int j = 0; j < scanResults.Length; j++)
                {
                    Console.WriteLine($"Scan result of barcode {j + 1} on page {i + 1}:\n" + scanResults[j] + "\n");
                }
            }
        }
    }
}

Below is the result of applying Method 2 (rendering full PDF page) to detect vector barcodes on the page:

Scan barcode from vector-rendered PDF page using C#

When to use: When barcodes are drawn on the page directly, not embedded as image elements.

Related article: Convert PDF Pages to Images in C#

Which Method Should You Use?

Use Case	Recommended Method
Scanned pages or scanned barcodes	Extract embedded images
Digital PDFs with vector barcodes	Render full page as image
Hybrid or unknown structure	Try both methods optionally

You can even combine both methods for maximum reliability when handling unpredictable document structures.

Real-World Use Cases

Here are some typical scenarios where barcode recognition from PDFs in C# proves useful:

Logistics automation Extract tracking numbers and shipping IDs from scanned labels, dispatch forms, or signed delivery receipts in bulk.
Invoice and billing systems Read barcode-based document IDs or payment references from digital or scanned invoices in batch processing tasks.
Healthcare document digitization Automatically scan patient barcodes from lab reports, prescriptions, or admission forms in PDF format.
Manufacturing and supply chain Recognize barcodes from packaging reports, quality control sheets, or equipment inspection PDFs.
Educational institutions Process barcoded student IDs on scanned test forms or attendance sheets submitted as PDFs.

Tip: In many of these use cases, PDFs come from scanners or online systems, which may embed barcodes as images or page content—both cases are supported with the two methods introduced above.

Conclusion

Reading barcodes from PDF files in C# can be achieved reliably using either image extraction or full-page rendering. Whether you need to extract a barcode from a scanned document or recognize one embedded in PDF content, both methods provide flexible solutions for barcode recognition in C#.

FAQ

Q: Does this work with multi-page PDFs?
Yes. You can loop through all pages in the PDF and scan each one individually.

Q: Can I extract multiple barcodes per page?
Yes. The BarcodeScanner.Scan() method can detect and return all barcodes found on each image.

Q: Can I improve recognition accuracy by increasing resolution?
Yes. When rendering a PDF page to an image, you can set a higher DPI using PdfDocument.SaveAsImage(pageIndex: int, PdfImageType.Bitmap: PdfImageType, dpiX: int, dpiY: int). For example, 300 DPI is ideal for small or low-quality barcodes.

Q: Can I read barcodes from PDF using C# for free?
Yes. You can use Free Spire.Barcode for .NET and Free Spire.PDF for .NET to read barcodes from PDF files in C#. However, the free editions have feature limitations, such as page count or supported barcode types. If you need full functionality without restrictions, you can request a free temporary license to evaluate the commercial editions.

Published in Create and Scan Barcode

Tagged under

barcode net

Thursday, 19 June 2025 01:49

Spire.PDF for Java 11.6.2 enhances the conversion form PDF to PDFA3B and PDFA1A

We are delighted to announce the release of Spire.PDF for Java 11.6.2. This version enhances the conversion from PDF to PDFA3B and PDFA1A as well as OFD to PDF. Besides, some known issues are fixed successfully in this version, such as the issue that the font was incorrect when replacing text. More details are listed below.

Here is a list of changes made in this release

Category	ID	Description
Bug	SPIREPDF-7485	Fixed the issue that spaces were lost after converting PDF to PDFA3B.
Bug	SPIREPDF-7497	Fixed the issue that signatures were lost after converting PDF to PDFA1A.
Bug	SPIREPDF-7506	Fixed the issue that the program threw NullPointerException after converting OFD to PDF.
Bug	SPIREPDF-7524	Fixed the issue that fonts were incorrect after replacing text.
Bug	SPIREPDF-7530	Fixed the issue that table layouts were incorrect after creating booklets using PdfBookletCreator.

Click the link below to download Spire.PDF for Java 11.6.2:

https://www.e-iceblue.com/Download/pdf-for-java.html

Published in Spire.PDF for Java

Friday, 13 June 2025 08:30

How to Read Excel Files in Java (XLS/XLSX) – Complete Guide

Cover image for tutorial on how to read Excel file in Java

Reading Excel files using Java is a common requirement in enterprise applications, especially when dealing with reports, financial data, user records, or third-party integrations. Whether you're building a data import feature, performing spreadsheet analysis, or integrating Excel parsing into a web application, learning how to read Excel files in Java efficiently is essential.

In this tutorial, you’ll discover how to read .xls and .xlsx Excel files using Java. We’ll use practical Java code examples which also cover how to handle large files, read Excel files from InputStream, and extract specific content line by line.

Table of Contents

1. Set Up Your Java Project
2. How to Read XLSX and XLS Files in Java
3. Best Practices for Large Excel Files
4. Full Example: Java Program to Read Excel File
5. Summary
6. FAQ

1. Set Up Your Java Project

To read Excel files using Java, you need a library that supports spreadsheet file formats. Spire.XLS for Java offers support for both .xls (legacy) and .xlsx (modern XML-based) files and provides a high-level API that makes Excel file reading straightforward.

Add Spire.XLS to Your Project

If you're using Maven, add the following to your pom.xml:

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.xls</artifactId>
        <version>15.12.15</version>
    </dependency>
</dependencies>

If you're not using Maven, you can manually download the JAR from the official Spire.XLS website and add it to your classpath.

For smaller Excel processing tasks, you can also choose Free Spire.XLS for Java.

2. How to Read XLSX and XLS Files in Java

Java programs can easily read Excel files by loading the workbook and iterating through worksheets, rows, and cells. The .xlsx format is commonly used in modern Excel, while .xls is its older binary counterpart. Fortunately, Spire.XLS supports both formats seamlessly with the same code.

Load and Read Excel File (XLSX or XLS)

Here’s a basic example that loads an Excel file and prints its content:

import com.spire.xls.*;

public class ReadExcel {
    public static void main(String[] args) {
        // Create a workbook object and load the Excel file
        Workbook workbook = new Workbook();
        workbook.loadFromFile("data.xlsx"); // or "data.xls"

        // Get the first worksheet
        Worksheet sheet = workbook.getWorksheets().get(0);
        // Loop through each used row and column
        for (int i = 1; i <= sheet.getLastRow(); i++) {
            for (int j = 1; j <= sheet.getLastColumn(); j++) {
                // Get cell text of a cell range
                String cellText = sheet.getCellRange(i, j).getValue();
                System.out.print(cellText + "\t");
            }
            System.out.println();
        }
    }
}

You can replace the file path with an .xls file and the code remains unchanged. This makes it simple to read Excel files using Java regardless of format.

The Excel file being read and the output result shown in the console.

Java example reading xlsx or xls file

Read Excel File Line by Line with Row Objects

In scenarios like user input validation or applying business rules, processing each row as a data record is often more intuitive. In such cases, you can read the Excel file line by line using row objects via the getRows() method.

for (int i = 0; i < sheet.getRows().length; i++) {
    // Get a row
    CellRange row = sheet.getRows()[i];
    if (row != null && !row.isBlank()) {
        for (int j = 0; j < row.getColumns().length; j++) {
            String text = row.getColumns()[j].getText();
            System.out.print((text != null ? text : "") + "\t");
        }
        System.out.println();
    }
}

This technique works particularly well when reading Excel files in Java for batch operations or when you only need to process rows individually.

Read Excel File from InputStream

In web applications or cloud services, Excel files are often received as streams. Here’s how to read Excel files from an InputStream in Java:

import com.spire.xls.*;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.InputStream;

public class ReadExcel {
    public static void main(String[] args) throws FileNotFoundException {
        // Create a InputStream
        InputStream stream = new FileInputStream("data.xlsx");
        // Load the Excel file from the stream
        Workbook workbook = new Workbook();
        workbook.loadFromStream(stream);
        System.out.println("Load Excel file successfully.");
    }
}

This is useful when handling file uploads, email attachments, or reading Excel files stored in remote storage.

Read Excel Cell Values in Different Formats

Once you load an Excel file and get access to individual cells, Spire.XLS allows you to read the contents in various formats—formatted text, raw values, formulas, and more.

Here's a breakdown of what each method does:

CellRange cell = sheet.getRange().get(2, 1); // B2

// Formatted text (what user sees in Excel)
String text = cell.getText();

// Raw string value
String value = cell.getValue();

// Generic object (number, boolean, date, etc.)
Object rawValue = cell.getValue2();

// Formula (if exists)
String formula = cell.getFormula();

// Evaluated result of the formula
String result = cell.getEnvalutedValue();

// If it's a number cell
double number = cell.getNumberValue();

// If it's a date cell
java.util.Date date = cell.getDateTimeValue();

// If it's a boolean cell
boolean bool = cell.getBooleanValue();

Tip: Use getValue2() for flexible handling, as it returns the actual underlying object. Use getText() when you want to match Excel's visible content.

You May Also Like: How to Write Data into Excel Files in Java

3. Best Practices for Reading Large Excel Files in Java

When your Excel file contains tens of thousands of rows or multiple sheets, performance can become a concern. To ensure your Java application reads large Excel files efficiently:

Load only required sheets
Access only relevant columns or rows
Avoid storing entire worksheets in memory
Use row-by-row reading patterns

Here’s an efficient pattern for reading only non-empty rows:

for (int i = 1; i <= sheet.getRows().length; i++) {
    Row row = sheet.getRows()[i];
    if (row != null && !row.isBlank()) {
        // Process only rows with data
    }
}

Even though Spire.XLS handles memory efficiently, following these practices helps scale your Java Excel reading logic smoothly.

4. Full Example: Java Program to Read Excel File

Here’s a full working Java example that reads an Excel file (users.xlsx) with extended columns such as name, email, age, department, and status. The code extracts only the original three columns (name, email, and age) and filters the output for users aged 30 or older.

import com.spire.xls.*;

public class ExcelReader {
    public static void main(String[] args) {
        Workbook workbook = new Workbook();
        workbook.loadFromFile("users.xlsx");

        Worksheet sheet = workbook.getWorksheets().get(0);
        System.out.println("Name\tEmail\tAge");

        for (int i = 2; i <= sheet.getLastRow(); i++) {
            String name = sheet.getCellRange(i, 1).getValue();
            String email = sheet.getCellRange(i, 2).getValue();
            String ageText = sheet.getCellRange(i, 3).getValue();

            int age = 0;
            try {
                age = Integer.parseInt(ageText);
            } catch (NumberFormatException e) {
                continue;  // Skip rows with invalid age data
            }

            if (age >= 30) {
                System.out.println(name + "\t" + email + "\t" + age);
            }
        }
    }
}

Result of Java program reading the Excel file and printing its contents. Java program extracting and filtering Excel data based on age

This code demonstrates how to read specific cells from an Excel file in Java and output meaningful tabular data, including applying filters on data such as age.

5. Summary

To summarize, this article showed you how to read Excel files in Java using Spire.XLS, including both .xls and .xlsx formats. You learned how to:

Set up your Java project with Excel-reading capabilities
Read Excel files using Java in row-by-row or stream-based fashion
Handle legacy and modern Excel formats with the same API
Apply best practices when working with large Excel files

Whether you're reading from an uploaded spreadsheet, a static report, or a stream-based file, the examples provided here will help you build robust Excel processing features in your Java applications.

If you want to unlock all limitations and experience the full power of Excel processing, you can apply for a free temporary license.

6. FAQ

Q1: How to read an Excel file dynamically in Java?

To read an Excel file dynamically in Java—especially when the number of rows or columns is unknown—you can use getLastRow() and getLastColumn() methods to determine the data range at runtime. This ensures that your program can adapt to various spreadsheet sizes without hardcoded limits.

Q2: How to extract data from Excel file in Java?

To extract data from Excel files in Java, load the workbook and iterate through the cells using nested loops. You can retrieve values with getCellRange(row, column).getValue(). Libraries like Spire.XLS simplify this process and support both .xls and .xlsx formats.

Q3: How to read a CSV Excel file in Java?

If your Excel data is saved as a CSV file, you can read it using Java’s BufferedReader or file streams. Alternatively, Spire.XLS supports CSV parsing directly—you can load a CSV file by specifying the separator, such as Workbook.loadFromFile("data.csv", ","). This lets you handle CSV files along with Excel formats using the same API.

Q4: How to read Excel file in Java using InputStream?

Reading Excel files from InputStream in Java is useful in server-side applications, such as handling file uploads. With Spire.XLS, simply call workbook.loadFromStream(inputStream) and process it as you would with any file-based Excel workbook.

Published in Document Operation

Tagged under

xls java Operation

Friday, 13 June 2025 08:21

Spire.PDF for Python 11.6.1 supports converting PDF to Markdown

We are pleased to announce the release of Spire.PDF for Python 11.6.1. This version introduces support for the Linux Aarch64 platform and adds the highly anticipated PDF to Markdown conversion feature. Additionally, several known issues have been resolved, including a critical bug encountered during OFD to PDF conversion. More details are listed below.

Here is a list of changes made in this release

Category	ID	Description
New feature	SPIREPDF-6436	Adds support for Linux Aarch64 platform.
New feature	SPIREPDF-7508	Supports converting PDF to Markdown.
Bug	SPIREPDF-6746	Fixes the issue where converting OFD to PDF throws an "Arg_NullReferenceException" error.
Bug	SPIREPDF-6999	Fixes the issue that causes incorrect retrieval of signature fields in PDF documents.
Bug	SPIREPDF-7047	Fixes the issue where extracting text from PDF results in incomplete content.
Bug	SPIREPDF-7052	Fixes the issue where retrieving bold formatting in searched text throws an exception.
Bug	SPIREPDF-7103	Fixes the issue where converting PDF to Linearized PDF throws an "AttributeError".

Click the link to download Spire.PDF for Python 11.6.1：

https://www.e-iceblue.com/Download/Spire-PDF-Python.html

Published in Spire.PDF for Python

Thursday, 12 June 2025 07:07

How to Read PDFs in Java: Extract Text, Images, and More

read pdf in java

In today's data-driven landscape, reading PDF files effectively is essential for Java developers. Whether you're handling scanned invoices, structured reports, or image-rich documents, the ability to read PDFs in Java can enhance workflows and reveal critical insights.

This guide will walk you through practical implementations using Spire.PDF for Java to master PDF reading in Java. You will learn to extract searchable text, retrieve embedded images, read tabular data, and perform OCR on scanned PDF documents.

Table of Contents:

Java Library for Reading PDF Content
Extract Text from Searchable PDFs
Retrieve Images from PDFs
Read Table Data from PDF Files
Convert Scanned PDFs to Text via OCR
Conclusion
FAQs

Java Library for Reading PDF Content

When it comes to reading PDF in Java, choosing the right library is half the battle. Spire.PDF stands out as a robust, feature-rich solution for developers. It supports text extraction, image retrieval, table parsing, and even OCR integration. Its intuitive API and comprehensive documentation make it ideal for both beginners and experts.

To start extracting PDF content, download Spire.PDF for Java from our website and add it as a dependency in your project. If you’re using Maven, include the following in your pom.xml:

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.pdf</artifactId>
        <version>11.5.2</version>
    </dependency>
</dependencies>

Below, we’ll explore how to leverage Spire.PDF for various PDF reading tasks.

Extract Text from Searchable PDFs in Java

Searchable PDFs store text in a machine-readable format, allowing for efficient content extraction. The PdfTextExtractor class in Spire.PDF provides a straightforward way to access page content, while PdfTextExtractOptions allows for flexible extraction settings, including options for handling special text layouts and specifying areas for extraction.

Step-by-Step Guide

Initialize a new instance of PdfDocument to work with your PDF file.
Use the loadFromFile method to load the desired PDF document.
Loop through each page of the PDF using a for loop.
For each page, create an instance of PdfTextExtractor to facilitate text extraction.
Create a PdfTextExtractOptions object to specify how text should be extracted, including any special strategies.
Call the extract method on the PdfTextExtractor instance to retrieve the text from the page.
Write the extracted text to a text file.

The example below shows how to retrieve text from every page of a PDF and output it to individual text files.

import com.spire.pdf.PdfDocument;
import com.spire.pdf.PdfPageBase;
import com.spire.pdf.texts.PdfTextExtractOptions;
import com.spire.pdf.texts.PdfTextExtractor;
import com.spire.pdf.texts.PdfTextStrategy;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

public class ExtractTextFromSearchablePdf {

    public static void main(String[] args) throws IOException {

        // Create a PdfDocument object
        PdfDocument doc = new PdfDocument();

        // Load a PDF file
        doc.loadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pdf");

        // Iterate through all pages
        for (int i = 0; i < doc.getPages().getCount(); i++) {
            // Get the current page
            PdfPageBase page = doc.getPages().get(i);

            // Create a PdfTextExtractor object
            PdfTextExtractor textExtractor = new PdfTextExtractor(page);

            // Create a PdfTextExtractOptions object
            PdfTextExtractOptions extractOptions = new PdfTextExtractOptions();

            // Specify extract option
            extractOptions.setStrategy(PdfTextStrategy.None);

            // Extract text from the page
            String text = textExtractor.extract(extractOptions);

            // Define the output file path
            Path outputPath = Paths.get("output/Extracted_Page_" + (i + 1) + ".txt");

            // Write to a txt file
            Files.write(outputPath, text.getBytes());
        }

        // Close the document
        doc.close();
    }
}

Result:

Input PDF document and output txt file with text extracted from the PDF.

Retrieve Images from PDFs in Java

The PdfImageHelper class in Spire.PDF enables efficient extraction of embedded images from PDF documents. It identifies images using PdfImageInfo objects, allowing for easy saving as standard image files.

Step-by-Step Guide

Initialize a new instance of PdfDocument to work with your PDF file.
Use the loadFromFile method to load the desired PDF.
Instantiate PdfImageHelper to assist with image extraction.
Loop through each page of the PDF.
For each page, retrieve all image information using the getImagesInfo method.
Loop through the retrieved image information, extract each image, and save it as a PNG file.

The following example extracts all embedded images from a PDF document and saves them as individual PNG files.

import com.spire.pdf.PdfDocument;
import com.spire.pdf.PdfPageBase;
import com.spire.pdf.utilities.PdfImageHelper;
import com.spire.pdf.utilities.PdfImageInfo;

import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;

public class ExtractAllImages {

    public static void main(String[] args) throws IOException {

        // Create a PdfDocument object
        PdfDocument doc = new PdfDocument();

        // Load a PDF document
        doc.loadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pdf");

        // Create a PdfImageHelper object
        PdfImageHelper imageHelper = new PdfImageHelper();

        // Declare an int variable
        int m = 0;

        // Iterate through the pages
        for (int i = 0; i < doc.getPages().getCount(); i++) {

            // Get a specific page
            PdfPageBase page = doc.getPages().get(i);

            // Get all image information from the page
            PdfImageInfo[] imageInfos = imageHelper.getImagesInfo(page);

            // Iterate through the image information
            for (int j = 0; j < imageInfos.length; j++)
            {
                // Get a specific image information
                PdfImageInfo imageInfo = imageInfos[j];

                // Get the image
                BufferedImage image = imageInfo.getImage();
                File file = new File(String.format("output/Image-%d.png",m));
                m++;

                // Save the image file in PNG format
                ImageIO.write(image, "PNG", file);
            }
        }

        // Clear up resources
        doc.dispose();
    }
}

Result:

Input PDF document and image file extracted from the PDF.

Read Table Data from PDF Files in Java

For PDF tables that need conversion to structured data, PdfTableExtractor intelligently recognizes cell boundaries and relationships. The resulting PdfTable objects maintain the original table organization, allowing for cell-level data export.

Step-by-Step Guide

Initialize an instance of PdfDocument to handle your PDF file.
Use the loadFromFile method to open the desired PDF.
Instantiate PdfTableExtractor to facilitate table extraction.
Iterate through each page of the PDF to extract tables.
For each page, retrieve tables into a PdfTable array using the extractTable method.
For each table, iterate through its rows and columns to extract data.
Write the extracted data to individual text files.

This Java code extracts table data from a PDF document and saves each table as a separate text file.

import com.spire.pdf.PdfDocument;
import com.spire.pdf.utilities.PdfTable;
import com.spire.pdf.utilities.PdfTableExtractor;

import java.io.FileWriter;

public class ExtractTableData {
    public static void main(String[] args) throws Exception {

        // Create a PdfDocument object
        PdfDocument doc = new PdfDocument();

        // Load a PDF document
        doc.loadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pdf");

        // Create a PdfTableExtractor instance
        PdfTableExtractor extractor = new PdfTableExtractor(doc);

        // Initialize a table counter
        int tableCounter = 1;

        // Loop through the pages in the PDF
        for (int pageIndex = 0; pageIndex < doc.getPages().getCount(); pageIndex++) {

            // Extract tables from the current page into a PdfTable array
            PdfTable[] tableLists = extractor.extractTable(pageIndex);

            // If any tables are found
            if (tableLists != null && tableLists.length > 0) {

                // Loop through the tables in the array
                for (PdfTable table : tableLists) {

                    // Create a StringBuilder for the current table
                    StringBuilder builder = new StringBuilder();

                    // Loop through the rows in the current table
                    for (int i = 0; i < table.getRowCount(); i++) {

                        // Loop through the columns in the current table
                        for (int j = 0; j < table.getColumnCount(); j++) {

                            // Extract data from the current table cell and append to the StringBuilder 
                            String text = table.getText(i, j);
                            builder.append(text).append(" | ");
                        }
                        builder.append("\r\n");
                    }

                    // Write data into a separate .txt document for each table
                    FileWriter fw = new FileWriter("output/Table_" + tableCounter + ".txt");
                    fw.write(builder.toString());
                    fw.flush();
                    fw.close();

                    // Increment the table counter
                    tableCounter++;
                }
            }
        }

        // Clear up resources
        doc.dispose();
    }
}

Result:

Input PDF document and txt file containing table data extracted from the PDF.

Convert Scanned PDFs to Text via OCR

Scanned PDFs require special handling through OCR engine such as Spire.OCR for Java. The solution first converts pages to images using Spire.PDF's rendering engine, then applies Spire.OCR's recognition capabilities via the OcrScanner class. This two-step approach effectively transforms physical document scans into editable text while supporting multiple languages.

Step 1. Install Spire.OCR and Configure the Environment

Download Spire.OCR for Java and add the Jar file as a dependency in your project.
Download the model that fits in with your operating system from one of the following links, and unzip the package somewhere on your disk.
Configure the model in your code.

OcrScanner scanner = new OcrScanner();
configureOptions.setModelPath("D:\\win-x64");// model path

For detailed steps, refer to: Extract Text from Images Using the New Model of Spire.OCR for Java

Step 2. Convert a Scanned PDF to Text

This code example converts each page of a scanned PDF into an image, applies OCR to extract text, and saves the results in a text file.

import com.spire.ocr.OcrException;
import com.spire.ocr.OcrScanner;
import com.spire.ocr.ConfigureOptions;
import com.spire.pdf.PdfDocument;
import com.spire.pdf.graphics.PdfImageType;

import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

public class ExtractTextFromScannedPdf {

    public static void main(String[] args) throws IOException, OcrException {

        // Create an instance of the OcrScanner class
        OcrScanner scanner = new OcrScanner();

        // Configure the scanner
        ConfigureOptions configureOptions = new ConfigureOptions();
        configureOptions.setModelPath("D:\\win-x64"); // Set model path
        configureOptions.setLanguage("English"); // Set language

        // Apply the configuration options
        scanner.ConfigureDependencies(configureOptions);

        // Load a PDF document
        PdfDocument doc = new PdfDocument();
        doc.loadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pdf");

        // Prepare temporary directory
        String tempDirPath = "temp";
        new File(tempDirPath).mkdirs(); // Create temp directory

        StringBuilder allText = new StringBuilder();

        // Iterate through all pages
        for (int i = 0; i < doc.getPages().getCount(); i++) {

            // Convert page to image
            BufferedImage bufferedImage = doc.saveAsImage(i, PdfImageType.Bitmap);
            String imagePath = tempDirPath + File.separator + String.format("page_%d.png", i);
            ImageIO.write(bufferedImage, "PNG", new File(imagePath));

            // Perform OCR
            scanner.scan(imagePath);
            String pageText = scanner.getText().toString();
            allText.append(String.format("\n--- PAGE %d ---\n%s\n", i + 1, pageText));

            // Clean up temp image
            new File(imagePath).delete();
        }

        // Save all extracted text to a file
        Path outputTxtPath = Paths.get("output", "extracted_text.txt");
        Files.write(outputTxtPath, allText.toString().getBytes());

        // Close the document
        doc.close();
        System.out.println("Text extracted to " + outputTxtPath);
    }
}

Conclusion

Mastering how to read PDF in Java opens up a world of possibilities for data extraction and document automation. Whether you’re dealing with searchable text, images, tables, or scanned documents, the right tools and techniques can simplify the process.

By leveraging libraries like Spire.PDF and integrating OCR for scanned files, you can build robust solutions tailored to your needs. Start experimenting with the code snippets provided and unlock the full potential of PDF processing in Java!

FAQs

Q1: Can I extract text from scanned PDFs using Java?

Yes, by combining Spire.PDF with Spire.OCR. Convert PDF pages to images and perform OCR to extract text.

Q2: What’s the best library for reading PDFs in Java?

Spire.PDF is highly recommended for its versatility and ease of use. It supports extraction of text, images, tables, and OCR integration.

Q3: Does Spire.PDF support extraction of PDF elements like metadata, attachments, and hyperlinks?

Yes, Spire.PDF provides comprehensive support for extracting:

Metadata (title, author, keywords)
Attachments (embedded files)
Hyperlinks (URLs and document links)

The library offers dedicated classes like PdfDocumentInformation for metadata and methods to retrieve embedded files ( PdfAttachmentCollection ) and hyperlinks ( PdfUriAnnotation ).

Q4: How to parse tables from PDFs into CSV/Excel programmatically?

Using Spire.PDF for Java, you can extract table data from PDFs, then seamlessly export it to Excel (XLSX) or CSV format with Spire.XLS for Java. For a step-by-step guide, refer to our tutorial: Export Table Data from PDF to Excel in Java.

Get a Free License

To fully experience the capabilities of Spire.PDF for Java without any evaluation limitations, you can request a free 30-day trial license.

Published in Document Operation

Tagged under

pdf java Operation

Thursday, 12 June 2025 07:00

Python Tutorial: Delete Text Boxes in PowerPoint Automatically

Text boxes are one of the most common elements used to display content in PowerPoint. However, as slides get frequently edited, you may end up with a clutter of unnecessary text boxes. Manually deleting them can be time-consuming. This guide will show you how to delete text boxes in PowerPoint using Python. Whether you want to delete all text boxes, remove a specific one, or clean up only the empty ones, you'll learn how to do it in just a few lines of code — saving time and making your workflow much more efficient. Visual guide for removing auto filters in Excel using Spire.XLS for .NET and C#

Install the Python Library
Delete All Text Boxes
Delete a Specific Text Box
Delete Empty Text Boxes
Compare All Three Methods
FAQs

Install the Python Library for PowerPoint Automation

To make this task easier, installing the right Python library is essential. In this guide, we’ll use Spire.Presentation for Python to demonstrate how to automate the removal of text boxes in a PowerPoint file. As a standalone third-party component, Spire.Presentation doesn’t require Microsoft Office to be installed on your machine. Its API is simple and beginner-friendly, and installation is straightforward — just run:

pip install spire.presentation

Alternatively, you can download the package for custom installation. A free version is also available, which is great for small projects and testing purposes.

How to Delete All Text Boxes in PowerPoint

Let’s start by looking at how to delete all text boxes — a common need when you're cleaning up a PowerPoint template. Instead of adjusting each text box and its content manually, it's often easier to remove them all and then re-add only what you need. With the help of Spire.Presentation, you can use the IAutoShape.Remove() method to remove text boxes in just a few lines of code. Let’s see how it works in practice. Steps to delete all text boxes in a PowerPoint presentation with Python:

Create an instance of Presentation class, and load a sample PowerPoint file.
Loop through all slides and all shapes on slides, and check if the shape is IAutoShape and if it is a text box.
Remove text boxes in the PowerPoint presentation through IAutoShape.Remove() method.
Save the modified PowerPoint file.

The following is a complete code example for deleting all text boxes in a PowerPoint presentation:

from spire.presentation import *

# Create a Presentation object and load a PowerPoint file
presentation = Presentation()
presentation.LoadFromFile("E:/Administrator/Python1/input/pre1.pptx")

# Loop through all slides 
for slide in presentation.Slides:
    # Loop through all shapes in the slide
    for i in range(slide.Shapes.Count - 1, -1, -1):
        shape = slide.Shapes[i]
        # Check if the shape is IAutoShape and is a text box
        if isinstance(shape, IAutoShape) and shape.IsTextBox:
            # Remove the shape
            slide.Shapes.Remove(shape)

# Save the modified presentation
presentation.SaveToFile("E:/Administrator/Python1/output/RemoveAllTextBoxes.pptx", FileFormat.Pptx2013)
presentation.Dispose()

Using Python to Delete All Text Boxes in PowerPoint

Warm Tip: When looping through shapes, use reverse order to avoid skipping any elements after deletion.

How to Delete a Specific Text Box in PowerPoint

If you only need to remove a few specific text boxes — for example, the first text box on the second slide — this method is perfect for you. In Python, you can first locate the target slide by its index, then identify the text box by its content, and finally remove it. This approach gives you precise control when you know exactly which text box needs to be deleted. Let’s walk through how to do this in practice. Steps to delete a specific text box in PowerPoint using Python:

Create an object of Presentation class and read a PowerPoint document.
Get a slide using Presentation.Slides[] property.
Loop through each shape on the slide and check if it is the target text box.
Remove the text box through IAutoShape.Remove() method.
Save the modified PowerPoint presentation.

The following code demonstrates how to delete a text box with the content "Text Box 1" on the second slide of the presentation:

from spire.presentation import *

# Create a new Presentation object and load a PowerPoint file 
presentation = Presentation()
presentation.LoadFromFile("E:/Administrator/Python1/input/pre1.pptx")

# Get the second slide
slide = presentation.Slides[1]

# Loop through all shapes on the slide
for i in range(slide.Shapes.Count - 1, -1, -1):
    shape = slide.Shapes[i]
    # Check if the shape is a text box and its text is "Text Box 1"
    if isinstance(shape, IAutoShape) and shape.IsTextBox:
        if shape.TextFrame.Text.strip() == "Text Box 1":
            slide.Shapes.Remove(shape)

# Save the modified presentation
presentation.SaveToFile("E:/Administrator/Python1/output/RemoveSpecificTextbox.pptx", FileFormat.Pptx2013)
presentation.Dispose()

Remove a Specific Text Box in a PowerPoint Slide Using Python

How to Delete Empty Text Boxes in PowerPoint

Another common scenario is removing all empty text boxes from a PowerPoint file — especially when you're cleaning up slides exported from other tools or merging multiple presentations and want to get rid of unused placeholders. Instead of checking each slide manually, automating the process with Python allows you to quickly remove all blank text boxes and keep only the meaningful content. It’s a far more efficient approach. Steps to delete empty text boxes in PowerPoint file using Python:

Create an object of Presentation class, and load a PowerPoint file.
Loop through all slides and all shapes on slides.
Check if the shape is a text box and is empty.
Remove text boxes in the PowerPoint presentation through IAutoShape.Remove() method.
Save the modified PowerPoint file.

Here's the code example that shows how to delete empty text boxes in a PowerPoint presentation:

from spire.presentation import *

# Create a Presentation instance and load a sample file
presentation = Presentation()
presentation.LoadFromFile("E:/Administrator/Python1/input/pre1.pptx")

# Loop through each slide
for slide in presentation.Slides:
    # Iterate through shapes 
    for i in range(slide.Shapes.Count - 1, -1, -1):
        shape = slide.Shapes[i]
        # Check if the shape is a textbox and its text is empty
        if isinstance(shape, IAutoShape) and shape.IsTextBox:
            text = shape.TextFrame.Text.strip()
            # Remove the shape if it is empty
            if not text:
                slide.Shapes.Remove(shape)

# Save the result file 
presentation.SaveToFile("E:/Administrator/Python1/output/RemoveEmptyTextBoxes.pptx", FileFormat.Pptx2013)
presentation.Dispose()

Delete All Empty Text Boxes from a PowerPoint Presentation with Python

Compare All Three Methods: Which One Should You Use?

Each of the three methods we've discussed has its own ideal use case. If you're still unsure which one fits your needs after reading through them, the table below will help you compare them at a glance — so you can quickly pick the most suitable solution.

Method	Best For	Keeps Valid Content?
Delete All Text Boxes	Cleaning up entire templates or resetting slides	❌ No
Delete Specified Text Box	When you know exactly which text box to remove (e.g., slide 2, shape 1)	✅ Yes
Delete Empty Text Boxes	Cleaning up imported or merged presentations	✅ Yes

Conclusion and Best Practice

Whether you're refreshing templates, fine-tuning individual slides, or cleaning up empty placeholders, automating PowerPoint with Python can save you hours of manual work. Choose the method that fits your workflow best — and start making your presentations cleaner and more efficient today.

FAQs about Deleting Text Boxes in PowerPoint

Q1: Why can't I delete a text box in PowerPoint?
A: One common reason is that the text box is placed inside the Slide Master layout. In this case, it can’t be selected or deleted directly from the normal slide view. You’ll need to go to the View → Slide Master tab, locate the layout, and delete it from there.

Q2: How can I delete a specific text box using Python?
A: You can locate the specific text box by accessing the slide and then searching for the shape based on its index or text content. Once identified, use the IAutoShape.Remove() method to delete it. This is useful when you know exactly which text box needs to be removed.

Q3: Is it possible to remove a text box without deleting the content?
A: If you want to keep the content but remove the text box formatting (like borders or background), you can extract the text before deleting the shape and reinsert it elsewhere — for example, as a plain paragraph. However, PowerPoint doesn’t natively support detaching text from its container without removing the shape.

Published in Image and Shapes

Tagged under

ppt Python Image and Shapes

Wednesday, 11 June 2025 01:58

How to Extract Text from Image Using Python (OCR Code Examples)

Python extracting text from image with OCR visualization

Extracting text from images using Python is a widely used technique in OCR-driven workflows such as document digitization, form recognition, and invoice processing. Many important documents still exist only as scanned images or photos, making it essential to convert visual information into machine-readable text.

With the help of powerful Python libraries, you can easily perform text extraction from image files with Python — even for multilingual documents or layout-sensitive content. In this article, you’ll learn how to use Python to extract text from an image, through practical OCR examples, useful tips, and proven methods to improve recognition accuracy.

The guide is structured as follows:

Powerful Python Library to Extract Text from Image
Step-by-Step: Python Code to Extract Text from Image
- Basic OCR Text Extraction (Image to Plain Text)
- Extract Text from Image with Coordinates
Real-World Use Cases for Text Extraction from Images
Supported Languages and Image Formats
How to Improve OCR Accuracy (Best Practices)
FAQ

Powerful Python Library to Extract Text from Image

Spire.OCR for Python is a powerful OCR library for Python, especially suited for applications requiring structured layout extraction and multilingual support. This Python OCR engine supports:

Text recognition with layout and position information
Multilingual support (English, Chinese, French, etc.)
Supports multiple image formats including JPG, PNG, BMP, GIF, and TIFF

Setup: Install Dependencies and OCR Models

Before extracting text from images using Python, you need to install the spire.ocr library and download the OCR model files compatible with your operating system.

1. Install the Spire.OCR Python Package

Use pip to install the Spire.OCR for Python package:

pip install spire.ocr

2. Download the OCR Model Package

Download the OCR model files based on your OS:

Windows: win-x64.zip
Linux: linux.zip
macOS: mac.zip
linux_aarch: linux_aarch.zip

After downloading, extract the files and set the model path in your Python script when configuring the OCR engine.

Step-by-Step: Python Code to Extract Text from Image

In this section, we’ll walk through different ways to extract text from images using Python — starting with a simple plain-text extraction, and then moving to more advanced structured recognition.

Basic OCR Text Extraction (Image to Plain Text)

Here’s how to extract plain text from an image using Python:

from spire.ocr import *

# Create OCR scanner instance
scanner = OcrScanner()

# Configure OCR model path and language
configureOptions = ConfigureOptions()
configureOptions.ModelPath = r'D:\OCR\win-x64'
configureOptions.Language = 'English'
scanner.ConfigureDependencies(configureOptions)

# Perform OCR on the image
scanner.Scan(r'Sample.png')

# Save extracted text to file
text = scanner.Text.ToString()
with open('output.txt', 'a', encoding='utf-8') as file:
    file.write(text + '\n')

Optional: Clean and Preprocess Extracted Text (Post-OCR)

After OCR, the output may contain empty lines or noise. This snippet shows how to clean the text:

# Clean extracted text: remove empty or short lines
clean_lines = [line.strip() for line in text.split('\n') if len(line.strip()) > 2]
cleaned_text = '\n'.join(clean_lines)

# Save to a clean version
with open('output_clean.txt', 'w', encoding='utf-8') as file:
    file.write(cleaned_text)

Use Case: Useful for post-processing OCR output before feeding into NLP tasks or database storage.

Here’s an example of plain-text OCR output using Spire.OCR:

Python code extracting plain text from image using Spire.OCR

Extract Text from Image with Coordinates

In forms or invoices, you may need both text content and layout. The code below outputs each block’s bounding box info:

from spire.ocr import *

scanner = OcrScanner()

configureOptions = ConfigureOptions()
configureOptions.ModelPath = r'D:\OCR\win-x64'
configureOptions.Language = 'English'
scanner.ConfigureDependencies(configureOptions)

scanner.Scan(r'sample.png')
text = scanner.Text

# Extract block-level text with position
block_text = ""
for block in text.Blocks:
    rectangle = block.Box
    block_info = f'{block.Text} -> x: {rectangle.X}, y: {rectangle.Y}, w: {rectangle.Width}, h: {rectangle.Height}'
    block_text += block_info + '\n'

with open('output.txt', 'a', encoding='utf-8') as file:
    file.write(block_text + '\n')

Extract Text from Multiple Images in a Folder

You can also batch process a folder of images:

import os
from spire.ocr import *

def extract_text_from_folder(folder_path, model_path):
    scanner = OcrScanner()
    config = ConfigureOptions()
    config.ModelPath = model_path
    config.Language = 'English'
    scanner.ConfigureDependencies(config)

    for filename in os.listdir(folder_path):
        if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
            image_path = os.path.join(folder_path, filename)
            scanner.Scan(image_path)
            text = scanner.Text.ToString()

            # Save each result as a separate file
            output_file = os.path.splitext(filename)[0] + '_output.txt'
            with open(output_file, 'w', encoding='utf-8') as f:
                f.write(text)

# Example usage
extract_text_from_folder(r'D:\images', r'D:\OCR\win-x64')

The recognized text blocks with position information are shown below:

OCR text extraction with coordinates and layout blocks in Python

Real-World Use Cases for Text Extraction from Images

Python-based OCR can be applied in:

✅ Invoice and receipt scanning
✅ Identity document OCR (passport, license)
✅ Business card digitization
✅ Form and survey data extraction
✅ Multilingual document indexing

Tip: For text extraction from PDF documents instead of images, you might also want to explore this tutorial on extracting text from PDF using Python.

Supported Languages and Image Formats

Spire.OCR supports multiple languages and a wide range of image formats for broader application scenarios.

Supported Languages:

English
Simplified / Traditional Chinese
French
German
Japanese
Korean

You can set the language using configureOptions.Language.

Supported Image Formats:

JPG / JPEG
PNG
BMP
GIF
TIFF

How to Improve OCR Accuracy (Best Practices)

For better OCR text extraction from images using Python, follow these tips:

Use high-resolution images (≥300 DPI)
Preprocess with grayscale, thresholding, or denoising
Avoid skewed or noisy scans
Match the OCR language with the image content

FAQ

How to extract text from an image in Python code?

To extract text from an image using Python, you can use an OCR library like Spire.OCR for Python. With just a few lines of Python code, you can recognize text in scanned documents or photos and convert it into editable, searchable content.

What is the best Python library to extract text from image?

Spire.OCR for Python is a powerful Python OCR library that offers high-accuracy recognition, multilingual support, and layout-aware output. It also works seamlessly with Spire.Office components, allowing full automation — such as saving extracted text to Excel, Word, or searchable PDFs. You can also explore open-source tools to build your Python text extraction from image projects, depending on your specific needs and preferences.

How to extract data (including position) from image in Python?

When performing text extraction from image using Python, Spire.OCR provides not just the recognized text, but also bounding box coordinates for each block — ideal for processing structured content like tables, forms, or receipts.

How to extract text using Python from scanned PDF files?

To perform text extraction from scanned PDF files using Python, you can first convert each PDF page into an image, then apply OCR using Spire.OCR for Python. For this, we recommend using Spire.PDF for Python — it allows you to save PDF pages as images or directly extract embedded images from scanned PDFs, making it easy to integrate with your OCR pipeline.

Conclusion: Efficient Text Extraction from Images with Python

Thanks to powerful libraries like Spire.OCR, text extraction from images in Python is both fast and reliable. Whether you're processing receipts or building an intelligent OCR pipeline, this approach gives you precise control over both content and layout.

If you want to remove usage limitations of Spire.OCR for Python, you can apply for a free temporary license.

Published in Recognize Text

Tagged under

OCR Python

Tuesday, 10 June 2025 08:16

Spire.PDF 11.6.0 enhances the conversion from PDF to images

We're pleased to announce the release of Spire.PDF 11.6.0. This version successfully fixes some known issues that occurred when converting PDF to images, loading and previewing PDF files. More details are listed below.

Here is a list of changes made in this release

Category	ID	Description
Bug	SPIREPDF-7302	Fixes the issue that converting PDF to images produced incorrect results.
Bug	SPIREPDF-7491	Fixed the issue where the program threw an "ArgumentOutOfRangeException" error when loading a PDF file.
Bug	SPIREPDF-7525	Fixed the issue where the program threw a "NullReferenceException" when calling the PdfDocument.Preview() method.

Click the link to download Spire.PDF 11.6.0:

https://www.e-iceblue.com/Download/download-pdf-for-net-now.html

More information of Spire.PDF new release or hotfix:

https://www.e-iceblue.com/forum/spire-pdf-new-release-or-hotfix-t4704.html

Published in Spire.PDF for .NET

Start «1 234 5 6 7 8 9 10 »End

Page 3 of 46

Install HTML to Text Converter in Python

Python Convert HTML Files to Text in 3 Steps

How to Convert an HTML String to Text in Python

The Conclusion

FAQs about Converting HTML to Text in Python

Table of Contents

Install with pip

Related Links

Table of Contents

Why You Should Convert PowerPoint Slides to Images?

Install PPT to Image Library for Python

Convert PPT to PNG, JPG, BMP (Raster Images) in Python

Save Slides as Images at Original Size

Customize the Output Image Size

Convert PowerPoint to SVG (Scalable Vector Graphics) in Python

Save Slides as SVG

Include Speaker Notes in SVG Output

Export Shapes as Images in Python

Convert PowerPoint to Images Online for Free (No Code)

Use Cloudxdocs to Convert PPT or PPTX to Image

Conclusion

FAQs

Q1: How do I convert a PPTX file to PNG using Python?

Q2: Can I convert PowerPoint slides to JPG and BMP formats in Python?

Q3: Can I extract and save individual shapes from a slide as images?

Q4: Can I customize the size of the output image when converting a slide?

Q5: Does the conversion require Microsoft PowerPoint to be installed?

Q6: Is there a way to convert PowerPoint to images without coding?

Getting Started: Tools and Setup

Step-by-Step: Read Barcodes from PDF in C#

Method 1: Extract Embedded Images and Detect Barcodes

Method 2: Render Page as Image and Scan

Which Method Should You Use?

Real-World Use Cases

Conclusion

FAQ

1. Set Up Your Java Project

Add Spire.XLS to Your Project

2. How to Read XLSX and XLS Files in Java

Load and Read Excel File (XLSX or XLS)

Read Excel File Line by Line with Row Objects

Read Excel File from InputStream

Read Excel Cell Values in Different Formats

3. Best Practices for Reading Large Excel Files in Java

4. Full Example: Java Program to Read Excel File

5. Summary

6. FAQ

Q1: How to read an Excel file dynamically in Java?

Q2: How to extract data from Excel file in Java?

Q3: How to read a CSV Excel file in Java?

Q4: How to read Excel file in Java using InputStream?

Java Library for Reading PDF Content

Extract Text from Searchable PDFs in Java

Retrieve Images from PDFs in Java

Read Table Data from PDF Files in Java

Convert Scanned PDFs to Text via OCR

Step 1. Install Spire.OCR and Configure the Environment

Step 2. Convert a Scanned PDF to Text

Conclusion

FAQs

Q1: Can I extract text from scanned PDFs using Java?

Q2: What’s the best library for reading PDFs in Java?

Q3: Does Spire.PDF support extraction of PDF elements like metadata, attachments, and hyperlinks?

Q4: How to parse tables from PDFs into CSV/Excel programmatically?

Get a Free License

Install the Python Library for PowerPoint Automation

How to Delete All Text Boxes in PowerPoint

How to Delete a Specific Text Box in PowerPoint

How to Delete Empty Text Boxes in PowerPoint

Compare All Three Methods: Which One Should You Use?

Conclusion and Best Practice

FAQs about Deleting Text Boxes in PowerPoint

Powerful Python Library to Extract Text from Image

Setup: Install Dependencies and OCR Models

Step-by-Step: Python Code to Extract Text from Image

Basic OCR Text Extraction (Image to Plain Text)

Extract Text from Image with Coordinates

Real-World Use Cases for Text Extraction from Images

Supported Languages and Image Formats

Supported Languages: