Spire.Office Knowledgebase Page 5 | E-iceblue

Python Convert HTML  Text Quickly and Easily

HTML (HyperText Markup Language) is a markup language used to create web pages, allowing developers to build rich and visually appealing layouts. However, HTML files often contain a large number of tags, which makes them difficult to read if you only need the main content. By using Python to convert HTML to text, this problem can be easily solved. Unlike raw HTML, the converted text file strips away all unnecessary markup, leaving only clean and readable content that is easier to store, analyze, or process further.

Install HTML to Text Converter in Python

To simplify the task, we recommend using Spire.Doc for Python. This Python Word library allows you to quickly remove HTML markup and extract clean plain text with ease. It not only works as an HTML-to-text converter, but also offers a wide range of features—covering almost everything you can do in Microsoft Word.

To install it, you can run the following pip command:

pip install spire.doc

Alternatively, you can download the Spire.Doc package and install it manually.

Python Convert HTML Files to Text in 3 Steps

After preparing the necessary tools, let's dive into today's main topic: how to convert HTML to plain text using Python. With the help of Spire.Doc, this task can be accomplished in just three simple steps: create a new document object, load the HTML file, and save it as a text file. It’s straightforward and efficient, even for beginners. Let’s take a closer look at how this process can be implemented in code!

Code Example – Converting an HTML File to a Text File:

from spire.doc import *
from spire.doc.common import *

# Open an html file
document = Document()
document.LoadFromFile("/input/htmlsample.html", FileFormat.Html, XHTMLValidationType.none)
# Save it as a Text document.
document.SaveToFile("/output/HtmlFileTotext.txt", FileFormat.Txt)

document.Close()

The following is a preview comparison between the source document (.html) and the output document (.txt):

Python Convert an HTML File to a Text Document

Note that if the HTML file contains tables, the output text file will only retain the values within the tables and cannot preserve the original table formatting. If you want to keep certain styles while removing markup, it is recommended to convert HTML to a Word document . This way, you can retain headings, tables, and other formatting, making the content easier to edit and use.

How to Convert an HTML String to Text in Python

Sometimes, we don’t need the entire content of a web page and only want to extract specific parts. In such cases, you can convert an HTML string directly to text. This approach allows you to precisely control the information you need without further editing. Using Python to convert an HTML string to a text file is also straightforward. Here’s a detailed step-by-step guide:

Steps to convert an HTML string to a text document using Spire.Doc:

  • Input the HTML string directly or read it from a local file.
  • Create a Document object and add sections and paragraphs.
  • Use Paragraph.AppendHTML() method to insert the HTML string into a paragraph.
  • Save the document as a .txt file using Document.SaveToFile() method.

The following code demonstrates how to convert an HTML string to a text file using Python:

from spire.doc import *
from spire.doc.common import *

#Get html string.
#with open(inputFile) as fp:
    #HTML = fp.read()

# Load HTML from string
html = """<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>HTML to Text Example</title>
  <style>
    body { font-family: Arial, sans-serif; margin: 20px; }
    header { background: #f4f4f4; padding: 10px; }
    nav a { margin: 0 10px; text-decoration: none; color: #333; }
    main { margin-top: 20px; }
  </style>
</head>
<body>
  <header>
    <h1>My Demo Page</h1>
    <nav>
      <a href="#">Home</a>
      <a href="#">About</a>
      <a href="#">Contact</a>
    </nav>
  </header>
  
  <main>
    <h2>Convert HTML to Text</h2>
    <p>This is a simple demo showing how HTML content can be displayed before converting it to plain text.</p>
  </main>
</body>
</html>
"""

# Create a new document
document = Document()
section = document.AddSection()
section.AddParagraph().AppendHTML(html)

# Save directly as TXT
document.SaveToFile("/output/HtmlStringTotext.txt", FileFormat.Txt)
document.Close()

Here's the preview of the converted .txt file: Python Convert an HTML String to a Text Document

The Conclusion

In today’s tutorial, we focused on how to use Python to convert HTML to a text file. With the help of Spire.Doc, you can handle both HTML files and HTML strings in just a few lines of code, easily generating clean plain text files. If you’re interested in the other powerful features of the Python Word library, you can request a 30-day free trial license and explore its full capabilities for yourself.

FAQs about Converting HTML to Text in Python

Q1: How can I convert HTML to plain text using Python?

A: Use Spire.Doc to load an HTML file or string, insert it into a Document object with AppendHTML(), and save it as a .txt file.

Q2: Can I keep some formatting when converting HTML to text?

A: To retain styles like headings or tables, convert HTML to a Word document first, then export to text if needed.

Q3: Is it possible to convert only part of an HTML page to text?

A: Yes, extract the specific HTML segment as a string and convert it to text using Python for precise control.

Convert PDF to CSV in Java – extract tables and save as CSV

When working with reports, invoices, or datasets stored in PDF format, developers often need a way to reuse the tabular data in spreadsheets, databases, or analytical tools. A common solution is to convert PDF to CSV using Java, since CSV is lightweight, structured, and compatible with almost every platform.

Unlike text or image export, a PDF-to-CSV conversion is mainly about extracting tables from PDF and saving them as CSV. With the help of Spire.PDF for Java, you can detect table structures in PDFs and export them programmatically with just a few lines of code.

In this article, you’ll learn step by step how to perform a PDF to CSV conversion in Java—from setting up the environment, to extracting tables, and even handling more complex scenarios like multi-page documents or multiple tables per page.

Overview of This Tutorial


Environment Setup for PDF to CSV Conversion in Java

Before extracting tables and converting PDF to CSV using Java, you need to set up the development environment. This involves choosing a suitable library and adding it to your project.

Why Choose Spire.PDF for Java

Since PDF files do not provide a built-in export to CSV, extracting tables programmatically is the practical approach. Spire.PDF for Java offers APIs to detect table structures in PDF documents and save them directly as CSV files, making the conversion process simple and efficient.

Install Spire.PDF for Java

Add Spire.PDF for Java to your project using Maven:

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.pdf</artifactId>
        <version>11.10.3</version>
    </dependency>
</dependencies>

If you are not using Maven, you can download the Spire.PDF for Java package and add the JAR files to your project’s classpath.


Extract Tables from PDF and Save as CSV

The most practical way to perform PDF to CSV conversion is by extracting tables. With Spire.PDF for Java, this can be done with just a few steps:

  1. Load the PDF document.
  2. Use PdfTableExtractor to find tables on each page.
  3. Collect cell values row by row.
  4. Write the output into a CSV file.

Here is a Java example that shows the process from start to finish:

Java Code Example for PDF to CSV Conversion

import com.spire.pdf.*;
import com.spire.pdf.utilities.*;

import java.io.*;

public class PdfToCsvExample {
    public static void main(String[] args) throws Exception {
        // Load the PDF document
        PdfDocument pdf = new PdfDocument();
        pdf.loadFromFile("Sample.pdf");

        // Create a StringBuilder to store extracted text
        StringBuilder sb = new StringBuilder();

        // Iterate through each page
        for (int i = 0; i < pdf.getPages().getCount(); i++) {
            PdfTableExtractor extractor = new PdfTableExtractor(pdf);
            PdfTable[] tableLists = extractor.extractTable(i);

            if (tableLists != null) {
                for (PdfTable table : tableLists) {
                    for (int row = 0; row < table.getRowCount(); row++) {
                        for (int col = 0; col < table.getColumnCount(); col++) {
                            // Escape the cell text safely
                            String cellText = escapeCsvField(table.getText(row, col));
                            sb.append(cellText);

                            if (col < table.getColumnCount() - 1) {
                                sb.append(",");
                            }
                        }
                        sb.append("\n");
                    }
                }
            }
        }

        // Write the output to a CSV file
        FileWriter writer = new FileWriter("output/PDFTable.csv");
        writer.write(sb.toString());
        writer.close();

        pdf.close();
        System.out.println("PDF tables successfully exported to CSV.");
    }

    // Utility method to escape CSV fields
    private static String escapeCsvField(String text) {
        if (text == null) return "";

        // Remove line breaks
        text = text.replaceAll("[\\n\\r]", "");

        // Escape if contains special characters
        if (text.contains(",") || text.contains(";") || text.contains("\"") || text.contains("\n")) {
            text = text.replace("\"", "\"\"");  // Escape double quotes
            text = "\"" + text + "\"";          // Wrap with quotes
        }

        return text;
    }
}

Code Walkthrough

  • PdfDocument loads the PDF file into memory.
  • PdfTableExtractor checks each page for tables.
  • PdfTable provides access to rows and columns.
  • escapeCsvField() removes line breaks and safely quotes/escapes text if needed.
  • StringBuilder accumulates cell text, separated by commas.
  • The result is written into Output.csv, which you can open in Excel or any editor.

CSV file generated from a PDF table after running the Java code.

CSV output example from PDF table in Java


Handling Complex PDF-to-CSV Conversion Cases

In practice, PDFs often contain multiple tables, span multiple pages, or have irregular structures. Let’s see how to extend the solution to handle these scenarios.

1. Multiple Tables per Page

The PdfTable[] returned by extractTable(i) contains all tables detected on a page. You can process each one separately. For example, to save each table as a different CSV file:

for (int i = 0; i < pdf.getPages().getCount(); i++) {
    PdfTableExtractor extractor = new PdfTableExtractor(pdf);
    PdfTable[] tableLists = extractor.extractTable(i);

    if (tableLists != null) {
        for (int t = 0; t < tableLists.length; t++) {
            PdfTable table = tableLists[t];
            StringBuilder tableContent = new StringBuilder();

            for (int row = 0; row < table.getRowCount(); row++) {
                for (int col = 0; col < table.getColumnCount(); col++) {
                    tableContent.append(escapeCsvField(table.getText(row, col)));
                    if (col < table.getColumnCount() - 1) {
                        tableContent.append(",");
                    }
                }
                tableContent.append("\n");
            }

            FileWriter writer = new FileWriter("Table_Page" + i + "_Index" + t + ".csv");
            writer.write(tableContent.toString());
            writer.close();
        }
    }
}

Example of multiple tables in one PDF page exported into separate CSV files.

Export multiple tables from one PDF page to CSV in Java

This way, every table is saved as an independent CSV file for better organization.

2. Multi-page or Large Tables

If a table spans across multiple pages, iterating page by page ensures that all data is collected. The key is to append data instead of overwriting:

StringBuilder sb = new StringBuilder();

for (int i = 0; i < pdf.getPages().getCount(); i++) {
    PdfTableExtractor extractor = new PdfTableExtractor(pdf);
    PdfTable[] tables = extractor.extractTable(i);

    if (tables != null) {
        for (PdfTable table : tables) {
            for (int row = 0; row < table.getRowCount(); row++) {
                for (int col = 0; col < table.getColumnCount(); col++) {
                    sb.append(escapeCsvField(table.getText(row, col)));
                    if (col < table.getColumnCount() - 1) sb.append(",");
                }
                sb.append("\n");
            }
        }
    }
}

FileWriter writer = new FileWriter("MergedTables.csv");
writer.write(sb.toString());
writer.close();

Example of a large table across multiple PDF pages merged into one CSV file.

Merge multi-page PDF table into one CSV file in Java

Here, all tables across pages are merged into one CSV file, useful when dealing with continuous reports.

3. Limitations with Formatting

CSV only stores plain text values. Elements like merged cells, fonts, or images are discarded. If preserving styling is critical, exporting to Excel (.xlsx) is a better alternative, which the same library also supports. See How to Export PDF Table to Excel in Java for more details.

4. CSV Special Characters Handling

When writing tables to CSV, certain characters like commas, semicolons, double quotes, or line breaks can break the file structure if not handled properly.
In the Java examples above, the escapeCsvField method removes line breaks and safely quotes or escapes text when needed.

For more advanced scenarios, you can also use Spire.XLS for Java to write data into worksheets and then save as CSV, which automatically handles special characters and ensures correct CSV formatting without manual processing.

Alternatively, for open-source options, libraries like OpenCSV or Apache Commons CSV also automatically handle special characters and CSV formatting, reducing potential issues and simplifying code.


Conclusion

Converting PDF to CSV in Java essentially means extracting tables and saving them in a structured format. CSV is widely supported, lightweight, and ideal for storing and analyzing tabular data. By setting up Spire.PDF for Java and following the code example, you can automate this process, saving time and reducing manual effort.

If you want to explore more advanced features of Spire.PDF for Java, please apply for a free trial license. You can also use Free Spire.PDF for Java for small projects.


FAQ

Q: Can I turn a PDF into a CSV file? A: Yes. While images and styled text cannot be exported, you can extract tables and save them as CSV files using Java.

Q: How to extract data from a PDF file in Java? A: Use a PDF library like Spire.PDF for Java to parse the document, detect tables, and export them to CSV or Excel.

Q: What is the best PDF to CSV converter? A: For Java developers, programmatic solutions such as Spire.PDF for Java offer more flexibility and automation than manual converters.

Q: How to convert PDF to Excel using Java code? A: The process is similar to CSV export. Instead of writing data as comma-separated text, you can export tables into Excel format for richer features.

C# PDF & bytes workflow overview

Working with PDFs as byte arrays is common in C# development. Developers often need to store PDF documents in a database, transfer them through an API, or process them entirely in memory without touching the file system. In such cases, converting between PDF and bytes using C# becomes essential.

This tutorial explains how to perform these operations step by step using Spire.PDF for .NET. You will learn how to convert a byte array to PDF, convert a PDF back into a byte array, and even edit a PDF directly from memory with C# code.

Jump right where you need

Why Work with Byte Arrays and PDFs in C#?

Using byte[] as the transport format lets you avoid temporary files and makes your code friendlier to cloud and container environments.

  • Database storage (BLOB): Persist PDFs as raw bytes; hydrate only when needed.
  • Web APIs: Send/receive PDFs over HTTP without touching disk.
  • In-memory processing: Transform or watermark PDFs entirely in streams.
  • Security & isolation: Limit file I/O, reduce temp-file risks.

Getting set up: before running the examples, add the NuGet package of Spire.PDF for .NET so the API surface is available in your project.

Install-Package Spire.PDF

Once installed, you can load from byte[] or Stream, edit pages, and write outputs back to memory or disk—no extra converters required.

Convert Byte Array to PDF in C#

When an upstream service (e.g., an API or message queue) hands you a byte[] that represents a PDF, you often need to materialize it as a document for further processing or for a one-time save to disk. With Spire.PDF for .NET, this is a direct load operation—no intermediate temp file.

Scenario & approach: we’ll accept a byte[] (from DB/API), construct a PdfDocument in memory, optionally validate basic metadata, and then save the document.

using Spire.Pdf;
using System.IO;

class Program
{
    static void Main()
    {
        // Example source: byte[] retrieved from DB/API
        byte[] pdfBytes = File.ReadAllBytes("Sample.pdf"); // substitute with your source

        // 1) Load PDF from raw bytes (in memory)
        PdfDocument doc = new PdfDocument();
        doc.LoadFromBytes(pdfBytes);

        // 2) (Optional) inspect basic info before saving or further processing
        // int pageCount = doc.Pages.Count;

        // 3) Save to a file
        doc.SaveToFile("Output.pdf");
        doc.Close();
    }
}

The diagram below illustrates the byte[] to PDF conversion workflow:

bytes loaded into PdfDocument and saved as PDF in C# with Spire.PDF

What the code is doing & why it matters:

  • LoadFromBytes(byte[]) initializes the PDF entirely in memory—perfect for services without write access.
  • You can branch after loading: validate pages, redact, stamp, or route elsewhere.
  • SaveToFile(string) saves the document to disk for downstream processing or storing.

Convert PDF to Byte Array in C#

In the reverse direction, converting a PDF to a byte[] enables database writes, caching, or streaming the file through an HTTP response. Spire.PDF for .NET writes directly to a MemoryStream, which you can convert to a byte array with ToArray().

Scenario & approach: load an existing PDF, push the document into a MemoryStream, then extract the byte[]. This pattern is especially useful when returning PDFs from APIs or persisting them to databases.

using Spire.Pdf;
using System.IO;

class Program
{
    static void Main()
    {
        // 1) Load a PDF from disk, network share, or embedded resource
        PdfDocument doc = new PdfDocument();
        doc.LoadFromFile("Input.pdf");

        // 2) Save to a MemoryStream for fileless output
        byte[] pdfBytes;
        using (var ms = new MemoryStream())
        {
            doc.SaveToStream(ms);
            pdfBytes = ms.ToArray();
        }

        doc.Close();

        // pdfBytes now contains the full document (ready for DB/API)
        // e.g., return File(pdfBytes, "application/pdf");
    }
}

The diagram below shows the PDF to byte[] conversion workflow:

PDF loaded into PdfDocument, saved to MemoryStream, then bytes in C#

Key takeaways after the code:

  • SaveToStream → ToArray is the standard way to obtain a PDF as bytes in C# without creating temp files.
  • This approach scales for large PDFs; the only limit is available memory.
  • Great for ASP.NET: return the byte array directly in your controller or minimal API endpoint.

If you want to learn more about working with streams, check out our guide on loading and saving PDF documents via streams in C#.

Create and Edit PDF Directly from a Byte Array

The real power comes from editing PDFs fully in memory. You can load from byte[], add text or images, stamp a watermark, fill form fields, and save the edited result back into a new byte[]. This enables fileless pipelines and is well-suited for microservices.

Scenario & approach: we’ll load a PDF from bytes, draw a small text annotation on page 1 (stand-in for any edit operation), and emit the edited document as a fresh byte array.

using Spire.Pdf;
using Spire.Pdf.Graphics;
using System.Drawing;
using System.IO;

class Program
{
    static void Main()
    {
        // Source could be DB, API, or file — represented as byte[]
        byte[] inputBytes = File.ReadAllBytes("Input.pdf");

        // 1) Load in memory
        var doc = new PdfDocument();
        doc.LoadFromBytes(inputBytes);

        // 2) Edit: write a small marker on the first page
        PdfPageBase page = doc.Pages[0];
        page.Canvas.DrawString(
            "Edited in memory",
            new PdfFont(PdfFontFamily.Helvetica, 12f),
            PdfBrushes.DarkBlue,
            new PointF(100, page.Size.Height - 100)
        );

        // 3) Save the edited PDF back to byte[]
        byte[] editedBytes;
        using (var ms = new MemoryStream())
        {
            doc.SaveToStream(ms);
            editedBytes = ms.ToArray();
        }

        doc.Close();

        // editedBytes can now be persisted or returned by an API
    }
}

The image below shows the edited PDF page:

Edited PDF page with insrted text using C# in bytes

After-code insights:

  • The same pattern works for text, images, watermarks, annotations, and form fields.
  • Keep edits idempotent (e.g., check if you already stamped a page) for safe reprocessing.
  • For ASP.NET, this is ideal for on-the-fly stamping or conditional redaction before returning the response.

For a step-by-step tutorial on building a PDF from scratch, see our article on creating PDF documents in C#.

Advantages of Using Spire.PDF for .NET

A concise view of why this API pairs well with byte-array workflows:

Concern What you get with Spire.PDF for .NET
I/O flexibility Load/save from file path, Stream, or byte[] with the same PdfDocument API.
In-memory editing Draw text/images, manage annotations/forms, watermark, and more—no temp files.
Service-friendly Clean integration with ASP.NET endpoints and background workers.
Scales to real docs Handles multi-page PDFs; you control memory via streams.
Straightforward code Minimal boilerplate; avoids manual byte fiddling and fragile interop.

Conclusion

You’ve seen how to convert byte array to PDF in C#, how to convert PDF to byte array, and how to edit a PDF directly from memory—all with concise code. Keeping everything in streams and byte[] simplifies API design, accelerates response times, and plays nicely with databases and cloud hosting. Spire.PDF for .NET gives you a consistent, fileless workflow that’s easy to extend from quick conversions to full in-memory document processing.

If you want to try these features without limitations, you can request a free 30-day temporary license. Alternatively, you can explore Free Spire.PDF for .NET for lightweight PDF tasks.

FAQ

Can I create a PDF from a byte array in C# without saving to disk?

Yes. Load from byte[] with LoadFromBytes, then either save to a MemoryStream or return it directly from an API—no disk required.

How do I convert PDF to byte array in C# for database storage?

Use SaveToStream on PdfDocument and call ToArray() on the MemoryStream. Store that byte[] as a BLOB (or forward it to another service).

Can I edit a PDF that only exists as a byte array?

Absolutely. Load from bytes, apply edits (text, images, watermarks, annotations, form fill), then save the result back to a new byte[].

Any tips for performance and reliability?

Dispose streams promptly, reuse buffers when appropriate, and create a new PdfDocument per operation/thread. For large files, stream I/O keeps memory usage predictable.

Page 5 of 329
page 5