We’re pleased to announce the release of Spire.Doc for Java 13.7.6. The latest version supports the "Two Lines in One" function, which enhances the conversion from Word to PDF. Furthermore, some known bugs are fixed successfully in the new version, such as the issue where accepting revisions did not affect the content in content controls. More details are listed below.

Here is a list of changes made in this release

New feature SPIREDOC-11113 SPIREDOC-11320 SPIREDOC-11338 Supports the "Two Lines in One" function.
Bug SPIREDOC-11276 Fixes the issue where accepting revisions did not affect the content in content controls.
Bug SPIREDOC-11314 Fixes the issue where converting Word to PDF caused a "NullPointerException" to be thrown.
Bug SPIREDOC-11325 Fixes the issue where retrieving Word document properties was incorrect.
Bug SPIREDOC-11333 Fixes the issue where converting Word to Markdown resulted in disorganized bullet points.
Bug SPIREDOC-11360 Fixes the issue where converting Word to PDF caused vertically oriented text in tables to be incorrect.
Bug SPIREDOC-11364 Fixes the issue where replacing bookmark content caused an "IllegalArgumentException" to be thrown.
Bug SPIREDOC-11389 Fixes the issue where loading a Word document caused an "IllegalArgumentException: List level must be less than 8 and greater than 0" to be thrown.
Bug SPIREDOC-11390 Fixes the issue where accepting revisions did not produce the correct effect.
Bug SPIREDOC-11398 Fixes the issue where using "pictureWatermark.setPicture(bufferedImage)" caused a "java.lang.NullPointerException" to be thrown.
Click the link below to download Spire.Doc for Java 13.7.6:

We’re excited to announce the release of Spire.OCR for Java 2.1.1. This version introduces support for Linux-ARM platform and enables text output that matches the original image layout. In addition, this update includes several bug fixes. More details are provided below.

Category ID Description
New feature - Added support for Linux-ARM platform.
New feature SPIREOCR-84 Added support for automatically rotating images when necessary.
ConfigureOptions configureOptions = new ConfigureOptions();
configureOptions.setAutoRotate(true);
New feature SPIREOCR-107 Added support for preserving the original image layout in text output.
VisualTextAligner visualTextAligner = new VisualTextAligner(scanner.getText());
String scannedText = visualTextAligner.toString();
Bug SPIREOCR-103 Fixed the issue where the cleanup of the temporary folder "temp" was not functioning correctly.
Bug SPIREOCR-104 Fixed the issue where an "Error occurred during ConfigureDependencies" message appeared when the path contained Chinese characters.
Bug SPIREOCR-108 Fixed the issue where the content extraction order was incorrect.
Click the link to download Spire.OCR for Java 2.1.1:

Java Guide to Convert HTML to Word while Preserving Formatting

Converting HTML to Word in Java is essential for developers building reporting tools, content management systems, and enterprise applications. While HTML powers web content, Word documents offer professional formatting, offline accessibility, and easy editing, making them ideal for reports, invoices, contracts, and formal submissions.

This comprehensive guide demonstrates how to use Java and Spire.Doc for Java to convert HTML to Word. It covers everything from converting HTML files and strings, batch processing multiple files, and preserving formatting and images.

Table of Contents

Why Convert HTML to Word in Java?

Converting HTML to Word offers several advantages:

  • Flexible editing – Add comments, track changes, and review content easily.
  • Consistent formatting – Preserve layouts, fonts, and styles across documents.
  • Professional appearance – DOCX files look polished and ready to share.
  • Offline access – Word files can be opened without an internet connection.
  • Integration – Word is widely supported across tools and industries.

Common use cases: exporting HTML reports from web apps, archiving dynamic content in editable formats, and generating formal reports, invoices, or contracts.

Set Up Spire.Doc for Java

Spire.Doc for Java is a robust library that enables developers to create Word documents, edit existing Word documents, and read and convert Word documents in Java without requiring Microsoft Word to be installed.

Before you can convert HTML content into Word documents, it’s essential to properly install and configure Spire.Doc for Java in your development environment.

1. Java Version Requirement

Ensure that your development environment is running Java 6 (JDK 1.6) or a higher version.

2. Installation

Option 1: Using Maven

For projects managed with Maven, you can add the repository and dependency to your pom.xml:

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.doc</artifactId>
        <version>13.11.2</version>
    </dependency>
</dependencies>

For a step-by-step guide on Maven installation and configuration, refer to our article**:** How to Install Spire Series Products for Java from Maven Repository.

Option 2. Manual JAR Installation

For projects without Maven, you can manually add the library:

  • Download Spire.Doc.jar from the official website.
  • Add it to your project classpath.

Convert HTML File to Word in Java

If you already have an existing HTML file, converting it into a Word document is straightforward and efficient. This method is ideal for situations where HTML reports, templates, or web content need to be transformed into professionally formatted, editable Word files.

By using Spire.Doc for Java, you can preserve the original layout, text formatting, tables, lists, images, and hyperlinks, ensuring that the converted document remains faithful to the source. The process is simple, requiring only a few lines of code while giving you full control over page settings and document structure.

Conversion Steps:

  • Create a new Document object.
  • Load the HTML file with loadFromFile().
  • Adjust settings like page margins.
  • Save the output as a Word document with saveToFile().

Example:

import com.spire.doc.Document;
import com.spire.doc.FileFormat;
import com.spire.doc.Section;
import com.spire.doc.documents.XHTMLValidationType;

public class ConvertHtmlFileToWord {
    public static void main(String[] args) {
        // Create a Document object
        Document document = new Document();

        // Load an HTML file
        document.loadFromFile("C:\\Users\\Administrator\\Desktop\\sample.html",
                FileFormat.Html,
                XHTMLValidationType.None);

        // Adjust margins
        Section section = document.getSections().get(0);
        section.getPageSetup().getMargins().setAll(2);

        // Save as Word file
        document.saveToFile("output/FromHtmlFile.docx", FileFormat.Docx);

        // Release resources
        document.dispose();

        System.out.println("HTML file successfully converted to Word!");
    }
}

Convert HTML file to Word in Java using Spire.Doc for Java

You may also be interested in: Java: Convert Word to HTML

Convert HTML String to Word in Java

In many real-world applications, HTML content is generated dynamically - whether it comes from user input, database records, or template engines. Converting these HTML strings directly into Word documents allows developers to create professional, editable reports, invoices, or documents on the fly without relying on pre-existing HTML files.

Using Spire.Doc for Java, you can render rich HTML content, including headings, lists, tables, images, hyperlinks, and more, directly into a Word document while preserving formatting and layout.

Conversion Steps:

  • Create a new Document object.
  • Add a section and adjust settings like page margins.
  • Add a paragraph.
  • Add the HTML string to the paragraph using appendHTML().
  • Save the output as a Word document with saveToFile().

Example:

import com.spire.doc.Document;
import com.spire.doc.FileFormat;
import com.spire.doc.Section;
import com.spire.doc.documents.Paragraph;

public class ConvertHtmlStringToWord {
    public static void main(String[] args) {
        // Sample HTML string
        String htmlString = "<h1>Java HTML to Word Conversion</h1>" +
                "<p><b>Spire.Doc</b> allows you to convert HTML content into Word documents seamlessly. " +
                "This includes support for headings, paragraphs, lists, tables, links, and images.</p>" +
                "<h2>Features</h2>" +
                "<ul>" +
                "<li>Preserve text formatting such as <i>italic</i>, <u>underline</u>, and <b>bold</b></li>" +
                "<li>Support for ordered and unordered lists</li>" +
                "<li>Insert tables with multiple rows and columns</li>" +
                "<li>Add hyperlinks and bookmarks</li>" +
                "<li>Embed images from URLs or base64 strings</li>" +
                "</ul>" +
                "<h2>Example Table</h2>" +
                "<table border='1' style='border-collapse:collapse;'>" +
                "<tr><th>Item</th><th>Description</th><th>Quantity</th></tr>" +
                "<tr><td>Notebook</td><td>Spire.Doc Java Guide</td><td>10</td></tr>" +
                "<tr><td>Pen</td><td>Blue Ink</td><td>20</td></tr>" +
                "<tr><td>Marker</td><td>Permanent Marker</td><td>5</td></tr>" +
                "</table>" +
                "<h2>Links and Images</h2>" +
                "<p>Visit <a href='https://www.e-iceblue.com/'>E-iceblue Official Site</a> for more resources.</p>" +
                "<p>Sample Image:</p>" +
                "<img src='https://www.e-iceblue.com/images/intro_pic/Product_Logo/doc-j.png' alt='Product Logo' width='150' height='150'/>" +
                "<h2>Conclusion</h2>" +
                "<p>Using Spire.Doc, Java developers can easily generate Word documents from rich HTML content while preserving formatting and layout.</p>";

        // Create a Document
        Document document = new Document();

        // Add section and paragraph
        Section section = document.addSection();
        section.getPageSetup().getMargins().setAll(72);

        Paragraph paragraph = section.addParagraph();

        // Render HTML string
        paragraph.appendHTML(htmlString);

        // Save as Word
        document.saveToFile("output/FromHtmlString.docx", FileFormat.Docx);

        document.dispose();

        System.out.println("HTML string successfully converted to Word!");
    }
}

Convert HTML String to Word in Java using Spire.Doc for Java

Batch Conversion of Multiple HTML Files to Word in Java

Sometimes you may need to convert hundreds of HTML files into Word documents. Here’s how to batch process them in Java.

import com.spire.doc.Document;
import com.spire.doc.FileFormat;
import com.spire.doc.documents.XHTMLValidationType;
import java.io.File;

public class BatchConvertHtmlToWord {
    public static void main(String[] args) {
        File folder = new File("C:\\Users\\Administrator\\Desktop\\HtmlFiles");

        for (File file : folder.listFiles()) {
            if (file.getName().endsWith(".html") || file.getName().endsWith(".htm")) {
                Document document = new Document();
                document.loadFromFile(file.getAbsolutePath(), FileFormat.Html, XHTMLValidationType.None);

                String outputPath = "output/" + file.getName().replace(".html", ".docx");
                document.saveToFile(outputPath, FileFormat.Docx);
                document.dispose();

                System.out.println(file.getName() + " converted to Word!");
            }
        }
    }
}

This approach is great for reporting systems where multiple HTML reports are generated daily.

Best Practices for HTML to Word Conversion

  • Use Inline CSS for Reliable Styling
    Inline CSS ensures that fonts, colors, and spacing are preserved during conversion. External stylesheets may not always render correctly, especially if they are not accessible at runtime.
  • Validate HTML Structure
    Well-formed HTML with proper nesting and closed tags helps render tables, lists, and headings accurately.
  • Optimize Images
    Use absolute URLs or embed images as base64. Resize large images to fit Word layouts and reduce file size.
  • Manage Resources in Batch Conversion
    When processing multiple files, convert them one by one and call dispose() after each document to prevent memory issues.
  • Preserve Page Layouts
    Set page margins, orientation, and paper size to ensure the Word document looks professional, especially for reports and formal documents.

Conclusion

Converting HTML to Word in Java is an essential feature for many enterprise applications. Using Spire.Doc for Java, you can:

  • Convert HTML files into Word documents.
  • Render HTML strings directly into DOCX.
  • Handle batch processing for multiple files.
  • Preserve images, tables, and styles with ease.

By following the examples and best practices above, you can integrate HTML to Word conversion seamlessly into your Java applications.

FAQs (Frequently Asked Questions)

Q1. Can Java convert multiple HTML files into one Word document?

A1: Yes. Instead of saving each file separately, you can load multiple HTML contents into the same Document and then save it once.

Q2. How to preserve CSS styles during HTML to Word conversion?

A2: Inline CSS will be preserved; external stylesheets can also be applied if they’re accessible at run time.

Q3. Can I generate a Word document directly from a web page?

A3: Yes. You can fetch the HTML using an HTTP client in Java, then pass it into the conversion method.

Q4. What Word formats are supported for saving the converted document?

A4: You can save as DOCX, DOC, or other Word-compatible formats supported by Spire.Doc. DOCX is recommended for modern applications due to its compatibility and smaller file size.

As the Chinese New Year approaches, our office will be closed from 28/01/2025 to 04/02/2025 (GMT+8:00).

During the holiday, your emails will be received as usual and urgent issues will be handled as soon as possible by the staff on-duty. Please note that standard support may be limited during this time, so we kindly ask for your understanding and patience if you do not receive an immediate response.

Note: Our purchase system is available 24/7 and will automatically send out license files once you have completed the online order and payment.

To get a temporary license to evaluate our product, please click "Request a Temporary License" on the download page. If there are any problems with the request, we will make it available when we return to work on February 05, 2025.

We apologize for any inconvenience this may cause and really appreciate your understanding and support.


Pease feel free to contact us via the following emails

Thursday, 23 January 2025 06:00

Spire.Office 10.1.0 is released

We are excited to announce the release of Spire.Office 10.1.0. In this version, Spire.Doc supports checking and modifying hyperlinks for images and shapes; Spire.XLS supports the CSCH, RANDARRAY, COTH, SEQUENCE, EXPAND functions; Spire.Presentation supports obtaining the file name of embedded OLE objects; Spire.PDF enhances the conversion from XPS to PDF and PDF to PNG, HTML, SVG, OFD, XPS, and Excel. More details are listed below.

In this version, the most recent versions of Spire.Doc, Spire.XLS, Spire.Presentation, and Spire.PDF are included.

DLL Versions:

  • Spire.Doc 13.1.4
  • Spire.XLS 15.1.3
  • Spire.Presentation 10.1.1
  • Spire.PDF 11.1.0
  • Spire.PDF 11.1.5
Click the link to get the version Spire.Office 10.1.0:
More information of Spire.Office new release or hotfix:

Here is a list of changes made in this release

Spire.Doc

Category ID Description
New feature SPIREDOC-10532
SPIREDOC-11019
Support judging and modifying hyperlinks for images and shapes.
foreach (Section section in doc.Sections)
{
    foreach (Paragraph paragraph in section.Paragraphs)
    {
        foreach (DocumentObject documentObject in paragraph.ChildObjects)
        {
            if (documentObject is DocPicture)
            {
                DocPicture pic=documentObject as DocPicture;

                if (pic.HasHyperlink)
                {
                    pic.HRef = "";
                }
            }
            if (documentObject is ShapeObject)
            {
                ShapeObject shape = documentObject as ShapeObject;

                if (shape.HasHyperlink)
                {
                    shape.HRef = "";
                }
            }
        }
    }
}
Bug SPIREDOC-10551 Fixes the issue that the program threw “The given key ‘5’ was not present in the dictionary” exception when converting HTML documents to Word documents.
Bug SPIREDOC-11022 Fixes the issue that the obtained ListText of paragraphs was incorrect.

Spire.XLS

Category ID Description
New feature SPIREXLS-5542 Supports the CSCH function
New feature SPIREXLS-5548 Supports the RANDARRAY function.
New feature SPIREXLS-5621 Supports the COTH function.
New feature SPIREXLS-5622 Supports the SEQUENCE function.
New feature SPIREXLS-5627 Supports the EXPAND function.
New feature SPIREXLS-5638 Supports the CHOOSECOLS function.
New feature SPIREXLS-5639 Supports the CHOOSEROWS function.
New feature SPIREXLS-5642 Supports the DROP function.
New feature SPIREXLS-5656 Support setting HyLink for XlsPrstGeomShape.
PrstGeomShapeCollection prstGeomShapeType = worksheet.PrstGeomShapes;
for (int i = 0; i < prstGeomShapeType.Count; i++)
{
    XlsPrstGeomShape shape = (XlsPrstGeomShape)prstGeomShapeType[i];
    shape.HyLink.Address = "https://www.baidu.com/";
}
Bug SPIREXLS-5570 Fixes the issue that the charts were lost when converting XLSM to PDF.
Bug SPIREXLS-5608 Fixes the issue that the content was lost when converting Excel to PDF.
Bug SPIREXLS-5611 Fixes the issue that setting ShowLeaderLines did not take effect.
Bug SPIREXLS-5612 Fixes the issue that the data bar colors were incorrect when converting Excel to PDF.
Bug SPIREXLS-5625
SPIREXLS-5647
Fixes the issue that the values were incorrect after calling the CalculateAllValue() method to calculate formula values.
Bug SPIREXLS-5635 Fixes the issue that setting the worksheet tab color to Color.Empty resulted in black.
Bug SPIREXLS-5640 Fixes the issue that the images were extracted incorrectly.
Bug SPIREXLS-5657 Fixes the issue that it failed to delete pivot fields in pivot tables.
Bug SPIREXLS-5659 Fixes the issue that the text orientation in shapes was reversed when converting Excel to PDF.

Spire.Presentation

Category ID Description
New feature SPIREPPT-2658 Supports obtaining the file name of embedded OLE objects.
IOleObject oleObject = shape as IOleObject;
oleObject.EmbeddedFileName
Bug SPIREPPT-2652 Fixes the issue that the program threw an exception "object reference not set to object instance" when loading PPTX documents.
Bug SPIREPPT-2657 Fixes the issue that underlines were discontinuous when converting PPTX to SVG.
Bug SPIREPPT-2690 Fixes the issue that content was lost when converting PPTX to PDF.
Bug SPIREPPT-2692 Fixes the issue that checkboxes were missed when converting PPTX to PDF.
Bug SPIREPPT-2702 Fixes the issue that the program threw an "Object reference not set to an instance of an object" exception when obtaining font names.
Bug SPIREPPT-2703 Fixes the issue that setting ”Shrink text on overflow“ resulted in incorrect format.
Bug SPIREPPT-2705 Fixes the issue that setting "Resize shape to fit text" did not take effect.

Spire.PDF

Category ID Description
Bug SPIREPDF-7162 Fixes the issue that multi-threaded PDF text extraction error happened.
Bug SPIREPDF-7201 Fixes the issue that there were incorrect links when converting XPS to PDF.
Bug SPIREPDF-7235 Fixes the issue that extracting incorrect content from PDF tables.
Bug SPIREPDF-7246 Fixes the issue that some content turned black when converting PDF to PNG.
Bug SPIREPDF-7248 Fixes the issue that HTML documents were too large when converting PDF to HTML.
Bug SPIREPDF-7250 Fixes the issue that annotation content wasn't displayed when converting PDF to XPS.
Bug SPIREPDF-7264 Fixes the issue that content was lost when converting PDF to images.
Bug SPIREPDF-7279
SPIREPDF-7301
Fixes the issue that there were incorrect fonts when converting PDF to SVG.
Bug SPIREPDF-7280 Fixes the issue that character spaces were missed when converting XPS to PDF.
Bug SPIREPDF-7286 Fixes the issue that the program threw an "Object reference not set to an instance of an object." exception when loading PDF documents.
Bug SPIREPDF-7288 Fixes the issue that seal content was cut when converting PDF to OFD.
Bug SPIREPDF-7289 Fixes the issue that the program threw an "Object reference not set to an instance of an object." exception when converting PDF to grayscale PDF.
Bug SPIREPDF-7051 Fixes the issue that the content was incorrect when printing PDF.
Bug SPIREPDF-7159
SPIREPDF-7294
Fixes the issue that replacing text caused some text to be lost.
Bug SPIREPDF-7211 Fixes the issue that the program suspended when saving PDF.
Bug SPIREPDF-7221 Fixes the issue that modifying the value of text fields consumed a long time.
Bug SPIREPDF-7249 Fixes the issue that some Chinese characters were garbled after converting PDF to XPS.
Bug SPIREPDF-7275 Fixes the issue that converting PDF to grayscale PDF consumed a long time.
Bug SPIREPDF-7278 Fixes the issue that the result was incorrect when converting PDF to Excel.
Bug SPIREPDF-7312 Fixes the issue that the value of the field disappeared when the mouse entered the field after filling the text box field.

Spire.OCR for Java offers developers a new model for extracting text from images. In this article, we will demonstrate how to extract text from images in Java using the new model of Spire.OCR for Java.

The detailed steps are as follows.

Step 1: Create a Java Project in IntelliJ IDEA.

Extract Text from Images Using the New Model of Spire.OCR for Java

Step 2: Add Spire.OCR.jar to Your Project.

Option 1: Install Spire.OCR for Java via Maven.

If you're using Maven, you can install Spire.OCR for Java by adding the following code to your project's pom.xml file:

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.ocr</artifactId>
        <version>2.1.1</version>
    </dependency>
</dependencies>

Option 2: Manually Import Spire.OCR.jar.

First, download Spire.OCR for Java from the following link and extract it to a specific directory:

https://www.e-iceblue.com/Download/ocr-for-java.html

Next, in IntelliJ IDEA, go to File > Project Structure > Modules > Dependencies. In the Dependencies pane, click the "+" button and select JARs or Directories. Navigate to the directory where Spire.OCR for Java is located, open the lib folder and select the Spire.OCR.jar file, then click OK to add it as the project’s dependency.

Extract Text from Images Using the New Model of Spire.OCR for Java

Step 3: Download the New Model of Spire.OCR for Java.

Download the model that fits in with your operating system from one of the following links.

Windows x64

Linux x64 (CentOS 8, Ubuntu 18 and above versions are required)

macOS 10.15 and later

Linux aarch

Then extract the package and save it to a specific directory on your computer. In this example, we saved the package to "D:\".

Extract Text from Images Using the New Model of Spire.OCR for Java

Step 4: Implement Text Extraction from Images Using the New Model of Spire.OCR for Java.

Use the following code to extract text from images with the new OCR model of Spire.OCR for Java:

  • Java
import com.spire.ocr.*;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;

public class Main {
    public static void main(String[] args) {
        try {
            // Create an instance of the OcrScanner class
            OcrScanner scanner = new OcrScanner();

            // Create an instance of the ConfigureOptions class to set up the scanner configurations
            ConfigureOptions configureOptions = new ConfigureOptions();

            // Set the path to the new model
            configureOptions.setModelPath("D:\\win-x64");

            // Set the language for text recognition. The default is English.
            // Supported languages include English, Chinese, Chinesetraditional, French, German, Japanese, and Korean.
            configureOptions.setLanguage("English");

            // Apply the configuration options to the scanner
            scanner.ConfigureDependencies(configureOptions);

            // Extract text from an image
            scanner.scan("Sample.png");

            // Save the extracted text to a text file
            saveTextToFile(scanner, "output.txt");

        } catch (OcrException e) {
            e.printStackTrace();
        }
    }

    private static void saveTextToFile(OcrScanner scanner, String filePath) {
        try {
            String text = scanner.getText().toString();
            try (BufferedWriter writer = new BufferedWriter(new FileWriter(filePath))) {
                writer.write(text);
            }
        } catch (IOException | OcrException e) {
            e.printStackTrace();
        }
    }
}

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

Set different text alignment in Word with Python

In the world of document automation, proper text alignment is crucial for creating professional, readable, and visually appealing documents. For developers and data professionals building reports, drafting letters, or designing invoices, mastering text alignment in Python is essential to producing polished, consistent documents without manual editing.

This guide delivers a step-by-step walkthrough on how to align text in Python using Spire.Doc for Python, a library that enables effortless control over Word document formatting.


Why Choose Spire.Doc for Python to Align Text?​

Before diving into code, let’s clarify why Spire.Doc is a top choice for text alignment tasks:​

  • Full Alignment Support: Natively supports all standard alignment types (Left, Right, Center, Justify) for paragraphs.​
  • No Microsoft Word Dependency: Runs independently - no need to install Word on your machine.​
  • High Compatibility: Works with .docx, .doc, and other Word formats, ensuring your aligned documents open correctly across devices.​
  • Fine-Grained Control: Adjust alignment for entire paragraphs or table cells.​

Core Text Alignment Types in Spire.Doc​

Spire.Doc uses the HorizontalAlignment enum to define text alignment. The most common values are:​

  • HorizontalAlignment.Left: Aligns text to the left margin (default).​
  • HorizontalAlignment.Right: Aligns text to the right margin.​
  • HorizontalAlignment.Center: Centers text horizontally between margins.​
  • HorizontalAlignment.Justify: Adjusts text spacing so both left and right edges align with margins.​
  • HorizontalAlignment.Distribute: Adjusts character spacing (adds space between letters) and word spacing to fill the line.

Below, we’ll cover how to programmatically set paragraph alignment (left, right, center, justified, and distributed) in Word using Python


Step-by-Step: Align Text in Word in Python

Here are the actionable steps to generate a Word document with 5 paragraphs, each using a different alignment style.

Step 1: Install Spire.Doc for Python

Open your terminal/command prompt​, and then run the following command to install the latest version:

pip install Spire.Doc

Step 2: Import Required Modules

Import the core classes from Spire.Doc. These modules let you create documents, sections, paragraphs, and configure formatting:

from spire.doc import *
from spire.doc.common import *

Step 3: Create a New Word Document

Initialize a Document instance that represents your empty Word file:

# Create a Document instance
doc = Document()

Step 4: Add a Section to the Document

Word documents organize content into sections (each section can have its own margins, page size, etc.). We’ll add one section to hold our paragraphs:

# Add a section to the document
section = doc.AddSection()

Step 5: Add Paragraphs with Different Alignments

A section contains paragraphs, and each paragraph’s alignment is controlled via the HorizontalAlignment enum. We’ll create 5 paragraphs, one for each alignment type.

1. Left Align in Python

Left alignment is the default for most text (text aligns to the left margin).

# Left aligned text
paragraph1 = section.AddParagraph()
paragraph1.AppendText("This is left-aligned text.")
paragraph1.Format.HorizontalAlignment = HorizontalAlignment.Left

2. Right Align Text in Python

Right alignment is useful for dates, signatures, or page numbers (text aligns to the right margin).

# Right aligned text
paragraph2 = section.AddParagraph()
paragraph2.AppendText("This is right-aligned text.")
paragraph2.Format.HorizontalAlignment = HorizontalAlignment.Right

3. Center Text in Python

Center alignment works well for titles or headings (text centers between left and right margins). Use to center text in Python:

# Center aligned text
paragraph3 = section.AddParagraph()
paragraph3.AppendText("This is center-aligned text.")
paragraph3.Format.HorizontalAlignment = HorizontalAlignment.Center

4. Justify Text in Python

Justified text aligns both left and right margins (spaces between words are adjusted for consistency). Ideal for formal documents like essays or reports.

# Justified
paragraph4 = section.AddParagraph()
paragraph4.AppendText("This is justified text.")
paragraph4.Format.HorizontalAlignment = HorizontalAlignment.Justify

Note: Justified alignment is more visible with longer text - short phrases may not show the spacing adjustment.

5. Distribute Text in Python

Distributed alignment is similar to justified, but evenly distributes single-line text (e.g., unevenly spaced words or short phrases).

# Distributed
Paragraph5 = section.AddParagraph()
Paragraph5.AppendText("This is evenly distributed text.")
Paragraph5.Format.HorizontalAlignment = HorizontalAlignment.Distribute

Step 6: Save and Close the Document

Finally, save the document to a specified path and close the Document instance to free resources:

# Save the document
document.SaveToFile("TextAlignment.docx", FileFormat.Docx2016)
# Close the document to release memory
document.Close()

Output:

Align paragraph text in Word in Python

Pro Tip: Spire.Doc for Python also provides interfaces to align tables in Word or align text in table cells.


FAQs About Python Text Alignment

Q1: Is Spire.Doc for Python free?

A: Spire.Doc offers a free version with limitations. For full functionality, you can request a 30-day trial license here.

Q2: Can I set text alignment for existing Word documents

A: Yes. Spire.Doc lets you load existing documents and modify text alignment for specific paragraphs. Here’s a quick example:

from spire.doc import *

# Load an existing document
doc = Document()
doc.LoadFromFile("ExistingDocument.docx")

# Get the first section and first paragraph
section = doc.Sections[0]
paragraph = section.Paragraphs[0]

# Change alignment to center
paragraph.Format.HorizontalAlignment = HorizontalAlignment.Center

# Save the modified document
doc.SaveToFile("UpdatedDocument.docx", FileFormat.Docx2016)
doc.Close()

Q3: Can I apply different alignments to different parts of the same paragraph?

A: No. Text alignment is a paragraph-level setting in Word, not a character-level setting. This means all text within a single paragraph must share the same alignment (left, right, center, etc.).

If you need mixed alignment in the same line, you’ll need to use a table with invisible borders.

Q4: Can Spire.Doc for Python handle other text formatting?

A: Absolutely! Spire.Doc lets you combine alignment with other formatting like fonts, line spacing, bullet points, and more.


Conclusion

Automating Word text alignment with Python and Spire.Doc saves time, reduces human error, and ensures consistency across documents. The code example provided offers a clear template for implementing left, right, center, justified, and distributed alignment, and adapting it to your needs is as simple as modifying the text or adding more formatting rules.

Try experimenting with different alignment combinations, and explore Spire.Doc’s online documentation to unlock more formatting possibilities.

C# Guide to Read Word Document Content

Word documents (.doc and .docx) are widely used in business, education, and professional workflows for reports, contracts, manuals, and other essential content. As a C# developer, you may find the need to programmatically read these files to extract information, analyze content, and integrate document data into applications.

In this complete guide, we will delve into the process of reading Word documents in C#. We will explore various scenarios, including:

  • Extracting text, paragraphs, and formatting details
  • Retrieving images and structured table data
  • Accessing comments and document metadata
  • Reading headers and footers for comprehensive document analysis

By the end of this guide, you will have a solid understanding of how to efficiently parse Word documents in C#, allowing your applications to access and utilize document content with accuracy and ease.

Table of Contents

Set Up Your Development Environment for Reading Word Documents in C#

Before you can read Word documents in C#, it’s crucial to ensure that your development environment is properly set up. This section outlines the necessary prerequisites and step-by-step installation instructions to get you ready for seamless Word document handling.

Prerequisites

Install Spire.Doc

To incorporate Spire.Doc into your C# project, follow these steps to install it via NuGet:

  1. Open your project in Visual Studio.
  2. Right-click on your project in the Solution Explorer and select Manage NuGet Packages.
  3. In the Browse tab, search for "Spire.Doc" and click Install.

Alternatively, you can use the Package Manager Console with the following command:

PM> Install-Package Spire.Doc

This installation adds the necessary references, enabling you to programmatically work with Word documents.

Load Word Document (.doc/.docx) in C#

To begin, you need to load a Word document into your project. The following example demonstrates how to load a .docx or .doc file in C#:

using Spire.Doc;
using Spire.Doc.Documents;
using System;

namespace LoadWordExample
{
    class Program
    {
        static void Main(string[] args)
        {
            // Specify the path of the Word document
            string filePath = @"C:\Documents\Sample.docx";

            // Create a Document object
            using (Document document = new Document())
            {
                // Load the Word .docx or .doc document
                document.LoadFromFile(filePath);
            }
        }
    }
}

This code loads a Word file from the specified path into a Document object, which is the entry point for accessing all document elements.

Read and Extract Content from Word Document in C#

After loading the Word document into a Document object, you can access its contents programmatically. This section covers various methods for extracting different types of content effectively.

Extract Text

Extracting text is often the first step in reading Word documents. You can retrieve all text content using the built-in GetText() method:

using (StreamWriter writer = new StreamWriter("ExtractedText.txt", false, Encoding.UTF8))
{
    // Get all text from the document
    string allText = document.GetText();
    
    // Write the entire text to a file
    writer.Write(allText);
}

This method extracts all text, disregarding formatting and non-text elements like images.

C# Example to Extract All Text from Word Document

Read Paragraphs and Formatting Information

When working with Word documents, it is often useful not only to access the text content of paragraphs but also to understand how each paragraph is formatted. This includes details such as alignment and spacing after the paragraph, which can affect layout and readability.

The following example demonstrates how to iterate through all paragraphs in a Word document and retrieve their text content and paragraph-level formatting in C#:

using (StreamWriter writer = new StreamWriter("Paragraphs.txt", false, Encoding.UTF8))
{
    // Loop through all sections
    foreach (Section section in document.Sections)
    {
        // Loop through all paragraphs in the section
        foreach (Paragraph paragraph in section.Paragraphs)
        {
            // Get paragraph alignment
            HorizontalAlignment alignment = paragraph.Format.HorizontalAlignment;

            // Get spacing after paragraph
            float afterSpacing = paragraph.Format.AfterSpacing;

            // Write paragraph formatting and text to the file
            writer.WriteLine($"[Alignment: {alignment}, AfterSpacing: {afterSpacing}]");
            writer.WriteLine(paragraph.Text);
            writer.WriteLine(); // Add empty line between paragraphs
        }
    }
}

This approach allows you to extract both the text and key paragraph formatting attributes, which can be useful for tasks such as document analysis, conditional processing, or preserving layout when exporting content.

Extract Images

Images embedded within Word documents play a vital role in conveying information. To extract these images, you will examine each paragraph's content, identify images (typically represented as DocPicture objects), and save them for further use:

// Create the folder if it does not exist
string imageFolder = "ExtractedImages";
if (!Directory.Exists(imageFolder))
    Directory.CreateDirectory(imageFolder);

int imageIndex = 1;

// Loop through sections and paragraphs to find images
foreach (Section section in document.Sections)
{
    foreach (Paragraph paragraph in section.Paragraphs)
    {
        foreach (DocumentObject obj in paragraph.ChildObjects)
        {
            if (obj is DocPicture picture)
            {
                // Save each image as a separate PNG file
                string fileName = Path.Combine(imageFolder, $"Image_{imageIndex}.png");
                picture.Image.Save(fileName, System.Drawing.Imaging.ImageFormat.Png);
                imageIndex++;
            }
        }
    }
}

This code saves all images in the document as separate PNG files, with options to choose other formats like JPEG or BMP.

C# Example to Extract Images from Word Document

Extract Table Data

Tables are commonly used to organize structured data, such as financial reports or survey results. To access this data, iterate through the tables in each section and retrieve the content of individual cells:

// Create a folder to store tables
string tableDir = "Tables";
if (!Directory.Exists(tableDir))
    Directory.CreateDirectory(tableDir);

// Loop through each section
for (int sectionIndex = 0; sectionIndex < document.Sections.Count; sectionIndex++)
{
    Section section = document.Sections[sectionIndex];
    TableCollection tables = section.Tables;

    // Loop through all tables in the section
    for (int tableIndex = 0; tableIndex < tables.Count; tableIndex++)
    {
        ITable table = tables[tableIndex];
        string fileName = Path.Combine(tableDir, $"Section{sectionIndex + 1}_Table{tableIndex + 1}.txt");

        using (StreamWriter writer = new StreamWriter(fileName, false, Encoding.UTF8))
        {
            // Loop through each row
            for (int rowIndex = 0; rowIndex < table.Rows.Count; rowIndex++)
            {
                TableRow row = table.Rows[rowIndex];

                // Loop through each cell
                for (int cellIndex = 0; cellIndex < row.Cells.Count; cellIndex++)
                {
                    TableCell cell = row.Cells[cellIndex];

                    // Loop through each paragraph in the cell
                    for (int paraIndex = 0; paraIndex < cell.Paragraphs.Count; paraIndex++)
                    {
                        writer.Write(cell.Paragraphs[paraIndex].Text.Trim() + " ");
                    }

                    // Add tab between cells
                    if (cellIndex < row.Cells.Count - 1) writer.Write("\t");
                }

                // Add newline after each row
                writer.WriteLine();
            }
        }
    }
}

This method allows efficient extraction of structured data, making it ideal for generating reports or integrating content into databases.

C# Example to Extract Table Data from Word Document

Read Comments

Comments are valuable for collaboration and feedback within documents. Extracting them is crucial for auditing and understanding the document's revision history.

The Document object provides a Comments collection, which allows you to access all comments in a Word document. Each comment contains one or more paragraphs, and you can extract their text for further processing or save them into a file.

using (StreamWriter writer = new StreamWriter("Comments.txt", false, Encoding.UTF8))
{
    // Loop through all comments in the document
    foreach (Comment comment in document.Comments)
    {
        // Loop through each paragraph in the comment
        foreach (Paragraph p in comment.Body.Paragraphs)
        {
            writer.WriteLine(p.Text);
        }
        // Add empty line to separate different comments
        writer.WriteLine();
    }
}

This code retrieves the content of all comments and outputs it into a single text file.

Retrieve Document Metadata

Word documents contain metadata such as the title, author, and subject. These metadata items are stored as document properties, which can be accessed through the BuiltinDocumentProperties property of the Document object:

using (StreamWriter writer = new StreamWriter("Metadata.txt", false, Encoding.UTF8))
{
    // Write built-in document properties to file
    writer.WriteLine("Title: " + document.BuiltinDocumentProperties.Title);
    writer.WriteLine("Author: " + document.BuiltinDocumentProperties.Author);
    writer.WriteLine("Subject: " + document.BuiltinDocumentProperties.Subject);
}

Read Headers and Footers

Headers and footers frequently contain essential content like page numbers and titles. To programmatically access this information, iterate through each section's header and footer paragraphs and retrieve the text of each paragraph:

using (StreamWriter writer = new StreamWriter("HeadersFooters.txt", false, Encoding.UTF8))
{
    // Loop through all sections
    foreach (Section section in document.Sections)
    {
        // Write header paragraphs
        foreach (Paragraph headerParagraph in section.HeadersFooters.Header.Paragraphs)
        {
            writer.WriteLine("Header: " + headerParagraph.Text);
        }

        // Write footer paragraphs
        foreach (Paragraph footerParagraph in section.HeadersFooters.Footer.Paragraphs)
        {
            writer.WriteLine("Footer: " + footerParagraph.Text);
        }
    }
}

This method ensures that all recurring content is accurately captured during document processing.

Advanced Tips and Best Practices for Reading Word Documents in C#

To get the most out of programmatically reading Word documents, following these tips can help improve efficiency, reliability, and code maintainability:

  • Use using Statements: Always wrap Document objects in using to ensure proper memory management.
  • Check for Null or Empty Sections: Prevent errors by verifying sections, paragraphs, tables, or images exist before accessing them.
  • Batch Reading Multiple Documents: Loop through a folder of Word files and apply the same extraction logic to each file. This helps automate workflows and consolidate extracted content efficiently.

Conclusion

Efficiently reading Word documents programmatically in C# involves handling various content types. With the techniques outlined in this guide, developers can:

  • Load Word documents (.doc and .docx) with ease.
  • Extract text, paragraphs, and formatting details for thorough analysis.
  • Retrieve images, structured table data, and comments.
  • Access headers, footers, and document metadata for complete insights.

FAQs

Q1: Can I read Word documents without installing Microsoft Word?

A1: Yes, libraries like Spire.Doc enable you to read and process Word files without requiring Microsoft Word installation.

Q2: Does this support both .doc and .docx formats?

A2: Absolutely, all methods discussed in this guide work seamlessly with both legacy (.doc) and modern (.docx) Word files.

Q3: Can I extract only specific sections of a document?

A3: Yes, by iterating through sections and paragraphs, you can selectively filter and extract the desired content.

Convert an HTML File to PDF in Python

Converting HTML to PDF in Python is a common need when you want to generate printable reports, preserve web content, or create offline documentation with consistent formatting. In this tutorial, you’ll learn how to convert HTML to PDF in Python— whether you're working with a local HTML file or a HTML string. If you're looking for a simple and reliable way to generate PDF files from HTML in Python, this guide is for you.

Install Spire.Doc to Convert HTML to PDF Easily

To convert HTML to PDF in Python, you’ll need a reliable library that supports HTML parsing and PDF rendering. Spire.Doc for Python is a powerful and easy-to-use HTML to PDF converter library that lets you generate PDF documents from HTML content — without relying on a browser, headless engine, or third-party tools.

Install via pip

You can install the library quickly with pip:

pip install spire.doc

Alternative: Manual Installation

You can also download the Spire.Doc package and perform a custom installation if you need more control over the environment.

Tip: Spire.Doc offers a free version suitable for small projects or evaluation purposes.

Once installed, you're ready to convert HTML to PDF in Python in just a few lines of code.

Convert HTML Files to PDF in Python

Spire.Doc for Python makes it easy to convert HTML files to PDF. The Document.LoadFromFile() method supports loading various file formats, including .html, .doc, and .docx. After loading an HTML file, you can convert it to PDF by calling Document.SaveToFile() method. Follow the steps below to convert an HTML file to PDF in Python using Spire.Doc.

Steps to convert an HTML file to PDF in Python:

  • Create a Document object.
  • Load an HTML file using Document.LoadFromFile() method.
  • Convert it to PDF using Document.SaveToFile() method.

The following code shows how to convert an HTML file directly to PDF in Python:

from spire.doc import *
from spire.doc.common import *

# Create a Document object
document = Document()

# Load an HTML file 
document.LoadFromFile("Sample.html", FileFormat.Html, XHTMLValidationType.none)

# Save the HTML file to a pdf file
document.SaveToFile("output/ToPdf.pdf", FileFormat.PDF)
document.Close()

Convert an HTML File to PDF in Python

Convert an HTML String to PDF in Python

If you want to convert an HTML string to PDF in Python, Spire.Doc for Python provides a straightforward solution. For simple HTML content like paragraphs, text styles, and basic formatting, you can use the Paragraph.AppendHTML() method to insert the HTML into a Word document. Once added, you can save the document as a PDF using the Document.SaveToFile() method.

Here are the steps to convert an HTML string to a PDF file in Python.

  • Create a Document object.
  • Add a section using Document.AddSection() method and insert a paragraph using Section.AddParagraph() method.
  • Specify the HTML string and add it to the paragraph using Paragraph.AppendHTML() method.
  • Save the document as a PDF file using Document.SaveToFile() method.

Here's the complete Python code that shows how to convert an HTML string to a PDF:

from spire.doc import *
from spire.doc.common import *

# Create a Document object
document = Document()

# Add a section to the document
sec = document.AddSection()

# Add a paragraph to the section
paragraph = sec.AddParagraph()

# Specify the HTML string
htmlString = """
<html>
<head>
    <title>HTML to Word Example</title>
    <style>
        body {
            font-family: Arial, sans-serif;
        }
        h1 {
            color: #FF5733;
            font-size: 24px;
            margin-bottom: 20px;
        }
        p {
            color: #333333;
            font-size: 16px;
            margin-bottom: 10px;
        }
        ul {
            list-style-type: disc;
            margin-left: 20px;
            margin-bottom: 15px;
        }
        li {
            font-size: 14px;
            margin-bottom: 5px;
        }
        table {
            border-collapse: collapse;
            width: 100%;
            margin-bottom: 20px;
        }
        th, td {
            border: 1px solid #CCCCCC;
            padding: 8px;
            text-align: left;
        }
        th {
            background-color: #F2F2F2;
            font-weight: bold;
        }
        td {
            color: #0000FF;
        }
    </style>
</head>
<body>
    <h1>This is a Heading</h1>
    <p>This is a paragraph.</p>
    <p>Here's an unordered list:</p>
    <ul>
        <li>Item 1</li>
        <li>Item 2</li>
        <li>Item 3</li>
    </ul>
    <p>And here's a table:</p>
    <table>
        <tr>
            <th>Name</th>
            <th>Age</th>
            <th>Gender</th>
        </tr>
        <tr>
            <td>John Smith</td>
            <td>35</td>
            <td>Male</td>
        </tr>
        <tr>
            <td>Jenny Garcia</td>
            <td>27</td>
            <td>Female</td>
        </tr>
    </table>
</body>
</html>
"""

# Append the HTML string to the paragraph
paragraph.AppendHTML(htmlString)

# Save the document as a pdf file
document.SaveToFile("output/HtmlStringToPdf.pdf", FileFormat.PDF)
document.Close()

Convert an HTML File to PDF in Python

Customize the Conversion from HTML to PDF in Python

While converting HTML to PDF in Python is often straightforward, there are times when you need more control over the output. For example, you may want to set a password to protect the PDF document, or embed fonts to ensure consistent formatting across different devices. In this section, you’ll learn how to customize the HTML to PDF conversion using Spire.Doc for Python.

1. Set a Password to Protect the PDF

To prevent unauthorized viewing or editing, you can encrypt the PDF by specifying a user password and an owner password.

# Create a ToPdfParameterList object
toPdf = ToPdfParameterList()

# Set PDF encryption passwords
userPassword = "viewer"
ownerPassword = "E-iceblue"
toPdf.PdfSecurity.Encrypt(userPassword, ownerPassword, PdfPermissionsFlags.Default, PdfEncryptionKeySize.Key128Bit)

# Save as PDF with password protection
document.SaveToFile("/HtmlToPdfWithPassword.pdf", toPdf)

2. Embed Fonts to Preserve Formatting

To ensure the PDF displays correctly across all devices, you can embed all fonts used in the document.

# Create a ToPdfParameterList object
ppl = ToPdfParameterList()
ppl.IsEmbeddedAllFonts = True 

# Save as PDF with embedded fonts
document.SaveToFile("/HtmlToPdfWithEmbeddedFonts.pdf", ppl)

These options give you finer control when you convert HTML to PDF in Python, especially for professional document sharing or long-term storage scenarios.

The Conclusion

Converting HTML to PDF in Python becomes simple and flexible with Spire.Doc for Python. Whether you're handling static HTML files or dynamic HTML strings, or need to secure and customize your PDFs, this library provides everything you need — all in just a few lines of code. Get a free 30-day license and start converting HTML to high-quality PDF documents in Python today!

FAQs

Q1: Can I convert an HTML file to PDF in Python? Yes. Using Spire.Doc for Python, you can convert a local HTML file to PDF with just a few lines of code.

Q2: How do I convert HTML to PDF in Chrome? While Chrome allows manual "Save as PDF", it’s not suitable for batch or automated workflows. If you're working in Python, Spire.Doc provides a better solution for programmatically converting HTML to PDF.

Q3: How do I convert HTML to PDF without losing formatting? To preserve formatting:

  • Use embedded or inline CSS (not external files).
  • Use absolute URLs for images and resources.
  • Embed fonts using Spire.Doc options like IsEmbeddedAllFonts(True).

Converting PDF files to editable text is a common need for researchers, analysts, and professionals who deal with large volumes of documents. Manual copying wastes time—Python offers a faster, more flexible solution. In this guide, you’ll learn how to convert PDF to text in Python efficiently, whether you want to keep the layout or extract specific content.

Convert PDF to text without layout

Getting Started: Why Choose Spire.PDF for PDF to Text in Python

To convert PDF files to text using Python, you’ll need a reliable PDF processing library. Spire.PDF for Python is a powerful and developer-friendly API that allows you to read, edit, and convert PDF documents in Python applications — no need for Adobe Acrobat or other third-party software.
This library is ideal for automating PDF workflows such as extracting text, adding annotations, or merging and splitting files. It supports a wide range of PDF features and works seamlessly in both desktop and server environments. You can donwload it to install mannually or quickly install Spire.PDF via PyPI using the following command:

pip install Spire.PDF

For smaller or personal projects, a free version is available with basic functionality. If you need advanced features such as PDF signing or form filling, you can upgrade to the commercial edition at any time.

General Workflow for PDF to Text in Python

Converting a PDF to text becomes simple and efficient with the help of Spire.PDF for Python. You can easily complete the task by reusing the sample code provided in the following sections and customizing it to fit your needs. But before diving into the code, let’s take a quick look at the general workflow behind this process.

  • Create an object of PdfDocument class and load a PDF file using LoadFromFile() method.
  • Create an object of PdfTextExtractOptions class and set the text extracting options, including extracting all text, showing hidden text, only extracting text in a specified area, and simple extraction.
  • Get a page in the document using PdfDocument.Pages.get_Item() method and create PdfTextExtractor objects based on each page to extract the text from the page using Extract() method with specified options.
  • Save the extracted text as a text file and close the object.

How to Convert PDF to Text in Python Without Layout

If you only need the plain text content from a PDF and don’t care about preserving the original layout, you can use a simple method to extract text. This approach is faster and easier, especially when working with scanned documents or large batches of files. In this section, we’ll show you how to convert PDF to text in Python without preserving the layout.

To extract text without preserving layout, follow these simplified steps:

  • Create an instance of PdfDocument and load the PDF file.
  • Create a PdfTextExtractOptions object and configure the text extraction options.
  • Set IsSimpleExtraction = True to ignore the layout and extract raw text.
  • Loop through all pages of the PDF.
  • Extract text from each page and write it to a .txt file.
from spire.pdf import PdfDocument
from spire.pdf import PdfTextExtractOptions
from spire.pdf import PdfTextExtractor

# Create an object of PdfDocument class and load a PDF file
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")

# Create a string object to store the text
extracted_text = ""

# Create an object of PdfExtractor
extract_options = PdfTextExtractOptions()
# Set to use simple extraction method
extract_options.IsSimpleExtraction = True

# Loop through the pages in the document
for i in range(pdf.Pages.Count):
    # Get a page
    page = pdf.Pages.get_Item(i)
    # Create an object of PdfTextExtractor passing the page as paramter
    text_extractor = PdfTextExtractor(page)
    # Extract the text from the page
    text = text_extractor.ExtractText(extract_options)
    # Add the extracted text to the string object
    extracted_text += text

# Write the extracted text to a text file
with open("output/ExtractedText.txt", "w") as file:
    file.write(extracted_text)
pdf.Close()

Convert PDF to text without layout

How to Convert PDF to Text in Python With Layout

To convert PDF to text in Python with layout, Spire.PDF preserves formatting like tables and paragraphs by default. The steps are similar to the general overview, but you still need to loop through each page for full-text extraction.

from spire.pdf import PdfDocument
from spire.pdf import PdfTextExtractOptions
from spire.pdf import PdfTextExtractor

# Create an object of PdfDocument class and load a PDF file
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")

# Create a string object to store the text
extracted_text = ""

# Create an object of PdfExtractor
extract_options = PdfTextExtractOptions()

# Loop through the pages in the document
for i in range(pdf.Pages.Count):
    # Get a page
    page = pdf.Pages.get_Item(i)
    # Create an object of PdfTextExtractor passing the page as paramter
    text_extractor = PdfTextExtractor(page)
    # Extract the text from the page
    text = text_extractor.ExtractText(extract_options)
    # Add the extracted text to the string object
    extracted_text += text

# Write the extracted text to a text file
with open("output/ExtractedText.txt", "w") as file:
    file.write(extracted_text)
pdf.Close()

Convert PDF to text without layout

Convert a Specific PDF Page to Text in Python

Need to extract text from only one page of a PDF instead of the entire document? With Spire.PDF, the PDF to Text converter in Python, you can easily target and convert a specific PDF page to text. The steps are the same as shown in the general overview. If you're already familiar with them, just copy the code below into any Python editor and automate your PDF to text conversion!

from spire.pdf import PdfDocument
from spire.pdf import PdfTextExtractOptions
from spire.pdf import PdfTextExtractor
from spire.pdf import RectangleF

# Create an object of PdfDocument class and load a PDF file
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")

# Create an object of PdfExtractor
extract_options = PdfTextExtractOptions()

# Set to extract specific page area
extract_options.ExtractArea = RectangleF(50.0, 220.0, 700.0, 230.0)

# Get a page
page = pdf.Pages.get_Item(0)

# Create an object of PdfTextExtractor passing the page as paramter
text_extractor = PdfTextExtractor(page)

# Extract the text from the page
extracted_text = text_extractor.ExtractText(extract_options)

# Write the extracted text to a text file
with open("output/ExtractedText.txt", "w") as file:
    file.write(extracted_text)
pdf.Close()

Convert PDF to text without layout

To Wrap Up

In this post, we covered how to convert PDF to text using Python and Spire.PDF, with clear steps and code examples for fast, efficient conversion. We also highlighted the benefits and pointed to OCR tools for image-based PDFs. For any issues or support, feel free to contact us.

FAQs about Converting PDF to Text

Q1: How do I convert a PDF to readable and editable text in Python?
A: You can convert a PDF to text in Python using the Spire.PDF library. It allows you to extract text from PDF files while optionally keeping the original layout. You don’t need Adobe Acrobat, and both visible and image-based PDFs are supported.

Q2: Is there a free tool to convert PDF to text?
A: Yes. Spire.PDF for Python provides a free edition that allows you to convert PDF to text without relying on Adobe Acrobat or other software. Online tools are also available, but they’re more suitable for occasional use or small files.

Q3: Can Python extract data from PDF? A: Yes, Python can extract data from PDF files. Using Spire.PDF, you can easily extract not only text but also other elements such as images, annotations, bookmarks, and even attachments. This makes it a versatile tool for working with PDF content in Python.

SEE ALSO:

Page 7 of 9