Spire.Doc for Java 13.7.6 supports the "Two Lines in One" function
We’re pleased to announce the release of Spire.Doc for Java 13.7.6. The latest version supports the "Two Lines in One" function, which enhances the conversion from Word to PDF. Furthermore, some known bugs are fixed successfully in the new version, such as the issue where accepting revisions did not affect the content in content controls. More details are listed below.
Here is a list of changes made in this release
| New feature | SPIREDOC-11113 SPIREDOC-11320 SPIREDOC-11338 | Supports the "Two Lines in One" function. |
| Bug | SPIREDOC-11276 | Fixes the issue where accepting revisions did not affect the content in content controls. |
| Bug | SPIREDOC-11314 | Fixes the issue where converting Word to PDF caused a "NullPointerException" to be thrown. |
| Bug | SPIREDOC-11325 | Fixes the issue where retrieving Word document properties was incorrect. |
| Bug | SPIREDOC-11333 | Fixes the issue where converting Word to Markdown resulted in disorganized bullet points. |
| Bug | SPIREDOC-11360 | Fixes the issue where converting Word to PDF caused vertically oriented text in tables to be incorrect. |
| Bug | SPIREDOC-11364 | Fixes the issue where replacing bookmark content caused an "IllegalArgumentException" to be thrown. |
| Bug | SPIREDOC-11389 | Fixes the issue where loading a Word document caused an "IllegalArgumentException: List level must be less than 8 and greater than 0" to be thrown. |
| Bug | SPIREDOC-11390 | Fixes the issue where accepting revisions did not produce the correct effect. |
| Bug | SPIREDOC-11398 | Fixes the issue where using "pictureWatermark.setPicture(bufferedImage)" caused a "java.lang.NullPointerException" to be thrown. |
Spire.OCR for Java 2.1.1 adds support for Linux-ARM platform
We’re excited to announce the release of Spire.OCR for Java 2.1.1. This version introduces support for Linux-ARM platform and enables text output that matches the original image layout. In addition, this update includes several bug fixes. More details are provided below.
| Category | ID | Description |
| New feature | - | Added support for Linux-ARM platform. |
| New feature | SPIREOCR-84 | Added support for automatically rotating images when necessary.
ConfigureOptions configureOptions = new ConfigureOptions(); configureOptions.setAutoRotate(true); |
| New feature | SPIREOCR-107 | Added support for preserving the original image layout in text output.
VisualTextAligner visualTextAligner = new VisualTextAligner(scanner.getText()); String scannedText = visualTextAligner.toString(); |
| Bug | SPIREOCR-103 | Fixed the issue where the cleanup of the temporary folder "temp" was not functioning correctly. |
| Bug | SPIREOCR-104 | Fixed the issue where an "Error occurred during ConfigureDependencies" message appeared when the path contained Chinese characters. |
| Bug | SPIREOCR-108 | Fixed the issue where the content extraction order was incorrect. |
How to Convert HTML to Word in Java (Complete Guide)

Converting HTML to Word in Java is essential for developers building reporting tools, content management systems, and enterprise applications. While HTML powers web content, Word documents offer professional formatting, offline accessibility, and easy editing, making them ideal for reports, invoices, contracts, and formal submissions.
This comprehensive guide demonstrates how to use Java and Spire.Doc for Java to convert HTML to Word. It covers everything from converting HTML files and strings, batch processing multiple files, and preserving formatting and images.
Table of Contents
- Why Convert HTML to Word in Java
- Set Up Spire.Doc for Java
- Convert HTML File to Word in Java
- Convert HTML String to Word in Java
- Batch Conversion of Multiple HTML Files to Word in Java
- Best Practices for HTML to Word Conversion
- Conclusion
- FAQs
Why Convert HTML to Word in Java?
Converting HTML to Word offers several advantages:
- Flexible editing – Add comments, track changes, and review content easily.
- Consistent formatting – Preserve layouts, fonts, and styles across documents.
- Professional appearance – DOCX files look polished and ready to share.
- Offline access – Word files can be opened without an internet connection.
- Integration – Word is widely supported across tools and industries.
Common use cases: exporting HTML reports from web apps, archiving dynamic content in editable formats, and generating formal reports, invoices, or contracts.
Set Up Spire.Doc for Java
Spire.Doc for Java is a robust library that enables developers to create Word documents, edit existing Word documents, and read and convert Word documents in Java without requiring Microsoft Word to be installed.
Before you can convert HTML content into Word documents, it’s essential to properly install and configure Spire.Doc for Java in your development environment.
1. Java Version Requirement
Ensure that your development environment is running Java 6 (JDK 1.6) or a higher version.
2. Installation
Option 1: Using Maven
For projects managed with Maven, you can add the repository and dependency to your pom.xml:
<repositories>
<repository>
<id>com.e-iceblue</id>
<name>e-iceblue</name>
<url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>e-iceblue</groupId>
<artifactId>spire.doc</artifactId>
<version>13.11.2</version>
</dependency>
</dependencies>
For a step-by-step guide on Maven installation and configuration, refer to our article**:** How to Install Spire Series Products for Java from Maven Repository.
Option 2. Manual JAR Installation
For projects without Maven, you can manually add the library:
- Download Spire.Doc.jar from the official website.
- Add it to your project classpath.
Convert HTML File to Word in Java
If you already have an existing HTML file, converting it into a Word document is straightforward and efficient. This method is ideal for situations where HTML reports, templates, or web content need to be transformed into professionally formatted, editable Word files.
By using Spire.Doc for Java, you can preserve the original layout, text formatting, tables, lists, images, and hyperlinks, ensuring that the converted document remains faithful to the source. The process is simple, requiring only a few lines of code while giving you full control over page settings and document structure.
Conversion Steps:
- Create a new Document object.
- Load the HTML file with loadFromFile().
- Adjust settings like page margins.
- Save the output as a Word document with saveToFile().
Example:
import com.spire.doc.Document;
import com.spire.doc.FileFormat;
import com.spire.doc.Section;
import com.spire.doc.documents.XHTMLValidationType;
public class ConvertHtmlFileToWord {
public static void main(String[] args) {
// Create a Document object
Document document = new Document();
// Load an HTML file
document.loadFromFile("C:\\Users\\Administrator\\Desktop\\sample.html",
FileFormat.Html,
XHTMLValidationType.None);
// Adjust margins
Section section = document.getSections().get(0);
section.getPageSetup().getMargins().setAll(2);
// Save as Word file
document.saveToFile("output/FromHtmlFile.docx", FileFormat.Docx);
// Release resources
document.dispose();
System.out.println("HTML file successfully converted to Word!");
}
}

You may also be interested in: Java: Convert Word to HTML
Convert HTML String to Word in Java
In many real-world applications, HTML content is generated dynamically - whether it comes from user input, database records, or template engines. Converting these HTML strings directly into Word documents allows developers to create professional, editable reports, invoices, or documents on the fly without relying on pre-existing HTML files.
Using Spire.Doc for Java, you can render rich HTML content, including headings, lists, tables, images, hyperlinks, and more, directly into a Word document while preserving formatting and layout.
Conversion Steps:
- Create a new Document object.
- Add a section and adjust settings like page margins.
- Add a paragraph.
- Add the HTML string to the paragraph using appendHTML().
- Save the output as a Word document with saveToFile().
Example:
import com.spire.doc.Document;
import com.spire.doc.FileFormat;
import com.spire.doc.Section;
import com.spire.doc.documents.Paragraph;
public class ConvertHtmlStringToWord {
public static void main(String[] args) {
// Sample HTML string
String htmlString = "<h1>Java HTML to Word Conversion</h1>" +
"<p><b>Spire.Doc</b> allows you to convert HTML content into Word documents seamlessly. " +
"This includes support for headings, paragraphs, lists, tables, links, and images.</p>" +
"<h2>Features</h2>" +
"<ul>" +
"<li>Preserve text formatting such as <i>italic</i>, <u>underline</u>, and <b>bold</b></li>" +
"<li>Support for ordered and unordered lists</li>" +
"<li>Insert tables with multiple rows and columns</li>" +
"<li>Add hyperlinks and bookmarks</li>" +
"<li>Embed images from URLs or base64 strings</li>" +
"</ul>" +
"<h2>Example Table</h2>" +
"<table border='1' style='border-collapse:collapse;'>" +
"<tr><th>Item</th><th>Description</th><th>Quantity</th></tr>" +
"<tr><td>Notebook</td><td>Spire.Doc Java Guide</td><td>10</td></tr>" +
"<tr><td>Pen</td><td>Blue Ink</td><td>20</td></tr>" +
"<tr><td>Marker</td><td>Permanent Marker</td><td>5</td></tr>" +
"</table>" +
"<h2>Links and Images</h2>" +
"<p>Visit <a href='https://www.e-iceblue.com/'>E-iceblue Official Site</a> for more resources.</p>" +
"<p>Sample Image:</p>" +
"<img src='https://www.e-iceblue.com/images/intro_pic/Product_Logo/doc-j.png' alt='Product Logo' width='150' height='150'/>" +
"<h2>Conclusion</h2>" +
"<p>Using Spire.Doc, Java developers can easily generate Word documents from rich HTML content while preserving formatting and layout.</p>";
// Create a Document
Document document = new Document();
// Add section and paragraph
Section section = document.addSection();
section.getPageSetup().getMargins().setAll(72);
Paragraph paragraph = section.addParagraph();
// Render HTML string
paragraph.appendHTML(htmlString);
// Save as Word
document.saveToFile("output/FromHtmlString.docx", FileFormat.Docx);
document.dispose();
System.out.println("HTML string successfully converted to Word!");
}
}

Batch Conversion of Multiple HTML Files to Word in Java
Sometimes you may need to convert hundreds of HTML files into Word documents. Here’s how to batch process them in Java.
import com.spire.doc.Document;
import com.spire.doc.FileFormat;
import com.spire.doc.documents.XHTMLValidationType;
import java.io.File;
public class BatchConvertHtmlToWord {
public static void main(String[] args) {
File folder = new File("C:\\Users\\Administrator\\Desktop\\HtmlFiles");
for (File file : folder.listFiles()) {
if (file.getName().endsWith(".html") || file.getName().endsWith(".htm")) {
Document document = new Document();
document.loadFromFile(file.getAbsolutePath(), FileFormat.Html, XHTMLValidationType.None);
String outputPath = "output/" + file.getName().replace(".html", ".docx");
document.saveToFile(outputPath, FileFormat.Docx);
document.dispose();
System.out.println(file.getName() + " converted to Word!");
}
}
}
}
This approach is great for reporting systems where multiple HTML reports are generated daily.
Best Practices for HTML to Word Conversion
- Use Inline CSS for Reliable Styling
Inline CSS ensures that fonts, colors, and spacing are preserved during conversion. External stylesheets may not always render correctly, especially if they are not accessible at runtime. - Validate HTML Structure
Well-formed HTML with proper nesting and closed tags helps render tables, lists, and headings accurately. - Optimize Images
Use absolute URLs or embed images as base64. Resize large images to fit Word layouts and reduce file size. - Manage Resources in Batch Conversion
When processing multiple files, convert them one by one and call dispose() after each document to prevent memory issues. - Preserve Page Layouts
Set page margins, orientation, and paper size to ensure the Word document looks professional, especially for reports and formal documents.
Conclusion
Converting HTML to Word in Java is an essential feature for many enterprise applications. Using Spire.Doc for Java, you can:
- Convert HTML files into Word documents.
- Render HTML strings directly into DOCX.
- Handle batch processing for multiple files.
- Preserve images, tables, and styles with ease.
By following the examples and best practices above, you can integrate HTML to Word conversion seamlessly into your Java applications.
FAQs (Frequently Asked Questions)
Q1. Can Java convert multiple HTML files into one Word document?
A1: Yes. Instead of saving each file separately, you can load multiple HTML contents into the same Document and then save it once.
Q2. How to preserve CSS styles during HTML to Word conversion?
A2: Inline CSS will be preserved; external stylesheets can also be applied if they’re accessible at run time.
Q3. Can I generate a Word document directly from a web page?
A3: Yes. You can fetch the HTML using an HTTP client in Java, then pass it into the conversion method.
Q4. What Word formats are supported for saving the converted document?
A4: You can save as DOCX, DOC, or other Word-compatible formats supported by Spire.Doc. DOCX is recommended for modern applications due to its compatibility and smaller file size.
E-iceblue has an 8-Day Spring Festival Holiday during 28/01/2025-04/02/2025
As the Chinese New Year approaches, our office will be closed from 28/01/2025 to 04/02/2025 (GMT+8:00).
During the holiday, your emails will be received as usual and urgent issues will be handled as soon as possible by the staff on-duty. Please note that standard support may be limited during this time, so we kindly ask for your understanding and patience if you do not receive an immediate response.
Note: Our purchase system is available 24/7 and will automatically send out license files once you have completed the online order and payment.
To get a temporary license to evaluate our product, please click "Request a Temporary License" on the download page. If there are any problems with the request, we will make it available when we return to work on February 05, 2025.
We apologize for any inconvenience this may cause and really appreciate your understanding and support.
Pease feel free to contact us via the following emails
- Support Team: support@e-iceblue.com
- Sales Team: sales@e-iceblue.com
Spire.Office 10.1.0 is released
We are excited to announce the release of Spire.Office 10.1.0. In this version, Spire.Doc supports checking and modifying hyperlinks for images and shapes; Spire.XLS supports the CSCH, RANDARRAY, COTH, SEQUENCE, EXPAND functions; Spire.Presentation supports obtaining the file name of embedded OLE objects; Spire.PDF enhances the conversion from XPS to PDF and PDF to PNG, HTML, SVG, OFD, XPS, and Excel. More details are listed below.
In this version, the most recent versions of Spire.Doc, Spire.XLS, Spire.Presentation, and Spire.PDF are included.
DLL Versions:
- Spire.Doc 13.1.4
- Spire.XLS 15.1.3
- Spire.Presentation 10.1.1
- Spire.PDF 11.1.0
- Spire.PDF 11.1.5
Here is a list of changes made in this release
Spire.Doc
| Category | ID | Description |
| New feature | SPIREDOC-10532 SPIREDOC-11019 |
Support judging and modifying hyperlinks for images and shapes.
foreach (Section section in doc.Sections)
{
foreach (Paragraph paragraph in section.Paragraphs)
{
foreach (DocumentObject documentObject in paragraph.ChildObjects)
{
if (documentObject is DocPicture)
{
DocPicture pic=documentObject as DocPicture;
if (pic.HasHyperlink)
{
pic.HRef = "";
}
}
if (documentObject is ShapeObject)
{
ShapeObject shape = documentObject as ShapeObject;
if (shape.HasHyperlink)
{
shape.HRef = "";
}
}
}
}
}
|
| Bug | SPIREDOC-10551 | Fixes the issue that the program threw “The given key ‘5’ was not present in the dictionary” exception when converting HTML documents to Word documents. |
| Bug | SPIREDOC-11022 | Fixes the issue that the obtained ListText of paragraphs was incorrect. |
Spire.XLS
| Category | ID | Description |
| New feature | SPIREXLS-5542 | Supports the CSCH function |
| New feature | SPIREXLS-5548 | Supports the RANDARRAY function. |
| New feature | SPIREXLS-5621 | Supports the COTH function. |
| New feature | SPIREXLS-5622 | Supports the SEQUENCE function. |
| New feature | SPIREXLS-5627 | Supports the EXPAND function. |
| New feature | SPIREXLS-5638 | Supports the CHOOSECOLS function. |
| New feature | SPIREXLS-5639 | Supports the CHOOSEROWS function. |
| New feature | SPIREXLS-5642 | Supports the DROP function. |
| New feature | SPIREXLS-5656 | Support setting HyLink for XlsPrstGeomShape.
PrstGeomShapeCollection prstGeomShapeType = worksheet.PrstGeomShapes;
for (int i = 0; i < prstGeomShapeType.Count; i++)
{
XlsPrstGeomShape shape = (XlsPrstGeomShape)prstGeomShapeType[i];
shape.HyLink.Address = "https://www.baidu.com/";
}
|
| Bug | SPIREXLS-5570 | Fixes the issue that the charts were lost when converting XLSM to PDF. |
| Bug | SPIREXLS-5608 | Fixes the issue that the content was lost when converting Excel to PDF. |
| Bug | SPIREXLS-5611 | Fixes the issue that setting ShowLeaderLines did not take effect. |
| Bug | SPIREXLS-5612 | Fixes the issue that the data bar colors were incorrect when converting Excel to PDF. |
| Bug | SPIREXLS-5625 SPIREXLS-5647 |
Fixes the issue that the values were incorrect after calling the CalculateAllValue() method to calculate formula values. |
| Bug | SPIREXLS-5635 | Fixes the issue that setting the worksheet tab color to Color.Empty resulted in black. |
| Bug | SPIREXLS-5640 | Fixes the issue that the images were extracted incorrectly. |
| Bug | SPIREXLS-5657 | Fixes the issue that it failed to delete pivot fields in pivot tables. |
| Bug | SPIREXLS-5659 | Fixes the issue that the text orientation in shapes was reversed when converting Excel to PDF. |
Spire.Presentation
| Category | ID | Description |
| New feature | SPIREPPT-2658 | Supports obtaining the file name of embedded OLE objects.
IOleObject oleObject = shape as IOleObject; oleObject.EmbeddedFileName |
| Bug | SPIREPPT-2652 | Fixes the issue that the program threw an exception "object reference not set to object instance" when loading PPTX documents. |
| Bug | SPIREPPT-2657 | Fixes the issue that underlines were discontinuous when converting PPTX to SVG. |
| Bug | SPIREPPT-2690 | Fixes the issue that content was lost when converting PPTX to PDF. |
| Bug | SPIREPPT-2692 | Fixes the issue that checkboxes were missed when converting PPTX to PDF. |
| Bug | SPIREPPT-2702 | Fixes the issue that the program threw an "Object reference not set to an instance of an object" exception when obtaining font names. |
| Bug | SPIREPPT-2703 | Fixes the issue that setting ”Shrink text on overflow“ resulted in incorrect format. |
| Bug | SPIREPPT-2705 | Fixes the issue that setting "Resize shape to fit text" did not take effect. |
Spire.PDF
| Category | ID | Description |
| Bug | SPIREPDF-7162 | Fixes the issue that multi-threaded PDF text extraction error happened. |
| Bug | SPIREPDF-7201 | Fixes the issue that there were incorrect links when converting XPS to PDF. |
| Bug | SPIREPDF-7235 | Fixes the issue that extracting incorrect content from PDF tables. |
| Bug | SPIREPDF-7246 | Fixes the issue that some content turned black when converting PDF to PNG. |
| Bug | SPIREPDF-7248 | Fixes the issue that HTML documents were too large when converting PDF to HTML. |
| Bug | SPIREPDF-7250 | Fixes the issue that annotation content wasn't displayed when converting PDF to XPS. |
| Bug | SPIREPDF-7264 | Fixes the issue that content was lost when converting PDF to images. |
| Bug | SPIREPDF-7279 SPIREPDF-7301 |
Fixes the issue that there were incorrect fonts when converting PDF to SVG. |
| Bug | SPIREPDF-7280 | Fixes the issue that character spaces were missed when converting XPS to PDF. |
| Bug | SPIREPDF-7286 | Fixes the issue that the program threw an "Object reference not set to an instance of an object." exception when loading PDF documents. |
| Bug | SPIREPDF-7288 | Fixes the issue that seal content was cut when converting PDF to OFD. |
| Bug | SPIREPDF-7289 | Fixes the issue that the program threw an "Object reference not set to an instance of an object." exception when converting PDF to grayscale PDF. |
| Bug | SPIREPDF-7051 | Fixes the issue that the content was incorrect when printing PDF. |
| Bug | SPIREPDF-7159 SPIREPDF-7294 |
Fixes the issue that replacing text caused some text to be lost. |
| Bug | SPIREPDF-7211 | Fixes the issue that the program suspended when saving PDF. |
| Bug | SPIREPDF-7221 | Fixes the issue that modifying the value of text fields consumed a long time. |
| Bug | SPIREPDF-7249 | Fixes the issue that some Chinese characters were garbled after converting PDF to XPS. |
| Bug | SPIREPDF-7275 | Fixes the issue that converting PDF to grayscale PDF consumed a long time. |
| Bug | SPIREPDF-7278 | Fixes the issue that the result was incorrect when converting PDF to Excel. |
| Bug | SPIREPDF-7312 | Fixes the issue that the value of the field disappeared when the mouse entered the field after filling the text box field. |
Java: Extract Text from Images Using the New Model of Spire.OCR for Java
Spire.OCR for Java offers developers a new model for extracting text from images. In this article, we will demonstrate how to extract text from images in Java using the new model of Spire.OCR for Java.
The detailed steps are as follows.
Step 1: Create a Java Project in IntelliJ IDEA.

Step 2: Add Spire.OCR.jar to Your Project.
Option 1: Install Spire.OCR for Java via Maven.
If you're using Maven, you can install Spire.OCR for Java by adding the following code to your project's pom.xml file:
<repositories>
<repository>
<id>com.e-iceblue</id>
<name>e-iceblue</name>
<url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>e-iceblue</groupId>
<artifactId>spire.ocr</artifactId>
<version>2.1.1</version>
</dependency>
</dependencies>
Option 2: Manually Import Spire.OCR.jar.
First, download Spire.OCR for Java from the following link and extract it to a specific directory:
https://www.e-iceblue.com/Download/ocr-for-java.html
Next, in IntelliJ IDEA, go to File > Project Structure > Modules > Dependencies. In the Dependencies pane, click the "+" button and select JARs or Directories. Navigate to the directory where Spire.OCR for Java is located, open the lib folder and select the Spire.OCR.jar file, then click OK to add it as the project’s dependency.

Step 3: Download the New Model of Spire.OCR for Java.
Download the model that fits in with your operating system from one of the following links.
Linux x64 (CentOS 8, Ubuntu 18 and above versions are required)
Then extract the package and save it to a specific directory on your computer. In this example, we saved the package to "D:\".

Step 4: Implement Text Extraction from Images Using the New Model of Spire.OCR for Java.
Use the following code to extract text from images with the new OCR model of Spire.OCR for Java:
- Java
import com.spire.ocr.*;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
public class Main {
public static void main(String[] args) {
try {
// Create an instance of the OcrScanner class
OcrScanner scanner = new OcrScanner();
// Create an instance of the ConfigureOptions class to set up the scanner configurations
ConfigureOptions configureOptions = new ConfigureOptions();
// Set the path to the new model
configureOptions.setModelPath("D:\\win-x64");
// Set the language for text recognition. The default is English.
// Supported languages include English, Chinese, Chinesetraditional, French, German, Japanese, and Korean.
configureOptions.setLanguage("English");
// Apply the configuration options to the scanner
scanner.ConfigureDependencies(configureOptions);
// Extract text from an image
scanner.scan("Sample.png");
// Save the extracted text to a text file
saveTextToFile(scanner, "output.txt");
} catch (OcrException e) {
e.printStackTrace();
}
}
private static void saveTextToFile(OcrScanner scanner, String filePath) {
try {
String text = scanner.getText().toString();
try (BufferedWriter writer = new BufferedWriter(new FileWriter(filePath))) {
writer.write(text);
}
} catch (IOException | OcrException e) {
e.printStackTrace();
}
}
}
Apply for a Temporary License
If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.
Text Alignment in Python | Left, Right, Center Align & More

In the world of document automation, proper text alignment is crucial for creating professional, readable, and visually appealing documents. For developers and data professionals building reports, drafting letters, or designing invoices, mastering text alignment in Python is essential to producing polished, consistent documents without manual editing.
This guide delivers a step-by-step walkthrough on how to align text in Python using Spire.Doc for Python, a library that enables effortless control over Word document formatting.
- Why Choose Spire.Doc for Python to Align Text?
- Core Text Alignment Types in Spire.Doc
- Step-by-Step: Align Text in Word in Python
- FAQs About Python Text Alignment
- Conclusion
Why Choose Spire.Doc for Python to Align Text?
Before diving into code, let’s clarify why Spire.Doc is a top choice for text alignment tasks:
- Full Alignment Support: Natively supports all standard alignment types (Left, Right, Center, Justify) for paragraphs.
- No Microsoft Word Dependency: Runs independently - no need to install Word on your machine.
- High Compatibility: Works with .docx, .doc, and other Word formats, ensuring your aligned documents open correctly across devices.
- Fine-Grained Control: Adjust alignment for entire paragraphs or table cells.
Core Text Alignment Types in Spire.Doc
Spire.Doc uses the HorizontalAlignment enum to define text alignment. The most common values are:
- HorizontalAlignment.Left: Aligns text to the left margin (default).
- HorizontalAlignment.Right: Aligns text to the right margin.
- HorizontalAlignment.Center: Centers text horizontally between margins.
- HorizontalAlignment.Justify: Adjusts text spacing so both left and right edges align with margins.
- HorizontalAlignment.Distribute: Adjusts character spacing (adds space between letters) and word spacing to fill the line.
Below, we’ll cover how to programmatically set paragraph alignment (left, right, center, justified, and distributed) in Word using Python
Step-by-Step: Align Text in Word in Python
Here are the actionable steps to generate a Word document with 5 paragraphs, each using a different alignment style.
Step 1: Install Spire.Doc for Python
Open your terminal/command prompt, and then run the following command to install the latest version:
pip install Spire.Doc
Step 2: Import Required Modules
Import the core classes from Spire.Doc. These modules let you create documents, sections, paragraphs, and configure formatting:
from spire.doc import *
from spire.doc.common import *
Step 3: Create a New Word Document
Initialize a Document instance that represents your empty Word file:
# Create a Document instance
doc = Document()
Step 4: Add a Section to the Document
Word documents organize content into sections (each section can have its own margins, page size, etc.). We’ll add one section to hold our paragraphs:
# Add a section to the document
section = doc.AddSection()
Step 5: Add Paragraphs with Different Alignments
A section contains paragraphs, and each paragraph’s alignment is controlled via the HorizontalAlignment enum. We’ll create 5 paragraphs, one for each alignment type.
Left alignment is the default for most text (text aligns to the left margin).
# Left aligned text
paragraph1 = section.AddParagraph()
paragraph1.AppendText("This is left-aligned text.")
paragraph1.Format.HorizontalAlignment = HorizontalAlignment.Left
Right alignment is useful for dates, signatures, or page numbers (text aligns to the right margin).
# Right aligned text
paragraph2 = section.AddParagraph()
paragraph2.AppendText("This is right-aligned text.")
paragraph2.Format.HorizontalAlignment = HorizontalAlignment.Right
Center alignment works well for titles or headings (text centers between left and right margins). Use to center text in Python:
# Center aligned text
paragraph3 = section.AddParagraph()
paragraph3.AppendText("This is center-aligned text.")
paragraph3.Format.HorizontalAlignment = HorizontalAlignment.Center
Justified text aligns both left and right margins (spaces between words are adjusted for consistency). Ideal for formal documents like essays or reports.
# Justified
paragraph4 = section.AddParagraph()
paragraph4.AppendText("This is justified text.")
paragraph4.Format.HorizontalAlignment = HorizontalAlignment.Justify
Note: Justified alignment is more visible with longer text - short phrases may not show the spacing adjustment.
Distributed alignment is similar to justified, but evenly distributes single-line text (e.g., unevenly spaced words or short phrases).
# Distributed
Paragraph5 = section.AddParagraph()
Paragraph5.AppendText("This is evenly distributed text.")
Paragraph5.Format.HorizontalAlignment = HorizontalAlignment.Distribute
Step 6: Save and Close the Document
Finally, save the document to a specified path and close the Document instance to free resources:
# Save the document
document.SaveToFile("TextAlignment.docx", FileFormat.Docx2016)
# Close the document to release memory
document.Close()
Output:

Pro Tip: Spire.Doc for Python also provides interfaces to align tables in Word or align text in table cells.
FAQs About Python Text Alignment
Q1: Is Spire.Doc for Python free?
A: Spire.Doc offers a free version with limitations. For full functionality, you can request a 30-day trial license here.
Q2: Can I set text alignment for existing Word documents
A: Yes. Spire.Doc lets you load existing documents and modify text alignment for specific paragraphs. Here’s a quick example:
from spire.doc import *
# Load an existing document
doc = Document()
doc.LoadFromFile("ExistingDocument.docx")
# Get the first section and first paragraph
section = doc.Sections[0]
paragraph = section.Paragraphs[0]
# Change alignment to center
paragraph.Format.HorizontalAlignment = HorizontalAlignment.Center
# Save the modified document
doc.SaveToFile("UpdatedDocument.docx", FileFormat.Docx2016)
doc.Close()
Q3: Can I apply different alignments to different parts of the same paragraph?
A: No. Text alignment is a paragraph-level setting in Word, not a character-level setting. This means all text within a single paragraph must share the same alignment (left, right, center, etc.).
If you need mixed alignment in the same line, you’ll need to use a table with invisible borders.
Q4: Can Spire.Doc for Python handle other text formatting?
A: Absolutely! Spire.Doc lets you combine alignment with other formatting like fonts, line spacing, bullet points, and more.
Conclusion
Automating Word text alignment with Python and Spire.Doc saves time, reduces human error, and ensures consistency across documents. The code example provided offers a clear template for implementing left, right, center, justified, and distributed alignment, and adapting it to your needs is as simple as modifying the text or adding more formatting rules.
Try experimenting with different alignment combinations, and explore Spire.Doc’s online documentation to unlock more formatting possibilities.
Read Word Document in C# .NET: Extract Text, Tables, Images

Word documents (.doc and .docx) are widely used in business, education, and professional workflows for reports, contracts, manuals, and other essential content. As a C# developer, you may find the need to programmatically read these files to extract information, analyze content, and integrate document data into applications.
In this complete guide, we will delve into the process of reading Word documents in C#. We will explore various scenarios, including:
- Extracting text, paragraphs, and formatting details
- Retrieving images and structured table data
- Accessing comments and document metadata
- Reading headers and footers for comprehensive document analysis
By the end of this guide, you will have a solid understanding of how to efficiently parse Word documents in C#, allowing your applications to access and utilize document content with accuracy and ease.
Table of Contents
- Set Up Your Development Environment for Reading Word Documents in C#
- Load Word Document (.doc/.docx) in C#
- Read and Extract Content from Word Document in C#
- Advanced Tips and Best Practices for Reading Word Documents in C#
- Conclusion
- FAQs
Set Up Your Development Environment for Reading Word Documents in C#
Before you can read Word documents in C#, it’s crucial to ensure that your development environment is properly set up. This section outlines the necessary prerequisites and step-by-step installation instructions to get you ready for seamless Word document handling.
Prerequisites
- Development Environment: Ensure you have Visual Studio or another compatible C# IDE installed.
- .NET Requirement: Ensure you have .NET Framework or .NET Core installed.
- Library Requirement: Spire.Doc for .NET, a versatile library that allows developers to:
- Create Word documents from scratch
- Edit and format existing Word documents
- Read and extract text, tables, images, and other content programmatically
- Convert Word documents to PDF, HTML, and other formats
- Work independently without requiring Microsoft Word installation
Install Spire.Doc
To incorporate Spire.Doc into your C# project, follow these steps to install it via NuGet:
- Open your project in Visual Studio.
- Right-click on your project in the Solution Explorer and select Manage NuGet Packages.
- In the Browse tab, search for "Spire.Doc" and click Install.
Alternatively, you can use the Package Manager Console with the following command:
PM> Install-Package Spire.Doc
This installation adds the necessary references, enabling you to programmatically work with Word documents.
Load Word Document (.doc/.docx) in C#
To begin, you need to load a Word document into your project. The following example demonstrates how to load a .docx or .doc file in C#:
using Spire.Doc;
using Spire.Doc.Documents;
using System;
namespace LoadWordExample
{
class Program
{
static void Main(string[] args)
{
// Specify the path of the Word document
string filePath = @"C:\Documents\Sample.docx";
// Create a Document object
using (Document document = new Document())
{
// Load the Word .docx or .doc document
document.LoadFromFile(filePath);
}
}
}
}
This code loads a Word file from the specified path into a Document object, which is the entry point for accessing all document elements.
Read and Extract Content from Word Document in C#
After loading the Word document into a Document object, you can access its contents programmatically. This section covers various methods for extracting different types of content effectively.
Extract Text
Extracting text is often the first step in reading Word documents. You can retrieve all text content using the built-in GetText() method:
using (StreamWriter writer = new StreamWriter("ExtractedText.txt", false, Encoding.UTF8))
{
// Get all text from the document
string allText = document.GetText();
// Write the entire text to a file
writer.Write(allText);
}
This method extracts all text, disregarding formatting and non-text elements like images.

Read Paragraphs and Formatting Information
When working with Word documents, it is often useful not only to access the text content of paragraphs but also to understand how each paragraph is formatted. This includes details such as alignment and spacing after the paragraph, which can affect layout and readability.
The following example demonstrates how to iterate through all paragraphs in a Word document and retrieve their text content and paragraph-level formatting in C#:
using (StreamWriter writer = new StreamWriter("Paragraphs.txt", false, Encoding.UTF8))
{
// Loop through all sections
foreach (Section section in document.Sections)
{
// Loop through all paragraphs in the section
foreach (Paragraph paragraph in section.Paragraphs)
{
// Get paragraph alignment
HorizontalAlignment alignment = paragraph.Format.HorizontalAlignment;
// Get spacing after paragraph
float afterSpacing = paragraph.Format.AfterSpacing;
// Write paragraph formatting and text to the file
writer.WriteLine($"[Alignment: {alignment}, AfterSpacing: {afterSpacing}]");
writer.WriteLine(paragraph.Text);
writer.WriteLine(); // Add empty line between paragraphs
}
}
}
This approach allows you to extract both the text and key paragraph formatting attributes, which can be useful for tasks such as document analysis, conditional processing, or preserving layout when exporting content.
Extract Images
Images embedded within Word documents play a vital role in conveying information. To extract these images, you will examine each paragraph's content, identify images (typically represented as DocPicture objects), and save them for further use:
// Create the folder if it does not exist
string imageFolder = "ExtractedImages";
if (!Directory.Exists(imageFolder))
Directory.CreateDirectory(imageFolder);
int imageIndex = 1;
// Loop through sections and paragraphs to find images
foreach (Section section in document.Sections)
{
foreach (Paragraph paragraph in section.Paragraphs)
{
foreach (DocumentObject obj in paragraph.ChildObjects)
{
if (obj is DocPicture picture)
{
// Save each image as a separate PNG file
string fileName = Path.Combine(imageFolder, $"Image_{imageIndex}.png");
picture.Image.Save(fileName, System.Drawing.Imaging.ImageFormat.Png);
imageIndex++;
}
}
}
}
This code saves all images in the document as separate PNG files, with options to choose other formats like JPEG or BMP.

Extract Table Data
Tables are commonly used to organize structured data, such as financial reports or survey results. To access this data, iterate through the tables in each section and retrieve the content of individual cells:
// Create a folder to store tables
string tableDir = "Tables";
if (!Directory.Exists(tableDir))
Directory.CreateDirectory(tableDir);
// Loop through each section
for (int sectionIndex = 0; sectionIndex < document.Sections.Count; sectionIndex++)
{
Section section = document.Sections[sectionIndex];
TableCollection tables = section.Tables;
// Loop through all tables in the section
for (int tableIndex = 0; tableIndex < tables.Count; tableIndex++)
{
ITable table = tables[tableIndex];
string fileName = Path.Combine(tableDir, $"Section{sectionIndex + 1}_Table{tableIndex + 1}.txt");
using (StreamWriter writer = new StreamWriter(fileName, false, Encoding.UTF8))
{
// Loop through each row
for (int rowIndex = 0; rowIndex < table.Rows.Count; rowIndex++)
{
TableRow row = table.Rows[rowIndex];
// Loop through each cell
for (int cellIndex = 0; cellIndex < row.Cells.Count; cellIndex++)
{
TableCell cell = row.Cells[cellIndex];
// Loop through each paragraph in the cell
for (int paraIndex = 0; paraIndex < cell.Paragraphs.Count; paraIndex++)
{
writer.Write(cell.Paragraphs[paraIndex].Text.Trim() + " ");
}
// Add tab between cells
if (cellIndex < row.Cells.Count - 1) writer.Write("\t");
}
// Add newline after each row
writer.WriteLine();
}
}
}
}
This method allows efficient extraction of structured data, making it ideal for generating reports or integrating content into databases.

Read Comments
Comments are valuable for collaboration and feedback within documents. Extracting them is crucial for auditing and understanding the document's revision history.
The Document object provides a Comments collection, which allows you to access all comments in a Word document. Each comment contains one or more paragraphs, and you can extract their text for further processing or save them into a file.
using (StreamWriter writer = new StreamWriter("Comments.txt", false, Encoding.UTF8))
{
// Loop through all comments in the document
foreach (Comment comment in document.Comments)
{
// Loop through each paragraph in the comment
foreach (Paragraph p in comment.Body.Paragraphs)
{
writer.WriteLine(p.Text);
}
// Add empty line to separate different comments
writer.WriteLine();
}
}
This code retrieves the content of all comments and outputs it into a single text file.
Retrieve Document Metadata
Word documents contain metadata such as the title, author, and subject. These metadata items are stored as document properties, which can be accessed through the BuiltinDocumentProperties property of the Document object:
using (StreamWriter writer = new StreamWriter("Metadata.txt", false, Encoding.UTF8))
{
// Write built-in document properties to file
writer.WriteLine("Title: " + document.BuiltinDocumentProperties.Title);
writer.WriteLine("Author: " + document.BuiltinDocumentProperties.Author);
writer.WriteLine("Subject: " + document.BuiltinDocumentProperties.Subject);
}
Read Headers and Footers
Headers and footers frequently contain essential content like page numbers and titles. To programmatically access this information, iterate through each section's header and footer paragraphs and retrieve the text of each paragraph:
using (StreamWriter writer = new StreamWriter("HeadersFooters.txt", false, Encoding.UTF8))
{
// Loop through all sections
foreach (Section section in document.Sections)
{
// Write header paragraphs
foreach (Paragraph headerParagraph in section.HeadersFooters.Header.Paragraphs)
{
writer.WriteLine("Header: " + headerParagraph.Text);
}
// Write footer paragraphs
foreach (Paragraph footerParagraph in section.HeadersFooters.Footer.Paragraphs)
{
writer.WriteLine("Footer: " + footerParagraph.Text);
}
}
}
This method ensures that all recurring content is accurately captured during document processing.
Advanced Tips and Best Practices for Reading Word Documents in C#
To get the most out of programmatically reading Word documents, following these tips can help improve efficiency, reliability, and code maintainability:
- Use using Statements: Always wrap Document objects in using to ensure proper memory management.
- Check for Null or Empty Sections: Prevent errors by verifying sections, paragraphs, tables, or images exist before accessing them.
- Batch Reading Multiple Documents: Loop through a folder of Word files and apply the same extraction logic to each file. This helps automate workflows and consolidate extracted content efficiently.
Conclusion
Efficiently reading Word documents programmatically in C# involves handling various content types. With the techniques outlined in this guide, developers can:
- Load Word documents (.doc and .docx) with ease.
- Extract text, paragraphs, and formatting details for thorough analysis.
- Retrieve images, structured table data, and comments.
- Access headers, footers, and document metadata for complete insights.
FAQs
Q1: Can I read Word documents without installing Microsoft Word?
A1: Yes, libraries like Spire.Doc enable you to read and process Word files without requiring Microsoft Word installation.
Q2: Does this support both .doc and .docx formats?
A2: Absolutely, all methods discussed in this guide work seamlessly with both legacy (.doc) and modern (.docx) Word files.
Q3: Can I extract only specific sections of a document?
A3: Yes, by iterating through sections and paragraphs, you can selectively filter and extract the desired content.
Convert HTML to PDF in Python: Complete Guide with Code

Converting HTML to PDF in Python is a common need when you want to generate printable reports, preserve web content, or create offline documentation with consistent formatting. In this tutorial, you’ll learn how to convert HTML to PDF in Python— whether you're working with a local HTML file or a HTML string. If you're looking for a simple and reliable way to generate PDF files from HTML in Python, this guide is for you.
Install Spire.Doc to Convert HTML to PDF Easily
To convert HTML to PDF in Python, you’ll need a reliable library that supports HTML parsing and PDF rendering. Spire.Doc for Python is a powerful and easy-to-use HTML to PDF converter library that lets you generate PDF documents from HTML content — without relying on a browser, headless engine, or third-party tools.
Install via pip
You can install the library quickly with pip:
pip install spire.doc
Alternative: Manual Installation
You can also download the Spire.Doc package and perform a custom installation if you need more control over the environment.
Tip: Spire.Doc offers a free version suitable for small projects or evaluation purposes.
Once installed, you're ready to convert HTML to PDF in Python in just a few lines of code.
Convert HTML Files to PDF in Python
Spire.Doc for Python makes it easy to convert HTML files to PDF. The Document.LoadFromFile() method supports loading various file formats, including .html, .doc, and .docx. After loading an HTML file, you can convert it to PDF by calling Document.SaveToFile() method. Follow the steps below to convert an HTML file to PDF in Python using Spire.Doc.
Steps to convert an HTML file to PDF in Python:
- Create a Document object.
- Load an HTML file using Document.LoadFromFile() method.
- Convert it to PDF using Document.SaveToFile() method.
The following code shows how to convert an HTML file directly to PDF in Python:
from spire.doc import *
from spire.doc.common import *
# Create a Document object
document = Document()
# Load an HTML file
document.LoadFromFile("Sample.html", FileFormat.Html, XHTMLValidationType.none)
# Save the HTML file to a pdf file
document.SaveToFile("output/ToPdf.pdf", FileFormat.PDF)
document.Close()

Convert an HTML String to PDF in Python
If you want to convert an HTML string to PDF in Python, Spire.Doc for Python provides a straightforward solution. For simple HTML content like paragraphs, text styles, and basic formatting, you can use the Paragraph.AppendHTML() method to insert the HTML into a Word document. Once added, you can save the document as a PDF using the Document.SaveToFile() method.
Here are the steps to convert an HTML string to a PDF file in Python.
- Create a Document object.
- Add a section using Document.AddSection() method and insert a paragraph using Section.AddParagraph() method.
- Specify the HTML string and add it to the paragraph using Paragraph.AppendHTML() method.
- Save the document as a PDF file using Document.SaveToFile() method.
Here's the complete Python code that shows how to convert an HTML string to a PDF:
from spire.doc import *
from spire.doc.common import *
# Create a Document object
document = Document()
# Add a section to the document
sec = document.AddSection()
# Add a paragraph to the section
paragraph = sec.AddParagraph()
# Specify the HTML string
htmlString = """
<html>
<head>
<title>HTML to Word Example</title>
<style>
body {
font-family: Arial, sans-serif;
}
h1 {
color: #FF5733;
font-size: 24px;
margin-bottom: 20px;
}
p {
color: #333333;
font-size: 16px;
margin-bottom: 10px;
}
ul {
list-style-type: disc;
margin-left: 20px;
margin-bottom: 15px;
}
li {
font-size: 14px;
margin-bottom: 5px;
}
table {
border-collapse: collapse;
width: 100%;
margin-bottom: 20px;
}
th, td {
border: 1px solid #CCCCCC;
padding: 8px;
text-align: left;
}
th {
background-color: #F2F2F2;
font-weight: bold;
}
td {
color: #0000FF;
}
</style>
</head>
<body>
<h1>This is a Heading</h1>
<p>This is a paragraph.</p>
<p>Here's an unordered list:</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
<p>And here's a table:</p>
<table>
<tr>
<th>Name</th>
<th>Age</th>
<th>Gender</th>
</tr>
<tr>
<td>John Smith</td>
<td>35</td>
<td>Male</td>
</tr>
<tr>
<td>Jenny Garcia</td>
<td>27</td>
<td>Female</td>
</tr>
</table>
</body>
</html>
"""
# Append the HTML string to the paragraph
paragraph.AppendHTML(htmlString)
# Save the document as a pdf file
document.SaveToFile("output/HtmlStringToPdf.pdf", FileFormat.PDF)
document.Close()

Customize the Conversion from HTML to PDF in Python
While converting HTML to PDF in Python is often straightforward, there are times when you need more control over the output. For example, you may want to set a password to protect the PDF document, or embed fonts to ensure consistent formatting across different devices. In this section, you’ll learn how to customize the HTML to PDF conversion using Spire.Doc for Python.
1. Set a Password to Protect the PDF
To prevent unauthorized viewing or editing, you can encrypt the PDF by specifying a user password and an owner password.
# Create a ToPdfParameterList object
toPdf = ToPdfParameterList()
# Set PDF encryption passwords
userPassword = "viewer"
ownerPassword = "E-iceblue"
toPdf.PdfSecurity.Encrypt(userPassword, ownerPassword, PdfPermissionsFlags.Default, PdfEncryptionKeySize.Key128Bit)
# Save as PDF with password protection
document.SaveToFile("/HtmlToPdfWithPassword.pdf", toPdf)
2. Embed Fonts to Preserve Formatting
To ensure the PDF displays correctly across all devices, you can embed all fonts used in the document.
# Create a ToPdfParameterList object
ppl = ToPdfParameterList()
ppl.IsEmbeddedAllFonts = True
# Save as PDF with embedded fonts
document.SaveToFile("/HtmlToPdfWithEmbeddedFonts.pdf", ppl)
These options give you finer control when you convert HTML to PDF in Python, especially for professional document sharing or long-term storage scenarios.
The Conclusion
Converting HTML to PDF in Python becomes simple and flexible with Spire.Doc for Python. Whether you're handling static HTML files or dynamic HTML strings, or need to secure and customize your PDFs, this library provides everything you need — all in just a few lines of code. Get a free 30-day license and start converting HTML to high-quality PDF documents in Python today!
FAQs
Q1: Can I convert an HTML file to PDF in Python? Yes. Using Spire.Doc for Python, you can convert a local HTML file to PDF with just a few lines of code.
Q2: How do I convert HTML to PDF in Chrome? While Chrome allows manual "Save as PDF", it’s not suitable for batch or automated workflows. If you're working in Python, Spire.Doc provides a better solution for programmatically converting HTML to PDF.
Q3: How do I convert HTML to PDF without losing formatting? To preserve formatting:
- Use embedded or inline CSS (not external files).
- Use absolute URLs for images and resources.
- Embed fonts using Spire.Doc options like IsEmbeddedAllFonts(True).
How to Convert PDF to Text in Python (Free & Easy Guide)
Table of Contents
Install with Pip
pip install Spire.PDF
Related Links
Converting PDF files to editable text is a common need for researchers, analysts, and professionals who deal with large volumes of documents. Manual copying wastes time—Python offers a faster, more flexible solution. In this guide, you’ll learn how to convert PDF to text in Python efficiently, whether you want to keep the layout or extract specific content.

- Why Choose Spire.PDF for PDF to Text
- General Workflow for PDF to Text in Python
- Convert PDF to Text in Python Without Layout
- Convert PDF to Text in Python With Layout
- Convert a Specific PDF Page to Text
- To Wrap Up
- FAQs
Getting Started: Why Choose Spire.PDF for PDF to Text in Python
To convert PDF files to text using Python, you’ll need a reliable PDF processing library. Spire.PDF for Python is a powerful and developer-friendly API that allows you to read, edit, and convert PDF documents in Python applications — no need for Adobe Acrobat or other third-party software.
This library is ideal for automating PDF workflows such as extracting text, adding annotations, or merging and splitting files. It supports a wide range of PDF features and works seamlessly in both desktop and server environments. You can donwload it to install mannually or quickly install Spire.PDF via PyPI using the following command:
pip install Spire.PDF
For smaller or personal projects, a free version is available with basic functionality. If you need advanced features such as PDF signing or form filling, you can upgrade to the commercial edition at any time.
General Workflow for PDF to Text in Python
Converting a PDF to text becomes simple and efficient with the help of Spire.PDF for Python. You can easily complete the task by reusing the sample code provided in the following sections and customizing it to fit your needs. But before diving into the code, let’s take a quick look at the general workflow behind this process.
- Create an object of PdfDocument class and load a PDF file using LoadFromFile() method.
- Create an object of PdfTextExtractOptions class and set the text extracting options, including extracting all text, showing hidden text, only extracting text in a specified area, and simple extraction.
- Get a page in the document using PdfDocument.Pages.get_Item() method and create PdfTextExtractor objects based on each page to extract the text from the page using Extract() method with specified options.
- Save the extracted text as a text file and close the object.
How to Convert PDF to Text in Python Without Layout
If you only need the plain text content from a PDF and don’t care about preserving the original layout, you can use a simple method to extract text. This approach is faster and easier, especially when working with scanned documents or large batches of files. In this section, we’ll show you how to convert PDF to text in Python without preserving the layout.
To extract text without preserving layout, follow these simplified steps:
- Create an instance of PdfDocument and load the PDF file.
- Create a PdfTextExtractOptions object and configure the text extraction options.
- Set IsSimpleExtraction = True to ignore the layout and extract raw text.
- Loop through all pages of the PDF.
- Extract text from each page and write it to a .txt file.
from spire.pdf import PdfDocument
from spire.pdf import PdfTextExtractOptions
from spire.pdf import PdfTextExtractor
# Create an object of PdfDocument class and load a PDF file
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")
# Create a string object to store the text
extracted_text = ""
# Create an object of PdfExtractor
extract_options = PdfTextExtractOptions()
# Set to use simple extraction method
extract_options.IsSimpleExtraction = True
# Loop through the pages in the document
for i in range(pdf.Pages.Count):
# Get a page
page = pdf.Pages.get_Item(i)
# Create an object of PdfTextExtractor passing the page as paramter
text_extractor = PdfTextExtractor(page)
# Extract the text from the page
text = text_extractor.ExtractText(extract_options)
# Add the extracted text to the string object
extracted_text += text
# Write the extracted text to a text file
with open("output/ExtractedText.txt", "w") as file:
file.write(extracted_text)
pdf.Close()

How to Convert PDF to Text in Python With Layout
To convert PDF to text in Python with layout, Spire.PDF preserves formatting like tables and paragraphs by default. The steps are similar to the general overview, but you still need to loop through each page for full-text extraction.
from spire.pdf import PdfDocument
from spire.pdf import PdfTextExtractOptions
from spire.pdf import PdfTextExtractor
# Create an object of PdfDocument class and load a PDF file
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")
# Create a string object to store the text
extracted_text = ""
# Create an object of PdfExtractor
extract_options = PdfTextExtractOptions()
# Loop through the pages in the document
for i in range(pdf.Pages.Count):
# Get a page
page = pdf.Pages.get_Item(i)
# Create an object of PdfTextExtractor passing the page as paramter
text_extractor = PdfTextExtractor(page)
# Extract the text from the page
text = text_extractor.ExtractText(extract_options)
# Add the extracted text to the string object
extracted_text += text
# Write the extracted text to a text file
with open("output/ExtractedText.txt", "w") as file:
file.write(extracted_text)
pdf.Close()

Convert a Specific PDF Page to Text in Python
Need to extract text from only one page of a PDF instead of the entire document? With Spire.PDF, the PDF to Text converter in Python, you can easily target and convert a specific PDF page to text. The steps are the same as shown in the general overview. If you're already familiar with them, just copy the code below into any Python editor and automate your PDF to text conversion!
from spire.pdf import PdfDocument
from spire.pdf import PdfTextExtractOptions
from spire.pdf import PdfTextExtractor
from spire.pdf import RectangleF
# Create an object of PdfDocument class and load a PDF file
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")
# Create an object of PdfExtractor
extract_options = PdfTextExtractOptions()
# Set to extract specific page area
extract_options.ExtractArea = RectangleF(50.0, 220.0, 700.0, 230.0)
# Get a page
page = pdf.Pages.get_Item(0)
# Create an object of PdfTextExtractor passing the page as paramter
text_extractor = PdfTextExtractor(page)
# Extract the text from the page
extracted_text = text_extractor.ExtractText(extract_options)
# Write the extracted text to a text file
with open("output/ExtractedText.txt", "w") as file:
file.write(extracted_text)
pdf.Close()

To Wrap Up
In this post, we covered how to convert PDF to text using Python and Spire.PDF, with clear steps and code examples for fast, efficient conversion. We also highlighted the benefits and pointed to OCR tools for image-based PDFs. For any issues or support, feel free to contact us.
FAQs about Converting PDF to Text
Q1: How do I convert a PDF to readable and editable text in Python?
A: You can convert a PDF to text in Python using the Spire.PDF library. It allows you to extract text from PDF files while optionally keeping the original layout. You don’t need Adobe Acrobat, and both visible and image-based PDFs are supported.
Q2: Is there a free tool to convert PDF to text?
A: Yes. Spire.PDF for Python provides a free edition that allows you to convert PDF to text without relying on Adobe Acrobat or other software. Online tools are also available, but they’re more suitable for occasional use or small files.
Q3: Can Python extract data from PDF? A: Yes, Python can extract data from PDF files. Using Spire.PDF, you can easily extract not only text but also other elements such as images, annotations, bookmarks, and even attachments. This makes it a versatile tool for working with PDF content in Python.