page 1

Subscribe to this RSS feed

Spire.Doc for Python (97)

Children categories

Program Guide (95)

View items...

Parse HTML in Python: Read Strings, Files & Web URLs

2025-09-24 02:04:45 Written by zaki zou

Parse HTML from Strings, Files, and URLs using Python

When it comes to working with web content and documents, the ability to parse HTML in Python is an essential skill for developers across various domains. HTML parsing involves extracting meaningful information from HTML documents, manipulating content, and processing web data efficiently. Whether you're working on web scraping projects, data extraction tasks, content analysis, or document processing, mastering HTML parsing techniques in Python can significantly enhance your productivity and capabilities.

In this guide, we'll explore how to effectively read HTML in Python using Spire.Doc for Python. You'll learn practical techniques for processing HTML content from strings, local files, and URLs, and implementing best practices for HTML parsing in your projects.

Why Parse HTML in Python?
Getting Started: Install HTML Parser in Python
How Spire.Doc Parses HTML: Core Concepts
Best Practices for Effective HTML Parsing
Conclusion

Why Parse HTML in Python?

HTML (HyperText Markup Language) is the backbone of the web, used to structure and present content on websites. Parsing HTML enables you to:

Extract specific data (text, images, tables, hyperlinks) from web pages or local files.
Analyze content structure for trends, keywords, or patterns.
Automate data collection for research, reporting, or content management.
Clean and process messy HTML into structured data.

While libraries like BeautifulSoup excel at lightweight parsing, Spire.Doc for Python shines when you need to integrate HTML parsing with document creation or conversion. It offers a robust framework to parse and interact with HTML content as a structured document object model (DOM).

Getting Started: Install HTML Parser in Python

Before diving into parsing, you’ll need to install Spire.Doc for Python. The library is available via PyPI, making installation straightforward:

pip install Spire.Doc

This command installs the latest version of the library, along with its dependencies. Once installed, you’re ready to start parsing HTML.

How Spire.Doc Parses HTML: Core Concepts

At its core, Spire.Doc parses HTML by translating HTML’s tag-based structure into a hierarchical document model. This model is composed of objects that represent sections, paragraphs, and other elements, mirroring the original HTML’s organization. Let’s explore how this works in practice.

1. Parsing HTML Strings in Python

If you have a small HTML snippet (e.g., from an API response or user input), parse it directly from a string. This is great for testing or working with short, static HTML.

from spire.doc import *
from spire.doc.common import *

# Define HTML content as a string
html_string = """
<html>
    <head>
        <title>Sample HTML</title>
    </head>
    <body>
        <h1>Main Heading</h1>
        <p>This is a paragraph with <strong>bold text</strong>.</p>
        <div>
            <p>A nested paragraph inside a div.</p>
        </div>
        <ul>
          <li>List item 1</li>
          <li>List item 2</li>
          <li>List item 3</li>
        </ul>
    </body>
</html>
"""

# Initialize a new Document object
doc = Document()

# Add a section and paragraph to the document
section = doc.AddSection()
paragraph = section.AddParagraph()

# Load HTML content from the string
paragraph.AppendHTML(html_string)

print("Parsed HTML Text:")
print("-----------------------------")

# Extract text content from the parsed HTML
parsed_text = doc.GetText()

# Print the result
print(parsed_text)

# Close the document
doc.Close()

How It Works:

HTML String: We define a sample HTML snippet with common elements (headings, paragraphs, lists).
Document Setup: Spire.Doc uses a Word-like structure (sections → paragraphs) to organize parsed HTML.
Parse HTML: AppendHTML() converts the string into structured Word elements (e.g., <h1> becomes a "Heading 1" style, <ul> becomes a list).
Extract Text: GetText() pulls clean, plain text from the parsed document (no HTML tags).

Output:

Parse an HTML string using Python

Spire.Doc supports exporting parsed HTML content to multiple formats such as TXT, Word via the SaveToFile() method.

2. Parsing HTML Files in Python

For local HTML files, Spire.Doc can load and parse them with a single method. This is useful for offline content (e.g., downloaded web pages, static reports).

from spire.doc import *
from spire.doc.common import *

# Define the path to your local HTML file
html_file_path = "example.html"

# Create a Document instance
doc = Document()

# Load and parse the HTML file
doc.LoadFromFile(html_file_path, FileFormat.Html)

# Analyze document structure
print(f"Document contains {doc.Sections.Count} section(s)")
print("-"*40)

# Process each section
for section_idx in range(doc.Sections.Count):
    section = doc.Sections[section_idx]
    print(f"SECTION {section_idx + 1}")
    print(f"Section has {section.Body.Paragraphs.Count} paragraph(s)")
    print("-"*40)
    
    # Traverse through paragraphs in the current section
    for para_idx in range(section.Paragraphs.Count):
        para = section.Paragraphs[para_idx]
        # Get paragraph style name and text content
        style_name = para.StyleName
        para_text = para.Text
        
        # Print paragraph information if content exists
        if para_text.strip():
            print(f"[{style_name}] {para_text}\n")
            
    # Add spacing between sections
    print()

# Close the document
doc.Close()

Key Features:

Load Local Files: LoadFromFile() reads the HTML file and auto-parses it into a Word structure.
Structure Analysis: Check the number of sections/paragraphs and their styles (critical for auditing content).
Style Filtering: Identify headings (e.g., "Heading 1") or lists (e.g., "List Paragraph") to organize content.

Output:

Parse a local HTML file with Python

After loading the HTML file into the Document object, you can use Spire.Doc to extract specific elements like tables, hyperlinks from HTML.

3. Parsing a URL in Python

To parse HTML directly from a live web page, first fetch the HTML content from the URL using a library like requests, then pass the content to Spire.Doc for parsing. This is core for web scraping and real-time data extraction.

Install the Requests library via pip:

pip install requests

Python code to parse web page:

from spire.doc import *
from spire.doc.common import *
import requests 

# Fetch html content from a URL
def fetch_html_from_url(url):
    """Fetch HTML from a URL and handle errors (e.g., 404, network issues)"""
    # Mimic a browser with User-Agent (avoids being blocked by websites)
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
    }
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()  # Raise exception for HTTP errors
        return response.text # Return raw HTML content
    except requests.exceptions.RequestException as e:
        raise Exception(f"Error fetching HTML: {str(e)}")

# Specify the target URL
url = "https://www.e-iceblue.com/privacypolicy.html"
print(f"Fetching HTML from: {url}")
     
# Get HTML content
html_content = fetch_html_from_url(url)
     
# Create document and insert HTML content into it
doc = Document()
section = doc.AddSection()
paragraph = section.AddParagraph()
paragraph.AppendHTML(html_content)
     
# Extract and display summary information
print("\nParsed Content Summary:")
print(f"Sections: {doc.Sections.Count}")
print("-------------------------------------------")
     
# Extract and display headings
print("Headings found:")
for para_idx in range(section.Paragraphs.Count):
    para = section.Paragraphs[para_idx]

    if isinstance(para, Paragraph) and para.StyleName.startswith("Heading"):
        print(f"- {para.Text.strip()}")

# Close the document
doc.Close()

Steps Explained:

Use requests.get() to fetch the HTML content from the URL.
Pass the raw HTML text to Spire.Doc for parsing.
Extract specific content (e.g., headings) from live pages for SEO audits or content aggregation.

Output:

Parse HTML from a web URL using Python

Best Practices for Effective HTML Parsing

To optimize your HTML parsing workflow with Spire.Doc, follow these best practices:

Validate Input Sources: Before parsing, check that HTML content (strings or files) is accessible and not corrupted. This reduces parsing errors:

import os

html_file = "data.html"
if os.path.exists(html_file):
    doc.LoadFromFile(html_file, FileFormat.Html)
else:
    print(f"Error: File '{html_file}' not found.")

Handle Exceptions: Wrap parsing operations in try-except blocks to catch catch errors (e.g., missing files, invalid HTML):

try:
    doc.LoadFromFile("sample.html", FileFormat.Html)
except Exception as e:
    print(f"Error loading HTML: {e}")

Optimize for Large Files: For large HTML files, consider loading content in chunks or disabling non-essential parsing features to improve performance.
Clean Extracted Data: Use Python’s string methods (e.g., strip(), replace()) to remove extra whitespace or unwanted characters from extracted text.
Keep the Library Updated: Regularly update Spire.Doc with pip install --upgrade Spire.Doc to benefit from improved parsing logic and bug fixes.

Conclusion

Python makes HTML parsing accessible for all skill levels. Whether you’re working with HTML strings, local files, or remote URLs, the combination of Requests (for fetching) and Spire.Doc (for structuring) simplifies complex tasks like web scraping and content extraction.

By following the examples and best practices in this guide, you’ll turn unstructured HTML into actionable, organized data in minutes. To unlock the full potential of Spire.Doc for Python, you can request a 30-day trial license here.

Published in Document Operation

Tagged under

doc Python Document Operation

How to Convert HTML to Markdown in Python: Step-by-Step Guide

2025-09-05 03:49:00 Written by Administrator

Python Guide to Export HTML to Markdown

Converting HTML to Markdown using Python is a common task for developers managing web content, documentation, or API data. While HTML provides powerful formatting and structure, it can be verbose and harder to maintain for tasks like technical writing or static site generation. Markdown, by contrast, is lightweight, human-readable, and compatible with platforms such as GitHub, GitLab, Jekyll, and Hugo.

Automating HTML to Markdown conversion with Python streamlines workflows, reduces errors, and ensures consistent output. This guide covers everything from converting HTML files and strings to batch processing multiple files, along with best practices to ensure accurate Markdown results.

What You Will Learn

Why Convert HTML to Markdown
Install HTML to Markdown Library for Python
Convert an HTML File to Markdown in Python
Convert an HTML String to Markdown in Python
Batch Conversion of Multiple HTML Files
Best Practices for HTML to Markdown Conversion
Conclusion
FAQs

Why Convert HTML to Markdown?

Before diving into the code, let’s look at why developers often prefer Markdown over raw HTML in many workflows:

Simplicity and Readability
Markdown is easier to read and edit than verbose HTML tags.
Portability Across Tools
Markdown is supported by GitHub, GitLab, Bitbucket, Obsidian, Notion, and static site generators like Hugo and Jekyll.
Better for Version Control
Being plain text, Markdown makes it easier to track changes with Git, review diffs, and collaborate.
Faster Content Creation
Writing Markdown is quicker than remembering HTML tag structures.
Integration with Static Site Generators
Popular frameworks rely on Markdown as the main content format. Converting from HTML ensures smooth migration.
Cleaner Documentation Workflows
Many documentation systems and wikis use Markdown as their primary format.

In short, converting HTML to Markdown improves maintainability, reduces clutter, and fits seamlessly into modern developer workflows.

Install HTML to Markdown Library for Python

Before converting HTML content to Markdown in Python, you’ll need a library that can handle both formats effectively. Spire.Doc for Python is a reliable choice that allows you to transform HTML files or HTML strings into Markdown while keeping headings, lists, images, and links intact.

You can install it from PyPI using pip:

pip install spire.doc

Once installed, you can automate the HTML to Markdown conversion in your Python scripts. The same library also supports broader scenarios. For example, when you need editable documents, you can rely on its HTML to Word conversion feature to transform web pages into Word files. And for distribution or archiving, HTML to PDF conversion is especially useful for generating standardized, platform-independent documents.

Convert an HTML File to Markdown in Python

One of the most common use cases is converting an existing .html file into a .md file. This is especially useful when migrating old websites, technical documentation, or blog posts into Markdown-based workflows, such as static site generators (Jekyll, Hugo) or Git-based documentation platforms (GitHub, GitLab, Read the Docs).

Steps

Create a new Document instance.
Load the .html file into the document using LoadFromFile().
Save the document as a .md file using SaveToFile() with FileFormat.Markdown.
Close the document to release resources.

Code Example

from spire.doc import *

# Create a Document instance
doc = Document()

# Load an existing HTML file
doc.LoadFromFile("input.html", FileFormat.Html)

# Save as Markdown file
doc.SaveToFile("output.md", FileFormat.Markdown)

# Close the document
doc.Close()

This converts input.html into output.md, preserving structural elements such as headings, paragraphs, lists, links, and images.

Python Example to Convert HTML File to Markdown

If you’re also interested in the reverse process, check out our guide on converting Markdown to HTML in Python.

Convert an HTML String to Markdown in Python

Sometimes, HTML content is not stored in a file but is dynamically generated—for example, when retrieving web content from an API or scraping. In these scenarios, you can convert directly from a string without needing to create a temporary HTML file.

Steps

Create a new Document instance.
Add a Section to the document.
Add a Paragraph to the section.
Append the HTML string to the paragraph using AppendHTML().
Save the document as a Markdown file using SaveToFile().
Close the document to release resources.

Code Example

from spire.doc import *

# Sample HTML string
html_content = """
<h1>Welcome</h1>
<p>This is a <strong>sample</strong> paragraph with <em>emphasis</em>.</p>
<ul>
  <li>First item</li>
  <li>Second item</li>
</ul>
"""

# Create a Document instance
doc = Document()

# Add a section
section = doc.AddSection()

# Add a paragraph and append the HTML string
paragraph = section.AddParagraph()
paragraph.AppendHTML(html_content)

# Save the document as Markdown
doc.SaveToFile("string_output.md", FileFormat.Markdown)

# close the document to release resources
doc.Close()

The resulting Markdown will look like this:

Python Example to Convert HTML String to Markdown

Batch Conversion of Multiple HTML Files

For larger projects, you may need to convert multiple .html files in bulk. A simple loop can automate the process.

import os
from spire.doc import *

# Define the folder containing HTML files to convert
input_folder = "html_files"

# Define the folder where converted Markdown files will be saved
output_folder = "markdown_files"

# Create the output folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

# Loop through all files in the input folder
for filename in os.listdir(input_folder):
    # Process only files with .html extension
    if filename.endswith(".html"):
        # Create a new Document object
        doc = Document()

        # Load the HTML file into the Document object
        doc.LoadFromFile(os.path.join(input_folder, filename), FileFormat.Html)

        # Generate the output file path by replacing .html with .md
        output_file = os.path.join(output_folder, filename.replace(".html", ".md"))

        # Save the Document as a Markdown file
        doc.SaveToFile(output_file, FileFormat.Markdown)

        # Close the Document to release resources
        doc.Close()

This script processes all .html files inside html_files/ and saves the Markdown results into markdown_files/.

Best Practices for HTML to Markdown Conversion

Turning HTML to Markdown makes content easier to read, manage, and version-control. To ensure accurate and clean conversion, follow these best practices:

Validate HTML Before Conversion
Ensure your HTML is properly structured. Invalid tags can cause incomplete or broken Markdown output.
Understand Markdown Limitations
Markdown does not support advanced CSS styling or custom HTML tags. Some formatting might get lost.
Choose File Encoding Carefully
Always be aware of character encoding. Open and save your files with a specific encoding (like UTF-8) to prevent issues with special characters.
Batch Processing

If converting multiple files, create a robust script that includes error handling (try-except blocks), logging, and skips problematic files instead of halting the entire process.

Conclusion

Converting HTML to Markdown in Python is a valuable skill for developers handling documentation pipelines, migrating web content, or processing data from APIs. With Spire.Doc for Python, you can:

Convert individual HTML files into Markdown with ease.
Transform HTML strings directly into .md files.
Automate batch conversions to efficiently manage large projects.

By applying these methods, you can streamline your workflows and ensure your content remains clean, maintainable, and ready for modern publishing platforms.

FAQs

Q1: Can I convert Markdown back to HTML in Python?

A1: Yes, Spire.Doc supports the conversion of Markdown to HTML, allowing for seamless transitions between these formats.

Q2: Will the conversion preserve complex HTML elements like tables?

A2: While Spire.Doc effectively handles standard HTML elements, it's advisable to review complex layouts, such as tables and nested elements, to ensure accurate conversion results.

Q3: Can I automate batch conversion for multiple HTML files?

A3: Absolutely! You can automate batch conversion using scripts in Python, enabling efficient processing of multiple HTML files at once.

Q4: Is Spire.Doc free to use?

A4: Spire.Doc provides both free and commercial versions, giving developers the flexibility to access essential features at no cost or unlock advanced functionality with a license.

Published in Conversion

Tagged under

doc Python Conversion

How to Convert Markdown to HTML in Python: Step-by-Step Guide

2025-09-05 02:05:04 Written by Administrator

Python Guide to Convert Markdown to HTML

Markdown (.md) is widely used in web development, documentation, and technical writing. Its simple syntax makes content easy to write and read. However, web browsers do not render Markdown directly. Converting Markdown to HTML ensures your content is structured, readable, and compatible with web platforms.

In this step-by-step guide, you will learn how to efficiently convert Markdown (.md) files into HTML using Python and Spire.Doc for Python, complete with practical code examples, clear instructions, and best practices for both single-file and batch conversions.

What is Markdown
Why Convert Markdown to HTML
Introducing Spire.Doc for Python
Step-by-Step Guide: Converting Markdown to HTML in Python
Automating Batch Conversion
Best Practices for Markdown to HTML Conversion
Conclusion
FAQs

What is Markdown?

Markdown is a lightweight markup language designed for readability and ease of writing. Unlike HTML, which can be verbose and harder to write by hand, Markdown uses simple syntax to indicate headings, lists, links, images, and more.

Example Markdown:

# This is a Heading

This is a paragraph with \*\*bold text\*\* and \*italic text\*.

- Item 1

- Item 2

Even in its raw form, Markdown is easy to read, which makes it popular for documentation, blogging, README files, and technical writing.

For more on Markdown syntax, see the Markdown Guide.

Why Convert Markdown to HTML?

While Markdown is excellent for authoring content, web browsers cannot render it natively. Converting Markdown to HTML allows you to:

Publish content on websites – Most CMS platforms require HTML for web pages.
Enhance styling – HTML supports CSS and JavaScript for advanced formatting and interactivity.
Maintain compatibility – HTML is universally supported by browsers, ensuring content displays correctly everywhere.
Integrate with web frameworks – Frameworks like React, Vue, and Angular require HTML as the base for rendering components.

Introducing Spire.Doc for Python

Spire.Doc for Python is a robust library for handling multiple document formats. It supports reading and writing Word documents, Markdown files, and exporting content to HTML. The library allows developers to convert Markdown directly to HTML with minimal code, preserving proper formatting and structure.

In addition to HTML, Spire.Doc for Python also allows you to convert Markdown to Word in Python or convert Markdown to PDF in Python, making it particularly useful for developers who want a unified tool for handling Markdown across different output formats.

Benefits of Using Spire.Doc for Python for Markdown to HTML Conversion

Easy-to-use API – Simple, intuitive methods that reduce development effort.
Accurate formatting – Preserves all Markdown elements such as headings, lists, links, and emphasis in HTML.
No extra dependencies – Eliminates the need for manual parsing or third-party libraries.
Flexible usage – Supports both single-file conversion and automated batch processing.

Step-by-Step Guide: Converting Markdown to HTML in Python

Now that you understand the purpose and benefits of converting Markdown to HTML, let’s walk through a clear, step-by-step process to transform your Markdown files into structured, web-ready HTML.

Step 1: Install Spire.Doc for Python

First, ensure that Spire.Doc for Python is installed in your environment. You can install it directly from PyPI using pip:

pip install spire.doc

Step 2: Prepare Your Markdown File

Next, create a sample Markdown file that you want to convert. For example, example.md:

Example Markdown File

Step 3: Write the Python Script

Now, write a Python script that loads the Markdown file and converts it to HTML:

from spire.doc import *

# Create a Document object
doc = Document()

# Load the Markdown file
doc.LoadFromFile("example.md", FileFormat.Markdown)

# Save the document as HTML
doc.SaveToFile("example.html", FileFormat.Html)

# Close the document
doc.Close()

Explanation of the code:

Document() initializes a new document object.
LoadFromFile("example.md", FileFormat.Markdown) loads the Markdown file into memory.
SaveToFile("example.html", FileFormat.Html) converts the loaded content into HTML and saves it to disk.
doc.Close() ensures resources are released properly, which is particularly important when processing multiple files or running batch operations.

Step 4: Verify the HTML Output

Finally, open the generated example.html file in a web browser or HTML editor. Verify that the Markdown content has been correctly converted.

HTML File Converted from Markdown using Python

Automating Batch Conversion

You can convert multiple Markdown files in a folder automatically:

import os
from spire.doc import *

# Set the folder containing Markdown files
input_folder = "markdown_files"

# Set the folder where HTML files will be saved
output_folder = "html_files"

# Create the output folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

# Loop through all files in the input folder
for filename in os.listdir(input_folder):
    # Process only Markdown files
    if filename.endswith(".md"):
        # Create a new Document object for each file
        doc = Document()

        # Load the Markdown file into the Document object
        doc.LoadFromFile(os.path.join(input_folder, filename), FileFormat.Markdown)

        # Construct the output file path by replacing .md extension with .html
        output_file = os.path.join(output_folder, filename.replace(".md", ".html"))

        # Save the loaded Markdown content as HTML
        doc.SaveToFile(output_file, FileFormat.Html)

        # Close the document to release resources
        doc.Close()

This approach allows you to process multiple Markdown files efficiently and generate corresponding HTML files automatically.

Best Practices for Markdown to HTML Conversion

While the basic steps are enough to complete a Markdown-to-HTML conversion, following a few best practices will help you avoid common pitfalls, improve compatibility, and ensure your output is both clean and professional:

Use proper Markdown syntax – Ensure headings, lists, links, and emphasis are correctly written.
Use UTF-8 Encoding: Always save your Markdown files in UTF-8 encoding to avoid issues with special characters or non-English text.
Batch Processing: If you need to convert multiple files, wrap your script in a loop and process entire folders. This saves time and ensures consistent formatting across documents.
Enhance Styling: Remember that HTML gives you the flexibility to add CSS and JavaScript for custom layouts, responsive design, and interactivity—something not possible in raw Markdown.

Conclusion

Converting Markdown to HTML using Python with Spire.Doc is simple, reliable, and efficient. It preserves formatting, supports automation, and produces clean HTML output ready for web use. By following this guide, you can implement a smooth Markdown to HTML workflow for both single documents and batch operations.

FAQs

Q1: Can I convert multiple Markdown files to HTML in Python?

A1: Yes, you can automate batch conversions by iterating through Markdown files in a directory and applying the conversion logic to each file.

Q2: Will the HTML preserve all Markdown formatting?

A2: Yes, Spire.Doc effectively preserves all essential Markdown formatting, including headings, lists, bold and italic text, links, and more.

Q3: Is there a way to handle images in Markdown during conversion?

A3: Yes, Spire.Doc supports the conversion of images embedded in Markdown, ensuring they are included in the resulting HTML.

Q4: Do I need additional libraries besides Spire.Doc?

A4: No additional libraries are required. Spire.Doc for Python provides a comprehensive solution for converting Markdown to HTML without any external dependencies.

Q5: Can I use the generated HTML in web frameworks?

A5: Yes, the HTML produced is fully compatible with popular web frameworks such as React, Vue, and Angular, making integration seamless.

Published in Conversion

Tagged under

doc Python Conversion

Convert HTML to Text in Python | Simple Plain Text Output

2025-09-03 01:16:17 Written by Administrator

Python Convert HTML Text Quickly and Easily

HTML (HyperText Markup Language) is a markup language used to create web pages, allowing developers to build rich and visually appealing layouts. However, HTML files often contain a large number of tags, which makes them difficult to read if you only need the main content. By using Python to convert HTML to text, this problem can be easily solved. Unlike raw HTML, the converted text file strips away all unnecessary markup, leaving only clean and readable content that is easier to store, analyze, or process further.

Install HTML to Text Converter in Python
Python Convert HTML File to Text
Python Convert HTML String to Text
The Conclusion
FAQs

Install HTML to Text Converter in Python

To simplify the task, we recommend using Spire.Doc for Python. This Python Word library allows you to quickly remove HTML markup and extract clean plain text with ease. It not only works as an HTML-to-text converter, but also offers a wide range of features—covering almost everything you can do in Microsoft Word.

To install it, you can run the following pip command:

pip install spire.doc

Alternatively, you can download the Spire.Doc package and install it manually.

Python Convert HTML Files to Text in 3 Steps

After preparing the necessary tools, let's dive into today's main topic: how to convert HTML to plain text using Python. With the help of Spire.Doc, this task can be accomplished in just three simple steps: create a new document object, load the HTML file, and save it as a text file. It’s straightforward and efficient, even for beginners. Let’s take a closer look at how this process can be implemented in code!

Code Example – Converting an HTML File to a Text File:

from spire.doc import *
from spire.doc.common import *

# Open an html file
document = Document()
document.LoadFromFile("/input/htmlsample.html", FileFormat.Html, XHTMLValidationType.none)
# Save it as a Text document.
document.SaveToFile("/output/HtmlFileTotext.txt", FileFormat.Txt)

document.Close()

The following is a preview comparison between the source document (.html) and the output document (.txt):

Python Convert an HTML File to a Text Document

Note that if the HTML file contains tables, the output text file will only retain the values within the tables and cannot preserve the original table formatting. If you want to keep certain styles while removing markup, it is recommended to convert HTML to a Word document . This way, you can retain headings, tables, and other formatting, making the content easier to edit and use.

How to Convert an HTML String to Text in Python

Sometimes, we don’t need the entire content of a web page and only want to extract specific parts. In such cases, you can convert an HTML string directly to text. This approach allows you to precisely control the information you need without further editing. Using Python to convert an HTML string to a text file is also straightforward. Here’s a detailed step-by-step guide:

Steps to convert an HTML string to a text document using Spire.Doc:

Input the HTML string directly or read it from a local file.
Create a Document object and add sections and paragraphs.
Use Paragraph.AppendHTML() method to insert the HTML string into a paragraph.
Save the document as a .txt file using Document.SaveToFile() method.

The following code demonstrates how to convert an HTML string to a text file using Python:

from spire.doc import *
from spire.doc.common import *

#Get html string.
#with open(inputFile) as fp:
    #HTML = fp.read()

# Load HTML from string
html = """<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>HTML to Text Example</title>
  <style>
    body { font-family: Arial, sans-serif; margin: 20px; }
    header { background: #f4f4f4; padding: 10px; }
    nav a { margin: 0 10px; text-decoration: none; color: #333; }
    main { margin-top: 20px; }
  </style>
</head>
<body>
  <header>
    <h1>My Demo Page</h1>
    <nav>
      <a href="#">Home</a>
      <a href="#">About</a>
      <a href="#">Contact</a>
    </nav>
  </header>
  
  <main>
    <h2>Convert HTML to Text</h2>
    <p>This is a simple demo showing how HTML content can be displayed before converting it to plain text.</p>
  </main>
</body>
</html>
"""

# Create a new document
document = Document()
section = document.AddSection()
section.AddParagraph().AppendHTML(html)

# Save directly as TXT
document.SaveToFile("/output/HtmlStringTotext.txt", FileFormat.Txt)
document.Close()

Here's the preview of the converted .txt file: Python Convert an HTML String to a Text Document

The Conclusion

In today’s tutorial, we focused on how to use Python to convert HTML to a text file. With the help of Spire.Doc, you can handle both HTML files and HTML strings in just a few lines of code, easily generating clean plain text files. If you’re interested in the other powerful features of the Python Word library, you can request a 30-day free trial license and explore its full capabilities for yourself.

FAQs about Converting HTML to Text in Python

Q1: How can I convert HTML to plain text using Python?

A: Use Spire.Doc to load an HTML file or string, insert it into a Document object with AppendHTML(), and save it as a .txt file.

Q2: Can I keep some formatting when converting HTML to text?

A: To retain styles like headings or tables, convert HTML to a Word document first, then export to text if needed.

Q3: Is it possible to convert only part of an HTML page to text?

A: Yes, extract the specific HTML segment as a string and convert it to text using Python for precise control.

Published in Conversion

Tagged under

doc Python Conversion

Creating Word Documents with Python: A Step-By-Step Guide

2025-07-08 01:35:11 Written by zaki zou

Python Examples for Creating Word Documents

Automating the creation of Word documents is a powerful way to generate reports, and produce professional-looking files. With Python, you can utilize various libraries for this purpose, and one excellent option is Spire.Doc for Python, specifically designed for handling Word documents.

This guide will provide a clear, step-by-step process for creating Word documents in Python using Spire.Doc. We’ll cover everything from setting up the library to adding formatted text, images, tables, and more. Whether you're generating reports, invoices, or any other type of document, thes techniques will equip you with the essential tools to enhance your workflow effectively.

Table of Contents:

What's Sprie.Doc for Python?
Set Up Spire.Doc in Your Python Project
Step 1: Create a Blank Word Document
Step 2: Add Formatted Text (Headings, Paragraphs)
Step 3: Insert Images to a Word Document
Step 4: Create and Format Tables
Step 5: Add Numbered or Bulleted Lists
Best Practices for Word Document Creation in Python
FAQs
Conclusion

What's Spire.Doc for Python?

Spire.Doc is a powerful library for creating, manipulating, and converting Word documents in Python. It enables developers to generate professional-quality documents programmatically without needing Microsoft Word. Here are some key features:

Supports Multiple Formats : Works with DOCX, DOC, RTF, and HTML.
Extensive Functionalities : Add text, images, tables, and charts.
Styling and Formatting : Apply various styles for consistent document appearance.
User-Friendly API: Simplifies automation of document generation processes.
Versatile Applications : Ideal for generating reports, invoices, and other documents.

With Spire.Doc, you have the flexibility and tools to streamline your Word document creation tasks effectively.

Set Up Spire.Doc in Your Python Project

To get started with Spire.Doc in your Python project, follow these simple steps:

Install Spire.Doc : First, you need to install the Spire.Doc library. You can do this using pip. Open your terminal or command prompt and run the following command:

pip install spire.doc

Import the Library : Once installed, import the Spire.Doc module in your Python script to access its functionalities. You can do this with the following import statement:

from spire.doc import *
from spire.doc.common import *

With the setup complete, you can begin writing your Python code to create Word documents according to your needs.

Step 1: Create a Blank Word Document in Python

The first step in automating Word document creation is to create a blank document. To begin with, we create a Document object, which serves as the foundation of our Word document. We then add a section to organize content, and set the page size to A4 with 60-unit margins . These configurations are crucial for ensuring proper document layout and readability.

Below is the code to initialize a document and set up the page configuration:

# Create a Document object
doc = Document()

# Add a section
section = doc.AddSection()

# Set page size and page margins
section.PageSetup.PageSize = PageSize.A4()
section.PageSetup.Margins.All = 60

# Save the document
doc.SaveToFile("BlankDocument.docx")
doc.Dispose

Step 2: Add Formatted Text (Headings, Paragraphs)

1. Add Title, Headings, Paragraphs

In this step, we add text content by first creating paragraphs using the AddParagraph method, followed by inserting text with the AppendText method.

Different paragraphs can be styled using various BuiltInStyle options, such as Title , Heading1 , and Normal , allowing for quick generation of document elements. Additionally, the TextRange.CharacterFormat property can be used to adjust the font, size, and other styles of the text, ensuring a polished and organized presentation.

Below is the code to insert and format these elements:

# Add a title
title_paragraph = section.AddParagraph()
textRange = title_paragraph.AppendText("My First Document")
title_paragraph.ApplyStyle(BuiltinStyle.Title)
textRange.CharacterFormat.FontName = "Times New Properties"
textRange.CharacterFormat.FontSize = 24

# Add a heading
heading_paragraph = section.AddParagraph()
textRange = heading_paragraph.AppendText("This Is Heading1")
heading_paragraph.ApplyStyle(BuiltinStyle.Heading1)
textRange.CharacterFormat.FontName = "Times New Properties"
textRange.CharacterFormat.FontSize = 16

# Add a paragraph
normal_paragraph = section.AddParagraph()
textRange = normal_paragraph .AppendText("This is a sample paragraph.")
normal_paragraph .ApplyStyle(BuiltinStyle.Normal)
textRange.CharacterFormat.FontName = "Times New Properties"
textRange.CharacterFormat.FontSize = 12

2. Apply Formatting to Paragraph

To ensure consistent formatting across multiple paragraphs, we can create a ParagraphStyle that defines key properties such as font attributes (name, size, color, boldness) and paragraph settings (spacing, indentation, alignment) within a single object. This style can then be easily applied to the selected paragraphs for uniformity.

Below is the code to define and apply the paragraph style:

# Defined paragraph style
style = ParagraphStyle(doc)
style.Name = "paraStyle"
style.CharacterFormat.FontName = "Arial"
style.CharacterFormat.FontSize = 13
style.CharacterFormat.TextColor = Color.get_Red()
style.CharacterFormat.Bold = True
style.ParagraphFormat.AfterSpacing = 12
style.ParagraphFormat.BeforeSpacing = 12
style.ParagraphFormat.FirstLineIndent = 4
style.ParagraphFormat.LineSpacing = 10
style.ParagraphFormat.HorizontalAlignment = HorizontalAlignment.Left
doc.Styles.Add(style)

# Apply the style to the specific paragraph
normal_paragraph.ApplyStyle("paraStyle")

Step 3: Insert Images to a Word Document

1. Insert an Image

In this step, we add an image to our document, allowing for visual enhancements that complement the text. We begin by creating a paragraph to host the image and then proceed to insert the desired image file usingthe Paragraph.AppendPicture method. After the image is inserted, we can adjust its dimensions and alignment to ensure it fits well within the document layout.

Below is the code to insert and format the image:

# Add a paragraph
paragraph = section.AddParagraph()

# Insert an image
picture = paragraph.AppendPicture("C:\\Users\\Administrator\\Desktop\\logo.png")

# Scale the image dimensions
picture.Width = picture.Width * 0.9
picture.Height = picture.Height * 0.9

# Set text wrapping style
picture.TextWrappingStyle = TextWrappingStyle.TopAndBottom

# Center-align the image horizontally
picture.HorizontalAlignment = HorizontalAlignment.Center

2. Position Image at Precise Location

To gain precise control over the positioning of images within your Word document, you can adjust both the horizontal and vertical origins and specify the image's coordinates in relation to these margins. This allows for accurate placement of the image, ensuring it aligns perfectly with the overall layout of your document.

Below is the code to set the image's position.

picture.HorizontalOrigin = HorizontalOrigin.LeftMarginArea 
picture.VerticalOrigin = VerticalOrigin.TopMarginArea 
picture.HorizontalPosition = 180.0 
picture.VerticalPosition = 165.0

Note : Absolute positioning does not apply when using the Inline text wrapping style.

Step 4: Create and Format Tables

In this step, we will create a table within the document and customize its appearance and functionality. This includes defining the table's structure, adding header and data rows, and setting formatting options to enhance readability.

Steps for creating and customizing a table in Word:

Add a Table : Use the Section.AddTablemethod to create a new table.
Specify Table Data : Define the data that will populate the table.
Set Rows and Columns : Specify the number of rows and columns with the Table.ResetCells method.
Access Rows and Cells : Retrieve a specific row using Table.Rows[rowIndex] and a specific cell using TableRow.Cells[cellIndex] .
Populate the Table : Add paragraphs with text to the designated cells.
Customize Appearance : Modify the table and cell styles through the Table.TableFormat and TableCell.CellFormat properties.

The following code demonstrates how to add a teble when creating Word documents in Python:

# Add a table
table = section.AddTable(True)

# Specify table data
header_data = ["Header 1", "Header 2", "Header 3"]
row_data = [["Row 1, Col 1", "Row 1, Col 2", "Row 1, Col 3"],
            ["Row 2, Col 1", "Row 2, Col 2", "Row 2, Col 3"]]

# Set the row number and column number of table
table.ResetCells(len(row_data) + 1, len(header_data))

# Set the width of table
table.PreferredWidth = PreferredWidth(WidthType.Percentage, int(100))

# Get header row
headerRow = table.Rows[0]
headerRow.IsHeader = True
headerRow.Height = 23
headerRow.RowFormat.BackColor = Color.get_DarkBlue()  # Header color

# Fill the header row with data and set the text formatting
for i in range(len(header_data)):
    headerRow.Cells[i].CellFormat.VerticalAlignment = VerticalAlignment.Middle
    paragraph = headerRow.Cells[i].AddParagraph()
    paragraph.Format.HorizontalAlignment = HorizontalAlignment.Center
    txtRange = paragraph.AppendText(header_data[i])
    txtRange.CharacterFormat.Bold = True
    txtRange.CharacterFormat.FontSize = 15
    txtRange.CharacterFormat.TextColor = Color.get_White()  # White text color

# Fill the rest rows with data and set the text formatting
for r in range(len(row_data)):
    dataRow = table.Rows[r + 1]
    dataRow.Height = 20
    dataRow.HeightType = TableRowHeightType.Exactly

    for c in range(len(row_data[r])):
        dataRow.Cells[c].CellFormat.VerticalAlignment = VerticalAlignment.Middle
        paragraph = dataRow.Cells[c].AddParagraph()
        paragraph.Format.HorizontalAlignment = HorizontalAlignment.Center
        txtRange = paragraph.AppendText(row_data[r][c])
        txtRange.CharacterFormat.FontSize = 13

# Alternate row color
for j in range(1, table.Rows.Count):
    if j % 2 == 0:
        row2 = table.Rows[j]
        for f in range(row2.Cells.Count):
            row2.Cells[f].CellFormat.BackColor = Color.get_LightGray()  # Alternate row color

# Set the border of table
table.TableFormat.Borders.BorderType = BorderStyle.Single
table.TableFormat.Borders.LineWidth = 1.0
table.TableFormat.Borders.Color = Color.get_Black()

Step 5: Add Numbered or Bulleted Lists

In this step, we create and apply both numbered and bulleted lists to enhance the document's organization. Spire.Doc offers the ListStyle class to define and manage different types of lists with customizable formatting options. Once created, these styles can be applied to any paragraph in the document, ensuring a consistent look across all list items.

Steps for generating numbered/bulleted lists in Word:

Define the List Style : Initialize a ListStyle for the numbered or bulleted list, specifying properties such as name, pattern type, and text position.
Add the List Style to Document : Use the Document.ListStyles.Add() method to incorporate the new list style into the document's styles collection.
Create List Items : For each item, create a paragraph and apply the corresponding list style using the Paragraph.ListFormat.ApplyStyle() method.
Format Text Properties : Adjust font size and type for each item to ensure consistency and readability.

Below is the code to generate numbered and bulleted lists:

# Create a numbered list style
listStyle = ListStyle(doc, ListType.Numbered)
listStyle.Name = "numberedList"
listStyle.Levels[0].PatternType = ListPatternType.Arabic
listStyle.Levels[0].TextPosition = 60;  
doc.ListStyles.Add(listStyle)

# Create a numbered list
for item in ["First item", "Second item", "Third item"]:
    paragraph = section.AddParagraph()
    textRange = paragraph.AppendText(item)
    textRange.CharacterFormat.FontSize = 13
    textRange.CharacterFormat.FontName = "Times New Roman"
    paragraph.ListFormat.ApplyStyle("numberedList")

# Create a bulleted list style
listStyle = ListStyle(doc, ListType.Bulleted)
listStyle.Name = "bulletedList"
listStyle.Levels[0].BulletCharacter = "\u00B7"
listStyle.Levels[0].CharacterFormat.FontName = "Symbol"
listStyle.Levels[0].TextPosition = 20
doc.ListStyles.Add(listStyle)

# Create a bulleted list
for item in ["Bullet item one", "Bullet item two", "Bullet item three"]:
    paragraph = section.AddParagraph()
    textRange = paragraph.AppendText(item)
    textRange.CharacterFormat.FontSize = 13
    textRange.CharacterFormat.FontName = "Times New Roman"
paragraph.ListFormat.ApplyStyle("bulletedList")

Here’s a screenshot of the Word document created using the code snippets provided above:

Word document generated with Python code.

Best Practices for Word Document Creation in Python

Reuse Styles : Define paragraph and list styles upfront to maintain consistency.
Modular Code : Break document generation into functions (e.g., add_heading(), insert_table()) for reusability.
Error Handling : Validate file paths and inputs to avoid runtime errors.
Performance Optimization: Dispose of document objects (doc.Dispose()) to free resources.
Use Templates : For complex documents, create MS Word templates with placeholders and replace them programmatically to save development time.

By implementing these practices, you can streamline document automation, reduce manual effort, and ensure professional-quality outputs.

FAQs

Q1: Does Spire.Doc support adding headers and footers to a Word document?

Yes, you can add and customize headers and footers, including page numbers, images, and custom text.

Q2. Can I generate Word documents on a server without Microsoft Office installed?

Yes, Spire.Doc works without Office dependencies, making it ideal for server-side automation.

Q3: Can I create Word documents from a template using Spire.Doc?

Of course, you can. Refer to the tutorial: Create Word Documents from Templates with Python

Q4: Can I convert Word documents to other formats using Spire.Doc?

Yes, Spire.Doc supports converting Word documents to various formats, including PDF, HTML, and plain text.

Q5. Can Spire.Doc edit existing Word documents?

Yes, Spire.Doc supports reading, editing, and saving DOCX/DOC files programmatically. Check out this documentation: How to Edit or Modify Word Documents in Pyhton

Conclusion

In this article, we've explored how to create Word documents in Python using the Spire.Doc library, highlighting its potential to enhance productivity while enabling the generation of highly customized and professional documents. By following the steps outlined in this guide, you can fully leverage Spire.Doc, making your document creation process both efficient and straightforward.

As you implement best practices and delve into the library's extensive functionalities, you'll discover that automating document generation significantly reduces manual effort, allowing you to concentrate on more critical tasks. Embrace the power of Python and elevate your document creation capabilities today!

Published in Document Operation

Tagged under

doc Python Document Operation

Read Word DOC or DOCX Files in Python - Extract Text, Images, Tables and More

2025-06-30 01:41:17 Written by zaki zou

Python Examples to Read Word DOC and DOCX Files

Reading Word documents in Python is a common task for developers who work with document automation, data extraction, or content processing. Whether you're working with modern .docx files or legacy .doc formats, being able to open, read, and extract content like text, tables, and images from Word files can save time and streamline your workflows.

While many Python libraries support .docx, reading .doc files—the older binary format—can be more challenging. Fortunately, there are reliable methods for handling both file types in Python.

In this tutorial, you'll learn how to read Word documents (.doc and .docx) in Python using the Spire.Doc for Python library. We'll walk through practical code examples to extract text, images, tables, comments, lists, and even metadata. Whether you're building an automation script or a full document parser, this guide will help you work with Word files effectively across formats.

Why Read Word Documents Programmatically in Python?
Install the Library for Parsing Word Documents in Python
Read Text from Word DOC or DOCX in Python
- Get Text from Entire Document
- Get Text from Specific Section or Paragraph
Read Specific Elements from a Word Document in Python
Conclusion
FAQs

Why Read Word Documents Programmatically in Python?

Reading Word files using Python allows for powerful automation of content processing tasks, such as:

Extracting data from reports, resumes, or forms.
Parsing and organizing content into databases or dashboards.
Converting or analyzing large volumes of Word documents.
Integrating document reading into web apps, APIs, or back-end systems.

Programmatic reading eliminates manual copy-paste workflows and ensures consistent and scalable results.

Install the Library for Parsing Word Documents in Python

To read .docx and .doc files in Python, you need a library that can handle both formats. Spire.Doc for Python is a versatile and easy-to-use library that lets you extract text, images, tables, comments, lists, and metadata from Word documents. It runs independently of Microsoft Word, so Office installation is not required.

To get started, install Spire.Doc easily with pip:

pip install Spire.Doc

Read Text from Word DOC or DOCX in Python

Extracting text from Word documents is a common requirement in many automation and data processing tasks. Depending on your needs, you might want to read the entire content or focus on specific sections or paragraphs. This section covers both approaches.

Get Text from Entire Document

When you need to retrieve the complete textual content of a Word document — for tasks like full-text indexing or simple content export — you can use the Document.GetText() method. The following example demonstrates how to load a Word file, extract all text, and save it to a file:

from spire.doc import *

# Load the Word .docx or .doc file
document = Document()
document.LoadFromFile("sample.docx") 

# Get all text
text = document.GetText()

# Save to a text file
with open("extracted_text.txt", "w", encoding="utf-8") as file:
    file.write(text)

document.Close()

Python Example to Retrieve All Text from Word Documents

Get Text from Specific Section or Paragraph

Many documents, such as reports or contracts, are organized into multiple sections. Extracting text from a specific section enables targeted processing when you need content from a particular part only. By iterating through the paragraphs of the selected section, you can isolate the relevant text:

from spire.doc import *

# Load the Word .docx or .doc file
document = Document()
document.LoadFromFile("sample.docx")

# Access the desired section
section = document.Sections[0]

# Get text from the paragraphs of the section
with open("paragraphs_output.txt", "w", encoding="utf-8") as file:
    for paragraph in section.Paragraphs:
        file.write(paragraph.Text + "\n")

document.Close()

Read Specific Elements from a Word Document in Python

Beyond plain text, Word documents often include rich content like images, tables, comments, lists, metadata, and more. These elements can easily be programmatically accessed and extracted.

Extract Images

Word documents often embed images like logos, charts, or illustrations. To extract these images:

Traverse each paragraph and its child objects.
Identify objects of type DocPicture.
Retrieve the image bytes and save them as separate files.

from spire.doc import *
import os

# Load the Word document
document = Document()
document.LoadFromFile("sample.docx")

# Create a list to store image byte data
images = []

# Iterate over sections
for s in range(document.Sections.Count):
    section = document.Sections[s]
    
    # Iterate over paragraphs
    for p in range(section.Paragraphs.Count):
        paragraph = section.Paragraphs[p]
        
        # Iterate over child objects
        for c in range(paragraph.ChildObjects.Count):
            obj = paragraph.ChildObjects[c]
            # Extract image data
            if isinstance(obj, DocPicture):
                picture = obj
                # Get image bytes
                dataBytes = picture.ImageBytes  
                # Store in the list
                images.append(dataBytes)        

# Create the output directory if it doesn't exist
output_folder = "ExtractedImages"
os.makedirs(output_folder, exist_ok=True)

# Save each image from byte data
for i, item in enumerate(images):
    fileName = f"Image-{i+1}.png"
    with open(os.path.join(output_folder, fileName), 'wb') as imageFile:
        imageFile.write(item)

# Close the document
document.Close()

Python Example to Extract Images from Word Documents

Get Table Data

Tables organize data such as schedules, financial records, or lists. To extract all tables and their content:

Loop through tables in each section.
Loop through rows and cells in each table.
Traverse over each cell’s paragraphs and combine their texts.
Save the extracted table data in a readable text format.

from spire.doc import *
import os

# Load the Word document
document = Document()
document.LoadFromFile("tables.docx")

# Ensure output directory exists
output_dir = "output/Tables"
os.makedirs(output_dir, exist_ok=True)

# Loop through each section
for s in range(document.Sections.Count):
    section = document.Sections[s]
    tables = section.Tables

    # Loop through each table in the section
    for i in range(tables.Count):
        table = tables[i]
        table_data = ""

        # Loop through each row
        for j in range(table.Rows.Count):
            row = table.Rows[j]

            # Loop through each cell
            for k in range(row.Cells.Count):
                cell = row.Cells[k]
                cell_text = ""

                # Combine text from all paragraphs in the cell
                for p in range(cell.Paragraphs.Count):
                    para_text = cell.Paragraphs[p].Text
                    cell_text += para_text + " "

                table_data += cell_text.strip()

                # Add tab between cells (except after the last cell)
                if k < row.Cells.Count - 1:
                    table_data += "\t"
            table_data += "\n"

        # Save the table data to a separate text file
        output_path = os.path.join(output_dir, f"WordTable_{s+1}_{i+1}.txt")
        with open(output_path, "w", encoding="utf-8") as output_file:
            output_file.write(table_data)

# Close the document
document.Close()

Python Example to Get Table Data from Word Documents

Read Lists

Lists are frequently used to structure content in Word documents. This example identifies paragraphs formatted as list items and writes the list marker together with the text to a file.

from spire.doc import *

# Load the Word document
document = Document()
document.LoadFromFile("sample.docx")

# Open a text file for writing the list items
with open("list_items.txt", "w", encoding="utf-8") as output_file:

    # Iterate over sections 
    for s in range(document.Sections.Count):
        section = document.Sections[s]

        # Iterate over paragraphs 
        for p in range(section.Paragraphs.Count):
            paragraph = section.Paragraphs[p]

            # Check if the paragraph is a list
            if paragraph.ListFormat.ListType != ListType.NoList:
                # Write the combined list marker and paragraph text to file
                output_file.write(paragraph.ListText + paragraph.Text + "\n")

# Close the document
document.Close()

Extract Comments

Comments are typically used for collaboration and feedback in Word documents. This code retrieves all comments, including the author and content, and saves them to a file with clear formatting for later review or audit.

from spire.doc import *

# Load the Word .docx or .doc document
document = Document()
document.LoadFromFile("sample.docx")

# Open a text file to save comments
with open("extracted_comments.txt", "w", encoding="utf-8") as output_file:

    # Iterate over the comments 
    for i in range(document.Comments.Count):
        comment = document.Comments[i]

        # Write comment header with comment number
        output_file.write(f"Comment {i + 1}:\n")
        
        # Write comment author
        output_file.write(f"Author: {comment.Format.Author}\n")

        # Extract full comment text by concatenating all paragraph texts
        comment_text = ""
        for j in range(comment.Body.Paragraphs.Count):
            paragraph = comment.Body.Paragraphs[j]
            comment_text += paragraph.Text + "\n"

        # Write the comment text
        output_file.write(f"Content: {comment_text.strip()}\n")

        # Add a blank line between comments
        output_file.write("\n")

# Close the document
document.Close()

Retrieve Metadata (Document Properties)

Metadata provides information about the document such as author, title, creation date, and modification date. This code extracts common built-in properties for reporting or cataloging purposes.

from spire.doc import *

# Load the Word .docx or .doc document
document = Document()
document.LoadFromFile("sample.docx")

# Get the built-in document properties
props = document.BuiltinDocumentProperties

# Open a text file to write the properties
with open("document_properties.txt", "w", encoding="utf-8") as output_file:
    output_file.write(f"Title: {props.Title}\n")
    output_file.write(f"Author: {props.Author}\n")
    output_file.write(f"Subject: {props.Subject}\n")
    output_file.write(f"Created: {props.CreateDate}\n")
    output_file.write(f"Modified: {props.LastSaveDate}\n")

# Close the document
document.Close()

Conclusion

Reading both .doc and .docx Word documents in Python is fully achievable with the right tools. With Spire.Doc, you can:

Read text from the entire document, any section or paragraph.
Extract tables and process structured data.
Export images embedded in the document.
Extract comments and lists from the document.
Work with both modern and legacy Word formats without extra effort.

Try Spire.Doc today to simplify your Word document parsing workflows in Python!

FAQs

Q1: How do I read a Word DOC or DOCX file in Python?

A1: Use a Python library like Spire.Doc to load and extract content from Word files.

Q2: Do I need Microsoft Word installed to use Spire.Doc?

A2: No, it works without any Office installation.

Q3: Can I generate or update Word documents with Spire.Doc?

A3: Yes, Spire.Doc not only allows you to read and extract content from Word documents but also provides powerful features to create, modify, and save Word files programmatically.

Get a Free License

To fully experience the capabilities of Spire.Doc for Python without any evaluation limitations, you can request a free 30-day trial license.

Published in Document Operation

Tagged under

doc Python Document Operation

How to Count Word Frequency in a Word Document Using Python

2025-05-22 09:16:03 Written by Administrator

Want to count the frequency of words in a Word document? Whether you're analyzing content, generating reports, or building a document tool, Python makes it easy to find how often a specific word appears—across the entire document, within specific sections, or even in individual paragraphs. In this guide, you’ll learn how to use Python to count word occurrences accurately and efficiently, helping you extract meaningful insights from your Word files without manual effort.

Count Frequency of Words in Word with Python

Count Frequency of Words in an Entire Word Document
Count Word Frequency by Section
Count Word Frequency by Paragraph
To Wrap Up
FAQ

In this tutorial, we’ll use Spire.Doc for Python, a powerful and easy-to-use library for Word document processing. It supports a wide range of features like reading, editing, and analyzing DOCX files programmatically—without requiring Microsoft Office.

You can install it via pip:

pip install spire.doc

Let’s see how it works in practice, starting with counting word frequency in an entire Word document.

How to Count Frequency of Words in an Entire Word Document

Let’s start by learning how to count how many times a specific word or phrase appears in an entire Word document. This is a common task—imagine you need to check how often the word "contract" appears in a 50-page file.
With the FindAllString() method from Spire.Doc for Python, you can quickly search through the entire document and get an exact count in just a few lines of code—saving you both time and effort.

Steps to count the frequency of a word in the entire Word document:

Create an object of Document class and read a source Word document.
Specify the keyword to find.
Find all occurrences of the keyword in the document using Document.FindAllString() method.
Count the number of matches and print it out.

The following code shows how to count the frequency of the keyword "AI-Generated Art" in the entire Word document:

from spire.doc import *
from spire.doc.common import *

# Create a Document object
document = Document()

# Load a Word document
document.LoadFromFile("E:/Administrator/Python1/input/AI-Generated Art.docx")

# Customize the keyword to find
keyword = "AI-Generated Art"

# Find all matches (False: distinguish case; True: full text search)
textSelections = document.FindAllString(keyword, False, True)

# Count the number of matches
count = len(textSelections)

# Print the result
print(f'"{keyword}" appears {count} times in the entire document.')

# Close the document
document.Close()

Count Frequency of Word in the Entire Document with Python

How to Count Word Frequency by Section in a Word Document Using Python

A Word document is typically divided into multiple sections, each containing its own paragraphs, tables, and other elements. Sometimes, instead of counting a word's frequency across the entire document, you may want to know how often it appears in each section. To achieve this, we’ll loop through all the document sections and search for the target word within each one. Let’s see how to count word frequency by section using Python.

Steps to count the frequency of a word by section in Word documents:

Create a Document object and load the Word file.
Define the target keyword to search.
Loop through all sections in the document. Within each section, loop through all paragraphs.
Use regular expressions to count keyword occurrences.
Accumulate and print the count for each section and the total count.

This code demonstrates how to count how many times "AI-Generated Art" appears in each section of a Word document:

import re
from spire.doc import *
from spire.doc.common import *

# Create a Document object and load a Word file
document = Document()
document.LoadFromFile("E:/Administrator/Python1/input/AI.docx")

# Specify the keyword
keyword = "AI-Generated Art"

# The total count of the keyword
total_count = 0

# Get all sections
sections = document.Sections

# Loop through each section
for i in range(sections.Count):
    section = sections.get_Item(i)
    paragraphs = section.Paragraphs

    section_count = 0  
    print(f"\n=== Section {i + 1} ===")

    # Loop through each paragraph in the section
    for j in range(paragraphs.Count):
        paragraph = paragraphs.get_Item(j)
        text = paragraph.Text

        # Find all matches using regular expressions
        count = len(re.findall(re.escape(keyword), text, flags=re.IGNORECASE))
        section_count += count
        total_count += count

    print(f'Total in Section {i + 1}: {section_count} time(s)')

print(f'\n=== Total occurrences in all sections: {total_count} ===')

# Close the document
document.Close()

How to Count Word Frequency by Sections in a Word File

How to Count Word Frequency by Paragraph in a Word Document

When it comes to tasks like sensitive word detection or content auditing, it's crucial to perform a more granular analysis of word frequency. In this section, you’ll learn how to count word frequency by paragraph in a Word document, which gives you deeper insight into how specific terms are distributed across your content. Let’s walk through the steps and see a code example in action.

Steps to count the frequency of words by paragraph in Word files:

Instantiate a Document object and load a Word document from files.
Specify the keyword to search for.
Loop through each section and each paragraph in the document.
Find and count the occurrence of the keyword using regular expressions.
Print out the count for each paragraph where the keyword appears and the total number of occurrences.

Use the following code to calculate the frequency of "AI-Generated Art" by paragraphs in a Word document:

import re
from spire.doc import *
from spire.doc.common import *

# Create a Document object
document = Document()

# Load a Word document
document.LoadFromFile("E:/Administrator/Python1/input/AI.docx")

# Customize the keyword to find
keyword = "AI-Generated Art"

# Initialize variables
total_count = 0
paragraph_index = 1

# Loop through sections and paragraphs
sections = document.Sections
for i in range(sections.Count):
    section = sections.get_Item(i)
    paragraphs = section.Paragraphs
    for j in range(paragraphs.Count):
        paragraph = paragraphs.get_Item(j)
        text = paragraph.Text

        # Find all occurrences of the keyword while ignoring case
        count = len(re.findall(re.escape(keyword), text, flags=re.IGNORECASE))

        # Print the result
        if count > 0:
            print(f'Paragraph {paragraph_index}: "{keyword}" appears {count} time(s)')
            total_count += count
        paragraph_index += 1

# Print the total count
print(f'\nTotal occurrences in all paragraphs: {total_count}')
document.Close()

Count Word Frequency by Paragraphs Using Python

To Wrap Up

The guide demonstrates how to count the frequency of specific words across an entire Word document, by section, and by paragraph using Python. Whether you're analyzing long reports, filtering sensitive terms, or building smart document tools, automating the task with Spire.Doc for Python can save time and boost accuracy. Give them a try in your own projects and take full control of your Word document content.

FAQs about Counting the Frequency of Words

Q1: How to count the number of times a word appears in Word?

A: You can count word frequency in Word manually using the “Find” feature, or automatically using Python and libraries like Spire.Doc. This lets you scan the entire document or target specific sections or paragraphs.

Q2: Can I analyze word frequency across multiple Word files?

A: Yes. By combining a loop in Python to load multiple documents, you can apply the same word-count logic to each file and aggregate the results—ideal for batch processing or document audits.

Published in Text

Tagged under

doc Python Text

Python: Integrate Excel Tables into Word Documents

2025-04-09 01:04:52 Written by Koohji

Modern workflows often span multiple platforms-while analysts work with data in Excel, polished reports are created in Word. Manually copying data between these documents can lead to errors, version conflicts, and inconsistent formatting. Python-driven automation provides an efficient solution by seamlessly integrating Excel's data capabilities with Word's formatting strengths. This integration ensures data integrity, reduces repetitive formatting, and accelerates report creation for financial, academic, and compliance-related tasks.

This article explores how to use Spire.Office for Python to insert Excel tables into Word documents using Python code.

Read Excel Data and Insert It into Word Documents
Copy Data and Formatting from Excel to Word
Integrate Excel Worksheets as OLE into Word Documents

Install Spire.Office for Python

This scenario requires Spire.Office for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.

Package Manager

pip install Spire.Office

Read Excel Data and Insert It into Word Documents

With Spire.XLS for Python, developers can extract data from Excel worksheets while preserving number formatting using the CellRange.NumberText property. The extracted data can then be inserted into a Word table created using Spire.Doc for Python. This method is ideal for simple Excel worksheets and cases requiring table reformatting.

Steps to Read Excel Data and Insert It into Word:

Create an instance of the Workbook class and load an Excel file using the Workbook.LoadFromFile() method.
Retrieve the worksheet using the Workbook.Worksheets.get_Item() method and obtain the used cell range with the Worksheet.AllocatedRange property.
Initialize a Document instance to create a Word document.
Add a section using the Document.AddSection() method and insert a table using the Section.AddTable() method.
Define the number of rows and columns based on the used cell range with the Table.ResetCells() method.
Iterate through the rows and columns of the used cell range.
Retrieve the corresponding table cell using the Table.Rows.get_Item().Cells.get_Item() method and add a paragraph using the TableCell.AddParagraph() method.
Extract the cell value using the CellRange.get_Item().NumberText property and append it to the paragraph using the Paragraph.AppendText() method.
Apply the required formatting to the Word table.
Save the Word document using the Document.SaveToFile() method.

Python

from spire.doc import Document, AutoFitBehaviorType, FileFormat, DefaultTableStyle
from spire.xls import Workbook

# Specify the file names
excel_file = "Sample.xlsx"
word_file = "output/ExcelDataToWord.docx"

# Create a Workbook instance
workbook = Workbook()
# Load the Excel file
workbook.LoadFromFile(excel_file)

# Get the first worksheet
sheet = workbook.Worksheets.get_Item(0)
# Get the used cell range in the first worksheet
allocatedRange = sheet.AllocatedRange

# Create a Document instance
doc = Document()

# Add a section to the document and add a table to the section
section = doc.AddSection()
table = section.AddTable()

# Reset the number of rows and columns in the Word table to match the number of rows and columns in the Excel worksheet
table.ResetCells(allocatedRange.RowCount, allocatedRange.ColumnCount)

# Loop through each row and column in the used cell range
for rowIndex in range(allocatedRange.RowCount):
    # Loop through each column in the row
    for colIndex in range(allocatedRange.ColumnCount):
        # Add a cell to the Word table and add a paragraph to the cell
        cell = table.Rows.get_Item(rowIndex).Cells.get_Item(colIndex)
        paragraph = cell.AddParagraph()
        # Append the cell value to the Word table
        paragraph.AppendText(allocatedRange.get_Item(rowIndex + 1, colIndex + 1).NumberText)

# Auto-fit the table to the window and apply a table style
table.AutoFit(AutoFitBehaviorType.AutoFitToWindow)
table.ApplyStyle(DefaultTableStyle.GridTable1LightAccent6)

# Save the Word document
doc.SaveToFile(word_file, FileFormat.Docx2019)

# Dispose resources
doc.Dispose()
workbook.Dispose()

Read Excel Data and Write it Into a Word Document Table with Spire.Doc

Copy Data and Formatting from Excel to Word

Spire.XLS for Python and Spire.Doc for Python can also be used together to copy both data and formatting from Excel to Word, preserving the table's original structure and appearance.

To handle format preservation, two helper methods are needed:

MergeCells: Merges table cells in Word according to the merged cells in the Excel worksheet.
CopyFormatting: Copies Excel cell formatting (font style, background color, horizontal and vertical alignment) to the Word table.

Steps to Copy Data and Formatting:

Create a Workbook instance and load an Excel file using the Workbook.LoadFromFile() method.
Retrieve a worksheet using the Workbook.Worksheets.get_Item() method.
Initialize a Document instance and add a section with the Document.AddSection() method.
Insert a table using the Section.AddTable() method.
Adjust the table’s structure based on the worksheet using the Table.ResetCells() method.
Apply cell merging using the MergeCells() method.
Iterate through each worksheet row and set row heights using the Table.Rows.get_Item().Height property.
For each column in a row:
- Retrieve worksheet cells using the Worksheet.Range.get_Item() method and table cells using the TableRow.Cells.get_Item() method.
- Extract cell data using the CellRange.NumberText property and append it to the table cell using the TableCell.AddParagraph().AppendText() method.
- Apply formatting using the CopyFormatting() method.
Save the Word document using the Document.SaveToFile() method.

Python

from spire.xls import Workbook, HorizontalAlignType, ExcelPatternType, VerticalAlignType
from spire.doc import Document, Color, HorizontalAlignment, VerticalAlignment, PageOrientation, FileFormat

def MergeCells(worksheet, wordTable):
    # Check if there are merged cells
    if not worksheet.HasMergedCells:
        return
    for cell_range in worksheet.MergedCells:
        start_row, start_col = cell_range.Row, cell_range.Column
        row_count, col_count = cell_range.RowCount, cell_range.ColumnCount
        # Process horizontal merging
        if col_count > 1:
            for row in range(start_row, start_row + row_count):
                wordTable.ApplyHorizontalMerge(row - 1, start_col - 1, start_col - 1 + col_count - 1)
        # Process vertical merging
        if row_count > 1:
            wordTable.ApplyVerticalMerge(start_col - 1, start_row - 1, start_row - 1 + row_count - 1)

def CopyFormatting(tableTextRange, excelCell, wordCell):
    # Copy font styles
    font = excelCell.Style.Font
    tableTextRange.CharacterFormat.TextColor = Color.FromRgb(font.Color.R, font.Color.G, font.Color.B)
    tableTextRange.CharacterFormat.FontSize = float(font.Size)
    tableTextRange.CharacterFormat.FontName = font.FontName
    tableTextRange.CharacterFormat.Bold = font.IsBold
    tableTextRange.CharacterFormat.Italic = font.IsItalic
    # Copy background colors
    if excelCell.Style.FillPattern != ExcelPatternType.none:
        wordCell.CellFormat.BackColor = Color.FromRgb(excelCell.Style.Color.R, excelCell.Style.Color.G,
                                                      excelCell.Style.Color.B)
    # Copy the horizontal alignment
    hAlignMap = {
        HorizontalAlignType.Left: HorizontalAlignment.Left,
        HorizontalAlignType.Center: HorizontalAlignment.Center,
        HorizontalAlignType.Right: HorizontalAlignment.Right
    }
    if excelCell.HorizontalAlignment in hAlignMap:
        tableTextRange.OwnerParagraph.Format.HorizontalAlignment = hAlignMap[excelCell.HorizontalAlignment]
    # Copy the vertical alignment
    vAlignMap = {
        VerticalAlignType.Top: VerticalAlignment.Top,
        VerticalAlignType.Center: VerticalAlignment.Middle,
        VerticalAlignType.Bottom: VerticalAlignment.Bottom
    }
    if excelCell.VerticalAlignment in vAlignMap:
        wordCell.CellFormat.VerticalAlignment = vAlignMap[excelCell.VerticalAlignment]

# Specify the file names
excelFileName = "Sample.xlsx"
wordFileName = "output/ExcelDataFormatToWord.docx"

# Create a Workbook instance and load the Excel file
workbook = Workbook()
workbook.LoadFromFile(excelFileName)

# Get a worksheet
sheet = workbook.Worksheets.get_Item(0)

# Create a Document instance
doc = Document()
# Add a section to the document and set the page orientation
section = doc.AddSection()
section.PageSetup.Orientation = PageOrientation.Landscape

# Add a table to the section
table = section.AddTable()
# Set the number of rows and columns according to the number of rows and columns in the Excel worksheet
table.ResetCells(sheet.LastRow, sheet.LastColumn)

# Execute the MergeCells method to merge cells
MergeCells(sheet, table)

# Iterate through each row and column in the Excel worksheet
for r in range(1, sheet.LastRow + 1):
    tableRow = table.Rows.get_Item(r - 1)
    tableRow.Height = float(sheet.Rows.get_Item(r - 1).RowHeight)
    for c in range(1, sheet.LastColumn + 1):
        # Get the corresponding cell in the Excel worksheet and the cell in the Word table
        eCell = sheet.Range.get_Item(r, c)
        wCell = table.Rows.get_Item(r - 1).Cells.get_Item(c - 1)
        # Append the cell value to the Word table
        textRange = wCell.AddParagraph().AppendText(eCell.NumberText)
        # Copy the cell formatting
        CopyFormatting(textRange, eCell, wCell)

# Save the Word document
doc.SaveToFile(wordFileName, FileFormat.Docx2019)
doc.Dispose()
workbook.Dispose()

Copy Excel Worksheet Data and Formatting to Word Documents Using Python

Integrate Excel Worksheets as OLE into Word Documents

Beyond copying data and formatting, Excel worksheets can be embedded as OLE objects in Word documents. This approach enables full worksheet visualization and allows users to edit Excel data directly from the Word document.

Using the Paragraph.AppendOleObject(str: filename, DocPicture, OleObjectType.ExcelWorksheet) method, developers can easily insert an Excel file as an OLE object.

Steps to Insert an Excel Worksheet as an OLE Object:

Create a Workbook instance and load an Excel file using the Workbook.LoadFromFile() method.
Retrieve a worksheet using the Workbook.Worksheets.get_Item() method and save it as an image using the Worksheet.ToImage().Save() method.
Initialize a Document instance to create a Word document.
Add a section using the Document.AddSection() method and insert a paragraph using the Section.AddParagraph() method.
Create a DocPicture instance and load the saved image using the DocPicture.LoadImage() method.
Resize the image according to the page layout using the DocPicture.Width property.
Insert the Excel file as an OLE object into the paragraph using the Paragraph.AppendOleObject() method.
Set the DocOleObject.DisplayAsIcon property to False to ensure that the OLE object updates dynamically after worksheet edits.
Save the Word document using the Document.SaveToFile() method.

Python

from spire.doc import Document, DocPicture, FileFormat, OleObjectType
from spire.xls import Workbook

# Specify the file path and names
excelFileName = "Sample.xlsx"
wordFileName = "output/ExcelOleToWord.docx"
tempImageName = "SheetImage.png"

# Create a Workbook instance and load the Excel file
workbook = Workbook()
workbook.LoadFromFile(excelFileName)

# Save the first worksheet as an image
sheet = workbook.Worksheets.get_Item(0)
sheet.ToImage(1, 1, sheet.LastRow, sheet.LastColumn).Save(tempImageName)

# Initialize a Document instance to create a Word document
doc = Document()
# Add a section to the document and add a paragraph to the section
section = doc.AddSection()
paragraph = section.AddParagraph()

# Create a DocPicture instance and load the image
pic = DocPicture(doc)
pic.LoadImage(tempImageName)

# Set the image width
pic.Width = section.PageSetup.PageSize.Width - section.PageSetup.Margins.Left - section.PageSetup.Margins.Right

# Insert the Excel file into the Word document as an OLE object and set the saved image as the display image
ole = paragraph.AppendOleObject(excelFileName, pic, OleObjectType.ExcelWorksheet)
# Set to not display the OLE object as an icon
ole.DisplayAsIcon = False

# Save the Word document
doc.SaveToFile(wordFileName, FileFormat.Docx2019)
workbook.Dispose()
doc.Dispose()

Excel Worksheets Inserted into Word Documents as OLE Object with Python

Get a Free License

To fully experience the capabilities of Install Spire.Office for Python without any evaluation limitations, you can request a free 30-day trial license.

Published in Table

Tagged under

doc Python Table

Python: Apply Emphasis Mark to Text in Word

2025-04-03 01:09:58 Written by Koohji

Word documents often contain extensive text, and applying emphasis marks is an effective way to highlight key information. Whether you need to accentuate important terms or enhance text clarity with styled formatting, emphasis marks can make your content more readable and professional. Instead of manually adjusting formatting, this guide demonstrates how to use Spire.Doc for Python to efficiently apply emphasis to text in Word with Python, saving time while ensuring a polished document.

Apply Emphasis Marks to First Matched Text
Apply Emphasis Marks to All Matched Text
Apply Emphasis Marks to Text with Regular Expression

Install Spire.Doc for Python

This scenario requires Spire.Doc for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.

Package Manager

pip install Spire.Doc

If you are unsure how to install, please refer to this tutorial: How to Install Spire.Doc for Python on Windows.

Apply Emphasis Marks to First Matched Text in Word Documents

When crafting a Word document, highlighting keywords or phrases can improve readability and draw attention to important information. With Spire.Doc's CharacterFormat.EmphasisMark property, you can easily apply emphasis marks to any text, ensuring clarity and consistency.

Steps to apply emphasis marks to the first matched text in a Word document:

Create an object of the Document class.
Load a source Word document from files using Document.LoadFromFile() method.
Find the text that you want to emphasize with Document.FindString() method.
Apply emphasis marks to the text through CharacterFormat.EmphasisMark property.
Save the updated Word document using Document.SaveToFile() method.

Below is the code example showing how to emphasize the first matching text of "AI-Generated Art" in a Word document:

Python

from spire.doc import *
from spire.doc.common import *

# Create a Document object
doc = Document()
doc.LoadFromFile("/AI-Generated Art.docx")

# Customize the text that you want to apply an emphasis mark to
matchingtext = doc.FindString("AI-Generated Art", True, True)

# Apply the emphasis mark to the matched text
matchingtext.GetAsOneRange().CharacterFormat.EmphasisMark = Emphasis.CommaAbove

# Save the document as a new one
doc.SaveToFile("/ApplyEmphasisMark_FirstMatch.docx", FileFormat.Docx2013)
doc.Close()

Apply Emphasis Marks to the First Matched Text in Word Documents

Apply Emphasis Marks to All Matched Text in Word Files

In the previous section, we demonstrated how to add an emphasis mark to the first matched text. Now, let's take it a step further—how can we emphasize all occurrences of a specific text? The solution is simple: use the Document.FindAllString() method to locate all matches and then apply emphasis marks using the CharacterFormat.EmphasisMark property. Below, you'll find detailed steps and code examples to guide you through the process.

Steps to apply emphasis marks to all matched text:

Create an instance of Document class.
Read a Word file through Document.LoadFromFile() method.
Find all the matching text using Document.FindAllString() method.
Loop through all occurrences and apply the emphasis effect to the text through CharacterFormat.EmphasisMark property.
Save the modified Word document through Document.SaveToFile() method.

The following code demonstrates how to apply emphasis to all occurrences of "AI-Generated Art" while ignoring case sensitivity:

Python

from spire.doc import *
from spire.doc.common import *

# Create a Document object
doc = Document()
doc.LoadFromFile("/AI-Generated Art.docx")

# Customize the text that you want to apply an emphasis mark to
textselections = doc.FindAllString("AI-Generated Art", False, True)

# Loop through the text selections and apply the emphasis mark to the text
for textselection in textselections:
    textselection.GetAsOneRange().CharacterFormat.EmphasisMark = Emphasis.CircleAbove

# Save the document as a new one
doc.SaveToFile("/ApplyEmphasisMark_AllMatch.docx", FileFormat.Docx2013)
doc.Close()

Python to Emphasize All Matched Text in Word Documents

Apply Emphasis Marks to Text in Word Documents with Regular Expression

Sometimes, the text you want to highlight may vary but follow a similar structure, such as email addresses, phone numbers, dates, or patterns like two to three words followed by special symbols (#, *, etc.). The best way to identify such text is by using regular expressions. Once located, you can apply emphasis marks using the same method. Let's go through the steps!

Steps to apply emphasis marks to text using regular expressions:

Create a Document instance.
Load a Word document from the local storage using Document.LoadFromFile() method.
Find text that you want to emphasize with Document.FindAllPattern() method.
Iterate through all occurrences and apply the emphasis effect to the text through CharacterFormat.EmphasisMark property.
Save the resulting Word file through Document.SaveToFile() method.

The code example below shows how to emphasize "AI" and the word after it in a Word document:

Python

from spire.doc import *
from spire.doc.common import *

# Create a Document object
doc = Document()
doc.LoadFromFile("/AI-Generated Art.docx")

# Match "AI" and the next word using regular expression
pattern = Regex(r"AI\s+\w+")

# Find all matching text
textSelections = doc.FindAllPattern(pattern)

# Loop through all the matched text and apply an emphasis mark
for selection in textSelections:
    selection.GetAsOneRange().CharacterFormat.EmphasisMark = Emphasis.DotBelow

# Save the document as a new one
doc.SaveToFile("/ApplyEmphasisMark_Regex.docx", FileFormat.Docx2013)
doc.Close()

Apply Emphasis Marks to Text Using Regular Expressions in Word

Get a Free License

To fully experience the capabilities of Spire.Doc for Python without any evaluation limitations, you can request a free 30-day trial license.

Published in Text

Tagged under

doc Python Text

Python: Add Border to Text and Paragraphs in Word

2025-04-03 01:00:21 Written by Koohji

Adding borders to specific text and paragraphs in Word documents is an effective way to highlight key information and improve the document's structure. Whether it's important terms or entire sections, borders help them stand out. In this guide, we'll show you how to use Spire.Doc for Python to add borders to text and paragraphs in Word with Python, boosting both the readability and professionalism of your document while saving you time from manual formatting.

Add Border to Text in Word Documents
Add Border to Paragraphs in Word Documents

Install Spire.Doc for Python

This scenario requires Spire.Doc for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.

Package Manager

pip install Spire.Doc

If you are unsure how to install, please refer to this tutorial: How to Install Spire.Doc for Python on Windows.

Add a Border to Text in Word Documents with Python

In Word documents, important information like technical terms, company names, or legal clauses can be highlighted with borders to draw readers' attention. Using Python, you can locate the required text with the Document.FindAllString() method and apply borders using the CharacterFormat.Border.BorderType property. Here's a step-by-step guide to help you do this efficiently.

Steps to add borders to all matched text in a Word document:

Create an object of Document class.
Read a source Word document from files using Document.LoadFromFile() method.
Find all occurrences of the specified text through Document.FindAllString() method.
Loop through all matched text and get the text as a text range.
Add a border to the text with CharacterFormat.Border.BorderType property.
Customize the color of the border through CharacterFormat.Border.Color property.
Save the modified document with Document.SaveToFile() method.

The code example below shows how to add a border to all occurrences of "AI-Generated Art":

Python

from spire.doc import *
from spire.doc.common import *

# Create a Document object
doc = Document()
doc.LoadFromFile("/AI-Generated Art.docx")

# Set the target text
target_text = "AI-Generated Art"
# Create a TextSelection object and find all matches
text_selections = doc.FindAllString(target_text, False, True)

# Loop through the text selections
for selection in text_selections:
    text_range = selection.GetAsOneRange()
    # Add a border to the text
    text_range.CharacterFormat.Border.BorderType = BorderStyle.Single
    # Set the border color
    text_range.CharacterFormat.Border.Color = Color.get_Blue()

# Save the resulting document
doc.SaveToFile("/AddBorder_Text.docx", FileFormat.Docx2013)
doc.Close()

Add Border to Text in Word Documents Using Python

Add a Border to Paragraphs in Word Files Using Python

Important clauses or legal statements in contracts, summaries in reports, and quotations in papers often require adding borders to paragraphs for emphasis or distinction. Unlike text borders, adding a border to a paragraph involves finding the target paragraph by its index and then using the Format.Borders.BorderType property. Let's check out the detailed instructions.

Steps to add a border to paragraphs in Word documents:

Create a Document instance.
Read a Word document through Document.LoadFromFile() method.
Get the specified paragraph with Document.Sections[].Paragraphs[] property.
Add a border to the paragraph using Format.Borders.BorderType property.
Set the type and color of the border.
Save the resulting Word file through Document.SaveToFile() method.

Here is an example showing how to add a border to the fourth paragraph in a Word document:

Python

from spire.doc import *
from spire.doc.common import *

# Create a Document object
doc = Document()
doc.LoadFromFile("/AI-Generated Art.docx")

# Get the fourth paragraph
paragraph = doc.Sections[0].Paragraphs[3]

# Add a border to the paragraph
borders = paragraph.Format.Borders
borders.BorderType(BorderStyle.DotDotDash)
borders.Color(Color.get_Blue())

# Save the updated document
doc.SaveToFile("/AddBorder_Paragraph.docx", FileFormat.Docx2013)
doc.Close()

Add Border to Paragraphs in Word Files with Python

Get a Free License

To fully experience the capabilities of Spire.Doc for Python without any evaluation limitations, you can request a free 30-day trial license.

Published in Paragraph

Tagged under

doc Python Paragraph

News Category

Spire.Doc for Python (97)

Children categories

Why Parse HTML in Python?

Getting Started: Install HTML Parser in Python

How Spire.Doc Parses HTML: Core Concepts

1. Parsing HTML Strings in Python

2. Parsing HTML Files in Python

3. Parsing a URL in Python

Best Practices for Effective HTML Parsing

Conclusion

What You Will Learn

Why Convert HTML to Markdown?

Install HTML to Markdown Library for Python

Convert an HTML File to Markdown in Python

Steps

Code Example

Convert an HTML String to Markdown in Python

Steps

Code Example

Batch Conversion of Multiple HTML Files

Best Practices for HTML to Markdown Conversion

Conclusion

FAQs

Q1: Can I convert Markdown back to HTML in Python?

Q2: Will the conversion preserve complex HTML elements like tables?

Q3: Can I automate batch conversion for multiple HTML files?

Q4: Is Spire.Doc free to use?

Table of Contents

What is Markdown?

Why Convert Markdown to HTML?

Introducing Spire.Doc for Python

Step-by-Step Guide: Converting Markdown to HTML in Python

Step 1: Install Spire.Doc for Python

Step 2: Prepare Your Markdown File

Step 3: Write the Python Script

Step 4: Verify the HTML Output

Automating Batch Conversion

Best Practices for Markdown to HTML Conversion

Conclusion

FAQs

Q1: Can I convert multiple Markdown files to HTML in Python?

Q2: Will the HTML preserve all Markdown formatting?

Q3: Is there a way to handle images in Markdown during conversion?

Q4: Do I need additional libraries besides Spire.Doc?

Q5: Can I use the generated HTML in web frameworks?

Install HTML to Text Converter in Python

Python Convert HTML Files to Text in 3 Steps

How to Convert an HTML String to Text in Python

The Conclusion

FAQs about Converting HTML to Text in Python

What's Spire.Doc for Python?

Set Up Spire.Doc in Your Python Project

Step 1: Create a Blank Word Document in Python

Step 2: Add Formatted Text (Headings, Paragraphs)

1. Add Title, Headings, Paragraphs

2. Apply Formatting to Paragraph

Step 3: Insert Images to a Word Document

1. Insert an Image

2. Position Image at Precise Location

Step 4: Create and Format Tables

Step 5: Add Numbered or Bulleted Lists

Best Practices for Word Document Creation in Python

FAQs

Q1: Does Spire.Doc support adding headers and footers to a Word document?

Q2. Can I generate Word documents on a server without Microsoft Office installed?

Q3: Can I create Word documents from a template using Spire.Doc?

Q4: Can I convert Word documents to other formats using Spire.Doc?

Q5. Can Spire.Doc edit existing Word documents?

Conclusion

Table of Contents

Why Read Word Documents Programmatically in Python?

Install the Library for Parsing Word Documents in Python

Read Text from Word DOC or DOCX in Python

Get Text from Entire Document

Get Text from Specific Section or Paragraph

Read Specific Elements from a Word Document in Python

Extract Images

Get Table Data

Read Lists