Python (359)

Modern business systems, from retail checkout lanes to warehouse inventory tracking, rely heavily on barcode scanning, and Python-based solutions have become a popular choice due to their versatility and ease of use. In this article, we’ll explore how to read barcodes in Python using the Spire.Barcode for Python library, covering setup, scanning from image files or bytes, and customization options for improved accuracy.
Table of Contents:
- Python Library for Reading Barcodes
- Integrate Spire.Barcode into Your Python Application
- Read a Barcode from an Image File
- Read Multiple Barcodes from an Image File
- Read Barcodes from Image Bytes
- Adjust Barcode Recognition Settings
- Conclusion
- FAQs
Python Library for Reading Barcodes
Spire.Barcode for Python is a powerful library specifically crafted for creating and reading barcodes in Python applications. The library supports a variety of barcode formats, including:
- 1D Barcodes : Such as Code 128, Code 39, EAN-13, and UPC-A.
- 2D Barcodes : Including QR Code, DataMatrix, and PDF417.
Notable Features of Spire.Barcode
- Format Support : Capable of reading barcodes from various image formats, including PNG, JPG, BMP, GIF, and TIFF.
- Batch Scanning : Enables the detection of multiple barcodes within a single image file.
- Recognition Accuracy : Utilizes advanced algorithms to deliver reliable barcode detection.
- Customization : Allows users to specify barcode types and enable checksum verification for enhanced recognition efficiency.
This library provides the capability to read barcodes from both image files and bytes, along with extensive customization options to meet diverse requirements.
Integrate Spire.Barcode into Your Python Application
To get started with Spire.Barcode, you first need to install the library. You can do this via pip. Open your terminal and run:
pip install spire.barcode
Once you have installed the library, you will need a license key to unlock its full capabilities. You can obtain a trial license from our website. and set up the library in your Python script:
from spire.barcode import *
License.SetLicenseKey("your license key")
Now that you have the library in place, you can begin reading barcodes using Python.
Read a Barcode from an Image File in Python
Reading a single barcode from an image file is straightforward with Spire.Barcode. Here's how you can do it:
from spire.barcode import *
# Apply license key to unlock full capabilities
License.SetLicenseKey("your license key")
# Read barcode from an image file
result = BarcodeScanner.ScanOneFile("C:/Users/Administrator/Desktop/qr_code.png")
# Print the result
print(result)
Explanation
- License.SetLicenseKey() : Initializes the library with your license key.
- BarcodeScanner.ScanOneFile() : Reads a single barcode from the specified image file.
- The result is printed to the console, displaying the barcode's data.
Output:

Read Multiple Barcodes from an Image File in Python
If you need to read multiple barcodes from a single image file, Spire.Barcode makes this easy as well. Here’s an example:
from spire.barcode import *
# Apply license key to unlock full capabilities
License.SetLicenseKey("your license key")
# Read multiple barcodes from stream
results = BarcodeScanner.ScanFile("C:/Users/Administrator/Desktop/barcodes.jpg")
# Print results
print(results)
Explanation
- BarcodeScanner.ScanFile() : Scans the entire image for multiple barcodes.
- The results are stored as a list. Each element in the list contains the data from a detected barcode.
Output:

Read Barcodes from Image Bytes in Python
In addition to reading barcodes directly from files, Spire.Barcode for Python supports decoding barcodes from in-memory image bytes . This approach is useful when working with dynamically loaded images (e.g., from APIs, databases, or user uploads).
Here’s how to do it:
from spire.barcode import *
# Apply license key to unlock full capabilities
License.SetLicenseKey("your license key")
# Read an image file into bytes
image_path = "C:/Users/Administrator/Desktop/barcodes.jpg"
with open(image_path, "rb") as file:
image_bytes = file.read()
# Wrap bytes in Spire.Barcode's Stream object
stream = Stream(image_bytes)
# Read one barcode from stream
# result = BarcodeScanner.ScanOneStream(stream)
# Read multiple barcodes from stream
results = BarcodeScanner.ScanStream(stream)
# Print results
print(results)
Explanation
- image_bytes: Raw binary data read from an image file (e.g., PNG, JPG) or other sources like APIs or databases.
- Stream (Spire.Barcode class): Converts image_bytes into an in-memory stream compatible with Spire.Barcode’s scanner.
- BarcodeScanner.ScanStream() : Scans the stream for barcodes and returns a list of detected barcodes.
Adjust Barcode Recognition Settings
The BarcodeScanner class provides various methods to customize barcode recognition settings. This can help improve detection accuracy and efficiency. Some of the key methods include:
- ScanOneFileBarCodeTypeIncludeCheckSum(fileName: str, barcodeType: BarCodeType, IncludeCheckSum: bool)
- ScanFileBarCodeTypeIncludeCheckSum(fileName: str, barcodeType: BarCodeType, IncludeCheckSum: bool)
- ScanOneStreamBarCodeTypeIncludeCheckSum(stream: Stream, barcodeType: BarCodeType, IncludeCheckSum: bool)
- ScanStreamBarCodeTypeIncludeCheckSum(stream: Stream, barcodeType: BarCodeType, IncludeCheckSum: bool)
Here’s an example of how to specify a barcode type and include checksum verification:
from spire.barcode import *
# Apply license key to unlock full capabilities
License.SetLicenseKey("your license key")
# Specify the barcode type (e.g., EAN13)
barcode_type = BarCodeType.EAN13
# Read a barcode from an image file with checksum included
result = BarcodeScanner.ScanOneFileBarCodeTypeIncludeCheckSum("C:/Users/Administrator/Desktop/EAN_13.png", barcode_type, True)
# Print the result
print(result)
Explanation
- BarcodeType : Specifies the type of barcode you want to scan.
- IncludeCheckSum (bool): Determines whether to verify the checksum during scanning. Setting it to True can help catch errors in data.
Conclusion
In this article, we explored how to read barcodes in Python using the Spire.Barcode library. We covered the setup process, reading single and multiple barcodes from image files, and reading from image bytes. Additionally, we discussed how to customize barcode detection settings for improved accuracy. With these tools at your disposal, you can easily integrate barcode scanning capabilities into your Python applications.
FAQs
Q1: What types of barcodes can I read with Spire.Barcode?
Spire.Barcode supports a wide range of barcode formats, including QR codes, UPC, EAN, Code 128, Code 39, and many others.
Q2: Do I need a license to use Spire.Barcode?
Yes, a license key is required to unlock the full functionality of the library. You can obtain a free 30-day trial license from our website.
Q3: Can I read barcodes from a webcam using Spire.Barcode?
While Spire.Barcode does not directly support webcam input, you can capture images from a webcam and then read barcodes from those images using the library.
Q4: How can I improve barcode scanning accuracy?
You can improve accuracy by specifying the barcode type and enabling checksum verification during scanning. Additionally, ensure that the images are clear and well-lit.
Q5. Can I generate barcodes using Spire.Barcode for Python?
Yes, Spire.Barcode supports both barcode recognition and generation. For detailed instructions, check out this tutorial: How to Generate Barcodes in Python: A Step-by-Step Guide.
Read Word DOC or DOCX Files in Python - Extract Text, Images, Tables and More
2025-06-30 01:41:17 Written by zaki zou
Reading Word documents in Python is a common task for developers who work with document automation, data extraction, or content processing. Whether you're working with modern .docx files or legacy .doc formats, being able to open, read, and extract content like text, tables, and images from Word files can save time and streamline your workflows.
While many Python libraries support .docx, reading .doc files—the older binary format—can be more challenging. Fortunately, there are reliable methods for handling both file types in Python.
In this tutorial, you'll learn how to read Word documents (.doc and .docx) in Python using the Spire.Doc for Python library. We'll walk through practical code examples to extract text, images, tables, comments, lists, and even metadata. Whether you're building an automation script or a full document parser, this guide will help you work with Word files effectively across formats.
Table of Contents
- Why Read Word Documents Programmatically in Python?
- Install the Library for Parsing Word Documents in Python
- Read Text from Word DOC or DOCX in Python
- Read Specific Elements from a Word Document in Python
- Conclusion
- FAQs
Why Read Word Documents Programmatically in Python?
Reading Word files using Python allows for powerful automation of content processing tasks, such as:
- Extracting data from reports, resumes, or forms.
- Parsing and organizing content into databases or dashboards.
- Converting or analyzing large volumes of Word documents.
- Integrating document reading into web apps, APIs, or back-end systems.
Programmatic reading eliminates manual copy-paste workflows and ensures consistent and scalable results.
Install the Library for Parsing Word Documents in Python
To read .docx and .doc files in Python, you need a library that can handle both formats. Spire.Doc for Python is a versatile and easy-to-use library that lets you extract text, images, tables, comments, lists, and metadata from Word documents. It runs independently of Microsoft Word, so Office installation is not required.
To get started, install Spire.Doc easily with pip:
pip install Spire.Doc
Read Text from Word DOC or DOCX in Python
Extracting text from Word documents is a common requirement in many automation and data processing tasks. Depending on your needs, you might want to read the entire content or focus on specific sections or paragraphs. This section covers both approaches.
Get Text from Entire Document
When you need to retrieve the complete textual content of a Word document — for tasks like full-text indexing or simple content export — you can use the Document.GetText() method. The following example demonstrates how to load a Word file, extract all text, and save it to a file:
from spire.doc import *
# Load the Word .docx or .doc file
document = Document()
document.LoadFromFile("sample.docx")
# Get all text
text = document.GetText()
# Save to a text file
with open("extracted_text.txt", "w", encoding="utf-8") as file:
file.write(text)
document.Close()

Get Text from Specific Section or Paragraph
Many documents, such as reports or contracts, are organized into multiple sections. Extracting text from a specific section enables targeted processing when you need content from a particular part only. By iterating through the paragraphs of the selected section, you can isolate the relevant text:
from spire.doc import *
# Load the Word .docx or .doc file
document = Document()
document.LoadFromFile("sample.docx")
# Access the desired section
section = document.Sections[0]
# Get text from the paragraphs of the section
with open("paragraphs_output.txt", "w", encoding="utf-8") as file:
for paragraph in section.Paragraphs:
file.write(paragraph.Text + "\n")
document.Close()
Read Specific Elements from a Word Document in Python
Beyond plain text, Word documents often include rich content like images, tables, comments, lists, metadata, and more. These elements can easily be programmatically accessed and extracted.
Extract Images
Word documents often embed images like logos, charts, or illustrations. To extract these images:
- Traverse each paragraph and its child objects.
- Identify objects of type DocPicture.
- Retrieve the image bytes and save them as separate files.
from spire.doc import *
import os
# Load the Word document
document = Document()
document.LoadFromFile("sample.docx")
# Create a list to store image byte data
images = []
# Iterate over sections
for s in range(document.Sections.Count):
section = document.Sections.get_Item(s)
# Iterate over paragraphs
for p in range(section.Paragraphs.Count):
paragraph = section.Paragraphs.get_Item(p)
# Iterate over child objects
for c in range(paragraph.ChildObjects.Count):
obj = paragraph.ChildObjects[c]
# Extract image data
if isinstance(obj, DocPicture):
picture = obj
# Get image bytes
dataBytes = picture.ImageBytes
# Store in the list
images.append(dataBytes)
# Create the output directory if it doesn't exist
output_folder = "ExtractedImages"
os.makedirs(output_folder, exist_ok=True)
# Save each image from byte data
for i, item in enumerate(images):
fileName = f"Image-{i+1}.png"
with open(os.path.join(output_folder, fileName), 'wb') as imageFile:
imageFile.write(item)
# Close the document
document.Close()

Get Table Data
Tables organize data such as schedules, financial records, or lists. To extract all tables and their content:
- Loop through tables in each section.
- Loop through rows and cells in each table.
- Traverse over each cell’s paragraphs and combine their texts.
- Save the extracted table data in a readable text format.
from spire.doc import *
import os
# Load the Word document
document = Document()
document.LoadFromFile("tables.docx")
# Ensure output directory exists
output_dir = "output/Tables"
os.makedirs(output_dir, exist_ok=True)
# Loop through each section
for s in range(document.Sections.Count):
section = document.Sections.get_Item(s)
tables = section.Tables
# Loop through each table in the section
for i in range(tables.Count):
table = tables.get_Item(i)
table_data = ""
# Loop through each row
for j in range(table.Rows.Count):
row = table.Rows.get_Item(j)
# Loop through each cell
for k in range(row.Cells.Count):
cell = row.Cells.get_Item(k)
cell_text = ""
# Combine text from all paragraphs in the cell
for p in range(cell.Paragraphs.Count):
para_text = cell.Paragraphs.get_Item(p).Text
cell_text += para_text + " "
table_data += cell_text.strip()
# Add tab between cells (except after the last cell)
if k < row.Cells.Count - 1:
table_data += "\t"
table_data += "\n"
# Save the table data to a separate text file
output_path = os.path.join(output_dir, f"WordTable_{s+1}_{i+1}.txt")
with open(output_path, "w", encoding="utf-8") as output_file:
output_file.write(table_data)
# Close the document
document.Close()

Read Lists
Lists are frequently used to structure content in Word documents. This example identifies paragraphs formatted as list items and writes the list marker together with the text to a file.
from spire.doc import *
# Load the Word document
document = Document()
document.LoadFromFile("sample.docx")
# Open a text file for writing the list items
with open("list_items.txt", "w", encoding="utf-8") as output_file:
# Iterate over sections
for s in range(document.Sections.Count):
section = document.Sections.get_Item(s)
# Iterate over paragraphs
for p in range(section.Paragraphs.Count):
paragraph = section.Paragraphs.get_Item(p)
# Check if the paragraph is a list
if paragraph.ListFormat.ListType != ListType.NoList:
# Write the combined list marker and paragraph text to file
output_file.write(paragraph.ListText + paragraph.Text + "\n")
# Close the document
document.Close()
Extract Comments
Comments are typically used for collaboration and feedback in Word documents. This code retrieves all comments, including the author and content, and saves them to a file with clear formatting for later review or audit.
from spire.doc import *
# Load the Word .docx or .doc document
document = Document()
document.LoadFromFile("sample.docx")
# Open a text file to save comments
with open("extracted_comments.txt", "w", encoding="utf-8") as output_file:
# Iterate over the comments
for i in range(document.Comments.Count):
comment = document.Comments.get_Item(i)
# Write comment header with comment number
output_file.write(f"Comment {i + 1}:\n")
# Write comment author
output_file.write(f"Author: {comment.Format.Author}\n")
# Extract full comment text by concatenating all paragraph texts
comment_text = ""
for j in range(comment.Body.Paragraphs.Count):
paragraph = comment.Body.Paragraphs[j]
comment_text += paragraph.Text + "\n"
# Write the comment text
output_file.write(f"Content: {comment_text.strip()}\n")
# Add a blank line between comments
output_file.write("\n")
# Close the document
document.Close()
Retrieve Metadata (Document Properties)
Metadata provides information about the document such as author, title, creation date, and modification date. This code extracts common built-in properties for reporting or cataloging purposes.
from spire.doc import *
# Load the Word .docx or .doc document
document = Document()
document.LoadFromFile("sample.docx")
# Get the built-in document properties
props = document.BuiltinDocumentProperties
# Open a text file to write the properties
with open("document_properties.txt", "w", encoding="utf-8") as output_file:
output_file.write(f"Title: {props.Title}\n")
output_file.write(f"Author: {props.Author}\n")
output_file.write(f"Subject: {props.Subject}\n")
output_file.write(f"Created: {props.CreateDate}\n")
output_file.write(f"Modified: {props.LastSaveDate}\n")
# Close the document
document.Close()
Conclusion
Reading both .doc and .docx Word documents in Python is fully achievable with the right tools. With Spire.Doc, you can:
- Read text from the entire document, any section or paragraph.
- Extract tables and process structured data.
- Export images embedded in the document.
- Extract comments and lists from the document.
- Work with both modern and legacy Word formats without extra effort.
Try Spire.Doc today to simplify your Word document parsing workflows in Python!
FAQs
Q1: How do I read a Word DOC or DOCX file in Python?
A1: Use a Python library like Spire.Doc to load and extract content from Word files.
Q2: Do I need Microsoft Word installed to use Spire.Doc?
A2: No, it works without any Office installation.
Q3: Can I generate or update Word documents with Spire.Doc?
A3: Yes, Spire.Doc not only allows you to read and extract content from Word documents but also provides powerful features to create, modify, and save Word files programmatically.
Get a Free License
To fully experience the capabilities of Spire.Doc for Python without any evaluation limitations, you can request a free 30-day trial license.
Python Tutorial: Delete Text Boxes in PowerPoint Automatically
2025-06-12 07:00:22 Written by AdministratorText boxes are one of the most common elements used to display content in PowerPoint. However, as slides get frequently edited, you may end up with a clutter of unnecessary text boxes. Manually deleting them can be time-consuming. This guide will show you how to delete text boxes in PowerPoint using Python. Whether you want to delete all text boxes, remove a specific one, or clean up only the empty ones, you'll learn how to do it in just a few lines of code — saving time and making your workflow much more efficient. 
- Install the Python Library
- Delete All Text Boxes
- Delete a Specific Text Box
- Delete Empty Text Boxes
- Compare All Three Methods
- FAQs
Install the Python Library for PowerPoint Automation
To make this task easier, installing the right Python library is essential. In this guide, we’ll use Spire.Presentation for Python to demonstrate how to automate the removal of text boxes in a PowerPoint file. As a standalone third-party component, Spire.Presentation doesn’t require Microsoft Office to be installed on your machine. Its API is simple and beginner-friendly, and installation is straightforward — just run:
pip install spire.presentation
Alternatively, you can download the package for custom installation. A free version is also available, which is great for small projects and testing purposes.
How to Delete All Text Boxes in PowerPoint
Let’s start by looking at how to delete all text boxes — a common need when you're cleaning up a PowerPoint template. Instead of adjusting each text box and its content manually, it's often easier to remove them all and then re-add only what you need. With the help of Spire.Presentation, you can use the IAutoShape.Remove() method to remove text boxes in just a few lines of code. Let’s see how it works in practice. Steps to delete all text boxes in a PowerPoint presentation with Python:
- Create an instance of Presentation class, and load a sample PowerPoint file.
- Loop through all slides and all shapes on slides, and check if the shape is IAutoShape and if it is a text box.
- Remove text boxes in the PowerPoint presentation through IAutoShape.Remove() method.
- Save the modified PowerPoint file.
The following is a complete code example for deleting all text boxes in a PowerPoint presentation:
from spire.presentation import *
# Create a Presentation object and load a PowerPoint file
presentation = Presentation()
presentation.LoadFromFile("E:/Administrator/Python1/input/pre1.pptx")
# Loop through all slides
for slide in presentation.Slides:
# Loop through all shapes in the slide
for i in range(slide.Shapes.Count - 1, -1, -1):
shape = slide.Shapes[i]
# Check if the shape is IAutoShape and is a text box
if isinstance(shape, IAutoShape) and shape.IsTextBox:
# Remove the shape
slide.Shapes.Remove(shape)
# Save the modified presentation
presentation.SaveToFile("E:/Administrator/Python1/output/RemoveAllTextBoxes.pptx", FileFormat.Pptx2013)
presentation.Dispose()

Warm Tip: When looping through shapes, use reverse order to avoid skipping any elements after deletion.
How to Delete a Specific Text Box in PowerPoint
If you only need to remove a few specific text boxes — for example, the first text box on the second slide — this method is perfect for you. In Python, you can first locate the target slide by its index, then identify the text box by its content, and finally remove it. This approach gives you precise control when you know exactly which text box needs to be deleted. Let’s walk through how to do this in practice. Steps to delete a specific text box in PowerPoint using Python:
- Create an object of Presentation class and read a PowerPoint document.
- Get a slide using Presentation.Slides[] property.
- Loop through each shape on the slide and check if it is the target text box.
- Remove the text box through IAutoShape.Remove() method.
- Save the modified PowerPoint presentation.
The following code demonstrates how to delete a text box with the content "Text Box 1" on the second slide of the presentation:
from spire.presentation import *
# Create a new Presentation object and load a PowerPoint file
presentation = Presentation()
presentation.LoadFromFile("E:/Administrator/Python1/input/pre1.pptx")
# Get the second slide
slide = presentation.Slides[1]
# Loop through all shapes on the slide
for i in range(slide.Shapes.Count - 1, -1, -1):
shape = slide.Shapes[i]
# Check if the shape is a text box and its text is "Text Box 1"
if isinstance(shape, IAutoShape) and shape.IsTextBox:
if shape.TextFrame.Text.strip() == "Text Box 1":
slide.Shapes.Remove(shape)
# Save the modified presentation
presentation.SaveToFile("E:/Administrator/Python1/output/RemoveSpecificTextbox.pptx", FileFormat.Pptx2013)
presentation.Dispose()
How to Delete Empty Text Boxes in PowerPoint
Another common scenario is removing all empty text boxes from a PowerPoint file — especially when you're cleaning up slides exported from other tools or merging multiple presentations and want to get rid of unused placeholders. Instead of checking each slide manually, automating the process with Python allows you to quickly remove all blank text boxes and keep only the meaningful content. It’s a far more efficient approach. Steps to delete empty text boxes in PowerPoint file using Python:
- Create an object of Presentation class, and load a PowerPoint file.
- Loop through all slides and all shapes on slides.
- Check if the shape is a text box and is empty.
- Remove text boxes in the PowerPoint presentation through IAutoShape.Remove() method.
- Save the modified PowerPoint file.
Here's the code example that shows how to delete empty text boxes in a PowerPoint presentation:
from spire.presentation import *
# Create a Presentation instance and load a sample file
presentation = Presentation()
presentation.LoadFromFile("E:/Administrator/Python1/input/pre1.pptx")
# Loop through each slide
for slide in presentation.Slides:
# Iterate through shapes
for i in range(slide.Shapes.Count - 1, -1, -1):
shape = slide.Shapes[i]
# Check if the shape is a textbox and its text is empty
if isinstance(shape, IAutoShape) and shape.IsTextBox:
text = shape.TextFrame.Text.strip()
# Remove the shape if it is empty
if not text:
slide.Shapes.Remove(shape)
# Save the result file
presentation.SaveToFile("E:/Administrator/Python1/output/RemoveEmptyTextBoxes.pptx", FileFormat.Pptx2013)
presentation.Dispose()

Compare All Three Methods: Which One Should You Use?
Each of the three methods we've discussed has its own ideal use case. If you're still unsure which one fits your needs after reading through them, the table below will help you compare them at a glance — so you can quickly pick the most suitable solution.
| Method | Best For | Keeps Valid Content? |
|---|---|---|
| Delete All Text Boxes | Cleaning up entire templates or resetting slides | ❌ No |
| Delete Specified Text Box | When you know exactly which text box to remove (e.g., slide 2, shape 1) | ✅ Yes |
| Delete Empty Text Boxes | Cleaning up imported or merged presentations | ✅ Yes |
Conclusion and Best Practice
Whether you're refreshing templates, fine-tuning individual slides, or cleaning up empty placeholders, automating PowerPoint with Python can save you hours of manual work. Choose the method that fits your workflow best — and start making your presentations cleaner and more efficient today.
FAQs about Deleting Text Boxes in PowerPoint
Q1: Why can't I delete a text box in PowerPoint?
A: One common reason is that the text box is placed inside the Slide Master layout. In this case, it can’t be selected or deleted directly from the normal slide view. You’ll need to go to the View → Slide Master tab, locate the layout, and delete it from there.
Q2: How can I delete a specific text box using Python?
A: You can locate the specific text box by accessing the slide and then searching for the shape based on its index or text content. Once identified, use the IAutoShape.Remove() method to delete it. This is useful when you know exactly which text box needs to be removed.
Q3: Is it possible to remove a text box without deleting the content?
A: If you want to keep the content but remove the text box formatting (like borders or background), you can extract the text before deleting the shape and reinsert it elsewhere — for example, as a plain paragraph. However, PowerPoint doesn’t natively support detaching text from its container without removing the shape.
How to Extract Text from Image Using Python (OCR Code Examples)
2025-06-11 01:58:59 Written by Administrator
Extracting text from images using Python is a widely used technique in OCR-driven workflows such as document digitization, form recognition, and invoice processing. Many important documents still exist only as scanned images or photos, making it essential to convert visual information into machine-readable text.
With the help of powerful Python libraries, you can easily perform text extraction from image files with Python — even for multilingual documents or layout-sensitive content. In this article, you’ll learn how to use Python to extract text from an image, through practical OCR examples, useful tips, and proven methods to improve recognition accuracy.
The guide is structured as follows:
- Powerful Python Library to Extract Text from Image
- Step-by-Step: Python Code to Extract Text from Image
- Real-World Use Cases for Text Extraction from Images
- Supported Languages and Image Formats
- How to Improve OCR Accuracy (Best Practices)
- FAQ
Powerful Python Library to Extract Text from Image
Spire.OCR for Python is a powerful OCR library for Python, especially suited for applications requiring structured layout extraction and multilingual support. This Python OCR engine supports:
- Text recognition with layout and position information
- Multilingual support (English, Chinese, French, etc.)
- Supports multiple image formats including JPG, PNG, BMP, GIF, and TIFF
Setup: Install Dependencies and OCR Models
Before extracting text from images using Python, you need to install the spire.ocr library and download the OCR model files compatible with your operating system.
1. Install the Spire.OCR Python Package
Use pip to install the Spire.OCR for Python package:
pip install spire.ocr
2. Download the OCR Model Package
Download the OCR model files based on your OS:
- Windows: win-x64.zip
- Linux: linux.zip
- macOS: mac.zip
- linux_aarch: linux_aarch.zip
After downloading, extract the files and set the model path in your Python script when configuring the OCR engine.
Step-by-Step: Python Code to Extract Text from Image
In this section, we’ll walk through different ways to extract text from images using Python — starting with a simple plain-text extraction, and then moving to more advanced structured recognition.
Basic OCR Text Extraction (Image to Plain Text)
Here’s how to extract plain text from an image using Python:
from spire.ocr import *
# Create OCR scanner instance
scanner = OcrScanner()
# Configure OCR model path and language
configureOptions = ConfigureOptions()
configureOptions.ModelPath = r'D:\OCR\win-x64'
configureOptions.Language = 'English'
scanner.ConfigureDependencies(configureOptions)
# Perform OCR on the image
scanner.Scan(r'Sample.png')
# Save extracted text to file
text = scanner.Text.ToString()
with open('output.txt', 'a', encoding='utf-8') as file:
file.write(text + '\n')
Optional: Clean and Preprocess Extracted Text (Post-OCR)
After OCR, the output may contain empty lines or noise. This snippet shows how to clean the text:
# Clean extracted text: remove empty or short lines
clean_lines = [line.strip() for line in text.split('\n') if len(line.strip()) > 2]
cleaned_text = '\n'.join(clean_lines)
# Save to a clean version
with open('output_clean.txt', 'w', encoding='utf-8') as file:
file.write(cleaned_text)
Use Case: Useful for post-processing OCR output before feeding into NLP tasks or database storage.
Here’s an example of plain-text OCR output using Spire.OCR:

Extract Text from Image with Coordinates
In forms or invoices, you may need both text content and layout. The code below outputs each block’s bounding box info:
from spire.ocr import *
scanner = OcrScanner()
configureOptions = ConfigureOptions()
configureOptions.ModelPath = r'D:\OCR\win-x64'
configureOptions.Language = 'English'
scanner.ConfigureDependencies(configureOptions)
scanner.Scan(r'sample.png')
text = scanner.Text
# Extract block-level text with position
block_text = ""
for block in text.Blocks:
rectangle = block.Box
block_info = f'{block.Text} -> x: {rectangle.X}, y: {rectangle.Y}, w: {rectangle.Width}, h: {rectangle.Height}'
block_text += block_info + '\n'
with open('output.txt', 'a', encoding='utf-8') as file:
file.write(block_text + '\n')
Extract Text from Multiple Images in a Folder
You can also batch process a folder of images:
import os
from spire.ocr import *
def extract_text_from_folder(folder_path, model_path):
scanner = OcrScanner()
config = ConfigureOptions()
config.ModelPath = model_path
config.Language = 'English'
scanner.ConfigureDependencies(config)
for filename in os.listdir(folder_path):
if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
image_path = os.path.join(folder_path, filename)
scanner.Scan(image_path)
text = scanner.Text.ToString()
# Save each result as a separate file
output_file = os.path.splitext(filename)[0] + '_output.txt'
with open(output_file, 'w', encoding='utf-8') as f:
f.write(text)
# Example usage
extract_text_from_folder(r'D:\images', r'D:\OCR\win-x64')
The recognized text blocks with position information are shown below:

Real-World Use Cases for Text Extraction from Images
Python-based OCR can be applied in:
- ✅ Invoice and receipt scanning
- ✅ Identity document OCR (passport, license)
- ✅ Business card digitization
- ✅ Form and survey data extraction
- ✅ Multilingual document indexing
Tip: For text extraction from PDF documents instead of images, you might also want to explore this tutorial on extracting text from PDF using Python.
Supported Languages and Image Formats
Spire.OCR supports multiple languages and a wide range of image formats for broader application scenarios.
Supported Languages:
- English
- Simplified / Traditional Chinese
- French
- German
- Japanese
- Korean
You can set the language using configureOptions.Language.
Supported Image Formats:
- JPG / JPEG
- PNG
- BMP
- GIF
- TIFF
How to Improve OCR Accuracy (Best Practices)
For better OCR text extraction from images using Python, follow these tips:
- Use high-resolution images (≥300 DPI)
- Preprocess with grayscale, thresholding, or denoising
- Avoid skewed or noisy scans
- Match the OCR language with the image content
FAQ
How to extract text from an image in Python code?
To extract text from an image using Python, you can use an OCR library like Spire.OCR for Python. With just a few lines of Python code, you can recognize text in scanned documents or photos and convert it into editable, searchable content.
What is the best Python library to extract text from image?
Spire.OCR for Python is a powerful Python OCR library that offers high-accuracy recognition, multilingual support, and layout-aware output. It also works seamlessly with Spire.Office components, allowing full automation — such as saving extracted text to Excel, Word, or searchable PDFs. You can also explore open-source tools to build your Python text extraction from image projects, depending on your specific needs and preferences.
How to extract data (including position) from image in Python?
When performing text extraction from image using Python, Spire.OCR provides not just the recognized text, but also bounding box coordinates for each block — ideal for processing structured content like tables, forms, or receipts.
How to extract text using Python from scanned PDF files?
To perform text extraction from scanned PDF files using Python, you can first convert each PDF page into an image, then apply OCR using Spire.OCR for Python. For this, we recommend using Spire.PDF for Python — it allows you to save PDF pages as images or directly extract embedded images from scanned PDFs, making it easy to integrate with your OCR pipeline.
Conclusion: Efficient Text Extraction from Images with Python
Thanks to powerful libraries like Spire.OCR, text extraction from images in Python is both fast and reliable. Whether you're processing receipts or building an intelligent OCR pipeline, this approach gives you precise control over both content and layout.
If you want to remove usage limitations of Spire.OCR for Python, you can apply for a free temporary license.
How to Read PDF Files in Python – Text, Tables, Images, and More
2025-06-06 08:07:20 Written by zaki zou
Reading PDF files using Python is essential for tasks like document automation, content analysis, and data scraping. Whether you're working with contracts, reports, invoices, or scientific papers, being able to programmatically access PDF content saves time and enables powerful workflows.
To reliably read PDF content in Python — including text, tables, images, and metadata — you need a reliable Python PDF reader. In this guide, we’ll show you how to read PDFs in Python using Spire.PDF for Python, a professional and easy-to-use library that supports full-featured PDF reading without relying on any third-party tools.
Here's what's covered:
- Preparing Your Environment
- Load a PDF File in Python
- Read Text from PDF Pages in Python
- Read Table Data from PDFs in Python
- Read Images from PDFs in Python
- Read PDF Metadata (Title, Author, etc.)
- Common Questions on Reading PDFs
Environment Setup for Reading PDFs in Python
Spire.PDF for Python is a powerful Python PDF reader that allows users to read PDF content with simple Python code, including text, tables, images, and metadata. It offers a developer-friendly interface and supports a wide range of PDF reading operations:
- Read PDF files from disk or memory
- Access text, tables, metadata, and images
- No need for third-party tools
- High accuracy for structured data reading
- Free version available
It’s suitable for developers who want to read and process PDFs with minimal setup.
You can install Spire.PDF for Python via pip:
pip install spire.pdf
Or the free version Free Spire.PDF for Python for small tasks:
pip install spire.pdf.free
Load a PDF File in Python
Before accessing content, the first step is to load the PDF into memory. Spire.PDF lets you read PDF files from a path on disk or directly from in-memory byte streams — ideal for reading from web uploads or APIs.
Read PDF from File Path
To begin reading a PDF in Python, load the file using PdfDocument.LoadFromFile(). This creates a document object you can use to access content.
from spire.pdf import PdfDocument
# Create a PdfDocument instance
pdf = PdfDocument()
# Load a PDF document
pdf.LoadFromFile("sample.pdf")
Read PDF from Bytes (In-Memory)
To read a PDF file from memory without saving it to disk, you can first load its byte content and then initialize a PdfDocument using a Stream object. This method is especially useful when handling PDF files received from web uploads, APIs, or temporary in-memory data.
from spire.pdf import PdfDocument, Stream
# Read the PDF file to a byte array
with open("sample.pdf", "rb") as f:
byte_data = f.read()
# Create a stream using the byte array
pdfStream = Stream(byte_data)
# Create a PdfDocument using the stream
pdf = PdfDocument(pdfStream)
To go further, check out this guide: Loading and Saving PDFs via Byte Streams in Python
Read Text from PDF Pages in Python
Reading text from a PDF file is one of the most common use cases in document automation. With Spire.PDF, you can easily retrieve all visible text from the entire PDF or from individual pages using simple methods.
Read All Text from PDF
To extract all text from a PDF, loop through each page and call PdfTextExtractor.ExtractText() to collect visible text content.
from spire.pdf import PdfDocument, PdfTextExtractor, PdfTextExtractOptions
# Create a PdfDocument instance
pdf = PdfDocument()
# Load a PDF document
pdf.LoadFromFile("sample.pdf")
all_text = ""
# Loop through each page
for pageIndex in range(pdf.Pages.Count):
page = pdf.Pages.get_Item(pageIndex)
# Create a PdfTextExtract instance
text_extractor = PdfTextExtractor(page)
# Configure extracting options
options = PdfTextExtractOptions()
options.IsExtractAllText = True
options.IsSimpleExtraction = True
# Extract text from the current page
all_text += text_extractor.ExtractText(options)
print(all_text)
Sample text content retrieved:

Read Text from Specific Area of a Page
You can also read text from a defined region of a page using a bounding box. This is useful when only a portion of the layout contains relevant information.
from spire.pdf import RectangleF, PdfDocument, PdfTextExtractor, PdfTextExtractOptions
# Load the PDF file
pdf = PdfDocument()
pdf.LoadFromFile("sample.pdf")
# Get the first page
page = pdf.Pages.get_Item(0)
# Create a PdfTextExtractor instance
textExtractor = PdfTextExtractor(page)
# Set the area to extract text by configuring the PdfTextExtractOptions
options = PdfTextExtractOptions()
area = RectangleF.FromLTRB(0, 200, page.Size.Width, 270) # x, y, width, height
options.ExtractArea = area
options.IsSimpleExtraction = True
# Extract text from the area
text = textExtractor.ExtractText(options)
print(text)
The text read from the PDF page area:

Read Table Data from PDFs in Python
PDF tables are often used in reports, invoices, and statements. With Spire.PDF, you can read PDF tables in Python by extracting structured tabular content using its layout-aware table extractor, making it ideal for financial and business documents. Use PdfTableExtractor.ExtractTable() to detect tables page by page and output each row and cell as structured text.
from spire.pdf import PdfDocument, PdfTableExtractor
# Load the PDF file
pdf = PdfDocument()
pdf.LoadFromFile("sample.pdf")
# Create a PdfTableExtractor instance
table_extractor = PdfTableExtractor(pdf)
# Extract the table from the first page
tables = table_extractor.ExtractTable(0)
for table in tables:
# Get the number of rows and columns
row_count = table.GetRowCount()
column_count = table.GetColumnCount()
# Iterate all rows
for i in range(row_count):
table_row = []
# Iterate all columns
for j in range(column_count):
# Get the cell
cell_text = table.GetText(i, j)
table_row.append(cell_text)
print(table_row)
Table content extracted using the code above:

Want to extract text from scanned PDFs using OCR? Read this guide on OCR with Python
Read Images from PDF in Python
PDF files often contain logos, scanned pages, or embedded images. Spire.PDF allows you to read and export these images, which is helpful for working with digitized documents or preserving visual content. Use PdfImageHelper.GetImagesInfo() on each page to retrieve and save all embedded images.
from spire.pdf import PdfDocument, PdfImageHelper
# Load the PDF file
pdf = PdfDocument()
pdf.LoadFromFile("sample.pdf")
# Get the first page
page = pdf.Pages.get_Item(0)
# Create a PdfImageHelper object
image_helper = PdfImageHelper()
# Get the image information from the page
images_info = image_helper.GetImagesInfo(page)
# Save the images from the page as image files
for i in range(len(images_info)):
images_info[i].Image.Save("output/Images/image" + str(i) + ".png")
The image read from the PDF file:

Read PDF Metadata (Title, Author, etc.)
Sometimes you may want to access document metadata like author, subject, and title. This can be helpful for indexing or organizing files. Use the ocumentInformation property to read metadata fields.
from spire.pdf import PdfDocument
# Load the PDF file
pdf = PdfDocument()
pdf.LoadFromFile("sample.pdf")
# Get the document properties
properties = pdf.DocumentInformation
print("Title: " + properties.Title)
print("Author: " + properties.Author)
print("Subject: " + properties.Subject)
print("Keywords: " + properties.Keywords)
The metadata read from the PDF document:

Common Questions on Reading PDFs
Can Python parse a PDF file?
Yes. Libraries like Spire.PDF for Python allow you to read PDF text, extract tables, and access embedded images or metadata. It supports methods like PdfTextExtractor.ExtractText() and PdfTableExtractor.ExtractTable() for structured content parsing.
How do I read a PDF in Jupyter?
Spire.PDF works seamlessly in Jupyter Notebooks. Just install it via pip and use its API to read PDF files, extract text, or parse tables and images directly in your notebook environment.
How to read text from a PDF file?
Use the PdfTextExtractor.ExtractText() method on each page after loading the PDF with Spire.PDF. This lets you read PDF file to text in Python and retrieve visible content for processing or analysis.
Can I read a PDF file without saving it to disk?
Yes. You can use LoadFromStream() to read PDF content as bytes and load it directly from memory. This is useful for processing PDFs received from web APIs or file uploads.
Conclusion
With Spire.PDF for Python, you can easily read a PDF in Python — including reading PDF text, tables, images, and metadata — and even read a PDF file to text for further processing or automation. This makes it an ideal solution for document automation, data ingestion, and content parsing in Python.
Need to process large PDF files or unlock all features? Request a free license and take full advantage of Spire.PDF for Python today!
How to Convert CSV to Excel (XLSX) in Python – Single & Batch Guide
2025-06-06 08:04:25 Written by zaki zouWhile working with CSV files is common in data processing, Excel (XLSX) often provides more advantages when it comes to data sharing, visualization, and large-scale analysis. In this guide, you’ll learn how to convert CSV to Excel in Python, including both single file and batch conversion methods. Whether you're automating reports or preparing data for further analysis, this guide will help you handle the conversion efficiently.

- Why Convert CSV to Excel
- Install Required Python Libraries
- Convert Single CSV to Excel
- Batch Convert CSV to XLSX
- FAQs
Why Convert CSV to Excel?
While CSV files are widely used for data storage and exchange due to their simplicity, they come with several limitations—especially when it comes to formatting, presentation, and usability. Converting CSV to Excel can bring several advantages:
Benefits of Converting CSV to Excel
- Better formatting support: Excel allows rich formatting options like fonts, colors, borders, and cell merging, making your data easier to read and present.
- Multiple worksheets: Unlike CSV files that support only a single sheet, Excel files can store multiple worksheets in one file, which is better for large datasets.
- Built-in formulas and charts: You can apply Excel formulas, pivot tables, and charts to analyze and visualize your data.
- Improved compatibility for business users: Excel is the preferred tool for many non-technical users, making it easier to share and collaborate on data.
Limitations of CSV Files
- No styling or formatting (plain text only)
- Single-sheet structure only
- Encoding issues (e.g., with non-English characters)
- Not ideal for large datasets or advanced reporting If your workflow involves reporting, data analysis, or sharing data with others, converting CSV to Excel is often a more practical and flexible choice.
Install Required Python Libraries
This guide demonstrates how to effortlessly convert CSV to Excel using Spire.XLS for Python. Spire.XLS is a powerful and professional Python Excel library that allows you to read, edit, and convert Excel files (both .xlsx and .xls) without relying on Microsoft Excel. Installing this CSV to Excel converter on your device is simple — just run the following command:
pip install Spire.XLS
Alternatively, you can download the Spire.XLS package manually for custom installation.
How to Convert CSV to Excel in Python: Single File
Now let’s get to the main part — how to convert a single CSV file to Excel using Python. With the help of Spire.XLS, this task becomes incredibly simple. All it takes is three easy steps: create a new workbook, load the CSV file, and save it as an Excel (.xlsx) file. Below is a detailed walkthrough along with a complete code example — let’s take a look!
Steps to convert a single CSV to Excel in Python:
- Create a Workbook instance.
- Load a sample CSV file using Workbook.LoadFromFile() method.
- Save the CSV file as Excel through Workbook.SaveToFile() method.
Below is the Python code to convert a CSV file to Excel. It also ignores parsing errors and automatically adjusts the column widths for better readability.
from spire.xls import *
from spire.xls.common import *
# Create a workbook
workbook = Workbook()
# Load a csv file
workbook.LoadFromFile("/sample csv.csv", ",", 1, 1)
# Set ignore error options
sheet = workbook.Worksheets[0]
sheet.Range["D2:E19"].IgnoreErrorOptions = IgnoreErrorType.NumberAsText
sheet.AllocatedRange.AutoFitColumns()
# Save the document and launch it
workbook.SaveToFile("/CSVToExcel1.xlsx", ExcelVersion.Version2013)

Warm Note: If you're only working with small files or doing some light testing, you can also use the free Spire.XLS. It's a great option for getting started quickly.
How to Batch Convert CSV to XLSX in Python
Another common scenario is when you need to convert multiple CSV files to Excel. Instead of manually replacing the file path and name for each one, there's a much more efficient approach. Simply place all the CSV files in the same folder, then use Python to loop through each file and convert them to Excel using the Workbook.SaveToFile() method. Let’s walk through the detailed steps below!
Steps to batch convert CSVs to Excel files in Python:
- Specify the file paths of input and output folders.
- Loop through all CSV files in the input folder.
- Create an object of Workbook class.
- Load each CSV file from the input folder with Workbook.LoadFromFile() method.
- Save the current CSV as an Excel file through Workbook.SaveToFile() method.
Here's the Python code to batch convert CSV to Excel (.XLSX):
import os
from spire.xls import *
input_folder = r"E:input\New folder"
output_folder = r"output\New folder"
# Loop through each CSV file
for csv_file in os.listdir(input_folder):
if csv_file.endswith(".csv"):
input_path = os.path.join(input_folder, csv_file)
output_name = os.path.splitext(csv_file)[0] + ".xlsx"
output_path = os.path.join(output_folder, output_name)
# Create a Workbook instance and load CSV files
workbook = Workbook()
workbook.LoadFromFile(input_path, ",", 1, 1)
# Save each CSV file as an Excel file
workbook.SaveToFile(output_path, ExcelVersion.Version2013)

The Conclusion
This guide showed you how to convert CSV to Excel in Python with step-by-step instructions and complete code examples. Whether you're working with a single CSV file or multiple files, Spire.XLS makes the process simple, fast, and hassle-free. Need help with more advanced scenarios or other Excel-related tasks? Feel free to contact us anytime!
FAQs about Converting CSV to Excel
Q1: How to convert CSV to Excel in Python without pandas?
A: You can use libraries like Spire.XLS, openpyxl, or xlsxwriter to convert CSV files without relying on pandas. These tools provide simple APIs to load .csv files and export them as xlsx—no Microsoft Excel installation required.
Q2: What is the easiest way to convert multiple CSV files to Excel in Python?
A: Just place all CSV files in one folder, then loop through them in Python and convert each using Workbook.SaveToFile(). This approach is ideal for batch processing. Alternatively, online converters can be a quick fix for occasional use.
Q3: How to auto-adjust column width when converting CSV to Excel in Python?
A: After loading the CSV, call worksheet.autoFitColumns() in Spire.XLS to automatically resize columns based on content before saving the Excel file.

Text files (.txt) are a common way to store data due to their simplicity, but they lack the structure and analytical power of Excel spreadsheets. Converting TXT files to Excel allows for better data organization, visualization, and manipulation.
While manual import text file to Excel works for small datasets, automating this process saves time and reduces errors. Python, with its powerful libraries, offers an efficient solution. In this guide, you’ll learn how to convert TXT to Excel in Python using Spire.XLS for Python, a robust API for Excel file manipulation.
Prerequisites
Install Python and Spire.XLS
- Install Python on your machine from python.org.
- Install the Spire.XLS for Python library via PyPI. Open your terminal and run the following command:
pip install Spire.XLS
Prepare a TXT File
Ensure your TXT file follows a consistent structure, typically with rows separated by newlines and columns separated by delimiters (e.g., commas, tabs, or spaces). For example, a sample text file might look like this: 
Step-by-Step Guide to Convert Text File to Excel
Step 1: Import Required Modules
In your Python script, import the necessary classes from Spire.XLS:
from spire.xls import *
from spire.xls.common import *
Step 2: Read and Parse the TXT File
Read the text file and split it into rows and columns using Python’s built-in functions. Define your delimiter (tab, in this case):
with open("Data.txt", "r") as file:
lines = file.readlines()
data = [line.strip().split("\t") for line in lines]
Note: If different delimiter was used, replace the parameter "\t" of the split() method (e.g., spaces: split(" ")).
Step 3: Create an Excel Workbook
Initialize a workbook object and access the first worksheet:
workbook = Workbook()
sheet = workbook.Worksheets[0]
Step 4: Write Data to the Worksheet
Loop through the parsed data and populate the Excel cells.
for row_num, row_data in enumerate(data):
for col_num, cell_data in enumerate(row_data):
sheet.Range[row_num + 1, col_num + 1].Value = cell_data
sheet.Range[1, col_num + 1].Style.Font.IsBold = True
Step 5: Save the Excel File
Export the workbook as an XLSX file (you can also use .xls for older formats):
workbook.SaveToFile("TXTtoExcel.xlsx", ExcelVersion.Version2016)
TXT to Excel Full Code Example
from spire.xls import *
from spire.xls.common import *
# Read TXT data
with open("Data.txt", "r") as file:
lines = file.readlines()
# Split data by delimiter
data = [line.strip().split("\t") for line in lines]
# Create an Excel workbook
workbook = Workbook()
# Get the first worksheet
sheet = workbook.Worksheets[0]
# Iterate through each row and column in the list
for row_num, row_data in enumerate(data):
for col_num, cell_data in enumerate(row_data):
# Write the data into the corresponding Excel cells
sheet.Range[row_num + 1, col_num + 1].Value = cell_data
# Set the header row to bold
sheet.Range[1, col_num + 1].Style.Font.IsBold = True
# Autofit column width
sheet.AllocatedRange.AutoFitColumns()
# Save as Excel (.xlsx or.xls) file
workbook.SaveToFile("TXTtoExcel.xlsx", ExcelVersion.Version2016)
workbook.Dispose()
The Excel workbook converted from a text file:

Conclusion
Converting TXT files to Excel in Python using Spire.XLS automates data workflows, saving time and reducing manual effort. Whether you’re processing logs, survey results, or financial records, this method ensures structured, formatted outputs ready for analysis.
Pro Tip: Explore Spire.XLS’s advanced features—such as charts, pivot tables, and encryption—to further enhance your Excel files.
FAQs
Q1: Can Spire.XLS convert large TXT files?
Yes, the Python Excel library is optimized for performance and can process large files efficiently. However, ensure your system has sufficient memory for very large datasets (e.g., millions of rows). For optimal results, process data in chunks or use batch operations.
Q2: Can I convert Excel back to TXT using Spire.XLS?
Yes, Spire.XLS allows to read Excel cells and write their values to a text file. A comprehensive guide is available at: Convert Excel to TXT in Python
Q3: How to handle the encoding issues during conversion?
Specify encoding if the text file uses non-standard characters (e.g., utf-8):
with open("Data.txt", "r", encoding='utf-8') as file:
lines = file.readlines()
Get a Free License
To fully experience the capabilities of Spire.XLS for Python without any evaluation limitations, you can request a free 30-day trial license.
How to Count Word Frequency in a Word Document Using Python
2025-05-22 09:16:03 Written by AdministratorWant to count the frequency of words in a Word document? Whether you're analyzing content, generating reports, or building a document tool, Python makes it easy to find how often a specific word appears—across the entire document, within specific sections, or even in individual paragraphs. In this guide, you’ll learn how to use Python to count word occurrences accurately and efficiently, helping you extract meaningful insights from your Word files without manual effort.

- Count Frequency of Words in an Entire Word Document
- Count Word Frequency by Section
- Count Word Frequency by Paragraph
- To Wrap Up
- FAQ
In this tutorial, we’ll use Spire.Doc for Python, a powerful and easy-to-use library for Word document processing. It supports a wide range of features like reading, editing, and analyzing DOCX files programmatically—without requiring Microsoft Office.
You can install it via pip:
pip install spire.doc
Let’s see how it works in practice, starting with counting word frequency in an entire Word document.
How to Count Frequency of Words in an Entire Word Document
Let’s start by learning how to count how many times a specific word or phrase appears in an entire Word document. This is a common task—imagine you need to check how often the word "contract" appears in a 50-page file.
With the FindAllString() method from Spire.Doc for Python, you can quickly search through the entire document and get an exact count in just a few lines of code—saving you both time and effort.
Steps to count the frequency of a word in the entire Word document:
- Create an object of Document class and read a source Word document.
- Specify the keyword to find.
- Find all occurrences of the keyword in the document using Document.FindAllString() method.
- Count the number of matches and print it out.
The following code shows how to count the frequency of the keyword "AI-Generated Art" in the entire Word document:
from spire.doc import *
from spire.doc.common import *
# Create a Document object
document = Document()
# Load a Word document
document.LoadFromFile("E:/Administrator/Python1/input/AI-Generated Art.docx")
# Customize the keyword to find
keyword = "AI-Generated Art"
# Find all matches (False: distinguish case; True: full text search)
textSelections = document.FindAllString(keyword, False, True)
# Count the number of matches
count = len(textSelections)
# Print the result
print(f'"{keyword}" appears {count} times in the entire document.')
# Close the document
document.Close()
How to Count Word Frequency by Section in a Word Document Using Python
A Word document is typically divided into multiple sections, each containing its own paragraphs, tables, and other elements. Sometimes, instead of counting a word's frequency across the entire document, you may want to know how often it appears in each section. To achieve this, we’ll loop through all the document sections and search for the target word within each one. Let’s see how to count word frequency by section using Python.
Steps to count the frequency of a word by section in Word documents:
- Create a Document object and load the Word file.
- Define the target keyword to search.
- Loop through all sections in the document. Within each section, loop through all paragraphs.
- Use regular expressions to count keyword occurrences.
- Accumulate and print the count for each section and the total count.
This code demonstrates how to count how many times "AI-Generated Art" appears in each section of a Word document:
import re
from spire.doc import *
from spire.doc.common import *
# Create a Document object and load a Word file
document = Document()
document.LoadFromFile("E:/Administrator/Python1/input/AI.docx")
# Specify the keyword
keyword = "AI-Generated Art"
# The total count of the keyword
total_count = 0
# Get all sections
sections = document.Sections
# Loop through each section
for i in range(sections.Count):
section = sections.get_Item(i)
paragraphs = section.Paragraphs
section_count = 0
print(f"\n=== Section {i + 1} ===")
# Loop through each paragraph in the section
for j in range(paragraphs.Count):
paragraph = paragraphs.get_Item(j)
text = paragraph.Text
# Find all matches using regular expressions
count = len(re.findall(re.escape(keyword), text, flags=re.IGNORECASE))
section_count += count
total_count += count
print(f'Total in Section {i + 1}: {section_count} time(s)')
print(f'\n=== Total occurrences in all sections: {total_count} ===')
# Close the document
document.Close()
How to Count Word Frequency by Paragraph in a Word Document
When it comes to tasks like sensitive word detection or content auditing, it's crucial to perform a more granular analysis of word frequency. In this section, you’ll learn how to count word frequency by paragraph in a Word document, which gives you deeper insight into how specific terms are distributed across your content. Let’s walk through the steps and see a code example in action.
Steps to count the frequency of words by paragraph in Word files:
- Instantiate a Document object and load a Word document from files.
- Specify the keyword to search for.
- Loop through each section and each paragraph in the document.
- Find and count the occurrence of the keyword using regular expressions.
- Print out the count for each paragraph where the keyword appears and the total number of occurrences.
Use the following code to calculate the frequency of "AI-Generated Art" by paragraphs in a Word document:
import re
from spire.doc import *
from spire.doc.common import *
# Create a Document object
document = Document()
# Load a Word document
document.LoadFromFile("E:/Administrator/Python1/input/AI.docx")
# Customize the keyword to find
keyword = "AI-Generated Art"
# Initialize variables
total_count = 0
paragraph_index = 1
# Loop through sections and paragraphs
sections = document.Sections
for i in range(sections.Count):
section = sections.get_Item(i)
paragraphs = section.Paragraphs
for j in range(paragraphs.Count):
paragraph = paragraphs.get_Item(j)
text = paragraph.Text
# Find all occurrences of the keyword while ignoring case
count = len(re.findall(re.escape(keyword), text, flags=re.IGNORECASE))
# Print the result
if count > 0:
print(f'Paragraph {paragraph_index}: "{keyword}" appears {count} time(s)')
total_count += count
paragraph_index += 1
# Print the total count
print(f'\nTotal occurrences in all paragraphs: {total_count}')
document.Close()
To Wrap Up
The guide demonstrates how to count the frequency of specific words across an entire Word document, by section, and by paragraph using Python. Whether you're analyzing long reports, filtering sensitive terms, or building smart document tools, automating the task with Spire.Doc for Python can save time and boost accuracy. Give them a try in your own projects and take full control of your Word document content.
FAQs about Counting the Frequency of Words
Q1: How to count the number of times a word appears in Word?
A: You can count word frequency in Word manually using the “Find” feature, or automatically using Python and libraries like Spire.Doc. This lets you scan the entire document or target specific sections or paragraphs.
Q2: Can I analyze word frequency across multiple Word files?
A: Yes. By combining a loop in Python to load multiple documents, you can apply the same word-count logic to each file and aggregate the results—ideal for batch processing or document audits.
How to Convert PDF to CSV in Python (Fast & Accurate Table Extraction)
2025-05-19 03:43:16 Written by Administrator
Working with PDFs that contain tables, reports, or invoice data? Manually copying that information into spreadsheets is slow, error-prone, and just plain frustrating. Fortunately, there's a smarter way: you can convert PDF to CSV in Python automatically — making your data easy to analyze, import, or automate.
In this guide, you’ll learn how to use Python for PDF to CSV conversion by directly extracting tables with Spire.PDF for Python — a pure Python library that doesn’t require any external tools.
✅ No Adobe or third-party tools required
✅ High-accuracy table recognition
✅ Ideal for structured data workflows
In this guide, we’ll cover:
- Convert PDF to CSV in Python Using Table Extraction
- Related Use Cases
- Why Use Spire.PDF for PDF to CSV Conversion in Python?
- Frequently Asked Questions
Convert PDF to CSV in Python Using Table Extraction
The best way to convert PDF to CSV using Python is by extracting tables directly — no need for intermediate formats like Excel. This method is fast, clean, and highly effective for documents with structured data such as invoices, bank statements, or reports. It gives you usable CSV output with minimal code and high accuracy, making it ideal for automation and data analysis workflows.
Step 1: Install Spire.PDF for Python
Before writing code, make sure to install the required library. You can install Spire.PDF for Python via pip:
pip install spire.pdf
You can also install Free Spire.PDF for Python if you're working on smaller tasks:
pip install spire.pdf.free
Step 2: Python Code — Extract Table from PDF and Save as CSV
- Python
from spire.pdf import PdfDocument, PdfTableExtractor
import csv
import os
# Load the PDF document
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")
# Create a table extractor
extractor = PdfTableExtractor(pdf)
# Ensure output directory exists
os.makedirs("output/Tables", exist_ok=True)
# Loop through each page in the PDF
for page_index in range(pdf.Pages.Count):
# Extract tables on the current page
tables = extractor.ExtractTable(page_index)
for table_index, table in enumerate(tables):
table_data = []
# Extract all rows and columns
for row in range(table.GetRowCount()):
row_data = []
for col in range(table.GetColumnCount()):
# Get cleaned cell text
cell_text = table.GetText(row, col).replace("\n", "").strip()
row_data.append(cell_text)
table_data.append(row_data)
# Write the table to a CSV file
output_path = os.path.join("output", "Tables", f"Page{page_index + 1}-Table{table_index + 1}.csv")
with open(output_path, "w", newline="", encoding="utf-8") as csvfile:
writer = csv.writer(csvfile)
writer.writerows(table_data)
# Release PDF resources
pdf.Dispose()
The conversion result:

What is PdfTableExtractor?
PdfTableExtractor is a utility class provided by Spire.PDF for Python that detects and extracts table structures from PDF pages. Unlike plain text extraction, it maintains the row-column alignment of tabular data, making it ideal for converting PDF tables to CSV with clean structure.
Best for:
- PDFs with structured tabular data
- Automated Python PDF to CSV conversion
- Fast Python-based data workflows
Relate Article: How to Convert PDFs to Excel XLSX Files with Python
Related Use Cases
If your PDF doesn't contain traditional tables — such as when it's formatted as paragraphs, key-value pairs, or scanned as an image — the following approaches can help you convert such PDFs to CSV using Python effectively:
Useful when data is in paragraph or report form — format it into table-like CSV using Python logic.
Perfect for image-based PDFs — use OCR to detect and export tables to CSV.
Why Choose Spire.PDF for Python?
Spire.PDF for Python is a robust PDF SDK tailored for developers. Whether you're building automated reports, analytics tools, or ETL pipelines — it just works.
Key Benefits:
- Accurate Table Recognition
Smartly extracts structured data from tables
- Pure Python, No Adobe Needed
Lightweight and dependency-free
- Multi-Format Support
Also supports conversion to text, images, Excel, and more
Frequently Asked Questions
Can I convert PDF to CSV using Python?
Yes, you can convert PDF to CSV in Python using Spire.PDF. It supports both direct table extraction to CSV and an optional workflow that converts PDFs to Excel first. No Adobe Acrobat or third-party tools are required.
What's the best way to extract tables from PDFs in Python?
The most efficient way is using Spire.PDF’s PdfTableExtractor class. It automatically detects tables on each page and lets you export structured data to CSV with just a few lines of Python code — ideal for invoices, reports, and automated processing.
Why would I convert PDF to Excel before CSV?
You might convert PDF to Excel first if the layout is complex or needs manual review. This gives you more control over formatting and cleanup before saving as CSV, but it's slower than direct extraction and not recommended for automation workflows.
Does Spire.PDF work without Adobe Acrobat?
Yes. Spire.PDF for Python is a 100% standalone library that doesn’t rely on Adobe Acrobat or any external software. It's a pure Python solution for converting, extracting, and manipulating PDF content programmatically.
Conclusion
Converting PDF to CSV in Python doesn’t have to be a hassle. With Spire.PDF for Python, you can:
- Automatically extract structured tables to CSV
- Build seamless, automated workflows in Python
- Handle both native PDFs and scanned ones (with OCR)
Get a Free License
Spire.PDF for Python offers a free edition suitable for basic tasks. If you need access to more features, you can also apply for a free license for evaluation use. Simply submit a request, and a license key will be sent to your email after approval.
How to Filter Excel Pivot Tables with Python: Step-by-Step Guide
2025-05-16 10:01:44 Written by Administrator
Introduction
Pivot Tables in Excel are versatile tools that enable efficient data summarization and analysis. They allow users to explore data, uncover insights, and generate reports dynamically. One of the most powerful features of Pivot Tables is filtering, which lets users focus on specific data subsets without altering the original data structure.
What This Tutorial Covers
This tutorial provides a detailed, step-by-step guide on how to programmatically apply various types of filters to a Pivot Table in Excel using Python with the Spire.XLS for Python library. It covers the following topics:
- Benefits of Filtering Data in Pivot Tables
- Install Python Excel Library – Spire.XLS for Python
- Add Report Filter to Pivot Table
- Apply Row Field Filter in Pivot Table
- Apply Column Field Filter in Pivot Table
- FAQs
- Conclusion
Benefits of Filtering Data in Pivot Tables
Filtering is an essential feature of Pivot Tables that provides the following benefits:
- Enhanced Data Analysis: Quickly focus on specific segments or categories of your data to draw meaningful insights.
- Dynamic Data Updates: Filters automatically adjust to reflect changes when the underlying data is refreshed, keeping your analysis accurate.
- Improved Data Organization: Display only relevant data in your reports without altering or deleting the original dataset, preserving data integrity.
Install Python Excel Library – Spire.XLS for Python
Before working with Pivot Tables in Excel using Python, ensure the Spire.XLS for Python library is installed. The quickest way to do this is using pip, Python’s package manager. Simply run the following command in your terminal or command prompt:
pip install spire.xls
Once installed, you’re ready to start automating Pivot Table filtering in your Python projects.
Add Report Filter to Pivot Table
A report filter allows you to filter the entire Pivot Table based on a particular field and value. This type of filter is useful when you want to display data for a specific category or item globally across the Pivot Table, without changing the layout.
Steps to Add a Report Filter
- Initialize the Workbook: Create a Workbook object to manage your Excel file.
- Load the Excel File: Use Workbook.LoadFromFile() to load an existing file containing a Pivot Table.
- Access the Worksheet: Use Workbook.Worksheets[] to select the desired worksheet.
- Locate the Pivot Table: Use Worksheet.PivotTables[] to access the specific Pivot Table.
- Define the Report Filter: Create a PivotReportFilter object specifying the field to filter.
- Apply the Report Filter: Add the filter to the Pivot Table using XlsPivotTable.ReportFilters.Add().
- Save the Updated File: Use Workbook.SaveToFile() to save your changes.
Code Example
- Python
from spire.xls import *
# Create an object of the Workbook class
workbook = Workbook()
# Load an existing Excel file containing a Pivot Table
workbook.LoadFromFile("Sample.xlsx")
# Access the first worksheet
sheet = workbook.Worksheets[0]
# Access the first Pivot Table in the worksheet
pt = sheet.PivotTables[0]
# Create a report filter for the field "Product"
reportFilter = PivotReportFilter("Product", True)
# Add the report filter to the pivot table
pt.ReportFilters.Add(reportFilter)
# Save the updated workbook to a new file
workbook.SaveToFile("AddReportFilter.xlsx", FileFormat.Version2016)
workbook.Dispose()

Apply Row Field Filter in Pivot Table
Row field filters allow you to filter data displayed in the row fields of an Excel Pivot Table. These filters can be based on labels (specific text values) or values (numeric criteria).
Steps to Add a Row Field Filter
- Initialize the Workbook: Create a Workbook object to manage the Excel file.
- Load the Excel File: Use Workbook.LoadFromFile() to load your target file containing a Pivot Table.
- Access the Worksheet: Select the desired worksheet using Workbook.Worksheets[].
- Locate the Pivot Table: Access the specific Pivot Table using Worksheet.PivotTables[].
- Add a Row Field Filter: Add a label filter or value filter using
XlsPivotTable.RowFields[].AddLabelFilter() or
XlsPivotTable.RowFields[].AddValueFilter().
- Calculate Pivot Table Data: Use XlsPivotTable.CalculateData() to calculate the pivot table data.
- Save the Updated File: Save your changes using Workbook.SaveToFile().
Code Example
- Python
from spire.xls import *
# Create an object of the Workbook class
workbook = Workbook()
# Load an Excel file
workbook.LoadFromFile("Sample.xlsx")
# Get the first worksheet
sheet = workbook.Worksheets[0]
# Get the first pivot table
pt = sheet.PivotTables[0]
# Add a value filter to the first row field in the pivot table
pt.RowFields[0].AddValueFilter(PivotValueFilterType.GreaterThan, pt.DataFields[0], Int32(5000), None)
# Or add a label filter to the first row field in the pivot table
# pt.RowFields[0].AddLabelFilter(PivotLabelFilterType.Equal, "Mike", None)
# Calculate the pivot table data
pt.CalculateData()
# Save the resulting file
workbook.SaveToFile("AddRowFieldFilter.xlsx", FileFormat.Version2016)
workbook.Dispose()

Apply Column Field Filter in Pivot Table
Column field filters in Excel Pivot Tables allow you to filter data displayed in the column fields. Similar to row field filters, column field filters can be based on labels (text values) or values (numeric criteria).
Steps to Add Column Field Filter
- Initialize the Workbook: Create a Workbook object to manage your Excel file.
- Load the Excel File: Use Workbook.LoadFromFile() to open your file containing a Pivot Table.
- Access the Worksheet: Select the target worksheet using Workbook.Worksheets[].
- Locate the Pivot Table: Use Worksheet.PivotTables[] to access the desired Pivot Table.
- Add a Column Field Filter: Add a label filter or value filter using
XlsPivotTable.ColumnFields[].AddLabelFilter() or
XlsPivotTable.ColumnFields[].AddValueFilter().
- Calculate Pivot Table Data: Use XlsPivotTable.CalculateData() to calculate the Pivot Table data.
- Save the Updated File: Save your changes using Workbook.SaveToFile().
Code Example
- Python
from spire.xls import *
# Create an object of the Workbook class
workbook = Workbook()
# Load the Excel file
workbook.LoadFromFile("Sample.xlsx")
# Access the first worksheet
sheet = workbook.Worksheets[0]
# Access the first Pivot Table
pt = sheet.PivotTables[0]
# Add a label filter to the first column field
pt.ColumnFields[0].AddLabelFilter(PivotLabelFilterType.Equal, String("Laptop"), None)
# # Or add a value filter to the first column field
# pt.ColumnFields[0].AddValueFilter(PivotValueFilterType.Between, pt.DataFields[0], Int32(5000), Int32(10000))
# Calculate the pivot table data
pt.CalculateData()
# Save the updated workbook
workbook.SaveToFile("AddColumnFieldFilter.xlsx", FileFormat.Version2016)
workbook.Dispose()

Conclusion
Filtering Pivot Tables in Excel is crucial for effective data analysis, allowing users to zoom in on relevant information without disrupting the table’s structure. Using Spire.XLS for Python, developers can easily automate adding, modifying, and managing filters on Pivot Tables programmatically. This tutorial covered the primary filter types—report filters, row field filters, and column field filters—with detailed code examples to help you get started quickly.
FAQs
Q: Can I add multiple filters to the same Pivot Table?
A: Yes, you can add multiple report filters, row filters, and column filters simultaneously to customize your data views with Spire.XLS.
Q: Do filters update automatically if the source data changes?
A: Yes, after refreshing the Pivot Table or recalculating with CalculateData(), filters will reflect the latest data.
Q: Can I filter based on custom conditions?
A: Spire.XLS supports many filter types including label filters (equals, begins with, contains) and value filters (greater than, less than, between).
Q: Is it possible to remove filters programmatically?
A: Yes, you can remove filters by clearing or resetting the respective filter collections or fields.
Get a Free License
To fully experience the capabilities of Spire.XLS for .NET without any evaluation limitations, you can request a free 30-day trial license.
