Python (359)

Converting PDF to database is a common requirement in data-driven applications. Many business documents—such as invoices, reports, and financial records—store structured information in PDF format, but this data is not directly usable for querying or analysis.
To make this data accessible, developers often need to convert PDF to SQL by extracting structured content and inserting it into relational databases like SQL Server, MySQL, or PostgreSQL. Manually handling this process is inefficient and error-prone, especially at scale.
In this guide, we focus on extracting table data from PDFs and building a complete pipeline to transform and insert it into an SQL database in Python with Spire.PDF for Python. This approach reflects the most practical and scalable solution for real-world PDF to database workflows.
Quick Navigation
- Understanding the Workflow
- Prerequisites
- Step 1: Extract Table Data from PDF
- Step 2: Transform and Insert Data into Database
- Complete Pipeline: From PDF Extraction to SQL Storage
- Adapting to Other SQL Databases
- Handling Other Types of PDF Data
- Common Pitfalls When Converting PDF Data to a Database
- Conclusion
- FAQ
Understanding the Workflow
Before diving into the implementation, it's important to understand the overall process of converting PDF data into a database.
Instead of treating each operation as completely separate, this workflow can be viewed as two main stages:

Each stage plays a distinct role in the pipeline:
-
Extract Tables: Retrieve structured table data from the PDF document
-
Process & Store Data: Clean, structure, and insert the extracted data into a relational database
- Transform Data: Convert raw rows into structured, database-ready records
- Insert into SQL Database: Persist the processed data into an SQL database
This end-to-end pipeline reflects how most real-world systems handle PDF to database workflows—by first extracting usable data, then processing and storing it in a database for querying and analysis.
Prerequisites
Before getting started, make sure you have the following:
-
Python 3.x installed
-
Spire.PDF for Python installed:
pip install Spire.PDFYou can also download Spire.PDF for Python and add it to your project manually.
-
A relational database system (e.g., SQLite, SQL Server, MySQL, or PostgreSQL)
This guide demonstrates the workflow using SQLite for simplicity, while also showing how the same approach can be applied to other SQL databases.
Step 1: Extract Table Data from PDF
In most business documents, such as invoices or reports, data is organized in tables. These tables already follow a row-and-column structure, making them ideal for direct insertion into an SQL database.
Table data in PDFs is typically already structured in rows and columns, making it the most suitable format for database storage.
Extract Tables Using Python
Below is an example of how to extract table data from a PDF file using Spire.PDF:
from spire.pdf import *
from spire.pdf.common import *
# Load PDF document
pdf = PdfDocument()
pdf.LoadFromFile("Quarterly Sales.pdf")
# Method for ligature normalization
def normalize_text(text: str) -> str:
if not text:
return text
ligature_map = {
'\ue000': 'ff', '\ue001': 'ft', '\ue002': 'ffi', '\ue003': 'ffl', '\ue004': 'ti', '\ue005': 'fi',
}
for k, v in ligature_map.items():
text = text.replace(k, v)
return text.strip()
table_data = []
# Iterate through pages
for i in range(pdf.Pages.Count):
# Extract tables from pages
extractor = PdfTableExtractor(pdf)
tables = extractor.ExtractTable(i)
if tables:
print(f"Page {i} has {len(tables)} tables.")
for table in tables:
rows = []
for row in range(table.GetRowCount()):
row_data = []
for col in range(table.GetColumnCount()):
text = table.GetText(row, col)
text = normalize_text(text)
row_data.append(text.strip() if text else "")
rows.append(row_data)
table_data.extend(rows)
pdf.Close()
# Print extracted data
for row in table_data:
print(row)
Below is a preview of the extracting result:

Code Explanation
- LoadFromFile: Loads the PDF document
- PdfTableExtractor: Identifies tables within each page
- GetText(row, col): Retrieves cell content
- table_data: Stores extracted rows as a list of lists
At this stage, the data is extracted but still unstructured in terms of database usage. Once the table data is extracted, we need to convert it into a structured format for SQL insertion.
Alternatively, you can export the extracted data to a CSV file for validation or batch import. See: Convert PDF Tables to CSV in Python
Step 2: Transform and Insert Data into Database
Raw table data extracted from PDFs often requires cleaning and structuring before it can be inserted into an SQL database.
For simplicity, the following examples demonstrate how to process a single extracted table. In real-world scenarios, PDFs may contain multiple tables, which can be handled using the same logic in a loop.
Transform Data (Single Table Example)
structured_data = []
# Assume first row is header
headers = table_data[0]
for row in table_data[1:]:
if not any(row):
continue
record = {}
for i in range(len(headers)):
value = row[i] if i < len(row) else ""
record[headers[i]] = value
structured_data.append(record)
# Preview structured data
for item in structured_data:
print(item)
What This Step Does
- Converts rows into dictionary-based records
- Maps column headers to values
- Filters out empty rows
- Prepares structured data for database insertion
You can also:
- Normalize column names for SQL compatibility
- Convert numeric fields
- Standardize date formats
Transforming raw PDF data into a structured format ensures it can be reliably inserted into a relational database. After transformation, the data is immediately ready for database insertion, which completes the pipeline.
Insert Data into SQLite (Single Table Example)
Using the structured data from a single table, we can dynamically create a database schema and insert records without hardcoding column names.
import sqlite3
# Connect to SQLite database
conn = sqlite3.connect("sales_data.db")
cursor = conn.cursor()
# Create table dynamically based on headers
columns_def = ", ".join([f'"{h}" TEXT' for h in headers])
cursor.execute(f"""
CREATE TABLE IF NOT EXISTS invoices (
id INTEGER PRIMARY KEY AUTOINCREMENT,
{columns_def}
)
""")
# Prepare insert statement
placeholders = ", ".join(["?" for _ in headers])
column_names = ", ".join([f'"{h}"' for h in headers])
# Insert data
for record in structured_data:
values = [record.get(h, "") for h in headers]
cursor.execute(f"""
INSERT INTO invoices ({column_names})
VALUES ({placeholders})
""", values)
# Commit and close
conn.commit()
conn.close()
Key Points
- Dynamically creates database tables based on extracted headers
- Uses parameterized queries (
?) to prevent SQL injection - Keeps the schema flexible without hardcoding column names
- Column names can be normalized to ensure SQL compatibility
- Batch inserts can improve performance for large datasets
This section demonstrates the core workflow for converting PDF table data into a relational database using a single table example. In the next section, we extend this approach to handle multiple tables automatically.
Complete Pipeline: From PDF Extraction to SQL Storage
Here's a complete runnable example that demonstrates the entire workflow from PDF to database:
from spire.pdf import *
from spire.pdf.common import *
import sqlite3
import re
# ---------------------------
# Utility Functions
# ---------------------------
def normalize_text(text: str) -> str:
if not text:
return ""
ligature_map = {
'\ue000': 'ff', '\ue001': 'ft', '\ue002': 'ffi',
'\ue003': 'ffl', '\ue004': 'ti', '\ue005': 'fi',
}
for k, v in ligature_map.items():
text = text.replace(k, v)
return text.strip()
def normalize_column_name(name: str, index: int) -> str:
if not name:
return f"column_{index}"
name = name.lower()
name = re.sub(r'[^a-z0-9]+', '_', name).strip('_')
return name or f"column_{index}"
def deduplicate_columns(columns):
seen = set()
result = []
for col in columns:
base = col
count = 1
while col in seen:
col = f"{base}_{count}"
count += 1
seen.add(col)
result.append(col)
return result
# ---------------------------
# Step 1: Extract Tables (STRUCTURED)
# ---------------------------
pdf = PdfDocument()
pdf.LoadFromFile("Quarterly Sales.pdf")
extractor = PdfTableExtractor(pdf)
all_tables = []
for i in range(pdf.Pages.Count):
tables = extractor.ExtractTable(i)
if tables:
for table in tables:
table_rows = []
for row in range(table.GetRowCount()):
row_data = []
for col in range(table.GetColumnCount()):
text = table.GetText(row, col)
row_data.append(normalize_text(text))
table_rows.append(row_data)
if table_rows:
all_tables.append(table_rows)
pdf.Close()
if not all_tables:
raise ValueError("No tables found in PDF.")
# ---------------------------
# Step 2 & 3: Process + Insert Each Table
# ---------------------------
conn = sqlite3.connect("sales_data.db")
cursor = conn.cursor()
for table_index, table in enumerate(all_tables):
if len(table) < 2:
continue # skip invalid tables
raw_headers = table[0]
# Normalize headers
normalized_headers = [
normalize_column_name(h, i)
for i, h in enumerate(raw_headers)
]
normalized_headers = deduplicate_columns(normalized_headers)
# Generate table name
table_name = f"table_{table_index+1}"
# Create table
columns_def = ", ".join([f'"{col}" TEXT' for col in normalized_headers])
cursor.execute(f"""
CREATE TABLE IF NOT EXISTS "{table_name}" (
id INTEGER PRIMARY KEY AUTOINCREMENT,
{columns_def}
)
""")
# Prepare insert
placeholders = ", ".join(["?" for _ in normalized_headers])
column_names = ", ".join([f'"{col}"' for col in normalized_headers])
insert_sql = f"""
INSERT INTO "{table_name}" ({column_names})
VALUES ({placeholders})
"""
# Insert data
batch = []
for row in table[1:]:
if not any(row):
continue
values = [
row[i] if i < len(row) else ""
for i in range(len(normalized_headers))
]
batch.append(values)
if batch:
cursor.executemany(insert_sql, batch)
print(f"Inserted {len(batch)} rows into {table_name}")
conn.commit()
conn.close()
print(f"Processed {len(all_tables)} tables from PDF.")
Below is a preview of the insertion result in the database:

This complete example demonstrates the full PDF to database pipeline:
- Load and extract table data from PDF using Spire.PDF
- Transform raw data into structured records
- Insert into SQLite database with proper schema
SQLite automatically creates a system table called sqlite_sequence when using AUTOINCREMENT to track the current maximum ID. This is expected behavior and does not affect your data. You can run this code directly to convert PDF table data into a database.
Adapting to Other SQL Databases
While this guide uses SQLite for simplicity, the same approach works for other SQL databases. The extraction and transformation steps remain identical—only the database connection and insertion syntax vary slightly.
The following examples assume you are using the normalized column names (headers) generated in the previous step.
SQL Server Example
import pyodbc
# Connect to SQL Server
conn_str = (
"DRIVER={SQL Server};"
"SERVER=your_server_name;"
"DATABASE=your_database_name;"
"UID=your_username;"
"PWD=your_password"
)
conn = pyodbc.connect(conn_str)
cursor = conn.cursor()
# Generate dynamic column definitions using normalized headers
columns_def = ", ".join([f"[{h}] NVARCHAR(MAX)" for h in headers])
# Create table dynamically
cursor.execute(f"""
IF NOT EXISTS (SELECT * FROM sys.tables WHERE name = 'invoices')
BEGIN
CREATE TABLE invoices (
id INT IDENTITY(1,1) PRIMARY KEY,
{columns_def}
)
END
""")
# Prepare insert statement
placeholders = ", ".join(["?" for _ in headers])
column_names = ", ".join([f"[{h}]" for h in headers])
# Insert data
for record in structured_data:
values = [record.get(h, "") for h in headers]
cursor.execute(f"""
INSERT INTO invoices ({column_names})
VALUES ({placeholders})
""", values)
# Commit and close
conn.commit()
conn.close()
MySQL Example
import mysql.connector
conn = mysql.connector.connect(
host="localhost",
user="your_username",
password="your_password",
database="your_database"
)
cursor = conn.cursor()
# Use the same dynamic table creation and insert logic as shown earlier,
# with minor syntax adjustments if needed
PostgreSQL Example
import psycopg2
conn = psycopg2.connect(
host="localhost",
database="your_database",
user="your_username",
password="your_password"
)
cursor = conn.cursor()
# Use the same dynamic table creation and insert logic as shown earlier,
# with minor syntax adjustments if needed
The core extraction and transformation steps remain the same across different SQL databases, especially when using normalized column names for compatibility.
Handling Other Types of PDF Data
While this guide focuses on table extraction, PDFs often contain other types of data that can also be integrated into a database, depending on your use case.
Text Data (Unstructured → Structured)
In many documents, important information such as invoice numbers, customer names, or dates is embedded in plain text rather than tables.
You can extract raw text using:
from spire.pdf import *
pdf = PdfDocument()
pdf.LoadFromFile("Quarterly Sales.pdf")
for i in range(pdf.Pages.Count):
page = pdf.Pages.get_Item(i)
extractor = PdfTextExtractor(page)
options = PdfTextExtractOptions()
options.IsExtractAllText = True
text = extractor.ExtractText(options)
print(text)
However, raw text cannot be directly inserted into a database. It typically requires parsing into structured fields, for example:
- Using regular expressions to extract key-value pairs
- Identifying patterns such as dates, IDs, or totals
- Converting text into dictionaries or structured records
Once structured, the data can be inserted into a database as part of the same transformation and insertion pipeline described earlier.
For more advanced techniques, you can learn more in the detailed Python PDF text extraction guide.
Images (OCR or File Reference)
Images in PDFs are usually not directly usable as structured data, but they can still be integrated into database workflows in two ways:
Option 1: OCR (Recommended for data extraction) Convert images to text using OCR tools, then process and store the extracted content.
Option 2: File Storage (Recommended for document systems) Store images as:
- File paths in the database
- Binary (BLOB) data if needed
Below is an example of extracting images:
from spire.pdf import *
pdf = PdfDocument()
pdf.LoadFromFile("Quarterly Sales.pdf")
helper = PdfImageHelper()
for i in range(pdf.Pages.Count):
page = pdf.Pages.get_Item(i)
images = helper.GetImagesInfo(page)
for j, img in enumerate(images):
img.Image.Save(f"image_{i}_{j}.png")
To further process image-based content, you can use OCR to extract text from images with Spire.OCR for Python.
Full PDF Storage (BLOB or File Reference)
In some scenarios, the goal is not to extract structured data, but to store the entire PDF file in a database.
This is commonly used in:
- Document management systems
- Archival systems
- Compliance and auditing workflows
You can store PDFs as:
- BLOB data in the database
- File paths referencing external storage
This approach represents another meaning of "PDF in database", but it is different from structured data extraction.
Key Takeaway
While PDFs can contain multiple types of content, table data remains the most efficient and scalable format for database integration. Other data types typically require additional processing before they can be stored or queried effectively.
Common Pitfalls When Converting PDF Data to a Database
While the process of converting PDF to a database may seem straightforward, several practical challenges can arise.
1. Inconsistent Table Structures
Not all PDFs follow a consistent table format:
- Missing columns
- Merged cells
- Irregular layouts
Solution:
- Validate row lengths
- Normalize structure
- Handle missing values
2. Poor Table Detection
Some PDFs do not define tables properly internally, such as no grid structure or irregular cell sizes.
Solution:
- Test with multiple files
- Use fallback parsing logic
- Preprocess PDFs if needed
3. Data Cleaning Issues
Extracted data may contain:
- Extra spaces
- Line breaks
- Formatting issues
Solution:
- Strip whitespace
- Normalize values
- Validate types
4. Character Encoding Issues (Ligatures & Fonts)
PDF table extraction can introduce unexpected characters due to font encoding and ligatures. For example, common letter combinations such as:
fi,ff,ffi,ffl,ft,ti
may be stored as single glyphs in the PDF. When extracted, they may appear as:
di\ue000erence → difference
o\ue002ce → office
\ue005le → file
These are typically private Unicode characters (e.g., \ue000–\uf8ff) caused by custom font mappings.
Solution:
-
Detect private Unicode characters (
\ue000–\uf8ff) -
Build a mapping table for ligatures, such as:
\ue000 → ff\ue001 → ft\ue002 → ffi\ue003 → ffl\ue004 → ti\ue005 → fi
-
Normalize text before inserting into the database
-
Optionally log unknown characters for further analysis
Handling encoding issues properly ensures data accuracy and prevents subtle corruption in downstream processing.
5. Cross-Page Table Fragmentation
Large tables in PDFs are often split across multiple pages. When extracted, each page may be treated as a separate table, leading to:
- Broken datasets
- Repeated headers
- Incomplete records
Solution:
- Compare column counts between consecutive tables
- Check header consistency or data type patterns in the first row
- Merge tables when structure and schema match
- Skip duplicated header rows when concatenating data
In practice, combining column structure and value pattern detection provides a reliable way to reconstruct full tables across pages.
6. Database Schema Mismatch
Incorrect mapping between extracted data and database columns can cause errors.
Solution:
- Align headers with schema
- Use explicit field mapping
7. Performance Issues with Large Files
Processing large PDFs can be slow.
Solution:
- Use batch processing
- Optimize insert operations
By anticipating these issues, you can build a more reliable PDF to database workflow.
Conclusion
Converting PDF to a database is not a one-step operation, but a structured process involving extracting data and processing it for database storage (including transformation and insertion)
By focusing on table data and using Python, you can efficiently implement a complete PDF to database pipeline, making it easier to automate data integration tasks.
This approach is especially useful for handling invoices, reports, and other structured business documents that need to be stored in SQL Server or other relational databases.
If you want to evaluate the performance of Spire.PDF for Python and remove any limitations, you can apply for a 30-day free trial.
FAQ
What does "PDF to database" mean?
It refers to the process of extracting structured data from PDF files and storing it in a database. This typically involves parsing PDF content, transforming it into structured formats, and inserting it into SQL databases for further querying and analysis.
Can Python convert PDF directly to a database?
No. Python cannot directly convert a PDF into a database in one step. The process usually involves extracting data from the PDF first, transforming it into structured records, and then inserting it into a database using SQL connectors.
How do I convert PDF to SQL using Python?
The typical workflow includes:
- Extracting table or text data from the PDF
- Converting it into structured records (rows and columns)
- Inserting the processed data into an SQL database such as SQLite, MySQL, or SQL Server using Python database libraries
Can I store PDF files directly in a database?
Yes. PDF files can be stored as binary (BLOB) data in a database. However, this approach is mainly used for document storage systems, while structured extraction is preferred for data analysis and querying.
What SQL databases can I use for PDF data integration?
You can use almost any SQL database, including SQLite, SQL Server, MySQL, and PostgreSQL. The overall extraction and transformation process remains the same, while only the database connection and insertion syntax differ slightly.

Importing an Excel file in Python typically involves more than just reading the file. In most cases, the data needs to be converted into Python structures such as lists, dictionaries, or other formats that can be directly used in your application.
This transformation step is important because Excel data is usually stored in a tabular format, while Python applications often require structured data for processing, integration, or storage. Depending on how the data will be used, it may be represented as a list for sequential processing, a dictionary for field-based access, custom objects for structured modeling, or a database for persistent storage.
This guide demonstrates how to import Excel file in Python and convert the data into multiple structures using Spire.XLS for Python, with practical examples for each approach.
Overall Implementation Approach and Quick Example
Importing Excel data into Python is essentially a two-step process:
- Load Excel file – Load the Excel file and access its raw data
- Transform data – Convert the data into Python structures such as lists, dictionaries, or objects
This separation is important because in real-world applications, simply reading Excel is not enough—the data must be transformed into a format that can be processed, stored, or integrated into systems.
Key Components
When importing Excel data using Spire.XLS for Python, the following components are involved:
- Workbook – Represents the entire Excel file and is responsible for loading data from disk
- Worksheet – Represents a single sheet within the Excel file
- CellRange – Represents a group of cells that contain actual data
- Data Transformation Layer – Your Python logic that converts cell values into target structures
Data Flow Overview
The typical workflow looks like this:
Excel File → Workbook → Worksheet → CellRange → Python Data Structure
Understanding this pipeline helps you design flexible import logic for different scenarios.
Quick Example: Import Excel File in Python
Before running the example, install Spire.XLS for Python using pip:
pip install spire.xls
If needed, you can also download Spire.XLS for Python manually and include it in your project.
The following example shows the simplest way to import Excel data into Python:
from spire.xls import *
workbook = Workbook()
workbook.LoadFromFile("SalesReport.xlsx")
data = []
sheet = workbook.Worksheets[0]
# Get the used cell range
cellRange = sheet.AllocatedRange
# Get the data from the first row
for col in range(cellRange.Columns.Count):
data.append(sheet.Range[1, col +1].Value)
print(data)
workbook.Dispose()
Below is a preview of the data imported from the Excel file:

This minimal example demonstrates the fundamental workflow: initialize a workbook, load the Excel file, access the worksheet and cell data, and then dispose of the workbook to release resources.
For more advanced scenarios, such as reading Excel files from memory or handling file streams, see how to import Excel data from a stream in Python.
Import Excel Data in Python as a List
One of the simplest ways to import Excel data in Python is to convert it into a list of rows. This structure is useful for iteration and basic data processing.
Example
from spire.xls import *
# Load the Workbook
workbook = Workbook()
workbook.LoadFromFile("SalesReport.xlsx")
# Get the used range in the first worksheet
sheet = workbook.Worksheets[0]
cellRange = sheet.AllocatedRange
# Create a list to store the data
data = []
for row_index in range(cellRange.RowCount):
row_data = []
for cell_index in range(cellRange.ColumnCount):
row_data.append(cellRange[row_index + 1, cell_index + 1].Value)
data.append(row_data)
workbook.Dispose()
Technical Explanation
Importing Excel data as a list treats each row in the worksheet as a Python list, preserving the original row order.
How the code works:
- A nested loop is used to traverse the worksheet in a row-first (row-major) pattern
- The outer loop iterates through rows, while the inner loop accesses each cell
- Index offsets (
+1) are applied because Spire.XLS uses 1-based indexing
Why this design works:
- AllocatedRange limits iteration to only populated cells, improving efficiency
- Row-by-row extraction keeps the structure consistent with Excel’s layout
- The intermediate row_data list ensures clean aggregation before appending
This structure is ideal for sequential processing, simple transformations, or as a base format before converting into dictionaries or objects.
If you want to load more than just text and numeric data, see How to Read Excel Files in Python for more data types.
Import Excel Data as a Dictionary in Python
If your Excel file contains headers, importing it as a dictionary provides better data organization and access by column names.
Example
from spire.xls import *
workbook = Workbook()
workbook.LoadFromFile("SalesReport.xlsx")
sheet = workbook.Worksheets[0]
cellRange = sheet.AllocatedRange
rows = list(cellRange.Rows)
headers = [cellRange[1, cell_index + 1].Value for cell_index in range(cellRange.ColumnCount)]
data_dict = []
for row in rows[1:]:
row_dict = {}
for i, cell in enumerate(row.Cells):
row_dict[headers[i]] = cell.Value
data_dict.append(row_dict)
workbook.Dispose()
Technical Explanation
Importing Excel data as a dictionary converts each row into a key-value structure using column headers.
How the code works:
- The first row is extracted as headers
- Each subsequent row is iterated and processed
- Cell values are mapped to headers using their column index
Why this design works:
- Both headers and row cells follow the same column order, enabling simple index-based mapping
- This removes reliance on fixed column positions
- The result is a self-descriptive structure with named fields
This method is useful when you need structured data access, such as working with JSON, APIs, or labeled datasets.
Import Excel Data into Custom Objects
For structured applications, you may need to import Excel data into Python objects to maintain type safety and encapsulate business logic.
Example
class Employee:
def __init__(self, name, age, department):
self.name = name
self.age = age
self.department = department
from spire.xls import *
from spire.xls.common import *
workbook = Workbook()
workbook.LoadFromFile("EmployeeData.xlsx")
sheet = workbook.Worksheets[0]
cellRange = sheet.AllocatedRange
employees = []
for row in list(cellRange.Rows)[1:]:
name = row.Cells[0].Value
age = int(row.Cells[1].Value) if row.Cells[1].Value else None
department = row.Cells[2].Value
emp = Employee(name, age, department)
employees.append(emp)
workbook.Dispose()
Technical Explanation
Importing Excel data into objects maps each row to a structured class instance.
How the code works:
- A class is defined to represent the data model
- Each row is read and its values are extracted
- Values are passed into the class constructor to create objects
Why this design works:
- The constructor acts as a controlled transformation point
- It allows validation, type conversion, or preprocessing
- Data is no longer loosely structured, but aligned with domain logic
This is ideal for applications with clear data models, such as backend systems or business logic layers.
Import Excel Data to Database in Python
In many applications, Excel data needs to be stored in a database for persistent storage and querying.
Example
import sqlite3
from spire.xls import *
# Connect to SQLite database
conn = sqlite3.connect("sales.db")
cursor = conn.cursor()
# Create table matching the Excel structure
cursor.execute("""
CREATE TABLE IF NOT EXISTS sales (
product TEXT,
category TEXT,
region TEXT,
sales REAL,
units_sold INTEGER
)
""")
# Load the Excel file
workbook = Workbook()
workbook.LoadFromFile("Sales.xlsx")
# Access the first worksheet
sheet = workbook.Worksheets[0]
rows = list(sheet.AllocatedRange.Rows)
# Iterate through rows (skip header row)
for row in rows[1:]:
product = row.Cells[0].Value
category = row.Cells[1].Value
region = row.Cells[2].Value
# Remove thousand-separators and convert to float
sales_text = row.Cells[3].Value
sales = float(str(sales_text).replace(",", "")) if sales_text else 0
# Convert units sold to integer
units_text = row.Cells[4].Value
units_sold = int(units_text) if units_text else 0
# Insert data into the database
cursor.execute(
"INSERT INTO sales VALUES (?, ?, ?, ?, ?)",
(product, category, region, sales, units_sold)
)
# Commit changes and close connection
conn.commit()
conn.close()
# Release Excel resources
workbook.Dispose()
Here is a preview of the Excel data and the SQLite database structure:

Technical Explanation
Importing Excel data into a database converts each row into a persistent record.
How the code works:
- A database connection is established and a table is created
- The table schema is aligned with the Excel structure
- Each row is read and inserted using parameterized SQL queries
Why this design works:
- Schema alignment ensures consistent data mapping
- Data normalization (e.g., numeric conversion) improves compatibility
- Parameterized queries provide safety and proper type handling
When to use this approach:
This approach is suitable for data storage, querying, and integration into larger data pipelines.
For a more detailed guide on importing Excel data into Databases, check out How to Transfer Data Between Excel Files and Databases.
Why Use Spire.XLS for Importing Excel Data
The examples in this guide use Spire.XLS for Python because it provides a clear and consistent way to access and transform Excel data. The main advantages in this context include:
-
Structured Object Model The library exposes components such as Workbook, Worksheet, and CellRange, which align directly with how Excel data is organized. This makes the data flow easier to understand and implement. See more details on Spire.XLS for Python API Reference.
-
Focused Data Access Layer Instead of handling low-level file parsing, you can work directly with cell values and ranges, allowing the import logic to focus on data transformation rather than file structure.
-
Format Compatibility It supports common Excel formats, such as XLS and XLSX, and other spreadsheet formats, such as CSV, ODS, and OOXML, enabling the same import logic to be applied across different file types.
-
No External Dependencies Excel files can be processed without requiring Microsoft Excel to be installed, which is important for backend services and automated environments.
Common Pitfalls
Incorrect File Path
Ensure the Excel file path is correct and accessible from your script. Use absolute paths or verify the current working directory.
import os
print(os.getcwd()) # Check current directory
Missing Headers
When importing as a dictionary, verify that your Excel file has headers in the first row. Otherwise, the keys will be incorrect.
Memory Management
Always dispose of the workbook object after processing to release resources, especially when processing large files.
workbook.Dispose()
Data Type Conversion
Excel cells may return different data types than expected. Validate and convert data types as needed for your application.
Import vs Read Excel in Python
In Python, "reading" and "importing" Excel files refer to related but distinct steps in data processing.
Read Excel focuses on accessing raw file content. This typically involves retrieving cell values, rows, or specific ranges without changing how the data is structured.
Import Excel includes both reading and transformation. After extracting the data, it is converted into structures such as lists, dictionaries, objects, or database records so that it can be used directly within an application.
In practice, reading is a subset of importing. The distinction lies in the goal—reading retrieves data, while importing prepares it for use.
Conclusion
Importing Excel file in Python is not just about reading data—it's about converting it into structures that your application can use effectively. In this guide, you learned how to import Excel file in Python as a list, convert Excel data into dictionaries, map Excel data into Python objects, and import Excel data into a database.
With Spire.XLS for Python, you can easily import Excel data into different structures with minimal code. The library provides a consistent API for handling various Excel formats and complex content, making it suitable for a wide range of data processing scenarios.
To evaluate the full performance of Spire.XLS for Python, you can apply for a 30 day trial license.
FAQ
What does it mean to import Excel file in Python?
Importing Excel means converting Excel data into Python structures such as lists, dictionaries, or databases for further processing and integration into your applications.
How do I import Excel data into Python?
You can use libraries like Spire.XLS for Python to load Excel files and convert their content into usable Python data structures. The process involves loading the workbook, accessing the worksheet, and iterating through cells to extract data.
Can I import Excel data into a database using Python?
Yes, you can read Excel data and insert it into databases like SQLite, MySQL, or PostgreSQL using Python. This approach is commonly used for data migration and backend system integration.
What is the best structure for importing Excel data?
The best structure depends on your use case. Lists are suitable for simple iteration, dictionaries for structured data access by column names, objects for type safety and business logic, and databases for persistent storage and querying.
Do I need Microsoft Excel installed to import Excel files in Python?
No, libraries like Spire.XLS for Python work independently and do not require Microsoft Excel to be installed on the system.

Developers often need to include Python code inside Word documents for technical documentation, tutorials, code reviews, internal reports, or client deliverables. While copying and pasting code manually works for small snippets, automated solutions provide better consistency, formatting control, and scalability — especially when working with long scripts or multiple files.
This tutorial demonstrates multiple practical methods to export Python code into Word documents using Python. Each method has its own strengths depending on whether you prioritize formatting, automation, syntax highlighting, or readability.
On This Page:
- Install Required Libraries
- Export Python Code to Word as Plain Text
- Add Syntax-Highlighted Python Code to Word
- Conclusion
- FAQs
Install Required Libraries
Install the necessary dependencies before running the examples:
pip install spire.doc pygments
Library Overview:
- Spire.Doc for Python — used to create and manipulate Word documents programmatically
- Pygments — used to generate syntax-highlighted code in RTF, HTML, or image formats
- Pathlib (built-in) — used for reading Python files from disk
- textwrap (built-in) — used to wrap long code lines before generating images formatting
Export Python Code to Word as Plain Text
Plain text insertion is the most straightforward method for embedding code in Word. It keeps scripts fully editable and preserves formatting such as indentation and line breaks.
Method 1. Insert Raw Python Code into a Word Document
This method reads a .py file and inserts the code directly into Word while applying a monospace font style.
from pathlib import Path
from spire.doc import *
# Read Python file
code_string = Path("demo.py").read_text(encoding="utf-8")
# Create a Word document
doc = Document()
# Add a section
section = doc.AddSection()
section.PageSetup.Margins.All = 60
# Add a paragraph
paragraph = section.AddParagraph()
# Insert code string to the paragraph
paragraph.AppendText(code_string)
# Create a paragraph style
style = ParagraphStyle(doc)
style.Name = "code"
style.CharacterFormat.FontName = "Consolas"
style.CharacterFormat.FontSize = 12
style.ParagraphFormat.LineSpacing = 12
doc.Styles.Add(style)
# Apply the style to the paragraph
paragraph.ApplyStyle("code")
# Save the document
doc.SaveToFile("Output.docx", FileFormat.Docx2019)
doc.Dispose()
How It Works:
This technique treats Python code as plain text and inserts it directly into a Word paragraph. The script reads the .py file using Path.read_text(), preserving indentation, blank lines, and overall structure.
After inserting the text, a custom paragraph style is created and applied. The use of a monospace font such as Consolas ensures alignment and readability, while fixed line spacing maintains consistent formatting across lines.
Because no intermediate format is used, this is the simplest and fastest approach. However, it does not provide syntax highlighting or semantic styling—Word only displays the code as formatted text.
Output:

You May Also Like: Generate Word Documents Using Python
Method 2. Generate a Word File from Markdown-Wrapped Code
If your workflow already uses Markdown, wrapping Python code inside fenced blocks provides a structured way to convert scripts into Word documents.
from pathlib import Path
from spire.doc import *
# Read Python file
code = Path("demo.py").read_text(encoding="utf-8")
# Convert to Markdown
md_content = f"```python\n{code}\n```"
Path("temp.md").write_text(md_content, encoding="utf-8")
# Load Markdown into Word
doc = Document()
doc.LoadFromFile("temp.md")
# Update page settings
doc.Sections[0].PageSetup.Margins.All = 60
# Save as a DOCX file
doc.SaveToFile("Output.docx", FileFormat.Docx)
doc.Dispose()
How It Works:
Instead of inserting text directly, this method wraps Python code inside Markdown fenced code blocks. The generated Markdown file is then loaded into Word using Spire.Doc’s Markdown parsing capability.
When Word imports Markdown, it automatically preserves code formatting such as indentation and line breaks. This approach is useful when your documentation workflow already uses Markdown or when code needs to coexist with headings, lists, and descriptive text.
Since Markdown itself does not inherently apply syntax coloring inside Word, the result is still plain code formatting—but the structure is cleaner and easier to manage within technical documentation pipelines.
Output:

Add Syntax-Highlighted Python Code to Word
Syntax highlighting makes code easier to read and understand. By integrating Pygments, Python scripts can be converted into stylized formats before being embedded into Word.
This section explores three approaches — RTF, HTML, and image rendering — each with different strengths depending on your formatting goals.
Method 1. Use RTF for Preformatted Code Blocks
RTF allows syntax-highlighted code to remain fully editable within Word.
from pathlib import Path
from pygments import highlight
from pygments.lexers import PythonLexer
from pygments.formatters import RtfFormatter
from spire.doc import *
# Read Python file
code = Path("demo.py").read_text(encoding="utf-8")
# Set font
formatter = RtfFormatter(fontface ="Consolas")
# Specify the lexer
rtf_text = highlight(code, PythonLexer(), formatter)
rtf_text = rtf_text.replace(r"\f0", r"\f0\fs24") # font size (24 for 12-point font)
# Create a Word document
doc = Document()
# Add a section
section = doc.AddSection()
section.PageSetup.Margins.All = 60
# Add a paragraph
paragraph = section.AddParagraph()
# Insert the syntax-highlighted code as RTF
paragraph.AppendRTF(rtf_text)
# Save the document
doc.SaveToFile("Output.docx", FileFormat.Docx2019)
doc.Dispose()
How It Works:
Pygments analyzes Python syntax using a lexer, identifying tokens such as keywords, strings, and comments. The RTF formatter applies styling rules that represent colors and fonts using RTF control words.
The resulting RTF string is inserted directly into Word using AppendRTF(). Because RTF is a native Word-compatible format, the document preserves fonts, colors, and spacing without requiring additional rendering steps.
Font size is controlled by modifying RTF control words (e.g., \fs24), allowing precise control over appearance. This method produces editable, selectable code with syntax highlighting inside Word.
Output:

Method 2. Render Highlighted Code via HTML Formatting
HTML rendering provides visually rich syntax highlighting and automatic text wrapping.
from pathlib import Path
from pygments import highlight
from pygments.lexers import PythonLexer
from pygments.formatters import HtmlFormatter
from spire.doc import *
# Read Python file
code = Path("demo.py").read_text(encoding="utf-8")
# Generate HTML from the Python code with syntax highlighting
html_text = highlight(code, PythonLexer(), HtmlFormatter(full=True))
# Create a Word document
doc = Document()
# Add a section
section = doc.AddSection()
section.PageSetup.Margins.All = 60
# Add a paragraph
paragraph = section.AddParagraph()
# Add the HTML string to the paragraph
paragraph.AppendHTML(html_text)
# Save the document
doc.SaveToFile("Output.docx", FileFormat.Docx2019)
doc.Dispose()
How It Works:
Here, Pygments converts Python code into styled HTML using the HtmlFormatter. The HTML output includes inline styles or CSS rules that represent syntax colors and formatting.
Spire.Doc then interprets the HTML content and renders it into Word. During this process, HTML elements are translated into Word formatting structures, allowing the highlighted code to appear visually similar to web-based code blocks.
This approach is ideal when code originates from web content, static documentation sites, or Markdown-to-HTML workflows.
Output:

You May Also Like: Convert HTML to Word DOC or DOCX in Python
Method 3. Insert Syntax-Highlighted Code as Images
For scenarios where visual consistency matters more than editability, code can be rendered as an image before insertion.
from pathlib import Path
import textwrap
from pygments import highlight
from pygments.lexers import PythonLexer
from pygments.formatters import ImageFormatter
from spire.doc import *
# Read Python file
code = Path("demo.py").read_text(encoding="utf-8")
# Wrap long lines manually
def wrap_code_lines(code_text, max_width=75):
wrapped_lines = []
for line in code_text.splitlines():
if len(line) > max_width:
wrapped_lines.extend(textwrap.wrap(
line,
width=max_width,
replace_whitespace=False,
drop_whitespace=False
))
else:
wrapped_lines.append(line)
return "\n".join(wrapped_lines)
code = wrap_code_lines(code, max_width=75)
# Step 3: Generate image
formatter = ImageFormatter(
font_name="Consolas",
font_size=18,
scale=2,
image_pad=10,
line_pad=2,
background_color="#ffffff"
)
img_bytes = highlight(code, PythonLexer(), formatter)
with open("code.png", "wb") as f:
f.write(img_bytes)
# Create a Word document
doc = Document()
section = doc.AddSection()
section.PageSetup.Margins.All = 60
# Insert into Word
paragraph = section.AddParagraph()
picture = paragraph.AppendPicture("code.png")
# Ensure image fits page width
page_width = (
section.PageSetup.PageSize.Width
- section.PageSetup.Margins.Left
- section.PageSetup.Margins.Right
)
picture.Width = page_width
# Save the document
doc.SaveToFile("Output.docx", FileFormat.Docx2019)
doc.Dispose()
How It Works:
This method renders Python code as an image instead of editable text. Pygments generates a syntax-highlighted bitmap using the ImageFormatter, allowing full visual control over fonts, colors, padding, and DPI.
Since image rendering does not automatically wrap long lines, the script manually wraps lengthy code lines using Python’s textwrap module before generating the image. This prevents oversized images that exceed page width.
After inserting the image into Word, its width is dynamically resized to fit the printable page area. Because the code is embedded as a graphic, it preserves exact visual appearance across platforms and prevents formatting inconsistencies—but the text is no longer editable.
Output:

Conclusion
Converting Python code to Word documents can be achieved through several approaches depending on your goals. Plain text methods provide simplicity and flexibility, while RTF and HTML techniques offer powerful syntax highlighting with selectable text. Image-based code blocks deliver consistent visual formatting but require careful line wrapping and scaling.
For most documentation workflows:
- Use plain text for editable technical content
- Use HTML or RTF for syntax-highlighted documentation
- Use images when formatting consistency is critical
FAQs
Which method is best for tutorials?
HTML or RTF methods provide clear syntax highlighting with selectable text.
How can I preserve indentation and blank lines?
Read the .py file using .read_text() without stripping or modifying lines.
Why do image-based code blocks become too small?
Word scales images to fit page width. Increasing the image formatter’s scale or adjusting the wrapping width can improve readability.
Can readers copy code from Word?
Yes — except when code is inserted as an image.
Do I need Markdown for conversion?
No. Markdown is optional but useful when working with documentation pipelines.
Can I export the generated document as a PDF file?
Yes. When saving the document, simply specify PDF as the output format in the Document.SaveToFile() method.
Get a Free License
To fully experience the capabilities of Spire.Doc for Python without any evaluation limitations, you can request a 30-day trial license.

CSV (Comma-Separated Values) files are the backbone of data exchange across industries—from data analysis to backend systems. They’re lightweight, human-readable, and compatible with almost every tool (Excel, Google Sheets, databases). If you’re a developer seeking a reliable way to create a CSV file in Python, Spire.XLS for Python is a powerful library that simplifies the process.
In this comprehensive guide, we'll explore how to generate a CSV file in Python with Spire.XLS, covering basic CSV creation and advanced use cases like list to CSV and Excel to CSV conversion.
What You’ll Learn
- Installation and Setup
- Basic: Create a Simple CSV File in Python
- Dynamic Data: Generate CSV from a List of Dictionaries in Python
- Excel-to-CSV: Generate CSV From an Excel File in Python
- Best Practices for CSV Creation
- FAQ: Create CSV in Python
Installation and Setup
Getting started with Spire.XLS for Python is straightforward. Follow these steps to set up your environment:
Step 1: Ensure Python 3.6 or higher is installed.
Step 2: Install the library via pip (the official package manager for Python):
pip install Spire.XLS
Step 3 (Optional): Request a temporary free license to test full features without any limitations.
Basic: Create a Simple CSV File in Python
Let’s start with a simple scenario: creating a CSV file from scratch with static data (e.g., a sales report). The code below creates a new workbook, populates it with data, and saves it as a CSV file.
from spire.xls import *
from spire.xls.common import *
# 1. Create a new workbook
workbook = Workbook()
# 2. Get the first worksheet (default sheet)
worksheet = workbook.Worksheets[0]
# 3. Populate data into cells
# Header row
worksheet.Range["A1"].Text = "ProductID"
worksheet.Range["B1"].Text = "ProductName"
worksheet.Range["C1"].Text = "Price"
worksheet.Range["D1"].Text = "QuantitySold"
worksheet.Range["A2"].NumberValue = 101
worksheet.Range["B2"].Text = "Wireless Headphones"
worksheet.Range["C2"].NumberValue = 79.99
worksheet.Range["D2"].NumberValue = 250
worksheet.Range["A3"].NumberValue = 102
worksheet.Range["B3"].Text = "Bluetooth Speaker"
worksheet.Range["C3"].NumberValue = 49.99
worksheet.Range["D3"].NumberValue = 180
# Save the worksheet to CSV
worksheet.SaveToFile("BasicSalesReport.csv", ",", Encoding.get_UTF8())
workbook.Dispose()
Core Workflow
- Initialize Core object: Workbook() creates a new Excel workbook, Worksheets[0] accesses the target sheet.
- Fill data into cells: Use .Text (for strings) and .NumberValue (for numbers) to ensure correct data types.
- Export & cleanup: SaveToFile() exports the worksheet to CSV , and Dispose() prevents memory leaks.
Output:
The resulting BasicSalesReport.csv will look like this:

Dynamic Data: Generate CSV from a List of Dictionaries in Python
In real-world scenarios, data is often stored in dictionaries (e.g., from APIs/databases). The code below converts a list of dictionaries to a CSV:
from spire.xls import *
from spire.xls.common import *
# Sample data (e.g., from a database/API)
customer_data = [
{"CustomerID": 1, "Name": "John Doe", "Email": "john@example.com", "Country": "USA"},
{"CustomerID": 2, "Name": "Maria Garcia", "Email": "maria@example.es", "Country": "Spain"},
{"CustomerID": 3, "Name": "Li Wei", "Email": "wei@example.cn", "Country": "China"}
]
# 1. Create workbook and worksheet
workbook = Workbook()
worksheet = workbook.Worksheets[0]
# 2. Write headers (extract keys from the first dictionary)
headers = list(customer_data[0].keys())
for col_idx, header in enumerate(headers, start=1):
worksheet.Range[1, col_idx].Text = header # Row 1 = headers
# 3. Write data rows
for row_idx, customer in enumerate(customer_data, start=2): # Start at row 2
for col_idx, key in enumerate(headers, start=1):
# Handle different data types (text/numbers)
value = customer[key]
if isinstance(value, (int, float)):
worksheet.Range[row_idx, col_idx].NumberValue = value
else:
worksheet.Range[row_idx, col_idx].Text = value
# 4. Save as CSV
worksheet.SaveToFile("CustomerData.csv", ",", Encoding.get_UTF8())
workbook.Dispose()
This example is ideal for JSON to CSV conversion, database dumps, and REST API data exports. Key advantages include:
- Dynamic Headers: Automatically extracts headers from the keys of the first dictionary in the dataset.
- Scalable: Seamlessly adapts to any volume of dictionaries or key-value pairs (perfect for dynamic data).
- Clean Output: Preserves the original order of dictionary keys for consistent CSV structure.
The generated CSV file:

Excel-to-CSV: Generate CSV From an Excel File in Python
Spire.XLS excels at converting Excel (XLS/XLSX) to CSV in Python. This is useful if you have Excel reports and need to export them to CSV for data pipelines or third-party tools.
from spire.xls import *
# 1. Initialize a workbook instance
workbook = Workbook()
# 2. Load a xlsx file
workbook.LoadFromFile("Expenses.xlsx")
# 3. Save Excel as a CSV file
workbook.SaveToFile("XLSXToCSV.csv", FileFormat.CSV)
workbook.Dispose()
Conversion result:

Note: By default, SaveToFile() converts only the first worksheet. For converting multiple sheets to separate CSV files, refer to the comprehensive guide: Convert Excel (XLSX/XLS) to CSV in Python – Batch & Multi-Sheet
Best Practices for CSV Creation
Follow these guidelines to ensure robust and professional CSV output:
- Validate Data First: Clean empty rows/columns before exporting to CSV.
- Use UTF-8 Encoding: Always specify UTF-8 encoding (Encoding.get_UTF8()) to support international characters seamlessly.
- Batch Process Smartly: For 100k+ rows, process data in chunks (avoid loading all data into memory at once).
- Choose the Correct Delimiter: Be mindful of regional settings. For European users, use a semicolon (;) as the delimiter to avoid locale issues.
- Dispose Objects: Release workbook/worksheet resources with Dispose() to prevent memory leaks.
Conclusion
Spire.XLS simplifies the process of leveraging Python to generate CSV files. Whether you're creating reports from scratch, converting Excel workbooks, or handling dynamic data from APIs and databases, this library delivers a robust and flexible solution.
By following this guide, you can easily customize delimiters, specify encodings such as UTF-8, and manage data types—ensuring your CSV files are accurate, compatible, and ready for any application. For more advanced features, you can explore the Spire.XLS for Python tutorials.
FAQ: Create CSV in Python
Q1: Why choose Spire.XLS over Python’s built-in csv module?
A: While Python's csv module is excellent for basic read/write operations, Spire.XLS offers significant advantages:
- Better data type handling: Automatic distinction between text and numeric data.
- Excel Compatibility: Seamlessly converts between Excel (XLSX/XLS) and CSV—critical for teams using Excel as a data source.
- Advanced Customization: Supports customizing the delimiter and encoding of the generated CSV file.
- Batch processing: Efficient handling of large datasets and multiple files.
- Cross-Platform Support: Works on Windows, macOS, and Linux (no Excel installation required).
Q2: Can I use Spire.XLS for Python to read CSV files?
A: Yes. Spire.XLS supports parsing CSV files and extracting their data. Details refer to: How to Read CSV Files in Python: A Comprehensive Guide
Q3: Can Spire.XLS convert CSV files back to Excel format?
A: Yes! Spire.XLS supports bidirectional conversion. A quick example:
from spire.xls import *
# Create a workbook
workbook = Workbook()
# Load a CSV file
workbook.LoadFromFile("sample.csv", ",", 1, 1)
# Save CSV as Excel
workbook.SaveToFile("CSVToExcel.xlsx", ExcelVersion.Version2016)
Q4: How do I change the CSV delimiter?
A: The SaveToFile() method’s second parameter controls the delimiter:
# Semicolon (for European locales):
worksheet.SaveToFile("EU.csv", ";", Encoding.get_UTF8())
# Tab (for tab-separated values/TSV)
worksheet.SaveToFile("TSV_File.csv", "\t", Encoding.get_UTF8())

Creating Word documents programmatically is a common requirement in Python applications. Reports, invoices, contracts, audit logs, and exported datasets are often expected to be delivered as editable .docx files rather than plain text or PDFs.
Unlike plain text output, a Word document is a structured document composed of sections, paragraphs, styles, and layout rules. When generating Word documents in Python, treating .docx files as simple text containers quickly leads to layout issues and maintenance problems.
This tutorial focuses on practical Word document creation in Python using Spire.Doc for Python. It demonstrates how to construct documents using Word’s native object model, apply formatting at the correct structural level, and generate .docx files that remain stable and editable as content grows.
Content Overview
- 1. Understanding Word Document Structure in Python
- 2. Creating a Basic Word Document in Python
- 3. Adding and Formatting Text Content
- 4. Inserting Images into a Word Document
- 5. Creating and Populating Tables
- 6. Adding Headers and Footers
- 7. Controlling Page Layout with Sections
- 8. Setting Document Properties and Metadata
- 9. Saving, Exporting, and Performance Considerations
- 10. Common Pitfalls When Creating Word Documents in Python
1. Understanding Word Document Structure in Python
Before writing code, it is important to understand how a Word document is structured internally.
A .docx file is not a linear stream of text. It consists of multiple object layers, each with a specific responsibility:
- Document – the root container for the entire file
- Section – defines page-level layout such as size, margins, and orientation
- Paragraph – represents a logical block of text
- Run (TextRange) – an inline segment of text with character formatting
- Style – a reusable formatting definition applied to paragraphs or runs
When you create a Word document in Python, you are explicitly constructing this hierarchy in code. Formatting and layout behave predictably only when content is added at the appropriate level.
Spire.Doc for Python provides direct abstractions for these elements, allowing you to work with Word documents in a way that closely mirrors how Word itself organizes content.
2. Creating a Basic Word Document in Python
This section shows how to generate a valid Word document in Python using Spire.Doc. The example focuses on establishing the correct document structure and essential workflow.
Installing Spire.Doc for Python
pip install spire.doc
Alternatively, you can download Spire.Doc for Python and integrate it manually.
Creating a Simple .docx File
from spire.doc import Document, FileFormat
# Create the document container
document = Document()
# Add a section (defines page-level layout)
section = document.AddSection()
# Add a paragraph to the section
paragraph = section.AddParagraph()
paragraph.AppendText(
"This document was generated using Python. "
"It demonstrates basic Word document creation with Spire.Doc."
)
# Save the document
document.SaveToFile("basic_document.docx", FileFormat.Docx)
document.Close()
This example creates a minimal but valid .docx file that can be opened in Microsoft Word. It demonstrates the essential workflow: creating a document, adding a section, inserting a paragraph, and saving the file.

From a technical perspective:
- The Document object represents the Word file structure and manages its content.
- The Section defines the page-level layout context for paragraphs.
- The Paragraph contains the visible text and serves as the basic unit for all paragraph-level formatting.
All Word documents generated with Spire.Doc follow this same structural pattern, which forms the foundation for more advanced operations.
3. Adding and Formatting Text Content
Text in a Word document is organized hierarchically. Formatting can be applied at the paragraph level (controlling alignment, spacing, indentation, etc.) or the character level (controlling font, size, color, bold, italic, etc.). Styles provide a convenient way to store these formatting settings so they can be consistently applied to multiple paragraphs or text ranges without redefining the formatting each time. Understanding the distinction between paragraph formatting, character formatting, and styles is essential when creating or editing Word documents in Python.
Adding and Setting Paragraph Formatting
All visible text in a Word document must be added through paragraphs, which serve as containers for text and layout. Paragraph-level formatting controls alignment, spacing, and indentation, and can be set directly via the Paragraph.Format property. Character-level formatting, such as font size, bold, or color, can be applied to text ranges within the paragraph via the TextRange.CharacterFormat property.
from spire.doc import Document, HorizontalAlignment, FileFormat, Color
document = Document()
section = document.AddSection()
# Add the title paragraph
title = section.AddParagraph()
title.Format.HorizontalAlignment = HorizontalAlignment.Center
title.Format.AfterSpacing = 20 # Space after the title
title.Format.BeforeSpacing = 20
title_range = title.AppendText("Monthly Sales Report")
title_range.CharacterFormat.FontSize = 18
title_range.CharacterFormat.Bold = True
title_range.CharacterFormat.TextColor = Color.get_LightBlue()
# Add the body paragraph
body = section.AddParagraph()
body.Format.FirstLineIndent = 20
body_range = body.AppendText(
"This report provides an overview of monthly sales performance, "
"including revenue trends across different regions and product categories. "
"The data presented below is intended to support management decision-making."
)
body_range.CharacterFormat.FontSize = 12
# Save the document
document.SaveToFile("formatted_paragraph.docx", FileFormat.Docx)
document.Close()
Below is a preview of the generated Word document.

Technical notes
- Paragraph.Format sets alignment, spacing, and indentation for the entire paragraph
- AppendText() returns a TextRange object, which allows character-level formatting (font size, bold, color)
- Every paragraph must belong to a section, and paragraph order determines reading flow and pagination
Creating and Applying Styles
Styles allow you to define paragraph-level and character-level formatting once and reuse it across the document. They can store alignment, spacing, font, and text emphasis, making formatting more consistent and easier to maintain. Word documents support both custom styles and built-in styles, which must be added to the document before being applied.
Creating and Applying a Custom Paragraph Style
from spire.doc import (
Document, HorizontalAlignment, BuiltinStyle,
TextAlignment, ParagraphStyle, FileFormat
)
document = Document()
# Create a new custom paragraph style
custom_style = ParagraphStyle(document)
custom_style.Name = "CustomStyle"
custom_style.ParagraphFormat.HorizontalAlignment = HorizontalAlignment.Center
custom_style.ParagraphFormat.TextAlignment = TextAlignment.Auto
custom_style.CharacterFormat.Bold = True
custom_style.CharacterFormat.FontSize = 20
# Inherit properties from a built-in heading style
custom_style.ApplyBaseStyle(BuiltinStyle.Heading1)
# Add the style to the document
document.Styles.Add(custom_style)
# Apply the custom style
title_para = document.AddSection().AddParagraph()
title_para.ApplyStyle(custom_style.Name)
title_para.AppendText("Regional Performance Overview")
Adding and Applying Built-in Styles
# Add a built-in style to the document
built_in_style = document.AddStyle(BuiltinStyle.Heading2)
document.Styles.Add(built_in_style)
# Apply the built-in style
heading_para = document.Sections.get_Item(0).AddParagraph()
heading_para.ApplyStyle(built_in_style.Name)
heading_para.AppendText("Sales by Region")
document.SaveToFile("document_styles.docx", FileFormat.Docx)
Preview of the generated Word document.

Technical Explanation
- ParagraphStyle(document) creates a reusable style object associated with the current document
- ParagraphFormat controls layout-related settings such as alignment and text flow
- CharacterFormat defines font-level properties like size and boldness
- ApplyBaseStyle() allows the custom style to inherit semantic meaning and default behavior from a built-in Word style
- Adding the style to document.Styles makes it available for use across all sections
Built-in styles, such as Heading 2, can be added explicitly and applied in the same way, ensuring the document remains compatible with Word features like outlines and tables of contents.
4. Inserting Images into a Word Document
In Word’s document model, images are embedded objects that belong to paragraphs, which ensures they flow naturally with text. Paragraph-anchored images adjust pagination automatically and maintain relative positioning when content changes.
Adding an Image to a Paragraph
from spire.doc import Document, TextWrappingStyle, HorizontalAlignment, FileFormat
document = Document()
section = document.AddSection()
section.AddParagraph().AppendText("\r\n\r\nExample Image\r\n")
# Insert an image
image_para = section.AddParagraph()
image_para.Format.HorizontalAlignment = HorizontalAlignment.Center
image = image_para.AppendPicture("Screen.jpg")
# Set the text wrapping style
image.TextWrappingStyle = TextWrappingStyle.Square
# Set the image size
image.Width = 350
image.Height = 200
# Set the transparency
image.FillTransparency(0.7)
# Set the horizontal alignment
image.HorizontalAlignment = HorizontalAlignment.Center
document.SaveToFile("document_images.docx", FileFormat.Docx)
Preview of the generated Word document.

Technical details
- AppendPicture() inserts the image into the paragraph, making it part of the text flow
- TextWrappingStyle determines how surrounding text wraps around the image
- Width and Height control the displayed size of the image
- FillTransparency() sets the image opacity
- HorizontalAlignment can center the image within the paragraph
Adding images to paragraphs ensures they behave like part of the text flow.
- Pagination adjusts automatically when images change size.
- Surrounding text reflows correctly when content is edited.
- When exporting to formats like PDF, images maintain their relative position.
These behaviors are consistent with Word’s handling of inline images.
For more advanced image operations in Word documents using Python, see how to insert images into a Word document with Python for a complete guide.
5. Creating and Populating Tables
Tables are commonly used to present structured data such as reports, summaries, and comparisons.
Internally, a table consists of rows, cells, and paragraphs inside each cell.
Creating and Formatting a Table in a Word Document
from spire.doc import Document, DefaultTableStyle, FileFormat, AutoFitBehaviorType
document = Document()
section = document.AddSection()
section.AddParagraph().AppendText("\r\n\r\nExample Table\r\n")
# Define the table data
table_headers = ["Region", "Product", "Units Sold", "Unit Price ($)", "Total Revenue ($)"]
table_data = [
["North", "Laptop", 120, 950, 114000],
["North", "Smartphone", 300, 500, 150000],
["South", "Laptop", 80, 950, 76000],
["South", "Smartphone", 200, 500, 100000],
["East", "Laptop", 150, 950, 142500],
["East", "Smartphone", 250, 500, 125000],
["West", "Laptop", 100, 950, 95000],
["West", "Smartphone", 220, 500, 110000]
]
# Add a table to the section
table = section.AddTable()
table.ResetCells(len(table_data) + 1, len(table_headers))
# Populate table headers
for col_index, header in enumerate(table_headers):
header_range = table.Rows[0].Cells[col_index].AddParagraph().AppendText(header)
header_range.CharacterFormat.FontSize = 14
header_range.CharacterFormat.Bold = True
# Populate table data
for row_index, row_data in enumerate(table_data):
for col_index, cell_data in enumerate(row_data):
data_range = table.Rows[row_index + 1].Cells[col_index].AddParagraph().AppendText(str(cell_data))
data_range.CharacterFormat.FontSize = 12
# Apply a default table style and auto-fit columns
table.ApplyStyle(DefaultTableStyle.ColorfulListAccent6)
table.AutoFit(AutoFitBehaviorType.AutoFitToContents)
document.SaveToFile("document_tables.docx", FileFormat.Docx)
Preview of the generated Word document.

Technical details
- Section.AddTable() inserts the table into the section content flow
- ResetCells(rows, columns) defines the table grid explicitly
- Table[row, column] or Table.Rows[row].Cells[col] returns a TableCell
Tables in Word are designed so that each cell acts as an independent content container. Text is always inserted through paragraphs, and each cell can contain multiple paragraphs, images, or formatted text. This structure allows tables to scale from simple grids to complex report layouts, making them flexible for reports, summaries, or any structured content.
For more detailed examples and advanced operations using Python, such as dynamically generating tables, merging cells, or formatting individual cells, see how to insert tables into Word documents with Python for a complete guide.
6. Adding Headers and Footers
Headers and footers in Word are section-level elements. They are not part of the main content flow and do not affect body pagination.
Each section owns its own header and footer, which allows different parts of a document to display different repeated content.
Adding Headers and Footers in a Section
from spire.doc import Document, FileFormat, HorizontalAlignment, FieldType, BreakType
document = Document()
section = document.AddSection()
section.AddParagraph().AppendBreak(BreakType.PageBreak)
# Add a header
header = section.HeadersFooters.Header
header_para1 = header.AddParagraph()
header_para1.AppendText("Monthly Sales Report").CharacterFormat.FontSize = 12
header_para1.Format.HorizontalAlignment = HorizontalAlignment.Left
header_para2 = header.AddParagraph()
header_para2.AppendText("Company Name").CharacterFormat.FontSize = 12
header_para2.Format.HorizontalAlignment = HorizontalAlignment.Right
# Add a footer with page numbers
footer = section.HeadersFooters.Footer
footer_para = footer.AddParagraph()
footer_para.Format.HorizontalAlignment = HorizontalAlignment.Center
footer_para.AppendText("Page ").CharacterFormat.FontSize = 12
footer_para.AppendField("PageNum", FieldType.FieldPage).CharacterFormat.FontSize = 12
footer_para.AppendText(" of ").CharacterFormat.FontSize = 12
footer_para.AppendField("NumPages", FieldType.FieldNumPages).CharacterFormat.FontSize = 12
document.SaveToFile("document_header_footer.docx", FileFormat.Docx)
document.Dispose()
Preview of the generated Word document.

Technical notes
- section.HeadersFooters.Header / .Footer provides access to header/footer of the section
- AppendField() inserts dynamic fields like FieldPage or FieldNumPages to display dynamic content
Headers and footers are commonly used for report titles, company information, and page numbering. They update automatically as the document changes and are compatible with Word, PDF, and other export formats.
For more detailed examples and advanced operations, see how to insert headers and footers in Word documents with Python.
7. Controlling Page Layout with Sections
In Spire.Doc for Python, all page-level layout settings are managed through the Section object. Page size, orientation, and margins are defined by the section’s PageSetup and apply to all content within that section.
Configuring Page Size and Orientation
from spire.doc import PageSize, PageOrientation
section.PageSetup.PageSize = PageSize.A4()
section.PageSetup.Orientation = PageOrientation.Portrait
Technical explanation
- PageSetup is a layout configuration object owned by the Section
- PageSize defines the physical dimensions of the page
- Orientation controls whether pages are rendered in portrait or landscape mode
PageSetup defines the layout for the entire section. All paragraphs, tables, and images added to the section will follow these settings. Changing PageSetup in one section does not affect other sections in the document, allowing different sections to have different page layouts.
Setting Page Margins
section.PageSetup.Margins.Top = 50
section.PageSetup.Margins.Bottom = 50
section.PageSetup.Margins.Left = 60
section.PageSetup.Margins.Right = 60
Technical explanation
- Margins defines the printable content area for the section
- Margin values are measured in document units
Margins control the body content area for the section. They are evaluated at the section level, so you do not need to set them for individual paragraphs, and header/footer areas are not affected.
Using Multiple Sections for Different Layouts
When a document requires different page layouts, additional sections must be created.
landscape_section = document.AddSection()
landscape_section.PageSetup.Orientation = PageOrientation.Landscape
Technical notes
- AddSection() creates a new section and appends it to the document
- Each section maintains its own PageSetup, headers, and footers
- Content added after this call belongs to the new section
Using multiple sections allows mixing portrait and landscape pages or applying different layouts within a single Word document.
Below is an example preview of the above settings in a Word document:

8. Setting Document Properties and Metadata
In addition to visible content, Word documents expose metadata through built-in document properties. These properties are stored at the document level and do not affect layout or rendering.
Assigning Built-in Document Properties
document.BuiltinDocumentProperties.Title = "Monthly Sales Report"
document.BuiltinDocumentProperties.Author = "Data Analytics System"
document.BuiltinDocumentProperties.Company = "Example Corp"
Technical notes
BuiltinDocumentPropertiesprovides access to standard document properties- Properties such as
Title,Author, andCompanycan be set programmatically
Document properties are commonly used for file indexing, search, document management, and audit workflows. In addition to built-in properties, Word documents support other metadata such as Keywords, Subject, Comments, and Hyperlink base. You can also define custom properties using Document.CustomDocumentProperties.
For a guide on managing document custom properties with Python, see how to manage custom metadata in Word documents with Python.
9. Saving, Exporting, and Performance Considerations
After constructing a Word document in memory, the final step is saving or exporting it to the required output format. Spire.Doc for Python supports multiple export formats through a unified API, allowing the same document structure to be reused without additional formatting logic.
Saving and Exporting Word Documents in Multiple Formats
A document can be saved as DOCX for editing or exported to other commonly used formats for distribution.
from spire.doc import FileFormat
document.SaveToFile("output.docx", FileFormat.Docx)
document.SaveToFile("output.pdf", FileFormat.PDF)
document.SaveToFile("output.html", FileFormat.Html)
document.SaveToFile("output.rtf", FileFormat.Rtf)
The export process preserves document structure, including sections, tables, images, headers, and footers, ensuring consistent layout across formats. Check out all the supported formats in the FileFormat enumeration.
Performance Considerations for Document Generation
For scenarios involving frequent or large-scale Word document generation, performance can be improved by:
- Reusing document templates and styles
- Avoiding unnecessary section creation
- Writing documents to disk only after all content has been generated
- After saving or exporting, explicitly releasing resources using document.Close()
When generating many similar documents with different data, mail merge is more efficient than inserting content programmatically for each file. Spire.Doc for Python provides built-in mail merge support for batch document generation. For details, see how to generate Word documents in bulk using mail merge in Python.
Saving and exporting are integral parts of Word document generation in Python. By using Spire.Doc for Python’s export capabilities and following basic performance practices, Word documents can be generated efficiently and reliably for both individual files and batch workflows.
10. Common Pitfalls When Creating Word Documents in Python
The following issues frequently occur when generating Word documents programmatically.
Treating Word Documents as Plain Text
Issue Formatting breaks when content length changes.
Recommendation Always work with sections, paragraphs, and styles rather than inserting raw text.
Hard-Coding Formatting Logic
Issue Global layout changes require editing multiple code locations.
Recommendation Centralize formatting rules using styles and section-level configuration.
Ignoring Section Boundaries
Issue Margins or orientation changes unexpectedly affect the entire document.
Recommendation Use separate sections to isolate layout rules.
11. Conclusion
Creating Word documents in Python involves more than writing text to a file. A .docx document is a structured object composed of sections, paragraphs, styles, and embedded elements.
By using Spire.Doc for Python and aligning code with Word’s document model, you can generate editable, well-structured Word files that remain stable as content and layout requirements evolve. This approach is especially suitable for backend services, reporting pipelines, and document automation systems.
For scenarios involving large documents or document conversion requirements, a licensed version is required.

CSV (Comma-Separated Values) is a universal file format for storing tabular data, while lists are Python’s fundamental data structure for easy data manipulation. Converting CSV to lists in Python enables seamless data processing, analysis, and integration with other workflows. While Python’s built-in csv module works for basic cases, Spire.XLS for Python simplifies handling structured CSV data with its intuitive spreadsheet-like interface.
This article will guide you through how to use Python to read CSV into lists (and lists of dictionaries), covering basic to advanced scenarios with practical code examples.
Table of Contents:
- Why Choose Spire.XLS for CSV to List Conversion?
- Basic Conversion: CSV to Python List
- Advanced: Convert CSV to List of Dictionaries
- Handle Special Scenarios
- Conclusion
- Frequently Asked Questions
Why Choose Spire.XLS for CSV to List Conversion?
Spire.XLS is a powerful library designed for spreadsheet processing, and it excels at CSV handling for several reasons:
- Simplified Indexing: Uses intuitive 1-based row/column indexing (matching spreadsheet logic).
- Flexible Delimiters: Easily specify custom separators (commas, tabs, semicolons, etc.).
- Structured Access: Treats CSV data as a worksheet, making row/column traversal straightforward.
- Robust Data Handling: Automatically parses numbers, dates, and strings without extra code.
Installation
Before starting, install Spire.XLS for Python using pip:
pip install Spire.XLS
This command installs the latest stable version, enabling immediate use in your projects.
Basic Conversion: CSV to Python List
If your CSV file has no headers (pure data rows), Spire.XLS can directly read rows and convert them to a list of lists (each sublist represents a CSV row).
Step-by-Step Process:
- Import the Spire.XLS module.
- Create a Workbook object and load the CSV file.
- Access the first worksheet (Spire.XLS parses CSV into a worksheet).
- Traverse rows and cells, extracting values into a Python list.
CSV to List Python Code Example:
from spire.xls import *
from spire.xls.common import *
# Initialize Workbook and load CSV
workbook = Workbook()
workbook.LoadFromFile("Employee.csv",",")
# Get the first worksheet
sheet = workbook.Worksheets[0]
# Convert sheet data to a list of lists
data_list = []
for i in range(sheet.Rows.Length):
row = []
for j in range(sheet.Columns.Length):
cell_value = sheet.Range[i + 1, j + 1].Value
row.append(cell_value)
data_list.append(row)
# Display the result
for row in data_list:
print(row)
# Dispose resources
workbook.Dispose()
Output:

If you need to convert the list back to CSV, refer to: Python List to CSV: 1D/2D/Dicts – Easy Tutorial
Advanced: Convert CSV to List of Dictionaries
For CSV files with headers (e.g., name,age,city), converting to a list of dictionaries (where keys are headers and values are row data) is more intuitive for data manipulation.
CSV to Dictionary Python Code Example:
from spire.xls import *
# Initialize Workbook and load CSV
workbook = Workbook()
workbook.LoadFromFile("Customer_Data.csv", ",")
# Get the first worksheet
sheet = workbook.Worksheets[0]
# Extract headers (first row)
headers = []
for j in range(sheet.Columns.Length):
headers.append(sheet.Range[1, j + 1].Value)
# Convert data rows to list of dictionaries
dict_list = []
for i in range(1, sheet.Rows.Length): # Skip header row
row_dict = {}
for j in range(sheet.Columns.Length):
key = headers[j]
value = sheet.Range[i + 1, j + 1].Value
row_dict[key] = value
dict_list.append(row_dict)
# Output the result
for record in dict_list:
print(record)
workbook.Dispose()
Explanation
- Load the CSV: Use LoadFromFile() method of Workbook class.
- Extracting Headers: Pull the first row of the worksheet to use as dictionary keys.
- Map Rows to Dictionaries: For each data row (skipping the header row), create a dictionary where keys are headers and values are cell contents.
Output:

Handle Special Scenarios
CSV with Custom Delimiters (e.g., Tabs, Semicolons)
To process CSV files with delimiters other than commas (e.g., tab-separated TSV files), specify the delimiter in LoadFromFile:
# Load a tab-separated file
workbook.LoadFromFile("data.tsv", "\t")
# Load a semicolon-separated file
workbook.LoadFromFile("data_eu.csv", ";")
Clean Empty Values
Empty cells in the CSV are preserved as empty strings ('') in the list. To replace empty strings with a custom value (e.g., "N/A"), modify the cell value extraction:
cell_value = sheet.Range[i + 1, j + 1].Value or "N/A"
Conclusion
Converting CSV to lists in Python using Spire.XLS is efficient, flexible, and beginner-friendly. Whether you need a list of lists for raw data or a list of dictionaries for structured analysis, this library handles parsing, indexing, and resource management efficiently. By following the examples above, you can integrate this conversion into data pipelines, analysis scripts, or applications with minimal effort.
For more advanced features (e.g., CSV to Excel conversion, batch processing), you can visit the Spire.XLS for Python documentation.
Frequently Asked Questions
Q1: Is Spire.XLS suitable for large CSV files?
A: Spire.XLS handles large files efficiently, but for very large datasets (millions of rows), consider processing in chunks or using specialized big data tools. For typical business datasets, it performs excellently.
Q2: How does this compare to using pandas for CSV to list conversion?
A: Spire.XLS offers more control over the parsing process and doesn't require additional data science dependencies. While pandas is great for analysis, Spire.XLS is ideal when you need precise control over CSV parsing or are working in environments without pandas.
Q3: How do I handle CSV files with headers when converting to lists?
A: For headers, use the dictionary conversion method. Extract the first row as headers, then map subsequent rows to dictionaries where keys are header values. This preserves column meaning and enables easy data access by column name.
Q4: How do I convert only specific columns from my CSV to a list?
A: Modify the inner loop to target specific columns:
# Convert only columns 1 and 3 (index 0 and 2)
target_columns = [0, 2]
for i in range(sheet.Rows.Length):
row = []
for j in target_columns:
cell_value = sheet.Range[i + 1, j + 1].Value
row.append(cell_value)
data_list.append(row)
Python TXT to CSV Tutorial | Convert TXT Files to CSV in Python
2025-10-15 07:42:33 Written by zaki zou
When working with data in Python, converting TXT files to CSV is a common and essential task for data analysis, reporting, or sharing data between applications. TXT files often store unstructured plain text, which can be difficult to process, while CSV files organize data into rows and columns, making it easier to work with and prepare for analysis. This tutorial explains how to convert TXT to CSV in Python efficiently, covering single-file conversion, batch conversion, and tips for handling different delimiters.
Table of Contents
- What is a CSV File
- Python TXT to CSV Library - Installation
- Convert a TXT File to CSV in Python (Step-by-Step)
- Automate Batch Conversion of Multiple TXT Files
- Advanced Tips for Python TXT to CSV Conversion
- Conclusion
- FAQs: Python Text to CSV
What is a CSV File?
A CSV (Comma-Separated Values) file is a simple text-based file format used to store tabular data. Each line in a CSV file represents a row, and values within the row are separated by commas (or another delimiter such as tabs or semicolons).
CSV is widely supported by spreadsheet applications, databases, and programming languages like Python. Its simple format makes it easy to import, export, and use across platforms such as Excel, Google Sheets, R, and SQL for data analysis and automation.
An Example CSV File:
Name, Age, City
John, 28, New York
Alice, 34, Los Angeles
Bob, 25, Chicago
Python TXT to CSV Library - Installation
To perform TXT to CSV conversion in Python, we will use Spire.XLS for Python, a powerful library for creating and manipulating Excel and CSV files, without requiring Microsoft Excel to be installed.

You can install it directly from PyPI with the following command:
pip install Spire.XLS
If you need instructions for the installation, visit the guide on How to Install Spire.XLS for Python.
Convert a TXT File to CSV in Python (Step-by-Step)
Converting a text file to CSV in Python is straightforward. You can complete the task in just a few steps. Below is a basic outline of the process:
- Prepare and read the text file: Load your TXT file and read its content line by line.
- Split the text data: Separate each line into fields using a specific delimiter such as a space, tab, or comma.
- Write data to CSV: Use Spire.XLS to write the processed data into a new CSV file.
- Verify the output: Check the CSV in Excel, Google Sheets, or a text editor.
The following code demonstrates how to export a TXT file to CSV using Python:
from spire.xls import *
# Read the txt file
with open("data.txt", "r", encoding="utf-8") as file:
lines = file.readlines()
# Process each line by splitting based on spaces (you can change the delimiter if needed)
processed_data = [line.strip().split() for line in lines]
# Create an Excel workbook
workbook = Workbook()
# Get the first worksheet
sheet = workbook.Worksheets[0]
# Write data from the processed list to the worksheet
for row_num, row_data in enumerate(processed_data):
for col_num, cell_data in enumerate(row_data):
# Write data into cells
sheet.Range[row_num + 1, col_num + 1].Value = cell_data
# Save the sheet as CSV file (UTF-8 encoded)
sheet.SaveToFile("TxtToCsv.csv", ",", Encoding.get_UTF8())
# Dispose the workbook to release resources
workbook.Dispose()
TXT to CSV Output:

If you are also interested in converting a TXT file to Excel, see the guide on converting TXT to Excel in Python.
Automate Batch Conversion of Multiple TXT Files
If you have multiple text files that you want to convert to CSV automatically, you can loop through all .txt files in a folder and convert them one by one.
The following code demonstrates how to batch convert multiple TXT files to CSV in Python:
import os
from spire.xls import *
# Folder containing TXT files
input_folder = "txt_files"
output_folder = "csv_files"
# Create output folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)
# Function to process a single TXT file
def convert_txt_to_csv(file_path, output_path):
# Read the TXT file
with open(file_path, "r", encoding="utf-8") as f:
lines = f.readlines()
# Process each line (split by space, modify if your delimiter is different)
processed_data = [line.strip().split() for line in lines if line.strip()]
# Create workbook and access the first worksheet
workbook = Workbook()
sheet = workbook.Worksheets[0]
# Write processed data into the sheet
for row_num, row_data in enumerate(processed_data):
for col_num, cell_data in enumerate(row_data):
sheet.Range[row_num + 1, col_num + 1].Value = cell_data
# Save the sheet as CSV with UTF-8 encoding
sheet.SaveToFile(output_path, ",", Encoding.get_UTF8())
workbook.Dispose()
print(f"Converted '{file_path}' -> '{output_path}'")
# Loop through all TXT files in the folder and convert each to a CSV file with the same file name
for filename in os.listdir(input_folder):
if filename.lower().endswith(".txt"):
input_path = os.path.join(input_folder, filename)
output_name = os.path.splitext(filename)[0] + ".csv"
output_path = os.path.join(output_folder, output_name)
convert_txt_to_csv(input_path, output_path)
Advanced Tips for Python TXT to CSV Conversion
Converting text files to CSV can involve variations in text file layout and potential errors, so these tips will help you handle different scenarios more effectively.
1. Handle Different Delimiters
Not all text files use spaces to separate values. If your TXT file uses tabs, commas, or other characters, you can adjust the split() function to match the delimiter.
- For tab-separated files (.tsv):
processed_data = [line.strip().split('\t') for line in lines]
- For comma-separated files:
processed_data = [line.strip().split(',') for line in lines]
- For custom delimiters (e.g., |):
processed_data = [line.strip().split('|') for line in lines]
This ensures that your data is correctly split into columns before writing to CSV.
2. Add Error Handling
When reading or writing files, it's a good practice to use try-except blocks to catch potential errors. This makes your script more robust and prevents unexpected crashes.
try:
# your code here
except Exception as e:
print("Error:", e)
Tip: Use descriptive error messages to help understand the problem.
- Skip Empty Lines
Sometimes, text files may have empty lines. You can filter out the blank lines to avoid creating empty rows in CSV:
processed_data = [line.strip().split() for line in lines if line.strip()]
Conclusion
In this article, you learned how to convert a TXT file to CSV format in Python using Spire.XLS for Python. This conversion is an essential step in data preparation, helping organize raw text into a structured format suitable for analysis, reporting, and sharing. With Spire.XLS for Python, you can automate the text to CSV conversion, handle different delimiters, and efficiently manage multiple text files.
If you have any questions or need technical assistance about Python TXT to CSV conversion, visit our Support Forum for help.
FAQs: Python Text to CSV
Q1: Can I convert TXT files to CSV without Microsoft Excel installed?
A1: Yes. Spire.XLS for Python works independently of Microsoft Excel, allowing you to create and export CSV files directly.
Q2: How to batch convert multiple TXT files to CSV in Python?
A2: Use a loop to read all TXT files in a folder and apply the conversion function for each. The tutorial includes a ready-to-use Python example for batch conversion.
Q3: How do I handle empty lines or inconsistent rows in TXT files when converting to CSV?
A3: Filter out empty lines during processing and implement checks for consistent column counts to avoid errors or blank rows in the output CSV.
Q4: How do I convert TXT files with tabs or custom delimiters to CSV in Python?
A4: You can adjust the split() function in your Python script to match the delimiter in your TXT file-tabs (\t), commas, or custom characters-before writing to CSV.

In today's data-driven world, Python developers frequently need to convert lists (a fundamental Python data structure) into Excel spreadsheets. Excel remains the standard for data presentation, reporting, and sharing across industries. Whether you're generating reports, preparing data for analysis, or sharing information with non-technical stakeholders, the ability to efficiently export Python lists to Excel is a valuable skill.
While lightweight libraries like pandas can handle basic exports, Spire.XLS for Python gives you full control over Excel formatting, styles, and file generation – all without requiring Microsoft Excel. In this comprehensive guide, we'll explore how to use the library to convert diverse list structures into Excel in Python, complete with detailed examples and best practices.
- Why Convert Python Lists to Excel?
- Installation Guide
- Basic – Convert a Simple Python List to Excel
- Convert Nested Lists to Excel in Python
- Convert a List of Dictionaries to Excel
- 4 Tips to Optimize Your Excel Outputs
- Conclusion
- FAQs
Why Convert Python Lists to Excel?
Lists in Python are versatile for storing structured or unstructured data, but Excel offers advantages in:
- Collaboration: Excel is universally used, and stakeholders can edit, sort, or filter data without Python knowledge.
- Reporting: Add charts, pivot tables, or summaries to Excel after export.
- Compliance: Many industries require data in Excel for audits or record-keeping.
- Visualization: Excel’s formatting tools (colors, borders, headers) make data easier to read than raw Python lists.
Whether you’re working with sales data, user records, or survey results, writing lists to Excel in Python ensures your data is accessible and professional.
Installation Guide
To get started with Spire.XLS for Python, install it using pip:
pip install Spire.XLS
The Python Excel library supports Excel formats like .xls or .xlsx and lets you customize formatting (bold headers, column widths, colors), perfect for production-ready files.
To fully experience the capabilities of Spire.XLS for Python, you can request a free 30-day trial license here.
Basic – Convert a Simple Python List to Excel
For a basic one-dimensional list, iterate through the items and write them to consecutive cells in a single column.
This code example converts a list of text strings into a single column. If you need to convert a list of numeric values, you can set their number format before saving.
from spire.xls import *
from spire.xls.common import *
# Create a Workbook object
workbook = Workbook()
# Clear the default sheets
workbook.Worksheets.Clear()
# Add a new worksheet
worksheet = workbook.Worksheets.Add("Simple List")
# Sample list
data_list = ["Alexander", "Bob", "Charlie", "Diana", "Eve"]
# Write list data to Excel cells (starting from row 1, column 1)
for index, value in enumerate(data_list):
worksheet.Range[index + 1, 1].Value = value
# Set column width for better readability
worksheet.Range[1, 1].ColumnWidth = 15
# Save the workbook
workbook.SaveToFile("SimpleListToExcel.xlsx", ExcelVersion.Version2016)
workbook.Dispose()
If you need to write the list in a single row, use the following:
for index, value in enumerate(data_list):
worksheet.Range[1, index + 1].Value = value
Output: A clean Excel file with one column of names, properly spaced.

Convert Nested Lists to Excel in Python
Nested lists (2D Lists) represent tabular data with rows and columns, making them perfect for direct conversion to Excel tables. Let’s convert a nested list of employee data (name, age, department) to an Excel table.
from spire.xls import *
from spire.xls.common import *
# Create a Workbook object
workbook = Workbook()
# Clear the default sheets
workbook.Worksheets.Clear()
# Add a new worksheet
worksheet = workbook.Worksheets.Add("Employee Data")
# Nested list (rows: [Name, Age, Department])
employee_data = [
["Name", "Age", "Department"], # Header row
["Alexander", 30, "HR"],
["Bob", 28, "Engineering"],
["Charlie", 35, "Marketing"],
["Diana", 29, "Finance"]
]
# Write nested list to Excel
for row_idx, row_data in enumerate(employee_data):
for col_idx, value in enumerate(row_data):
if isinstance(value, int):
worksheet.Range[row_idx + 1, col_idx + 1].NumberValue = value
else:
worksheet.Range[row_idx + 1, col_idx + 1].Value = value
# Format header row
worksheet.Range["A1:C1"].Style.Font.IsBold = True
worksheet.Range["A1:C1"].Style.Color = Color.get_Yellow()
# Set column widths
worksheet.Range[1, 1].ColumnWidth = 10
worksheet.Range[1, 2].ColumnWidth = 6
worksheet.Range[1, 3].ColumnWidth = 15
# Save the workbook
workbook.SaveToFile("NestedListToExcel.xlsx", ExcelVersion.Version2016)
workbook.Dispose()
Explanation:
- Nested List Structure: The first sub-list acts as headers, and subsequent sub-lists are data rows.
- 2D Loop: We use nested loops to write each row and column to Excel cells.
Output: An Excel table with bold yellow headers and correctly typed data.

To make your Excel files more professional, you can add cell borders, set conditional formatting, or apply other formatting options with Spire.XLS for Python.
Convert a List of Dictionaries to Excel
Lists of dictionaries are common in Python for storing structured data with labeled fields. This example converts a list of dictionaries (e.g., customer records) to Excel and auto-extracts headers from dictionary keys.
from spire.xls import *
from spire.xls.common import *
# Create a Workbook object
workbook = Workbook()
# Clear the default sheets
workbook.Worksheets.Clear()
# Add a new worksheet
worksheet = workbook.Worksheets.Add("Customer Data")
# List of dictionaries
customers = [
{"ID": 101, "Name": "John Doe", "Email": "john@example.com"},
{"ID": 102, "Name": "Jane Smith", "Email": "jane@example.com"},
{"ID": 103, "Name": "Mike Johnson", "Email": "mike@example.com"}
]
# Extract headers from dictionary keys
headers = list(customers[0].keys())
# Write headers to row 1
for col, header in enumerate(headers):
worksheet.Range[1, col + 1].Value = header
worksheet.Range[1, col + 1].Style.Font.IsBold = True # Bold headers
# Write data rows
for row, customer in enumerate(customers, start=2): # Start from row 2
for col, key in enumerate(headers):
value = customer[key]
if isinstance(value, (int, float)):
worksheet.Range[row, col + 1].NumberValue = value
else:
worksheet.Range[row, col + 1].Value = value
# Adjust column widths
worksheet.AutoFitColumn(2)
worksheet.AutoFitColumn(3)
# Save the file
workbook.SaveToFile("CustomerDataToExcel.xlsx", ExcelVersion.Version2016)
workbook.Dispose()
Why This Is Useful:
- Auto-Extracted Headers: Saves time. No need to retype headers like “ID” or “Email”.
- Auto-Fit Columns: Excel automatically adjusts column width to fit the longest text.
- Scalable: Works for large lists of dictionaries (e.g., 1000+ customers).
Output: Excel file with headers auto-created, data types preserved, and columns automatically sized.

4 Tips to Optimize Your Excel Outputs
- Preserve Data Types: Always use NumberValue for numbers (avoids issues with Excel calculations later).
- Auto-Fit Columns: Use worksheet.AutoFitColumn() to skip manual width adjustments.
- Name Worksheets Clearly: Instead of “Sheet1”, use names like “Q3 Sales” to make files user-friendly.
- Dispose of Workbooks: Always call workbook.Dispose() to free memory (critical for large datasets).
Conclusion
Converting lists to Excel in Python is a critical skill for data professionals, and Spire.XLS makes it easy to create polished, production-ready files. Whether you’re working with simple lists, nested data, or dictionaries, the examples above can be adapted to your needs.
For even more flexibility (e.g., adding charts or formulas), explore Spire.XLS’s documentation.
FAQs for List to Excel Conversion
Q1: How is Spire.XLS different from pandas for converting lists to Excel?
A: Pandas is great for quick, basic exports, but it lacks fine-grained control over Excel formatting. Spire.XLS is better when you need:
- Custom styles (colors, fonts, borders).
- Advanced Excel features (freeze panes, conditional formatting, charts).
- Standalone functionality (no Excel installation required).
Q2: How do I save my Excel file in different formats?
A: Use the ExcelVersion parameter in SaveToFile:
workbook.SaveToFile("output.xlsx", ExcelVersion.Version2016)
workbook.SaveToFile("output.xls", ExcelVersion.Version97to2003)
Q3: How does Spire.XLS handle different data types?
A: Spire.XLS provides specific properties for different data types:
- Use .Text for strings
- Use .NumberValue for numerical data
- Use .DateTimeValue for dates
- Use .BooleanValue for True/False values
Q4: Why clear default worksheets before adding new ones?
A: Spire.XLS for Python creates default sheets when you create a Workbook. Therefore, if you don't clear it with the Workbook.Worksheets.Clear(), your file will have extra empty sheets.
Q5: My data isn't showing correctly in Excel. What's wrong?
A: Check that you're using 1-based indexing and that your data types match the expected format. Also, verify that you're saving the file before disposing of the workbook.

CSV (Comma-Separated Values) is one of the most widely used formats for data exchange between applications, databases, and programming languages. For Python developers, the need to convert Python lists to CSV format arises constantly - whether exporting application data, generating reports, or preparing datasets for analysis. Spire.XLS for Python streamlines this critical process with an intuitive, reliable approach that eliminates common conversion pitfalls.
This comprehensive guide will explore how to write lists to CSV in Python. You'll discover how to handle everything from simple one-dimensional lists to complex nested dictionaries, while maintaining data integrity and achieving professional-grade output.
Table of Contents:
- Getting Started with Spire.XLS for Python
- Convert 1D List to CSV in Python
- Convert 2D List to CSV in Python
- Convert List of Dictionaries to CSV in Python
- Advanced: Custom Delimiters and Encoding
- Conclusion
- FAQs
Getting Started with Spire.XLS for Python
Why Use Spire.XLS for List-to-CSV Conversion?
While Python's built-in csv module is excellent for simple CSV operations, Spire.XLS offers additional benefits:
- Handles various data types seamlessly
- Lets you customize CSV output (e.g., semicolon delimiters for European locales).
- Can save in multiple file formats (CSV, XLSX, XLS, etc.)
- Works well with both simple and complex data structures
Install via pip
The Spire.XLS for Python lets you create, modify, and save Excel/CSV files programmatically. To use it, run this command in your terminal or command prompt:
pip install Spire.XLS
This command downloads and installs the latest version, enabling you to start coding immediately.
Convert 1D List to CSV in Python
A 1D (one-dimensional) list is a simple sequence of values (e.g., ["Apple", "Banana", "Cherry"]). The following are the steps to write these values to a single row or column in a CSV.
Step 1: Import Spire.XLS Modules
First, import the necessary classes from Spire.XLS:
from spire.xls import *
from spire.xls.common import *
Step 2: Create a Workbook and Worksheet
Spire.XLS uses workbooks and worksheets to organize data. We’ll create a new workbook and add a new worksheet:
# Create a workbook instance
workbook = Workbook()
# Remove the default worksheet and add a new one
workbook.Worksheets.Clear()
worksheet = workbook.Worksheets.Add()
Step 3: Write 1D List Data to the Worksheet
Choose to write the list to a single row (horizontal) or a single column (vertical).
Example 1: Write 1D List to a Single Row
# Sample 1D list
data_list = ["Apple", "Banana", "Orange", "Grapes", "Mango"]
# Write list to row 1
for i, item in enumerate(data_list):
worksheet.Range[1, i+1].Value = item
Example 2: Write 1D List to a Single Column
# Sample 1D list
data_list = ["Apple", "Banana", "Orange", "Grapes", "Mango"]
# Write list to column 1
for i, item in enumerate(data_list):
worksheet.Range[i + 1, 1].Value = item
Step 4: Save the Worksheet as CSV
Use SaveToFile() to export the workbook to a CSV file. Specify FileFormat.CSV to ensure proper formatting:
# Save as CSV file
workbook.SaveToFile("ListToCSV.csv", FileFormat.CSV)
# Close the workbook to free resources
workbook.Dispose()
Output:

Convert 2D List to CSV in Python
A 2D (two-dimensional) list is a list of lists that represents tabular data. More commonly, you'll work with this type of list, where each inner list represents a row in the CSV file.
Python Code for 2D List to CSV:
from spire.xls import *
from spire.xls.common import *
# Create a workbook instance
workbook = Workbook()
# Remove the default worksheet and add a new one
workbook.Worksheets.Clear()
worksheet = workbook.Worksheets.Add()
# Sample 2D list (headers + data)
data = [
["Name", "Age", "City", "Salary"],
["John Doe", 30, "New York", 50000],
["Jane Smith", 25, "Los Angeles", 45000],
["Bob Johnson", 35, "Chicago", 60000],
["Alice Brown", 28, "Houston", 52000]
]
# Write 2D list to worksheet
for row_index, row_data in enumerate(data):
for col_index, cell_data in enumerate(row_data):
worksheet.Range[row_index + 1, col_index + 1].Value = str(cell_data)
# Save as a CSV file
workbook.SaveToFile("2DListToCSV.csv", FileFormat.CSV)
workbook.Dispose()
Key points:
- Ideal for structured tabular data with headers
- Nested loops handle both rows and columns
- Converting all values to strings ensures compatibility
Output:

The generated CSV can be converted to PDF for secure presentation, or converted to JSON for web/API data exchange.
Convert List of Dictionaries to CSV in Python
Lists of dictionaries are ideal when data has named fields (e.g., [{"Name": "Alice", "Age": 30}, {"Name": "Bob", "Age": 25}]). The dictionary keys become CSV headers, and values become rows.
Python Code for List of Dictionaries to CSV
from spire.xls import *
from spire.xls.common import *
# Create a workbook instance
workbook = Workbook()
# Remove the default worksheet and add a new one
workbook.Worksheets.Clear()
worksheet = workbook.Worksheets.Add()
# Sample 2D list (headers + data)
customer_list = [
{"CustomerID": 101, "Name": "Emma Wilson", "Email": "emma@example.com"},
{"CustomerID": 102, "Name": "Liam Brown", "Email": "liam@example.com"},
{"CustomerID": 103, "Name": "Olivia Taylor", "Email": "olivia@example.com"}
]
# Extract headers (dictionary keys) and write to row 1
if customer_list: # Ensure the list is not empty
headers = list(customer_list[0].keys())
# Write headers
for col_index, header in enumerate(headers):
worksheet.Range[1, col_index + 1].Value = str(header)
# Write dictionary values to rows 2 onwards
for row_index, record in enumerate(customer_list):
for col_index, header in enumerate(headers):
# Safely get value, use empty string if key doesn't exist
value = record.get(header, "")
worksheet.Range[row_index + 2, col_index + 1].Value = str(value)
# Save as CSV file
workbook.SaveToFile("Customer_Data.csv", FileFormat.CSV)
workbook.Dispose()
Key points:
- Extracts headers from the first dictionary's keys
- Uses .get() method to safely handle missing keys
- Maintains column order based on the header row
Output:

Advanced: Custom Delimiters and Encoding
One of the biggest advantages of using Spire.XLS for Python is its flexibility in saving CSV files with custom delimiters and encodings. This allows you to tailor your CSV output for different regions, applications, and data requirements.
To specify the delimiters and encoding, simply change the corresponding parameter in the SaveToFile() method of the Worksheet class. Example:
# Save with different delimiters and encodings
worksheet.SaveToFile("semicolon_delimited.csv", ";", Encoding.get_UTF8())
worksheet.SaveToFile("tab_delimited.csv", "\t", Encoding.get_UTF8())
worksheet.SaveToFile("unicode_encoded.csv", ",", Encoding.get_Unicode())
Conclusion
Converting Python lists to CSV is straightforward with the right approach. Whether you're working with simple 1D lists, structured 2D arrays, or more complex lists of dictionaries, Spire.XLS provides a robust solution. By choosing the appropriate method for your data structure, you can ensure efficient and accurate CSV generation in any application.
For more advanced features and detailed documentation, you can visit the official Spire.XLS for Python documentation.
Frequently Asked Questions (FAQs)
Q1: What are the best practices for list to CSV conversion?
- Validate input data before processing
- Handle exceptions with try-catch blocks
- Test with sample data before processing large datasets
- Clean up resources using Dispose()
Q2: Can I export multiple lists into separate CSV files in one go?
Yes. Loop through your lists and save each as a separate CSV:
lists = {
"fruits": ["Apple", "Banana", "Cherry"],
"scores": [85, 92, 78]
}
for name, data in lists.items():
wb = Workbook()
wb.Worksheets.Clear()
ws = wb.Worksheets.Add(name)
for i, val in enumerate(data):
ws.Range[i + 1, 1].Value = str(val)
wb.SaveToFile(f"{name}.csv", FileFormat.CSV)
wb.Dispose()
Q3: How to format numbers (e.g., currency, decimals) in CSV?
CSV stores numbers as plain text, so formatting must be applied before saving:
ws.Range["A1:A10"].NumberFormat = "$#,##0.00"
This ensures numbers appear as $1,234.56 in the CSV. For more number formatting options, refer to: Set the Number Format in Python
Q4: Does Spire.XLS for Python work on all operating systems?
Yes! Spire.XLS for Python is cross-platform and supports Windows, macOS, and Linux systems.

When it comes to working with web content and documents, the ability to parse HTML in Python is an essential skill for developers across various domains. HTML parsing involves extracting meaningful information from HTML documents, manipulating content, and processing web data efficiently. Whether you're working on web scraping projects, data extraction tasks, content analysis, or document processing, mastering HTML parsing techniques in Python can significantly enhance your productivity and capabilities.
In this guide, we'll explore how to effectively read HTML in Python using Spire.Doc for Python. You'll learn practical techniques for processing HTML content from strings, local files, and URLs, and implementing best practices for HTML parsing in your projects.
- Why Parse HTML in Python?
- Getting Started: Install HTML Parser in Python
- How Spire.Doc Parses HTML: Core Concepts
- Best Practices for Effective HTML Parsing
- Conclusion
Why Parse HTML in Python?
HTML (HyperText Markup Language) is the backbone of the web, used to structure and present content on websites. Parsing HTML enables you to:
- Extract specific data (text, images, tables, hyperlinks) from web pages or local files.
- Analyze content structure for trends, keywords, or patterns.
- Automate data collection for research, reporting, or content management.
- Clean and process messy HTML into structured data.
While libraries like BeautifulSoup excel at lightweight parsing, Spire.Doc for Python shines when you need to integrate HTML parsing with document creation or conversion. It offers a robust framework to parse and interact with HTML content as a structured document object model (DOM).
Getting Started: Install HTML Parser in Python
Before diving into parsing, you’ll need to install Spire.Doc for Python. The library is available via PyPI, making installation straightforward:
pip install Spire.Doc
This command installs the latest version of the library, along with its dependencies. Once installed, you’re ready to start parsing HTML.
How Spire.Doc Parses HTML: Core Concepts
At its core, Spire.Doc parses HTML by translating HTML’s tag-based structure into a hierarchical document model. This model is composed of objects that represent sections, paragraphs, and other elements, mirroring the original HTML’s organization. Let’s explore how this works in practice.
1. Parsing HTML Strings in Python
If you have a small HTML snippet (e.g., from an API response or user input), parse it directly from a string. This is great for testing or working with short, static HTML.
from spire.doc import *
from spire.doc.common import *
# Define HTML content as a string
html_string = """
<html>
<head>
<title>Sample HTML</title>
</head>
<body>
<h1>Main Heading</h1>
<p>This is a paragraph with <strong>bold text</strong>.</p>
<div>
<p>A nested paragraph inside a div.</p>
</div>
<ul>
<li>List item 1</li>
<li>List item 2</li>
<li>List item 3</li>
</ul>
</body>
</html>
"""
# Initialize a new Document object
doc = Document()
# Add a section and paragraph to the document
section = doc.AddSection()
paragraph = section.AddParagraph()
# Load HTML content from the string
paragraph.AppendHTML(html_string)
print("Parsed HTML Text:")
print("-----------------------------")
# Extract text content from the parsed HTML
parsed_text = doc.GetText()
# Print the result
print(parsed_text)
# Close the document
doc.Close()
How It Works:
- HTML String: We define a sample HTML snippet with common elements (headings, paragraphs, lists).
- Document Setup: Spire.Doc uses a Word-like structure (sections → paragraphs) to organize parsed HTML.
- Parse HTML:
AppendHTML()converts the string into structured Word elements (e.g.,<h1>becomes a "Heading 1" style,<ul>becomes a list). - Extract Text:
GetText()pulls clean, plain text from the parsed document (no HTML tags).
Output:

Spire.Doc supports exporting parsed HTML content to multiple formats such as TXT, Word via the SaveToFile() method.
2. Parsing HTML Files in Python
For local HTML files, Spire.Doc can load and parse them with a single method. This is useful for offline content (e.g., downloaded web pages, static reports).
from spire.doc import *
from spire.doc.common import *
# Define the path to your local HTML file
html_file_path = "example.html"
# Create a Document instance
doc = Document()
# Load and parse the HTML file
doc.LoadFromFile(html_file_path, FileFormat.Html)
# Analyze document structure
print(f"Document contains {doc.Sections.Count} section(s)")
print("-"*40)
# Process each section
for section_idx in range(doc.Sections.Count):
section = doc.Sections.get_Item(section_idx)
print(f"SECTION {section_idx + 1}")
print(f"Section has {section.Body.Paragraphs.Count} paragraph(s)")
print("-"*40)
# Traverse through paragraphs in the current section
for para_idx in range(section.Paragraphs.Count):
para = section.Paragraphs.get_Item(para_idx)
# Get paragraph style name and text content
style_name = para.StyleName
para_text = para.Text
# Print paragraph information if content exists
if para_text.strip():
print(f"[{style_name}] {para_text}\n")
# Add spacing between sections
print()
# Close the document
doc.Close()
Key Features:
- Load Local Files:
LoadFromFile()reads the HTML file and auto-parses it into a Word structure. - Structure Analysis: Check the number of sections/paragraphs and their styles (critical for auditing content).
- Style Filtering: Identify headings (e.g., "Heading 1") or lists (e.g., "List Paragraph") to organize content.
Output:

After loading the HTML file into the Document object, you can use Spire.Doc to extract specific elements like tables, hyperlinks from HTML.
3. Parsing a URL in Python
To parse HTML directly from a live web page, first fetch the HTML content from the URL using a library like requests, then pass the content to Spire.Doc for parsing. This is core for web scraping and real-time data extraction.
Install the Requests library via pip:
pip install requests
Python code to parse web page:
from spire.doc import *
from spire.doc.common import *
import requests
# Fetch html content from a URL
def fetch_html_from_url(url):
"""Fetch HTML from a URL and handle errors (e.g., 404, network issues)"""
# Mimic a browser with User-Agent (avoids being blocked by websites)
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}
try:
response = requests.get(url, headers=headers)
response.raise_for_status() # Raise exception for HTTP errors
return response.text # Return raw HTML content
except requests.exceptions.RequestException as e:
raise Exception(f"Error fetching HTML: {str(e)}")
# Specify the target URL
url = "https://www.e-iceblue.com/privacypolicy.html"
print(f"Fetching HTML from: {url}")
# Get HTML content
html_content = fetch_html_from_url(url)
# Create document and insert HTML content into it
doc = Document()
section = doc.AddSection()
paragraph = section.AddParagraph()
paragraph.AppendHTML(html_content)
# Extract and display summary information
print("\nParsed Content Summary:")
print(f"Sections: {doc.Sections.Count}")
print("-------------------------------------------")
# Extract and display headings
print("Headings found:")
for para_idx in range(section.Paragraphs.Count):
para = section.Paragraphs.get_Item(para_idx)
if isinstance(para, Paragraph) and para.StyleName.startswith("Heading"):
print(f"- {para.Text.strip()}")
# Close the document
doc.Close()
Steps Explained:
- Use requests.get() to fetch the HTML content from the URL.
- Pass the raw HTML text to Spire.Doc for parsing.
- Extract specific content (e.g., headings) from live pages for SEO audits or content aggregation.
Output:

Best Practices for Effective HTML Parsing
To optimize your HTML parsing workflow with Spire.Doc, follow these best practices:
- Validate Input Sources: Before parsing, check that HTML content (strings or files) is accessible and not corrupted. This reduces parsing errors:
import os
html_file = "data.html"
if os.path.exists(html_file):
doc.LoadFromFile(html_file, FileFormat.Html)
else:
print(f"Error: File '{html_file}' not found.")
- Handle Exceptions: Wrap parsing operations in try-except blocks to catch catch errors (e.g., missing files, invalid HTML):
try:
doc.LoadFromFile("sample.html", FileFormat.Html)
except Exception as e:
print(f"Error loading HTML: {e}")
- Optimize for Large Files: For large HTML files, consider loading content in chunks or disabling non-essential parsing features to improve performance.
- Clean Extracted Data: Use Python’s string methods (e.g., strip(), replace()) to remove extra whitespace or unwanted characters from extracted text.
- Keep the Library Updated: Regularly update Spire.Doc with
pip install --upgrade Spire.Docto benefit from improved parsing logic and bug fixes.
Conclusion
Python makes HTML parsing accessible for all skill levels. Whether you’re working with HTML strings, local files, or remote URLs, the combination of Requests (for fetching) and Spire.Doc (for structuring) simplifies complex tasks like web scraping and content extraction.
By following the examples and best practices in this guide, you’ll turn unstructured HTML into actionable, organized data in minutes. To unlock the full potential of Spire.Doc for Python, you can request a 30-day trial license here.