Excel formatting with Python tutorial illustration

When working with spreadsheets, readability is just as important as the data itself. A well-formatted Excel file makes it easier to analyze, present, and share information. Instead of manually adjusting styles in Excel, you can use Python for Excel formatting to automate the process and save significant time.

This tutorial shows you how to format Excel with Python using the library Spire.XLS for Python. We’ll cover basic styling, advanced formatting, and practical use cases, while also explaining the key classes and properties that make Excel formatting in Python efficient.

Here's What's Covered:


Why Use Python for Excel Formatting

Formatting Excel manually is time-consuming, especially when handling large datasets or generating reports dynamically. By using Python Excel formatting, you can:

  • Apply consistent formatting to multiple workbooks.
  • Automate repetitive tasks like setting fonts, borders, and colors.
  • Generate styled reports programmatically for business or research.
  • Save time while improving accuracy and presentation quality.

With Python, you can quickly build scripts that apply professional-looking styles to your spreadsheets. Next, let’s see how to set up the environment.


Setting Up the Environment and Project

To follow this tutorial, you need to install Spire.XLS for Python, a library designed for working with Excel files. It supports creating, reading, modifying, and formatting Excel documents programmatically.

Install Spire.XLS for Python

Install the library via pip:

pip install Spire.XLS

Then import it in your Python script:

from spire.xls import *

Creating or Loading an Excel Workbook

Before we start formatting, we need a workbook to work with.

Create a new workbook:

workbook = Workbook()
sheet = workbook.Worksheets[0]

Or load an existing file:

workbook = Workbook()
workbook.LoadFromFile("input.xlsx")
sheet = workbook.Worksheets[0]

After applying formatting, save the result:

workbook.SaveToFile("output/formatted_output.xlsx", ExcelVersion.Version2016)

With the workbook ready, let’s move on to formatting examples.


Basic Excel Formatting in Python

Before diving into advanced operations, it’s important to master the fundamental Excel formatting features in Python. These basic techniques—such as fonts, alignment, borders, background colors, and adjusting column widths or row heights—are the building blocks of clear, professional spreadsheets. Once familiar with them, you can combine and extend these methods to create more complex styles later.

1. Formatting Fonts

Changing font properties is one of the most frequent tasks when working with styled Excel sheets. In Spire.XLS for Python, font settings are accessed through the CellRange.Style.Font object, which lets you control the typeface, size, color, and emphasis (bold, italic, underline).

cell = sheet.Range[2, 2]
cell.Text = "Python Excel Formatting"

cell.Style.Font.FontName = "Arial"
cell.Style.Font.Size = 14
cell.Style.Font.Color = Color.get_Blue()
cell.Style.Font.IsBold = True

This modifies the text appearance directly within the cell by adjusting its style attributes.

2. Alignment and Wrapping

Cell alignment is managed through the HorizontalAlignment and VerticalAlignment properties of the Style object. In addition, the WrapText property ensures that longer text fits within a cell without overflowing.

cell = sheet.Range[4, 2]
cell.Text = "This text is centered and wrapped."

cell.Style.HorizontalAlignment = HorizontalAlignType.Center
cell.Style.VerticalAlignment = VerticalAlignType.Center
cell.Style.WrapText = True

This produces neatly centered text that remains readable even when it spans multiple lines.

3. Adding Borders

Borders are defined through the Borders collection on the Style object, where you can set the line style and color for each edge individually.

cell = sheet.Range[6, 2]
cell.Text = "Border Example"

cell.Style.Borders[BordersLineType.EdgeBottom].LineStyle = LineStyleType.Thin
cell.Style.Borders[BordersLineType.EdgeBottom].Color = Color.get_Black()

This adds a thin black line to separate the cell from the content below.

4. Background Colors

Cell background is controlled by the Style.Color property. This is often used to highlight headers or important values.

cell = sheet.Range[8, 2]
cell.Text = "Highlighted Cell"
cell.Style.Color = Color.get_Yellow()

This makes the cell stand out in the worksheet.

5. Setting Column Widths and Row Heights

Besides styling text and borders, adjusting the column widths and row heights ensures that your content fits properly without overlapping or leaving excessive blank space.

# Set specific column width
sheet.Columns[1].ColumnWidth = 20
# Set specific row height
sheet.Rows[7].RowHeight = 20

In addition to specifying widths and heights, Excel rows and columns can also be automatically adjusted to fit their content:

  • Use Worksheet.AutoFitColumn(columnIndex) or Worksheet.AutoFitRow(rowIndex) to adjust a specific column or row.
  • Use CellRange.AutoFitColumns() or CellRange.AutoFitRows() to adjust all columns or rows in a selected range.

This helps create well-structured spreadsheets where data is both readable and neatly aligned.

Preview of Basic Formatting Result

Here’s what the basic Excel formatting looks like in practice:

Basic Excel formatting with fonts, alignment, borders, background colors, and adjusted column widths

In addition to these visual styles, you can also customize how numbers, dates, or currency values are displayed—see How to Set Excel Number Format in Python for details.


Extended Excel Formatting in Python

Instead of formatting individual cells, you can also work with ranges and reusable styles. These operations typically involve the CellRange object for merging and sizing, and the Workbook.Styles collection for creating reusable Style definitions.

1. Merging Cells

Merging cells is often used to create report titles or section headers that span multiple columns.

range = sheet.Range[2, 2, 2, 4]
range.Merge()
range.Text = "Quarterly Report"
range.Style.HorizontalAlignment = HorizontalAlignType.Center
range.RowHeight = 30

Here, three columns (B2:D2) are merged into one, styled as a bold, centered title with a shaded background—perfect for highlighting key sections in a report.

2. Applying Built-in Styles

Excel comes with a set of predefined styles that can be applied quickly. In Spire.XLS, you can directly assign these built-in styles to ranges.

range.BuiltInStyle = BuiltInStyles.Heading1

This instantly applies Excel’s built-in Heading 1 style, giving the range a consistent and professional appearance without manually setting fonts, borders, or colors.

3. Creating a Custom Style and Applying It

Creating a custom style is useful when you need to apply the same formatting rules across multiple cells or sheets. The Workbook.Styles.Add() method allows you to define a named style that can be reused throughout the workbook.

# Create a custom style
custom_style = workbook.Styles.Add("CustomStyle")
custom_style.Font.FontName = "Calibri"
custom_style.Font.Size = 12
custom_style.Font.Color = Color.get_DarkGreen()
custom_style.Borders[BordersLineType.EdgeTop].LineStyle = LineStyleType.MediumDashDot

# Apply style to a cell
cell = sheet.Range[4, 2]
cell.Text = "Custom Style Applied"
cell.Style = custom_style

This method makes Excel style management in Python more efficient, especially when working with large datasets.

Preview of Advanced Formatting Result

Below is an example showing the results of merged cells, built-in styles, and a custom style applied:

Advanced Excel formatting with merged cells, built-in heading style, and custom reusable style

For highlighting data patterns and trends, you can also use conditional rules—see How to Apply Conditional Formatting in Python Excel.


Key APIs for Excel Styling in Python

After exploring both basic and advanced formatting examples, it’s helpful to step back and look at the core classes, properties, and methods that make Excel styling in Python possible. Understanding these elements will give you the foundation to write scripts that are flexible, reusable, and easier to maintain.

The following table summarizes the most important classes, properties, and methods for formatting Excel with Python. You can use it as a quick reference whenever you need to recall the key styling options.

Class / Property / Method Description
Workbook / Worksheet Represent Excel files and individual sheets.
Workbook.LoadFromFile() / SaveToFile() Load and save Excel files.
Workbook.Styles.Add() Create and define a custom reusable style that can be applied across cells.
CellRange Represents one or more cells; used for applying styles or formatting.
CellRange.Style Represents formatting information (font, alignment, borders, background, text wrapping, etc.).
CellRange.Merge() Merge multiple cells into a single cell.
CellRange.BuiltInStyle Apply one of Excel’s predefined built-in styles (e.g., Heading1).
CellRange.BorderAround() / BorderInside() Apply border formatting to the outside or inside of a range.
CellRange.ColumnWidth / RowHeight Adjust the width of columns and the height of rows.
CellRange.NumberFormat Defines the display format for numbers, dates, or currency.

With these building blocks, you can handle common formatting tasks—such as fonts, borders, alignment, colors, and number formats—in a structured and highly customizable way. This ensures that your spreadsheets look professional, remain easy to manage, and can be consistently styled across different Python automation projects. For more details, see the Spire.XLS for Python API reference.


Real-World Use Case: Formatting an Annual Excel Sales Report with Python

Now that we’ve explored both basic and advanced Excel formatting techniques, let’s apply them in a real-world scenario. The following example demonstrates how to generate a comprehensive Excel annual sales report with Python. By combining structured data, regional breakdowns, and advanced formatting, the report becomes much easier to interpret and presents business data in a clear and professional way.

from spire.xls import *

workbook = Workbook()
sheet = workbook.Worksheets[0]
sheet.Name = "Sales Report"

# Report title
title = sheet.Range[1, 1, 1, 7]
title.Merge()
title.Text = "Annual Sales Report - 2024"
title.Style.Font.IsBold = True
title.Style.Font.Size = 16
title.Style.HorizontalAlignment = HorizontalAlignType.Center
title.Style.Color = Color.get_LightGray()
title.RowHeight = 30

# Data
data = [
    ["Product", "Region", "Q1", "Q2", "Q3", "Q4", "Total"],
    ["Laptop", "North", 1200, 1500, 1300, 1600, 5600],
    ["Laptop", "South", 1000, 1200, 1100, 1300, 4600],
    ["Tablet", "North", 800, 950, 1000, 1200, 3950],
    ["Tablet", "South", 700, 850, 900, 1000, 3450],
    ["Phone", "North", 2000, 2200, 2100, 2500, 8800],
    ["Phone", "South", 1800, 1900, 2000, 2200, 7900],
    ["Accessories", "All", 600, 750, 720, 900, 2970],
    ["", "", "", "", "", "Grand Total", 39370]
]

for r in range(len(data)):
    for c in range(len(data[r])):
        sheet.Range[r+2, c+1].Text = str(data[r][c])

# Header formatting
header = sheet.Range[2, 1, 2, 7]
header.Style.Font.IsBold = True
header.Style.Color = Color.get_LightBlue()
header.Style.HorizontalAlignment = HorizontalAlignType.Center
header.Style.Borders[BordersLineType.EdgeBottom].LineStyle = LineStyleType.Thin

# Numeric columns as currency
for row in range(3, 10):
    for col in range(3, 8):
        cell = sheet.Range[row, col]
        if cell.Text.isdigit():
            cell.NumberValue = float(cell.Text)
            cell.NumberFormat = "$#,##0"

# Highlight totals
grand_total = sheet.Range[10, 7]
grand_total.Style.Color = Color.get_LightYellow()
grand_total.Style.Font.IsBold = True

# Freeze the first row and the first two columns
sheet.FreezePanes(2, 3)

# Auto fit columns
sheet.AllocatedRange.AutoFitColumns()

workbook.SaveToFile("output/annual_sales_report.xlsx", ExcelVersion.Version2016)

This script combines several formatting techniques—such as a merged and styled title, bold headers with borders, currency formatting, highlighted totals, and frozen panes—into a single workflow. Applying these styles programmatically improves readability and ensures consistency across every report, which is especially valuable for business and financial data.

Here’s the final styled annual sales report generated with Python:

Annual Excel sales report with a styled title, bold headers, regional breakdown, currency formatting, highlighted grand total, and frozen panes


Conclusion

Formatting Excel with Python is a practical way to automate report generation and ensure professional presentation of data. By combining basic styling with advanced techniques like custom styles and column adjustments, you can create clear, consistent, and polished spreadsheets.

Whether you are working on financial reports, research data, or business dashboards, formatting Excel with Python helps you save time while maintaining presentation quality. With the right use of styles, properties, and formatting options, your spreadsheets will not only contain valuable data but also deliver it in a visually effective way.

You can apply for a free temporary license to unlock the full functionality of Spire.XLS for Python or try Free Spire.XLS for Python to get started quickly.


FAQ - Excel Fpormatting in Python

Q1: Can you format Excel with Python?

Yes. Python provides libraries that allow you to apply fonts, colors, borders, alignment, conditional formatting, and more to Excel files programmatically.

Q2: How to do formatting in Python?

For Excel formatting, you can use a library such as Spire.XLS for Python. It lets you change fonts, set alignment, adjust column widths, merge cells, and apply custom or built-in styles through code.

Q3: Can I use Python to edit Excel?

Yes. Python can not only format but also create, read, modify, and save Excel files, making it useful for dynamic reporting and data automation.

Q4: What’s the best way to automate repeated Excel styling tasks?

Define custom styles or reusable functions that standardize formatting rules across multiple workbooks and sheets. This ensures consistency and saves time.

Python Guide to Export Flat and Nested JSON to CSV

JSON (JavaScript Object Notation) is the most widely used format for structured data in web applications, APIs, and configuration files. CSV (Comma-Separated Values) is a popular format for data analysis, reporting, and seamless integration with Excel. Converting JSON to CSV is a common requirement for developers, data engineers, and analysts who need to transform hierarchical JSON data into a tabular format for easier processing, visualization, and automation.

This step-by-step guide demonstrates how to convert JSON to CSV using Python, including techniques for handling flat and nested JSON data, and optimizing performance for large JSON datasets. By following this tutorial, you can streamline your data workflows and efficiently integrate JSON data into Excel or other tabular systems.

What You Will Learn

Why Convert JSON to CSV?

Although JSON is ideal for hierarchical and complex data, CSV offers several practical advantages:

  • Spreadsheet Compatibility: CSV files can be opened in Excel, Google Sheets, and other BI tools.
  • Ease of Analysis: Tabular data is easier to filter, sort, summarize, and visualize.
  • Data Pipeline Integration: Many ETL workflows and reporting systems rely on CSV for seamless data integration.

Real-World Use Cases:

  • API Data Extraction: Transform JSON responses from web APIs into CSV for analysis or reporting.
  • Reporting Automation: Convert application or system logs into CSV for automated dashboards or scheduled reports.
  • Data Analytics: Prepare hierarchical JSON data for Excel, Google Sheets, or BI tools like Tableau to perform pivot tables and visualizations.
  • ETL Pipelines: Flatten and export JSON from databases or log files into CSV for batch processing or integration into data warehouses.

Converting JSON files to CSV format bridges the gap between structured storage and table-based analytics, making it a common requirement in reporting, data migration, and analytics pipelines.

Understanding JSON Structures

Before implementing JSON to CSV conversion in Python, it is important to understand the two common structures of JSON data:

  • Flat JSON: Each object contains simple key-value pairs.
[
    {"name": "Alice", "age": 28, "city": "New York"},
    {"name": "Bob", "age": 34, "city": "Los Angeles"}
]
  • Nested JSON: Objects contain nested dictionaries or arrays.
[
    {
        "name": "Alice",
        "age": 28,
        "contacts": {"email": "alice@example.com", "phone": "123-456"}
    }
]

Flat JSON can be directly mapped to CSV columns, while nested JSON requires flattening to ensure proper tabular representation.

Python JSON to CSV Converter - Installation

To export JSON to CSV in Python, you can use Spire.XLS for Python, a spreadsheet library that enables Python developers to create, read, manipulate, and export spreadsheet files directly from Python. It supports common formats like .xls, .xlsx, .csv, and .ods.

Installation

You can install the library from PyPI via pip:

pip install spire.xls

If you need assistance with the installation, refer to this tutorial: How to Install Spire.XLS for Python on Windows.

Once installed, you can import it into your Python scripts:

from spire.xls import *

This setup allows seamless JSON-to-CSV conversion with complete control over the workbook structure and output format.

Step-by-Step JSON to CSV Conversion in Python

Converting JSON to CSV involves four main steps: loading JSON data, creating a workbook, writing headers and rows, and exporting the final file as CSV. Below, we’ll go through the process for both flat JSON and nested JSON.

Handling Flat JSON Data

Step 1: Load JSON and Create Workbook

First, load your JSON file into Python and create a new workbook where the data will be written.

import json
from spire.xls import *

# Load JSON data
with open('data.json') as f:
    data = json.load(f)

# Create workbook and worksheet
workbook = Workbook()
sheet = workbook.Worksheets[0]

Explanation:

  • json.load parses your JSON file into Python objects (lists and dictionaries).
  • Workbook() creates a new Excel workbook in memory.
  • workbook.Worksheets[0] accesses the first worksheet where data will be written.

Step 2: Write Headers Dynamically

Next, generate column headers from the JSON keys. This ensures that the CSV reflects the structure of your data.

# Extract headers from JSON keys
headers = list(data[0].keys())

# Write headers to the first row
for col, header in enumerate(headers, start=1):
    sheet.Range[1, col].Value = header

Explanation:

  • list(data[0].keys()) retrieves all top-level JSON keys.
  • sheet.Range[1, col].Value writes headers in the first row, starting from column 1.

Step 3: Write Data Rows

After headers are set, populate the worksheet row by row with values from each JSON object.

# Populate values from each JSON object to the subsequent rows
for row_index, item in enumerate(data, start=2):
    for col_index, key in enumerate(headers, start=1):
        value = item.get(key, "")
        sheet.Range[row_index, col_index].Value = str(value) if value is not None else ""

Explanation:

  • Loop starts from row 2 because row 1 is reserved for headers.
  • Each JSON object is mapped to a row, and each key is mapped to a column.

Step 4: Save the Worksheet as CSV

Finally, save the worksheet as a CSV file and clean up resources.

# Save the worksheet as a CSV file
sheet.SaveToFile("output.csv", ",", Encoding.get_UTF8())
workbook.Dispose()

Resulting CSV (output.csv):

Flat JSON to CSV conversion output in Python

This example uses Python’s built-in json module to parse JSON data. For more details on its functions and usage, refer to the Python json module documentation.

If you also want to implement JSON to Excel conversion, see our guide on converting JSON to/from Excel in Python.

Handling Nested JSON with Dictionaries and Arrays

When JSON objects contain nested dictionaries or arrays, direct CSV export is not possible because CSV is a flat format. To ensure compatibility, nested fields must be flattened to preserve all data in a readable CSV format.

Suppose you have the following JSON file (nested.json) with nested data:

[
    {
        "Name": "Alice",
        "Age": 28,
        "City": "New York",
        "Contacts": {"Email": "alice@example.com", "Phone": "123-456"},
        "Skills": ["Python", "Excel", "SQL"]
    },
    {
        "Name": "Bob",
        "Age": 34,
        "City": "Los Angeles",
        "Contacts": {"Email": "bob@example.com", "Phone": "987-654"},
        "Skills": ["Java", "CSV", "JSON"]
    }
]

This JSON contains:

  • Flat fields: Name, Age, City
  • Nested dictionary: Contacts
  • Array: Skills

To export all fields to CSV, nested dictionaries need to be flattened, and arrays need to be joined into semicolon-separated strings. The following steps show you how to achieve this.

Step 1: Flatten Nested JSON Data

Flattening nested JSON with dictionaries and arrays involves converting nested dictionaries into dot-notation keys and joining array values into a single string. The following helper function performs this flattening:

import json
from spire.xls import *

# ---------------------------
# Step 1: Flatten nested JSON
# ---------------------------
def flatten_json(item):
    flat = {}
    for key, value in item.items():
        if isinstance(value, dict):
            for sub_key, sub_value in value.items():
                flat[f"{key}.{sub_key}"] = sub_value
        elif isinstance(value, list):
            flat[key] = "; ".join(map(str, value))
        else:
            flat[key] = value
    return flat

with open('nested.json') as f:
    data = json.load(f)

flat_data = [flatten_json(item) for item in data]

Explanation:

  • flatten_json() checks each field in the JSON object.
  • If the value is a dictionary, its keys are prefixed with the parent key (dot notation).
  • If the value is a list, all elements are joined into one string separated by semicolons.
  • Other values are kept as they are.
  • Finally, the original nested JSON is transformed into flat_data, which can now be exported using the same workflow as flat JSON.

Step 2: Export Flattened Data to CSV

After flattening, the export process is the same as flat JSON: create a workbook, generate headers, write rows, and save as CSV.

# ------------------------------------------------
# Step 2: Export flattened data to CSV
# (create workbook, write headers, populate rows, save)
# ------------------------------------------------
workbook = Workbook()
sheet = workbook.Worksheets[0]

# Generate headers from flattened JSON keys
headers = list(flat_data[0].keys())
for col, header in enumerate(headers, start=1):
    sheet.Range[1, col].Value = header

# Populate rows from flat_data
for row_index, item in enumerate(flat_data, start=2):
    for col_index, key in enumerate(headers, start=1):
        value = item.get(key, "")
        sheet.Range[row_index, col_index].Value = str(value) if value is not None else ""

# Save the worksheet as CSV (comma delimiter, UTF-8)
sheet.SaveToFile("output.csv", ",", Encoding.get_UTF8())

# Clean up resources
workbook.Dispose()

Explanation:

  • Headers and rows are generated from flat_data (instead of the original data).
  • This ensures nested fields (Contacts.Email, Contacts.Phone) and arrays (Skills) are properly included in the CSV.

Resulting CSV (output.csv):

Nested JSON to CSV conversion output in Python

Performance Tips for Exporting Large or Complex JSON Files to CSV

When working with large or complex JSON files, applying a few practical strategies can help optimize memory usage, improve performance, and ensure your CSV output remains clean and accurate.

  • For very large datasets, process JSON in chunks to avoid memory issues.
  • When arrays are very long, consider splitting into multiple rows instead of joining with semicolons.
  • Maintain consistent JSON keys for cleaner CSV output.
  • Test the CSV output in Excel to verify formatting, especially for dates and numeric values.

Conclusion

Converting JSON to CSV in Python is essential for developers and data engineers. With Spire.XLS for Python, you can:

  • Convert flat and nested JSON into structured CSV.
  • Maintain consistent headers and data formatting.
  • Handle large datasets efficiently.
  • Export directly to CSV or Excel without additional dependencies.

By following this guide, you can seamlessly transform JSON into CSV for reporting, analytics, and integration.

FAQs

Q1: Can Spire.XLS for Python convert nested JSON objects to CSV?

A1: Yes. You can flatten nested JSON objects, including dictionaries and arrays, and export them as readable CSV columns with dynamic headers. This ensures all hierarchical data is preserved in a tabular format.

Q2: How do I install Spire.XLS for Python?

A2: You can install it via pip with the command:

pip install spire.xls

Then import it into your Python script using:

from spire.xls import

This setup enables seamless JSON-to-CSV conversion in Python.

Q3: Can I join array elements from JSON into a single CSV cell?

A3: Yes. You can join array elements from JSON into a single CSV cell using a delimiter like ;. This keeps multiple values readable and consistent in your exported CSV file.

Q4: How can I optimize performance when converting large JSON files to CSV in Python?

A4: To handle large JSON files efficiently: flatten nested JSON before writing, process data row-by-row, and export in batches. This minimizes memory usage and ensures smooth CSV generation.

Q5: Can I customize the CSV delimiter or column order when exporting JSON to CSV?

A5: Yes. Spire.XLS allows you to set custom delimiters (e.g., comma, semicolon, or tab) and manually define the column order, giving you full control over professional CSV output.

Python Guide to Export HTML to Markdown

Converting HTML to Markdown using Python is a common task for developers managing web content, documentation, or API data. While HTML provides powerful formatting and structure, it can be verbose and harder to maintain for tasks like technical writing or static site generation. Markdown, by contrast, is lightweight, human-readable, and compatible with platforms such as GitHub, GitLab, Jekyll, and Hugo.

Automating HTML to Markdown conversion with Python streamlines workflows, reduces errors, and ensures consistent output. This guide covers everything from converting HTML files and strings to batch processing multiple files, along with best practices to ensure accurate Markdown results.

What You Will Learn

Why Convert HTML to Markdown?

Before diving into the code, let’s look at why developers often prefer Markdown over raw HTML in many workflows:

  • Simplicity and Readability
    Markdown is easier to read and edit than verbose HTML tags.
  • Portability Across Tools
    Markdown is supported by GitHub, GitLab, Bitbucket, Obsidian, Notion, and static site generators like Hugo and Jekyll.
  • Better for Version Control
    Being plain text, Markdown makes it easier to track changes with Git, review diffs, and collaborate.
  • Faster Content Creation
    Writing Markdown is quicker than remembering HTML tag structures.
  • Integration with Static Site Generators
    Popular frameworks rely on Markdown as the main content format. Converting from HTML ensures smooth migration.
  • Cleaner Documentation Workflows
    Many documentation systems and wikis use Markdown as their primary format.

In short, converting HTML to Markdown improves maintainability, reduces clutter, and fits seamlessly into modern developer workflows.

Install HTML to Markdown Library for Python

Before converting HTML content to Markdown in Python, you’ll need a library that can handle both formats effectively. Spire.Doc for Python is a reliable choice that allows you to transform HTML files or HTML strings into Markdown while keeping headings, lists, images, and links intact.

You can install it from PyPI using pip:

pip install spire.doc

Once installed, you can automate the HTML to Markdown conversion in your Python scripts. The same library also supports broader scenarios. For example, when you need editable documents, you can rely on its HTML to Word conversion feature to transform web pages into Word files. And for distribution or archiving, HTML to PDF conversion is especially useful for generating standardized, platform-independent documents.

Convert an HTML File to Markdown in Python

One of the most common use cases is converting an existing .html file into a .md file. This is especially useful when migrating old websites, technical documentation, or blog posts into Markdown-based workflows, such as static site generators (Jekyll, Hugo) or Git-based documentation platforms (GitHub, GitLab, Read the Docs).

Steps

  • Create a new Document instance.
  • Load the .html file into the document using LoadFromFile().
  • Save the document as a .md file using SaveToFile() with FileFormat.Markdown.
  • Close the document to release resources.

Code Example

from spire.doc import *

# Create a Document instance
doc = Document()

# Load an existing HTML file
doc.LoadFromFile("input.html", FileFormat.Html)

# Save as Markdown file
doc.SaveToFile("output.md", FileFormat.Markdown)

# Close the document
doc.Close()

This converts input.html into output.md, preserving structural elements such as headings, paragraphs, lists, links, and images.

Python Example to Convert HTML File to Markdown

If you’re also interested in the reverse process, check out our guide on converting Markdown to HTML in Python.

Convert an HTML String to Markdown in Python

Sometimes, HTML content is not stored in a file but is dynamically generated—for example, when retrieving web content from an API or scraping. In these scenarios, you can convert directly from a string without needing to create a temporary HTML file.

Steps

  • Create a new Document instance.
  • Add a Section to the document.
  • Add a Paragraph to the section.
  • Append the HTML string to the paragraph using AppendHTML().
  • Save the document as a Markdown file using SaveToFile().
  • Close the document to release resources.

Code Example

from spire.doc import *

# Sample HTML string
html_content = """
<h1>Welcome</h1>
<p>This is a <strong>sample</strong> paragraph with <em>emphasis</em>.</p>
<ul>
  <li>First item</li>
  <li>Second item</li>
</ul>
"""

# Create a Document instance
doc = Document()

# Add a section
section = doc.AddSection()

# Add a paragraph and append the HTML string
paragraph = section.AddParagraph()
paragraph.AppendHTML(html_content)

# Save the document as Markdown
doc.SaveToFile("string_output.md", FileFormat.Markdown)

# close the document to release resources
doc.Close()

The resulting Markdown will look like this:

Python Example to Convert HTML String to Markdown

Batch Conversion of Multiple HTML Files

For larger projects, you may need to convert multiple .html files in bulk. A simple loop can automate the process.

import os
from spire.doc import *

# Define the folder containing HTML files to convert
input_folder = "html_files"

# Define the folder where converted Markdown files will be saved
output_folder = "markdown_files"

# Create the output folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

# Loop through all files in the input folder
for filename in os.listdir(input_folder):
    # Process only files with .html extension
    if filename.endswith(".html"):
        # Create a new Document object
        doc = Document()

        # Load the HTML file into the Document object
        doc.LoadFromFile(os.path.join(input_folder, filename), FileFormat.Html)

        # Generate the output file path by replacing .html with .md
        output_file = os.path.join(output_folder, filename.replace(".html", ".md"))

        # Save the Document as a Markdown file
        doc.SaveToFile(output_file, FileFormat.Markdown)

        # Close the Document to release resources
        doc.Close()

This script processes all .html files inside html_files/ and saves the Markdown results into markdown_files/.

Best Practices for HTML to Markdown Conversion

Turning HTML to Markdown makes content easier to read, manage, and version-control. To ensure accurate and clean conversion, follow these best practices:

  • Validate HTML Before Conversion
    Ensure your HTML is properly structured. Invalid tags can cause incomplete or broken Markdown output.
  • Understand Markdown Limitations
    Markdown does not support advanced CSS styling or custom HTML tags. Some formatting might get lost.
  • Choose File Encoding Carefully
    Always be aware of character encoding. Open and save your files with a specific encoding (like UTF-8) to prevent issues with special characters.
  • Batch Processing

If converting multiple files, create a robust script that includes error handling (try-except blocks), logging, and skips problematic files instead of halting the entire process.

Conclusion

Converting HTML to Markdown in Python is a valuable skill for developers handling documentation pipelines, migrating web content, or processing data from APIs. With Spire.Doc for Python, you can:

  • Convert individual HTML files into Markdown with ease.
  • Transform HTML strings directly into .md files.
  • Automate batch conversions to efficiently manage large projects.

By applying these methods, you can streamline your workflows and ensure your content remains clean, maintainable, and ready for modern publishing platforms.

FAQs

Q1: Can I convert Markdown back to HTML in Python?

A1: Yes, Spire.Doc supports the conversion of Markdown to HTML, allowing for seamless transitions between these formats.

Q2: Will the conversion preserve complex HTML elements like tables?

A2: While Spire.Doc effectively handles standard HTML elements, it's advisable to review complex layouts, such as tables and nested elements, to ensure accurate conversion results.

Q3: Can I automate batch conversion for multiple HTML files?

A3: Absolutely! You can automate batch conversion using scripts in Python, enabling efficient processing of multiple HTML files at once.

Q4: Is Spire.Doc free to use?

A4: Spire.Doc provides both free and commercial versions, giving developers the flexibility to access essential features at no cost or unlock advanced functionality with a license.

Python Guide to Convert Markdown to HTML

Markdown (.md) is widely used in web development, documentation, and technical writing. Its simple syntax makes content easy to write and read. However, web browsers do not render Markdown directly. Converting Markdown to HTML ensures your content is structured, readable, and compatible with web platforms.

In this step-by-step guide, you will learn how to efficiently convert Markdown (.md) files into HTML using Python and Spire.Doc for Python, complete with practical code examples, clear instructions, and best practices for both single-file and batch conversions.

Table of Contents

What is Markdown?

Markdown is a lightweight markup language designed for readability and ease of writing. Unlike HTML, which can be verbose and harder to write by hand, Markdown uses simple syntax to indicate headings, lists, links, images, and more.

Example Markdown:

# This is a Heading

This is a paragraph with \*\*bold text\*\* and \*italic text\*.

- Item 1

- Item 2

Even in its raw form, Markdown is easy to read, which makes it popular for documentation, blogging, README files, and technical writing.

For more on Markdown syntax, see the Markdown Guide.

Why Convert Markdown to HTML?

While Markdown is excellent for authoring content, web browsers cannot render it natively. Converting Markdown to HTML allows you to:

  • Publish content on websites – Most CMS platforms require HTML for web pages.
  • Enhance styling – HTML supports CSS and JavaScript for advanced formatting and interactivity.
  • Maintain compatibility – HTML is universally supported by browsers, ensuring content displays correctly everywhere.
  • Integrate with web frameworks – Frameworks like React, Vue, and Angular require HTML as the base for rendering components.

Introducing Spire.Doc for Python

Spire.Doc for Python is a robust library for handling multiple document formats. It supports reading and writing Word documents, Markdown files, and exporting content to HTML. The library allows developers to convert Markdown directly to HTML with minimal code, preserving proper formatting and structure.

In addition to HTML, Spire.Doc for Python also allows you to convert Markdown to Word in Python or convert Markdown to PDF in Python, making it particularly useful for developers who want a unified tool for handling Markdown across different output formats.

Benefits of Using Spire.Doc for Python for Markdown to HTML Conversion

  • Easy-to-use API – Simple, intuitive methods that reduce development effort.
  • Accurate formatting – Preserves all Markdown elements such as headings, lists, links, and emphasis in HTML.
  • No extra dependencies – Eliminates the need for manual parsing or third-party libraries.
  • Flexible usage – Supports both single-file conversion and automated batch processing.

Step-by-Step Guide: Converting Markdown to HTML in Python

Now that you understand the purpose and benefits of converting Markdown to HTML, let’s walk through a clear, step-by-step process to transform your Markdown files into structured, web-ready HTML.

Step 1: Install Spire.Doc for Python

First, ensure that Spire.Doc for Python is installed in your environment. You can install it directly from PyPI using pip:

pip install spire.doc

Step 2: Prepare Your Markdown File

Next, create a sample Markdown file that you want to convert. For example, example.md:

Example Markdown File

Step 3: Write the Python Script

Now, write a Python script that loads the Markdown file and converts it to HTML:

from spire.doc import *

# Create a Document object
doc = Document()

# Load the Markdown file
doc.LoadFromFile("example.md", FileFormat.Markdown)

# Save the document as HTML
doc.SaveToFile("example.html", FileFormat.Html)

# Close the document
doc.Close()

Explanation of the code:

  • Document() initializes a new document object.
  • LoadFromFile("example.md", FileFormat.Markdown) loads the Markdown file into memory.
  • SaveToFile("example.html", FileFormat.Html) converts the loaded content into HTML and saves it to disk.
  • doc.Close() ensures resources are released properly, which is particularly important when processing multiple files or running batch operations.

Step 4: Verify the HTML Output

Finally, open the generated example.html file in a web browser or HTML editor. Verify that the Markdown content has been correctly converted.

HTML File Converted from Markdown using Python

Automating Batch Conversion

You can convert multiple Markdown files in a folder automatically:

import os
from spire.doc import *

# Set the folder containing Markdown files
input_folder = "markdown_files"

# Set the folder where HTML files will be saved
output_folder = "html_files"

# Create the output folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

# Loop through all files in the input folder
for filename in os.listdir(input_folder):
    # Process only Markdown files
    if filename.endswith(".md"):
        # Create a new Document object for each file
        doc = Document()

        # Load the Markdown file into the Document object
        doc.LoadFromFile(os.path.join(input_folder, filename), FileFormat.Markdown)

        # Construct the output file path by replacing .md extension with .html
        output_file = os.path.join(output_folder, filename.replace(".md", ".html"))

        # Save the loaded Markdown content as HTML
        doc.SaveToFile(output_file, FileFormat.Html)

        # Close the document to release resources
        doc.Close()

This approach allows you to process multiple Markdown files efficiently and generate corresponding HTML files automatically.

Best Practices for Markdown to HTML Conversion

While the basic steps are enough to complete a Markdown-to-HTML conversion, following a few best practices will help you avoid common pitfalls, improve compatibility, and ensure your output is both clean and professional:

  • Use proper Markdown syntax – Ensure headings, lists, links, and emphasis are correctly written.
  • Use UTF-8 Encoding: Always save your Markdown files in UTF-8 encoding to avoid issues with special characters or non-English text.
  • Batch Processing: If you need to convert multiple files, wrap your script in a loop and process entire folders. This saves time and ensures consistent formatting across documents.
  • Enhance Styling: Remember that HTML gives you the flexibility to add CSS and JavaScript for custom layouts, responsive design, and interactivity—something not possible in raw Markdown.

Conclusion

Converting Markdown to HTML using Python with Spire.Doc is simple, reliable, and efficient. It preserves formatting, supports automation, and produces clean HTML output ready for web use. By following this guide, you can implement a smooth Markdown to HTML workflow for both single documents and batch operations.

FAQs

Q1: Can I convert multiple Markdown files to HTML in Python?

A1: Yes, you can automate batch conversions by iterating through Markdown files in a directory and applying the conversion logic to each file.

Q2: Will the HTML preserve all Markdown formatting?

A2: Yes, Spire.Doc effectively preserves all essential Markdown formatting, including headings, lists, bold and italic text, links, and more.

Q3: Is there a way to handle images in Markdown during conversion?

A3: Yes, Spire.Doc supports the conversion of images embedded in Markdown, ensuring they are included in the resulting HTML.

Q4: Do I need additional libraries besides Spire.Doc?

A4: No additional libraries are required. Spire.Doc for Python provides a comprehensive solution for converting Markdown to HTML without any external dependencies.

Q5: Can I use the generated HTML in web frameworks?

A5: Yes, the HTML produced is fully compatible with popular web frameworks such as React, Vue, and Angular, making integration seamless.

Python Convert HTML  Text Quickly and Easily

HTML (HyperText Markup Language) is a markup language used to create web pages, allowing developers to build rich and visually appealing layouts. However, HTML files often contain a large number of tags, which makes them difficult to read if you only need the main content. By using Python to convert HTML to text, this problem can be easily solved. Unlike raw HTML, the converted text file strips away all unnecessary markup, leaving only clean and readable content that is easier to store, analyze, or process further.

Install HTML to Text Converter in Python

To simplify the task, we recommend using Spire.Doc for Python. This Python Word library allows you to quickly remove HTML markup and extract clean plain text with ease. It not only works as an HTML-to-text converter, but also offers a wide range of features—covering almost everything you can do in Microsoft Word.

To install it, you can run the following pip command:

pip install spire.doc

Alternatively, you can download the Spire.Doc package and install it manually.

Python Convert HTML Files to Text in 3 Steps

After preparing the necessary tools, let's dive into today's main topic: how to convert HTML to plain text using Python. With the help of Spire.Doc, this task can be accomplished in just three simple steps: create a new document object, load the HTML file, and save it as a text file. It’s straightforward and efficient, even for beginners. Let’s take a closer look at how this process can be implemented in code!

Code Example – Converting an HTML File to a Text File:

from spire.doc import *
from spire.doc.common import *

# Open an html file
document = Document()
document.LoadFromFile("/input/htmlsample.html", FileFormat.Html, XHTMLValidationType.none)
# Save it as a Text document.
document.SaveToFile("/output/HtmlFileTotext.txt", FileFormat.Txt)

document.Close()

The following is a preview comparison between the source document (.html) and the output document (.txt):

Python Convert an HTML File to a Text Document

Note that if the HTML file contains tables, the output text file will only retain the values within the tables and cannot preserve the original table formatting. If you want to keep certain styles while removing markup, it is recommended to convert HTML to a Word document . This way, you can retain headings, tables, and other formatting, making the content easier to edit and use.

How to Convert an HTML String to Text in Python

Sometimes, we don’t need the entire content of a web page and only want to extract specific parts. In such cases, you can convert an HTML string directly to text. This approach allows you to precisely control the information you need without further editing. Using Python to convert an HTML string to a text file is also straightforward. Here’s a detailed step-by-step guide:

Steps to convert an HTML string to a text document using Spire.Doc:

  • Input the HTML string directly or read it from a local file.
  • Create a Document object and add sections and paragraphs.
  • Use Paragraph.AppendHTML() method to insert the HTML string into a paragraph.
  • Save the document as a .txt file using Document.SaveToFile() method.

The following code demonstrates how to convert an HTML string to a text file using Python:

from spire.doc import *
from spire.doc.common import *

#Get html string.
#with open(inputFile) as fp:
    #HTML = fp.read()

# Load HTML from string
html = """<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>HTML to Text Example</title>
  <style>
    body { font-family: Arial, sans-serif; margin: 20px; }
    header { background: #f4f4f4; padding: 10px; }
    nav a { margin: 0 10px; text-decoration: none; color: #333; }
    main { margin-top: 20px; }
  </style>
</head>
<body>
  <header>
    <h1>My Demo Page</h1>
    <nav>
      <a href="#">Home</a>
      <a href="#">About</a>
      <a href="#">Contact</a>
    </nav>
  </header>
  
  <main>
    <h2>Convert HTML to Text</h2>
    <p>This is a simple demo showing how HTML content can be displayed before converting it to plain text.</p>
  </main>
</body>
</html>
"""

# Create a new document
document = Document()
section = document.AddSection()
section.AddParagraph().AppendHTML(html)

# Save directly as TXT
document.SaveToFile("/output/HtmlStringTotext.txt", FileFormat.Txt)
document.Close()

Here's the preview of the converted .txt file: Python Convert an HTML String to a Text Document

The Conclusion

In today’s tutorial, we focused on how to use Python to convert HTML to a text file. With the help of Spire.Doc, you can handle both HTML files and HTML strings in just a few lines of code, easily generating clean plain text files. If you’re interested in the other powerful features of the Python Word library, you can request a 30-day free trial license and explore its full capabilities for yourself.

FAQs about Converting HTML to Text in Python

Q1: How can I convert HTML to plain text using Python?

A: Use Spire.Doc to load an HTML file or string, insert it into a Document object with AppendHTML(), and save it as a .txt file.

Q2: Can I keep some formatting when converting HTML to text?

A: To retain styles like headings or tables, convert HTML to a Word document first, then export to text if needed.

Q3: Is it possible to convert only part of an HTML page to text?

A: Yes, extract the specific HTML segment as a string and convert it to text using Python for precise control.

Read CSV file and convert to Excel in Python

In development, reading CSV files in Python is a common task in data processing, analytics, and backend integration. While Python offers built-in modules like csv and pandas for handling CSV files, Spire.XLS for Python provides a powerful, feature-rich alternative for working with CSV and Excel files programmatically.

In this article, you’ll learn how to use Python to read CSV files, from basic CSV parsing to advanced techniques.


Getting Started with Spire.XLS for Python

Spire.XLS for Python is a feature-rich library for processing Excel and CSV files. Unlike basic CSV parsers in Python, it offers advanced capabilities such as:

  • Simple API to load, read, and manipulate CSV data.
  • Reading/writing CSV files with support for custom delimiters.
  • Converting CSV files to Excel formats (XLSX, XLS) and vice versa.

These features make Spire.XLS ideal for data analysts, developers, and anyone working with structured data in CSV format.

Install via pip

Before getting started, install the library via pip. It works with Python 3.6+ on Windows, macOS, and Linux:

pip install Spire.XLS

Basic Example: Read a CSV in Python

Let’s start with a simple example: parsing a CSV file and extracting its data. Suppose we have a CSV file named “input.csv” with the following content:

Name,Age,City,Salary
Alice,30,New York,75000
Bob,28,Los Angeles,68000
Charlie,35,San Francisco,90000

Python Code to Read the CSV File

Here’s how to load and get data from the CSV file with Python:

from spire.xls import *
from spire.xls.common import *

# Create a Workbook object
workbook = Workbook()

# Load a CSV file
workbook.LoadFromFile("input.csv", ",", 1, 1)

# Get the first worksheet (CSV files are loaded as a single sheet)
worksheet = workbook.Worksheets[0]

# Get the number of rows and columns with data
row_count = worksheet.LastRow 
col_count = worksheet.LastColumn 

# Iterate through rows and columns to print data
print("CSV Data:")
for row in range(row_count):
    for col in range(col_count):
        # Get cell value
        cell_value = worksheet.Range[row+1, col+1].Value
        print(cell_value, end="\t")
    print()  # New line after each row

# Close the workbook
workbook.Dispose()

Explanation:

  • Workbook Initialization: The Workbook class is the core object for handling Excel files.

  • Load CSV File: LoadFromFile() imports the CSV data. Its parameters are:

    • fileName: The CSV file to read.
    • separator: Specified delimiter (e.g., “,”).
    • row/column: The starting row/column index.
  • Access Worksheet: The CSV data is loaded into the first worksheet.

  • Read Data: Iterate through rows and columns to extract cell values via worksheet.Range[].Value.

Output: Get data from a CSV file and print in a tabular format.

Read a CSV file using Python


Advanced CSV Reading Techniques

1. Read CSV with Custom Delimiters

Not all CSVs use commas. If your CSV file uses a different delimiter (e.g., ;), specify it during loading:

# Load a CSV file
workbook.LoadFromFile("input.csv", ";", 1, 1)

2. Skip Header Rows

If your CSV has headers, skip them by adjusting the row iteration to start from the second row instead of the first.

for row in range(1, row_count):
    for col in range(col_count):
        # Get cell value (row+1 because Spire.XLS uses 1-based indexing)
        cell_value = worksheet.Range[row+1, col+1].Value
        print(cell_value, end="\t")

3. Convert CSV to Excel in Python

One of the most powerful features of Spire.XLS is the ability to convert a CSV file into a native Excel format effortlessly. For example, you can read a CSV and then:​

  • Apply Excel formatting (e.g., set cell colors, borders).​
  • Create charts (e.g., a bar chart for sales by region).​
  • Save the data as an Excel file (.xlsx) for sharing.

Code Example: Convert CSV to Excel (XLSX) in Python – Single & Batch


Conclusion

Reading CSV files in Python with Spire.XLS simplifies both basic and advanced data processing tasks. Whether you need to extract CSV data, convert it to Excel, or handle advanced scenarios like custom delimiters, the examples outlined in this guide enables you to implement robust CSV reading capabilities in your projects with minimal effort.

Try the examples above, and explore the online documentation for more advanced features!

Python Read PowerPoint Documents

PowerPoint (PPT & PPTX) files are rich with diverse content, including text, images, tables, charts, shapes, and metadata. Extracting these elements programmatically can unlock a wide range of use cases, from automating repetitive tasks to performing in-depth data analysis or migrating content across platforms.

In this tutorial, we'll explore how to read PowerPoint documents in Python using Spire.Presentation for Python, a powerful library for processing PowerPoint files.

Table of Contents:

1. Python Library to Read PowerPoint Files

To work with PowerPoint files in Python, we'll use Spire.Presentation for Python. This feature-rich library enables developers to create, edit, and read content from PowerPoint presentations efficiently. It allows for the extraction of text, images, tables, SmartArt, and metadata with minimal coding effort.

Before we begin, install the library using pip:

pip install spire.presentation

Now, let's dive into different ways to extract content from PowerPoint files.

2. Extracting Text from Slides in Python

PowerPoint slides contain text in various forms—shapes, tables, SmartArt, and more. We'll explore how to extract text from each of these elements.

2.1 Extract Text from Shapes

Most text in PowerPoint slides resides within shapes (text boxes, labels, etc.). Here’s how to extract text from shapes:

Steps-by-Step Guide

  • Initialize the Presentation object and load your PowerPoint file.
  • Iterate through each slide and its shapes.
  • Check if a shape is an IAutoShape (a standard text container).
  • Extract text from each paragraph in the shape.

Code Example

from spire.presentation import *
from spire.presentation.common import *

# Create an object of Presentation class
presentation = Presentation()

# Load a PowerPoint presentation
presentation.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pptx")

# Create a list
text = []

# Loop through the slides in the document
for slide_index, slide in enumerate(presentation.Slides):
    # Add slide marker
    text.append(f"====slide {slide_index + 1}====")

    # Loop through the shapes in the slide
    for shape in slide.Shapes:
        # Check if the shape is an IAutoShape object
        if isinstance(shape, IAutoShape):
            # Loop through the paragraphs in the shape
            for paragraph in shape.TextFrame.Paragraphs:
                # Get the paragraph text and append it to the list
                text.append(paragraph.Text)

# Write the text to a txt file
with open("output/ExtractAllText.txt", "w", encoding='utf-8') as f:
    for s in text:
        f.write(s + "\n")

# Dispose resources
presentation.Dispose()

Output:

A text file containing all extracted text from shapes in the presentation.

2.2 Extract Text from Tables

Tables in PowerPoint store structured data. Extracting this data requires iterating through each cell to maintain the table’s structure.

Step-by-Step Guide

  • Initialize the Presentation object and load your PowerPoint file.
  • Iterate through each slide to access its shapes.
  • Identify table shapes (ITable objects).
  • Loop through rows and cells to extract text.

Code Example

from spire.presentation import *
from spire.presentation.common import *

# Create a Presentation object
presentation = Presentation()

# Load a PowerPoint file
presentation.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pptx")

# Create a list for tables
tables = []

# Loop through the slides
for slide in presentation.Slides:
    # Loop through the shapes in the slide
    for shape in slide.Shapes:
        # Check whether the shape is a table
        if isinstance(shape, ITable):
            tableData = "

            # Loop through the rows in the table
            for row in shape.TableRows:
                rowData = "

                # Loop through the cells in the row
                for i in range(row.Count):
                    # Get the cell value
                    cellValue = row[i].TextFrame.Text
                    # Add cell value with spaces for better readability
                    rowData += (cellValue + "  |  " if i < row.Count - 1 else cellValue)
                tableData += (rowData + "\n")
            tables.append(tableData)

# Write the tables to text files
for idx, table in enumerate(tables, start=1):
    fileName = f"output/Table-{idx}.txt"
    with open(fileName, "w", encoding='utf-8') as f:
        f.write(table)

# Dispose resources
presentation.Dispose()

Output:

A text file containing structured table data from the presentation.

2.3 Extract Text from SmartArt

SmartArt is a unique feature in PowerPoint used for creating diagrams. Extracting text from SmartArt involves accessing its nodes and retrieving the text from each node.

Step-by-Step Guide

  • Load the PowerPoint file into a Presentation object.
  • Iterate through each slide and its shapes.
  • Identify ISmartArt shapes in slides.
  • Loop through each node in the SmartArt.
  • Extract and save the text from each node.

Code Example

from spire.presentation.common import *
from spire.presentation import *

# Create a Presentation object
presentation = Presentation()

# Load a PowerPoint file
presentation.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pptx")

# Iterate through each slide in the presentation
for slide_index, slide in enumerate(presentation.Slides):

    # Create a list to store the extracted text for the current slide
    extracted_text = []

    # Loop through the shapes on the slide and find the SmartArt shapes
    for shape in slide.Shapes:
        if isinstance(shape, ISmartArt):
            smartArt = shape

            # Extract text from the SmartArt nodes and append to the list
            for node in smartArt.Nodes:
                extracted_text.append(node.TextFrame.Text)

    # Write the extracted text to a separate text file for each slide
    if extracted_text:  # Only create a file if there's text extracted
        file_name = f"output/SmartArt-from-slide-{slide_index + 1}.txt"
        with open(file_name, "w", encoding="utf-8") as text_file:
            for text in extracted_text:
                text_file.write(text + "\n")

# Dispose resources
presentation.Dispose()

Output:

A text file containing node text of a SmartArt.

You might also be interested in: Read Speaker Notes in PowerPoint in Python

3. Saving Images from Slides in Python

In addition to text, slides often contain images that may be important for your analysis. This section will show you how to save images from the slides.

Step-by-Step Guide

  • Initialize the Presentation object and load your PowerPoint file.
  • Access the Images collection in the presentation.
  • Iterate through each image and save it in a desired format (e.g., PNG).

Code Example

from spire.presentation.common import *
from spire.presentation import *

# Create a Presentation object
presentation = Presentation()

# Load a PowerPoint document
presentation.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pptx")

# Get the images in the document
images = presentation.Images

# Iterate through the images in the document
for i, image in enumerate(images):

    # Save a certain image in the specified path
    ImageName = "Output/Images_"+str(i)+".png"
    image_data = (IImageData)(image)
    image_data.Image.Save(ImageName)

# Dispose resources
presentation.Dispose()

Output:

PNG files of all images extracted from the PowerPoint.

4. Accessing Metadata (Document Properties) in Python

Extracting metadata provides insights into the presentation, such as its title, author, and keywords. This section will guide you on how to access and save this metadata.

Step-by-Step Guide

  • Create and load your PowerPoint file into a Presentation object.
  • Access the DocumentProperty object.
  • Extract properties like Title , Author , and Keywords .

Code Example

from spire.presentation.common import *
from spire.presentation import *

# Create a Presentation object
presentation = Presentation()

# Load a PowerPoint document
presentation.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pptx")

# Get the DocumentProperty object
documentProperty = presentation.DocumentProperty

# Prepare the content for the text file
properties = [
    f"Title: {documentProperty.Title}",
    f"Subject: {documentProperty.Subject}",
    f"Author: {documentProperty.Author}",
    f"Manager: {documentProperty.Manager}",
    f"Company: {documentProperty.Company}",
    f"Category: {documentProperty.Category}",
    f"Keywords: {documentProperty.Keywords}",
    f"Comments: {documentProperty.Comments}",
]

# Write the properties to a text file
with open("output/DocumentProperties.txt", "w", encoding="utf-8") as text_file:
    for line in properties:
        text_file.write(line + "\n")

# Dispose resources
presentation.Dispose()

Output:

A text file containing PowerPoint document properties.

You might also be interested in: Add Document Properties to a PowerPoint File in Python

5. Conclusion

With Spire.Presentation for Python, you can effortlessly read and extract various elements from PowerPoint files—such as text, images, tables, and metadata. This powerful library streamlines automation tasks, content analysis, and data migration, allowing for efficient management of PowerPoint files. Whether you're developing an analytics tool, automating document processing, or managing presentation content, Spire.Presentation offers a robust and seamless solution for programmatically handling PowerPoint files.

6. FAQs

Q1. Can Spire.Presentation handle password-protected PowerPoint files?

Yes, Spire.Presentation can open and process password-protected PowerPoint files. To access an encrypted file, use the LoadFromFile() method with the password parameter:

presentation.LoadFromFile("encrypted.pptx", "yourpassword")

Q2. How can I read comments from PowerPoint slides?

You can read comments from PowerPoint slides using the Spire.Presentation library. Here’s how:

from spire.presentation import *

presentation = Presentation()
presentation.LoadFromFile("Input.pptx")

with open("PowerPoint_Comments.txt", "w", encoding="utf-8") as file:
    for slide_idx, slide in enumerate(presentation.Slides):
        slide = (ISlide)(slide)
        if len(slide.Comments) > 0:
            for comment_idx, comment in enumerate(slide.Comments):
                file.write(f"Comment {comment_idx + 1} from Slide {slide_idx + 1}: {comment.Text}\n")

Q3. Does Spire.Presentation preserve formatting when extracting text?

Basic text extraction retrieves raw text content. For formatted text (fonts, colors), you would need to access additional properties like TextRange.LatinFont and TextRange.Fill .

Q4. Are there any limitations on file size when reading PowerPoint files in Python?

While Spire.Presentation can handle most standard presentations, extremely large files (hundreds of MB) may require optimization for better performance.

Q5. Can I create or modify PowerPoint documents using Spire.Presentation?

Yes, you can create PowerPoint documents and modify existing ones using Spire.Presentation. The library provides a range of features that allow you to add new slides, insert text, images, tables, and shapes, as well as edit existing content.

Get a Free License

To fully experience the capabilities of Spire.Presentation for Python without any evaluation limitations, you can request a free 30-day trial license.

Python code examples for creating PowerPoint documents

Creating PowerPoint presentations programmatically can save time and enhance efficiency in generating reports, slideshows, and other visual presentations. By automating the process, you can focus on content and design rather than manual formatting.

In this tutorial, we will explore how to create PowerPoint documents in Python using Spire.Presentation for Python. This powerful tool allows developers to manipulate and generate PPT and PPTX files seamlessly.

Table of Contents:

1. Python Library to Work with PowerPoint Files

Spire.Presentation is a robust library for creating, reading, and modifying PowerPoint files in Python , without requiring Microsoft Office. This library supports a wide range of features, including:

  • Create PowerPoint documents from scratch or templates.
  • Add text, images, lists, tables, charts, and shapes.
  • Customize fonts, colors, backgrounds, and layouts.
  • Save as or export to PPT, PPTX, PDF, or images.

In the following sections, we will walk through the steps to install the library, create presentations, and add various elements to your slides.

2. Installing Spire.Presentation for Python

To get started, you need to install the Spire.Presentation library. You can install it using pip:

pip install spire.presentation

Once installed, you can begin utilizing its features in your Python scripts to create PowerPoint documents.

3. Creating a PowerPoint Document from Scratch

3.1 Generate and Save a Blank Presentation

Let's start by creating a basic PowerPoint presentation from scratch. The following code snippet demonstrates how to generate and save a blank presentation:

from spire.presentation.common import *
from spire.presentation import *

# Create a Presentation object
presentation = Presentation()

# Set the slide size type
presentation.SlideSize.Type = SlideSizeType.Screen16x9

# Add a slide (there is one slide in the document by default)
presentation.Slides.Append()

# Save the document as a PPT or PPTX file
presentation.SaveToFile("BlankPowerPoint.pptx", FileFormat.Pptx2019)
presentation.Dispose()

In this code:

  • Presentation : Root class representing the PowerPoint file.
  • SlideSize.Type : Sets the slide dimensions (e.g., SlideSizeType.Screen16x9 for widescreen).
  • Slides.Append() : Adds a new slide to the presentation. By default, a presentation starts with one slide.
  • SaveToFile() : Saves the presentation to the specified file format (PPTX in this case).

3.2 Add Basic Elements to Your Slides

Now that we have a blank presentation, let's add some basic elements like text, images, lists, and tables.

Add Formatted Text

To add formatted text, we can use the following code:

# Get the first slide
first_slide = presentation.Slides[0]

# Add a shape to the slide
rect = RectangleF.FromLTRB (30, 60, 900, 150)
shape = first_slide.Shapes.AppendShape(ShapeType.Rectangle, rect)
shape.ShapeStyle.LineColor.Color = Color.get_Transparent()
shape.Fill.FillType = FillFormatType.none

# Add text to the shape
shape.AppendTextFrame("This guide demonstrates how to create a PowerPoint document using Python.")

# Get text of the shape as a text range
textRange = shape.TextFrame.TextRange

# Set font name, style (bold & italic), size and color
textRange.LatinFont = TextFont("Times New Roman")
textRange.IsBold = TriState.TTrue
textRange.FontHeight = 32
textRange.Fill.FillType = FillFormatType.Solid
textRange.Fill.SolidColor.Color = Color.get_Black()

# Set alignment
textRange.Paragraph.Alignment = TextAlignmentType.Left

In this code:

  • AppendShape() : Adds a shape to the slide. We create a rectangle shape that will house our text.
  • AppendTextFrame() : Adds a text frame to the shape, allowing us to insert text into it.
  • TextFrame.TextRange : Accesses the text range of the shape, enabling further customization such as font style, size, and color.
  • Paragraph.Alignment : Sets the alignment of the text within the shape.

Add an Image

To include an image in your presentation, use the following code snippet:

# Get the first slide
first_slide = presentation.Slides[0]

# Load an image file
imageFile = "C:\\Users\\Administrator\\Desktop\\logo.png"
stream = Stream(imageFile)
imageData = presentation.Images.AppendStream(stream)

# Reset size
width = imageData.Width * 0.6
height = imageData.Height * 0.6

# Append it to the slide
rect = RectangleF.FromLTRB (750, 50, 750 + width, 50 + height)
image = first_slide.Shapes.AppendEmbedImageByImageData(ShapeType.Rectangle, imageData, rect)
image.Line.FillType = FillFormatType.none

In this code:

  • Stream() : Creates a stream from the specified image file path.
  • AppendStream() : Appends the image data to the presentation's image collection.
  • AppendEmbedImageByImageData() : Adds the image to the slide at the specified rectangle coordinates.

You may also like: Insert Shapes in PowerPoint in Python

Add a List

To add a bulleted list to your slide, we can use:

# Get the first slide
first_slide = presentation.Slides[0]

# Specify list bounds and content
listBounds = RectangleF.FromLTRB(30, 150, 500, 350)
listContent = [
    " Step 1. Install Spire.Presentation for Python.",
    " Step 2. Create a Presentation object.",
    " Step 3. Add text, images, etc. to slides.",
    " Step 5. Set a background image or color.",
    " Step 6. Save the presentation to a PPT(X) file."
]

# Add a shape
autoShape = first_slide.Shapes.AppendShape(ShapeType.Rectangle, listBounds)
autoShape.TextFrame.Paragraphs.Clear()
autoShape.Fill.FillType = FillFormatType.none
autoShape.Line.FillType = FillFormatType.none

for content in listContent:
    # Create paragraphs based on the list content and add them to the shape
    paragraph = TextParagraph()
    autoShape.TextFrame.Paragraphs.Append(paragraph)
    paragraph.Text = content
    paragraph.TextRanges[0].Fill.FillType = FillFormatType.Solid
    paragraph.TextRanges[0].Fill.SolidColor.Color = Color.get_Black()
    paragraph.TextRanges[0].FontHeight = 20
    paragraph.TextRanges[0].LatinFont = TextFont("Arial")

    # Set the bullet type for these paragraphs
    paragraph.BulletType = TextBulletType.Symbol

    # Set line spacing
    paragraph.LineSpacing = 150

In this code:

  • AppendShape() : Creates a rectangle shape for the list.
  • TextFrame.Paragraphs.Append() : Adds paragraphs for each list item.
  • BulletType : Sets the bullet style for the list items.

Add a Table

To include a table, you can use the following:

# Get the first slide
first_slide = presentation.Slides[0]

# Define table dimensions and data
widths = [200, 200, 200]
heights = [18, 18, 18, 18]
dataStr = [
    ["Slide Number", "Title", "Content Type"], 
    ["1", "Introduction", "Text/Image"], 
    ["2", "Project Overview", "Chart/Graph"],  
    ["3", "Key Findings", "Text/List"]
]

# Add table to the slide
table = first_slide.Shapes.AppendTable(30, 360, widths, heights)

# Fill table with data and apply formatting
for rowNum, rowData in enumerate(dataStr):
    for colNum, cellData in enumerate(rowData):  
        cell = table[colNum, rowNum]
        cell.TextFrame.Text = cellData
        
        textRange = cell.TextFrame.Paragraphs[0].TextRanges[0]
        textRange.LatinFont = TextFont("Times New Roman")
        textRange.FontHeight = 20
        
        cell.TextFrame.Paragraphs[0].Alignment = TextAlignmentType.Center

# Apply a built-in table style
table.StylePreset = TableStylePreset.MediumStyle2Accent1

In this code:

  • AppendTable() : Adds a table to the slide at specified coordinates with defined widths and heights for columns and rows.
  • Cell.TextFrame.Text : Sets the text for each cell in the table.
  • StylePreset : Applies a predefined style to the table for enhanced aesthetics.

3.3 Apply a Background Image or Color

To set a custom background for your slide, use the following code:

# Get the first slide
first_slide = presentation.Slides[0]

# Get the background of the first slide
background = first_slide.SlideBackground

# Create a stream from the specified image file
stream = Stream("C:\\Users\\Administrator\\Desktop\\background.jpg")
imageData = presentation.Images.AppendStream(stream)

# Set the image as the background 
background.Type = BackgroundType.Custom
background.Fill.FillType = FillFormatType.Picture
background.Fill.PictureFill.FillType = PictureFillType.Stretch
background.Fill.PictureFill.Picture.EmbedImage = imageData

In this code:

  • SlideBackground : Accesses the background properties of the slide.
  • Fill.FillType : Specifies the type of fill (in this case, an image).
  • PictureFill.FillType : Sets how the background image is displayed (stretched, in this case).
  • Picture.EmbedImage : Sets image data for the background.

For additional background options, refer to this tutorial: Set Background Color or Picture for PowerPoint Slides in Python

Output:

Below is a screenshot of the PowerPoint document generated by the code snippets provided above.

Create a PowerPoint file from scratch in Python

4. Creating PowerPoint Documents Based on a Template

Using templates can simplify the process of creating presentations by allowing you to replace placeholders with actual data. Below is an example of how to create a PowerPoint document based on a template:

from spire.presentation.common import *
from spire.presentation import *

# Create a Presentation object
presentation = Presentation()

# Load a PowerPoint document from a specified file path
presentation.LoadFromFile("C:\\Users\\Administrator\\Desktop\\template.pptx")

# Get a specific slide from the presentation
slide = presentation.Slides[0]

# Define a list of replacements where each tuple consists of the placeholder and its corresponding replacement text
replacements = [
    ("{project_name}", "GreenCity Solar Initiative"),
    ("{budget}", "$1,250,000"),
    ("{status}", "In Progress (65% Completion)"),
    ("{start_date}", "March 15, 2023"),
    ("{end_date}", "November 30, 2024"),
    ("{manager}", "Emily Carter"),
    ("{client}", "GreenCity Municipal Government")
]

# Iterate through each replacement pair
for old_string, new_string in replacements:

    # Replace the first occurrence of the old string in the slide with the new string
    slide.ReplaceFirstText(old_string, new_string, False)

# Save the modified presentation to a new file
presentation.SaveToFile("Template-Based.pptx", FileFormat.Pptx2019)
presentation.Dispose()

In this code:

  • LoadFromFile() : Loads an existing PowerPoint file that serves as the template.
  • ReplaceFirstText() : Replaces placeholders within the slide with actual values. This is useful for dynamic content generation.
  • SaveToFile() : Saves the modified presentation as a new file.

Output:

Create PowerPoint documents based on a template in Python

5. Best Practices for Python PowerPoint Generation

When creating PowerPoint presentations using Python, consider the following best practices:

  • Maintain Consistency : Ensure that the formatting (fonts, colors, styles) is consistent across slides for a professional appearance.
  • Modular Code: Break document generation into functions (e.g., add_list(), insert_image()) for reusability.
  • Optimize Images : Resize and compress images before adding them to presentations to reduce file size and improve loading times.
  • Use Templates : Whenever possible, use templates to save time and maintain a cohesive design.
  • Test Your Code : Always test your presentation generation code to ensure that all elements are added correctly and appear as expected.

6. Wrap Up

In this tutorial, we explored how to create PowerPoint documents in Python using the Spire.Presentation library. We covered the installation, creation of presentations from scratch, adding various elements, and using templates for dynamic content generation. With these skills, you can automate the creation of presentations, making your workflow more efficient and effective.

7. FAQs

Q1. What is Spire.Presentation?

Spire.Presentation is a powerful library used for creating, reading, and modifying PowerPoint files in various programming languages, including Python.

Q2. Does this library require Microsoft Office to be installed?

No, Spire.Presentation operates independently and does not require Microsoft Office.

Q3. Can I customize the layout of slides in my presentation?

Yes, you can customize the layout of each slide by adjusting properties such as size, background, and the placement of shapes, text, and images.

Q4. Does Spire.Presentation support both PPT and PPTX format?

Yes, Spire.Presentation supports both PPT and PPTX formats, allowing you to create and manipulate presentations in either format.

Q5. Can I add charts to my presentations?

Yes, Spire.Presentation supports the addition of charts to your slides, allowing you to visualize data effectively. For detailed instruction, refer to: How to Create Column Charts in PowerPoint Using Python

Get a Free License

To fully experience the capabilities of Spire.Presentation for Python without any evaluation limitations, you can request a free 30-day trial license.

OCR PDF and Extract Text Using Python In daily work, extracting text from PDF files is a common task. For standard digital documents—such as those exported from Word to PDF—this process is usually straightforward. However, things get tricky when dealing with scanned PDFs, which are essentially images of printed documents. In such cases, traditional text extraction methods fail, and OCR (Optical Character Recognition) becomes necessary to recognize and convert the text within images into editable content.

In this article, we’ll walk through how to perform PDF OCR using Python to automate this workflow and significantly reduce manual effort.

Why OCR is Needed for PDF Text Extraction

When it comes to extracting text from PDF files, one important factor that determines your approach is the type of PDF. Generally, PDFs fall into two categories: scanned (image-based) PDFs and searchable PDFs. Each requires a different strategy for text extraction.

  • Scanned PDFs are typically created by digitizing physical documents such as books, invoices, contracts, or magazines. While the text appears readable to the human eye, it's actually embedded as an image—making it inaccessible to traditional text extraction tools. Older digital files or password-protected PDFs may also lack an actual text layer.

  • Searchable PDFs, on the other hand, contain a hidden text layer that allows computers to search, copy, or parse the content. These files are usually generated directly from applications like Microsoft Word or PDF editors and are much easier to process programmatically.

This distinction highlights the importance of OCR (Optical Character Recognition) when working with scanned PDFs. With tools like Python PDF OCR, we can convert these image-based PDFs into images, run OCR to recognize the text, and extract it for further use—all in an automated way.

Best Python OCR Libraries for PDF Processing

Before diving into the implementation, let’s take a quick look at the tools we’ll be using in this tutorial. To simplify the process, we’ll use Spire.PDF for Python and Spire.OCR for Python to perform PDF OCR in Python.

  • Spire.PDF will handle the conversion from PDF to images.
  • Spire.OCR, a powerful OCR tool for PDF files, will recognize the text in those images and extract it as editable content.

You can install Spire.PDF using the following pip command:

pip install spire.pdf

and install Spire.OCR with:

pip install spire.ocr

Alternatively, you can download and install them manually by visiting the official Spire.PDF and Spire.OCR pages.

Convert PDF Pages to Images Using Python

Before we dive into Python PDF OCR, it's crucial to understand a foundational step: OCR technology doesn't directly process PDF files. Especially with image-based PDFs (like those created from scanned documents), we first need to convert them into individual image files.

Converting PDFs to images using the Spire.PDF library is straightforward. You simply load your target PDF document and then iterate through each page. For every page, call the PdfDocument.SaveAsImage() method to save it as a separate image file. Once this step is complete, your images are ready for the subsequent OCR process.

Here's a code example showing how to convert PDF to PNG:

from spire.pdf import *

# Load the PDF file
pdf = PdfDocument()
pdf.LoadFromFile("/AI-Generated Art.pdf")

# Loop through pages and save as images
for i in range(pdf.Pages.Count):
    # Convert each page to image
    with pdf.SaveAsImage(i) as image:
        
        # Save in different formats as needed
        image.Save(f"/output/pdftoimage/ToImage_{i}.png")
        # image.Save(f"Output/ToImage_{i}.jpg")
        # image.Save(f"Output/ToImage_{i}.bmp")

# Close the PDF document
pdf.Close()

Conversion result preview: Convert PDF to PNG in Python

Scan and Extract Text from Images Using Spire.OCR

After converting the scanned PDF into images, we can now move on to OCR PDF with Python and to extract text from the PDF. With OcrScanner.Scan() from Spire.OCR, recognizing text in images becomes straightforward. It supports multiple languages such as English, Chinese, French, and German. Once the text is extracted, you can easily save it to a .txt file or generate a Word document.

The code example below shows how to OCR the first PDF page and export to text in Python:

from spire.ocr import *

# Create OCR scanner instance
scanner = OcrScanner()

# Configure OCR model path and language
configureOptions = ConfigureOptions()
configureOptions.ModelPath = r'E:/DownloadsNew/win-x64/'
configureOptions.Language = 'English'
scanner.ConfigureDependencies(configureOptions)

# Perform OCR on the image
scanner.Scan(r'/output/pdftoimage/ToImage_0.png')

# Save extracted text to file
text = scanner.Text.ToString()
with open('/output/scannedpdfoutput.txt', 'a', encoding='utf-8') as file:
    file.write(text + '\n')

Result preview: OCR the First PDF Image and Extract Text Using Python

The Conclusion

In this article, we covered how to perform PDF OCR with Python—from converting PDFs to images, to recognizing text with OCR, and finally saving the extracted content as a plain text file. With this streamlined approach, extracting text from scanned PDFs becomes effortless. If you're looking to automate your PDF processing workflows, feel free to reach out and request a 30-day free trial. It’s time to simplify your document management.

Visual guide of PDF to Markdown in Python

PDFs are ubiquitous in digital document management, but their rigid formatting often makes them less than ideal for content that needs to be easily edited, updated, or integrated into modern workflows. Markdown (.md), on the other hand, offers a lightweight, human-readable syntax perfect for web publishing, documentation, and version control. In this guide, we'll explore how to leverage the Spire.PDF for Python library to perform single or batch conversions from PDF to Markdown in Python efficiently.

Why Convert PDFs to Markdown?

Markdown offers several advantages over PDF for content creation and management:

  • Version control friendly: Easily track changes in Git
  • Lightweight and readable: Plain text format with simple syntax
  • Editability: Simple to modify without specialized software
  • Web integration: Natively supported by platforms like GitHub, GitLab, and static site generators (e.g., Jekyll, Hugo).

Spire.PDF for Python provides a robust solution for extracting text and structure from PDFs while preserving essential formatting elements like tables, lists, and basic styling.

Python PDF Converter Library - Installation

To use Spire.PDF for Python in your projects, you need to install the library via PyPI (Python Package Index) using pip. Open your terminal/command prompt and run:

pip install Spire.PDF

To upgrade an existing installation to the latest version:

pip install --upgrade spire.pdf

Convert PDF to Markdown in Python

Here’s a basic example demonstrates how to use Python to convert a PDF file to a Markdown (.md) file.

from spire.pdf.common import *
from spire.pdf import *

# Create an instance of PdfDocument class
pdf = PdfDocument()

# Load a PDF document
pdf.LoadFromFile("TestFile.pdf")

# Convert the PDF to a Markdown file
pdf.SaveToFile("PDFToMarkdown.md", FileFormat.Markdown) 
pdf.Close()

This Python script loads a PDF file and then uses the SaveToFile() method to convert it to Markdown format. The FileFormat.Markdown parameter specifies the output format.

How Conversion Works

The library extracts text, images, tables, and basic formatting from the PDF and converts them into Markdown syntax.

  • Text: Preserved with paragraphs/line breaks.
  • Images: Images in the PDF are converted to base64-encoded PNG format and embedded directly in the Markdown.
  • Tables: Tabular data is converted to Markdown table syntax (rows/columns with pipes |).
  • Styling: Basic formatting (bold, italic) is retained using Markdown syntax.

Output: Convert a PDF file to a Markdown file.

Batch Convert Multiple PDFs to Markdown in Python

This Python script uses a loop to convert all PDF files in a specified directory to Markdown format.

import os
from spire.pdf import *

# Configure paths
input_folder = "pdf_folder/"
output_folder = "markdown_output/"

# Create output directory
os.makedirs(output_folder, exist_ok=True)

# Process all PDFs in folder
for file_name in os.listdir(input_folder):
    if file_name.endswith(".pdf"):
        # Initialize document
        pdf = PdfDocument()
        pdf.LoadFromFile(os.path.join(input_folder, file_name))
        
        # Generate output path
        md_name = os.path.splitext(file_name)[0] + ".md"
        output_path = os.path.join(output_folder, md_name)
        
        # Convert to Markdown
        pdf.SaveToFile(output_path, FileFormat.Markdown)
        pdf.Close()

Key Characteristics

  • Batch Processing: Automatically processes all PDFs in input folder, improving efficiency for bulk operations.
  • 1:1 Conversion: Each PDF generates corresponding Markdown file.
  • Sequential Execution: Files processed in alphabetical order.
  • Resource Management: Each PDF is closed immediately after conversion.

Output:

Batch convert multiple PDF files to Markdown files.

Need to convert Markdown to PDF? Refer to: Convert Markdown to PDF in Python


Frequently Asked Questions (FAQs)

Q1: Is Spire.PDF for Python free?

A: Spire.PDF offers a free version with limitations (e.g., maximum 3 pages per conversion). For unlimited use, request a 30-day free trial for evaluation.

Q2: Can I convert password-protected PDFs to Markdown?

A: Yes. Use the LoadFromFile method with the password as a second parameter:

pdf.LoadFromFile("ProtectedFile.pdf", "your_password")

Q3: Can Spire.PDF convert scanned/image-based PDFs to Markdown?

A: No. The library extracts text-based content only. For scanned PDFs, use OCR tools (like Spire.OCR for Python) to create searchable PDFs first.


Conclusion

Spire.PDF for Python simplifies PDF to Markdown conversion for both single file and batch processing.

Its advantages include:

  • Simple API with minimal code
  • Preservation of document structure
  • Batch processing capabilities
  • Cross-platform compatibility

Whether you're migrating documentation, processing research papers, or building content pipelines, by following the examples in this guide, you can efficiently transform static PDF documents into flexible, editable Markdown content, streamlining workflows and improving collaboration.

Page 2 of 26
page 2