Knowledgebase (2300)
Hyperlinks are a commonly used tool in Excel that facilitates navigation between different sheets, workbooks, websites, or even specific cells within a worksheet. There are instances where you may need to manage hyperlinks in Excel files, such as extracting hyperlinks for further analysis, modifying existing links, or removing them entirely. In this article, we will introduce how to extract, modify, and remove hyperlinks in Excel in Python using Spire.XLS for Python.
- Extract Hyperlinks from Excel in Python
- Modify Hyperlinks in Excel in Python
- Remove Hyperlinks from Excel in Python
Install Spire.XLS for Python
This scenario requires Spire.XLS for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.
pip install Spire.XLS
If you are unsure how to install, please refer to this tutorial: How to Install Spire.XLS for Python on Windows
Extract Hyperlinks from Excel in Python
Extracting hyperlinks from an Excel worksheet can be beneficial when you need to analyze or export the link data for further processing.
The following steps demonstrate how to extract hyperlinks from an Excel worksheet in Python using Spire.XLS for Python:
- Create a Workbook object.
- Load an Excel file using Workbook.LoadFromFile() method.
- Get a specific worksheet using Workbook.Worksheets[] property.
- Get the collection of all hyperlinks in the worksheet using Worksheet.HyperLinks property.
- Create an empty list to store the extracted hyperlink information.
- Loop through the hyperlinks in the hyperlink collection.
- Get the address of each hyperlink using XlsHyperlink.Address property and append the address to the list.
- Write the addresses in the list into a text file.
- Python
from spire.xls import *
from spire.xls.common import *
# Create a Workbook object
workbook = Workbook()
# Load an Excel file
workbook.LoadFromFile("Hyperlinks.xlsx")
# Get the first worksheet of the file
sheet = workbook.Worksheets[0]
# Get the hyperlink collection of the worksheet
links = sheet.HyperLinks
# Create an empty list to store the extracted hyperlinks
list = []
# Loop through the hyperlinks in the hyperlink collection
for link in links:
# Get the address of each hyperlink
address = link.Address
# Append the address to the list
list.append(address)
# Write the extracted hyperlink addresses to a text file
with open("ExtractHyperlinks.txt", "w", encoding = "utf-8") as file:
for item in list:
file.write(item + "\n")
workbook.Dispose()

Modify Hyperlinks in Excel in Python
Modifying hyperlinks allows you to update URLs or alter the display text to suit your needs.
The following steps demonstrate how to modify an existing hyperlink in an Excel worksheet in Python using Spire.XLS for Python:
- Create a Workbook object.
- Load an Excel file using Workbook.LoadFromFile() method.
- Get a specific worksheet using Workbook.Worksheets[] property.
- Get a specific hyperlink in the worksheet using Worksheet.HyperLinks[] property.
- Modify the display text and address of the hyperlink using XlsHyperlink.TextToDisplay and XlsHyperlink.Address properties.
- Save the resulting file using Workbook.SaveToFile() method.
- Python
from spire.xls import *
from spire.xls.common import *
# Create a Workbook object
workbook = Workbook()
# Load an Excel file
workbook.LoadFromFile("Hyperlinks.xlsx")
# Get the first worksheet of the file
sheet = workbook.Worksheets[0]
# Get the first hyperlink in the worksheet
link = sheet.HyperLinks[0]
# Change the display text of the hyperlink
link.TextToDisplay = "Spire.XLS for .NET"
# Change the address of the hyperlink
link.Address = "http://www.e-iceblue.com"
# Save the resulting file
workbook.SaveToFile("ModifyHyperlink.xlsx", ExcelVersion.Version2016)
workbook.Dispose()

Remove Hyperlinks from Excel in Python
Removing hyperlinks can help eliminate unnecessary links and clean up your spreadsheet.
The following steps demonstrate how to remove a specific hyperlink from an Excel worksheet in Python using Spire.XLS for Python:
- Create a Workbook object.
- Load an Excel file using Workbook.LoadFromFile() method.
- Get a specific worksheet using Workbook.Worksheets[] property.
- Remove a specific hyperlink from the worksheet using Worksheet.Hyperlinks.RemoveAt() method.
- Save the resulting file using Workbook.SaveToFile() method.
- Python
from spire.xls import *
from spire.xls.common import *
# Create a Workbook object
workbook = Workbook()
# Load an Excel file
workbook.LoadFromFile("Hyperlinks.xlsx")
# Get the first worksheet of the file
sheet = workbook.Worksheets[0]
# Remove the first hyperlink and keep its display text
sheet.HyperLinks.RemoveAt(0)
# Save the resulting file
workbook.SaveToFile("RemoveHyperlink.xlsx", ExcelVersion.Version2016)
workbook.Dispose()

Apply for a Temporary License
If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.
By extracting text from Word documents, you can effortlessly obtain the written information contained within them. This allows for easier manipulation, analysis, and organization of textual content, enabling tasks such as text mining, sentiment analysis, and natural language processing. Extracting images, on the other hand, provides access to visual elements embedded within Word documents, which can be crucial for tasks like image recognition, content extraction, or creating image databases. In this article, you will learn how to extract text and images from a Word document in Python using Spire.Doc for Python.
- Extract Text from a Specific Paragraph in Python
- Extract Text from an Entire Word Document in Python
- Extract Images from an Entire Word Document in Python
Install Spire.Doc for Python
This scenario requires Spire.Doc for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.
pip install Spire.Doc
If you are unsure how to install, please refer to this tutorial: How to Install Spire.Doc for Python on Windows
Extract Text from a Specific Paragraph in Python
To get a certain paragraph from a section, use Section.Paragraphs[index] property. Then, you can get the text of the paragraph through Paragraph.Text property. The detailed steps are as follows.
- Create a Document object.
- Load a Word file using Document.LoadFromFile() method.
- Get a specific section through Document.Sections[index] property.
- Get a specific paragraph through Section.Paragraphs[index] property.
- Get text from the paragraph through Paragraph.Text property.
- Python
from spire.doc import *
from spire.doc.common import *
# Create a Document object
doc = Document()
# Load a Word document
doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\input.docx")
# Get a specific section
section = doc.Sections[0]
# Get a specific paragraph
paragraph = section.Paragraphs[2]
# Get text from the paragraph
str = paragraph.Text
# Print result
print(str)

Extract Text from an Entire Word Document in Python
If you want to get text from a whole document, you can simply use Document.GetText() method. Below are the steps.
- Create a Document object.
- Load a Word file using Document.LoadFromFile() method.
- Get text from the document using Document.GetText() method.
- Python
from spire.doc import *
from spire.doc.common import *
# Create a Document object
doc = Document()
# Load a Word file
doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\input.docx")
# Get text from the entire document
str = doc.GetText()
# Print result
print(str)

Extract Images from an Entire Word Document in Python
Spire.Doc for Python does not provide a straightforward method to get images from a Word document. You need to iterate through the child objects in the document, and determine if a certain a child object is a DocPicture. If yes, you get the image data using DocPicture.ImageBytes property and then save it as a popular image format file. The main steps are as follows.
- Create a Document object.
- Load a Word file using Document.LoadFromFile() method.
- Loop through the child objects in the document.
- Determine if a specific child object is a DocPicture. If yes, get the image data through DocPicture.ImageBytes property.
- Write the image data as a PNG file.
- Python
import queue
from spire.doc import *
from spire.doc.common import *
# Create a Document object
doc = Document()
# Load a Word file
doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\input.docx")
# Create a Queue object
nodes = queue.Queue()
nodes.put(doc)
# Create a list
images = []
while nodes.qsize() > 0:
node = nodes.get()
# Loop through the child objects in the document
for i in range(node.ChildObjects.Count):
child = node.ChildObjects.get_Item(i)
# Determine if a child object is a picture
if child.DocumentObjectType == DocumentObjectType.Picture:
picture = child if isinstance(child, DocPicture) else None
dataBytes = picture.ImageBytes
# Add the image data to the list
images.append(dataBytes)
elif isinstance(child, ICompositeObject):
nodes.put(child if isinstance(child, ICompositeObject) else None)
# Loop through the images in the list
for i, item in enumerate(images):
fileName = "Image-{}.png".format(i)
with open("ExtractedImages/"+fileName,'wb') as imageFile:
# Write the image to a specified path
imageFile.write(item)
doc.Close()

Apply for a Temporary License
If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.
How to Convert PDF to Excel with Formatting in Python (Step-by-Step Guide)
2023-09-07 01:11:29 Written by Koohji
Converting PDF files to Excel spreadsheets in Python is an effective way to extract structured data for analysis, reporting, and automation. While PDFs are excellent for preserving layout across platforms, their static format often makes data extraction challenging.
Excel, on the other hand, provides robust features for sorting, filtering, calculating, and visualizing data. By using Python along with the Spire.PDF for Python library, you can automate the entire PDF to Excel conversion process — from basic one-page documents to complex, multi-page PDFs.
Whether you're automating data extraction from PDFs or integrating PDF content into Excel workflows, this tutorial will walk you through both quick-start and advanced methods for reliable Python PDF to Excel conversion.
Table of Contents
- Why Convert PDF to Excel Programmatically in Python
- Setting Up Your Development Environment
- Quick Start: Convert PDF to Excel in Python
- Advanced PDF to Excel Conversion with Layout Control and Formatting Options
- Conclusion
Why Convert PDF to Excel Programmatically in Python
PDFs are ideal for sharing documents with consistent formatting, but their fixed structure makes them difficult to analyze or reuse, especially if they contain tables.
Converting PDF to Excel allows you to:
- Extract tabular data for analysis or visualization
- Automate monthly or recurring report extraction
- Enable downstream processing in Excel
- Save hours of manual copy-pasting
Using Python for this task adds automation, flexibility, and scalability — ideal for integration into data pipelines or backend services.
Setting Up Your Development Environment
Before you start converting PDF files to Excel using Python, it’s essential to set up your development environment properly. This ensures you have all the necessary tools and libraries installed to follow the tutorial smoothly.
Install Python
If you haven’t already installed Python on your system, download and install the latest version from the official website.
Make sure to add Python to your system PATH during installation to run Python commands from the terminal or command prompt easily.
Install Spire.PDF for Python
Spire.PDF for Python is the core library used in this tutorial to load, manipulate, and convert PDF documents.
To install it, open your terminal and run:
pip install Spire.PDF
This command downloads and installs Spire.PDF along with any required dependencies.
If you encounter any issues or need detailed installation help, please refer to our step-by-step guide: How to Install Spire.PDF for Python on Windows
Quick Start: Convert PDF to Excel in Python
If your PDF has a clean and simple layout without complex formatting or multiple page structures, you can convert it directly to Excel with just 3 lines of code using Spire.PDF for Python.
Steps to Quickly Export PDF to Excel
Follow these straightforward steps to export your PDF file to an Excel spreadsheet in Python:
- Import the required classes.
- Create a PdfDocument object.
- Load your PDF file with the LoadFromFile method.
- Export the PDF to Excel (.xlsx) format using the SaveToFile method and specify FileFormat.XLSX as the output format.
Code Example
from spire.pdf import *
# Create a PdfDocument object
pdf = PdfDocument()
# Load your PDF file
pdf.LoadFromFile("Sample.pdf")
# Convert and save the PDF to Excel
pdf.SaveToFile("output.xlsx", FileFormat.XLSX)
# Close the document
pdf.Close()
Advanced PDF to Excel Conversion with Layout Control and Formatting Options
For more complex PDF documents—such as those containing multiple pages, rotated text, table cells with multiple lines of text, or overlapping content - you can use the XlsxLineLayoutOptions class to gain precise control over the conversion process.
This allows you to preserve the original structure and formatting of your PDF more accurately when exporting to Excel.
Layout Options You Can Configure
The XlsxLineLayoutOptions class in Spire.PDF provides several properties that give you granular control over how PDF content is exported to Excel. Below is a breakdown of each option and its behavior:
| Option | Description |
|---|---|
| convertToMultipleSheet | Determines whether to convert each PDF page into a separate worksheet. The default value is true. |
| rotatedText | Specifies whether to preserve the original rotation of angled text. The default value is true. |
| splitCell | Determines whether to split a PDF table cell with multiple lines of text into separate rows in the Excel output. The default value is true. |
| wrapText | Determines whether to enable word wrap inside Excel cells. The default value is true. |
| overlapText | Specifies whether text overlapping in the original PDF should be preserved in the Excel output. The default value is false. |
Code Example
from spire.pdf import *
# Create a PdfDocument object
pdf = PdfDocument()
# Load your PDF file
pdf.LoadFromFile("Sample.pdf")
# Define layout options
# Parameters: convertToMultipleSheet, rotatedText, splitCell, wrapText, overlapText
layout_options = XlsxLineLayoutOptions(True, True, False, True, False)
# Apply layout options
pdf.ConvertOptions.SetPdfToXlsxOptions(layout_options)
# Convert and save the PDF to Excel
pdf.SaveToFile("advanced_output.xlsx", FileFormat.XLSX)
# Close the document
pdf.Close()

Conclusion
Converting PDF files to Excel in Python is an efficient way to automate data extraction and processing tasks. Whether you need a quick conversion or fine-grained layout control, Spire.PDF for Python offers flexible options that scale from simple to complex scenarios.
Ready to automate your PDF to Excel conversions?
Get a free trial license for Spire.PDF for Python and explore the full Spire.PDF Documentation to get started today!
FAQs
Q1: Can I convert each PDF page into a separate Excel worksheet?
A1: Yes. Use the convertToMultipleSheet=True option in the XlsxLineLayoutOptions class to export each page to its own sheet.
Q2. What Excel format does Spire.PDF export to?
A2: Spire.PDF converts PDFs to .xlsx, the modern Excel format supported by Excel 2007 and later.
Q3: Can I convert a PDF to Excel in Python without losing formatting?
A3: Yes. Using Spire.PDF for Python, you can retain the original formatting, including merged cells, cell background colors, and other format settings when saving PDFs to Excel.
Q4: Can I extract only a specific table from a PDF to Excel instead of converting the whole document?
A4: Yes, Spire.PDF for Python supports extracting specific tables from PDF files. You can then write the extracted table data to Excel using our Excel processing library - Spire.XLS for Python.