page 72

Knowledgebase (2311)

Children categories

Spire.OfficeJs (3)

View items...

Extract Images from PDF in Python – A Complete Guide

2024-07-19 01:08:00 Written by Administrator

PDF files often contain critical embedded images (e.g., charts, diagrams, scanned documents). For developers, knowing how to extract images from PDF in Python allows them to repurpose graphical content for automated report generation or feed these visuals into machine learning models for analysis and OCR tasks.

Visual guide for Extract Images from PDF Python

This article explores how to leverage the Spire.PDF for Python library to extract images from PDF files via Python, covering the following aspects:

Installation & Environment Setup
How to Extract Images from PDFs using Python
- Example 1: Extract Images from a PDF Page
- Example 2: Extract All Images from a PDF File
Handle Different Image Formats While Extraction
Frequently Asked Questions
Conclusion (Extract Text and More)

Installation & Environment Setup

Before you start using Spire.PDF for Python to extract images from PDF, make sure you have the following in place:

Python Environment: Ensure that you have Python installed on your system. It is recommended to use the latest stable version for the best compatibility and performance.
Spire.PDF for Python Library: You need to install the Python PDF SDK, and the easiest way is using pip, the Python package installer.

Open your command prompt or terminal and run the following command:

pip install Spire.PDF

How to Extract Images from PDFs using Python

Example 1: Extract Images from a PDF Page

Here’s a complete Python script to extract and save images from a specified page in PDF:

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument instance
pdf = PdfDocument()

# Load a PDF file
pdf.LoadFromFile("template1.pdf")

# Get the first page
page = pdf.Pages.get_Item(0)

# Create a PdfImageHelper instance
imageHelper = PdfImageHelper()

# Get the image information on the page
imageInfo = imageHelper.GetImagesInfo(page)

# Iterate through the image information
for i in range(0, len(imageInfo)):
    # Save images to file
    imageInfo[i].Image.Save("PageImage\\Image" + str(i) + ".png")

# Release resources
pdf.Dispose()

Key Steps Explained:

Load the PDF: Use the LoadFromFile() method to load a PDF file.
Access a Page: Access a specified PDF page by index.
Extract Image information:
- Create a PdfImageHelper instance to facilitate image extraction.
- Use the GetImagesInfo() method to retrieve image information from the specified page, and return a list of PdfImageInfo objects.
Save Images to Files:
- Loops through all detected images on the page
- Use the PdfImageInfo[].Image.Save() method to save the image to disk.

Output:

Extract all images from the first page in a PDF

Example 2: Extract All Images from a PDF File

Building on the single-page extraction method, you can iterate through all pages of the PDF document to extract every embedded image.

Python code example:

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument instance
pdf = PdfDocument()

# Load a PDF file
pdf.LoadFromFile("template1.pdf")

# Create a PdfImageHelper instance
imageHelper = PdfImageHelper()

# Iterate through the pages in the document
for i in range(0, pdf.Pages.Count):
    # Get the current page
    page = pdf.Pages.get_Item(i)
    # Get the image information on the page
    imageInfo = imageHelper.GetImagesInfo(page)
    # Iterate through the image information items
    for j in range(0, len(imageInfo)):
        # Save the current image to file
        imageInfo[j].Image.Save(f"Images\\Image{i}_{j}.png")

# Release resources
pdf.Close()

Output:

Extract all images from an entire PDF file.

Handle Different Image Formats While Extraction

Spire.PDF for Python supports extracting images in various formats such as PNG, JPG/JPEG, BMP, etc. When saving the extracted images, you can choose the appropriate format based on your needs.

Common Image Formats:

Format	Best Use Cases	PDF Extraction Notes
JPG/JPEG	Photos, scanned documents	Common in PDFs; quality loss on re-compress
PNG	Web graphics, diagrams, screenshots	Preserves transparency; larger file sizes
BMP	Windows applications, temp storage	Rare in modern PDFs; avoid for web use
TIFF	Archiving, print, OCR input	Ideal for document preservation; multi-page
EMF	Windows vector editing	Editable in Illustrator/Inkscape

Frequently Asked Questions

Q1: Is Spire.PDF for Python a free library?

Spire.PDF for Python offers both free and commercial versions. The free version has limitations, such as a maximum of 10 pages per PDF. For commercial use or to remove these restrictions, you can request a trial license here.

Q2: Can I extract images from a specified page range only?

Yes. Instead of iterating through all pages, specify the page indices you want. For example, to extract images from the pages 2 to 5:

# Extract images from pages 2 to 5
for i in range(1, 4): # Pages are zero-indexed
    page = pdf.Pages.get_Item(i)
    # Process images as before

Q3: Is it possible to extract text from images?

Yes. For scanned PDF files, after extracting the images, you can extract the text in the images in conjunction with the Spire.OCR for Python library.

A step-by-step guide: How to Extract Text from Image Using Python (OCR Code Examples)

Conclusion (Extract Text and More)

Spire.PDF simplifies image extraction from PDF in Python with minimal code. By following this guide, you can:

Extract images from single pages or entire PDF documents.
Save images from PDF in various formats (PNG, JPG, BMP or TIFF).

As a PDF document can contain different elements, the Python PDF library is also capable of:

Published in Extract/Read

Tagged under

pdf Python Extract Read

Python: Add Bookmarks to a Word Document

2023-10-07 01:13:31 Written by Koohji

Adding bookmarks to Word documents is a useful feature that allows users to mark specific locations within their documents for quick reference or navigation. Bookmarks serve as virtual placeholders, making it easier to find and revisit important sections of a document without scrolling through lengthy pages. In this article, you will learn how to add bookmarks to a Word document in Python using Spire.Doc for Python.

Add Bookmarks to a Paragraph in Python
Add Bookmarks to Selected Text in Python

Install Spire.Doc for Python

This scenario requires Spire.Doc for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.

Package Manager

pip install Spire.Doc

If you are unsure how to install, please refer to this tutorial: How to Install Spire.Doc for Python on Windows

Add Bookmarks to a Paragraph in Python

Spire.Doc for Python offers the BookmarkStart to represent the start of a bookmark and the BookmarkEnd to represent the end of a bookmark. To bookmark a paragraph, a BookmarkStart object is placed at the beginning of the paragraph and a BookmarkEnd object is appended at the end of the paragraph. The following are the detailed steps.

Create a Document object.
Load a Word file using Document.LoadFromFile() method.
Get a specific paragraph through Document.Sections[index].Paragraphs[index] property.
Create a BookmarkStart using Paragraph.AppendBookmarkStart() method and insert it at the beginning of the paragraph using Paragraph.Items.Insert() method.
Append a BookmarkEnd at the end of the paragraph using Paragraph.AppendBookmarkEnd() method.
Save the document to a different Word file using Document.SaveToFile() method.

Python

from spire.doc import *
from spire.doc.common import *

# Create a Document object
doc = Document()

# Load a sample Word file
doc.LoadFromFile('C:/Users/Administrator/Desktop/input.docx')

# Get the second paragraph
paragraph = doc.Sections[0].Paragraphs[2]

# Create a bookmark start
start = paragraph.AppendBookmarkStart('myBookmark')

# Insert it at the beginning of the paragraph
paragraph.Items.Insert(0, start)

# Append a bookmark end at the end of the paragraph
paragraph.AppendBookmarkEnd('myBookmark')

# Save the file
doc.SaveToFile('output/AddBookmarkToParagraph.docx', FileFormat.Docx2019)

Python: Add Bookmarks to a Word Document

Add Bookmarks to Selected Text in Python

To bookmark a piece of text, you need first to get the text from the document and get its position inside its owner paragraph. And then place a BookmarkStart before it and a BookmarEnd after it. The detailed steps are as follows.

Create a Document object.
Load a Word file using Document.LoadFromFile() method.
Find the string to be marked from the document.
Get its owner paragraph and its position inside the paragraph.
Insert a BookmarkStart before the text and a BookmarkEnd after the text.
Save the document to a different Word file using Document.SaveToFile() method.

Python

from spire.doc import *
from spire.doc.common import *

# Create a Document object
doc = Document()

# Load a sample Word file
doc.LoadFromFile('C:/Users/Administrator/Desktop/input.docx')

# Specify the string to find
stringToFind = 'programming paradigms'

# Find the selected text from the document
finds = doc.FindAllString(stringToFind, False, True)
specificText = finds[0]

# Find the paragraph where the text is located
paragraph = specificText.GetAsOneRange().OwnerParagraph

# Get the index of the text in the paragraph
index = paragraph.ChildObjects.IndexOf(specificText.GetAsOneRange())

# Create a bookmark start
start = paragraph.AppendBookmarkStart("myBookmark")

# Insert the bookmark start at the index position
paragraph.ChildObjects.Insert(index, start)

# Create a bookmark end
end = paragraph.AppendBookmarkEnd("myBookmark")

# Insert the bookmark end at the end of the selected text
paragraph.ChildObjects.Insert(index + 2, end)

# Save the document to a different file
doc.SaveToFile("output/AddBookmarkToSelectedText.docx", FileFormat.Docx2019)

Python: Add Bookmarks to a Word Document

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

Published in Bookmark

Tagged under

Python: Split a PDF File into Multiple PDFs

2023-09-28 03:20:37 Written by Koohji

Large PDF files can sometimes be cumbersome to handle, especially when sharing or uploading them. Splitting a large PDF file into multiple smaller PDFs reduces the file size, making it more manageable and quicker to open and process. In this article, we will demonstrate how to split PDF documents in Python using Spire.PDF for Python.

Split a PDF File into Multiple Single-Page PDFs in Python
Split a PDF File by Page Ranges in Python

Install Spire.PDF for Python

This scenario requires Spire.PDF for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.

Package Manager

pip install Spire.PDF

If you are unsure how to install, please refer to this tutorial: How to Install Spire.PDF for Python on Windows

Split a PDF File into Multiple Single-Page PDFs in Python

Spire.PDF for Python offers the PdfDocument.Split() method to divide a multi-page PDF document into multiple single-page PDF files. The following are the detailed steps.

Create a PdfDocument object.
Load a PDF document using PdfDocument.LoadFromFile() method.
Split the document into multiple single-page PDFs using PdfDocument.Split() method.

Python

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()
# Load a PDF file
doc.LoadFromFile("Sample.pdf")

# Split the PDF file into multiple single-page PDFs
doc.Split("Output/SplitDocument-{0}.pdf", 1)

# Close the PdfDocument object
doc.Close()

Python: Split a PDF File into Multiple PDFs

Split a PDF File by Page Ranges in Python

To split a PDF file into two or more PDF files by page ranges, you need to create two or more new PDF files, and then import the specific page or range of pages from the source PDF into the newly created PDF files. The following are the detailed steps.

Create a PdfDocument object.
Load a PDF document using PdfDocument.LoadFromFile() method.
Create three PdfDocument objects.
Import the first page from the source file into the first document using PdfDocument.InsertPage() method.
Import pages 2-4 from the source file into the second document using PdfDocument.InsertPageRange() method.
Import the remaining pages from the source file into the third document using PdfDocument.InsertPageRange() method.
Save the three documents using PdfDocument.SaveToFile() method.

Python

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()
# Load a PDF file
doc.LoadFromFile("Sample.pdf")

# Create three PdfDocument objects
newDoc_1 = PdfDocument()
newDoc_2 = PdfDocument()
newDoc_3 = PdfDocument()

# Insert the first page of the source file into the first document
newDoc_1.InsertPage(doc, 0)

# Insert pages 2-4 of the source file into the second document
newDoc_2.InsertPageRange(doc, 1, 3)

# Insert the rest pages of the source file into the third document
newDoc_3.InsertPageRange(doc, 4, doc.Pages.Count - 1)

# Save the three documents
newDoc_1.SaveToFile("Output1/Split-1.pdf")
newDoc_2.SaveToFile("Output1/Split-2.pdf")
newDoc_3.SaveToFile("Output1/Split-3.pdf")

# Close the PdfDocument objects
doc.Close()
newDoc_1.Close()
newDoc_2.Close()
newDoc_3.Close()

Python: Split a PDF File into Multiple PDFs

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

Published in Document Operation

Tagged under

pdf Python Document Operation

News Category

Knowledgebase (2311)

Children categories

Purchase (7)

Licensing (7)

Benchmark (1)

Java (481)

.NET (1317)

Cloud (13)

CPP (71)

Python (355)

AI (4)

JavaScript (51)

Spire.OfficeJs (3)

Extract Images from PDF in Python – A Complete Guide

Installation & Environment Setup

How to Extract Images from PDFs using Python

Example 1: Extract Images from a PDF Page

Example 2: Extract All Images from a PDF File

Handle Different Image Formats While Extraction

Frequently Asked Questions

Q1: Is Spire.PDF for Python a free library?

Q2: Can I extract images from a specified page range only?

Q3: Is it possible to extract text from images?

Conclusion (Extract Text and More)

Python: Add Bookmarks to a Word Document

Install Spire.Doc for Python

Add Bookmarks to a Paragraph in Python

Add Bookmarks to Selected Text in Python

Apply for a Temporary License

Python: Split a PDF File into Multiple PDFs

Install Spire.PDF for Python

Split a PDF File into Multiple Single-Page PDFs in Python

Split a PDF File by Page Ranges in Python

Apply for a Temporary License

More...

Python: Set or Remove Word Document Editing Restrictions

Python Image to PDF Conversion: Best Practices and Code Examples

Python: Convert PowerPoint to Images (PNG, JPG, BMP, SVG)

Python: Add, Update, or Delete Textboxes in Excel