C++: Extract Text and Images from PDF

Extracting text and images from PDF files enables you to quickly reuse these contents in other types of files, such as Word documents, web pages, or presentations. This approach can help you save a significant amount of time and effort, as it eliminates the tedious and time-consuming process of re-typing information from scratch. In this article, you will learn how to extract text and images from a PDF file in C++ using Spire.PDF for C++.

Extract Text from a PDF File in C++
Extract Text from a Specific Page Area in a PDF File in C++
Extract Images from a PDF File in C++

Install Spire.PDF for C++

There are two ways to integrate Spire.PDF for C++ into your application. One way is to install it through NuGet, and the other way is to download the package from our website and copy the libraries into your program. Installation via NuGet is simpler and more recommended. You can find more details by visiting the following link.

Integrate Spire.PDF for C++ in a C++ Application

Extract Text from a PDF File in C++

Spire.PDF for C++ provides the PdfTextExtractor class to extract text from PDF pages. The detailed steps are as follows:

Initialize an instance of the PdfDocument class.
Load a PDF file using PdfDocument->LoadFromFile() method.
Iterate through all pages in the file.
For each page, create a PdfTextExtractor object and use its ExtractText() method to extract the text content.
Save the extracted text to a .txt file.

#include "Spire.Pdf.o.h"

using namespace std;
using namespace Spire::Pdf;

int main()
{
	// Create a new instance of PdfDocument
	intrusive_ptr<PdfDocument> doc = new PdfDocument();

	// Load the PDF document from the input file
	doc->LoadFromFile(L"Input.pdf");

	// Variable to hold the extracted text
	wstring buffer = L"";

	// Iterate through all pages in the file
	for (int i = 0; i < doc->GetPages()->GetCount(); i++)
	{
		// Get the current page
		intrusive_ptr<PdfPageBase> page = doc->GetPages()->GetItem(i);

		// Create a text extractor for the specified page
		intrusive_ptr<PdfTextExtractor> textExtractor = new PdfTextExtractor(page);

		// Create options for text extraction
		intrusive_ptr<PdfTextExtractOptions> textExtractorOption = new PdfTextExtractOptions();

		// Extract text from the page
		buffer += textExtractor->ExtractText(textExtractorOption);
	}

	// Save the extracted text to a .txt file
	wofstream write(L"ExtractText.txt");
	auto LocUtf8 = locale(locale(""), new std::codecvt_utf8<wchar_t>);
	write.imbue(LocUtf8);
	write << buffer;
	write.close();
	doc->Close();
}

C++: Extract Text and Images from PDF

Extract Text from a Specific Page Area in a PDF File in C++

You can extract text from a specific rectangular area of a PDF page using the PdfTextExtractor class and PdfTextExtractOptions in Spire.PDF for C++. The detailed steps are as follows:

Initialize an instance of the PdfDocument class.
Load a PDF file using PdfDocument->LoadFromFile() method.
Iterate through all pages in the document or access a specific page using its index.
Create a PdfTextExtractor object for the specified page.
Define a rectangular area using RectangleF and set it in PdfTextExtractOptions using the SetExtractArea() method.
Extract text from the specified area using the PdfTextExtractor->ExtractText() method
Save the extracted text to a .txt file.

#include "Spire.Pdf.o.h"

using namespace Spire::Pdf;
using namespace std;

int main()
{
	// Create a new instance of PdfDocument
	intrusive_ptr<PdfDocument> doc = new PdfDocument();

	// Load the PDF document from the input file
	doc->LoadFromFile(L"Input.pdf");

	// Variable to hold the extracted text
	wstring buffer = L"";

	// Iterate through all pages in the file
	for (int i = 0; i < doc->GetPages()->GetCount(); i++)
	{
		// Get the current page
		intrusive_ptr<PdfPageBase> page = doc->GetPages()->GetItem(i);

		// Create a text extractor for the specified page
		intrusive_ptr<PdfTextExtractor> textExtractor = new PdfTextExtractor(page);

		// Create options for text extraction
		intrusive_ptr<PdfTextExtractOptions> textExtractorOption = new PdfTextExtractOptions();
		textExtractorOption->SetExtractArea(new RectangleF(0, 0, 600, 200));

		// Extract text from the page
		buffer += textExtractor->ExtractText(textExtractorOption);
	}

	// Save the extracted text to a .txt file
	wofstream write(L"ExtractTextFromPageArea.txt");
	auto LocUtf8 = locale(locale(""), new std::codecvt_utf8<wchar_t>);
	write.imbue(LocUtf8);
	write << buffer;
	write.close();
	doc->Close();
}

C++: Extract Text and Images from PDF

Extract Images from a PDF File in C++

You can use the PdfImageHelper class to extract images from the pages in a PDF file. The detailed steps are as follows:

Initialize an instance of the PdfDocument class.
Load a PDF file using PdfDocument->LoadFromFile() method.
Initialize an instance of the PdfImageHelper class to assist with image extraction
Iterate through all pages in the document.
For each page, retrieve information about all images on the page using the PdfImageHelper->GetImagesInfo() method
Iterate through the extracted images and save them to PNG files

#include "Spire.Pdf.o.h"

using namespace Spire::Pdf;
using namespace std;

int main()
{
	// Create a new instance of PdfDocument
	intrusive_ptr<PdfDocument> doc = new PdfDocument();

	// Load the PDF document from the input file
	doc->LoadFromFile(L"Sample.pdf");

	// Initialize an image helper to extract images from the page
	intrusive_ptr<PdfImageHelper> imagehelper = new PdfImageHelper();

	int imageIndex = 0;

	// Iterate through all pages in the file
	for (int i = 0; i < doc->GetPages()->GetCount(); i++)
	{
		// Get the current page
		intrusive_ptr<PdfPageBase> page = doc->GetPages()->GetItem(i);

		// Retrieve information about all images on the page
		vector<intrusive_ptr<Utilities_PdfImageInfo>> exImages = imagehelper->GetImagesInfo(page);

		// Iterate through images and save them
		for (auto image : exImages)
		{
			std::wstring imageFileName = L"Image\\Image-" + to_wstring(imageIndex) + L".png";
			image->GetImage()->Save(imageFileName.c_str());
			imageIndex++;
		}
	}
}

C++: Extract Text and Images from PDF

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

C++: Extract Text and Images from PDF

Install Spire.PDF for C++

Extract Text from a PDF File in C++

Extract Text from a Specific Page Area in a PDF File in C++

Extract Images from a PDF File in C++

Apply for a Temporary License

See Also