Extracting text and images from PDF files enables you to quickly reuse these contents in other types of files, such as Word documents, web pages, or presentations. This approach can help you save a significant amount of time and effort, as it eliminates the tedious and time-consuming process of re-typing information from scratch. In this article, you will learn how to extract text and images from a PDF file in C++ using Spire.PDF for C++.
- Extract Text from a PDF File in C++
- Extract Text from a Specific Page Area in a PDF File in C++
- Extract Images from a PDF File in C++
Install Spire.PDF for C++
There are two ways to integrate Spire.PDF for C++ into your application. One way is to install it through NuGet, and the other way is to download the package from our website and copy the libraries into your program. Installation via NuGet is simpler and more recommended. You can find more details by visiting the following link.
Integrate Spire.PDF for C++ in a C++ Application
Extract Text from a PDF File in C++
Spire.PDF for C++ provides the PdfTextExtractor class to extract text from PDF pages. The detailed steps are as follows:
- Initialize an instance of the PdfDocument class.
- Load a PDF file using PdfDocument->LoadFromFile() method.
- Iterate through all pages in the file.
- For each page, create a PdfTextExtractor object and use its ExtractText() method to extract the text content.
- Save the extracted text to a .txt file.
- C++
#include "Spire.Pdf.o.h"
using namespace std;
using namespace Spire::Pdf;
int main()
{
// Create a new instance of PdfDocument
intrusive_ptr<PdfDocument> doc = new PdfDocument();
// Load the PDF document from the input file
doc->LoadFromFile(L"Input.pdf");
// Variable to hold the extracted text
wstring buffer = L"";
// Iterate through all pages in the file
for (int i = 0; i < doc->GetPages()->GetCount(); i++)
{
// Get the current page
intrusive_ptr<PdfPageBase> page = doc->GetPages()->GetItem(i);
// Create a text extractor for the specified page
intrusive_ptr<PdfTextExtractor> textExtractor = new PdfTextExtractor(page);
// Create options for text extraction
intrusive_ptr<PdfTextExtractOptions> textExtractorOption = new PdfTextExtractOptions();
// Extract text from the page
buffer += textExtractor->ExtractText(textExtractorOption);
}
// Save the extracted text to a .txt file
wofstream write(L"ExtractText.txt");
auto LocUtf8 = locale(locale(""), new std::codecvt_utf8<wchar_t>);
write.imbue(LocUtf8);
write << buffer;
write.close();
doc->Close();
}

Extract Text from a Specific Page Area in a PDF File in C++
You can extract text from a specific rectangular area of a PDF page using the PdfTextExtractor class and PdfTextExtractOptions in Spire.PDF for C++. The detailed steps are as follows:
- Initialize an instance of the PdfDocument class.
- Load a PDF file using PdfDocument->LoadFromFile() method.
- Iterate through all pages in the document or access a specific page using its index.
- Create a PdfTextExtractor object for the specified page.
- Define a rectangular area using RectangleF and set it in PdfTextExtractOptions using the SetExtractArea() method.
- Extract text from the specified area using the PdfTextExtractor->ExtractText() method
- Save the extracted text to a .txt file.
- C++
#include "Spire.Pdf.o.h"
using namespace Spire::Pdf;
using namespace std;
int main()
{
// Create a new instance of PdfDocument
intrusive_ptr<PdfDocument> doc = new PdfDocument();
// Load the PDF document from the input file
doc->LoadFromFile(L"Input.pdf");
// Variable to hold the extracted text
wstring buffer = L"";
// Iterate through all pages in the file
for (int i = 0; i < doc->GetPages()->GetCount(); i++)
{
// Get the current page
intrusive_ptr<PdfPageBase> page = doc->GetPages()->GetItem(i);
// Create a text extractor for the specified page
intrusive_ptr<PdfTextExtractor> textExtractor = new PdfTextExtractor(page);
// Create options for text extraction
intrusive_ptr<PdfTextExtractOptions> textExtractorOption = new PdfTextExtractOptions();
textExtractorOption->SetExtractArea(new RectangleF(0, 0, 600, 200));
// Extract text from the page
buffer += textExtractor->ExtractText(textExtractorOption);
}
// Save the extracted text to a .txt file
wofstream write(L"ExtractTextFromPageArea.txt");
auto LocUtf8 = locale(locale(""), new std::codecvt_utf8<wchar_t>);
write.imbue(LocUtf8);
write << buffer;
write.close();
doc->Close();
}

Extract Images from a PDF File in C++
You can use the PdfImageHelper class to extract images from the pages in a PDF file. The detailed steps are as follows:
- Initialize an instance of the PdfDocument class.
- Load a PDF file using PdfDocument->LoadFromFile() method.
- Initialize an instance of the PdfImageHelper class to assist with image extraction
- Iterate through all pages in the document.
- For each page, retrieve information about all images on the page using the PdfImageHelper->GetImagesInfo() method
- Iterate through the extracted images and save them to PNG files
- C++
#include "Spire.Pdf.o.h"
using namespace Spire::Pdf;
using namespace std;
int main()
{
// Create a new instance of PdfDocument
intrusive_ptr<PdfDocument> doc = new PdfDocument();
// Load the PDF document from the input file
doc->LoadFromFile(L"Sample.pdf");
// Initialize an image helper to extract images from the page
intrusive_ptr<PdfImageHelper> imagehelper = new PdfImageHelper();
int imageIndex = 0;
// Iterate through all pages in the file
for (int i = 0; i < doc->GetPages()->GetCount(); i++)
{
// Get the current page
intrusive_ptr<PdfPageBase> page = doc->GetPages()->GetItem(i);
// Retrieve information about all images on the page
vector<intrusive_ptr<Utilities_PdfImageInfo>> exImages = imagehelper->GetImagesInfo(page);
// Iterate through images and save them
for (auto image : exImages)
{
std::wstring imageFileName = L"Image\\Image-" + to_wstring(imageIndex) + L".png";
image->GetImage()->Save(imageFileName.c_str());
imageIndex++;
}
}
}

Apply for a Temporary License
If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.
