Perform OCR on Scanned PDFs in C# for Text Extraction

Perform OCR on Scanned PDFs in C#

Optical Character Recognition (OCR) technology has become essential for developers working with scanned documents and image-based PDFs. In this tutorial, you learn how to perform OCR on PDFs in C# to extract text from scanned documents or images within a PDF using the Spire.PDF for .NET and Spire.OCR for .NET libraries. By transferring scanned PDFs into editable and searchable formats, you can significantly improve your document management processes.

Table of Contents :

Why OCR is Needed for Scanned PDFs?
Setting Up: Installing Required Libraries
Performing OCR on Scanned PDFs
Extracting Text from Images within PDFs
Wrapping Up
FAQs

Why OCR is Needed for Scanned PDFs?

Scanned PDFs are essentially image files —they contain pictures of text rather than actual selectable and searchable text content. When you scan a paper document or receive an image-based PDF, the text exists only as pixels , making it impossible to edit, search, or extract. This creates significant limitations for businesses and individuals who need to work with these documents digitally.

OCR technology solves this problem by analyzing the shapes of letters and numbers in scanned images and converting them into machine-readable text. This process transforms static PDFs into usable, searchable, and editable documents—enabling text extraction, keyword searches, and seamless integration with databases and workflow automation tools.

In fields such as legal, healthcare, and education, where large volumes of scanned documents are common, OCR plays a crucial role in document digitization, making important data easily accessible and actionable.

Setting Up: Installing Required Libraries

Before we dive into the code, let's first set up our development environment with the necessary components: Spire.PDF and Spire.OCR . Spire.PDF handles PDF operations, while Spire.OCR performs the actual text recognition.

Step 1. Install Spire.PDF and Spire.OCR via NuGet

To begin, open the NuGet Package Manager in Visual Studio, and search for "Spire.PDF" and "Spire.OCR" to install them in your project. Alternatively, you can use the Package Manager Console :

Install-Package Spire.PDF
Install-Package Spire.OCR

Step 2. Download OCR Models:

Spire.OCR requires pre-trained language models for text recognition. Download the appropriate model files for your operating system (Windows, Linux, or MacOS) and extract them to a directory (e.g., D:\win-x64).

Important Note : Ensure your project targets x64 platform (Project Properties > Build > Platform target) as Spire.OCR only supports 64-bit systems.

Set platform target to x64.

Performing OCR on Scanned PDFs in C#

With the necessary libraries installed, we can now perform OCR on scanned PDFs. Below is a sample code snippet demonstrating this process.

using Spire.OCR;
using Spire.Pdf;
using Spire.Pdf.Graphics;
using System.Drawing;

namespace OCRPDF
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create an instance of the OcrScanner class
            OcrScanner scanner = new OcrScanner();

            // Configure the scanner
            ConfigureOptions configureOptions = new ConfigureOptions
            {
                ModelPath = @"D:\win-x64", // Set model path
                Language = "English"        // Set language
            };

            // Apply the configuration options
            scanner.ConfigureDependencies(configureOptions);

            // Load a PDF document
            PdfDocument doc = new PdfDocument();
            doc.LoadFromFile(@"C:\Users\Administrator\Desktop\Input5.pdf");

            // Iterate through all pages
            for (int i = 0; i < doc.Pages.Count; i++)
            {
                // Convert page to image
                Image image = doc.SaveAsImage(i, PdfImageType.Bitmap);

                // Convert the image to a MemoryStream
                using (MemoryStream stream = new MemoryStream())
                {
                    image.Save(stream, System.Drawing.Imaging.ImageFormat.Png);
                    stream.Position = 0; // Reset the stream position

                    // Perform OCR on the image stream
                    scanner.Scan(stream, OCRImageFormat.Png);
                    string pageText = scanner.Text.ToString();

                    // Save extracted text to a separate file
                    string outputTxtPath = Path.Combine(@"C:\Users\Administrator\Desktop\Output", $"Page-{i + 1}.txt");
                    File.WriteAllText(outputTxtPath, pageText);
                }
            }

            // Close the document
            doc.Close();
        }
    }
}

Key Components Explained :

OcrScanner Class : This class is crucial for performing OCR. It provides methods to configure and execute the scanning operation.
ConfigureOptions Class : This class is used to set up the OCR scanner's configurations. The ModelPath property specifies the path to the OCR model files, and the Language property allows you to specify the language for text recognition.
PdfDocument Class : This class represents the PDF document. The LoadFromFile method loads the PDF file that you want to process.
Image Conversion : Each PDF page is converted to an image using the SaveAsImage method. This is essential because OCR works on image files.
MemoryStream : The image is saved into a MemoryStream , allowing us to perform OCR without saving the image to disk.
OCR Processing : The Scan method performs OCR on the image stream. The recognized text can be accessed using the Text property of the OcrScanner instance.
Output : The extracted text is saved to a text file for each page.

Output :

Perform OCR on a PDF document in C#

To extract text from searchable PDFs, refer to this guide: Automate PDF Text Extraction Using C#

Extracting Text from Images within PDFs in C#

In addition to processing entire PDF pages, you can also extract text from images embedded within PDFs. Here’s how:

using Spire.OCR;
using Spire.Pdf;
using Spire.Pdf.Graphics;
using System.Drawing;

namespace OCRPDF
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create an instance of the OcrScanner class
            OcrScanner scanner = new OcrScanner();

            // Configure the scanner
            ConfigureOptions configureOptions = new ConfigureOptions
            {
                ModelPath = @"D:\win-x64", // Set model path
                Language = "English"        // Set language
            };

            // Apply the configuration options
            scanner.ConfigureDependencies(configureOptions);

            // Load a PDF document
            PdfDocument doc = new PdfDocument();
            doc.LoadFromFile(@"C:\Users\Administrator\Desktop\Input5.pdf");

            // Iterate through all pages
            for (int i = 0; i < doc.Pages.Count; i++)
            {
                // Convert page to image
                Image image = doc.SaveAsImage(i, PdfImageType.Bitmap);

                // Convert the image to a MemoryStream
                using (MemoryStream stream = new MemoryStream())
                {
                    image.Save(stream, System.Drawing.Imaging.ImageFormat.Png);
                    stream.Position = 0; // Reset the stream position

                    // Perform OCR on the image stream
                    scanner.Scan(stream, OCRImageFormat.Png);
                    string pageText = scanner.Text.ToString();

                    // Save extracted text to a separate file
                    string outputTxtPath = Path.Combine(@"C:\Users\Administrator\Desktop\Output", $"Page-{i + 1}.txt");
                    File.WriteAllText(outputTxtPath, pageText);
                }
            }

            // Close the document
            doc.Close();
        }
    }
}

Key Components Explained :

PdfImageHelper Class : This class is essential for extracting images from a PDF page. It provides methods to retrieve image information such as GetImagesInfo , which returns an array of PdfImageInfo objects.
PdfImageInfo Class : Each PdfImageInfo object contains properties related to an image, including the actual Image object that can be processed further.
Image Processing : Similar to the previous example, each image is saved to a MemoryStream for OCR processing.
Output : The extracted text from each image is saved to a separate text file.

Output:

Extract text from images in PDF in C#

Wrapping Up

By combining Spire.PDF with Spire.OCR , you can seamlessly transform scanned PDFs and image-based documents into fully searchable and editable text. Whether you need to process entire pages or extract text from specific embedded images, the approach is straightforward and flexible.

This OCR integration not only streamlines document digitization but also enhances productivity by enabling search, copy, and automated data extraction. In industries where large volumes of scanned documents are the norm, implementing OCR with C# can significantly improve accessibility, compliance, and information retrieval speed.

FAQs

Q1. Can I perform OCR on non-English PDFs?

Yes, Spire.OCR supports multiple languages. You can set the Language property in ConfigureOptions to the desired language.

Q2. What should I do if the output is garbled or incorrect?

Check the quality of the input PDF images. If the images are blurry or have low contrast, OCR may struggle to recognize text accurately. Consider enhancing the image quality before processing.

Q3. Can I extract text from images embedded within a PDF?

Yes, you can. Use a helper class to extract images from each page and then apply OCR to recognize text.

Q4. Can Spire.OCR handle handwritten text in PDFs?

Spire.OCR is primarily optimized for printed text. Handwriting recognition typically has lower accuracy.

Q5. Do I need to install additional language models for OCR?

Yes, Spire.OCR requires pre-trained language model files. Download and configure the appropriate models for your target language before performing OCR.

Get a Free License

To fully experience the capabilities of Spire.PDF for .NET and Spire.OCR for .NET without any evaluation limitations, you can request a free 30-day trial license.