Effortlessly Automate PDF Text Extraction Using C# .NET: A Complete Guide

Precisely Extract Text from PDF using C# .NET

Manually extracting text from PDF files can be tedious, error-prone, and inefficient—especially when working with large volumes of documents or complex layouts. PDFs store content based on coordinates rather than linear text flow, making it difficult to retrieve structured or readable text without specialized tools.

For developers working in C# .NET, automating PDF text extraction is essential for streamlining workflows such as document processing, content indexing, data migration, and digital archiving.

This comprehensive guide shows you how to read text from PDF using C# and Spire.PDF for .NET, a powerful library for reading and processing PDF documents. You’ll learn how to:

Extract full-text content from entire documents
Retrieve text from individual pages
Capture content within defined regions
Obtain position and font metadata for advanced use cases

Whether you're building a PDF parser, developing an automated document management system, or migrating PDF data into structured formats, this article provides ready-to-use C# code examples and best practices to help you extract text from PDFs quickly, accurately, and at scale.

Why Use Spire.PDF for Text Extraction in .NET?
Extract Text from PDF (Basic Example)
Advanced Text Extraction Options
Conclusion
FAQs

Why Use Spire.PDF for Text Extraction in .NET?

Spire.PDF for .NET is a feature-rich and developer friendly library that supports seamless text extraction from PDFs in .NET applications. Here's why it stands out:

Precise Layout Preservation: Maintains original layout, spacing, and reading order.
Detailed Extraction: Retrieve text along with its metadata like position and size.
No Adobe Dependency: Works independently of Adobe Acrobat or other third-party tools.
Quick Integration: Clean API and extensive documentation for faster development.

Installation

Before getting started, install the library in your project via NuGet:

Install-Package Spire.PDF

Or download the DLL and manually reference it in your solution.

Extract Text from PDF (Basic Example)

Extract full text content from a PDF is crucial for capturing all information for analysis or processing.

This basic example extracts all text content from a PDF file uses the PdfTextExtractor class and saves it to a text file with the original spacing, line breaks and layout preserved.

using Spire.Pdf;
using Spire.Pdf.Texts;
using System.IO;
using System.Text;

namespace ExtractAllTextFromPDF
{
    internal class Program
    {
        static void Main(string[] args)
        {
            // Create a PDF document instance
            PdfDocument pdf = new PdfDocument();
            // Load the PDF file
            pdf.LoadFromFile("Sample.pdf");

            // Initialize a StringBuilder to hold the extracted text
            StringBuilder extractedText = new StringBuilder();
            // Loop through each page in the PDF
            foreach (PdfPageBase page in pdf.Pages)
            {
                // Create a PdfTextExtractor for the current page
                PdfTextExtractor extractor = new PdfTextExtractor(page);
                // Set extraction options
                PdfTextExtractOptions option = new PdfTextExtractOptions
                {
                    IsExtractAllText = true
                };
                // Extract text from the current page
                string text = extractor.ExtractText(option);
                // Append the extracted text to the StringBuilder
                extractedText.AppendLine(text);
            }

            // Save the extracted text to a text file
            File.WriteAllText("ExtractedText.txt", extractedText.ToString());
            // Close the PDF document
            pdf.Close();        
        }
    }
}

Advanced Text Extraction Options

Spire.PDF offers more than basic full-document extraction. It supports advanced scenarios like retrieving text from specific pages, extracting content from defined areas, and accessing text layout details such as position and dimensions. This section explores these capabilities with practical examples.

Retrieve Text from Individual PDF Pages

Sometimes, you only need text from a particular page—for example, when processing a specific section of a multi-page document. You can access the desired page from the Pages collection of the document and then apply the extraction logic to it.

using Spire.Pdf;
using Spire.Pdf.Texts;
using System.IO;

namespace ExtractTextFromIndividualPages
{
    internal class Program	
    {
        static void Main(string[] args)
        {
            // Create a PDF document instance
            PdfDocument pdf = new PdfDocument();
            // Load the PDF file
            pdf.LoadFromFile("Sample.pdf");

            // Access the page to extract text from (e.g., index 1 = the second page)
            PdfPageBase page = pdf.Pages[1];

            // Create a PdfTextExtractor for the selected page
            PdfTextExtractor extractor = new PdfTextExtractor(page);
            // Set extraction options
            PdfTextExtractOptions option = new PdfTextExtractOptions
            {
                IsExtractAllText = true
            };
            // Extract text from the specified page
            string text = extractor.ExtractText(option);

            // Save the extracted text to a text file
            File.WriteAllText("IndividualPage.txt", text);
            // Close the PDF document
            pdf.Close();
        }
    }
}

Extract Text from Individual Pages

Read Text within a Defined Area on a PDF Page

If you're interested in text within a specific rectangular area (e.g., header or footer), you can set a rectangular extraction region via PdfTextExtractOptions.ExtractArea to limit the extraction scope.

using Spire.Pdf;
using Spire.Pdf.Texts;
using System.IO;
using System.Drawing;

namespace ExtractTextFromDefinedArea
{
    internal class Program
    {
        static void Main(string[] args)
        {
            // Create a PDF document instance
            PdfDocument doc = new PdfDocument();
            // Load the PDF file
            doc.LoadFromFile("Sample.pdf");

            // Get the second page 
            PdfPageBase page = doc.Pages[1];

            // Create a PdfTextExtractor for the selected page
            PdfTextExtractor textExtractor = new PdfTextExtractor(page);
            // Set extraction options with a defined rectangular area
            PdfTextExtractOptions extractOptions = new PdfTextExtractOptions
            {
                ExtractArea = new RectangleF(0, 0, 890, 170)
            };

            // Extract text from the specified rectangular area
            string text = textExtractor.ExtractText(extractOptions);

            // Save the extracted text to a text file
            File.WriteAllText("Extracted.txt", text);

            // Close the PDF document
            doc.Close();
        }
    }
}

Extract Text within a Defined Area

Get Text Position and Size Information for Advanced Processing

For advanced tasks like annotation or content overlay, accessing the position and size of each text fragment is crucial. You can obtain this information using PdfTextFinder and PdfTextFragment.

using Spire.Pdf;
using Spire.Pdf.Texts;
using System;
using System.Collections.Generic;
using System.Drawing;

namespace ExtractTextWithPositionAndSize
{
    class Program
    {
        static void Main(string[] args)
        {
            // Load the PDF document
            PdfDocument pdf = new PdfDocument();
            pdf.LoadFromFile("Sample.pdf");

            // Iterate through each page of the document
            for (int i = 0; i < pdf.Pages.Count; i++)
            {
                PdfPageBase page = pdf.Pages[i];

                // Create a PdfTextFinder object for the current page
                PdfTextFinder finder = new PdfTextFinder(page);

                // Find all text fragments on the page
                List<PdfTextFragment> fragments = finder.FindAllText();

                Console.WriteLine($"Page {i + 1}:");

                // Iterate over each text fragment
                foreach (PdfTextFragment fragment in fragments)
                {
                    // Extract text content
                    string text = fragment.Text;

                    // Get bounding rectangles with position and size
                    RectangleF[] rects = fragment.Bounds;

                    Console.WriteLine($"Text: \"{text}\"");

                    // Iterate through each rectangle for this fragment
                    foreach (var rect in rects)
                    {
                        Console.WriteLine($"Position: ({rect.X}, {rect.Y}), Size: ({rect.Width} x {rect.Height})");
                    }
                    Console.WriteLine();
                }
            }
        }
    }
}

Conclusion

Whether you're performing simple extraction or building advanced document automation tools, Spire.PDF for .NET provides versatile and accurate methods to extract and manipulate PDF text:

Full-text extraction for complete documents
Page-level control to isolate relevant sections
Area-based targeting for structured or repeated patterns
Precise layout data for custom rendering and analysis

By combining these techniques, you can create powerful and flexible PDF processing workflows tailored to your application's needs.

FAQs

Q1: Can Spire.PDF extract text from password-protected PDFs?

A1: Yes, by providing the correct password when loading the documents, Spire.PDF can open and extract text from secured PDFs.

Q2: Does Spire.PDF support batch extraction?

A2: Absolutely. You can iterate over a directory of PDF files and apply the same extraction logic programmatically.

Q3: Can it extract font styles and sizes?

A3: Yes. Spire.PDF allows you to retrieve font-related details such as font name, size, style.

Q4: Can I extract images or tables as well?

A4: While text extraction is the focus of this guide, Spire.PDF can also extract images and supports table detection with additional logic.

Q5: Can Spire.PDF extract text from scanned (image-based) PDFs?

A5: Scanned PDFs require OCR (Optical Character Recognition). Spire.PDF doesn't provide built-in OCR, but you can combine it with an OCR library - Spire.OCR for image-to-text conversion.

Get a Free License

To fully experience the capabilities of Spire.PDF for .NET without any evaluation limitations, you can request a free 30-day trial license.