zaki zou

Wednesday, 30 August 2023 06:44

C#/VB.NET: Extract Text from PDF Documents

Installed via NuGet

PM> Install-Package Spire.PDF

Links Relacionados

Baixar biblioteca

Os documentos PDF têm layout fixo e não permitem que os usuários façam modificações neles. Para tornar o conteúdo PDF editável novamente, você pode converter PDF para Word ou extraia texto de PDF. Neste artigo você aprenderá como extrair texto de uma página PDF específica, como extrair texto de uma área retangular específica, e como extraia texto por SimpleTextExtractionStrategy em C# e VB.NET usando Spire.PDF for .NET.

Extraia texto de uma página específica
Extrair texto de um retângulo
Extraia texto usando SimpleTextExtractionStrategy

Instale o Spire.PDF for .NET

Para começar, você precisa adicionar os arquivos DLL incluídos no pacote Spire.PDF for.NET como referências em seu projeto .NET. Os arquivos DLL podem ser baixados deste link ou instalados via NuGet.

Package Manager

PM> Install-Package Spire.PDF

Extraia texto de uma página específica

A seguir estão as etapas para extrair texto de uma determinada página de um documento PDF usando Spire.PDF for .NET.

Crie um objeto PdfDocument.
Carregue um arquivo PDF usando o método PdfDocument.LoadFromFile().
Obtenha a página específica por meio da propriedade PdfDocument.Pages[index].
Crie um objeto PdfTextExtractor.
Crie um objeto PdfTextExtractOptions e defina a propriedade IsExtractAllText como true.
Extraia o texto da página selecionada usando o método PdfTextExtractor.ExtractText().
Escreva o texto extraído em um arquivo TXT.

C#
VB.NET

using System;
    using System.IO;
    using Spire.Pdf;
    using Spire.Pdf.Texts;
    
    namespace ExtractTextFromPage
    {
        class Program
        {
            static void Main(string[] args)
            {
                //Create a PdfDocument object
                PdfDocument doc = new PdfDocument();
    
                //Load a PDF file
                doc.LoadFromFile(@"C:\Users\Administrator\Desktop\Terms of Service.pdf");
    
                //Get the second page
                PdfPageBase page = doc.Pages[1];
    
                //Create a PdfTextExtractot object
                PdfTextExtractor textExtractor = new PdfTextExtractor(page);
    
                //Create a PdfTextExtractOptions object
                PdfTextExtractOptions extractOptions = new PdfTextExtractOptions();
    
                //Set isExtractAllText to true
                extractOptions.IsExtractAllText = true;
    
                //Extract text from the page
                string text = textExtractor.ExtractText(extractOptions);
    
                //Write to a txt file
                File.WriteAllText("Extracted.txt", text);
            }
        }
    }

C#/VB.NET: Extract Text from PDF Documents

Extrair texto de um retângulo

A seguir estão as etapas para extrair texto de uma área retangular de uma página usando Spire.PDF for .NET.

Crie um objeto PdfDocument.
Carregue um arquivo PDF usando o método PdfDocument.LoadFromFile().
Obtenha a página específica por meio da propriedade PdfDocument.Pages[index].
Crie um objeto PdfTextExtractor.
Crie um objeto PdfTextExtractOptions e especifique a área do retângulo por meio da propriedade ExtractArea dele.
Extraia o texto do retângulo usando o método PdfTextExtractor.ExtractText().
Escreva o texto extraído em um arquivo TXT.

C#
VB.NET

using Spire.Pdf;
    using Spire.Pdf.Texts;
    using System.IO;
    using System.Drawing;
    
    namespace ExtractTextFromRectangleArea
    {
        class Program
        {
            static void Main(string[] args)
            {
                //Create a PdfDocument object
                PdfDocument doc = new PdfDocument();
    
                //Load a PDF file
                doc.LoadFromFile(@"C:\Users\Administrator\Desktop\Terms of Service.pdf");
    
                //Get the second page
                PdfPageBase page = doc.Pages[1];
    
                //Create a PdfTextExtractot object
                PdfTextExtractor textExtractor = new PdfTextExtractor(page);
    
                //Create a PdfTextExtractOptions object
                PdfTextExtractOptions extractOptions = new PdfTextExtractOptions();
    
                //Set the rectangle area
                extractOptions.ExtractArea = new RectangleF(0, 0, 890, 170);
    
                //Extract text from the rectangle
                string text = textExtractor.ExtractText(extractOptions);
    
                //Write to a txt file
                File.WriteAllText("Extracted.txt", text);
            }
        }
    }

C#/VB.NET: Extract Text from PDF Documents

Extraia texto usando SimpleTextExtractionStrategy

Os métodos acima extraem texto linha por linha. Ao extrair texto usando SimpleTextExtractionStrategy, ele rastreia a posição Y atual de cada string e insere uma quebra de linha na saída se a posição Y tiver mudado. A seguir estão as etapas detalhadas.

Crie um objeto PdfDocument.
Carregue um arquivo PDF usando o método PdfDocument.LoadFromFile().
Obtenha a página específica por meio da propriedade PdfDocument.Pages[index].
Crie um objeto PdfTextExtractor.
Crie um objeto PdfTextExtractOptions e defina a propriedade IsSimpleExtraction como true.
Extraia o texto da página selecionada usando o método PdfTextExtractor.ExtractText().
Escreva o texto extraído em um arquivo TXT.

C#
VB.NET

using System.IO;
    using Spire.Pdf;
    using Spire.Pdf.Texts;
    
    namespace SimpleExtraction
    {
        class Program
        {
            static void Main(string[] args)
            {
                //Create a PdfDocument object
                PdfDocument doc = new PdfDocument();
    
                //Load a PDF file
                doc.LoadFromFile(@"C:\Users\Administrator\Desktop\Invoice.pdf");
    
                //Get the first page
                PdfPageBase page = doc.Pages[0];
    
                //Create a PdfTextExtractor object
                PdfTextExtractor textExtractor = new PdfTextExtractor(page);
    
                //Create a PdfTextExtractOptions object
                PdfTextExtractOptions extractOptions = new PdfTextExtractOptions();
    
                //Set IsSimpleExtraction to true
                extractOptions.IsSimpleExtraction = true;
    
                //Extract text from the selected page
                string text = textExtractor.ExtractText(extractOptions);
    
                //Write to a txt file
                File.WriteAllText("Extracted.txt", text);
            }
        }
    }

C#/VB.NET: Extract Text from PDF Documents

Solicite uma licença temporária

Se desejar remover a mensagem de avaliação dos documentos gerados ou se livrar das limitações de função, por favor solicite uma licença de teste de 30 dias para você mesmo.

Veja também

Published in pdf

Wednesday, 30 August 2023 06:42

C#/VB.NET: извлечение текста из PDF-документов

Установлено через NuGet

PM> Install-Package Spire.PDF

Ссылки по теме

Скачать библиотеку

PDF-документы имеют фиксированный макет и не позволяют пользователям вносить в них изменения. Чтобы снова сделать содержимое PDF доступным для редактирования, вы можете конвертировать PDF в Word или извлечь текст из PDF. В этой статье вы узнаете, как извлечь текст из определенной страницы PDF, как извлечь текст из определенной области прямоугольника, и как извлекайте текст с помощью SimpleTextExtractionStrategy в C# и VB.NET используя Spire.PDF for .NET.

Извлечь текст с указанной страницы
Извлечь текст из прямоугольника
Извлеките текст с помощью SimpleTextExtractionStrategy

Установите Spire.PDF for .NET

Для начала вам необходимо добавить файлы DLL, включенные в пакет Spire.PDF for.NET, в качестве ссылок в ваш проект .NET. Файлы DLL можно загрузить по этой ссылке или установить через NuGet.

Package Manager

PM> Install-Package Spire.PDF

Извлечь текст с указанной страницы

Ниже приведены шаги по извлечению текста из определенной страницы PDF-документа с помощью Spire.PDF for .NET.

Создайте объект PDFDocument.
Загрузите PDF-файл с помощью метода PdfDocument.LoadFromFile().
Получите конкретную страницу через свойство PdfDocument.Pages[index].
Создайте объект PdfTextExtractor.
Создайте объект PdfTextExtractOptions и задайте для свойства IsExtractAllText значение true.
Извлеките текст с выбранной страницы с помощью метода PdfTextExtractor.ExtractText().
Запишите извлеченный текст в файл TXT.

C#
VB.NET

using System;
    using System.IO;
    using Spire.Pdf;
    using Spire.Pdf.Texts;
    
    namespace ExtractTextFromPage
    {
        class Program
        {
            static void Main(string[] args)
            {
                //Create a PdfDocument object
                PdfDocument doc = new PdfDocument();
    
                //Load a PDF file
                doc.LoadFromFile(@"C:\Users\Administrator\Desktop\Terms of Service.pdf");
    
                //Get the second page
                PdfPageBase page = doc.Pages[1];
    
                //Create a PdfTextExtractot object
                PdfTextExtractor textExtractor = new PdfTextExtractor(page);
    
                //Create a PdfTextExtractOptions object
                PdfTextExtractOptions extractOptions = new PdfTextExtractOptions();
    
                //Set isExtractAllText to true
                extractOptions.IsExtractAllText = true;
    
                //Extract text from the page
                string text = textExtractor.ExtractText(extractOptions);
    
                //Write to a txt file
                File.WriteAllText("Extracted.txt", text);
            }
        }
    }

C#/VB.NET: Extract Text from PDF Documents

Извлечь текст из прямоугольника

Ниже приведены шаги по извлечению текста из прямоугольной области страницы с помощью Spire.PDF for .NET.

Создайте объект PDFDocument.
Загрузите PDF-файл с помощью метода PdfDocument.LoadFromFile().
Получите конкретную страницу через свойство PdfDocument.Pages[index].
Создайте объект PdfTextExtractor.
Создайте объект PdfTextExtractOptions и укажите область прямоугольника с помощью его свойства ExtractArea.
Извлеките текст из прямоугольника с помощью метода PdfTextExtractor.ExtractText().
Запишите извлеченный текст в файл TXT.

C#
VB.NET

using Spire.Pdf;
    using Spire.Pdf.Texts;
    using System.IO;
    using System.Drawing;
    
    namespace ExtractTextFromRectangleArea
    {
        class Program
        {
            static void Main(string[] args)
            {
                //Create a PdfDocument object
                PdfDocument doc = new PdfDocument();
    
                //Load a PDF file
                doc.LoadFromFile(@"C:\Users\Administrator\Desktop\Terms of Service.pdf");
    
                //Get the second page
                PdfPageBase page = doc.Pages[1];
    
                //Create a PdfTextExtractot object
                PdfTextExtractor textExtractor = new PdfTextExtractor(page);
    
                //Create a PdfTextExtractOptions object
                PdfTextExtractOptions extractOptions = new PdfTextExtractOptions();
    
                //Set the rectangle area
                extractOptions.ExtractArea = new RectangleF(0, 0, 890, 170);
    
                //Extract text from the rectangle
                string text = textExtractor.ExtractText(extractOptions);
    
                //Write to a txt file
                File.WriteAllText("Extracted.txt", text);
            }
        }
    }

C#/VB.NET: Extract Text from PDF Documents

Извлеките текст с помощью SimpleTextExtractionStrategy

Вышеупомянутые методы извлекают текст построчно. При извлечении текста с помощью SimpleTextExtractionStrategy он отслеживает текущую позицию Y каждой строки и вставляет разрыв строки в выходные данные, если позиция Y изменилась. Ниже приведены подробные шаги.

Создайте объект PDFDocument.
Загрузите PDF-файл с помощью метода PdfDocument.LoadFromFile().
Получите конкретную страницу через свойство PdfDocument.Pages[index].
Создайте объект PdfTextExtractor.
Создайте объект PdfTextExtractOptions и задайте для свойства IsSimpleExtraction значение true.
Извлеките текст с выбранной страницы с помощью метода PdfTextExtractor.ExtractText().
Запишите извлеченный текст в файл TXT.

C#
VB.NET

using System.IO;
    using Spire.Pdf;
    using Spire.Pdf.Texts;
    
    namespace SimpleExtraction
    {
        class Program
        {
            static void Main(string[] args)
            {
                //Create a PdfDocument object
                PdfDocument doc = new PdfDocument();
    
                //Load a PDF file
                doc.LoadFromFile(@"C:\Users\Administrator\Desktop\Invoice.pdf");
    
                //Get the first page
                PdfPageBase page = doc.Pages[0];
    
                //Create a PdfTextExtractor object
                PdfTextExtractor textExtractor = new PdfTextExtractor(page);
    
                //Create a PdfTextExtractOptions object
                PdfTextExtractOptions extractOptions = new PdfTextExtractOptions();
    
                //Set IsSimpleExtraction to true
                extractOptions.IsSimpleExtraction = true;
    
                //Extract text from the selected page
                string text = textExtractor.ExtractText(extractOptions);
    
                //Write to a txt file
                File.WriteAllText("Extracted.txt", text);
            }
        }
    }

C#/VB.NET: Extract Text from PDF Documents

Подать заявку на временную лицензию

Если вы хотите удалить сообщение об оценке из сгенерированных документов или избавиться от ограничений функции, пожалуйста запросите 30-дневную пробную лицензию для себя.

Смотрите также

Published in pdf

Wednesday, 30 August 2023 06:41

C#/VB.NET: Text aus PDF-Dokumenten extrahieren

Inhaltsverzeichnis

Installieren Sie Spire.PDF for .NET
Extrahieren Sie Text von einer bestimmten Seite
Extrahieren Sie Text aus einem Rechteck
Extrahieren Sie Text mit SimpleTextExtractionStrategy
Siehe auch

Über NuGet installiert

PM> Install-Package Spire.PDF

Tabla de contenido

Instalar Spire.PDF for .NET
Extraer texto de una página especificada
Extraer texto de un rectángulo
Extraer texto usando SimpleTextExtractionStrategy
Ver también

Instalado a través de NuGet

PM> Install-Package Spire.PDF

enlaces relacionados

Descargar biblioteca

Los documentos PDF tienen un diseño fijo y no permiten a los usuarios realizar modificaciones en ellos. Para volver a editar el contenido del PDF, puede convertir PDF a Word o extraer texto de PDF. En este artículo, aprenderá cómo extraer texto de una página PDF específica, cómo extraer texto de un área rectangular particular, y cómo extraiga texto mediante SimpleTextExtractionStrategy en C# y VB.NET usando Spire.PDF for .NET.

Extraer texto de una página especificada
Extraer texto de un rectángulo
Extraer texto usando SimpleTextExtractionStrategy

Instalar Spire.PDF for .NET

Para empezar, debe agregar los archivos DLL incluidos en el paquete Spire.PDF for .NET como referencias en su proyecto .NET. Los archivos DLL se pueden descargar desde este enlace o instalar a través de NuGet.

Package Manager

PM> Install-Package Spire.PDF

Extraer texto de una página especificada

Los siguientes son los pasos para extraer texto de una determinada página de un documento PDF usando Spire.PDF for .NET.

Cree un objeto PdfDocument.
Cargue un archivo PDF utilizando el método PdfDocument.LoadFromFile().
Obtenga la página específica a través de la propiedad PdfDocument.Pages[index].
Cree un objeto PdfTextExtractor.
Cree un objeto PdfTextExtractOptions y establezca la propiedad IsExtractAllText en verdadero.
Extraiga texto de la página seleccionada utilizando el método PdfTextExtractor.ExtractText().
Escriba el texto extraído en un archivo TXT.

C#
VB.NET

using System;
    using System.IO;
    using Spire.Pdf;
    using Spire.Pdf.Texts;
    
    namespace ExtractTextFromPage
    {
        class Program
        {
            static void Main(string[] args)
            {
                //Create a PdfDocument object
                PdfDocument doc = new PdfDocument();
    
                //Load a PDF file
                doc.LoadFromFile(@"C:\Users\Administrator\Desktop\Terms of Service.pdf");
    
                //Get the second page
                PdfPageBase page = doc.Pages[1];
    
                //Create a PdfTextExtractot object
                PdfTextExtractor textExtractor = new PdfTextExtractor(page);
    
                //Create a PdfTextExtractOptions object
                PdfTextExtractOptions extractOptions = new PdfTextExtractOptions();
    
                //Set isExtractAllText to true
                extractOptions.IsExtractAllText = true;
    
                //Extract text from the page
                string text = textExtractor.ExtractText(extractOptions);
    
                //Write to a txt file
                File.WriteAllText("Extracted.txt", text);
            }
        }
    }

C#/VB.NET: Extract Text from PDF Documents

Extraer texto de un rectángulo

Los siguientes son los pasos para extraer texto de un área rectangular de una página usando Spire.PDF for .NET.

Cree un objeto PdfDocument.
Cargue un archivo PDF utilizando el método PdfDocument.LoadFromFile().
Obtenga la página específica a través de la propiedad PdfDocument.Pages[index].
Cree un objeto PdfTextExtractor.
Cree un objeto PdfTextExtractOptions y especifique el área del rectángulo a través de la propiedad ExtractArea del mismo.
Extraiga texto del rectángulo utilizando el método PdfTextExtractor.ExtractText().
Escriba el texto extraído en un archivo TXT.

C#
VB.NET

using Spire.Pdf;
    using Spire.Pdf.Texts;
    using System.IO;
    using System.Drawing;
    
    namespace ExtractTextFromRectangleArea
    {
        class Program
        {
            static void Main(string[] args)
            {
                //Create a PdfDocument object
                PdfDocument doc = new PdfDocument();
    
                //Load a PDF file
                doc.LoadFromFile(@"C:\Users\Administrator\Desktop\Terms of Service.pdf");
    
                //Get the second page
                PdfPageBase page = doc.Pages[1];
    
                //Create a PdfTextExtractot object
                PdfTextExtractor textExtractor = new PdfTextExtractor(page);
    
                //Create a PdfTextExtractOptions object
                PdfTextExtractOptions extractOptions = new PdfTextExtractOptions();
    
                //Set the rectangle area
                extractOptions.ExtractArea = new RectangleF(0, 0, 890, 170);
    
                //Extract text from the rectangle
                string text = textExtractor.ExtractText(extractOptions);
    
                //Write to a txt file
                File.WriteAllText("Extracted.txt", text);
            }
        }
    }

C#/VB.NET: Extract Text from PDF Documents

Extraer texto usando SimpleTextExtractionStrategy

Los métodos anteriores extraen texto línea por línea. Al extraer texto usando SimpleTextExtractionStrategy, realiza un seguimiento de la posición Y actual de cada cadena e inserta un salto de línea en la salida si la posición Y ha cambiado. Los siguientes son los pasos detallados.

Cree un objeto PdfDocument.
Cargue un archivo PDF utilizando el método PdfDocument.LoadFromFile().
Obtenga la página específica a través de la propiedad PdfDocument.Pages[index].
Cree un objeto PdfTextExtractor.
Cree un objeto PdfTextExtractOptions y establezca la propiedad IsSimpleExtraction en verdadero.
Extraiga texto de la página seleccionada utilizando el método PdfTextExtractor.ExtractText().
Escriba el texto extraído en un archivo TXT.

C#
VB.NET

using System.IO;
    using Spire.Pdf;
    using Spire.Pdf.Texts;
    
    namespace SimpleExtraction
    {
        class Program
        {
            static void Main(string[] args)
            {
                //Create a PdfDocument object
                PdfDocument doc = new PdfDocument();
    
                //Load a PDF file
                doc.LoadFromFile(@"C:\Users\Administrator\Desktop\Invoice.pdf");
    
                //Get the first page
                PdfPageBase page = doc.Pages[0];
    
                //Create a PdfTextExtractor object
                PdfTextExtractor textExtractor = new PdfTextExtractor(page);
    
                //Create a PdfTextExtractOptions object
                PdfTextExtractOptions extractOptions = new PdfTextExtractOptions();
    
                //Set IsSimpleExtraction to true
                extractOptions.IsSimpleExtraction = true;
    
                //Extract text from the selected page
                string text = textExtractor.ExtractText(extractOptions);
    
                //Write to a txt file
                File.WriteAllText("Extracted.txt", text);
            }
        }
    }

C#/VB.NET: Extract Text from PDF Documents

Solicitar una licencia temporal

Si desea eliminar el mensaje de evaluación de los documentos generados o deshacerse de las limitaciones de la función, por favor solicitar una licencia de prueba de 30 días para ti.

Ver también

Published in pdf

Wednesday, 30 August 2023 06:39

C#/VB.NET: PDF 문서에서 텍스트 추출

NuGet을 통해 설치됨

PM> Install-Package Spire.PDF

Sommario

Installa Spire.PDF for .NET
Estrai testo da una pagina specificata
Estrai testo da un rettangolo
Estrai testo utilizzando SimpleTextExtractionStrategy
Guarda anche

Installato tramite NuGet

PM> Install-Package Spire.PDF

Link correlati

Scarica Libreria

I documenti PDF hanno un layout fisso e non consentono agli utenti di apportare modifiche al loro interno. Per rendere nuovamente modificabile il contenuto del PDF, puoi farlo convertire PDF in Word o estrarre testo da PDF. In questo articolo imparerai come farlo estrarre il testo da una pagina PDF specifica, come estrarre il testo da una particolare area rettangolare, e come farlo estrarre testo con SimpleTextExtractionStrategy in C# e VB.NET utilizzando Spire.PDF for .NET.

Estrai testo da una pagina specificata
Estrai testo da un rettangolo
Estrai testo utilizzando SimpleTextExtractionStrategy

Installa Spire.PDF for .NET

Per cominciare, devi aggiungere i file DLL inclusi nel pacchetto Spire.PDF for.NET come riferimenti nel tuo progetto .NET. I file DLL possono essere scaricati da questo link o installato tramite NuGet.

Package Manager

PM> Install-Package Spire.PDF

Estrai testo da una pagina specificata

Di seguito sono riportati i passaggi per estrarre il testo da una determinata pagina di un documento PDF utilizzando Spire.PDF for .NET.

Crea un oggetto PdfDocument.
Carica un file PDF utilizzando il metodo PdfDocument.LoadFromFile().
Ottieni la pagina specifica tramite la proprietà PdfDocument.Pages[index].
Crea un oggetto PdfTextExtractor.
Crea un oggetto PdfTextExtractOptions e imposta la proprietà IsExtractAllText su true.
Estrai il testo dalla pagina selezionata utilizzando il metodo PdfTextExtractor.ExtractText().
Scrivi il testo estratto in un file TXT.

C#
VB.NET

using System;
    using System.IO;
    using Spire.Pdf;
    using Spire.Pdf.Texts;
    
    namespace ExtractTextFromPage
    {
        class Program
        {
            static void Main(string[] args)
            {
                //Create a PdfDocument object
                PdfDocument doc = new PdfDocument();
    
                //Load a PDF file
                doc.LoadFromFile(@"C:\Users\Administrator\Desktop\Terms of Service.pdf");
    
                //Get the second page
                PdfPageBase page = doc.Pages[1];
    
                //Create a PdfTextExtractot object
                PdfTextExtractor textExtractor = new PdfTextExtractor(page);
    
                //Create a PdfTextExtractOptions object
                PdfTextExtractOptions extractOptions = new PdfTextExtractOptions();
    
                //Set isExtractAllText to true
                extractOptions.IsExtractAllText = true;
    
                //Extract text from the page
                string text = textExtractor.ExtractText(extractOptions);
    
                //Write to a txt file
                File.WriteAllText("Extracted.txt", text);
            }
        }
    }

C#/VB.NET: Extract Text from PDF Documents

Estrai testo da un rettangolo

Di seguito sono riportati i passaggi per estrarre il testo da un'area rettangolare di una pagina utilizzando Spire.PDF for .NET.

Crea un oggetto PdfDocument.
Carica un file PDF utilizzando il metodo PdfDocument.LoadFromFile().
Ottieni la pagina specifica tramite la proprietà PdfDocument.Pages[index].
Crea un oggetto PdfTextExtractor.
Crea un oggetto PdfTextExtractOptions e specifica l'area del rettangolo tramite la sua proprietà ExtractArea.
Estrai il testo dal rettangolo utilizzando il metodo PdfTextExtractor.ExtractText().
Scrivi il testo estratto in un file TXT.

C#
VB.NET

using Spire.Pdf;
    using Spire.Pdf.Texts;
    using System.IO;
    using System.Drawing;
    
    namespace ExtractTextFromRectangleArea
    {
        class Program
        {
            static void Main(string[] args)
            {
                //Create a PdfDocument object
                PdfDocument doc = new PdfDocument();
    
                //Load a PDF file
                doc.LoadFromFile(@"C:\Users\Administrator\Desktop\Terms of Service.pdf");
    
                //Get the second page
                PdfPageBase page = doc.Pages[1];
    
                //Create a PdfTextExtractot object
                PdfTextExtractor textExtractor = new PdfTextExtractor(page);
    
                //Create a PdfTextExtractOptions object
                PdfTextExtractOptions extractOptions = new PdfTextExtractOptions();
    
                //Set the rectangle area
                extractOptions.ExtractArea = new RectangleF(0, 0, 890, 170);
    
                //Extract text from the rectangle
                string text = textExtractor.ExtractText(extractOptions);
    
                //Write to a txt file
                File.WriteAllText("Extracted.txt", text);
            }
        }
    }

C#/VB.NET: Extract Text from PDF Documents

Estrai testo utilizzando SimpleTextExtractionStrategy

I metodi precedenti estraggono il testo riga per riga. Quando si estrae il testo utilizzando SimpleTextExtractionStrategy, tiene traccia della posizione Y corrente di ciascuna stringa e inserisce un'interruzione di riga nell'output se la posizione Y è cambiata. Di seguito sono riportati i passaggi dettagliati.

Crea un oggetto PdfDocument.
Carica un file PDF utilizzando il metodo PdfDocument.LoadFromFile().
Ottieni la pagina specifica tramite la proprietà PdfDocument.Pages[index].
Crea un oggetto PdfTextExtractor.
Crea un oggetto PdfTextExtractOptions e imposta la proprietà IsSimpleExtraction su true.
Estrai il testo dalla pagina selezionata utilizzando il metodo PdfTextExtractor.ExtractText().
Scrivi il testo estratto in un file TXT.

C#
VB.NET

using System.IO;
    using Spire.Pdf;
    using Spire.Pdf.Texts;
    
    namespace SimpleExtraction
    {
        class Program
        {
            static void Main(string[] args)
            {
                //Create a PdfDocument object
                PdfDocument doc = new PdfDocument();
    
                //Load a PDF file
                doc.LoadFromFile(@"C:\Users\Administrator\Desktop\Invoice.pdf");
    
                //Get the first page
                PdfPageBase page = doc.Pages[0];
    
                //Create a PdfTextExtractor object
                PdfTextExtractor textExtractor = new PdfTextExtractor(page);
    
                //Create a PdfTextExtractOptions object
                PdfTextExtractOptions extractOptions = new PdfTextExtractOptions();
    
                //Set IsSimpleExtraction to true
                extractOptions.IsSimpleExtraction = true;
    
                //Extract text from the selected page
                string text = textExtractor.ExtractText(extractOptions);
    
                //Write to a txt file
                File.WriteAllText("Extracted.txt", text);
            }
        }
    }

C#/VB.NET: Extract Text from PDF Documents

Richiedi una licenza temporanea

Se desideri rimuovere il messaggio di valutazione dai documenti generati o eliminare le limitazioni della funzione, per favore richiedere una licenza di prova di 30 giorni per te.

Guarda anche

Published in pdf

Tagged under

itpdf

Wednesday, 30 August 2023 06:36

C#/VB.NET : extraire le texte des documents PDF

Table des matières

Installer Spire.PDF for .NET
Extraire le texte d'une page spécifiée
Extraire le texte d'un rectangle
Extraire du texte à l'aide de SimpleTextExtractionStrategy
Voir également

Installé via NuGet

PM> Install-Package Spire.PDF

Liens connexes

Télécharger la bibliothèque

Les documents PDF ont une mise en page fixe et ne permettent pas aux utilisateurs d'y apporter des modifications. Pour rendre le contenu PDF à nouveau modifiable, vous pouvez convertir un PDF en Word ou extraire du texte d'un PDF. Dans cet article, vous apprendrez comment extraire le texte d'une page PDF spécifique, comment extraire le texte d'une zone de rectangle particulière, et comment extraire le texte par SimpleTextExtractionStrategy en C# et VB.NET l'aide de Spire.PDF for .NET.

Extraire le texte d'une page spécifiée
Extraire le texte d'un rectangle
Extraire du texte à l'aide de SimpleTextExtractionStrategy

Installer Spire.PDF for .NET

Pour commencer, vous devez ajouter les fichiers DLL inclus dans le package Spire.PDF for.NET comme références dans votre projet .NET. Les fichiers DLL peuvent être téléchargés à partir de ce lien ou installés via NuGet.

Package Manager

PM> Install-Package Spire.PDF

Extraire le texte d'une page spécifiée

Voici les étapes pour extraire le texte d'une certaine page d'un document PDF à l'aide de Spire.PDF for .NET.

Créez un objet PdfDocument.
Chargez un fichier PDF à l'aide de la méthode PdfDocument.LoadFromFile().
Obtenez la page spécifique via la propriété PdfDocument.Pages[index].
Créez un objet PdfTextExtractor.
Créez un objet PdfTextExtractOptions et définissez la propriété IsExtractAllText sur true.
Extrayez le texte de la page sélectionnée à l’aide de la méthode PdfTextExtractor.ExtractText().
Écrivez le texte extrait dans un fichier TXT.

C#
VB.NET

using System;
    using System.IO;
    using Spire.Pdf;
    using Spire.Pdf.Texts;
    
    namespace ExtractTextFromPage
    {
        class Program
        {
            static void Main(string[] args)
            {
                //Create a PdfDocument object
                PdfDocument doc = new PdfDocument();
    
                //Load a PDF file
                doc.LoadFromFile(@"C:\Users\Administrator\Desktop\Terms of Service.pdf");
    
                //Get the second page
                PdfPageBase page = doc.Pages[1];
    
                //Create a PdfTextExtractot object
                PdfTextExtractor textExtractor = new PdfTextExtractor(page);
    
                //Create a PdfTextExtractOptions object
                PdfTextExtractOptions extractOptions = new PdfTextExtractOptions();
    
                //Set isExtractAllText to true
                extractOptions.IsExtractAllText = true;
    
                //Extract text from the page
                string text = textExtractor.ExtractText(extractOptions);
    
                //Write to a txt file
                File.WriteAllText("Extracted.txt", text);
            }
        }
    }

C#/VB.NET: Extract Text from PDF Documents

Extraire le texte d'un rectangle

Voici les étapes pour extraire le texte d’une zone rectangulaire d’une page à l’aide de Spire.PDF for .NET.

Créez un objet PdfDocument.
Chargez un fichier PDF à l'aide de la méthode PdfDocument.LoadFromFile().
Obtenez la page spécifique via la propriété PdfDocument.Pages[index].
Créez un objet PdfTextExtractor.
Créez un objet PdfTextExtractOptions et spécifiez la zone rectangulaire via sa propriété ExtractArea.
Extrayez le texte du rectangle à l’aide de la méthode PdfTextExtractor.ExtractText().
Écrivez le texte extrait dans un fichier TXT.

C#
VB.NET

using Spire.Pdf;
    using Spire.Pdf.Texts;
    using System.IO;
    using System.Drawing;
    
    namespace ExtractTextFromRectangleArea
    {
        class Program
        {
            static void Main(string[] args)
            {
                //Create a PdfDocument object
                PdfDocument doc = new PdfDocument();
    
                //Load a PDF file
                doc.LoadFromFile(@"C:\Users\Administrator\Desktop\Terms of Service.pdf");
    
                //Get the second page
                PdfPageBase page = doc.Pages[1];
    
                //Create a PdfTextExtractot object
                PdfTextExtractor textExtractor = new PdfTextExtractor(page);
    
                //Create a PdfTextExtractOptions object
                PdfTextExtractOptions extractOptions = new PdfTextExtractOptions();
    
                //Set the rectangle area
                extractOptions.ExtractArea = new RectangleF(0, 0, 890, 170);
    
                //Extract text from the rectangle
                string text = textExtractor.ExtractText(extractOptions);
    
                //Write to a txt file
                File.WriteAllText("Extracted.txt", text);
            }
        }
    }

C#/VB.NET: Extract Text from PDF Documents

Extraire du texte à l'aide de SimpleTextExtractionStrategy

Les méthodes ci-dessus extraient le texte ligne par ligne. Lors de l'extraction de texte à l'aide de SimpleTextExtractionStrategy, il garde une trace de la position Y actuelle de chaque chaîne et insère un saut de ligne dans la sortie si la position Y a changé. Voici les étapes détaillées.

Créez un objet PdfDocument.
Chargez un fichier PDF à l'aide de la méthode PdfDocument.LoadFromFile().
Obtenez la page spécifique via la propriété PdfDocument.Pages[index].
Créez un objet PdfTextExtractor.
Créez un objet PdfTextExtractOptions et définissez la propriété IsSimpleExtraction sur true.
Extrayez le texte de la page sélectionnée à l’aide de la méthode PdfTextExtractor.ExtractText().
Écrivez le texte extrait dans un fichier TXT.

C#
VB.NET

using System.IO;
    using Spire.Pdf;
    using Spire.Pdf.Texts;
    
    namespace SimpleExtraction
    {
        class Program
        {
            static void Main(string[] args)
            {
                //Create a PdfDocument object
                PdfDocument doc = new PdfDocument();
    
                //Load a PDF file
                doc.LoadFromFile(@"C:\Users\Administrator\Desktop\Invoice.pdf");
    
                //Get the first page
                PdfPageBase page = doc.Pages[0];
    
                //Create a PdfTextExtractor object
                PdfTextExtractor textExtractor = new PdfTextExtractor(page);
    
                //Create a PdfTextExtractOptions object
                PdfTextExtractOptions extractOptions = new PdfTextExtractOptions();
    
                //Set IsSimpleExtraction to true
                extractOptions.IsSimpleExtraction = true;
    
                //Extract text from the selected page
                string text = textExtractor.ExtractText(extractOptions);
    
                //Write to a txt file
                File.WriteAllText("Extracted.txt", text);
            }
        }
    }

C#/VB.NET: Extract Text from PDF Documents

Demander une licence temporaire

Si vous souhaitez supprimer le message d'évaluation des documents générés ou vous débarrasser des limitations fonctionnelles, veuillez demander une licence d'essai de 30 jours pour toi.

Voir également

Published in pdf

Tagged under

frpdf

Wednesday, 30 August 2023 06:35

C#/VB.NET: Insert, Replace or Delete Images in PDF

Installed via NuGet

PM> Install-Package Spire.PDF

Links Relacionados

Baixar biblioteca

Em comparação com documentos somente de texto, os documentos que contêm imagens são, sem dúvida, mais vívidos e atraentes para os leitores. Ao gerar ou editar um documento PDF, às vezes pode ser necessário inserir imagens para melhorar sua aparência e torná-lo mais atraente. Neste artigo, você aprenderá como inserir, substituir ou excluir imagens em documentos PDF em C# e VB.NET usando Spire.PDF for .NET.

Insira uma imagem em um documento PDF
Substitua uma imagem por outra imagem em um documento PDF
Excluir uma imagem específica em um documento PDF

Instale o Spire.PDF for .NET

Para começar, você precisa adicionar os arquivos DLL incluídos no pacote Spire.PDF for.NET como referências em seu projeto .NET. Os arquivos DLL podem ser baixados deste link ou instalados via NuGet.

Package Manager

PM> Install-Package Spire.PDF

Insira uma imagem em um documento PDF em C# e VB.NET

As etapas a seguir demonstram como inserir uma imagem em um documento PDF existente:

Inicialize uma instância da classe PdfDocument.
Carregue um documento PDF usando o método PdfDocument.LoadFromFile().
Obtenha a página desejada no documento PDF através da propriedade PdfDocument.Pages[pageIndex].
Carregue uma imagem usando o método PdfImage.FromFile().
Especifique a largura e a altura da área da imagem na página.
Especifique as coordenadas X e Y para começar a desenhar a imagem.
Desenhe a imagem na página usando o método PdfPageBase.Canvas.DrawImage().
Salve o documento resultante usando o método PdfDocument.SaveToFile().

C#
VB.NET

using Spire.Pdf;
    using Spire.Pdf.Graphics;
    
    namespace InsertImage
    {
        class Program
        {
            static void Main(string[] args)
            {
                //Create a PdfDocument instance
                PdfDocument pdf = new PdfDocument();
                pdf.LoadFromFile("Input.pdf");
    
                //Get the first page in the PDF document
                PdfPageBase page = pdf.Pages[0];
    
                //Load an image
                PdfImage image = PdfImage.FromFile("image.jpg");
    
                //Specify the width and height of the image area on the page
                float width = image.Width * 0.50f;
                float height = image.Height * 0.50f;
    
                //Specify the X and Y coordinates to start drawing the image
                float x = 180f;
                float y = 70f;
    
                //Draw the image at a specified location on the page
                page.Canvas.DrawImage(image, x, y, width, height);
    
                //Save the result document
                pdf.SaveToFile("AddImage.pdf", FileFormat.PDF);
            }
        }
    }

C#/VB.NET: Insert, Replace or Delete Images in PDF

Substitua uma imagem por outra imagem em um documento PDF em C# e VB.NET

As etapas a seguir demonstram como substituir uma imagem por outra imagem em um documento PDF:

Inicialize uma instância da classe PdfDocument.
Carregue um documento PDF usando o método PdfDocument.LoadFromFile().
Obtenha a página desejada no documento PDF através da propriedade PdfDocument.Pages[pageIndex].
Carregue uma imagem usando o método PdfImage.FromFile().
Inicialize uma instância da classe PdfImageHelper.
Obtenha as informações da imagem da página usando o método PdfImageHelper.GetImagesInfo().
Substitua uma imagem específica na página pela imagem carregada usando o método PdfImageHelper.ReplaceImage().
Salve o documento resultante usando o método PdfDocument.SaveToFile().

C#
VB.NET

using Spire.Pdf;
    using Spire.Pdf.Graphics;
    using Spire.Pdf.Utilities;
    
    namespace ReplaceImage
    {
        class Program
        {
            static void Main(string[] args)
            {
                //Create a PdfDocument instance
                PdfDocument doc = new PdfDocument();
                //Load a PDF document
                doc.LoadFromFile("AddImage.pdf");
    
                //Get the first page
                PdfPageBase page = doc.Pages[0];
    
                //Load an image
                PdfImage image = PdfImage.FromFile("image1.jpg");
    
                //Create a PdfImageHelper instance
                PdfImageHelper imageHelper = new PdfImageHelper();
                //Get the image information from the page
                PdfImageInfo[] imageInfo = imageHelper.GetImagesInfo(page);
                //Replace the first image on the page with the loaded image
                imageHelper.ReplaceImage(imageInfo[0], image);
    
                //Save the result document
                doc.SaveToFile("ReplaceImage.pdf", FileFormat.PDF);
            }
        }
    }

C#/VB.NET: Insert, Replace or Delete Images in PDF

Exclua uma imagem específica em um documento PDF em C# e VB.NET

As etapas a seguir demonstram como excluir uma imagem de um documento PDF:

Inicialize uma instância da classe PdfDocument.
Carregue um documento PDF usando o método PdfDocument.LoadFromFile().
Obtenha a página desejada no documento PDF através da propriedade PdfDocument.Pages[pageIndex].
Exclua uma imagem específica da página usando o método PdfPageBase.DeleteImage().
Salve o documento resultante usando o método PdfDocument.SaveToFile().

C#
VB.NET

using Spire.Pdf;
    
    namespace DeleteImage
    {
        class Program
        {
            static void Main(string[] args)
            {
                //Create a PdfDocument instance
                PdfDocument pdf = new PdfDocument();
                //Load a PDF document
                pdf.LoadFromFile("AddImage.pdf");
    
                //Get the first page
                PdfPageBase page = pdf.Pages[0];
    
                //Delete the first image on the page
                page.DeleteImage(0);
    
                //Save the result document
                pdf.SaveToFile("DeleteImage.pdf", FileFormat.PDF);
            }
        }
    }

C#/VB.NET: Insert, Replace or Delete Images in PDF

Solicite uma licença temporária

Se desejar remover a mensagem de avaliação dos documentos gerados ou se livrar das limitações de função, por favor solicite uma licença de teste de 30 dias para você mesmo.

Veja também

Published in pdf

Table of Contents

Installed via NuGet

Related Links

Install Spire.PDF for .NET

Extract Text from a Specified Page

Extract Text from a Rectangle

Extract Text using SimpleTextExtractionStrategy

Apply for a Temporary License

See Also

Índice

Instalado via NuGet

Links Relacionados

Instale o Spire.PDF for .NET

Extraia texto de uma página específica

Extrair texto de um retângulo

Extraia texto usando SimpleTextExtractionStrategy

Solicite uma licença temporária

Veja também

Оглавление

Установлено через NuGet

Ссылки по теме

Установите Spire.PDF for .NET

Извлечь текст с указанной страницы

Извлечь текст из прямоугольника

Извлеките текст с помощью SimpleTextExtractionStrategy

Подать заявку на временную лицензию

Смотрите также

Inhaltsverzeichnis

Über NuGet installiert

verwandte Links

Installieren Sie Spire.PDF for .NET

Extrahieren Sie Text von einer bestimmten Seite

Extrahieren Sie Text aus einem Rechteck

Extrahieren Sie Text mit SimpleTextExtractionStrategy

Beantragen Sie eine temporäre Lizenz

Siehe auch

Tabla de contenido

Instalado a través de NuGet

enlaces relacionados

Instalar Spire.PDF for .NET

Extraer texto de una página especificada

Extraer texto de un rectángulo

Extraer texto usando SimpleTextExtractionStrategy

Solicitar una licencia temporal

Ver también

목차

NuGet을 통해 설치됨

관련된 링크들

Spire.PDF for .NET 설치

지정된 페이지에서 텍스트 추출

직사각형에서 텍스트 추출

SimpleTextExtractionStrategy를 사용하여 텍스트 추출

임시 라이센스 신청

또한보십시오

Sommario

Installato tramite NuGet

Link correlati

Installa Spire.PDF for .NET

Estrai testo da una pagina specificata

Estrai testo da un rettangolo

Estrai testo utilizzando SimpleTextExtractionStrategy

Richiedi una licenza temporanea

Guarda anche

Table des matières

Installé via NuGet

Liens connexes

Installer Spire.PDF for .NET

Extraire le texte d'une page spécifiée

Extraire le texte d'un rectangle

Extraire du texte à l'aide de SimpleTextExtractionStrategy

Demander une licence temporaire

Voir également

Table of Contents

Installed via NuGet

Related Links

Install Spire.PDF for .NET

Insert an Image into a PDF Document in C# and VB.NET

Replace an Image with Another Image in a PDF Document in C# and VB.NET

Delete a Specific Image in a PDF Document in C# and VB.NET

Apply for a Temporary License