OCR (Optical Character Recognition) technology is the primary method to extract text from images. Spire.OCR for Java provides developers with a quick and efficient solution to scan and extract text from images in Java projects. This article will guide you on how to use Spire.OCR for Java to recognize and extract text from images in Java projects.

Obtaining Spire.OCR for Java

To scan and recognize text in images using Spire.OCR for Java, you need to first import the Spire.OCR.jar file along with other relevant dependencies into your Java project.

You can download Spire.OCR for Java from our website. If you are using Maven, you can add the following code to your project's pom.xml file to import the JAR file into your application.

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.ocr</artifactId>
        <version>2.1.1</version>
    </dependency>
</dependencies>

Please download the other dependencies based on your operating system:

Linux

Windows x64

Install Dependencies

Step 1: Create a Java project in IntelliJ IDEA.

How to Scan and Recognize Text from Images in Java Projects

Step 2: Go to File > Project Structure > Modules > Dependencies in the menu and add Spire.OCR.jar as a project dependency.

How to Scan and Recognize Text from Images in Java Projects

Step 3: Download and extract the other dependency files. Copy all the files from the extracted "dependencies" folder to your project directory.

How to Scan and Recognize Text from Images in Java Projects

Scanning and Recognizing Text from a Local Image

  • Java
import com.spire.ocr.OcrScanner;
import java.io.*;

public class ScanLocalImage {
    public static void main(String[] args) throws Exception {
        // Specify the path to the dependency files
        String dependencies = "dependencies/";
        // Specify the path to the image file to be scanned
        String imageFile = "data/Sample.png";
        // Specify the path to the output file
        String outputFile = "ScanLocalImage_out.txt";
        
        // Create an OcrScanner object
        OcrScanner scanner = new OcrScanner();
        // Set the dependency file path for the OcrScanner object
        scanner.setDependencies(dependencies);
        
        // Use the OcrScanner object to scan the specified image file
        scanner.scan(imageFile);
        
        // Get the scanned text content
        String scannedText = scanner.getText().toString();
        
        // Create an output file object
        File output = new File(outputFile);
        // If the output file already exists, delete it
        if (output.exists()) {
            output.delete();
        }
        // Create a BufferedWriter object to write content to the output file
        BufferedWriter writer = new BufferedWriter(new FileWriter(outputFile));
        // Write the scanned text content to the output file
        writer.write(scannedText);
        // Close the BufferedWriter object to release resources
        writer.close();
    }
}

Specify the Language File to Scan and Recognize Text from an Image

  • Java
import com.spire.ocr.OcrScanner;
import java.io.*;

public class ScanImageWithLanguageSelection {
    public static void main(String[] args) throws Exception {
        // Specify the path to the dependency files
        String dependencies = "dependencies/";
        // Specify the path to the language file
        String languageFile = "data/japandata";
        // Specify the path to the image file to be scanned
        String imageFile = "data/JapaneseSample.png";
        // Specify the path to the output file
        String outputFile = "ScanImageWithLanguageSelection_out.txt";
        
        // Create an OcrScanner object
        OcrScanner scanner = new OcrScanner();
        // Set the dependency file path for the OcrScanner object
        scanner.setDependencies(dependencies);
        // Load the specified language file
        scanner.loadLanguageFile(languageFile);
        
        // Use the OcrScanner object to scan the specified image file
        scanner.scan(imageFile);
        // Get the scanned text content
        String scannedText = scanner.getText().toString();

        // Create an output file object
        File output = new File(outputFile);
        // If the output file already exists, delete it
        if (output.exists()) {
            output.delete();
        }

        // Create a BufferedWriter object to write content to the output file
        BufferedWriter writer = new BufferedWriter(new FileWriter(outputFile));
        // Write the scanned text content to the output file
        writer.write(scannedText);
        // Close the BufferedWriter object to release resources
        writer.close();
    }
}

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

Java: Compare PDF Documents

2023-10-16 01:30:33 Written by Koohji

Comparison of PDF documents is essential for effective document management. By comparing PDF documents, users can easily identify differences in document content to have a more comprehensive understanding of them, which will greatly facilitate the user to modify and integrate the document content. This article will introduce how to use Spire.PDF for Java to compare PDF documents and find the differences.

Examples of the two PDF documents that will be used for comparison:

Java: Compare PDF Documents

Install Spire.PDF for Java

First of all, you need to add the Spire.Pdf.jar file as a dependency in your Java program. The JAR file can be downloaded from this link. If you use Maven, you can easily import the JAR file by adding the following code to your project's pom.xml file.

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.pdf</artifactId>
        <version>11.12.16</version>
    </dependency>
</dependencies>

Compare Two PDF Documents

Spire.PDF for Java provides the PdfComparer class for users to create an object with two PDF documents for comparing. After creating the PdfComparer object, users can use PdfComparer.compare(String fileName) method to compare the two documents and save the result as a new PDF file.

The resulting PDF document displays the two original documents on the left and the right, with the deleted items in red and the added items in yellow.

The detailed steps for comparing two PDF documents are as follows:

  • Create two objects of PdfDocument class and load two PDF documents using PdfDocument.loadFromFile() method.
  • Create an object of PdfComparer class with the two documents.
  • Compare the two documents and save the result as a new PDF document using PdfComparer.compare() method.
  • Java
import com.spire.pdf.PdfDocument;
import com.spire.pdf.comparison.PdfComparer;

public class ComparePDFPageRange {
    public static void main(String[] args) {
        //Create an object of PdfDocument class and load a PDF document
        PdfDocument pdf1 = new PdfDocument();
        pdf1.loadFromFile("Sample1.pdf");

        //Create another object of PdfDocument class and load another PDF document
        PdfDocument pdf2 = new PdfDocument();
        pdf2.loadFromFile("Sample2.pdf");

        //Create an object of PdfComparer class
        PdfComparer comparer = new PdfComparer(pdf1,pdf2);

        //Compare the two PDF documents and save the compare results to a new document
        comparer.compare("ComparisonResult.pdf");
    }
}

Java: Compare PDF Documents

Compare a Specified Page Range of Two PDF Documents

Before comparing, users can use the PdfComparer.getOptions().setPageRanges() method to limit the page range to be compared. The detailed steps are as follows:

  • Create two objects of PdfDocument class and load two PDF documents using PdfDocument.loadFromFile() method.
  • Create an object of PdfComparer class with the two documents.
  • Set the page range to be compared using PdfComparer.getOptions().setPageRanges() method.
  • Compare the two documents and save the result as a new PDF document using PdfComparer.compare() method.
  • Java
import com.spire.pdf.PdfDocument;
import com.spire.pdf.comparison.PdfComparer;

public class ComparePDFPageRange {
    public static void main(String[] args) {
        //Create an object of PdfDocument class and load a PDF document
        PdfDocument pdf1 = new PdfDocument();
        pdf1.loadFromFile("G:/Documents/Sample6.pdf");

        //Create another object of PdfDocument class and load another PDF document
        PdfDocument pdf2 = new PdfDocument();
        pdf2.loadFromFile("G:/Documents/Sample7.pdf");

        //Create an object of PdfComparer class
        PdfComparer comparer = new PdfComparer(pdf1,pdf2);

        //Set the page range to be compared
        comparer.getOptions().setPageRanges(1, 1, 1, 1);

        //Compare the two PDF documents and save the compare results to a new document
        comparer.compare("ComparisonResult.pdf");
    }
}

Java: Compare PDF Documents

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

Java: Extract Text from HTML

2023-08-29 01:53:40 Written by Koohji

HTML (Hypertext Markup Language) has become one of the most commonly used text markup languages on the Internet, and nearly all web pages are created using HTML. While HTML contains numerous tags and formatting information, the most valuable content is typically the visible text. It is important to know how to extract the text content from an HTML file when users intend to utilize it for tasks such as editing, AI training, or storing in databases. This article will demonstrate how to extract text from HTML using Spire.Doc for Java within Java programs.

Install Spire.Doc for Java

First of all, you're required to add the Spire.Doc.jar file as a dependency in your Java program. The JAR file can be downloaded from this link. If you use Maven, you can easily import the JAR file in your application by adding the following code to your project's pom.xml file.

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.doc</artifactId>
        <version>14.1.3</version>
    </dependency>
</dependencies>

Extract Text from HTML File

Spire.Doc for Java supports loading HTML files using the Document.loadFromFile(fileName, FileFormat.Html) method. Then, users can use Document.getText() method to get the text that is visible in browsers and write it to a TXT file. The detailed steps are as follows:

  • Create an object of Document class.
  • Load an HTML file using Document.loadFromFile(fileName, FileFormat.Html) method.
  • Get the text of the HTML file using Document.getText() method.
  • Write the text to a TXT file.
  • Java
import com.spire.doc.Document;
import com.spire.doc.FileFormat;

import java.io.FileWriter;
import java.io.IOException;

public class ExtractTextFromHTML {
    public static void main(String[] args) throws IOException {

        //Create an object of Document class
        Document doc = new Document();

        //Load an HTML file
        doc.loadFromFile("Sample.html", FileFormat.Html);

        //Get text from the HTML file
        String text = doc.getText();

        //Write the text to a TXT file
        FileWriter fileWriter = new FileWriter("HTMLText.txt");
        fileWriter.write(text);
        fileWriter.close();
    }
}

HTML Web Page:

Java: Extract Text from HTML

Extracted Text:

Java: Extract Text from HTML

Extract Text from URL

To extract text from a URL, users need to create a custom method to retrieve the HTML file from the URL and then extract the text from it. The detailed steps are as follows:

  • Create an object of Document class.
  • Use the custom method readHTML() to get the HTML file from a URL and return the file path.
  • Load the HTML file using Document.loadFromFile(filename, FileFormat.Html) method.
  • Get the text from the HTML file using Document.getText() method.
  • Write the text to a TXT file.
  • Java
import com.spire.doc.Document;
import com.spire.doc.FileFormat;

import java.io.*;
import java.net.URL;
import java.net.URLConnection;

public class ExtractTextFromURL {
    public static void main(String[] args) throws IOException {
        //Create an object of Document
        Document doc = new Document();

        //Call the custom method to load the HTML file from a URL
        doc.loadFromFile(readHTML("https://aeon.co/essays/how-to-face-the-climate-crisis-with-spinoza-and-self-knowledge", "output.html"), FileFormat.Html);

        //Get the text from the HTML file
        String urlText = doc.getText();

        //Write the text to a TXT file
        FileWriter fileWriter = new FileWriter("URLText.txt");
        fileWriter.write(urlText);
    }

    public static String readHTML(String urlString, String saveHtmlFilePath) throws IOException {

        //Create an object of URL class
        URL url = new URL(urlString);

        //Open the URL
        URLConnection connection = url.openConnection();

        //Save the url as an HTML file
        BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF-8"));
        BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(saveHtmlFilePath), "UTF-8"));
        String line;
        while ((line = reader.readLine()) != null) {
            writer.write(line);
            writer.newLine();
        }

        reader.close();
        writer.close();

        //Return the file path of the saved HTML file
        return saveHtmlFilePath;
    }
}

URL Web Page:

Java: Extract Text from HTML

Extracted Text:

Java: Extract Text from HTML

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

Page 9 of 81
page 9