Java (481)
OCR (Optical Character Recognition) technology is the primary method to extract text from images. Spire.OCR for Java provides developers with a quick and efficient solution to scan and extract text from images in Java projects. This article will guide you on how to use Spire.OCR for Java to recognize and extract text from images in Java projects.
Obtaining Spire.OCR for Java
To scan and recognize text in images using Spire.OCR for Java, you need to first import the Spire.OCR.jar file along with other relevant dependencies into your Java project.
You can download Spire.OCR for Java from our website. If you are using Maven, you can add the following code to your project's pom.xml file to import the JAR file into your application.
<repositories>
<repository>
<id>com.e-iceblue</id>
<name>e-iceblue</name>
<url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>e-iceblue</groupId>
<artifactId>spire.ocr</artifactId>
<version>2.1.1</version>
</dependency>
</dependencies>
Please download the other dependencies based on your operating system:
Install Dependencies
Step 1: Create a Java project in IntelliJ IDEA.

Step 2: Go to File > Project Structure > Modules > Dependencies in the menu and add Spire.OCR.jar as a project dependency.

Step 3: Download and extract the other dependency files. Copy all the files from the extracted "dependencies" folder to your project directory.

Scanning and Recognizing Text from a Local Image
- Java
import com.spire.ocr.OcrScanner;
import java.io.*;
public class ScanLocalImage {
public static void main(String[] args) throws Exception {
// Specify the path to the dependency files
String dependencies = "dependencies/";
// Specify the path to the image file to be scanned
String imageFile = "data/Sample.png";
// Specify the path to the output file
String outputFile = "ScanLocalImage_out.txt";
// Create an OcrScanner object
OcrScanner scanner = new OcrScanner();
// Set the dependency file path for the OcrScanner object
scanner.setDependencies(dependencies);
// Use the OcrScanner object to scan the specified image file
scanner.scan(imageFile);
// Get the scanned text content
String scannedText = scanner.getText().toString();
// Create an output file object
File output = new File(outputFile);
// If the output file already exists, delete it
if (output.exists()) {
output.delete();
}
// Create a BufferedWriter object to write content to the output file
BufferedWriter writer = new BufferedWriter(new FileWriter(outputFile));
// Write the scanned text content to the output file
writer.write(scannedText);
// Close the BufferedWriter object to release resources
writer.close();
}
}
Specify the Language File to Scan and Recognize Text from an Image
- Java
import com.spire.ocr.OcrScanner;
import java.io.*;
public class ScanImageWithLanguageSelection {
public static void main(String[] args) throws Exception {
// Specify the path to the dependency files
String dependencies = "dependencies/";
// Specify the path to the language file
String languageFile = "data/japandata";
// Specify the path to the image file to be scanned
String imageFile = "data/JapaneseSample.png";
// Specify the path to the output file
String outputFile = "ScanImageWithLanguageSelection_out.txt";
// Create an OcrScanner object
OcrScanner scanner = new OcrScanner();
// Set the dependency file path for the OcrScanner object
scanner.setDependencies(dependencies);
// Load the specified language file
scanner.loadLanguageFile(languageFile);
// Use the OcrScanner object to scan the specified image file
scanner.scan(imageFile);
// Get the scanned text content
String scannedText = scanner.getText().toString();
// Create an output file object
File output = new File(outputFile);
// If the output file already exists, delete it
if (output.exists()) {
output.delete();
}
// Create a BufferedWriter object to write content to the output file
BufferedWriter writer = new BufferedWriter(new FileWriter(outputFile));
// Write the scanned text content to the output file
writer.write(scannedText);
// Close the BufferedWriter object to release resources
writer.close();
}
}
Apply for a Temporary License
If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.
Comparison of PDF documents is essential for effective document management. By comparing PDF documents, users can easily identify differences in document content to have a more comprehensive understanding of them, which will greatly facilitate the user to modify and integrate the document content. This article will introduce how to use Spire.PDF for Java to compare PDF documents and find the differences.
Examples of the two PDF documents that will be used for comparison:

Install Spire.PDF for Java
First of all, you need to add the Spire.Pdf.jar file as a dependency in your Java program. The JAR file can be downloaded from this link. If you use Maven, you can easily import the JAR file by adding the following code to your project's pom.xml file.
<repositories>
<repository>
<id>com.e-iceblue</id>
<name>e-iceblue</name>
<url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>e-iceblue</groupId>
<artifactId>spire.pdf</artifactId>
<version>11.12.16</version>
</dependency>
</dependencies>
Compare Two PDF Documents
Spire.PDF for Java provides the PdfComparer class for users to create an object with two PDF documents for comparing. After creating the PdfComparer object, users can use PdfComparer.compare(String fileName) method to compare the two documents and save the result as a new PDF file.
The resulting PDF document displays the two original documents on the left and the right, with the deleted items in red and the added items in yellow.
The detailed steps for comparing two PDF documents are as follows:
- Create two objects of PdfDocument class and load two PDF documents using PdfDocument.loadFromFile() method.
- Create an object of PdfComparer class with the two documents.
- Compare the two documents and save the result as a new PDF document using PdfComparer.compare() method.
- Java
import com.spire.pdf.PdfDocument;
import com.spire.pdf.comparison.PdfComparer;
public class ComparePDFPageRange {
public static void main(String[] args) {
//Create an object of PdfDocument class and load a PDF document
PdfDocument pdf1 = new PdfDocument();
pdf1.loadFromFile("Sample1.pdf");
//Create another object of PdfDocument class and load another PDF document
PdfDocument pdf2 = new PdfDocument();
pdf2.loadFromFile("Sample2.pdf");
//Create an object of PdfComparer class
PdfComparer comparer = new PdfComparer(pdf1,pdf2);
//Compare the two PDF documents and save the compare results to a new document
comparer.compare("ComparisonResult.pdf");
}
}

Compare a Specified Page Range of Two PDF Documents
Before comparing, users can use the PdfComparer.getOptions().setPageRanges() method to limit the page range to be compared. The detailed steps are as follows:
- Create two objects of PdfDocument class and load two PDF documents using PdfDocument.loadFromFile() method.
- Create an object of PdfComparer class with the two documents.
- Set the page range to be compared using PdfComparer.getOptions().setPageRanges() method.
- Compare the two documents and save the result as a new PDF document using PdfComparer.compare() method.
- Java
import com.spire.pdf.PdfDocument;
import com.spire.pdf.comparison.PdfComparer;
public class ComparePDFPageRange {
public static void main(String[] args) {
//Create an object of PdfDocument class and load a PDF document
PdfDocument pdf1 = new PdfDocument();
pdf1.loadFromFile("G:/Documents/Sample6.pdf");
//Create another object of PdfDocument class and load another PDF document
PdfDocument pdf2 = new PdfDocument();
pdf2.loadFromFile("G:/Documents/Sample7.pdf");
//Create an object of PdfComparer class
PdfComparer comparer = new PdfComparer(pdf1,pdf2);
//Set the page range to be compared
comparer.getOptions().setPageRanges(1, 1, 1, 1);
//Compare the two PDF documents and save the compare results to a new document
comparer.compare("ComparisonResult.pdf");
}
}

Apply for a Temporary License
If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.
HTML (Hypertext Markup Language) has become one of the most commonly used text markup languages on the Internet, and nearly all web pages are created using HTML. While HTML contains numerous tags and formatting information, the most valuable content is typically the visible text. It is important to know how to extract the text content from an HTML file when users intend to utilize it for tasks such as editing, AI training, or storing in databases. This article will demonstrate how to extract text from HTML using Spire.Doc for Java within Java programs.
Install Spire.Doc for Java
First of all, you're required to add the Spire.Doc.jar file as a dependency in your Java program. The JAR file can be downloaded from this link. If you use Maven, you can easily import the JAR file in your application by adding the following code to your project's pom.xml file.
<repositories>
<repository>
<id>com.e-iceblue</id>
<name>e-iceblue</name>
<url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>e-iceblue</groupId>
<artifactId>spire.doc</artifactId>
<version>14.1.3</version>
</dependency>
</dependencies>
Extract Text from HTML File
Spire.Doc for Java supports loading HTML files using the Document.loadFromFile(fileName, FileFormat.Html) method. Then, users can use Document.getText() method to get the text that is visible in browsers and write it to a TXT file. The detailed steps are as follows:
- Create an object of Document class.
- Load an HTML file using Document.loadFromFile(fileName, FileFormat.Html) method.
- Get the text of the HTML file using Document.getText() method.
- Write the text to a TXT file.
- Java
import com.spire.doc.Document;
import com.spire.doc.FileFormat;
import java.io.FileWriter;
import java.io.IOException;
public class ExtractTextFromHTML {
public static void main(String[] args) throws IOException {
//Create an object of Document class
Document doc = new Document();
//Load an HTML file
doc.loadFromFile("Sample.html", FileFormat.Html);
//Get text from the HTML file
String text = doc.getText();
//Write the text to a TXT file
FileWriter fileWriter = new FileWriter("HTMLText.txt");
fileWriter.write(text);
fileWriter.close();
}
}
HTML Web Page:

Extracted Text:

Extract Text from URL
To extract text from a URL, users need to create a custom method to retrieve the HTML file from the URL and then extract the text from it. The detailed steps are as follows:
- Create an object of Document class.
- Use the custom method readHTML() to get the HTML file from a URL and return the file path.
- Load the HTML file using Document.loadFromFile(filename, FileFormat.Html) method.
- Get the text from the HTML file using Document.getText() method.
- Write the text to a TXT file.
- Java
import com.spire.doc.Document;
import com.spire.doc.FileFormat;
import java.io.*;
import java.net.URL;
import java.net.URLConnection;
public class ExtractTextFromURL {
public static void main(String[] args) throws IOException {
//Create an object of Document
Document doc = new Document();
//Call the custom method to load the HTML file from a URL
doc.loadFromFile(readHTML("https://aeon.co/essays/how-to-face-the-climate-crisis-with-spinoza-and-self-knowledge", "output.html"), FileFormat.Html);
//Get the text from the HTML file
String urlText = doc.getText();
//Write the text to a TXT file
FileWriter fileWriter = new FileWriter("URLText.txt");
fileWriter.write(urlText);
}
public static String readHTML(String urlString, String saveHtmlFilePath) throws IOException {
//Create an object of URL class
URL url = new URL(urlString);
//Open the URL
URLConnection connection = url.openConnection();
//Save the url as an HTML file
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF-8"));
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(saveHtmlFilePath), "UTF-8"));
String line;
while ((line = reader.readLine()) != null) {
writer.write(line);
writer.newLine();
}
reader.close();
writer.close();
//Return the file path of the saved HTML file
return saveHtmlFilePath;
}
}
URL Web Page:

Extracted Text:

Apply for a Temporary License
If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.