Spire.Office Knowledgebase Page 23 | E-iceblue

RTF files are versatile, containing text, images, and formatting information. Converting these files into PDF and HTML ensures that they are accessible and display consistently across various devices and browsers. Whether you're building a document viewer or integrating document management features into your application, mastering RTF conversion is a valuable skill.

In this article, you will learn how to convert RTF to PDF and RTF to HTML in React using Spire.Doc for JavaScript.

Install Spire.Doc for JavaScript

To get started with converting RTF to PDF and HTML in a React application, you can either download Spire.Doc for JavaScript from our website or install it via npm with the following command:

npm i spire.doc

After that, copy the "Spire.Doc.Base.js" and "Spire.Doc.Base.wasm" files to the public folder of your project. Additionally, include the required font files to ensure accurate and consistent text rendering.

For more details, refer to the documentation: How to Integrate Spire.Doc for JavaScript in a React Project

Convert RTF to PDF with JavaScript

With Spire.Doc for JavaScript, converting RTF files to PDF is straightforward. Utilize the Document.LoadFromFile() method to load the RTF file, preserving its formatting. Then, save it as a PDF using the Document.SaveToFile() method. This process ensures high-quality output, making file format conversion easy and efficient.

Here are the steps to convert RTF to PDF in React using Spire.Doc for JavaScript:

  • Load the font files used in the RTF document into the virtual file system (VFS).
  • Create a new Document object using the wasmModule.Document.Create() method.
  • Load the input RTF file using the Document.LoadFromFile() method.
  • Save the document as a PDF file using the Document.SaveToFile() method.
  • Generate a Blob from the PDF file, create a download link, and trigger the download.
  • JavaScript
import React, { useState, useEffect } from 'react';

function App() {

  // State to hold the loaded WASM module
  const [wasmModule, setWasmModule] = useState(null);

  // useEffect hook to load the WASM module when the component mounts
  useEffect(() => {
    const loadWasm = async () => {
      try {

        // Access the Module and spiredoc from the global window object
        const { Module, spiredoc } = window;

        // Set the wasmModule state when the runtime is initialized
        Module.onRuntimeInitialized = () => {
          setWasmModule(spiredoc);
        };
      } catch (err) {

        // Log any errors that occur during loading
        console.error('Failed to load WASM module:', err);
      }
    };

    // Create a script element to load the WASM JavaScript file
    const script = document.createElement('script');
    script.src = `${process.env.PUBLIC_URL}/Spire.Doc.Base.js`;
    script.onload = loadWasm;

    // Append the script to the document body
    document.body.appendChild(script);

    // Cleanup function to remove the script when the component unmounts
    return () => {
      document.body.removeChild(script);
    };
  }, []); 

  // Function to convert RTF to PDF
  const convertRtfToPdf = async () => {
    if (wasmModule) {

      // Load the font files into the virtual file system (VFS)
      await wasmModule.FetchFileToVFS('times.ttf', '/Library/Fonts/', `${process.env.PUBLIC_URL}/`);
      await wasmModule.FetchFileToVFS('timesbd.ttf', '/Library/Fonts/', `${process.env.PUBLIC_URL}/`);
      await wasmModule.FetchFileToVFS('timesbi.ttf', '/Library/Fonts/', `${process.env.PUBLIC_URL}/`);
      await wasmModule.FetchFileToVFS('timesi.ttf', '/Library/Fonts/', `${process.env.PUBLIC_URL}/`);
      
      // Specify the input file path
      const inputFileName = 'input.rtf'; 
   
      // Create a new document
      const doc= wasmModule.Document.Create();

      // Fetch the input file and add it to the VFS
      await wasmModule.FetchFileToVFS(inputFileName, '', `${process.env.PUBLIC_URL}/`);
      
      // Load the RTF file
      doc.LoadFromFile(inputFileName);

      // Define the output file name
      const outputFileName = "RtfToPdf.pdf";

      // Save the document to the specified path
      doc.SaveToFile({fileName: outputFileName, fileFormat: wasmModule.FileFormat.PDF});
 
      // Read the generated PDF file from VFS
      const modifiedFileArray = wasmModule.FS.readFile(outputFileName);

      // Create a Blob object from the PDF file
      const modifiedFile = new Blob([modifiedFileArray], { type: 'application/pdf'});

      // Create a URL for the Blob
      const url = URL.createObjectURL(modifiedFile);

      // Create an anchor element to trigger the download
      const a = document.createElement('a');
      a.href = url;
      a.download = outputFileName;
      document.body.appendChild(a);
      a.click(); 
      document.body.removeChild(a); 
      URL.revokeObjectURL(url); 

      // Clean up resources
      doc.Dispose();
    }
  };

  return (
    <div style={{ textAlign: 'center', height: '300px' }}>
      <h1>Convert RTF to PDF in React</h1>
      <button onClick={convertRtfToPdf} disabled={!wasmModule}>
        Convert
      </button>
    </div>
  );
}

export default App;

Run the code to launch the React app at localhost:3000. Click "Convert," and a "Save As" window will appear, prompting you to save the output file in your chosen folder.

React app runs at localhost:3000

Below is a screenshot of the generated PDF document:

Convert RTF to PDF in React

Convert RTF to HTML with JavaScript

When converting RTF to HTML, it's crucial to decide whether to embed image files and CSS stylesheets as internal resources, as these elements significantly impact the HTML file's display.

With Spire.Doc for JavaScript, you can easily configure these settings using the Document.HtmlExportOptions.CssStyleSheetType and Document.HtmlExportOptions.ImageEmbedded properties.

Here are the steps to convert RTF to HTML with embedded images and CSS stylesheets using Spire.Doc for JavaScript:

  • Load the font files used in the RTF document into the virtual file system (VFS).
  • Create a new Document object using the wasmModule.Document.Create() method.
  • Load the input RTF file using the Document.LoadFromFile() method.
  • Embed CSS stylesheet in the HTML file by setting the Document.HtmlExportOptions.CssStyleSheetType as Internal.
  • Embed image files in the HTML file by setting the Document.HtmlExportOptions.ImageEmbedded to true.
  • Save the document as an HTML file using the Document.SaveToFile() method.
  • Generate a Blob from the PDF file, create a download link, and trigger the download.
  • JavaScript
import React, { useState, useEffect } from 'react';

function App() {

  // State to hold the loaded WASM module
  const [wasmModule, setWasmModule] = useState(null);

  // useEffect hook to load the WASM module when the component mounts
  useEffect(() => {
    const loadWasm = async () => {
      try {

        // Access the Module and spiredoc from the global window object
        const { Module, spiredoc } = window;

        // Set the wasmModule state when the runtime is initialized
        Module.onRuntimeInitialized = () => {
          setWasmModule(spiredoc);
        };
      } catch (err) {

        // Log any errors that occur during loading
        console.error('Failed to load WASM module:', err);
      }
    };

    // Create a script element to load the WASM JavaScript file
    const script = document.createElement('script');
    script.src = `${process.env.PUBLIC_URL}/Spire.Doc.Base.js`;
    script.onload = loadWasm;

    // Append the script to the document body
    document.body.appendChild(script);

    // Cleanup function to remove the script when the component unmounts
    return () => {
      document.body.removeChild(script);
    };
  }, []); 

  // Function to convert RTF to HTML
  const convertRtfToHtml = async () => {
    if (wasmModule) {

      // Load the font files into the virtual file system (VFS)
      await wasmModule.FetchFileToVFS('times.ttf', '/Library/Fonts/', `${process.env.PUBLIC_URL}/`);
      await wasmModule.FetchFileToVFS('timesbd.ttf', '/Library/Fonts/', `${process.env.PUBLIC_URL}/`);
      await wasmModule.FetchFileToVFS('timesbi.ttf', '/Library/Fonts/', `${process.env.PUBLIC_URL}/`);
      await wasmModule.FetchFileToVFS('timesi.ttf', '/Library/Fonts/', `${process.env.PUBLIC_URL}/`);

      // Specify the input file path
      const inputFileName = 'input.rtf'; 
   
      // Create a new document
      const doc= wasmModule.Document.Create();

      // Fetch the input file and add it to the VFS
      await wasmModule.FetchFileToVFS(inputFileName, '', `${process.env.PUBLIC_URL}/`);
      
      // Load the RTF file
      doc.LoadFromFile(inputFileName);

      // Embed CSS file in the HTML file      
      doc.HtmlExportOptions.CssStyleSheetType = wasmModule.CssStyleSheetType.Internal;     

      // Embed images in the HTML file      
      doc.HtmlExportOptions.ImageEmbedded = true;

      // Define the output file name
      const outputFileName = "RtfToHtml.html";

      // Save the document to the specified path
      doc.SaveToFile({fileName: outputFileName, fileFormat: wasmModule.FileFormat.Html});
 
      // Read the generated HTML file from VFS
      const modifiedFileArray = wasmModule.FS.readFile(outputFileName);

      // Create a Blob object from the HTML file
      const modifiedFile = new Blob([modifiedFileArray], { type: 'text/html'});

      // Create a URL for the Blob
      const url = URL.createObjectURL(modifiedFile);

      // Create an anchor element to trigger the download
      const a = document.createElement('a');
      a.href = url;
      a.download = outputFileName;
      document.body.appendChild(a);
      a.click(); 
      document.body.removeChild(a); 
      URL.revokeObjectURL(url); 

      // Clean up resources
      doc.Dispose();
    }
  };

  return (
    <div style={{ textAlign: 'center', height: '300px' }}>
      <h1>Convert RTF to HTML in React</h1>
      <button onClick={convertRtfToHtml} disabled={!wasmModule}>
        Convert
      </button>
    </div>
  );
}

export default App;

Convert RTF to HTML in React

Get a Free License

To fully experience the capabilities of Spire.Doc for JavaScript without any evaluation limitations, you can request a free 30-day trial license.

In web page development, transforming Word documents into HTML allows content creators to leverage the familiar Word document editing for crafting web-ready content. This approach not only structures the content appropriately for web delivery but also streamlines content management processes. Furthermore, by harnessing the capabilities of React, developers can execute this transformation directly within the browser on the client side, thereby simplifying the development workflow and potentially reducing load times and server costs.

This article demonstrates how to use Spire.Doc for JavaScript to convert Word documents to HTML files within React applications.

Install Spire.Doc for JavaScript

To get started with converting Word documents to HTML in a React application, you can either download Spire.Doc for JavaScript from our website or install it via npm with the following command:

npm i spire.doc

After that, copy the "Spire.Doc.Base.js" and "Spire.Doc.Base.wasm" files into the public folder of your project. Additionally, include the required font files to ensure accurate and consistent text rendering.

For more details, refer to the documentation: How to Integrate Spire.Doc for JavaScript in a React Project

Convert Word Documents to HTML Using JavaScript

With Spire.Doc for JavaScript, you can load Word documents into the WASM environment using the Document.LoadFromFile() method and convert them to HTML files with the Document.SaveToFile() method. This approach converts Word documents into HTML format with CSS files and images separated from the main HTML file, allowing developers to easily customize the HTML page.

Follow these steps to convert a Word document to HTML format using Spire.Doc for JavaScript in React:

  • Load the Spire.Doc.Base.js file to initialize the WebAssembly module.
  • Load the Word file into the virtual file system using the wasmModule.FetchFileToVFS() method.
  • Create a Document instance in the WASM module using the wasmModule.Document.Create() method.
  • Load the Word document into the Document instance using the Document.LoadFromFile() method.
  • Convert the Word document to HTML format using the Document.SaveToFile({ fileName: string, fileFormat: wasmModule.FileFormat.Html }) method.
  • Pack and download the result files or take further actions as needed.
  • JavaScript
import React, { useState, useEffect } from 'react';
import JSZip from 'jszip';

function App() {

  // State to hold the loaded WASM module
  const [wasmModule, setWasmModule] = useState(null);

  // useEffect hook to load the WASM module when the component mounts
  useEffect(() => {
    const loadWasm = async () => {
      try {

        // Access the Module and spiredoc from the global window object
        const { Module, spiredoc } = window;

        // Set the wasmModule state when the runtime is initialized
        Module.onRuntimeInitialized = () => {
          setWasmModule(spiredoc);
        };
      } catch (err) {

        // Log any errors that occur during loading
        console.error('Failed to load WASM module:', err);
      }
    };

    // Create a script element to load the WASM JavaScript file
    const script = document.createElement('script');
    script.src = `${process.env.PUBLIC_URL}/Spire.Doc.Base.js`;
    script.onload = loadWasm;

    // Append the script to the document body
    document.body.appendChild(script);

    // Cleanup function to remove the script when the component unmounts
    return () => {
      document.body.removeChild(script);
    };
  }, []);

  // Function to convert the Word document to HTML format
  const WordToHTMLAndZip = async () => {
    if (wasmModule) {
      // Specify the input file name and the output folder name
      const inputFileName = 'Sample.docx';
      const outputFolderName = 'WordToHTMLOutput';

      // Fetch the input file and add it to the VFS
      await wasmModule.FetchFileToVFS(inputFileName, '', `${process.env.PUBLIC_URL}/`);

      // Create an instance of the Document class
      const doc = wasmModule.Document.Create();
      // Load the Word document
      doc.LoadFromFile({ fileName: inputFileName });

      // Save the Word document to HTML format in the output folder
      doc.SaveToFile({ fileName: `${outputFolderName}/document.html`, fileFormat: wasmModule.FileFormat.Html });

      // Release resources
      doc.Dispose();

      // Create a new JSZip object
      const zip = new JSZip();

      // Recursive function to add a directory and its contents to the ZIP
      const addFilesToZip = (folderPath, zipFolder) => {
        const items = wasmModule.FS.readdir(folderPath);
        items.filter(item => item !== "." && item !== "..").forEach((item) => {
          const itemPath = `${folderPath}/${item}`;

          try {
            // Attempt to read file data
            const fileData = wasmModule.FS.readFile(itemPath);
            zipFolder.file(item, fileData);
          } catch (error) {
            if (error.code === 'EISDIR') {
              // If it's a directory, create a new folder in the ZIP and recurse into it
              const zipSubFolder = zipFolder.folder(item);
              addFilesToZip(itemPath, zipSubFolder);
            } else {
              // Handle other errors
              console.error(`Error processing ${itemPath}:`, error);
            }
          }
        });
      };

      // Add all files in the output folder to the ZIP
      addFilesToZip(outputFolderName, zip);

      // Generate and download the ZIP file
      zip.generateAsync({ type: 'blob' }).then((content) => {
        const url = URL.createObjectURL(content);
        const a = document.createElement('a');
        a.href = url;
        a.download = `${outputFolderName}.zip`;
        document.body.appendChild(a);
        a.click();
        document.body.removeChild(a);
        URL.revokeObjectURL(url);
      });
    }
  };

  return (
      <div style={{ textAlign: 'center', height: '300px' }}>
        <h1>Convert Word File to HTML and Download as ZIP Using JavaScript in React</h1>
        <button onClick={WordToHTMLAndZip} disabled={!wasmModule}>
          Convert and Download
        </button>
      </div>
  );
}

export default App;

Word to HTML Conversion Effect with JavaScript

Convert Word to HTML with Embedded CSS and Images

In addition to converting Word documents to HTML with separated files, CSS and images can be embedded into a single HTML file by configuring the Document.HtmlExportOptions.CssStyleSheetType property and the Document.HtmlExportOptions.ImageEmbedded property. The steps to achieve this are as follows:

  • Load the Spire.Doc.Base.js file to initialize the WebAssembly module.
  • Load the Word file into the virtual file system using the wasmModule.FetchFileToVFS() method.
  • Create a Document instance in the WASM module using the wasmModule.Document.Create() method.
  • Load the Word document into the Document instance using the Document.LoadFromFile() method.
  • Set the Document.HtmlExportOptions.CssStyleSheetType property to wasmModule.CssStyleSheetType.Internal to embed CSS styles in the resulting HTML file.
  • Set the Document.HtmlExportOptions.ImageEmbedded property to true to embed images in the resulting HTML file.
  • Convert the Word document to an HTML file with CSS styles and images embedded using the Document.SaveToFile({ fileName: string, fileFormat: wasmModule.FileFormat.Html }) method.
  • Download the resulting HTML file or take further actions as needed.
  • JavaScript
import React, { useState, useEffect } from 'react';

function App() {

  // State to hold the loaded WASM module
  const [wasmModule, setWasmModule] = useState(null);

  // useEffect hook to load the WASM module when the component mounts
  useEffect(() => {
    const loadWasm = async () => {
      try {
        const { Module, spiredoc } = window;
        Module.onRuntimeInitialized = () => {
          setWasmModule(spiredoc);
        };
      } catch (err) {
        console.error('Failed to load WASM module:', err);
      }
    };

    const script = document.createElement('script');
    script.src = `${process.env.PUBLIC_URL}/Spire.Doc.Base.js`;
    script.onload = loadWasm;

    document.body.appendChild(script);

    return () => {
      document.body.removeChild(script);
    };
  }, []);

  // Function to convert the Word document to HTML format
  const WordToHTMLAndZip = async () => {
    if (wasmModule) {

      // Specify the input file name and the base output name
      const inputFileName = 'Sample.docx';
      const outputFileName = 'ConvertedDocument.html';

      // Fetch the input file and add it to the VFS
      await wasmModule.FetchFileToVFS(inputFileName, '', `${process.env.PUBLIC_URL}/`);

      // Create an instance of the Document class
      const doc = wasmModule.Document.Create();

      // Load the Word document
      doc.LoadFromFile({ fileName: inputFileName });

      // Embed CSS file in the HTML file
      doc.HtmlExportOptions.CssStyleSheetType = wasmModule.CssStyleSheetType.Internal;

      // Embed images in the HTML file
      doc.HtmlExportOptions.ImageEmbedded = true;

      // Save the Word document to HTML format
      doc.SaveToFile({ fileName: outputFileName, fileFormat: wasmModule.FileFormat.Html });

      // Release resources
      doc.Dispose();

      // Read the HTML file from the VFS
      const htmlFileArray = wasmModule.FS.readFile(outputFileName);

      // Generate a Blob from the HTML file array and trigger download
      const blob = new Blob([new Uint8Array(htmlFileArray)], { type: 'text/html' });
      const url = URL.createObjectURL(blob);
      const a = document.createElement("a");
      a.href = url;
      a.download = outputFileName;
      document.body.appendChild(a);
      a.click();
      document.body.removeChild(a);
      URL.revokeObjectURL(url);
    }
  };

  return (
      <div style={{ textAlign: 'center', height: '300px' }}>
        <h1>Convert Word to HTML Using JavaScript in React</h1>
        <button onClick={WordToHTMLAndZip} disabled={!wasmModule}>
          Convert and Download
        </button>
      </div>
  );
}

export default App;

Word to HTML Conversion Result with CSS and Images Embedded

Convert Word to HTML with Customized Options

Spire.Doc for JavaScript also supports customizing many other HTML export options, such as CSS file name, header and footer, form field, etc., through the Document.HtmlExportOptions property. The table below lists the properties available under Document.HtmlExportOptions, which can be used to tailor the Word-to-HTML conversion:

Property Description
CssStyleSheetType Specifies the type of the HTML CSS style sheet (External or Internal).
CssStyleSheetFileName Specifies the name of the HTML CSS style sheet file.
ImageEmbedded Specifies whether to embed images in the HTML code using the Data URI scheme.
ImagesPath Specifies the folder for images in the exported HTML.
UseSaveFileRelativePath Specifies whether the image file path is relative to the HTML file path.
HasHeadersFooters Specifies whether headers and footers should be included in the exported HTML.
IsTextInputFormFieldAsText Specifies whether text-input form fields should be exported as text in HTML.
IsExportDocumentStyles Specifies whether to export document styles to the HTML <head>.

Follow these steps to customize options when converting Word documents to HTML format:

  • Load the Spire.Doc.Base.js file to initialize the WebAssembly module.
  • Load the Word file into the virtual file system using the wasmModule.FetchFileToVFS() method.
  • Create a Document instance in the WASM module using the wasmModule.Document.Create() method.
  • Load the Word document into the Document instance using the Document.LoadFromFile() method.
  • Customize the conversion options through properties under Document.HtmlExportOptions.
  • Convert the Word document to HTML format using the Document.SaveToFile({ fileName: string, fileFormat: wasmModule.FileFormat.Html }) method.
  • Pack and download the result files or take further actions as needed.
  • JavaScript
import React, { useState, useEffect } from 'react';
import JSZip from 'jszip';

function App() {

  // State to hold the loaded WASM module
  const [wasmModule, setWasmModule] = useState(null);

  // useEffect hook to load the WASM module when the component mounts
  useEffect(() => {
    const loadWasm = async () => {
      try {

        // Access the Module and spiredoc from the global window object
        const { Module, spiredoc } = window;

        // Set the wasmModule state when the runtime is initialized
        Module.onRuntimeInitialized = () => {
          setWasmModule(spiredoc);
        };
      } catch (err) {

        // Log any errors that occur during loading
        console.error('Failed to load WASM module:', err);
      }
    };

    // Create a script element to load the WASM JavaScript file
    const script = document.createElement('script');
    script.src = `${process.env.PUBLIC_URL}/Spire.Doc.Base.js`;
    script.onload = loadWasm;

    // Append the script to the document body
    document.body.appendChild(script);

    // Cleanup function to remove the script when the component unmounts
    return () => {
      document.body.removeChild(script);
    };
  }, []);

  // Function to convert the Word document to HTML format
  const WordToHTMLAndZip = async () => {
    if (wasmModule) {
      // Specify the input file name and the base output file name
      const inputFileName = 'Sample.docx';
      const baseOutputFileName = 'WordToHTML';
      const outputFolderName = 'WordToHTMLOutput';

      // Fetch the input file and add it to the VFS
      await wasmModule.FetchFileToVFS(inputFileName, '', `${process.env.PUBLIC_URL}/`);

      // Create an instance of the Document class
      const doc = wasmModule.Document.Create();

      // Load the Word document
      doc.LoadFromFile({ fileName: inputFileName });

      // Un-embed the CSS file and set its name
      doc.HtmlExportOptions.CssStyleSheetType = wasmModule.CssStyleSheetType.External;
      doc.HtmlExportOptions.CssStyleSheetFileName = `${baseOutputFileName}CSS.css`;

      // Un-embed the image files and set their path
      doc.HtmlExportOptions.ImageEmbedded = false;
      doc.HtmlExportOptions.ImagesPath = `/Images`;
      doc.HtmlExportOptions.UseSaveFileRelativePath = true;

      // Set to ignore headers and footers
      doc.HtmlExportOptions.HasHeadersFooters = false;

      // Set form fields flattened as text
      doc.HtmlExportOptions.IsTextInputFormFieldAsText = true;

      // Set exporting document styles in the head section
      doc.HtmlExportOptions.IsExportDocumentStyles = true;

      // Save the Word document to HTML format
      doc.SaveToFile({
        fileName: `${outputFolderName}/${baseOutputFileName}.html`,
        fileFormat: wasmModule.FileFormat.Html
      });

      // Release resources
      doc.Dispose();

      // Create a new JSZip object
      const zip = new JSZip();

      // Recursive function to add a directory and its contents to the ZIP
      const addFilesToZip = (folderPath, zipFolder) => {
        const items = wasmModule.FS.readdir(folderPath);
        items.filter(item => item !== "." && item !== "..").forEach((item) => {
          const itemPath = `${folderPath}/${item}`;

          try {
            // Attempt to read file data. If it's a directory, this will throw an error.
            const fileData = wasmModule.FS.readFile(itemPath);
            zipFolder.file(item, fileData);
          } catch (error) {
            if (error.code === 'EISDIR') {
              // If it's a directory, create a new folder in the ZIP and recurse into it
              const zipSubFolder = zipFolder.folder(item);
              addFilesToZip(itemPath, zipSubFolder);
            } else {
              // Handle other errors
              console.error(`Error processing ${itemPath}:`, error);
            }
          }
        });
      };

      // Add the contents of the output folder to the ZIP
      addFilesToZip(`${outputFolderName}`, zip);

      // Generate and download the ZIP file
      zip.generateAsync({ type: 'blob' }).then((content) => {
        const url = URL.createObjectURL(content);
        const a = document.createElement("a");
        a.href = url;
        a.download = `${baseOutputFileName}.zip`;
        document.body.appendChild(a);
        a.click();
        document.body.removeChild(a);
        URL.revokeObjectURL(url);
      });
    }
  };

  return (
      <div style={{ textAlign: 'center', height: '300px' }}>
        <h1>Convert Word File to HTML and Download as ZIP Using JavaScript in React</h1>
        <button onClick={WordToHTMLAndZip} disabled={!wasmModule}>
          Convert and Download
        </button>
      </div>
  );
}

export default App;

Convert Word to HTML and Customize Conversion Options

Get a Free License

To fully experience the capabilities of Spire.Doc for JavaScript without any evaluation limitations, you can request a free 30-day trial license.

In today's digital world, extracting text from images has become essential for many fields, including business, education, and data analysis. OCR (Optical Character Recognition) technology makes this process effortless by converting text in images into editable and searchable formats quickly and accurately. Whether it's turning handwritten notes into digital files or pulling key information from scanned documents, OCR simplifies tasks and makes work more efficient. In this article, we will demonstrate how to recognize text from images in Python using Spire.OCR for Python.

Install Spire.OCR for Python

This scenario requires Spire.OCR for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.

pip install Spire.OCR

Download the Model of Spire.OCR for Python

Spire.OCR for Python provides different recognition models for different operating systems. Download the model suited to your system from one of the links below:

After downloading, extract the package and save it to a specific directory on your system.

Recognize Text from Images in Python

Spire.OCR for Python offers the OcrScanner.Scan() method to recognize text from images. Once the recognition is complete, you can use the OcrScanner.Text property to retrieve the recognized text and then save it to a file for further use. The detailed steps are as follows.

  • Create an instance of the OcrScanner class to handle OCR operations.
  • Create an instance of the ConfigureOptions class to configure the OCR settings.
  • Specify the file path to the model and the desired recognition language through the ConfigureOptions.ModelPath and ConfigureOptions.Language properties.
  • Apply the configuration settings to the OcrScanner instance using the OcrScanner.ConfigureDependencies() method.
  • Call the OcrScanner.Scan() method to perform text recognition on the image.
  • Retrieve the recognized text using the OcrScanner.Text property.
  • Save the extracted text to a file for further use.
  • Python
from spire.ocr import *

# Create an instance of the OcrScanner class
scanner = OcrScanner()

# Configure OCR settings
configureOptions = ConfigureOptions()
# Set the file path to the model
configureOptions.ModelPath = r'D:\OCR\win-x64'  
# Set the recognition language. Supported languages include English, Chinese, Chinesetraditional, French, German, Japanese, and Korean.
configureOptions.Language = 'English'  
# Apply the settings to the OcrScanner instance
scanner.ConfigureDependencies(configureOptions)

# Recognize text from the image
scanner.Scan(r'Sample.png')

# Retrieve the recognized text and save it to a file
text = scanner.Text.ToString() + '\n'
with open('output.txt', 'a', encoding='utf-8') as file:
    file.write(text + '\n')

Recognize Text from Images in Python

Recognize Text with Coordinates from Images in Python

In scenarios where you need the exact position of text in an image, such as for layout analysis or advanced data processing, extracting coordinate information is essential. With Spire.OCR for Python, you can retrieve recognized text block by block. Each text block includes detailed positional data such as the x and y coordinates, width, and height. The detailed steps are as follows.

  • Create an instance of the OcrScanner class to handle OCR operations.
  • Create an instance of the ConfigureOptions class to configure the OCR settings.
  • Specify the file path to the model and the desired recognition language through the ConfigureOptions.ModelPath and ConfigureOptions.Language properties.
  • Apply the configuration settings to the OcrScanner instance using the OcrScanner.ConfigureDependencies() method.
  • Call the OcrScanner.Scan() method to perform text recognition on the image.
  • Retrieve the recognized text using the OcrScanner.Text property.
  • Iterate through the text blocks in the recognized text. For each block, use the IOCRTextBlock.Text property to get the text and the IOCRTextBlock.Box property to retrieve positional details (x, y, width, and height).
  • Save the results to a text file for further analysis.
  • Python
from spire.ocr import *

# Create an instance of the OcrScanner class
scanner = OcrScanner()

# Configure OCR settings
configureOptions = ConfigureOptions()
# Set the file path to the model
configureOptions.ModelPath = r'D:\OCR\win-x64' 
# Set the recognition language. Supported languages include English, Chinese, Chinesetraditional, French, German, Japanese, and Korean.
configureOptions.Language = 'English' 
# Apply the settings to the OcrScanner instance
scanner.ConfigureDependencies(configureOptions)

# Recognize text from the image
scanner.Scan(r'sample.png')
# Retrieve the recognized text 
text = scanner.Text

# Iterate through the text blocks in the recognized text. For each text block, retrieve its text and positional data (x, y, width, and height)
block_text = ""
for block in text.Blocks:
    rectangle = block.Box
    block_info = f'{block.Text} -> x: {rectangle.X}, y: {rectangle.Y}, w: {rectangle.Width}, h: {rectangle.Height}'
    block_text += block_info + '\n'

# Save the results to a file
with open('output.txt', 'a', encoding='utf-8') as file:
    file.write(block_text + '\n')

Recognize Text with Coordinates from Images in Python

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

page 23