Spire.Doc for .NET (337)
Children categories

In .NET development, converting HTML to plain text is a common task, whether you need to extract content from web pages, process HTML emails, or generate lightweight text reports. However, HTML’s rich formatting, tags, and structural elements can complicate workflows that require clean, unformatted text. This is why using C# for HTML to text conversion becomes essential.
Spire.Doc for .NET simplifies this process: it’s a robust library for document manipulation that natively supports loading HTML files/strings and converting them to clean plain text. This guide will explore how to convert HTML to plain text in C# using the library, including detailed breakdowns of two core scenarios: converting HTML strings (in-memory content) and HTML files (disk-based content).
- Why Use Spire.Doc for HTML to Text Conversion?
- Installing Spire.Doc
- Convert HTML Strings to Text in C#
- Convert HTML File to Text in C#
- FAQs
- Conclusion
Why Use Spire.Doc for HTML to Text Conversion?
Spire.Doc is a .NET document processing library that stands out for HTML-to-text conversion due to:
- Simplified Code: Minimal lines of code to handle even complex HTML.
- Structure Preservation: Maintains logical formatting (line breaks, list indentation) in the output text.
- Special Character Support: Automatically converts HTML entities to their plain text equivalents.
- Lightweight: Avoids heavy dependencies, making it suitable for both desktop and web applications
Installing Spire.Doc
Spire.Doc is available via NuGet, the easiest way to manage dependencies:
- In Visual Studio, right-click your project > Manage NuGet Packages.
- Search for Spire.Doc and install the latest stable version.
- Alternatively, use the Package Manager Console:
Install-Package Spire.Doc
After installing, you can dive into the C# code to extract text from HTML.
Convert HTML Strings to Text in C#
This example renders an HTML string into a Document object, then uses SaveToFile() to save it as a plain text file.
using Spire.Doc;
using Spire.Doc.Documents;
namespace HtmlToTextSaver
{
class Program
{
static void Main(string[] args)
{
// Define HTML content
string htmlContent = @"
<html>
<body>
<h1>Sample HTML Content</h1>
<p>This is a paragraph with <strong>bold</strong> and <em>italic</em> text.</p>
<p>Another line with a <a href='https://example.com'>link</a>.</p>
<ul>
<li>List item 1</li>
<li>List item 2 (with <em>italic</em> text)</li>
</ul>
<p>Special characters: © & ®</p>
</body>
</html>";
// Create a Document object
Document doc = new Document();
// Add a section to hold content
Section section = doc.AddSection();
// Add a paragraph
Paragraph paragraph = section.AddParagraph();
// Render HTML into the paragraph
paragraph.AppendHTML(htmlContent);
// Save as plain text
doc.SaveToFile("HtmlStringtoText.txt", FileFormat.Txt);
}
}
}
How It Works:
- HTML String Definition: We start with a sample HTML string containing headings, paragraphs, formatting tags (
<strong>,<em>), links, lists, and special characters. - Document Setup: A
Documentobject is created to manage the content, with aSectionandParagraphto structure the HTML rendering. - HTML Rendering:
AppendHTML()parses the HTML string and converts it into the document's internal structure, preserving content hierarchy. - Text Conversion:
SaveToFile()withFileFormat.Txtconverts the rendered content to plain text, stripping HTML tags while retaining readable structure.
Output:

Extended reading: Parse or Read HTML in C#
Convert HTML File to Text in C#
This example directly loads an HTML file and converts it to text. Ideal for batch processing or working with pre-existing HTML documents (e.g., downloaded web pages, local templates).
using Spire.Doc;
using Spire.Doc.Documents;
namespace HtmlToText
{
class Program
{
static void Main()
{
// Create a Document object
Document doc = new Document();
// Load an HTML file
doc.LoadFromFile("sample.html", FileFormat.Html, XHTMLValidationType.None);
// Convert HTML to plain text
doc.SaveToFile("HTMLtoText.txt", FileFormat.Txt);
doc.Dispose();
}
}
}
How It Works:
- Document Initialization: A
Documentobject is created to handle the file operations. - HTML File Loading:
LoadFromFile()imports the HTML file, withFileFormat.Htmlspecifying the input type.XHTMLValidationType.Noneensures compatibility with non-strict HTML. - Text Conversion:
SaveToFile()withFileFormat.Txtconverts the loaded HTML content to plain text.

To preserve the original formatting and style, you can refer to the C# tutorial to convert the HTML file to Word.
FAQs
Q1: Can Spire.Doc process malformed HTML?
A: Yes. Spire.Doc includes built-in tolerance for malformed HTML, but you may need to disable strict validation to ensure proper parsing.
When loading HTML files, use XHTMLValidationType.None (as shown in the guide) to skip strict XHTML checks:
doc.LoadFromFile("malformed.html", FileFormat.Html, XHTMLValidationType.None);
This setting tells Spire.Doc to parse the HTML like a web browser (which automatically corrects minor issues like unclosed <p> or <li> tags) instead of rejecting non-compliant content.
Q2: Can I extract specific elements from HTML (like only paragraphs or headings)?
A: Yes, after loading the HTML into a Document object, you can access specific elements through the object model (like paragraphs, tables, etc.) and extract text from only those specific elements rather than the entire document.
Q3: Can I convert HTML to other formats besides plain text using Spire.Doc?
A: Yes, Spire.Doc supports conversion to multiple formats, including Word DOC/DOCX, PDF, image, RTF, and more, making it a versatile document processing solution.
Q4: Does Spire.Doc work with .NET Core/.NET 5+?
A: Spire.Doc fully supports .NET Core, .NET 5/6/7/8, and .NET Framework 4.0+. There’s no difference in functionality across these frameworks, which means you can use the same code (e.g., Document, AppendHTML(), SaveToFile()) regardless of which .NET runtime you’re targeting.
Conclusion
Converting HTML to text in C# is straightforward with the Spire.Doc library. Whether you’re working with HTML strings or files, Spire.Doc simplifies the process by handling HTML parsing, structure preservation, and text conversion. By following the examples in this guide, you can seamlessly integrate HTML-to-text conversion into your C# applications.
You can request a free 30-day trial license here to unlock full functionality and remove limitations of the Spire.Doc library.

The need to efficiently parse HTML in C# is a common requirement for many development tasks, from web scraping, data extraction to content automation. While .NET offers built-in tools (e.g., HtmlAgilityPack), Spire.Doc simplifies HTML parsing in C# with its intuitive object model and seamless integration.
This guide explores how to leverage Spire.Doc for .NET to parse HTML, including loading HTML from various sources, navigating document structure, and extracting critical data.
- Setting Up Spire.Doc
- How Spire.Doc Parses HTML
- How to Load and Parse HTML Content
- Conclusion
- Common Questions
Setting Up Spire.Doc
The easiest way to integrate the C# HTML parser library into your project is via NuGet:
- Open your project in Visual Studio.
- Right-click the project in the Solution Explorer → Select Manage NuGet Packages.
- In the NuGet Package Manager, search for Spire.Doc.
- Select the latest stable version and click Install.
Alternatively, download the library directly from the E-iceblue website, extract the ZIP file, and reference Spire.Doc.dll in your project.
How Spire.Doc Parses HTML
Spire.Doc converts HTML into a structured object model, where elements like <p>, <a>, and <table> are mapped to classes you can programmatically access. Key components include:
- Document: Acts as the container for parsed HTML content.
- Section: Represents a block of content (similar to HTML’s
<body>or<div>sections). - Paragraph: Maps to HTML block elements like
<p>,<h1>, or<li>. - DocumentObject: Base class for all elements within a Paragraph (images, links, etc.).
This model ensures that HTML structures are preserved and accessible via intuitive C# properties and methods.
How to Load and Parse HTML Content
Spire.Doc supports parsing HTML from strings, local files, or even remote URLs (when combined with HTTP clients). Below are detailed examples for each scenario.
Parse an HTML String in C#
Parse an HTML string (e.g., from a web API or database) into Spire.Doc’s object model for inspection.
using Spire.Doc;
using Spire.Doc.Documents;
namespace ParseHtmlString
{
class Program
{
static void Main(string[] args)
{
// Create a Document object
Document doc = new Document();
// Add a section to act as a container
Section section = doc.AddSection();
// Add a paragraph
Paragraph para = section.AddParagraph();
// Define HTML content to parse
string htmlContent = @"
<h2>Sample HTML String</h2>
<p>This is a paragraph with <strong>bold text</strong> and a <a href='https://www.e-iceblue.com/'>link</a>.</p>
<ul>
<li>List item 1</li>
<li>List item 2</li>
</ul>
";
// Parse the HTML string into the paragraph
para.AppendHTML(htmlContent);
// Print all paragraph text
Console.WriteLine("Parsed HTML Content:");
Console.WriteLine("---------------------");
foreach (Paragraph paragraph in section.Paragraphs)
{
Console.WriteLine(paragraph.Text);
}
}
}
}
In this code, the method AppendHTML() automatically converts HTML tags to corresponding Spire.Doc objects (e.g., <h1> → Heading1 style, <ul> → list paragraphs).
Output:

Pro Tip: You can also call the SaveToFile() method to convert the HTML string to Word in C#.
Parse an HTML File in C#
For HTML content stored in a file (e.g., downloaded web pages, static HTML reports), load it via LoadFromFile() and then analyze its structure (e.g., extracting headings, paragraphs).
using Spire.Doc;
using Spire.Doc.Documents;
namespace ParseHtmlFile
{
class Program
{
static void Main(string[] args)
{
// Create a Document object
Document doc = new Document();
// Load an HTML file
doc.LoadFromFile("sample.html", FileFormat.Html);
// Traverse sections (HTML body blocks)
foreach (Section section in doc.Sections)
{
Console.WriteLine($"Section {doc.Sections.IndexOf(section) + 1}:");
Console.WriteLine("---------------------------------");
// Traverse paragraphs in the section
foreach (Paragraph para in section.Paragraphs)
{
// Print paragraph text and style (e.g., heading level)
string styleName = para.StyleName;
Console.WriteLine($"[{styleName}] {para.Text}"+ "\n");
}
Console.WriteLine();
}
}
}
}
This C# code example loads a local HTML file and then uses the Paragraph.StyleName and Paragraph.Text properties to extract content along with its styling information.
Output:

Spire.Doc’s object model allows you to interact with an HTML file just like you would with a Word document. In addition to extracting text content, you can also extract elements like links, tables from HTML.
Parse a URL in C#
To parse HTML from a web page, combine Spire.Doc with HttpClient to fetch the HTML content first, then parse it.
using Spire.Doc;
using Spire.Doc.Documents;
namespace HtmlUrlParsing
{
class Program
{
// HttpClient instance for web requests
private static readonly HttpClient httpClient = new HttpClient();
static async Task Main(string[] args)
{
try
{
// Fetch HTML from a URL
string url = "https://www.e-iceblue.com/privacypolicy.html";
Console.WriteLine($"Fetching HTML from: {url}");
string htmlContent = await FetchHtmlFromUrl(url);
// Parse the fetched HTML
Document doc = new Document();
Section section = doc.AddSection();
Paragraph paragraph = section.AddParagraph();
paragraph.AppendHTML(htmlContent);
// Extract key information
Console.WriteLine("\nParsed Content Summary:");
Console.WriteLine($"Sections: {doc.Sections.Count}");
Console.WriteLine($"Paragraphs: {section.Paragraphs.Count}");
Console.WriteLine("-------------------------------------------");
// Extract all heading paragraphs
foreach (Paragraph para in section.Paragraphs)
{
if (para.StyleName.StartsWith("Heading"))
{
string headings = para.Text;
Console.WriteLine($"Headings: {headings}");
}
}
}
catch (Exception ex)
{
Console.WriteLine($"Error: {ex.Message}");
}
}
// Helper method to fetch HTML from a URL
private static async Task<string> FetchHtmlFromUrl(string url)
{
// Set a user-agent to avoid being blocked by servers
httpClient.DefaultRequestHeaders.UserAgent.ParseAdd("Mozilla/5.0 (Windows NT 10.0; Win64; x64)");
// Send GET request and return HTML content
HttpResponseMessage response = await httpClient.GetAsync(url);
response.EnsureSuccessStatusCode(); // Throw on HTTP errors (4xx, 5xx)
return await response.Content.ReadAsStringAsync();
}
}
}
This C# code combines web scraping (fetching HTML from a URL) with document parsing (using Spire.Doc) to extract structured information (like headings) from web content. It’s useful for scenarios like content analysis or web data extraction.
Output:

Conclusion
Spire.Doc for .NET provides a comprehensive solution for reading HTML in C# applications. Whether you're working with HTML strings, local files, or even web URLs, this library streamlines the process with intuitive APIs and reliable performance. By following the examples outlined in this guide, you can efficiently integrate HTML parsing capabilities into your .NET projects.
To fully experience the capabilities of Spire.Doc for .NET, request a free 30-day trial license here.
Common Questions
Q1: Why use Spire.Doc for HTML parsing instead of HtmlAgilityPack?
A: Spire.Doc and HtmlAgilityPack serve different primary goals, so the choice depends on your needs:
- HtmlAgilityPack: A lightweight library only for parsing and manipulating raw HTML (e.g., extracting tags, fixing invalid HTML). It does not handle document formatting or export to Word.
- Spire.Doc: Designed for document manipulation first - it parses HTML and maps it directly to structured Word elements (sections, paragraphs, styles like headings/bold). This is critical if you need to:
- Preserve HTML structure in an editable Word file.
- Extract styled content (e.g., identify "Heading 1" vs. "Normal" text).
- Export parsed HTML to RTF, TXT, PDF, etc.
Q2. How do I convert HTML to Text in C#
A: To convert an HTML file to plain text in C#, get its text content via the GetText() method and then write the result to a .txt file.
// Create a Document object
Document doc = new Document();
// Load an HTML file
doc.LoadFromFile("sample.html", FileFormat.Html);
// Get text from HTML
string text = doc.GetText();
// Write to a text file
File.WriteAllText("HTMLText.txt", text);
Q3: Can Spire.Doc handle malformed or incomplete HTML?
A: Spire.Doc has good error tolerance and can handle imperfect HTML to some extent. However, severely malformed HTML might cause parsing issues. For best results, ensure your HTML is well-formed or use HTML sanitization libraries before parsing with Spire.Doc.
Q3: Can I use Spire.Doc in ASP.NET Core applications?
A: Yes, Spire.Doc is fully compatible with ASP.NET Core applications. The installation and usage process is the same as in other .NET applications.