HTML Parsing in C#: Strings, Files, and URLs

How to Parse HTML in C#

The need to efficiently parse HTML in C# is a common requirement for many development tasks, from web scraping, data extraction to content automation. While .NET offers built-in tools (e.g., HtmlAgilityPack), Spire.Doc simplifies HTML parsing in C# with its intuitive object model and seamless integration.

This guide explores how to leverage Spire.Doc for .NET to parse HTML, including loading HTML from various sources, navigating document structure, and extracting critical data.

Setting Up Spire.Doc
How Spire.Doc Parses HTML
How to Load and Parse HTML Content
Conclusion
Common Questions

Setting Up Spire.Doc

The easiest way to integrate the C# HTML parser library into your project is via NuGet:

Open your project in Visual Studio.
Right-click the project in the Solution Explorer → Select Manage NuGet Packages.
In the NuGet Package Manager, search for Spire.Doc.
Select the latest stable version and click Install.

Alternatively, download the library directly from the E-iceblue website, extract the ZIP file, and reference Spire.Doc.dll in your project.

How Spire.Doc Parses HTML

Spire.Doc converts HTML into a structured object model, where elements like <p>, <a>, and <table> are mapped to classes you can programmatically access. Key components include:

Document: Acts as the container for parsed HTML content.
Section: Represents a block of content (similar to HTML’s <body> or <div> sections).
Paragraph: Maps to HTML block elements like <p>, <h1>, or <li>.
DocumentObject: Base class for all elements within a Paragraph (images, links, etc.).

This model ensures that HTML structures are preserved and accessible via intuitive C# properties and methods.

How to Load and Parse HTML Content

Spire.Doc supports parsing HTML from strings, local files, or even remote URLs (when combined with HTTP clients). Below are detailed examples for each scenario.

Parse an HTML String in C#

Parse an HTML string (e.g., from a web API or database) into Spire.Doc’s object model for inspection.

using Spire.Doc;
using Spire.Doc.Documents;

namespace ParseHtmlString
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a Document object
            Document doc = new Document();

            // Add a section to act as a container
            Section section = doc.AddSection();
            // Add a paragraph
            Paragraph para = section.AddParagraph();

            // Define HTML content to parse
            string htmlContent = @"
                <h2>Sample HTML String</h2>
                <p>This is a paragraph with <strong>bold text</strong> and a <a href='https://www.e-iceblue.com/'>link</a>.</p>
                <ul>
                    <li>List item 1</li>
                    <li>List item 2</li>
                </ul>
            ";

            // Parse the HTML string into the paragraph
            para.AppendHTML(htmlContent);

            // Print all paragraph text
            Console.WriteLine("Parsed HTML Content:");
            Console.WriteLine("---------------------");

            foreach (Paragraph paragraph in section.Paragraphs)
            {
                Console.WriteLine(paragraph.Text);
            }
        }
    }
}

In this code, the method AppendHTML() automatically converts HTML tags to corresponding Spire.Doc objects (e.g., <h1> → Heading1 style, <ul> → list paragraphs).

Output:

Parse an HTML string using C#

Pro Tip: You can also call the SaveToFile() method to convert the HTML string to Word in C#.

Parse an HTML File in C#

For HTML content stored in a file (e.g., downloaded web pages, static HTML reports), load it via LoadFromFile() and then analyze its structure (e.g., extracting headings, paragraphs).

using Spire.Doc;
using Spire.Doc.Documents;

namespace ParseHtmlFile
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a Document object
            Document doc = new Document();

            // Load an HTML file
            doc.LoadFromFile("sample.html", FileFormat.Html);

            // Traverse sections (HTML body blocks)
            foreach (Section section in doc.Sections)
            {
                Console.WriteLine($"Section {doc.Sections.IndexOf(section) + 1}:");
                Console.WriteLine("---------------------------------");

                // Traverse paragraphs in the section
                foreach (Paragraph para in section.Paragraphs)
                {
                    // Print paragraph text and style (e.g., heading level)
                    string styleName = para.StyleName;
                    Console.WriteLine($"[{styleName}] {para.Text}"+ "\n");
                }
                Console.WriteLine();
            }
        }
    }
}

This C# code example loads a local HTML file and then uses the Paragraph.StyleName and Paragraph.Text properties to extract content along with its styling information.

Output:

Load and parse an HTML file using C#

Spire.Doc’s object model allows you to interact with an HTML file just like you would with a Word document. In addition to extracting text content, you can also extract elements like links, tables from HTML.

Parse a URL in C#

To parse HTML from a web page, combine Spire.Doc with HttpClient to fetch the HTML content first, then parse it.

using Spire.Doc;
using Spire.Doc.Documents;

namespace HtmlUrlParsing
{
    class Program
    {
        // HttpClient instance for web requests
        private static readonly HttpClient httpClient = new HttpClient();

        static async Task Main(string[] args)
        {
            try
            {
                // Fetch HTML from a URL
                string url = "https://www.e-iceblue.com/privacypolicy.html";
                Console.WriteLine($"Fetching HTML from: {url}");
                string htmlContent = await FetchHtmlFromUrl(url);

                // Parse the fetched HTML
                Document doc = new Document();
                Section section = doc.AddSection();
                Paragraph paragraph = section.AddParagraph();
                paragraph.AppendHTML(htmlContent);

                // Extract key information
                Console.WriteLine("\nParsed Content Summary:");
                Console.WriteLine($"Sections: {doc.Sections.Count}");
                Console.WriteLine($"Paragraphs: {section.Paragraphs.Count}");
                Console.WriteLine("-------------------------------------------");

                // Extract all heading paragraphs
                foreach (Paragraph para in section.Paragraphs)
                {
                    if (para.StyleName.StartsWith("Heading"))
                    {
                        string headings = para.Text;
                        Console.WriteLine($"Headings: {headings}");
                    }
                }
                
            }
            catch (Exception ex)
            {
                Console.WriteLine($"Error: {ex.Message}");
            }
        }

        // Helper method to fetch HTML from a URL
        private static async Task<string> FetchHtmlFromUrl(string url)
        {
            // Set a user-agent to avoid being blocked by servers
            httpClient.DefaultRequestHeaders.UserAgent.ParseAdd("Mozilla/5.0 (Windows NT 10.0; Win64; x64)");

            // Send GET request and return HTML content
            HttpResponseMessage response = await httpClient.GetAsync(url);
            response.EnsureSuccessStatusCode(); // Throw on HTTP errors (4xx, 5xx)
            return await response.Content.ReadAsStringAsync();
        }
    }
}

This C# code combines web scraping (fetching HTML from a URL) with document parsing (using Spire.Doc) to extract structured information (like headings) from web content. It’s useful for scenarios like content analysis or web data extraction.

Output:

Parse HTML from a web URL using C#

Conclusion

Spire.Doc for .NET provides a comprehensive solution for reading HTML in C# applications. Whether you're working with HTML strings, local files, or even web URLs, this library streamlines the process with intuitive APIs and reliable performance. By following the examples outlined in this guide, you can efficiently integrate HTML parsing capabilities into your .NET projects.

To fully experience the capabilities of Spire.Doc for .NET, request a free 30-day trial license here.

Common Questions

Q1: Why use Spire.Doc for HTML parsing instead of HtmlAgilityPack?

A: Spire.Doc and HtmlAgilityPack serve different primary goals, so the choice depends on your needs:

HtmlAgilityPack: A lightweight library only for parsing and manipulating raw HTML (e.g., extracting tags, fixing invalid HTML). It does not handle document formatting or export to Word.
Spire.Doc: Designed for document manipulation first - it parses HTML and maps it directly to structured Word elements (sections, paragraphs, styles like headings/bold). This is critical if you need to:
- Preserve HTML structure in an editable Word file.
- Extract styled content (e.g., identify "Heading 1" vs. "Normal" text).
- Export parsed HTML to RTF, TXT, PDF, etc.

Q2. How do I convert HTML to Text in C#

A: To convert an HTML file to plain text in C#, get its text content via the GetText() method and then write the result to a .txt file.

// Create a Document object
Document doc = new Document();

// Load an HTML file
doc.LoadFromFile("sample.html", FileFormat.Html);

// Get text from HTML
string text = doc.GetText();

// Write to a text file
File.WriteAllText("HTMLText.txt", text);

Q3: Can Spire.Doc handle malformed or incomplete HTML?

A: Spire.Doc has good error tolerance and can handle imperfect HTML to some extent. However, severely malformed HTML might cause parsing issues. For best results, ensure your HTML is well-formed or use HTML sanitization libraries before parsing with Spire.Doc.

Q3: Can I use Spire.Doc in ASP.NET Core applications?

A: Yes, Spire.Doc is fully compatible with ASP.NET Core applications. The installation and usage process is the same as in other .NET applications.

HTML Parsing in C#: Strings, Files, and URLs

Setting Up Spire.Doc

How Spire.Doc Parses HTML

How to Load and Parse HTML Content

Parse an HTML String in C#

Parse an HTML File in C#

Parse a URL in C#

Conclusion

Common Questions

Q1: Why use Spire.Doc for HTML parsing instead of HtmlAgilityPack?

Q2. How do I convert HTML to Text in C#

Q3: Can Spire.Doc handle malformed or incomplete HTML?

Q3: Can I use Spire.Doc in ASP.NET Core applications?

See Also