June 15, 2015 by Nitesh

Getting Started With HTML Agility Pack

Friends,

In this post we will see how to get started with HTML Agility Pack and code samples to see how web scraping can be achieved using this package in C#. For users who are unaware about “HTML Agility Pack“, This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT. In simple words, it is a .NET code library that allows you to parse “out of the web” files (be it html/php/aspx).

To make it more simpler, you can scrape web pages present on Internet using this library.

How to Get HTML Agility Pack in your application

You can get HTML Agility Pack in your application using Nuget. To install it in your project, you can just use the following in Package Manager Console

Install-Package HtmlAgilityPack 

Read this: How to add Nuget packages in your project

After adding the reference via Nuget, you need to include the reference in your page using –

using HtmlAgilityPack;

Load a Page From Internet

To load a page directly from Web, you can use the following code:

 HtmlWeb web = new HtmlWeb();
 HtmlDocument document = web.Load("http://www.c-sharpcorner.com");

After executing this 2 lines of code, we have the entire page of http://c-sharpcorner.com in document object of HtmlDocument class.

Load a Page from a Saved Document

Several times we need to load a HTML document from a saved file from our hard disk. To load a HTML document from a saved file, we need to write the following code –

HtmlDocument document2 = new HtmlDocument();
document2.Load(@"C:\Temp\sample.txt")

At this point, we have the entire HTML parsed and loaded in document2 object.

At this point, let us see a sample HTML that we’re using in sample.txt file –





	
	Link 3 outside all divs	
	



Get all Hyperlinks in a page

Once we have the HTML document loaded, let us see how can we get all hyperlinks from the page.

HtmlDocument document2 = new HtmlDocument();
document2.Load(@"C:\Temp\sample.txt")
HtmlNode[] nodes = document2.DocumentNode.SelectNodes("//a").ToArray();
foreach (HtmlNode item in nodes)
{
     Console.WriteLine(item.InnerHtml);
}

This will output the following text –

html-agility-pack-1

Select a specific div in a page

To get a specific div in a page, we will use the following code –

HtmlDocument document2 = new HtmlDocument();
document2.Load(@"C:\Temp\sample.txt")
HtmlNode node = document2.DocumentNode.SelectNodes("//div[@id='div1']").First();

This code will select the div with id ‘div1’ from the page and return in the Node. You can now iterate on the ChildNodes property of HtmlNode class to get further child elements of the DOM element.

Select all Hyperlinks within a specific div

To select all hyperlinks within a specific div, we can use the following 2 ways –

HtmlDocument document2 = new HtmlDocument();
document2.Load(@"C:\Temp\sample.txt")

//Approach 1
HtmlNode node = document2.DocumentNode.SelectNodes("//div[@id='div1']").First();

HtmlNode [] aNodes = node.SelectNodes(".//a").ToArray();

//Approach 2
HtmlNode [] aNodes2 = document2.DocumentNode.SelectNodes("//div[@id='div1']//a").ToArray();

The above code will give the following output –

html-agility-pack-2

Filter hyperlinks for certain conditions

In case you want to filter nodes based on conditions, you can also use LINQ to perform any kind of query on the nodes and return your specific nodes. For example, the below code will return all hyperlinks where the anchor tags contain ‘div2‘ in their link text.

HtmlDocument document2 = new HtmlDocument();
document2.Load(@"C:\Temp\sample.txt");

HtmlNode[] nodes = document2.DocumentNode.SelectNodes("//a").Where(x=>x.InnerHtml.Contains("div2")).ToArray();
foreach (HtmlNode item in nodes)
{
    Console.WriteLine(item.InnerHtml);
}

The above code will give the following output –

html-agility-pack-3

Hope this post gives you a head start with HTML Agility Pack. If you have any questions or would like me to provide some support using this, please connect with me here.

#C##HTML Agility Pack#scraping
  • Ketul Suthar

    How to get Node which does not have class “footer” ?

    ex : document2.SelectSingleNode(“.//ul/li/div/dl/dd/a[@href]”)

    or

    document2.SelectSingleNode(“//div[@class != footer]ul/li/div/dl/dd/a[@href]”) ?

    • The 1st approach will not work as it will not check “footer” class on the div. The 2nd approach should work, although I have not checked it yet. Also you’re missing // in 2nd approach after div.

Support us!

If you like this site please help and make click on any of these buttons!

Powered by WordPress Popup