In this post, I show an example of scraping data in C# using HtmlAgilityPack. I come across HtmlAgilityPack because I need to get data from Zillow to analyze properties deals. I was able to scrape the data I want without much trouble using HtmlAgilityPack with a bit of XPath, LINQ and regular expression.
Below I show a screenshot of a sample Zillow listing page which contains the data I want to scrape.
I want to scrape the data under Facts and Features. The process is simple using .NET HttpClient and HtmlAgilityPack. First, I stream the HTML content. Then, I use HtmlAgilityPack to parse the document and extract the data using XPATH.
Stream HTML
It is quite easy to stream the HTML of a Zillow listing page using .NET HttpClient, as shown in the below code snippet.
public class ZillowClient : IZillowClient
{
private HttpClient _httpClient;
public ZillowClient(HttpClient httpClient)
{
_httpClient = httpClient;
}
public Task<string> GetHtml(string address)
{
return _httpClient.GetStringAsync(BuildUrl(ZillowUtil.NormalizeAddress(address)));
}
private string BuildUrl(string address)
{
return @$"https://www.zillow.com/homes/{address}";
}
}