Convert (render) HTML to Text with correct line-breaks

The code below works correctly with the example provided, even deals with some weird stuff like <div><br></div>, there’re still some things to improve, but the basic idea is there. See the comments. public static string FormatLineBreaks(string html) { //first – remove all the existing ‘\n’ from HTML //they mean nothing in HTML, but break our … Read more

XPath wildcard in attribute value

Use the following expression: //span[contains(concat(‘ ‘, @class, ‘ ‘), ‘ amount ‘)] You could use contains on its own, but that would also match classes like someamount. Test the above expression on the following input: <root> <span class=”test amount blah”/> <span class=”amount test”/> <span class=”test amount”/> <span class=”amount”/> <span class=”someamount”/> </root> It will select the … Read more

HTML agility pack – removing unwanted tags without removing content?

I wrote an algorithm based on Oded’s suggestions. Here it is. Works like a charm. It removes all tags except strong, em, u and raw text nodes. internal static string RemoveUnwantedTags(string data) { if(string.IsNullOrEmpty(data)) return string.Empty; var document = new HtmlDocument(); document.LoadHtml(data); var acceptableTags = new String[] { “strong”, “em”, “u”}; var nodes = new … Read more

HtmlAgilityPack and HtmlDecode

The Html Agility Pack is equiped with a utility class called HtmlEntity. It has a static method with the following signature: /// <summary> /// Replace known entities by characters. /// </summary> /// <param name=”text”>The source text.</param> /// <returns>The result text.</returns> public static string DeEntitize(string text) It supports well-known entities (like &nbsp;) and encoded characters such … Read more

HTML Agility pack – parsing tables

How about something like: Using HTML Agility Pack HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(@”<html><body><p><table id=””foo””><tr><th>hello</th></tr><tr><td>world</td></tr></table></body></html>”); foreach (HtmlNode table in doc.DocumentNode.SelectNodes(“//table”)) { Console.WriteLine(“Found: ” + table.Id); foreach (HtmlNode row in table.SelectNodes(“tr”)) { Console.WriteLine(“row”); foreach (HtmlNode cell in row.SelectNodes(“th|td”)) { Console.WriteLine(“cell: ” + cell.InnerText); } } } Note that you can make it prettier with LINQ-to-Objects if … Read more