Parsing Html with Web Browser Control

Parsing Html with WebBrowser Control

It is relatively easy to parse html with the WebBrowser for VB and C# or any other .net language. The trick with the WebBrowser Control is to first load a blank document into the control by using one of the following methods:

1)

CSharp


WebBrowser wb = new WebBrowser();

wb.Navigate("about:blank");

VB.net


Dim wb as new WebBrowser()

wb.Navitage("about:blank")

2)

CSharp


WebBrowser wb = new WebBrowser();

wb.Navigate(System.Empty);

VB.net


Dim wb as new WebBrowser

wb.Navigate(String.Empty)

The above steps will initialize the browser, then you can move ahead with setting the document text or document stream to a memory stream or straight text from the db


WebBrowser wb = new WebBrowser();

wb.Navigate(System.Empty);

wb.Document.Write("<html><head><title></title><body><p>Hello World</p></body></html>" ) ;

VB.net


Dim wb as new WebBrowser

wb.Navigate(String.Empty)

wb.Document.Write("<html><head><title></title><body><p>Hello World</p></body></html>" )

Please note that the above only slightly differ in content, for instance, there is no trailing semicolon in the VB.net code, and the variable initialization is slightly different.

Getting Access to the document object from the WebBrowser control

The document object hierarchy can be sought after in a couple of different ways. The default and easiest way is to use the builtin WebBrowser Document property.
CSharp


HtmlDocument doc = wb.Document;

VB.net


dim doc as HtmlDocument = wb.Document

From which, the HtmlElement hierarchy can be retrieved, or a flat list of all the elements using the HtmlDocument.All. The problem with this is that you don’t get direct access to the text nodes.

Accessing the Text nodes and complete hierarchy with the WebBrowser control

In order to get access to the text nodes, you have to use the DomDocument property and cast appropriately.
CSharp


IHTMLDocument3 doc = (IHTMLDocument3)wb.Document.DomDocument;
IHTMLDOMChildrenCollection col = (IHTMLDOMChildrenCollection)doc.childNodes;

VB.net


dim doc as IHTMLDocument3 =wb.Document.DomDocument
dim col as  IHTMLDOMChildrenCollection = doc.childNodes

Once this is accomplished, the following can be used to detect the difference between node types.
Csharp

foreach (IHTMLDOMNode el in col){
    if (el.nodeType == 3){
        // do something with text node
    }else{
        // do something with regular node
    }
}

VB.net

for each el as IHTMLDOMNode in col
    if (el.nodeType == 3) then
        ' do something with text node
    else
        ' do something with regular node
    end if
next

Thats it for now – happy coding

ttessier

About ttessier

Professional Developer and Operator of SwhistleSoft
This entry was posted in Applications Development, C#.net, VB.net. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *