Parsing Html with WebBrowser Control
It is relatively easy to parse html with the WebBrowser for VB and C# or any other .net language. The trick with the WebBrowser Control is to first load a blank document into the control by using one of the following methods:
1)
CSharp
WebBrowser wb = new WebBrowser(); wb.Navigate("about:blank");
VB.net
Dim wb as new WebBrowser() wb.Navitage("about:blank")
2)
CSharp
WebBrowser wb = new WebBrowser(); wb.Navigate(System.Empty);
VB.net
Dim wb as new WebBrowser wb.Navigate(String.Empty)
The above steps will initialize the browser, then you can move ahead with setting the document text or document stream to a memory stream or straight text from the db
WebBrowser wb = new WebBrowser(); wb.Navigate(System.Empty); wb.Document.Write("<html><head><title></title><body><p>Hello World</p></body></html>" ) ;
VB.net
Dim wb as new WebBrowser wb.Navigate(String.Empty) wb.Document.Write("<html><head><title></title><body><p>Hello World</p></body></html>" )
Please note that the above only slightly differ in content, for instance, there is no trailing semicolon in the VB.net code, and the variable initialization is slightly different.
Getting Access to the document object from the WebBrowser control
The document object hierarchy can be sought after in a couple of different ways. The default and easiest way is to use the builtin WebBrowser Document property.
CSharp
HtmlDocument doc = wb.Document;
VB.net
dim doc as HtmlDocument = wb.Document
From which, the HtmlElement hierarchy can be retrieved, or a flat list of all the elements using the HtmlDocument.All. The problem with this is that you don’t get direct access to the text nodes.
Accessing the Text nodes and complete hierarchy with the WebBrowser control
In order to get access to the text nodes, you have to use the DomDocument property and cast appropriately.
CSharp
IHTMLDocument3 doc = (IHTMLDocument3)wb.Document.DomDocument; IHTMLDOMChildrenCollection col = (IHTMLDOMChildrenCollection)doc.childNodes;
VB.net
dim doc as IHTMLDocument3 =wb.Document.DomDocument dim col as IHTMLDOMChildrenCollection = doc.childNodes
Once this is accomplished, the following can be used to detect the difference between node types.
Csharp
foreach (IHTMLDOMNode el in col){ if (el.nodeType == 3){ // do something with text node }else{ // do something with regular node } }
VB.net
for each el as IHTMLDOMNode in col if (el.nodeType == 3) then ' do something with text node else ' do something with regular node end if next
Thats it for now – happy coding
What’s up,I log on to your new stuff named “Parsing Html with Web Browser Control” like every week.Your writing style is awesome, keep doing what you’re doing! And you can look our website about love spells.
Very useful. Thanks!