Mac / iPhone App Development
Home of a Small Time Developer

Objective C HMTL Parser

Here is basic html parser which I’ve used in several projects recently. The code depends on libxml2 and is basically a thin wrapper that provides a more convenient interface for parsing html with objective c. This has only been tested on iphone OS 3.1.3 & 3.2, if your using a OSX your probably better of investigating using Webkit to manipulate your DOM.

The code below is provided under an MIT license, but if you do make any updates it would be great if you could send them back.

Usage:

   //Example to download google's source and print out the urls of all the images
   NSError * error = nil;
   HTMLParser * parser = [[HTMLParser alloc] initWithContentsOfURL:[NSURL URLWithString:@"http://www.google.com"] error:&error];

   if (error) {
     NSLog(@"Error: %@", error);
     return;
  }
  HTMLNode * bodyNode = [parser body]; //Find the body tag

  NSArray * imageNodes = [bodyNode findChildTags:@"img"]; //Get all the 

 for (HTMLNode * imageNode in imageNodes) { //Loop through all the tags
       NSLog(@"Found image with src: %@", [imageNode getAttributeNamed:@"src"]); //Echo the src=""
  }

  [parser release];

You can grab a copy of the code here on github:
http://github.com/zootreeves/Objective-C-HMTL-Parser
[email protected]:zootreeves/Objective-C-HMTL-Parser.git

Note:

-(NSString*)rawContents; does not work, you need to use NSString * rawContentsOfNode(xmlNode * node, htmlDocPtr doc); to dump the entire html contents of a node.

Comments Ahead

  1. by James Allchin on October 23rd, 2010 at 4:49 pm

    Hi Ben,

    In git-hub you seem to have a problem with your latest check-in. The HTMLNode.h has not been updated with the code change to HTMLNode.m.

    The mod required to HTMLNode.h is:

    //Returns the contents including html tags
    -(NSString*)rawContents;
    // JA
    //NSString * rawContentsOfNode(xmlNode * node, htmlDocPtr doc);
    NSString * rawContentsOfNode(xmlNode * node);

    There is also a mistake in the latest HTMLNode.m. The problem is that there is no such method called hxmlNodeDumpOutput.

    The mod required to HTMLNode.m is:

    // JA
    //hxmlNodeDumpOutput(buf, node->doc, node, 3, 0, NULL);
    xmlNodeDumpOutput(buf, node->doc, node, 3, 0, NULL);

    Hope this helps.

    Cheers

    James

Leave a Reply