Memanfaatkan Waktu Luang Untuk Belajar Dan Berbagi: Parsing Microformats

It bugs me when I look at the previous XML example and see "Brian Suda" encoded twice, once for FN then repeated again for N. With HTML this isn't a problem, we can combine those two XML elements using space-separated values in the class attribute. It is a little know fact that the class, rel, and rev attributes in HTML can actually take a space-separated list of values. If we combine the FN and N we get something like this:
<'div class="n fn">
<'div class="given-name">Brian<'/div>
<'div class="family-name">Suda<'/div>
<'/div>

Now the N property still has its children and the FN has the same value as before. Remember, HTML collapses whitespace, so the FN still is "Brian Suda" even though it is spread over two elements now with spaces inside those <'div>s.
So, we have sorted the ability to condense multiple properties with the same value. The next thing that bothers me about the XML example is that the URL is displayed, it doesn't seem natural. In XML we are talking about data, but the HTML is being displayed to people in a browser. Coincidentally, there is an <'a> element, which has an href attribute that takes the URL value and also a node-value to display more human-friendly text. We can further refine our HTML example to include the URL switching the <'div> to an <'a> element.
<'a class="n fn url" href="http://suda.co.uk">
<'span class="given-name">Brian<'/span>
<'span class="family-name">Suda<'/span>
<'/a>
After switching to the <'a> element, we needed to change the child <'div>s to s because the <'a> element can only contain inline elements as children. Microformats do not force publishers to use specific elements, but it is recommended that you use the most semantic for each case. In the case of URL data, it makes the most sense in this case to use an <'a> element, because of this; the parsing rules change slightly (we'll discuss this in a bit).
The final hCard microformat might look something like the following in HTML:
<'div class="vcard">
<'a class="n fn url" href="http://suda.co.uk">
<'span class="given-name">Brian<'/span>
<'span class="family-name">Suda<'/span>
<'/a>
<'/div>
To me, this is much more intuitive, simpler, and more compact than the XML example at the start. People are already publishing blogrolls and links in this manner and all browsers recognize and style this information, plus it can easily be passed around inside a feed.

Parsing with XSLT

Let's take that HTML example and try to parse it using XSLT.
Microformats are designed to work with HTML 4 and up. The downside to using XSLT is that the document needs to be well-formed. HTML 4 does not. HTML 4 can use <'br>, <'img>, and

elements without closing tags. If you were using a different technology like REGEXs or the DOM to extract microformats, then this is a separate issue, but with XSLT we need to clean up the HTML first. There are two simple ways to do this, TIDY or a function like HTMLlib or loadHTML, either will load the HTML document and convert it into a usable state for XSLT.
Now that we know we have a well-formed HTML document, we can begin to extract the microformat data. The following is a very rough XSLT that is far from comprehensive, but it should get you started. For more information you can see the microformats.org wiki page about parsing or use the XSLT templates that do most of the heavy-lifting data extraction (available at hg.microformats.org).
All the data inside an hCard is contained within the element that has a class of "vcard". In our example this is a <'div>, but it could be any element, so we'll start with:
//*[@class="vcard"]
This XPath expression looks for any element anywhere in the tree that has a class equal to "vcard". At first glance, this should find all the hCards, but the problem is that the class attribute can take a space-separated list of values. So, class="vcard myStyle" would not be picked up by that XPath expression. To fix this we can use the contains function.
//*[contains(@class,"vcard")]
This is better, now we find any element when the class attribute contains the term "vcard." This will successfully find the "vcard" in class="vcard myStyle", but there is still a problem. The contains function is not word safe it is a substring match. So, class="my-vcard" would be found by contains() just the same as class="vcard", even though "my-vcard" is not the proper name of the property to indicate this is an hCard microformat, a false-positive. To fix this we need to work some magic and pad the values we are searching for with spaces, then search for the term with the padded spaces around it. It sounds complicated, but really isn't.
//*[contains(concat(" ",@class," "), " vcard ")]
With padding, class="my-vcard" becomes " my-vcardZ " and would not contain the substring " vcard ," which solves the substring problem. In the other instance, class="vcard mySytle" becomes " vcard myStyle ," which does contain " vcard " so the space-separated values in a class issue is also solved with the padding technique.
Now that we know how to find the data, let's loop through each hCard using XSLT and begin to extract it into vCard output. At this point, it is pretty easy to see how using XSLT can let you easily convert this HTML data into just about any format you want. This includes other HTML, XML, RDF, flat vCard text, CSV, SPARQL results, JSON, or just about anything else your heart desires.
The for-each will find all instances of an hCard on the page and create a new vCard for each one. While creating each vCard it applies the templates to look for any properties inside an hCard, such as FN, N, and URL.

<'xsl:text>BEGIN:VCARD<'/xsl:text>
<'xsl:apply-templates />
<'xsl:text>END:VCARD<'/xsl:text>
<'/xsl:for-each>

The FN is a simple template that extracts the node-value of the element that contains FN as a class value.
<'xsl:template match="//*[contains(concat(" ",@class," "), " fn ")]">
<'xsl:text>FN:<'/xsl:text>
<'/xsl:template>
The N template is slightly more complex. It first has to look for an element with a class containing N. Then it looks for child elements that contain subproperties of N, such as family-name and given-name and outputs those values.

<'xsl:text>N:<'/xsl:text>
<'xsl:value-of select="//*[contains(concat(" ",@class," "), " family-name ")]"/>
<'xsl:text>;<'/xsl:text>
<'xsl:value-of select="//*[contains(concat(" ",@class," "), " given-name ")]"/>
<'xsl:text>;;;<'/xsl:text>
<'/xsl:template>
The template for URL uses the choose element to determine where the most semantic information for the URL value is encoded. It tests to see if the element the class="url" is an <'a> element. If it is, then the value of URL is extracted from the @href, otherwise it uses the node-value.
<'xsl:template match="//*[contains(concat(" ",@class," "), " url ")]">
<'xsl:text>URL:<'/xsl:text>
<'xsl:choose>
<'xsl:when test="local-name() = 'a'">
<'vxsl:alue-of select="@href"/>
<'/xsl:when>
<'xsl:otherwise>
<'xsl:value-of select="."/>
<'/xsl:otherwise>
<'/xsl:choose>
<'/xsl:template>
The <'a> element and many others carry implied semantics. In our original HTML example the URL had been encoded on a <'div>, in that case, the node-value would have been extracted and the value of URL would have been the same. This is just one of the many ways microformats are different than XML. The parsing of microformats data is dependent the type of data and on the HTML element it was encoded on.

Sumber:
http://www.xml.com/pub/a/2007/09/04/parsing-microformats.html?page=1