Parsing tables with XPath without hardcoded indexes
24 Sep
Let’s say you have to get some data from a XML provided by a 3rd party. You don’t have access or rights to modify that XML and the semantic, well, is not as good as you, semantic lover, would like to. In fact, let’s face it, the XML sucks in semantic and it has tables inside that makes you remember the nineties and those HTMLs.
First thing you do is a XPath expression based on hardcoded indexes, of course. “Please sir, would you mind to give me the TR[4] from TD[1]? Yes, I already know where is this data from, thanks”. But let’s imagine for a moment that your friend, the 3rd party, decide one day to change the XML, oor simply to change the order of the fields in the table, oor to add more fields to the table, oor… You don’t want to be dependent on the 3rd party mood to change your code and your XPath expressions one day and another, right? So better live without hardcoded indexes, if you can.
But… how? I had to fight with two scenarios, and I managed to get it done without hardcoded indexes. Of course your mileage may vary, but here’s how I did it:
First scenario
In this first scenario we have a XML file that looks like this:
<RESOURCE name="books">
<TABLE name="books">
<FIELD name="title" datatype="char"/>
<FIELD name="author" datatype="char"/>
<FIELD name="pages" unit="--" datatype="int"/>
<DATA>
<TABLEDATA>
<TR>
<TD>Moby-Dick</TD>
<TD>Herman Melville</TD>
<TD>704</TD>
</TR>
<TR>
<TD>1984</TD>
<TD>George Orwell</TD>
<TD>176</TD>
</TR>
</TABLEDATA>
</DATA>
</TABLE>
</RESOURCE>
From a semantic point of view, kind of nasty, right? To avoid using hardcoded indexes, you should resolve what index position has the attribute you’re looking for when the XPath expression is parsed. For example, for the pages attribute:
/RESOURCE[@name='books']/TABLE[@name='books']//TR[1]/TD[
count( /RESOURCE[@name='books']/TABLE[@name='books']/FIELD[@name='pages']/preceding-sibling::*)+1
]
Don’t be scared, I’m going to explain a little further that hard one liner XPath expression. Let’s start for the inner expression, which is the more important:
count( /RESOURCE[@name='books']/TABLE[@name='books']/FIELD[@name='pages']/preceding-sibling::*)+1
That will return the position (index) for the attribute pages inside the table. For that, it will use the preceding-sibling::* expression that returns all the sibling nodes that has the attribute node before it. In the example of pages it will return the nodes title and author. So if we do a count() for that expression and add one, we should have the index for pages inside the table.
The rest of the expression is quite easy, and can be resumed as a simple table access with an index:
/RESOURCE[@name='books']/TABLE[@name='books']//TR[1]/TD[3]
but replacing the hardcoded index (TD[3]) for the count() expression.
Second scenario
This second scenario is when someone in the 3rd party was told to produce a XML file with the data, he took an already produced HTML file, change the file extension to XML and told to himself “Good work pal!”
<TABLE name="image">
<FIELD name="name" datatype="char"/>
<FIELD name="address" datatype="char"/>
<DATA>
<TABLEDATA>
<TR>
<TD>name</TD>
<TD>pic12316972.jpg</TD>
</TR>
<TR>
<TD>resolution</TD>
<TD>1024x768</TD>
</TR>
<TR>
<TD>timestamp</TD>
<TD>2008-04-03T11:57:40.015</TD>
</TR>
</TABLEDATA>
</DATA>
</TABLE>
Here what we would like, besides blame the one who created that and call it “XML”, is ask for the resolution value, but since resolution is part of the information and not the meta information, we have a situation here.
The approach in this scenario is different from the previous one, but is also simple when you have it figured out:
child::TR/child::TD[text()='resolution']/following::TD[1]
What this does, is from all the TR nodes (child::TR), get a child TD which text is the same as the attribute we’re looking for (child::TD[text()='resolution']). And once we have that node, we just have to get the following TD (following::TD[1]), which is the one with the attribute value.
Summary
- If you have the chance, avoid unsemantic XML files. They are ugly.
- If you have to produce an XML file, please! don’t forget the semantic
- If you have a 3rd party XML file without semantic… I’m sorry!