How to ignore namespace when selcting XML nodes with XPath

By | April 9, 2016

I have several full article in PMC XML format to parse. I plan to use XPath. However, the namespace always breaks my path. For example, in the following XML I cannot use *//article-id to retrieve all article-id tags. Instead, I have to use *//{}article-id. As you can see, it is really painful.

<?xml version="1.0" encoding="UTF-8"?><OAI-PMH
    <request verb="GetRecord" identifier="" metadataPrefix="pmc"></request>
                <article xmlns:xlink="" xmlns:mml="" xmlns:xsi="" xmlns="" xsi:schemaLocation="" article-type="research-article">
                          <article-id pub-id-type="accession">PMC2702331</article-id>
                          <article-id pub-id-type="pmcid">PMC2702331</article-id>
                          <article-id pub-id-type="pmc-uid">2702331</article-id>
                          <article-id pub-id-type="publisher-id">1742-4690-6-47</article-id>
                          <article-id pub-id-type="pmid">19454010</article-id>

How to ignore namespace when selecting XML nodes with XPath? One way is to use local-name() function in XPath, such as */[local-name()='article-id']. But it is still inconvenient.

In Python, I use lxml with the following code.

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(metadata, parser)
root = tree.getroot()

for elem in root.getiterator():
    if not hasattr(elem.tag, 'find'): continue
    i = elem.tag.find('}')
    if i >= 0:
        elem.tag = elem.tag[i+1:]
objectify.deannotate(root, cleanup_namespaces=True)

The above code does three things:

  1. remove {namespace} if it appears in the tag
  2. some tags like Comment return a function when accessing tag attribute. Skip that
  3. use lxml.objectify.deannotate to recursively de-annotate the elements of an XML tree by removing py:pytype and/or xsi:type attributes and/or xsi:nil attributes.

As a result, XPath .//article-id will do the job.

Leave a Reply

Your email address will not be published. Required fields are marked *