I have several full article in PMC XML format to parse. I plan to use XPath. However, the namespace always breaks my path. For example, in the following XML I cannot use *//article-id
to retrieve all article-id
tags. Instead, I have to use *//{http://dtd.nlm.nih.gov/ns/archiving/2.3/}article-id
. As you can see, it is really painful.
<?xml version="1.0" encoding="UTF-8"?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2016-04-08T01:08:52Z</responseDate> <request verb="GetRecord" identifier="oai:pubmedcentral.nih.gov:2702331" metadataPrefix="pmc">http://www.ncbi.nlm.nih.gov/oai/oai.cgi</request> <GetRecord> <record> <header> <identifier>oai:pubmedcentral.nih.gov:2702331</identifier> <datestamp>2009-06-27</datestamp> <setSpec>retrovir</setSpec> <setSpec>pmc-open</setSpec> </header> <metadata> <article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://dtd.nlm.nih.gov/ns/archiving/2.3/" xsi:schemaLocation="http://dtd.nlm.nih.gov/ns/archiving/2.3/ http://dtd.nlm.nih.gov/archiving/2.3/xsd/archivearticle.xsd" article-type="research-article"> <front> <article-meta> <article-id pub-id-type="accession">PMC2702331</article-id> <article-id pub-id-type="pmcid">PMC2702331</article-id> <article-id pub-id-type="pmc-uid">2702331</article-id> <article-id pub-id-type="publisher-id">1742-4690-6-47</article-id> <article-id pub-id-type="pmid">19454010</article-id> ...
How to ignore namespace when selecting XML nodes with XPath? One way is to use local-name()
function in XPath, such as */[local-name()='article-id']
. But it is still inconvenient.
In Python, I use lxml
with the following code.
parser = etree.XMLParser(remove_blank_text=True) tree = etree.parse(metadata, parser) root = tree.getroot() #### for elem in root.getiterator(): if not hasattr(elem.tag, 'find'): continue i = elem.tag.find('}') if i >= 0: elem.tag = elem.tag[i+1:] objectify.deannotate(root, cleanup_namespaces=True) ####
The above code does three things:
- remove
{namespace}
if it appears in the tag - some tags like
Comment
return a function when accessingtag
attribute. Skip that - use
lxml.objectify.deannotate
to recursively de-annotate the elements of an XML tree by removingpy:pytype
and/orxsi:type
attributes and/orxsi:nil
attributes.
As a result, XPath .//article-id
will do the job.