Category Archives: xml

How to ignore namespace when selcting XML nodes with XPath

I have several full article in PMC XML format to parse. I plan to use XPath. However, the namespace always breaks my path. For example, in the following XML I cannot use *//article-id to retrieve all article-id tags. Instead, I have to use *//{http://dtd.nlm.nih.gov/ns/archiving/2.3/}article-id. As you can see, it is really painful.

<?xml version="1.0" encoding="UTF-8"?><OAI-PMH
    xmlns="http://www.openarchives.org/OAI/2.0/" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
    <responseDate>2016-04-08T01:08:52Z</responseDate>
    <request verb="GetRecord" identifier="oai:pubmedcentral.nih.gov:2702331" metadataPrefix="pmc">http://www.ncbi.nlm.nih.gov/oai/oai.cgi</request>
    <GetRecord>
        <record>
            <header>
                <identifier>oai:pubmedcentral.nih.gov:2702331</identifier>
                <datestamp>2009-06-27</datestamp>
                <setSpec>retrovir</setSpec>
                <setSpec>pmc-open</setSpec>
            </header>
            <metadata>
                <article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://dtd.nlm.nih.gov/ns/archiving/2.3/" xsi:schemaLocation="http://dtd.nlm.nih.gov/ns/archiving/2.3/ http://dtd.nlm.nih.gov/archiving/2.3/xsd/archivearticle.xsd" article-type="research-article">
                      <front>
                        <article-meta>
                          <article-id pub-id-type="accession">PMC2702331</article-id>
                          <article-id pub-id-type="pmcid">PMC2702331</article-id>
                          <article-id pub-id-type="pmc-uid">2702331</article-id>
                          <article-id pub-id-type="publisher-id">1742-4690-6-47</article-id>
                          <article-id pub-id-type="pmid">19454010</article-id>
...

read more

Tidy config for XML

The following configuration won’t wrap text, which is useful if users don’t want to insert spaces while reformatting XML files.

char-encoding: utf8
indent: auto
indent-spaces: 2
wrap: 0

Usage:

tidy -xml -i -config tidy.config -m XMLFILE