Understanding CDATA in XML
Posted on 19 September 2009
Update: this post is under review and is being rewritten, it may change without notice. This notice will disappear when the rewrite is over.
CDATA sections in XML are often misunderstood. Programmers get the wrong advice, like the element data must be in a CDATA section to be valid! or they hear that strings should go in CDATA sections because they contain Unicode or non-ASCII characters. All nonsense, in XML, a CDATA section is only a convenience for not having to use entities for < and & (less-then and ampersand) sign. Anything written in CDATA can also be written without CDATA.
What is CDATA
The term CDATA comes from the SGML world, which is the complex predecessor of XML and was used to describe the original HTML 2, 3 and 4 specifications. The term is short for Character Data and means that the data contains of characters, and should not be parsed. Tags, entities, attributes, processing instructions inside CDATA are treated as text, not as XML elements.
PCDATA on the other hand means Parsed Character Data. The default for any XML element, attribute or processing instruction is PCDATA.
CDATA and PCDATA in HTML
This article is about XML, but let us allow a small step in to HTML. In HTML, the default is determined by the HTML DTD and per element or attributes. Most HTML attributes are of type CDATA, most elements are of type PCDATA, however, the element SCRIPT for instance is of type CDATA, which means that literal < and & (less-then and ampersands) are allowed and will be interpreted literally, i.e., will not be parsed. This is handy in scripts which traditionally contains lots of these characters.
What is PCDATA
The following sections go into the details of PCDATA. I recommend you reading these parts, as they are invaluable in understanding the intricacies of CDATA: why it exists, why you need it, or, more importantly, why you don’t need it.
Understanding Parsed Character Data (PCDATA)
To understand CDATA, you must first understand PCDATA.
In XML, all content defaults to PCDATA. That means, all contents is parsed. Parsed data means that when the parser encounters a less-then symbol (<) it knows that an element starts and it will look for the greater-then symbol (>) which is the end of the opening tag which starts the element. Between the less-then and the greater-then only pairs of name+value are allowed: the attributes. The names must be preceded with one or more spaces, followed by one or more spaces, then an equal sign and then, between single or double quotes, the data of the attribute. This, slightly simplified definition makes it understandable why it is so easy for an XML parser to see the following two examples both as opening tags with one attribute. As for the parser, both examples are 100% exactly equal and no tool should distinguish between the two:
<element-name attribute-name = "attribute data" >
<element-name
attribute-name
= "attribute data"
>
In XML, the less-then sign is holy: it means the start of the next element and an element is the basic building block of XML. Even more: an XML document without any element is invalid XML and cannot be parsed. And an XML document with multiple root elements can neither be parsed. In other words, each XML document contains one and only one root element.
XML is all about very strict rules. When an element starts, the parser knows, i.e., it can rely on the element to also be closed. Because these rules being so strict, it is so relatively easy to build an XML parser, which is in part guilty of its success. I won’t go to deeply into the rules of XML, what’s important for you to understand is that XML is a textual document where each part has a special meaning and certain characters start special treatment: the parser knows this.
Understanding entities
There are many types of entities in XML: general entities, external and internal entities and parameter entities. Some types can be either parsed or unparsed. The plethora of possibilities these smallest of XML particles have can give even the most dyed-in-the-wool programmer headaches. For 99% of the cases that you work with XML however, you only have to deal with two types of entities: internal parsed general entities and predefined internal general entities. Quite a mouthful, but they are the official names behind the — and ' that you’ve become so acquainted with through the years.
What’s an entity really
Many people have trouble with the word “entity”. In XML parlor, it is nothing more then a replacement of something that the parser could otherwise not deal with directly or it’s a shortcut or abbreviation of something that would otherwise require too much typing. Entities can be much more, but with this definition you can get quite far. Examples of entities are:
<entities-example>
! <!-- exclamation mark -->
&john; <!-- short for John "The Baptist" Church -->
</entities-example>
In XML, an entity only means something when it is actually defined, unless it is one of the five predefined entities (see table below). All other entities must be defined in the DTD, otherwise the parser cannot know what to do with it. If you ever get an “unresolved entity” error, you know now what’s missing: the definition of the entity. An example for how to define the entities used in the code snippet above is this:
<!DOCTYPE entities-example[
<!ELEMENT entities-example (#PCDATA)>
<!ENTITY excl "!" >
<!ENTITY john "John &The Baptist& Church" >
]>
As you can see from the examples, entities can reference each other. The reason? They are parsed entities remember? That means that, after the parser replaces the entity reference (&john;) with the entity contents (John &The Baptist& Church) it will parse that contents as well, resulting in the actual string being John “The Baptist” Church. In another article, I will go deeper into using entities, explaining why referencing other entities from entity definitions can be a dangerous business.
Internal parsed general entities
The internal parsed entities are the ones you saw in the previous examples on entities. I like to think of them in a different way: placeholders. They can contain anything, including XML markup. Why? Because they are parsed. Yes, I know, I keep repeating that…
Long story short: an entity is a placeholder for a little piece of PCDATA.
Internal predefined general entities
I generally like to think about those as Internal predefined general unparsed entities, but this is not true, technically these are parsed entities. That is why those two magic characters, the less-then and the ampersand, are doubly escaped as entity in the official entity declaration that each XML parser loads by default (even when it is absent, this declaration is available, which is why we call them predefined: they are available always and ever):
<!ENTITY lt "&#60;"> <!ENTITY amp "&#38;"> <!ENTITY gt ">"> <!ENTITY quot """> <!ENTITY apos "'">
You may have seen this before, and now you know why: if the parsed entity “lt” would be replaced with its entity contents, that contents will be parsed as well. The result will be < in the case of “lt”, which will not be parsed again (parsing means: parsing once!). Here’s a more readable list of the famous five predefined entities:
| character | must escape | numerical entity | character entity | description |
|---|---|---|---|---|
| < | YES | < | < | must always be escaped when used literally |
| & | YES | " | & | must always be escaped when used literally |
| > | NO | > | > | often escaped, but you can really save yourself the typing |
| “ | NO | & | " | sometimes necessary to use " inside attributes |
| ‘ | NO | ' | ' | some browsers do not understand ' in HTML |
CDATA sections in XML
A CDATA section looks like the following:
<content><![CDATA[Romeo & Juliet]]></content>
which means exactly the same, better yet, which is exactly the same (from an XML parser’s view) as the following:
<content>Romeo & Juliet</content>
That’s really all there is to it. To make the example complete, consider this piece of code:
<entity-table>
<cell><![CDATA[the ampersand: &]]></cell>
<cell><![CDATA[the less-then sign: <]]></cell>
<cell><![CDATA[the greater-then sign: >]]></cell>
<cell><![CDATA[the double quote: "]]></cell>
<cell><![CDATA[the single quote or apostroph: ']]></cell>
</entity-table>
and compare that with the following, which is again exactly the same (!) and quite likely easier to read:
<entity-table>
<cell>the ampersand: &</cell>
<cell>the less-then sign: <</cell>
<cell>the greater-then sign: ></cell>
<cell>the double quote: "</cell>
<cell>the single quote or apostroph: '</cell>
</entity-table>
Confusing issues
TBD
Conclusion
Use CDATA sections sparingly and know when you really need them: never. If your superiors tell you differently, refer them to me and I’ll help you in the discussion. Use CDATA as a convenience and always be aware that it is not necessary to use it and no software should ever be written that demands actual CDATA sections in a certain element.
Ever tried to shrink a volume? Ever wondered why you cannot shrink a volume smaller than half its size? Ever wondered what $MFTMirr is all about and what it’s doing in the middle of your drive? Or do you just want to get the biggest available free space and shrink your drive? Then this article is for you — read article
Have you ever received this error using Windows System Backup and Restore Center? Never managed to get rid of it or it mysteriously keeps coming back? Here’s a lightweight and easy solution — read article
The improvements that matter to you, focused on the .NET Framework in general and the CLR or CLI especially. Read about parallel computing and concurrency support that’s now available to everybody developing for .NET — read article
No responses yet. You could be the first!
Leave a Response
Additional comments powered by BackType