Tip: Force UTF8 or other encoding for XmlWriter with StringBuilder

Tip: Force UTF8 or other encoding for XmlWriter with StringBuilder

When you use an XmlWriter or XmlTextWriter and the output target is StringBuilder or StringWriter then it is not possible to set the encoding to anything else then UTF-16. This tip explains why this is designed this way by Microsoft and offers you a way around this all-too restrictive behavior of the XmlWriter, to allow any encoding, not only UTF-8 or UTF-16.

Strings and encodings

The default encoding of any string in .NET is UTF-16. There exists quite some debate on this subject, in particular whether it is the older, by win32 supported encoding UCS-2, which never became a standard, or whether it is true UTF-16. I won’t go into those details now, apart from some rare occasions, bugs and oddities, the strings in .NET are truely UTF-16 encoded internally.

Whenever you are coding with strings and you need to output a certain string to a stream of some kind, or when you need to convert from bytes to strings and back, you need to specify the encoding. But for this discussion it is sufficient to know that the encoding of strings internally is always UTF-16. Period.

XML and encodings

The default encoding for XML is UTF-8 or UTF-16. Let no-one tell you that US-ASCII or even CP1252 or ISO-8859-1 is the default, it is not. Any XML parser must inspect the first bytes for the presence of a byte order mark and must use that information to get the correct encoding of the document. Unless there’s an <?xml version=”1.0″ encoding=”xyz-enc” ?> at the top which specifically gives another encoding, the document must be in UTF-8, UTF-16 or UTF-32 (but the latter is not mandatory to support).

This inspection is rather simple. The short version is: if the document starts with 0xFF 0xFE or 0xFE 0xFF then the document is UTF-16, if it starts with 0xEF 0xBB 0xBF then it is UTF-8. If neither is found, then there’s no BOM but the detection is about the same: the first character must be a greater then sign (now you know why there is no content or space allowed before it) and that sign is differently encoded in the three major Unicode encodings, including an EBCDIC version of Unicode.

.NET Strings and XML and encodings

Following the logic of the XML standard, the encoding in the XML declaration must match the encoding of the document. The strings are stored as UTF-16, so the logical and only choice one has is the following statement at the beginning of the XML document:

<?xml version="1.0" encoding="UTF-16" ?>
<encoding should-be="utf-8" />

This is all very nice and dandy of course, but what if you just wanted to display a piece of XML in textbox or something and you wanted to parse the XML using some form of the XmlWriter? If you read the Microsoft documentation, they are pretty clear on this subject (not!) and tell you the following when you check the documentation on XmlWriter.Create:

which led me to writing the following snippet:

// initialize the settings for UTF8
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.UTF8;

// create XmlWriter for output to string
StringBuilder xmlAsUtf8 = new StringBuilder();
using (XmlWriter xmlWriter = XmlWriter.Create(xmlAsUtf8, settings))
{
    xmlWriter.WriteStartElement("encoding");
    xmlWriter.WriteAttributeString("should-be", "utf-8");
    xmlWriter.WriteEndElement();
}

// output content to debug output window
Debug.Write(xmlAsUtf8.ToString());

Can you guess what the output is of this little snippet? I would expect something like the following. Do you concur? Or do you think the output is something different?

<?xml version="1.0" encoding="UTF-8" ?>
<encoding should-be="utf-8" />

Of course it is different! Because of the rules laid out above about encodings, the XmlWriter tries to behave correctly and standards compliant and instead of using the provided encoding, it takes the encoding from the underlying StringWriter. Abbreviated, the following happens when you take a look under the hood using Reflector:

/*
    The static XmlWriter.Create used in the snippet above
    which creates the StringWriter based on the StringBuilder and
    passes it on to the private method CreateWriterImpl(..)
*/
public static XmlWriter Create(StringBuilder output, XmlWriterSettings settings)
{
    return CreateWriterImpl(new StringWriter(output, CultureInfo.InvariantCulture), settings);
}

/*
    The actual writer implementation code, which will
    create a writer depending on output type (XML, HTML, Text) and
    indentation style (indent or not). Greatly simplified for clarity

    NOTE: code greatly simplified, non relevant lines removed!
*/
private static XmlWriter CreateWriterImpl(TextWriter output, XmlWriterSettings settings)
{
    XmlWriter writer;
    switch (settings.OutputMethod)
    {
        case XmlOutputMethod.Xml:
            writer = new XmlEncodedRawTextWriter(output, settings);
            break;

        default:
            return null;
    }

    return new XmlWellFormedWriter(writer, settings);
}

/*
    For XML output (but also for the other types), when using
    a StringBuilder (internally used in a TextWriter), we end up here
    where the actual internal writer is created. Here, the
    encoding settings of the writer is used and NOT of the settings!
*/
public XmlEncodedRawTextWriter(TextWriter writer, XmlWriterSettings settings)
      : this(settings, settings.CloseOutput)
{
    this.writer = writer;

    /*
        This is THE SPOT! Here's the offending line which
        determines the encoding, which is always Encoding.UTF16
        when writer is a StringWriter (see line 8)
    */
    this.encoding = writer.Encoding;

    this.bufChars = new char[0x1820];
    if (settings.AutoXmlDeclaration)
    {
        this.WriteXmlDeclaration(this.standalone);
        this.autoXmlDeclaration = true;
    }
}

Following the logic above, where line 44 clearly shows that the encoding is defined by the underlying TextWriter object, we now know that we’re quite out of luck when it comes to our chances to change this behavior:

  • The TextWriter encoding is read-only;
  • After initialization, the encoding of the XmlWriter is read-only;
  • If you could change the encoding of the TextWriter, it is dangerous and causes internal strings to misbehave;
  • The place were the encoding is used to write the XML declaration is hard to override through inheritance.

I’ll spare you my travels to the rest of the implementation in Reflector. The conclusion was and is that there are no real or straightforward ways to change this encoding reliably through normal coding practices like aggregation or overloading with inheritance.

One other conclusion I cannot withhold you, though; the this.encoding field is only used for writing the correct character string for the encoding in the XML header, it is not used anywhere else when it comes to the implementations of XmlWriters based on StringBuilder or TextWriter.

Three ways of changing the encoding regardless

Being out of “normal” ideas and possibilities, we’ll go the unpaved and dangerous road of using reflection to force the encoding. But not before I first present you with two simple, not-so-elegant ways that partially solve your problem.

Do not use an XML declaration at all

This may not really sound like a true solution, but per the XML specification, it is allowed to have no XML declaration. If one is absent, the encoding is determined by auto detection of the first four bytes. If no UTF-16, UTF-32 or UTF-8 encoding is detected, the UTF-8 encoding is assumed and mandatory. For display purposes, the XML declaration is of limited use, because it says something of the version (usually 1.0 anyway) and about the encoding (which we don’t need to display the data in a text field. Furthermore, if we pass it to the database, most databases internally use NVARCHAR, which is internally stored as UTF8. When it comes to MS SQL Server 2000, there’s a bug in storing UTF-16 XML, which is yet another reason why removing the XML declaration is a good idea.

Disadvantages? Personally, I always declare that an XML declaration is mandatory, if not only for making it absolutely clear what the document is about. Also, if you want to show to someone how your write encoding=”ISO-8859-1″, how would you go about doing that if your XmlWriter cannot be used to illustrate that behavior?

That’s why we have option two:

Use simple string replacement

That’s of trivial to do of course. After all, we were writing all the time to a StringBuilder. Why not simply use StringBuilder’s methods to replace the encoding part and set it to ours? Let’s do that neatly

  • Search only where we need to search;
    (the longest XML declaration with standalone=”yes” is 56 characters when utf-16 is the encoding)
  • Search based on the official name of the encoding, which is what is used internally by the XmlWriter
    (the EncodingX.WebName gives this official name)
  • Do it after the string is flushed;
    (doing it earlier can interfere with the XmlWriter)

I can’t make it much neater than that. If we put that together with the code above, the snippet now looks as follows:

StringBuilder xmlAsUtf8 = new StringBuilder();
using (XmlWriter xmlWriter = XmlWriter.Create(xmlAsUtf8, settings))
{
    xmlWriter.WriteStartElement("encoding");
    xmlWriter.WriteAttributeString("should-be", "utf-8");
    xmlWriter.WriteEndElement();
}

/*
    replace the encoding neatly, correctly and quickly
*/
xmlAsUtf8.Replace(Encoding.Unicode.WebName, Encoding.UTF8.WebName, 0, 56);

Debug.Write(xmlAsUtf8.ToString());

Now the output is what we expected it to be:

<?xml version="1.0" encoding="utf-8"?>
<encoding should-be="utf-8" />

but it just doesn’t feel right to have to do that after-the-fact. Unfortunately, I can’t change that. This is by large the easiest solution to force another outcome of the encoding when you write XmlWriter output to a StringBuilder. But there’s one other possibility: brute force, which is explained in the next section.

Use reflection to set the encoding field

The following section assumes a moderate understanding of .NET Reflection techniques, if you only want to use it, instead of also understand it, consider just copying the code or download the library.

Perhaps this is more an exercise in Reflection then that I should endorse it as a professional and solid workaround. Reflection is handy and useful in many situations, but when one needs reflection to set some inner fields that are declared private or even read-only, the water is deep and surfaced with very thin ice. If you need to override this behavior, I can fully understand it, but hopefully, in a future version of XmlWriter, Microsoft allows us to overwrite the encoding setting for string-based output.

Rule: Most Public Properties/Methods First

When working with reflection, in any language, it is imperative to stick to this rule. The reason is simple: a private property, method or field is private for a reason: the author decided to make it part of his implementation and by declaring it private he says “keep your hands off of this, I can change it any time I wish”. Internal methods and properties are slightly better, but can change for an assembly just like private ones. Best is to stick to protected and public properties, methods and classes. These are least likely to change.

The best I could do for this solution was a protected field and an internal property. They are well-named and used throughout the assembly, so the chances that they change are not large. But the classes I use are private and therefor everything they expose can basically change from one version to another. If you find that the below code does not work faultless with your version or .NET, please drop a note here and I’ll adjust the code to fit other versions of .NET just the same.

The solution through reflection

The solution, which we’ll briefly discuss in the next two sections, looks as follows, where I included the code from above to make it easier to understand where the reflection should be placed. For a more solid implementation, see further on (TBD).

For clarity, exception handling is removed in the code below.

XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.UTF8;

StringBuilder xmlAsUtf8 = new StringBuilder();
using (XmlWriter xmlWriter = XmlWriter.Create(xmlAsUtf8, settings))
{
    /*
        get the writer implementation property InnerWriter, which is
        of internal scope
    */
    PropertyInfo innerWriterPropInfo = writer.GetType().GetProperty(
        "InnerWriter",
        BindingFlags.NonPublic | BindingFlags.Instance);

    /*
        get the XmlEncodedRawTextWriter or the XmlEncodedRawTextWriterIndent
        which are the actual underlying XmlWriter objects that do the meat of the work
        when writing to a StringBuilder or StringWriter
    */
    XmlWriter innerWriter = (XmlWriter) innerWriterPropInfo.GetValue(writer, null);

    /*
        get the FieldInfo for the "encoding" field, which is a Protected field
        this is the actual field we need for our "overriding" technique
    */
    FieldInfo encodingFieldInfo = innerWriter.GetType().GetField(
        "encoding",
        BindingFlags.NonPublic | BindingFlags.Instance);

    /*
        assign the encoding we want, any encoding can be used here
    */
    encodingFieldInfo.SetValue(innerWriter, Encoding.UTF8);

    xmlWriter.WriteStartElement("encoding");
    xmlWriter.WriteAttributeString("should-be", "utf-8");
    xmlWriter.WriteEndElement();
}

// this will now output with utf-8 as encoding
Debug.Write(xmlAsUtf8.ToString());
Step 1: getting the underlying writer

The implementation of XmlWriter and helper objects follows the Builder Pattern of the GoF (page 97), with one exception: it uses the non-inheritability of static methods to use one class, XmlWriter, for both the Director (which decides what implementation to use) and the Builder (which is the abstract base class for all xml writers, the ConcreteBuilders). Once you know what pattern is used, it is not so hard anymore to find what you need.

The static XmlWriter.Create overloaded members serve as an excellent Director->Construct() of the GoF pattern. It calls two versions of the private CreateWriterImpl (also static) which, based on some boolean logic sets the local writer variable to either of the following implementations (all private classes):

  • an XmlEncodedRawTextWriter when settings.Indent is false and the output method is set to XML;
  • an XmlEncodedRawTextWriterIndent when settings.Indent is false and the output method is set to XML;
  • an HtmlEncodedRawTextWriter or HtmlEncodedRawTextWriterIndent (same logic, now for output is HTML);
  • a TextEncodedRawTextWriter if output method is Text (indent is ignored);
  • an XmlAutoDetectWriter if output is AutoDetect;
  • null if none of the above (cannot happen: the enumeration prohibits this);
  • a QueryOutputWriter when IsQuerySpecific is set (this overrides any of the previous, except AutoDetect, which is higher in precedence);

and passes it on to the final construction of the return value: an XmlWellFormedWriter, which sets its property InnerWriter to one of the above. For our exercise, only the first two are relevant, the others can be ignored here. Now that we know where to look and what to look for, it becomes trivial to get the underlying writer. Let’s have a look at the relevant lines of code:

PropertyInfo innerWriterPropInfo = writer.GetType().GetProperty(
        "InnerWriter",
        BindingFlags.NonPublic | BindingFlags.Instance);

XmlWriter innerWriter = (XmlWriter) innerWriterPropInfo.GetValue(writer, null);

The first line above gets the type of our writer object, searches for a property by the name “InnerWriter” and tells the default binder to look for non public (that is: private, internal, protected and protected internal) properties that are part of the instance (that is: search the instance, not the static properties).

The last line above actually retrieves this property. It is not unlikely that the property cannot be found. That happens when this code runs under a lower trust level (i.e., in SilverLight) or when the property simply isn’t there anymore (i.e., some new version of the .NET Framework may use another name. After all, this property is internal and can be changed without notifying us). The code in the download section will raise a proper exception when the property cannot be found.

The final step is then to cast this property to an XmlWriter object. This is technically not necessary, but signifies that we are dealing with an object that is castable to an XmlWriter. We cannot use any of the real classes from our list above, because all of these are private (actually: we can by using other types of reflection, but there’s really no need for it: we know the type, we won’t need the type, because we use reflection which works on objects, not on type-safe classes).

Step 2: getting and setting the encoding field

The hard part is over. We’ve got the internal implementation of the XmlWriter, the ConcreteBuilder from the GoF pattern. The whole idea of that pattern was to hide the concrete implementations. Microsoft did a good job there, because when you use the XmlWriter, you have no idea that you are actually using one of a bunch of writer implementations (and there are many more, actually, for streams and other uses).

The class XmlEncodedRawTextWriter has a protected field called encoding. The class XmlEncodedRawTextWriterIndent does not, but inherits it from XmlEncodedRawTextWriter, which is its base class. Let’s have a look at the code:

FieldInfo encodingFieldInfo= underlyingWriter.GetType().GetField(
    "encoding",
    BindingFlags.NonPublic | BindingFlags.Instance | BindingFlags.FlattenHierarchy);

encodingFieldInfo.SetValue(innerWriter, Encoding.UTF8);

The first line above retrieves the field information for the encoding field, again we are looking for something non public (the field is protected) and on the instance (not a static field). The third binding flag, FlattenHierarchy, continues searching the base classes if the current class does not have the field and will do so recursively until it reaches the object base class. This is needed for finding the field for the indented writer, which inherits the field, but doesn’t have it itself.

The last line is self explanatory: it sets the UTF8 encoding for the encoding field on the object of the underlying writer.

The same precautions as with the previous section should be taken into account: though the field is protected and is used by more then one class, it is inside a private class which can change its complete implementation; it can even disappear completely without ever giving a signal. The download contains the full code with the necessary exceptions.

Conclusion

Using reflection was by far the hardest and most complex approach. I consider it a way to “correct” the implementation where Microsoft treated the XML specification a tad too strict. I hope that one day this feature is replaced and it becomes possible to simply override the encoding, not for streams (in which case it already correctly uses the setting in the XmlWriterSettings object) but for strings (StringBuilder and TextWriter).

I only rarely find a real reason to use reflection on internal or private methods and classes, however this is an example where I feel it is allowed, provided that you take the necessary precautions and remove it when a better method becomes available.

Download

You can download my implementation of the XmlWriter extension method, there’s a full source and a binary for debug and release builds available. The library contains other methods as well, which are free to use. The library comes with no warranty whatsoever. I appreciate any feedback on the library or any problems you encounter.

How to use it

The library in the namespace UnderMyHat.Toolchest contains several helper methods. The ones you can use for overriding the encoding of the XmlWriter through reflection are the following methods:

  • XmlExtender.XmlWriterCreate static method creates an XmlWriter object with a defaulting encoding;
  • XmlExtender.ForceEncoding extension method forces the encoding of an existing XmlWriter object; if the optional third parameter is false, the method will not throw exceptions related to reflection, the default is false. Returns true when successful, false otherwise;
  • Util.SuppressedException shows the latest suppressed exception of the Toolchest library, use it to get the exception if you initially suppressed it. The library uses this for all reflection methods, to make it easier to implement reflected extension methods into your own code.

I only implemented the reflection method. It is implemented as an extension method on XmlWriter, which makes it very easy to use. Simply add a reference to my library (which contains other simple utilities as well that may come in handy) and add this line:

using UnderMyHat.Toolchest;

on top of your class. The example code we used throughout will then become as follows, check the highlighted line:

XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.UTF8;

StringBuilder xmlAsUtf8 = new StringBuilder();
using (XmlWriter xmlWriter = XmlWriter.Create(xmlAsUtf8, settings))
{
    /*
        Set the correct encoding using the extension
        method of the UnderMyHat library.
        MUST BE THE FIRST LINE AFTER INSTANTIATION
    */
    xmlWriter.ForceEncoding(Encoding.UTF8);

    xmlWriter.WriteStartElement("encoding");
    xmlWriter.WriteAttributeString("should-be", "utf-8");
    xmlWriter.WriteEndElement();
}

// this will now output with utf-8 as encoding
Debug.Write(xmlAsUtf8.ToString());

The ForceEncoding extension method is called like that to emphasize that you really force the encoding to be different then allowed by the original implementation. You can use any encoding you like.

– Abel –

If you wish to specify the features to support on the created writer, use the overload that takes an XmlWriterSettings object as one of its arguments, and pass in a XmlWriterSettings object with the correct settings.

  • Pingback: Oakley Frogskin Sunglasses On Sale

  • Pingback: Home United Kingdom on Wikipedia

  • PersonalPronoun

    Thanks so much, you’ve solved an issue I was having with generating an iTunes Podcast / RSS feed – my feed was coming out in UTF16, and my Page.Response was reporting UTF8.

  • Matheus Rufca

    I found this and it works:
    MemoryStream memoryStream = new MemoryStream();
    XmlWriterSettings xmlWriterSettings = new XmlWriterSettings();
    xmlWriterSettings.Encoding = new UTF8Encoding(false);
    xmlWriterSettings.ConformanceLevel = ConformanceLevel.Document;
    xmlWriterSettings.Indent = true;

    XmlWriter xmlWriter = XmlWriter.Create(memoryStream, xmlWriterSettings);
    xmlWriter.WriteStartDocument();
    xmlWriter.WriteStartElement(“root”, “http://www.timvw.be/ns”);
    xmlWriter.WriteEndElement();
    xmlWriter.WriteEndDocument();
    xmlWriter.Flush();
    xmlWriter.Close();

    string xmlString = Encoding.UTF8.GetString(memoryStream.ToArray());

    font: http://www.timvw.be/2007/01/08/generating-utf-8-with-systemxmlxmlwriter/

    • http://undermyhat.org Abel Braaksma

      That’s an interesting link. Essentially it explains how to create your own Encoding class and use that instead, plus that you need to use a MemoryStream as opposed to a StringWriter. It’s still quite a lot of work, but nice to know that there are other workarounds.

    • http://twitter.com/Rynkadink Ricardo

      Many thanks dude! Great solution…
      Solved my problem.

      Cheers!

    • DropChris

      this is the solution!

  • http://www.seventymm.com Uttam

    Try This —>

    public class StringWriterWithEncoding : StringWriter
    {
    private readonly Encoding encoding;
    public StringWriterWithEncoding(StringBuilder sb, Encoding newEncoding) : base(sb) { encoding = newEncoding; }
    public override Encoding Encoding { get { return encoding ?? base.Encoding; } }
    }

    private static string GenerateXML()
    {
    var stringBuilder = new StringBuilder();
    var xmlTextWriter = new XmlTextWriter(new StringWriterWithEncoding(stringBuilder, Encoding.UTF8)) { Formatting = Formatting.None };
    xmlTextWriter.WriteStartDocument();

    #region Generatere your xml
    #endregion

    xmlTextWriter.WriteEndDocument();
    xmlTextWriter.Flush();
    xmlTextWriter.Close();
    return stringBuilder.ToString();
    }

  • Craig

    This author disagrees with you, and says that the default encoding in VB.net is UTF-8. What is your opinion?

    http://techrepublic.com.com/5208-6230-0.html?forumID=102&threadID=226398&messageID=3222797

    • http://www.undermyhat.org Abel Braaksma

      Hi Graig, just have the guy read the very first sentence of the subject on strings in the .NET Framework reference. This is true for all .NET languages that are CLR compliant (like VB):

      “A string is an object of type String whose value is text. Internally, the text is stored as a readonly collection of Char objects, each of which represents one Unicode character encoded in UTF-16.” (See this reference)

      (note that this was true from the very first version of .NET. In fact, they all follow the Unicode version 3.0 standard pretty flawlessly internally. Unfortunately, Microsoft never upgraded, not even for the recent VS 2010)

  • http://www.gocek.org Gary

    Nice explanation on the utf-16 vs. utf-8 XML header issue. I have a program that is invoked from the command line. Its output, normally sent to the console, can be redirected to a file. Problem is, that’s an 8-bit text file, even if the content is XML with a utf-16 header. The result is that the XML file can’t be opened in a browser because the XML header does not match the byte size of the text in the file. So, I took the simple route of string replacement, but I had wondered why I couldn’t just provide UTF8 settings to the XmlWriter constructor.

    • http://www.undermyhat.org Abel Braaksma

      That was what this little article was about, to explain the particular reasons and solutions why UTF-16 is used even when you don’t want that. There are other ways. For instance, the header is correctly honored if you simply write your output to a file (but watch the method you use, you must use an XML classes method)

  • Peter

    Thanks for the article. It does shine some light on this topic.

    An easier and more “correct” way would be:

    using (MemoryStream buffer = new MemoryStream())
    {
    using (StreamWriter stream = new StreamWriter(buffer, Encoding.UTF8))
    {
    using (XmlWriter writer = XmlWriter.Create(stream))
    {
    writer.WriteStartElement(“should-be utf-8″);
    writer.WriteEndElement();
    }
    }
    byte[] xmlData = buffer.ToArray();
    }

    // For testing or to put into a textbox:
    string xml = Encoding.UTF8.GetString(xmlData); // You might want to strip the BOM here

    This method does create the underlying data in the correct encoding and it does write the BOM.

    Why force something (StringBuilder) to UTF-8 if it is not? If using StringBuilder you can create the XML with XmlWriter and the setting OmitXmlDeclaration and then prepend the XML declaration yourself when converting the StringBuilder string to UTF-8 or whatever encoding.

    • http://www.undermyhat.org Abel Braaksma

      I see your point, and I like this other approach, however, it still does not allow you to simply write with any encoding string to a StringBuilder or TextWriter. One idea to use StringBuilder and XmlWriter is efficiency, by writing it first to a memory stream, then converting it to a byte array (second copy) and then converting it to a string (third copy) defies that idea a bit.

      At the other end, when you need string output (i.e. for display purposes), it is debatable whether such optimizations are needed.

      Though I can agree that your approach is the obvious more “correct” way, and the OmitXmlDeclaration is yet another “dirty” way because you have to prepend it by hand, which you just want to prevent by using a standardized writer, I don’t see an easy way of implementing this more correct way as a simple accessible extension so that users do not need to worry about this issue.

      I don’t see it as forcing something it is not, because a string is always a UTF-16 string internally and here it should represent a data stream as a string. I don’t force the string to be something else, the string will remain UTF-16, but will output with the intended encoding just like any other XmlWriter.

      – Abel –

  • Arjan

    > The default encoding for XML is UTF-8 or UTF-16. Let no-one tell you
    > that US-ASCII or even CP1252 or ISO-8859-1 is the default, it is not.

    Well, let me try anyway… ;-)

    When specifying the Content-Type to be “text/xml” while transferring XML over HTTP, then unfortunately the (authoritative) charset defaults to US-ASCII as per “RFC2046 Media Types”, http://ietf.org/rfc/rfc2046 … Even worse, when using that default, a processor should ignore any BOM. (Or maybe a processor should even raise an error when a BOM is found while no charset was given in the HTTP headers; “RFC2376 XML Media Types”, http://ietf.org/rfc/rfc2376 probably describes the rules but I never bothered to find them.)

    Using “application/xml” (rather than “text/xml”) allows for interpreting a BOM, and solves the problems of the 13 year old default of US-ASCII for the “text/*” media types. But of course when transferring over HTTP one should simply always specify the charset in the Content-Type as well.

    (Don’t shoot the messenger…)

    • http://www.undermyhat.org Abel Braaksma

      Very intriguing points and you’re right on the mark when it comes to the brilliance of IETF and W3.ORG to create confusion. However, it doesn’t hold (and I won’t shoot you), let me explain why (sorry for the length):

      You referred to RFC2046, which is a standard for MIME types but was created in 1996 before XML was standardized and before Unicode was widely accepted. In addition, the RFC2046 only mandates this for text/plain: “The default character set, which must be assumed in the absence of a charset parameter, is US-ASCII.” and then the following: “The specification for any future subtypes of “text” must specify whether or not they will also utilize a “charset” parameter, and may possibly restrict its values as well”, which means new types MUST specify what the do with this.

      Though the text/xml section in your RFC2376 follows your line of reasoning, it is not a standard, it is only an Informational Message. This is the all-time ‘informational’ confusing text:

      “Omitting the charset parameter is NOT RECOMMENDED for text/xml. For example, even if the contents of the XML entity are UTF-16 or UTF-8, or the XML entity has an explicit encoding declaration, XML and MIME processors must assume the charset is “us-ascii”.”

      In addition, the original XML specification itself does not refer to RFC2376 nor RFC2046 at all. There is, however, a link to RFC3023, which was added in a later edition of XML. And — here it comes! — the RFC3023 has vital information about our little discussion and is a Real Standard:

      “For this example, the XML MIME entity begins with a BOM. Since the charset has been omitted, a conforming XML processor follows the requirements of [XML], section 4.3.3. Specifically, the XML processor reads the BOM, and thus knows deterministically that the charset is UTF-16.”

      This here now follows the actual XML specification, which is good and makes them compatible.

      Thanks for taking the time to point at the omissions in my story. It is one of the least understood parts of the XML specification, perhaps I should write about it. ;-)

      – Abel –

      • Arjan van Bentem

        Too bad that last quote is about “application/xml”. Upwards from that example you’ll still find “8.5 Text/xml with Omitted Charset”, which in both RFC2376 and its superseding RFC3023 sadly explains the same I did, and the same you quoted: “even if the contents of the XML MIME entity are UTF-16 or UTF-8, or the XML MIME entity has an explicit encoding declaration, XML and MIME processors MUST assume the charset is us-ascii”. I don’t see how that is not part of the Real Standard if your last quote is?

        Indeed, http://www.w3.org/TR/xml/ states: “When multiple sources of information are available, their relative priority and the preferred method of handling conflict should be specified as part of the higher-level protocol used to deliver XML. In particular, please refer to [IETF RFC 3023] or its successor, which defines the text/xml and application/xml MIME types and provides some useful guidance.”

        Still, you may very well be right that some standard should have (or in fact might have) defined another default charset for “text/xml”, but I haven’t searched for it.

        However, I have learned the hard way that at least some well-known processors (as in: widely used libraries) do use US-ASCII for “text/xml” when no charset is defined in the HTTP headers. (Or, more likely: get that charset from the libraries that handle the HTTP transport, without even knowing if it was explicitly specified by the sender.) So I know that when using “text/xml” to serve XML to third-party clients you may get into a fight about specifications some day. Or, when being that client, you may find yourself in need of a workaround to have your strict (or erroneous) third-party libraries not use that stupid US-ASCII default when the sender doesn’t want to add the real charset… Worst of all: you may not notice until you use some extended ASCII character in the XML. (When you’re only developing the client then make sure you get some special characters during testing.)

        In other words: I advise people not to use “text/xml”. Instead, use “text/xml; charset = ..”, or use “application/xml” with an optional charset. To get people to listen, I guess I’ll keep telling them the default is US-ASCII, no matter if you can convince me about another default. ;-)

        • http://www.undermyhat.org Abel Braaksma

          Thanks for pointing out that error in my story, I overlooked that ;-)

          But we are a bit off-topic: the discussion was, I believe, about the default for XML. The default for XML is defined as UTF-8 and UTF-16 which all processors must be able to read. These RFC’s are talking about MIME types, which are a description of a type of document, not the specification of that document. It is a bit like saying: the default for the extension *.xml is CP-1252 (which is unfortunately true for Windows in many respects). But this is NOT the default for XML ;-)

          Luckily, I hardly ever see US-ASCII when dealing with XML (but if I do, I make sure they understand they’re almost 50 (!!) years behind). Nowadays, the internet as a whole has embraced UTF-8 and is now the “most used character encoding”.

          You’ll only notice the ‘extended US-ASCII characters’ (which don’t exist: US-ASCII is inherently 7-bit, 8th bit always zero) as errors if they are not properly escaped as numeric entities. The beauty of XML is then that the document is invalid and must be returned to sender.

          There’s no real shame in using US-ASCII, other then living half a century in the past. Most software correctly (un)escapes entities, but many people don’t and create files by hand (string concatenation) with incorrect encoding which is where the real trouble starts…

          – Abel –

          PS: no, I don’t want to convince anyone, I just like to have it stated correctly, which is, as you can see, hard enough…

        • http://www.undermyhat.org Abel Braaksma

          > have learned the hard way that at least some well-known
          > processors do use US-ASCII for “text/xml” when no
          > charset is defined in the HTTP headers

          What we found out in this little discussion is that the well-known processors are correct (Saxon 9 is one). Treating XML as US-ASCII is not a problem, you should treat it encoding-neutral anyway. With any processor, it takes two lines of code to transform it into another encoding if wished for.

Get Adobe Flash player