Recently, I’ve had to interpret some user input and then place this input into an XML file for processing by BizTalk Server 2006. Unfortunately, BizTalk Server 2006 likes you to encode characters using their XML equivalents. Let me explain..
Background
This can seem quite easy using the System.Xml.XmlSerializer, with its ability to automatically generate XML and escape invalid characters for us. There are problems though.
Here is a template class:
- public class TestClass
- {
- public string Element1 { get; set; }
- public string Element2 { get; set; }
- public string Element3 { get; set; }
- }
Now some code to test it being serialized:
- TestClass tc = new TestClass();
- tc.Element1 = "Hello World!";
- tc.Element2 = "Yo yo yo yo !";
- tc.Element3 = "And some more text here!";
-
- XmlSerializer serializer = new XmlSerializer(typeof(TestClass));
- XmlWriter writer = XmlWriter.Create("UTF-8.xml");
-
- serializer.Serialize(writer, tc);
And what do we get?
- <?xml version="1.0" encoding="utf-8"?>
- <TestClass> <!-- Schemas removed for clarity -->
- <Element1>Hello World!</Element1>
- <Element2>Yo yo yo yo !</Element2>
- <Element3>And some more text here!</Element3>
- </TestClass>
Problem
So what is the problem? Well, lets see how it copes with characters that XML would prefer are encoded – like &, £ and “
- TestClass tc = new TestClass();
- tc.Element1 = "Hello World!";
- tc.Element2 = "Yo yo yo yo !";
- tc.Element3 = "\"£$%^&*^%$£\"£(£^^%&\"^%£$£\"$%";
Our resulting XML?
- <?xml version="1.0" encoding="utf-8"?>
- <TestClass> <!-- Schema removed for clarity -->
- <Element1>Hello World!</Element1>
- <Element2>Yo yo yo yo !</Element2>
- <Element3>"£$%^&*^%$£"£(£^^%&"^%£$£"$%</Element3>
- </TestClass>
Good work .NET! But what does BizTalk Server 2006 say if you try and interpret this XML?
There is no Unicode byte order mark. Cannot switch to Unicode. Okay… After checking the file in EditPad, and switching to hex mode, the Byte Order Mark was definitely in the file. Well, why don’t I just use the System.Web.HttpUtility.HtmlEncode() method to encoding everything and hopefully sort this problem out. We’ll have to make some changes to the TestClass class:
- public class TestClass
- {
- public string Element1 { get; set; }
- public string Element2 { get; set; }
-
- [XmlIgnore()]
- public string Element3 { get; set; }
-
- [XmlElement("Element3")]
- public string EncodedElement3
- {
- get
- {
- return System.Web.HttpUtility.HtmlEncode(this.Element3);
- }
- set
- {
- this.Element3 = value;
- }
- }
- }
[XmlIgnoreAttribute()] flags the XmlSerializer to not serialize the public property. Instead, we want it to serialize a different public property, but by using [XmlElementAttribute()] to override the name of it
That should sort out any invalid characters right? We are encoding the text before passing it through the XmlSerializer, so what is the result? Using a smaller string of the above example, I get:
- <Element3>&quot;&#163;$%^&amp;*</Element3>
Is this right? No! " has turned into &quot; – £ became &#163; IE8 says this is interpreted as:
<Element3>"£$%^&*</Element3>
When we should be seeing this (from IE8’s rendering):
<Element3>"&£$%^&*</Element3>
Now, even if we do put <![CDATA[ …… ]]> ourselves, the XmlSerializer is still none the wiser and you can end up with even worse results:
<Element3>&lt;![CDATA[&quot;&#163;$%^&amp;*]]&gt;</Element3>
Solution
The easiest way to get round this is to add an extra property, like the above example, for the sole purpose of serialization. But instead of using the return type of string, you should use System.Xml.XmlCDataSection instead, like this:
- [XmlIgnore()]
- public string Element3 { get; set; }
-
- [XmlElementAttribute("Element3")]
- public XmlCDataSection CDataElement
- {
- get
- {
- XmlDocument xmlDox = new XmlDocument();
- return xmlDox.CreateCDataSection(this.Element3);
- }
- set
- {
- this.Element3 = value.Value;
- }
- }
Now, if you try this and see the results:
<Element3><![CDATA["£$%^&*]]></Element3>
Perfect! And the good thing is BizTalk Server 2006 interprets the content correctly, so any split functions or substrings work as expected!