Bug 25401 - XmlTextReader.Encoding is Unicode after loading XML that is UTF-8 and has encoding="utf-8" in the declaration
Summary: XmlTextReader.Encoding is Unicode after loading XML that is UTF-8 and has enc...
Status: RESOLVED FIXED
Alias: None
Product: Class Libraries
Classification: Mono
Component: System.XML ()
Version: unspecified
Hardware: PC Linux
: --- normal
Target Milestone: Untriaged
Assignee: Bugzilla
URL:
Depends on:
Blocks:
 
Reported: 2014-12-15 14:00 UTC by Greg Najda
Modified: 2015-03-17 16:43 UTC (History)
2 users (show)

Tags:
Is this bug a regression?: ---
Last known good build:

Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.


Please create a new report on GitHub or Developer Community with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:
Status:
RESOLVED FIXED

Description Greg Najda 2014-12-15 14:00:29 UTC
After loading XML that is encoded in UTF-8 and begins with <?xml version="1.0" encoding="utf-8"?>, XMLTextReader.Encoding is set to Unicode, even if a UTF-8 BOM is present.

https://github.com/mono/mono/blob/f5e62d4d2f2ca3fd492c41224ee923f4cc650c61/mcs/class/System.XML/System.Xml/XmlTextReader.cs#L2049 appears to be the cause.

Reproduction code below. On Microsoft .NET, output is

Encoding of XmlTextReader without UTF-8 BOM: null
Encoding of XmlTextReader with UTF-8 BOM: null

On mono 3.2.8 on Linux, output is

Encoding of XmlTextReader without UTF-8 BOM: Unicode
Encoding of XmlTextReader with UTF-8 BOM: Unicode

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml;

namespace XmlEncodingRepro
{
    class Program
    {
        static void Main(string[] args)
        {
            string xml = @"<?xml version=""1.0"" encoding=""utf-8""?>
<configuration></configuration>";

            byte[] utf8Bom = Encoding.UTF8.GetPreamble();
            byte[] utf8WithoutBom = Encoding.UTF8.GetBytes(xml);
            byte[] utf8WithBom = utf8Bom.Concat(utf8WithoutBom).ToArray();

            using (MemoryStream utf8WithoutBomStream = new MemoryStream(utf8WithoutBom))
            using (StreamReader utf8WithoutBomReader = new StreamReader(utf8WithoutBomStream, true))
            using (XmlTextReader utf8WithoutBomXmlTextReader = new XmlTextReader(utf8WithoutBomReader))
            using (MemoryStream utf8WithBomStream = new MemoryStream(utf8WithBom))
            using (StreamReader utf8WithBomReader = new StreamReader(utf8WithBomStream, true))
            using (XmlTextReader utf8WithBomXmlTextReader = new XmlTextReader(utf8WithBomReader))
            {
                XmlDocument withoutBomDoc = new XmlDocument();
                withoutBomDoc.Load(utf8WithoutBomXmlTextReader);
                Console.WriteLine("Encoding of XmlTextReader without UTF-8 BOM: {0}",
                    utf8WithoutBomXmlTextReader.Encoding != null ? utf8WithoutBomXmlTextReader.Encoding.EncodingName : "null");

                XmlDocument withBomDoc = new XmlDocument();
                withBomDoc.Load(utf8WithBomXmlTextReader);
                Console.WriteLine("Encoding of XmlTextReader with UTF-8 BOM: {0}",
                    utf8WithBomXmlTextReader.Encoding != null ? utf8WithBomXmlTextReader.Encoding.EncodingName : "null");
            }
        }
    }
}
Comment 1 Atsushi Eno 2015-03-17 16:43:05 UTC
Now XmlReader is based on referencesource, even this kind of corner case is identical to .NET.