Bug 40002 - decoder.convert class consumes more bytes than is necessary
Summary: decoder.convert class consumes more bytes than is necessary
Status: RESOLVED NOT_REPRODUCIBLE
Alias: None
Product: Class Libraries
Classification: Mono
Component: General ()
Version: master
Hardware: PC Linux
: --- normal
Target Milestone: Untriaged
Assignee: Bugzilla
URL:
Depends on:
Blocks:
 
Reported: 2016-03-30 20:16 UTC by Jason Curl
Modified: 2016-04-08 08:38 UTC (History)
2 users (show)

Tags:
Is this bug a regression?: ---
Last known good build:

Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.


Please create a new report on GitHub or Developer Community with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:
Status:
RESOLVED NOT_REPRODUCIBLE

Description Jason Curl 2016-03-30 20:16:41 UTC
System.Text.Decoder for UTF-8 has a different behaviour to .NET on Windows under very specific conditions that results in possible decoding errors.

I'm using Ubuntu 16.04 Xenial (development branch), but using the Xamarin built mono binaries, installed with "apt install mono-complete". Note, the Ubuntu binaries are broken, so see the Xamarin pages on how I installed this.

Mono JIT compiler version 4.2.3 (Stable 4.2.3.4/832de4b Wed Mar 16 13:19:08 UTC 2016)
Copyright (C) 2002-2014 Novell, Inc, Xamarin Inc and Contributors. www.mono-project.com
        TLS:           __thread
        SIGSEGV:       altstack
        Notifications: epoll
        Architecture:  amd64
        Disabled:      none
        Misc:          softdebug 
        LLVM:          supported, not enabled.
        GC:            sgen

The following test case fails (running using nUnit 2.6.4 from nuget)

namespace BugTest
{
    using System;
    using System.Text;
    using NUnit.Framework;

    [TestFixture]
    public class MonoDecoder
    {
        [Test]
        public void TestDecoder()
        {
            Encoding encoding = Encoding.GetEncoding("UTF-8");
            Decoder decoder = encoding.GetDecoder();

            byte[] data = new byte[] { 0x61, 0xE2, 0x82, 0xAC, 0x40, 0x41 };
            char[] oneChar = new char[2];

            int bu; int cu; bool complete;
            decoder.Convert(data, 0, 2, oneChar, 0, 1, false, out bu, out cu, out complete);
            Assert.That(bu, Is.EqualTo(1));
            Assert.That(cu, Is.EqualTo(1));
        }

    }
}

Note, in this test case, it passes using .NET 4.6 on Windows 7.

Interesting here, if I say the array is 6 bytes long, the test case passes saying only 1 byte for 1 character, If I say the array is 2 bytes long (as seen here), bu=2, where it is expected to be 1. Thus one might consider that the character 'a' actually consists of 2 bytes.

The purpose of this code is to get exactly one character at a time and measure how many bytes were being used.

Current workaround: I have to use a byte array length of 1 always, to guarantee that bu is the smallest possible value.
Comment 1 Jason Curl 2016-03-30 20:46:54 UTC
I seem to be having problems related to this bug in other test cases of my small project, where whenever the number of bytes given in the input array is 2, data is also lost.
Comment 2 Jason Curl 2016-03-30 21:55:35 UTC
Ignore Comment 1 - that was a different bug by my code, and I couldn't create a small test case. Just concentrate on the test case that I provided in the problem report.
Comment 3 Jon Purdy 2016-04-04 17:49:44 UTC
This is curious, because we’re using the System.Text.Decoder and UTF8Encoding from referencesource, mostly unmodified. Looking into it.
Comment 4 Jon Purdy 2016-04-04 18:24:49 UTC
The test case also fails for me on Windows.

I’m guessing you could be using a version in which they’ve already fixed this bug. What is the output of this command for you?

    reg query "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\NET Framework Setup\NDP\v4\Full"

It gives me:

    Version          REG_SZ       4.6.01055
    CBS              REG_DWORD    0x1
    TargetVersion    REG_SZ       4.0.0
    Install          REG_DWORD    0x1
    InstallPath      REG_SZ       C:\Windows\Microsoft.NET\Framework64\v4.0.30319\
    Servicing        REG_DWORD    0x0
    Release          REG_DWORD    0x6041f
Comment 5 Jason Curl 2016-04-06 17:45:07 UTC
Well, this is embarrassing. I retested on a few of my machines and VM's as I could swear this worked on Windows. But I've confirmed it doesn't. All these cause a test case fail, from original .NET 4.0 on WinXP to Win10

================================================================================
== BUGATTI: Windows 10 x64 1511
== RESULT: FAILS
================================================================================
reg query "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\NET Framework Setup\NDP\v4\Full"

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\NET Framework Setup\NDP\v4\Full
    CBS            REG_DWORD    0x1
    Install        REG_DWORD    0x1
    InstallPath    REG_SZ       C:\Windows\Microsoft.NET\Framework\v4.0.30319\
    Release        REG_DWORD    0x6040e
    Servicing      REG_DWORD    0x0
    TargetVersion  REG_SZ       4.0.0
    Version        REG_SZ       4.6.01038

================================================================================
== VEYRON: Windows 10 x64 1511 (Surface Book)
== RESULT: FAILS
================================================================================
reg query "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\NET Framework Setup\NDP\v4\Full"

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\NET Framework Setup\NDP\v4\Full
    CBS            REG_DWORD    0x1
    Install        REG_DWORD    0x1
    InstallPath    REG_SZ       C:\Windows\Microsoft.NET\Framework64\v4.0.30319\
    Release        REG_DWORD    0x6040e
    Servicing      REG_DWORD    0x0
    TargetVersion  REG_SZ       4.0.0
    Version        REG_SZ       4.6.01038

================================================================================
== LEON: Windows 7 SP1 (VS2015 installed)
== RESULT: FAILS
================================================================================
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\NET Framework Setup\NDP\v4\Full
    MSI            REG_DWORD    0x1
    Install        REG_DWORD    0x1
    InstallPath    REG_SZ       C:\Windows\Microsoft.NET\Framework\v4.0.30319\
    Release        REG_DWORD    0x6041f
    Servicing      REG_DWORD    0x0
    TargetVersion  REG_SZ       4.0.0
    Version        REG_SZ       4.6.01055

================================================================================
== WIN7VMDEV: Windows 7 SP1 x64 (VS2015 installed)
== RESULT: FAILS
================================================================================
reg query "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\NET Framework Setup\NDP\v4\Full"

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\NET Framework Setup\NDP\v4\Full
    MSI            REG_DWORD    0x1
    Install        REG_DWORD    0x1
    InstallPath    REG_SZ       C:\Windows\Microsoft.NET\Framework64\v4.0.30319\
    Release        REG_DWORD    0x60051
    Servicing      REG_DWORD    0x0
    TargetVersion  REG_SZ       4.0.0
    Version        REG_SZ       4.6.00081

================================================================================
== WIN7OFF: Windows 7 SP1 x64 (VS2015 installed)
== RESULT: FAILS
================================================================================
reg query "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\NET Framework Setup\NDP\v4\Full"

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\NET Framework Setup\NDP\v4\Full
    MSI            REG_DWORD    0x1
    Install        REG_DWORD    0x1
    InstallPath    REG_SZ       C:\Windows\Microsoft.NET\Framework64\v4.0.30319\
    Release        REG_DWORD    0x5cbf5
    Servicing      REG_DWORD    0x0
    TargetVersion  REG_SZ       4.0.0
    Version        REG_SZ       4.5.51209

================================================================================
== WINXPDEV: Windows XP SP3
== RESULT: FAILS
================================================================================
reg query "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\NET Framework Setup\NDP\v4\Full"

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\NET Framework Setup\NDP\v4\Full
    MSI            REG_DWORD    0x1
    Install        REG_DWORD    0x1
    InstallPath    REG_SZ       c:\WINDOWS\Microsoft.NET\Framework\v4.0.30319\
    Servicing      REG_DWORD    0x0
    TargetVersion  REG_SZ       4.0.0
    Version        REG_SZ       4.0.30319
Comment 6 Jon Purdy 2016-04-08 01:32:45 UTC
Okay. I guess I’m going to mark this as “not reproducible”, as we match .NET here.

Technically speaking, you asked “please convert these 2 bytes into characters” and it responded “okay, I was able to decode those 2 bytes into 1 character”. It’s just that the second byte didn’t correspond to valid UTF-8.

You can tell how many bytes a UTF-8 character will use with UTF8Encoding.GetByteCount(). Alternatively, you can do it by the range of the character or the most significant bits of the first byte.
Comment 7 Jason Curl 2016-04-08 08:38:44 UTC
Thanks for your time.

I'd like to point out though the inconsistency of the implementation, whether it be Mono (Mono can take the lead here :) or it be the reference sources.

If you give three bytes, ask for one character, the API works as expected! That is, it consumes only 1 byte and the test case passes (not consuming 2 bytes, or 3).

While I'm familiar enough with UTF8 to do that, it's not very practical for my work where the user specifies the decoder to use. See "github.com/jcurl/serialportstream" for the use case. It was implemented due to other decoding bugs in the SerialPort of System.IO.Ports (when it converts characters back to bytes thus screwing up the byte stream in case of non-decodable situations).

The solution to use UTF8Encoding.GetByteCount() is incorrect, as it assumes valid data. An easy way to create a case where this doesn't work is to set the fallback character for decoding failures to '.', which GetByteCount() will return 1, but an input stream can be construed to consume 3 bytes that turn out to be invalid.

Thus the only consistent reliable solution for a Decoder is to consume the minimum number of bytes (or characters) for a given input, which this test case shows is not the case for a very specific type of input.

I do ask that this be looked into or at least be accepted that a fix is required so that a patch from anyone would be considered, or at least a suggestion on what to do next.