Bug 8117 - Implementation of System.Text.GetEncodings() appears to be flawed
Summary: Implementation of System.Text.GetEncodings() appears to be flawed
Status: RESOLVED FIXED
Alias: None
Product: Class Libraries
Classification: Mono
Component: mscorlib ()
Version: 2.10.x
Hardware: PC Linux
: --- normal
Target Milestone: Untriaged
Assignee: Bugzilla
URL:
Depends on:
Blocks:
 
Reported: 2012-10-31 17:08 UTC by Mike
Modified: 2016-04-16 08:17 UTC (History)
2 users (show)

Tags:
Is this bug a regression?: ---
Last known good build:


Attachments
Potential patch for Encoding.cs (31.33 KB, text/plain)
2013-01-10 08:58 UTC, Mike
Details


Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.


Please create a new report on GitHub or Developer Community with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:
Status:
RESOLVED FIXED

Description Mike 2012-10-31 17:08:06 UTC
I'm having an issue with IronPython that I believe is related to the implementation of System.Text.GetEncodings() in Mono's Corlib. The issue appears that when GetEncodings is called it assembles significantly more CodePages than Mono currently has implementation for. When the code pages are then passed into GetEncoding the method throws an exception (as it should) because only a handful of encodings are handled.

This effectively makes GetEncodings unusable and anything else that calls the method unusable as well. Shouldn't it by default only return Encodings that mono can handle?

Here is the stack trace of the issue.

Stack Trace:
System.NotSupportedException: CodePage 37 not supported
  at System.Text.Encoding.GetEncoding (Int32 codepage) [0x00165] in /home/abuild/rpmbuild/BUILD/mono-2.10.6/mcs/class/corlib/System.Text/Encoding.cs:483
  at System.Text.EncodingInfo.GetEncoding () [0x00000] in /home/abuild/rpmbuild/BUILD/mono-2.10.6/mcs/class/corlib/System.Text/EncodingInfo.cs:76
  at System.Text.EncodingInfo.get_Name () [0x0000b] in /home/abuild/rpmbuild/BUILD/mono-2.10.6/mcs/class/corlib/System.Text/EncodingInfo.cs:57
  at IronPython.Runtime.Operations.StringOps+CodecsInfo.MakeCodecsDict () [0x00000] in <filename unknown>:0
  at IronPython.Runtime.Operations.StringOps+CodecsInfo..cctor () [0x00000] in <filename unknown>:0


Perhaps it makes sense to wrap the GetEncoding method in a try catch as a hack to a hack? Or better yet have a static list of implemented encodings that are shared amongst the entire class?

Am I right in my analysis of the issue? 

Thanks for all the great work,
Mike
Comment 1 Marek Safar 2012-11-11 10:37:37 UTC
GetEncoding is OS dependant, it's does not guarantee what encoding are support. You can call GetEncodings () to get list of supported codes.
Comment 2 Mike 2012-11-12 14:52:10 UTC
I do not follow you.

The GetEncodings() method (at line 538 in System.Text.Encoding) has a hardcoded list of encoding numbers in them. Each one is passed into a constructor (int) as an EncodingInfo. When the encoding info's properties are used GetEncoding(int codepage) is used to retreive it. (reviewing the source code makes me wonder if when i put this bug in I was looking at old source, because i could have sworn it had GetEncoding for each codepage inside GetEncodings, but the latest appears that it just returns a bunch of lazilyloaded EncodingInfo objects).

This means that if a codepage is not supported (it would appear 37 in my case, which just so happens to be the first one in GetEncodings), if i attempt to get any properties out of it, it blows up.

I believe my current install of Mono is using the source that I was describing before where GetEncoding was called for each hardcoded encoding type. I am using mono 2.10.6.

All one has to do is call GetEncodings() in a main function and it will fail.

Even with the new source if one where to attempt to use the results of GetEncodings, when the EncodingInfo object is first used (initialized) it would need to be located in a try catch.

As an Example, if i wanted to create a dictionary of codepages to their appropriate encoding I might try this:

var myDictionary = GetEncodings().ToDictionary(info => info.CodePage, info => info.GetEncoding())

This is not possible.
Comment 3 Mike 2012-11-12 15:01:04 UTC
Rereading the stack I posted above, nullifies my thought about GetEncoding being called within GetEncodings, however, its being called by the EncodingInfo objects, so GetEncodings() doesn't return supported encodings, it just returns the array of codepages as EncodingInfos.

IronPython uses GetEncodings() as I described, and at least to me, it makes sense to expect that GetEncodings will return objects that aren't guaranteed to throw exceptions. To me, it seems that doing the following, doesn't quite make sense.

List<EncodingInfo> goodEncodings = new List<EncodingInfo>();
foreach(var info in GetEncodings())
{
    try
    {
        var name = info.Name;
        goodEncodings.Add(info);
    }
    catch(Exception e)
    {
       // Codepage must not be supported.
    }
}

//Do something with the good encodings
var supportedEncodings = goodEncodings.ToDictionary(info => info.CodePage, info => info.Encoding);
Comment 4 Marek Safar 2012-11-12 15:12:11 UTC
My fault, I read the bug description to be about GetEncoding. I agree GetEncodings is buggy if returns hardcoded list.
Comment 5 Mike 2012-12-29 01:09:11 UTC
I would like to fix this and submit it for review. Is there any suggestions that can be provided or perhaps some source that I can look up?

Other than looking at the getencoding method which shows the hardcoded encodings that Mono can handle is there some sort of other reference I should know about? I'm not familiar (from briefly looking at the source code) of the calls that attempt to get "encoding readers" outside of mono, so that is my only hesitation. 

This is an extremely critical issue for me to fix. As of right now all Gnu/ Non-Windows based development has completely and totally stopped because of this issue. So long as 3rd party software calls GetEncodings() (such as IronPython, which is reasonable) this will remain an issue. 

Thanks alot guys,
Mike
Comment 6 Marek Safar 2013-01-02 04:05:52 UTC
I was quickly looking at the issue and it's not simple to fix.

There are 2 things to be done.
- Determine which encoding are always available, part of mscorlib/runtime
- Change external encoding assemblies to advertise which encoding they contain. That's for assemblies like I18N*.dll. There is also possibility for user code to implement encoding but I am not sure that ever used.
Comment 7 Mike 2013-01-09 18:38:27 UTC
Thanks for the reply!

As a quick fix I submitted an Issue Bug over at IronPython so that they might know about the issue, and perhaps put a try catch around each encoding that they get from GetEncodings(). Obviously it isn't the method that fails, but the individual encodings that come out. I made the "fix" on my personal source of IronPython but have not been able to test it as it would require alot of work getting my code ported over. I'll give that a shot here soon.

In the mean time I ran a test just to see the kind of results that it provides. I called GetEncodings and try catched each result when getting the name. Out of 95 encodings, 23 worked while 72 exceptions were thrown. Youchers.

It would appear that the first thing you have, is partially already done. I don't have the source sitting in front of me, but there was hardcoded cases in GetEncoding. Perhaps those ones need to be moved to a static field perhaps?

The 2nd requirement definitely makes sense. 


Another possibility that I thought about was a lazily initialized array of non-failure encodings. So instead of just returning the array of encodings that are uninitialized, perhaps each encoding is tested before going out. Try catch each one, if it fails, it is ignored. Before returning the array of encodings, the array of encodings is statically saved in a field. (Perhaps this all fits better in a static accessor so it can be accessed anywhere within the class). Any successive calls would just get the already created list. 

In my case that would result in a 72 caught exceptions and returning supported encodings, rather than one exception that breaks unsuspecting third party code.


Thanks for the help,
Mike
Comment 8 Mike 2013-01-10 08:58:26 UTC
Created attachment 3187 [details]
Potential patch for Encoding.cs

Here is a potential patch that I mentioned regarding the try catch. I noted that the lazily initialization is already there so it was simply a matter of creating a maximum array with all the values, creating the EncodingInfo objects, testing the GetEncoding method, if it fails it does not end up in the final encoding_infos field and sent out.
Comment 9 Mike 2013-01-10 12:46:32 UTC
Marek,
I should have added in that I am considering looking into what it takes to write a couple of code pages because a lot of extremely well used code pages across the world don't have implementations yet in MONO.

In particular I'm considering writing Big 5, gb2312, and gb18030.

I'll study the current sources and see if this is a reasonable undertaking.
Comment 10 Marek Safar 2016-04-16 08:17:57 UTC
Should be fixed as we switched to RS code.