Bug 16938 - Java Iterator over a collection of strings causes illegal JNI GREF access
Summary: Java Iterator over a collection of strings causes illegal JNI GREF access
Alias: None
Product: Android
Classification: Xamarin
Component: Mono runtime / AOT Compiler ()
Version: 4.10.0.x
Hardware: PC Windows
: Normal normal
Target Milestone: ---
Assignee: Jonathan Pryor
Depends on:
Reported: 2013-12-22 19:25 UTC by T.J. Purtell
Modified: 2013-12-27 09:59 UTC (History)
3 users (show)

Is this bug a regression?: ---
Last known good build:

Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.

Please create a new report on Developer Community or GitHub with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:

Description T.J. Purtell 2013-12-22 19:25:15 UTC
Test case is in a repository here: https://bitbucket.org/tpurtell/xamarin-jni-string-interning-crash

It generally takes less than 15 seconds for this to produce a crash on my HTC One with Android 4.3 firmware in a Release build.  Basically all the test case does is start 3 threads.  One thread GCs repeatedly with a very short random delay (<5ms).  The other two thread allocate a JavaDictionary<string, string> for themselves (one per thread) and then repeatedly iterate over the Keys in in the JavaDictionary.  The actual string value of the key is "one" and is identical in all dictionaries.  The end result is that one of the threads frees the GREF of the string the other thread is accessing.

My best guess monodroid_get_identity_hash code does something special for strings that isn't taken into account in other places.  Or the code that converts from Java.Lang.String => string automagically for these dictionary types must have a bug where it is disposing of a gref.  Either that or there is some weird string interning going on in Android.  Given that the documentation for android offers an option to manually intern strings, I don't think this problem is an Android OS issue.

I originally encountered this error attempting to work around: https://bugzilla.xamarin.com/show_bug.cgi?id=15135

When using the ModernHttpClient, I found that my process would randomly crash or hang under concurrent network load.  As a workaround I had to acquire a GLOBAL LOCK around access to a JavaDictionary of strings.  You can see this atrocity here: https://github.com/tpurtell/ModernHttpClient/commit/57a37db5e6e6b489d848d30e391de55c1d577b7e#diff-ce22ec22a5fe0a65957f41e7b93462f2R80

I imagine this problem can occur for basically any API in Xamarin that maps to JNI and returns a string argument.  BTW, if you get bored of fixing this, try making a JavaDictionary<string, int>.  That fails instantly always.

======= CRASH LOG ====
>12-22 15:53:56.838: E/dalvikvm(2766): JNI ERROR (app bug): attempt to use stale global reference 0x1da00d7a
>12-22 15:53:56.838: E/dalvikvm(2766): VM aborting
>12-22 15:53:56.838: E/mono-rt(2766): Stacktrace:
>12-22 15:53:56.838: E/mono-rt(2766):   at <unknown> <0xffffffff>
>12-22 15:53:56.838: E/mono-rt(2766):   at (wrapper managed-to-native) object.wrapper_native_0x4095d175 (intptr,intptr,intptr) <0xffffffff>
>12-22 15:53:56.838: E/mono-rt(2766):   at Android.Runtime.JNIEnv.CallBooleanMethod (intptr,intptr) <0x0009f>
>12-22 15:53:56.838: E/mono-rt(2766):   at Java.Util.IIteratorInvoker.get_HasNext () <0x00063>
>12-22 15:53:56.838: E/mono-rt(2766):   at System.Linq.Extensions/<ToEnumerator_Dispose>c__Iterator8`1.MoveNext () <0x0016f>
>12-22 15:53:56.838: E/mono-rt(2766):   at Android.Runtime.JavaDictionary`2/<GetEnumerator>c__Iterator4.MoveNext () <0x0024f>
>12-22 15:53:56.838: E/mono-rt(2766):   at AndroidApplication4.StringInterningCorruption.Really () <0x0013b>
>12-22 15:53:56.838: E/mono-rt(2766):   at Java.Lang.Thread/RunnableImplementor.Run () <0x0003f>
>12-22 15:53:56.838: E/mono-rt(2766):   at Java.Lang.IRunnableInvoker.n_Run (intptr,intptr) <0x0003b>
>12-22 15:53:56.838: E/mono-rt(2766):   at (wrapper dynamic-method) object.206ff215-4fb4-4461-84c7-27c1f979e708 (intptr,intptr) <0x0003b>
>12-22 15:53:56.838: E/mono-rt(2766):   at (wrapper native-to-managed) object.206ff215-4fb4-4461-84c7-27c1f979e708 (intptr,intptr) <0xffffffff>
>12-22 15:53:56.838: E/mono-rt(2766): =================================================================
>12-22 15:53:56.838: E/mono-rt(2766): Got a SIGSEGV while executing native code. This usually indicates
>12-22 15:53:56.838: E/mono-rt(2766): a fatal error in the mono runtime or one of the native libraries 
>12-22 15:53:56.838: E/mono-rt(2766): used by your application.
>12-22 15:53:56.838: E/mono-rt(2766): =================================================================
Comment 1 T.J. Purtell 2013-12-23 03:17:05 UTC
I suppose that crash trace shows an iterator->hasNext call which is failing.  

The original ModernHttpClient crash log on a Nexus 5 is this
>E/dalvikvm(16514): JNI ERROR (app bug): attempt to use stale global reference 0x23a014b6
>E/dalvikvm(16514): VM aborting
>E/mono-rt (16514): Stacktrace:
>E/mono-rt (16514): 
>E/mono-rt (16514):   at <unknown> <0xffffffff>
>E/mono-rt (16514):   at (wrapper managed-to-native) object.wrapper_native_0x41539381 (intptr,intptr) <IL 0x00026, 0xffffffff>
>E/mono-rt (16514):   at Android.Runtime.JNIEnv.NewGlobalRef (intptr) <IL 0x00010, 0x000b7>
>E/mono-rt (16514):   at Java.Lang.Object.RegisterInstance (Android.Runtime.IJavaObject,intptr,Android.Runtime.JniHandleOwnership) <IL 0x00034, 0x00153>
>E/mono-rt (16514):   at Java.Lang.Object.SetHandle (intptr,Android.Runtime.JniHandleOwnership) <IL 0x00004, 0x0005b>
>E/mono-rt (16514):   at Java.Lang.Object..ctor (intptr,Android.Runtime.JniHandleOwnership) <IL 0x0002d, 0x000f7>
>E/mono-rt (16514):   at Java.Lang.String..ctor (intptr,Android.Runtime.JniHandleOwnership) <IL 0x00003, 0x00053>
>E/mono-rt (16514):   at Java.Interop.JavaConvert.<JniHandleConverters>m__D1 (intptr,Android.Runtime.JniHandleOwnership) <IL 0x00002, 0x0007b>
>E/mono-rt (16514):   at Java.Interop.JavaConvert.FromJavaObject<T> (Android.Runtime.IJavaObject) <IL 0x00097, 0x003fb>
>E/mono-rt (16514):   at Java.Interop.JavaConvert.FromJniHandle<T> (intptr,Android.Runtime.JniHandleOwnership) <IL 0x0002f, 0x0017f>
>E/mono-rt (16514):   at System.Linq.Extensions/<ToEnumerator_Dispose>c__Iterator8`1.MoveNext () <IL 0x0005a, 0x00247>
>E/mono-rt (16514):   at ModernHttpClient.OkHttpNetworkHandler/<SendAsync>c__async0/<SendAsync>c__AnonStorey9.<>m__1 () <IL 0x00152, 0x00a33>
>E/mono-rt (16514):   at System.Threading.Tasks.TaskActionInvoker/FuncInvoke`1.Invoke (System.Threading.Tasks.Task,object,System.Threading.Tasks.Task) <IL 0x0000c, 0x00087>
>E/mono-rt (16514):   at System.Threading.Tasks.Task.InnerInvoke () <IL 0x0003f, 0x00123>
>E/mono-rt (16514):   at System.Threading.Tasks.Task.ThreadStart () <IL 0x00098, 0x004b7>
>E/mono-rt (16514):   at System.Threading.Tasks.Task.Execute () <IL 0x00001, 0x00043>
>E/mono-rt (16514):   at System.Threading.Tasks.TpScheduler.TaskExecuterCallback (object) <IL 0x00008, 0x0008f>
>E/mono-rt (16514):   at (wrapper runtime-invoke) <Module>.runtime_invoke_void__this___object (object,intptr,intptr,intptr) <IL 0x00052, 0xffffffff>
>E/mono-rt (16514): 
>E/mono-rt (16514): =================================================================
>E/mono-rt (16514): Got a SIGABRT while executing native code. This usually indicates
>E/mono-rt (16514): a fatal error in the mono runtime or one of the native libraries 
>E/mono-rt (16514): used by your application.
>E/mono-rt (16514): =================================================================
>E/mono-rt (16514):
Comment 2 T.J. Purtell 2013-12-23 03:22:11 UTC
The log from the original ModernHttpClient crash that inspired the test case is identical on the HTC One and Nexus 5

>E/dalvikvm(21443): JNI ERROR (app bug): attempt to use stale global reference 0x23500ac6
>E/dalvikvm(21443): VM aborting
>E/mono-rt (21443): Stacktrace:
>E/mono-rt (21443): 
>E/mono-rt (21443):   at <unknown> <0xffffffff>
>E/mono-rt (21443):   at (wrapper managed-to-native) object.wrapper_native_0x4095d695 (intptr,intptr) <0xffffffff>
>E/mono-rt (21443):   at Android.Runtime.JNIEnv.NewGlobalRef (intptr) <0x00043>
>E/mono-rt (21443):   at Java.Lang.Object.RegisterInstance (Android.Runtime.IJavaObject,intptr,Android.Runtime.JniHandleOwnership) <0x00083>
>E/mono-rt (21443):   at Java.Lang.Object.SetHandle (intptr,Android.Runtime.JniHandleOwnership) <0x00023>
>E/mono-rt (21443):   at Java.Lang.Object..ctor (intptr,Android.Runtime.JniHandleOwnership) <0x00047>
>E/mono-rt (21443):   at Java.Lang.String..ctor (intptr,Android.Runtime.JniHandleOwnership) <0x00023>
>E/mono-rt (21443):   at Java.Interop.JavaConvert.<JniHandleConverters>m__D1 (intptr,Android.Runtime.JniHandleOwnership) <0x00037>
>E/mono-rt (21443):   at Java.Interop.JavaConvert.FromJavaObject<T> (Android.Runtime.IJavaObject) <0x0019b>
>E/mono-rt (21443):   at Java.Interop.JavaConvert.FromJniHandle<T> (intptr,Android.Runtime.JniHandleOwnership) <0x0006b>
>E/mono-rt (21443):   at System.Linq.Extensions/<ToEnumerator_Dispose>c__Iterator8`1.MoveNext () <0x000f3>
>E/mono-rt (21443):   at ModernHttpClient.OkHttpNetworkHandler/<InternalSendAsync>c__async1/<InternalSendAsync>c__AnonStoreyB.<>m__1 () <0x004ab>
>E/mono-rt (21443):   at System.Threading.Tasks.TaskActionInvoker/FuncInvoke`1.Invoke (System.Threading.Tasks.Task,object,System.Threading.Tasks.Task) <0x0004f>
>E/mono-rt (21443):   at System.Threading.Tasks.Task.InnerInvoke () <0x0007f>
>E/mono-rt (21443):   at System.Threading.Tasks.Task.ThreadStart () <0x00213>
>E/mono-rt (21443):   at System.Threading.Tasks.Task.Execute () <0x00013>
>E/mono-rt (21443):   at System.Threading.Tasks.TpScheduler.TaskExecuterCallback (object) <0x0004b>
>E/mono-rt (21443):   at (wrapper runtime-invoke) <Module>.runtime_invoke_void__this___object (object,intptr,intptr,intptr) <0xffffffff>
>E/mono-rt (21443): 
>E/mono-rt (21443): =================================================================
>E/mono-rt (21443): Got a SIGSEGV while executing native code. This usually indicates
>E/mono-rt (21443): a fatal error in the mono runtime or one of the native libraries 
>E/mono-rt (21443): used by your application.
>E/mono-rt (21443): =================================================================
Comment 3 T.J. Purtell 2013-12-23 18:58:25 UTC
The global lock seems not to be an adequate work around.  I've switched to using java code to convert the Map<String, List<String>> into a Entry[] with Entry having a Key and a single Value.  First level of crashes is hidden behind this.  It will take another day or testing across many devices to find out if it hides the problem completely.
Comment 4 T.J. Purtell 2013-12-25 17:17:50 UTC
Global lock may have worked.  The remaining class of crashes was from the IMT bug (bug #16075) fix not being on the build server (it was on x.a stable).  Hiding the iterator in pure java code and just returning an array worked too.
Comment 5 Jonathan Pryor 2013-12-26 10:03:30 UTC
Please try the Xamarin.Android 4.11.0 alpha release, which contains a number of multithreading fixes. I'm unable to reproduce your crash in a timely manner (i.e. it runs much longer than 15s without crashing).

This is with tpurtell-xamarin-jni-string-interning-crash-dcec0501e77e and at the time of this writing the count has reached 17707 and is still counting.
Comment 6 Jonathan Pryor 2013-12-26 11:07:35 UTC
App is still running, and the button count has reached 46803 before apparently "freezing" but not crashing.

This is expected; it's possible to Post() more messages to the UI loop than the UI loop can process, and since you're posting on every 100 loop iterations the UI message loop is likely "full" and not being processed fast enough to keep up with the new messages being posted...

I'm now killing the repro.
Comment 7 T.J. Purtell 2013-12-26 12:12:58 UTC
Are you using Xamarin.Android 4.11 alpha? 

Oh oh I can't believe there was another Christmas present I forgot to open!
Comment 8 T.J. Purtell 2013-12-26 14:07:14 UTC
I modified the test case to also print the number of messages pending for the UI.  It pretty much stays at 0 for me.  I think the hang you are seeing is actually a different fault of some kind in Xamarin.Android.  I pushed the source code change to the repository that shows the number of pending requests.  It doesn't get too many Posts pending because the garbage collector keeps everything nice and slow.

Also, note that the button has always been clickable and it will display a different message on click so that you can easily see if the app is still responsive.

I installed the 4.11 alpha. When I run the test case, I no longer get crashes within 30 seconds.  

So... seems like this issue is fixed in 4.11.  Thanks Jonathan.
Comment 9 T.J. Purtell 2013-12-26 14:12:42 UTC
I looked at the logs while running under debug and it looks like GC_BRIDGE time gets worse with every passing moment.  This is probably why you eventually got the ANR.

Currently @ 15903 iterations, I see 780-820ms GC_BRIDGE time.  Eventually this will increase to a level where the stop the world will cause Android to report ANR for the process.
Comment 10 T.J. Purtell 2013-12-26 14:27:15 UTC
Given linear growth in GC time, perhaps it is fixed because its leaking GREFs completely now???

>12-26 11:05:22.798: D/Mono(22853): GC_BRIDGE num-objects 6971 num_hash_entries 6972 sccs size 6972 init 0.00ms df1 10.86ms sort 5.19ms dfs2 2.26ms setup-cb 5.01ms free-data 2.81ms user-cb 255.46ms clenanup 13.22ms links 425/425/425/1 dfs passes 2846960/1423905
>12-26 11:05:22.798: D/Mono(22853): GC_MAJOR: (user request) pause 16.42ms, total 16.97ms, bridge 284.08ms major 880K/880K los 0K/0K

>12-26 11:06:02.681: D/Mono(22853): GC_BRIDGE num-objects 8829 num_hash_entries 8830 sccs size 8830 init 0.00ms df1 20.81ms sort 9.77ms dfs2 2.81ms setup-cb 6.56ms free-data 3.60ms user-cb 269.01ms clenanup 11.29ms links 563/563/563/1 dfs passes 5020872/2510999
>12-26 11:06:02.681: D/Mono(22853): GC_MAJOR: (user request) pause 31.50ms, total 33.30ms, bridge 303.10ms major 944K/944K los 0K/0K

>12-26 11:07:15.609: D/Mono(22853): GC_BRIDGE num-objects 11541 num_hash_entries 11542 sccs size 11542 init 0.00ms df1 27.07ms sort 17.98ms dfs2 4.94ms setup-cb 10.93ms free-data 6.29ms user-cb 459.97ms clenanup 14.31ms links 732/732/732/1 dfs passes 8446818/4224141
>12-26 11:07:15.609: D/Mono(22853): GC_MAJOR: (user request) pause 40.50ms, total 42.15ms, bridge 514.51ms major 1056K/1056K los 0K/0K

>12-26 11:12:58.334: D/Mono(22853): GC_BRIDGE num-objects 17769 num_hash_entries 17770 sccs size 17770 init 0.00ms df1 29.12ms sort 16.24ms dfs2 5.16ms setup-cb 15.93ms free-data 9.12ms user-cb 931.36ms clenanup 22.77ms links 1160/1160/1160/1 dfs passes 21021850/10512085
>12-26 11:12:58.334: D/Mono(22853): GC_MAJOR: (user request) pause 43.12ms, total 44.04ms, bridge 1000.70ms major 1296K/1296K los 0K/0K
Comment 11 Jonathan Pryor 2013-12-27 09:59:27 UTC
> Given linear growth in GC time, perhaps it is fixed because its leaking GREFs
> completely now???

Indeed, it would seem that the GC isn't collecting the JavaDictionary or the JavaSet instances, though I don't know _why_; sounds like a question for our resident GC guru.

That said, if we explicitly dispose of the underlying JavaSet instance, then the GREF count is stable at ~30 (-ish) GREFs:

    var jd = ...
    for ( ; ; ) {
        var _keys = jd.Keys;
        foreach (var s in _keys)
            if (s != "1")
                throw new InvalidProgramException();
        ((IDisposable) _keys).Dispose();

The problem here is twofold:
(1) JavaDictionary.Keys allocates a new JavaSet instance on each access, which takes a GREF, but
(2) There's no way for you the developer to actually know (1) and know to do anything about it.

Oops. Yet another item for the "breaking changes for 5.0" list...
Comment 12 Jonathan Pryor 2013-12-27 09:59:53 UTC
Marking as fixed as per Comment #8.