Bug 22968 - Crash when lazy loading assemblies
Summary: Crash when lazy loading assemblies
Status: VERIFIED FIXED
Alias: None
Product: Android
Classification: Xamarin
Component: Mono runtime / AOT Compiler ()
Version: 4.16.0
Hardware: PC Windows
: Normal normal
Target Milestone: 4.20
Assignee: Radek Doulik
URL:
Depends on:
Blocks:
 
Reported: 2014-09-13 14:52 UTC by Cody Beyer (MSFT)
Modified: 2014-11-17 04:33 UTC (History)
8 users (show)

Tags:
Is this bug a regression?: ---
Last known good build:


Attachments
Error Log (66.50 KB, text/plain)
2014-09-13 14:52 UTC, Cody Beyer (MSFT)
Details
Reproduction case (130.80 KB, text/plain)
2014-10-02 12:05 UTC, Adam Kapos
Details
Logs/etc (15.08 KB, application/zip)
2014-10-22 16:40 UTC, Jon Douglas [MSFT]
Details


Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.


Please create a new report on Developer Community or GitHub with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:
Status:
VERIFIED FIXED

Description Cody Beyer (MSFT) 2014-09-13 14:52:03 UTC
Created attachment 8056 [details]
Error Log

Basic error message information is as follows, with more detailed log attached. The error does not exist if the libraries are loaded early in the constructor. 

E/Surface (14724): dequeueBuffer_DEPRECATED: Fence::wait returned an error:
-4

W/Adreno-EGLSUB(14724): <DequeueBuffer:585>: dequeue native buffer fail:
Interrupted system call, buffer=0x0, handle=0x0

W/Adreno-ES20(14724): <gl2_surface_swap:43>: GL_OUT_OF_MEMORY

W/Adreno-EGL(14724): <qeglDrvAPI_eglSwapBuffers:3590>: EGL_BAD_SURFACE
Comment 1 Prashant manu 2014-09-15 02:44:41 UTC
Could you please provide us the direction/suggestions to check this issue?
Comment 2 T.J. Purtell 2014-09-18 12:14:33 UTC
Are you running under the debugger when this situation occurs?  I have had issues with this on Qualcomm based chipsets with Xamarin under the debugger for a long time.  Basically Qualcomm has bugs in their graphics driver where an interrupted system call is not handled correctly.  The graphics driver does not free memory it was using when the system call is interrupted or it is just misinterpreting the return code.

It's quite aggravating as it causes hangs or crashes while trying to debug apps on a device.  I suspect that this error can also be triggered by the garbage collector signals interrupting the graphics driver system calls.


Aside from working with Qualcomm to address this issue (which wont help existing devices for a long time if ever), I think Xamarin can solve this issue by introducing the SA_RESTART flag on the signal handlers that it registers.  This would make it so that system calls from the SIGSEGV exceptions used for the debugger and SIGxxx used for GC wouldn't cause other android subsystem to get unexpected errors.

I also see another weird manifestation of this signal handler behavior when I am doing network requests under the debugger.  Often the requests will fail noting an EINTR.  Again, the GC likely causes these as well.  Normal android apps generally run with no signals occuring, so the extra signals in Xamarin/mono often seem to cause unexpected behaviors...
Comment 3 Adam Kapos 2014-10-02 11:02:24 UTC
Yes. This only happens when the debugger is connected.
Comment 4 Adam Kapos 2014-10-02 12:05:01 UTC
Created attachment 8287 [details]
Reproduction case

I managed to come up with a reproduction. Launch the game with debugger, tap the screen and wait a few seconds. Bug occured on a Galaxy S4. Didn't occur on an LG L65.
Comment 6 Adam Kapos 2014-10-22 05:23:21 UTC
It's been three weeks without a sign. Anyone looking at these tickets?
Comment 7 Jon Douglas [MSFT] 2014-10-22 16:40:41 UTC
Created attachment 8475 [details]
Logs/etc
Comment 8 Jon Douglas [MSFT] 2014-10-22 16:42:22 UTC
Was able to reproduce this on a physical samsung galaxy S4, however the reproduction did not occur every attempt. I also noticed that this issue doesn't happen on other samsung galaxy s devices other than the 4. I've attached logs/build output/etc to confirm that I was able to replicate the issue.
Comment 9 Adam Kapos 2014-11-04 18:38:43 UTC
Any updates?
Comment 10 Jonathan Pryor 2014-11-04 23:17:03 UTC
I'm rather confused by this.

The crash appears to have nothing to do with lazy-loading of assemblies; the crash appears to be an Out Of Memory condition within OpenGL, and things falling to pieces from there:

> [Adreno-EGLSUB] <DequeueBuffer:593>: dequeue native buffer fail: Interrupted system call, buffer=0x0, handle=0x0
> [Adreno-ES20] <gl2_surface_swap:43>: GL_OUT_OF_MEMORY
> [Adreno-EGL] <qeglDrvAPI_eglSwapBuffers:3661>: EGL_BAD_SURFACE
> Thread finished: <Thread Pool> #14
> [Error in swap buffers] OpenTK.Platform.Android.EglException: EglSwapBuffers failed with error 12301 (0x300d)
> [Error in swap buffers]   at OpenTK.Platform.Android.AndroidGraphicsContext.Swap () [0x00090] in /Users/builder/data/lanes/monodroid-mlion-monodroid-4.18-series/3b7ef0a7/source/monodroid/src/OpenGLES/Android/AndroidGraphicsContext.cs:146 
> [Error in swap buffers]   at OpenTK.Platform.Android.AndroidGameView.SwapBuffers () [0x0001f] in /Users/builder/data/lanes/monodroid-mlion-monodroid-4.18-series/3b7ef0a7/source/monodroid/src/OpenGLES/Android/AndroidGameView.cs:228 
> [Error in swap buffers]   at Microsoft.Xna.Framework.AndroidGamePlatform.Present () [0x00025] in /Users/dominicnahous/Downloads/Repro/MonoGame/MonoGame.Framework/Android/AndroidGamePlatform.cs:171 

The "Repro Case" in Comment #4 appears to be additional log output. I am further confused.

@Adam's URL in Comment #5 is taking an eternity to download; I haven't been able to view it yet. It's been stuck at 99% for several minutes.

Comment #2 is somewhat enlightening -- it's a hardware/driver issue! yay! -- though the advice to use SA_RESTART is further confusing, as Mono itself *does* use SA_RESTART in numerous places, so I'm not sure what would be missing.
Comment 11 T.J. Purtell 2014-11-04 23:36:33 UTC
The GC signals do use SA_RESTART, so I am wrong about the GC causing the same behavior, but it appears that none of the other posix signal handlers do... e.g. my good friend SIGSEGV.

https://github.com/mono/mono/blob/master/mono/mini/mini-posix.c#L489

The only way an EINTR is getting return AFAIK, is if a signal interrupted it.  For example, the debugger is interacting with the running app in some way, e.g. pausing the app via signals, a breakpoint is triggered, a conditional breakpoint is evaluated, the UI tries to read some state and sends some signal unbeknownst to the debugger monkey operating it :)

IIRC debug code from mono is generated with a read @ magic page address in between statements so that the  magic address can be changed to invalid to cause program execution to pause after each single step event.  For me personally, single stepping under the debugger (even in non GUI code) would trigger this issue.
Comment 12 Jonathan Pryor 2014-11-05 10:58:22 UTC
SIGSEGV needs restartable?! The one with a default action that terminates the process?

That said, Mono *does* have support to "delegate" the SIGSEGV chain to the native (dalvik/ART) SIGSEGV handler (added in Xamarin.Android 4.16+) which will cause Android to field the SIGSEGV, invoking debuggerd for additional logcat logging...

https://github.com/mono/mono/blob/87f4b147/mono/mini/driver.c#L2232-2243

But I do not understand why SIGSEGV of all signals should be SA_RESTARTable...
Comment 14 T.J. Purtell 2014-11-05 11:44:27 UTC
When many signals are delivered, they pause all threads in the process.  SIGSEGV is an example.  SA_RESTART controls what happens to the other threads pending syscalls when the signal handler returns.   It specifically controls whether a pending syscall on results in a return value of error (EINTR) or whether the kernel restarts the pending system call (e.g. a wait type operation).  If a signal type is not targeted to a specific thread, then all threads will have their system calls interrupted.

The Xamarin/mono debug system for single stepping (breakpoints?) works by essentially rewriting code to include a potential SIGSEGV between lines.  For example.

int foo = 1;
foo *= 2;

would logically become

int foo = 1;
IGNORE_RESULT(*magic_page)
foo *= 2;

The magic_page pointer is changed by the soft debugger to be inaccessible when you are in single stepping mode.  This causes the application to generate a SIGSEGV between every statement.

Now consider I have two threads running. One is doing some graphics work (UI thread/render thread/etc), the other is in some code I am trying to debug.

T1: code in the Qualcomm driver
A: result = wait_for_graphics_operation_to_complete();
B: if (result != OK)
C:  crap_all_over_the_graphics_state();

T2: my code that needs to be debugged
D: int foo = 1;
E: IGNORE_RESULT(*magic_page)
F: foo *= 2;

I am in single-stepping mode.  Thread 1 pauses at line A waiting for a result.  Meanwhile, Thread 2 reaches line E and the memory read generates a SIGSEGV.  **If** the SIGSEGV causes the android kernel to "interrupt" all threads, then this would result in result != OK and the driver would crap all over the graphics state.

In theory, a SIGSEGV could be properly targeted to a thread, but sometimes these things are surprising.  For example, someone may have decided it would be sensible that if a program thread fails for a SIGSEGV all threads should be stopped so the most consistent handling can be performed.

If SA_RESTART was enabled for SIGSEGV, I don't think there would be any behavior change for the thread that triggered the SIGSEGV.  The kernel checks all the buffers for validity up front, so a waiting syscall could never SIGSEGV.

The other signal handlers related to debugging likely also have an impact.

Glancing at the signal handler registration code without having properly looked into all of the handlers actual usage, the SIGINT handler stands out to me.  It only is enabled in debug mode, and sounds like it might be used to send a request from the debugger UI to pause the process to query state/change configuration/etc.  

SIGINT also seems less likely to be a thread targeted signal.  Using a kill(xyz) to deliver a signal doesn't appear to have a way to target a specific thread (linux PID/TID overlap), so I think the API must always deliver them at the process level.
Comment 15 Jonathan Pryor 2014-11-05 12:36:42 UTC
@Zoltan: This looks like a debugger-related issue, as per Comment #3.

Repro is Comment #13; repro steps in Comment #4.
Comment 16 Zoltan Varga 2014-11-05 14:25:38 UTC
I can't reproduce this with the testcase provided. I don't think SIGSEGV signals cause other threads to be interrupted, but the debugger does use signals to suspend/resume threads. Will look into the SA_RESTART flag.
Comment 17 T.J. Purtell 2014-11-05 15:26:20 UTC
I think that this EINTR problem is also the cause of some Lollipop issues

Android's mutex implementation can be interrupted by signals (see lock which
returns an error code)

https://android.googlesource.com/platform/system/core/+/android-5.0.0_r2/include/utils/Mutex.h

Android's Looper implementation (used by RenderThread) does not check the
result of this lock operation, so it proceeds to 

https://android.googlesource.com/platform/system/core/+/android-5.0.0_r2/libutils/Looper.cpp
 (see line 229)

Although I originally saw that Qualcomm's driver had this issue where adding
new signals to the normal lifetime of Android apps caused unexpected behavior,
it now seems that Android itself has this problem.  

The Looper is not new code in Android and it seems to have never expected to be
interrupted.  I suspect there are more instances of this across the entire
Android code base, so its critical to work around with SA_RESTART for any new
signals that Xamarin introduces.
Comment 18 T.J. Purtell 2014-11-05 16:45:16 UTC
RenderThread also uses AutoMutex which doesn't check the result of mLock.lock() so there are new places in Lollipop that cause random crashes due to native corruption from a false acquisition of a lock due to EINTR.

https://android.googlesource.com/platform/frameworks/base/+/android-5.0.0_r2/libs/hwui/renderthread/RenderThread.cpp
Comment 19 T.J. Purtell 2014-11-05 17:27:34 UTC
Actually, pthread_mutex_lock never returns EINTR apparently, so this proposed explanation is wrong atleast for the exact mechanism. :(  Maybe there is some other android code in the render thread that doesn't handle EINTR well though.
Comment 20 Zoltan Varga 2014-11-07 14:07:23 UTC
This will hopefully be fixed in xamarin.android 4.20.
Comment 21 Zoltan Varga 2014-11-07 16:05:56 UTC
Fixed.
Comment 22 Adam Kapos 2014-11-07 16:08:06 UTC
What was the problem?
Comment 23 Parmendra Kumar 2014-11-13 13:44:49 UTC
I have checked this issue and observed that application deploy successfully on android device (Samsung Galaxy S4 4.4.2).

I have also observed that when we tap on the screen then it throws an 'Unhandled Exception' after few seconds and application got crashed.

Hence reopening this issue.

Screencast:http://www.screencast.com/t/axoBlIUi5HMP

ADB Logcat: https://gist.github.com/Parmendrak/a59c496b03e5f4cb87ef
OutputLog:https://gist.github.com/Parmendrak/a242d2b3d3181e7ef7f7
ZipXamLog:https://gist.github.com/Parmendrak/b86bb566420bc4d1b975

Environment Info:
Microsoft Visual Studio Professional 2013
Version 12.0.30723.00 Update 3
Microsoft .NET Framework
Version 4.5.51641

Installed Version: Professional

Xamarin   3.8.134.0
Xamarin.Android   4.20.0.24
Xamarin.iOS   8.4.0.0
Comment 24 Jonathan Pryor 2014-11-13 16:25:12 UTC
@radek: Please review the OutputLog in Comment #23. Any ideas?

> E/mono    (11640): OpenTK.Platform.Android.EglException: MakeCurrent failed with error 12299 (0x300b)
Comment 25 T.J. Purtell 2014-11-13 17:41:03 UTC
https://gist.github.com/Parmendrak/a242d2b3d3181e7ef7f7#file-outputlog-L875

> 11-13 23:24:42.000 E/Surface (11640): dequeueBuffer_DEPRECATED: Fence::wait returned an error: -4

That is -EINTR .. so more rogue signals are a foot...
Comment 26 Radek Doulik 2014-11-14 05:16:18 UTC
(In reply to comment #24)
> @radek: Please review the OutputLog in Comment #23. Any ideas?
> 
> > E/mono    (11640): OpenTK.Platform.Android.EglException: MakeCurrent failed with error 12299 (0x300b)

Will take a look. I saw such error once when MakeCurrent was called from a wrong thread.
Comment 27 Radek Doulik 2014-11-14 09:17:18 UTC
I tried to reproduce on Galaxy S4 and don't see the exception there, so hopefully it is fixed. (tested with XA 4.20.0.25)
Comment 28 Parmendra Kumar 2014-11-17 04:33:09 UTC
I have checked this issue on Galaxy S4 and now its working fine.

Environment Info:
Microsoft Visual Studio Professional 2013
Version 12.0.30723.00 Update 3
Microsoft .NET Framework
Version 4.5.51641

Xamarin   3.8.145.0
Xamarin.Android   4.20.0.26