Bug 14129 - Android App crashes with signal 4 or 11 if actions are passed between threads (4.8.2)
Summary: Android App crashes with signal 4 or 11 if actions are passed between threads...
Status: VERIFIED FIXED
Alias: None
Product: Android
Classification: Xamarin
Component: Mono runtime / AOT Compiler ()
Version: 4.8.x
Hardware: All All
: High normal
Target Milestone: 4.12.0 (KitKat)
Assignee: Rodrigo Kumpera
URL:
Depends on: 14050
Blocks:
  Show dependency tree
 
Reported: 2013-08-21 01:36 UTC by T.J. Purtell
Modified: 2014-01-22 04:32 UTC (History)
10 users (show)

Tags:
Is this bug a regression?: ---
Last known good build:


Attachments
Repro for crashing with signal 4/11 (1.66 KB, text/plain)
2013-08-21 01:36 UTC, T.J. Purtell
Details
debug.mono.log=all on HTC One Android 4.1.2 (1.20 MB, application/x-zip-compressed)
2013-09-28 01:32 UTC, T.J. Purtell
Details
test case: deep recursive on java main thread while GCing from a background thread (2.42 KB, text/plain)
2013-10-14 01:47 UTC, T.J. Purtell
Details
test case: kick off java gc and mono gc while executing a 1000 frame call on main thread via handler.post (2.79 KB, text/plain)
2013-10-14 02:47 UTC, T.J. Purtell
Details
test case: simultaneous gc kick off while running a long (non-recursive) call via Handler.Post (2.96 KB, text/plain)
2013-10-18 23:17 UTC, T.J. Purtell
Details
Linux Kernel patch that illustrates the original source of this bug (838 bytes, application/octet-stream)
2013-10-24 02:12 UTC, T.J. Purtell
Details
ARM Architecture Clipping highlighting that the kernel is relying on unspecified behavior (49.58 KB, application/pdf)
2013-10-24 14:39 UTC, T.J. Purtell
Details


Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.


Please create a new report on Developer Community or GitHub with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:
Status:
VERIFIED FIXED

Description T.J. Purtell 2013-08-21 01:36:42 UTC
Created attachment 4682 [details]
Repro for crashing with signal 4/11

I have been receiving random crashes since the update to the initial Xamarin Android 4.8.  These have persisted across all the version up to the latest available 4.8.2.  My app would die with signal 4 or 11 typically.   I have found that passing Action's and other function types between threads safely while poking the garbage collector causes these crashes reliably

Attached is a sample that is replacement code for the default android project created by Visual Studio which crashes within a minute or so on my device with signal 4 or 11 ... or a hang of the app (its random).  My device is an HTC One running Android 4.1.2.
Comment 1 T.J. Purtell 2013-08-21 12:23:14 UTC
This test case doesn't seem to repro for me within 3 minutes on the following other devices I have tried 
- Motorola Atrix 4G running 2.3.6
- Mono for Android API 10 emulator
- Original Nexus 7 - Android 4.3
Comment 2 T.J. Purtell 2013-08-21 14:26:20 UTC
Note: always using the Release build.
Comment 3 T.J. Purtell 2013-08-22 17:21:58 UTC
The repro also works happen with 4.8.1 and Debug builds.  I see these random crashes with my app on other devices as well, but this particular reliable repro test case only works on my HTC One with Android 4.1.2.
Comment 5 Jonathan Pryor 2013-09-27 21:53:43 UTC
When you get a crash, what's the stack trace from `adb logcat`?

Is it a SIGSEGV? Does it look like Bug #13858?

> E/mono-rt(1600): Stacktrace:
> E/mono-rt(1600): at <unknown> <0xffffffff>
> E/mono-rt(1600): at (wrapper managed-to-native) object.wrapper_native_0x8233c760 (intptr,intptr,intptr)
> E/mono-rt(1600): at Android.Runtime.JNIEnv.CallVoidMethod (intptr,intptr) [0x00005] 
> ...
> E/mono-rt(1600): =================================================================
> E/mono-rt(1600): Got a SIGSEGV while executing native code.  This usually indicates
> E/mono-rt(1600): a fatal error in the mono runtime or one of the native libraries
> E/mono-rt(1600): used by your application.
> E/mono-rt(1600): =================================================================

In particular, is JNIEnv.Call*Method() near the top of the stack trace?

I've run it for several minutes on a Nexus 10 w/ Android v4.3, and not hit a crash yet. I'll let it run overnight...
Comment 6 T.J. Purtell 2013-09-28 01:30:18 UTC
Thanks for having a look at this.  Unfortunately, neither the debug nor the release version produce a mono-rt stacktrace.  I enabled the full mono debug log and will attach that (it still doesnt not include a stack trace.  At the end it says core dumped, but I don't know where to get the core dump file.  Any suggestions? Do I need to root the phone and look in /data/tombstones?

This specific test case did not reproduce for me on the other devices I use for testing, including a Galaxy Nexus, Nexus S, Motorola Atrix, and Nexus 7 first gen. 

I was thinking it might have something to do with having 4 cores in the devices.  Is there anything the GC does differently if it has more cores available?
Comment 7 T.J. Purtell 2013-09-28 01:32:29 UTC
Created attachment 5015 [details]
debug.mono.log=all on HTC One Android 4.1.2
Comment 12 Jonathan Pryor 2013-09-28 09:56:16 UTC
> I've run it for several minutes on a Nexus 10 w/ Android v4.3, and not hit a
> crash yet. I'll let it run overnight...

My overnight run is complete. It did NOT crash. It DID hang; the UI was unresponsive come morning.

Guess I'll need to run within gdb to see where it's hanging...
Comment 13 T.J. Purtell 2013-09-30 03:22:46 UTC
So, I discovered the command "adb setprop log.redirect-stdio 1" which supposedly should make the crash traces end up in the logcat.  The really weird thing here is that now, the test application **freezes** within one minute rather than crashing and producing a signal 4 or 11.  I rooted the phone so that I could poke around more, but I still haven't figured out what magic to invoke to get a core dump written out.
Comment 15 T.J. Purtell 2013-10-02 00:40:24 UTC
I managed to get a little more information out of GDB to clarify that the SIGILL is actually an ILLOPC.  I haven't managed to get a reasonable stack trace, but it seems to repeatedly happen at a BL instruction.   The CPSR indicates that the chip is not in Thumb mode, so I believe the dissembly was correct.  I did a bit of disassembling in IDA and found that at least in one case, the particular call seemed to be to get the errno (from an unnamed function).  Note that the trace below isn't the one I did the math on to discover the BL target was an errno function call.  I did load the libraries and verify that I could do breakpoints in the named functions but the crash never seemed to include any of those on the stack.  I don't know if I am using the android GDB incorrectly or not, however a command like "disass mono_jit_exec" did give sensible results, so I think that only unnamed functions in the binaries bundled in the apk are on the stack at this time.


>>>>>> some gdb info <<<<<<<<
Program received signal SIGPWR, Power fail/restart.

Program received signal SIGXCPU, CPU time limit exceeded.

Program received signal SIGXCPU, CPU time limit exceeded.

Program received signal SIGPWR, Power fail/restart.

Program received signal SIGPWR, Power fail/restart.

Program received signal SIGILL, Illegal instruction.
[Switching to Thread 21773]
0x6ad1c45c in ?? ()
(gdb) x/i $pc
Cannot access memory at address 0xfffffe10
=> 0x6ad1c45c:  bl      0x6aba23d8
(gdb) info registers
r0             0x1e     30
r1             0x6f1d4698       1864189592
r2             0x6f1d4718       1864189720
r3             0x33c6edf        54292191
r4             0x0      0
r5             0x6f1d4718       1864189720
r6             0x6d3cea30       1832708656
r7             0x5f     95
r8             0x40980140       1083703616
r9             0x9      9
r10            0x0      0
r11            0xfffffe10       4294966800
r12            0x6d3bc1c4       1832632772
sp             0x6f1d4698       0x6f1d4698
lr             0xffff050c       -64244
pc             0x6ad1c45c       0x6ad1c45c
cpsr           0x10     16
(gdb) p $_siginfo
$3 = {si_signo = 4, si_errno = 0, si_code = 1, _sifields = {_pad = {
      1792132188, 301244, -1093564636, -1093564512, 0, 269440, 174239,
      1074700288, 13991162, 0, 241411734, 262180, 13991162, 1, 307064, 0,
      204132, 218, 5056, 436, -1073741824, 0, 204132, 0, 307064, 307064, 0,
      307064, 307064}, _kill = {si_pid = 1792132188, si_uid = 301244},
    _timer = {si_tid = 1792132188, si_overrun = 301244, si_sigval = {
        sival_int = -1093564636, sival_ptr = 0xbed18724}}, _rt = {
      si_pid = 1792132188, si_uid = 301244, si_sigval = {
        sival_int = -1093564636, sival_ptr = 0xbed18724}}, _sigchld = {
      si_pid = 1792132188, si_uid = 301244, si_status = -1093564636,
      si_utime = -1093564512, si_stime = 0}, _sigfault = {
      si_addr = 0x6ad1c45c}, _sigpoll = {si_band = 1792132188,
      si_fd = 301244}}}
(gdb) bt
#0  0x6ad1c45c in ?? ()
#1  0xffff050c in ?? ()
Cannot access memory at address 0xfffffe10
#2  0xffff050c in ?? ()
Cannot access memory at address 0xfffffe10
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Comment 16 T.J. Purtell 2013-10-02 16:46:34 UTC
Now that I have become familiar with the steps to use GDB with Xamarin, I completely wiped the Mono frameworks and test application off the phone before proceeding through them again.  I was finally able to get the stack traces to include the method names from within libmonosgen-2.0.so and libmonodroid.so

I have included backtraces, siginfo information, etc to help identify the issue.  The thread that appears to show a $_siginfo of SIGILL/ILLOPC seems to have its si_addr pointing to a valid arm instruction (BL). I also included the CPSR flags for each thread, but I don't know how to get these from the frame that triggered the SIGILL.  If the code were executed in thumb mode, then the second half of that word would be an invalid opcode.  That would only make sense if the ARM instruction decoder decoded both instructions in one clock and faulted yielding the old address or some part of gdb/kernel fudged the address so si_addr didn't point to the exact invalid opcode.

The code looks reasonable disassembled as ARM
(gdb) disass $_siginfo._sifields._sigfault.si_addr
Dump of assembler code for function suspend_handler:
   0x6ad1c454 <+0>:     push    {r4, r5, r6, lr}
   0x6ad1c458 <+4>:     mov     r5, r2
=> 0x6ad1c45c <+8>:     bl      0x6aba23d8
   0x6ad1c460 <+12>:    ldr     r4, [r0]
   0x6ad1c464 <+16>:    bl      0x6ad70100 <mono_thread_info_current>
   0x6ad1c468 <+20>:    cmp     r0, #0
   0x6ad1c46c <+24>:    beq     0x6ad1c478 <suspend_handler+36>
   0x6ad1c470 <+28>:    mov     r1, r5
   0x6ad1c474 <+32>:    bl      0x6ad1c2a0 <suspend_thread>
   0x6ad1c478 <+36>:    bl      0x6aba23d8
   0x6ad1c47c <+40>:    str     r4, [r0]
   0x6ad1c480 <+44>:    pop     {r4, r5, r6, pc}
End of assembler dump.

The first invalid instruction in the function if it was interpreted as Thumb code is the second half of the 
(gdb) set arm force-mode thumb
(gdb) disass $_siginfo._sifields._sigfault.si_addr
Dump of assembler code for function suspend_handler:
   0x6ad1c454 <+0>:     eors    r0, r6
   0x6ad1c456 <+2>:     stmdb   sp!, {r1, r12, lr}
   0x6ad1c45a <+6>:     b.n     0x6ad1c79e <sgen_sort_addresses+98>
=> 0x6ad1c45c <+8>:     asrs    r5, r3, #31
   0x6ad1c45e <+10>:                    ; <UNDEFINED> instruction: 0xebfa4000
   0x6ad1c462 <+14>:    b.n     0x6ad1bf86 <mono_image_create_pefile+5054>
   0x6ad1c464 <+16>:    ldr     r7, [pc, #148]  ; (0x6ad1c4fc <sgen_wait_for_sus
pend_ack+36>)
   0x6ad1c466 <+18>:    add.w   r0, r1, r0
   0x6ad1c46a <+22>:    b.n     0x6ad1cb0e <mono_gc_set_allow_synchronous_major+
26>
   0x6ad1c46c <+24>:    movs    r1, r0
   0x6ad1c46e <+26>:    lsrs    r0, r0, #8
   0x6ad1c470 <+28>:    asrs    r5, r0, #32
   0x6ad1c472 <+30>:    b.n     0x6ad1c7b6 <sgen_sort_addresses+122>
   0x6ad1c474 <+32>:                    ; <UNDEFINED> instruction: 0xff89ebff
   0x6ad1c478 <+36>:    asrs    r6, r2, #31
   0x6ad1c47a <+38>:                    ; <UNDEFINED> instruction: 0xebfa4000
   0x6ad1c47e <+42>:    b.n     0x6ad1bf82 <mono_image_create_pefile+5050>
   0x6ad1c480 <+44>:    strh    r0, [r6, #2]
   0x6ad1c482 <+46>:    ldmia.w sp!, {r4, lr}
End of assembler dump.

At this point, I am not quite sure what else I can provide you to help.  Is there any progress towards resolving this issue? 


MORE GDB DETAILS
----------------


>>> for thread 3 (sigill) x/i $_siginfo._sifields._sigfault.si_addr <<<
=> 0x6ad1c45c <suspend_handler+8>:      bl      0x6aba23d8


>>>> thread apply all bt <<<<
Thread 15 (Thread 25877):
#0  0x40111bd0 in ?? ()
Cannot access memory at address 0x1
#1  0x4012d1e0 in ?? ()
#2  0x4012d1e0 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 14 (Thread 25876):
#0  0x40111bd0 in ?? ()
Cannot access memory at address 0x1
#1  0x4012d1e0 in ?? ()
#2  0x4012d1e0 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 13 (Thread 25862):
#0  0x40112ce0 in ?? ()
Cannot access memory at address 0x0
#1  0x4012e4a8 in ?? ()
#2  0x4012e4a8 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 12 (Thread 25797):
#0  0x40111bd0 in ?? ()
Cannot access memory at address 0x1
#1  0x4012d1e0 in ?? ()
#2  0x4012d1e0 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 11 (Thread 25796):
#0  0x40111bd0 in ?? ()
Cannot access memory at address 0x1
#1  0x4012d1e0 in ?? ()
#2  0x4012d1e0 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 10 (Thread 25795):
#0  0x40112ce0 in ?? ()
Cannot access memory at address 0xfffffe10
#1  0x40118094 in ?? ()
Cannot access memory at address 0xfffffe10
#2  0x40118094 in ?? ()
Cannot access memory at address 0xfffffe10
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 9 (Thread 25794):
#0  0x40112ce0 in ?? ()
Cannot access memory at address 0xfffffe10
#1  0x40118094 in ?? ()
Cannot access memory at address 0xfffffe10
#2  0x40118094 in ?? ()
Cannot access memory at address 0xfffffe10
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 8 (Thread 25793):
#0  0x40112ce0 in ?? ()
Cannot access memory at address 0xfffffe10
#1  0x40118094 in ?? ()
Cannot access memory at address 0xfffffe10
#2  0x40118094 in ?? ()
Cannot access memory at address 0xfffffe10
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 7 (Thread 25792):
#0  0x40112ce0 in ?? ()
Cannot access memory at address 0x1
#1  0x40117210 in ?? ()
Cannot access memory at address 0x1
#2  0x40117210 in ?? ()
Cannot access memory at address 0x1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 6 (Thread 25791):
#0  0x40111cf4 in ?? ()
Cannot access memory at address 0x1
#1  0x409ca18e in ?? ()
#2  0x409ca18e in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 5 (Thread 25790):
#0  0x40112594 in ?? ()
Cannot access memory at address 0x4a3d0
#1  0x4011f4c6 in ?? ()
#2  0x4011f4c6 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 4 (Thread 25789):
#0  0x40112ce4 in ?? ()
Cannot access memory at address 0x1
#1  0x40118094 in ?? ()
Cannot access memory at address 0x1
#2  0x40118094 in ?? ()
Cannot access memory at address 0x1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 3 (Thread 25861):
#0  suspend_handler (sig=30, siginfo=0x6f1d4698, context=0x6f1d4718)
    at /Users/builder/data/lanes/monodroid-mlion-monodroid-4.8.2-branch/3c344aeb
/source/mono/mono/metadata/sgen-os-posix.c:129
#1  0x000064b8 in ?? ()
#2  0x000064b8 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 2 (Thread 25799):
#0  0x4011253c in ?? ()
Cannot access memory at address 0xe

Thread 1 (Thread 25785):
#0  0x4011253c in ?? ()
#1  0x6ad1c368 in suspend_thread (info=0x6a48f180, context=<optimized out>)
    at /Users/builder/data/lanes/monodroid-mlion-monodroid-4.8.2-branch/3c344aeb
/source/mono/mono/metadata/sgen-os-posix.c:113
#2  0x000000f0 in ?? ()
#3  0x000000f0 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

>>>> print $_siginfo  for thread 1 ( in suspend handler )<<<<
$28 = {si_signo = 19, si_errno = 0, si_code = -6, _sifields = {_pad = {25848,
      0, -782532608, 0, 136, 1, 0, -641912640, -1072992484, -1096640704,
      1610612755, 0, 0, -782524584, -641912640, 0, 0, 1792132188, 1,
      -782532608, 0, -1096640812, -1073117680, -641912640, 0, -1096640964,
      -1073115824, -1096640964, -1096640952}, _kill = {si_pid = 25848,
      si_uid = 0}, _timer = {si_tid = 25848, si_overrun = 0, si_sigval = {
        sival_int = -782532608, sival_ptr = 0xd15b8000}}, _rt = {
      si_pid = 25848, si_uid = 0, si_sigval = {sival_int = -782532608,
        sival_ptr = 0xd15b8000}}, _sigchld = {si_pid = 25848, si_uid = 0,
      si_status = -782532608, si_utime = 0, si_stime = 136}, _sigfault = {
      si_addr = 0x64f8}, _sigpoll = {si_band = 25848, si_fd = 0}}}
	  
>>>> print $_siginfo for thread 3 (in suspend_Handler) <<<
$29 = {si_signo = 4, si_errno = 0, si_code = 1, _sifields = {_pad = {
      1792132188, 0, -782532608, 0, 136, 1, 0, -641912640, -1072992484,
      -1096640704, 1610612755, 0, 0, -782524584, -641912640, 0, 0,
      1792132188, 1, -782532608, 0, -1096640812, -1073117680, -641912640, 0,
      -1096640964, -1073115824, -1096640964, -1096640952}, _kill = {
      si_pid = 1792132188, si_uid = 0}, _timer = {si_tid = 1792132188,
      si_overrun = 0, si_sigval = {sival_int = -782532608,
        sival_ptr = 0xd15b8000}}, _rt = {si_pid = 1792132188, si_uid = 0,
      si_sigval = {sival_int = -782532608, sival_ptr = 0xd15b8000}},
    _sigchld = {si_pid = 1792132188, si_uid = 0, si_status = -782532608,
      si_utime = 0, si_stime = 136}, _sigfault = {si_addr = 0x6ad1c45c},
    _sigpoll = {si_band = 1792132188, si_fd = 0}}}
	
>>>> thread apply all print $_siginfo <<<<
Thread 15 (Thread 25877):
$30 = {si_signo = 19, si_errno = 0, si_code = -6, _sifields = {_pad = {25848,
      0, -782532608, 0, 136, 1, 0, -641912640, -1072992484, -1096640704,
      1610612755, 0, 0, -782524584, -641912640, 0, 0, 1792132188, 1,
      -782532608, 0, -1096640812, -1073117680, -641912640, 0, -1096640964,
      -1073115824, -1096640964, -1096640952}, _kill = {si_pid = 25848,
      si_uid = 0}, _timer = {si_tid = 25848, si_overrun = 0, si_sigval = {
      si_pid = 25848, si_uid = 0, si_sigval = {sival_int = -782532608,
        sival_int = -782532608, sival_ptr = 0xd15b8000}}, _rt = {
        sival_ptr = 0xd15b8000}}, _sigchld = {si_pid = 25848, si_uid = 0,
      si_status = -782532608, si_utime = 0, si_stime = 136}, _sigfault = {
      si_addr = 0x64f8}, _sigpoll = {si_band = 25848, si_fd = 0}}}

Thread 14 (Thread 25876):
$31 = {si_signo = 19, si_errno = 0, si_code = -6, _sifields = {_pad = {25848,
      0, -782532608, 0, 136, 1, 0, -641912640, -1072992484, -1096640704,
      1610612755, 0, 0, -782524584, -641912640, 0, 0, 1792132188, 1,
      -782532608, 0, -1096640812, -1073117680, -641912640, 0, -1096640964,
      -1073115824, -1096640964, -1096640952}, _kill = {si_pid = 25848,
      si_uid = 0}, _timer = {si_tid = 25848, si_overrun = 0, si_sigval = {
        sival_int = -782532608, sival_ptr = 0xd15b8000}}, _rt = {
      si_pid = 25848, si_uid = 0, si_sigval = {sival_int = -782532608,
        sival_ptr = 0xd15b8000}}, _sigchld = {si_pid = 25848, si_uid = 0,
      si_status = -782532608, si_utime = 0, si_stime = 136}, _sigfault = {
      si_addr = 0x64f8}, _sigpoll = {si_band = 25848, si_fd = 0}}}

Thread 13 (Thread 25862):
$32 = {si_signo = 19, si_errno = 0, si_code = -6, _sifields = {_pad = {25848,
      0, -782532608, 0, 136, 1, 0, -641912640, -1072992484, -1096640704,
      1610612755, 0, 0, -782524584, -641912640, 0, 0, 1792132188, 1,
      -782532608, 0, -1096640812, -1073117680, -641912640, 0, -1096640964,
      -1073115824, -1096640964, -1096640952}, _kill = {si_pid = 25848,
      si_uid = 0}, _timer = {si_tid = 25848, si_overrun = 0, si_sigval = {
        sival_int = -782532608, sival_ptr = 0xd15b8000}}, _rt = {
      si_pid = 25848, si_uid = 0, si_sigval = {sival_int = -782532608,
        sival_ptr = 0xd15b8000}}, _sigchld = {si_pid = 25848, si_uid = 0,
      si_status = -782532608, si_utime = 0, si_stime = 136}, _sigfault = {
      si_addr = 0x64f8}, _sigpoll = {si_band = 25848, si_fd = 0}}}

Thread 12 (Thread 25797):
$33 = {si_signo = 19, si_errno = 0, si_code = -6, _sifields = {_pad = {25848,
      0, -782532608, 0, 136, 1, 0, -641912640, -1072992484, -1096640704,
      1610612755, 0, 0, -782524584, -641912640, 0, 0, 1792132188, 1,
      -782532608, 0, -1096640812, -1073117680, -641912640, 0, -1096640964,
      -1073115824, -1096640964, -1096640952}, _kill = {si_pid = 25848,
      si_uid = 0}, _timer = {si_tid = 25848, si_overrun = 0, si_sigval = {
        sival_int = -782532608, sival_ptr = 0xd15b8000}}, _rt = {
      si_pid = 25848, si_uid = 0, si_sigval = {sival_int = -782532608,
        sival_ptr = 0xd15b8000}}, _sigchld = {si_pid = 25848, si_uid = 0,
      si_status = -782532608, si_utime = 0, si_stime = 136}, _sigfault = {
      si_addr = 0x64f8}, _sigpoll = {si_band = 25848, si_fd = 0}}}

Thread 11 (Thread 25796):
$34 = {si_signo = 19, si_errno = 0, si_code = -6, _sifields = {_pad = {25848,
      0, -782532608, 0, 136, 1, 0, -641912640, -1072992484, -1096640704,
      1610612755, 0, 0, -782524584, -641912640, 0, 0, 1792132188, 1,
      -782532608, 0, -1096640812, -1073117680, -641912640, 0, -1096640964,
      -1073115824, -1096640964, -1096640952}, _kill = {si_pid = 25848,
      si_uid = 0}, _timer = {si_tid = 25848, si_overrun = 0, si_sigval = {
        sival_int = -782532608, sival_ptr = 0xd15b8000}}, _rt = {
      si_pid = 25848, si_uid = 0, si_sigval = {sival_int = -782532608,
        sival_ptr = 0xd15b8000}}, _sigchld = {si_pid = 25848, si_uid = 0,
      si_status = -782532608, si_utime = 0, si_stime = 136}, _sigfault = {
      si_addr = 0x64f8}, _sigpoll = {si_band = 25848, si_fd = 0}}}

Thread 10 (Thread 25795):
$35 = {si_signo = 19, si_errno = 0, si_code = -6, _sifields = {_pad = {25848,
      0, -782532608, 0, 136, 1, 0, -641912640, -1072992484, -1096640704,
      1610612755, 0, 0, -782524584, -641912640, 0, 0, 1792132188, 1,
      -782532608, 0, -1096640812, -1073117680, -641912640, 0, -1096640964,
      -1073115824, -1096640964, -1096640952}, _kill = {si_pid = 25848,
      si_uid = 0}, _timer = {si_tid = 25848, si_overrun = 0, si_sigval = {
        sival_int = -782532608, sival_ptr = 0xd15b8000}}, _rt = {
      si_pid = 25848, si_uid = 0, si_sigval = {sival_int = -782532608,
        sival_ptr = 0xd15b8000}}, _sigchld = {si_pid = 25848, si_uid = 0,
      si_status = -782532608, si_utime = 0, si_stime = 136}, _sigfault = {
      si_addr = 0x64f8}, _sigpoll = {si_band = 25848, si_fd = 0}}}

Thread 9 (Thread 25794):
$36 = {si_signo = 19, si_errno = 0, si_code = -6, _sifields = {_pad = {25848,
      0, -782532608, 0, 136, 1, 0, -641912640, -1072992484, -1096640704,
      1610612755, 0, 0, -782524584, -641912640, 0, 0, 1792132188, 1,
      -782532608, 0, -1096640812, -1073117680, -641912640, 0, -1096640964,
      -1073115824, -1096640964, -1096640952}, _kill = {si_pid = 25848,
      si_uid = 0}, _timer = {si_tid = 25848, si_overrun = 0, si_sigval = {
        sival_int = -782532608, sival_ptr = 0xd15b8000}}, _rt = {
      si_pid = 25848, si_uid = 0, si_sigval = {sival_int = -782532608,
        sival_ptr = 0xd15b8000}}, _sigchld = {si_pid = 25848, si_uid = 0,
      si_status = -782532608, si_utime = 0, si_stime = 136}, _sigfault = {
      si_addr = 0x64f8}, _sigpoll = {si_band = 25848, si_fd = 0}}}

Thread 8 (Thread 25793):
$37 = {si_signo = 19, si_errno = 0, si_code = -6, _sifields = {_pad = {25848,
      0, -782532608, 0, 136, 1, 0, -641912640, -1072992484, -1096640704,
      1610612755, 0, 0, -782524584, -641912640, 0, 0, 1792132188, 1,
      -782532608, 0, -1096640812, -1073117680, -641912640, 0, -1096640964,
      -1073115824, -1096640964, -1096640952}, _kill = {si_pid = 25848,
      si_uid = 0}, _timer = {si_tid = 25848, si_overrun = 0, si_sigval = {
        sival_int = -782532608, sival_ptr = 0xd15b8000}}, _rt = {
      si_pid = 25848, si_uid = 0, si_sigval = {sival_int = -782532608,
        sival_ptr = 0xd15b8000}}, _sigchld = {si_pid = 25848, si_uid = 0,
      si_status = -782532608, si_utime = 0, si_stime = 136}, _sigfault = {
      si_addr = 0x64f8}, _sigpoll = {si_band = 25848, si_fd = 0}}}

Thread 7 (Thread 25792):
$38 = {si_signo = 19, si_errno = 0, si_code = -6, _sifields = {_pad = {25848,
      0, -782532608, 0, 136, 1, 0, -641912640, -1072992484, -1096640704,
      1610612755, 0, 0, -782524584, -641912640, 0, 0, 1792132188, 1,
      -782532608, 0, -1096640812, -1073117680, -641912640, 0, -1096640964,
      -1073115824, -1096640964, -1096640952}, _kill = {si_pid = 25848,
      si_uid = 0}, _timer = {si_tid = 25848, si_overrun = 0, si_sigval = {
        sival_int = -782532608, sival_ptr = 0xd15b8000}}, _rt = {
      si_pid = 25848, si_uid = 0, si_sigval = {sival_int = -782532608,
        sival_ptr = 0xd15b8000}}, _sigchld = {si_pid = 25848, si_uid = 0,
      si_status = -782532608, si_utime = 0, si_stime = 136}, _sigfault = {
      si_addr = 0x64f8}, _sigpoll = {si_band = 25848, si_fd = 0}}}

Thread 6 (Thread 25791):
$39 = {si_signo = 19, si_errno = 0, si_code = -6, _sifields = {_pad = {25848,
      0, -782532608, 0, 136, 1, 0, -641912640, -1072992484, -1096640704,
      1610612755, 0, 0, -782524584, -641912640, 0, 0, 1792132188, 1,
      -782532608, 0, -1096640812, -1073117680, -641912640, 0, -1096640964,
      -1073115824, -1096640964, -1096640952}, _kill = {si_pid = 25848,
      si_uid = 0}, _timer = {si_tid = 25848, si_overrun = 0, si_sigval = {
        sival_int = -782532608, sival_ptr = 0xd15b8000}}, _rt = {
      si_pid = 25848, si_uid = 0, si_sigval = {sival_int = -782532608,
        sival_ptr = 0xd15b8000}}, _sigchld = {si_pid = 25848, si_uid = 0,
      si_status = -782532608, si_utime = 0, si_stime = 136}, _sigfault = {
      si_addr = 0x64f8}, _sigpoll = {si_band = 25848, si_fd = 0}}}

Thread 5 (Thread 25790):
$40 = {si_signo = 19, si_errno = 0, si_code = -6, _sifields = {_pad = {25848,
      0, -782532608, 0, 136, 1, 0, -641912640, -1072992484, -1096640704,
      1610612755, 0, 0, -782524584, -641912640, 0, 0, 1792132188, 1,
      -782532608, 0, -1096640812, -1073117680, -641912640, 0, -1096640964,
      -1073115824, -1096640964, -1096640952}, _kill = {si_pid = 25848,
      si_uid = 0}, _timer = {si_tid = 25848, si_overrun = 0, si_sigval = {
        sival_int = -782532608, sival_ptr = 0xd15b8000}}, _rt = {
      si_pid = 25848, si_uid = 0, si_sigval = {sival_int = -782532608,
        sival_ptr = 0xd15b8000}}, _sigchld = {si_pid = 25848, si_uid = 0,
      si_status = -782532608, si_utime = 0, si_stime = 136}, _sigfault = {
      si_addr = 0x64f8}, _sigpoll = {si_band = 25848, si_fd = 0}}}

Thread 4 (Thread 25789):
$41 = {si_signo = 19, si_errno = 0, si_code = -6, _sifields = {_pad = {25848,
      0, -782532608, 0, 136, 1, 0, -641912640, -1072992484, -1096640704,
      1610612755, 0, 0, -782524584, -641912640, 0, 0, 1792132188, 1,
      -782532608, 0, -1096640812, -1073117680, -641912640, 0, -1096640964,
      -1073115824, -1096640964, -1096640952}, _kill = {si_pid = 25848,
      si_uid = 0}, _timer = {si_tid = 25848, si_overrun = 0, si_sigval = {
        sival_int = -782532608, sival_ptr = 0xd15b8000}}, _rt = {
      si_pid = 25848, si_uid = 0, si_sigval = {sival_int = -782532608,
        sival_ptr = 0xd15b8000}}, _sigchld = {si_pid = 25848, si_uid = 0,
      si_status = -782532608, si_utime = 0, si_stime = 136}, _sigfault = {
      si_addr = 0x64f8}, _sigpoll = {si_band = 25848, si_fd = 0}}}

Thread 3 (Thread 25861):
$42 = {si_signo = 4, si_errno = 0, si_code = 1, _sifields = {_pad = {
      1792132188, 0, -782532608, 0, 136, 1, 0, -641912640, -1072992484,
      -1096640704, 1610612755, 0, 0, -782524584, -641912640, 0, 0,
      1792132188, 1, -782532608, 0, -1096640812, -1073117680, -641912640, 0,
      -1096640964, -1073115824, -1096640964, -1096640952}, _kill = {
      si_pid = 1792132188, si_uid = 0}, _timer = {si_tid = 1792132188,
      si_overrun = 0, si_sigval = {sival_int = -782532608,
        sival_ptr = 0xd15b8000}}, _rt = {si_pid = 1792132188, si_uid = 0,
      si_sigval = {sival_int = -782532608, sival_ptr = 0xd15b8000}},
    _sigchld = {si_pid = 1792132188, si_uid = 0, si_status = -782532608,
      si_utime = 0, si_stime = 136}, _sigfault = {si_addr = 0x6ad1c45c},
    _sigpoll = {si_band = 1792132188, si_fd = 0}}}

Thread 2 (Thread 25799):
$43 = {si_signo = 19, si_errno = 0, si_code = -6, _sifields = {_pad = {25848,
      0, -782532608, 0, 136, 1, 0, -641912640, -1072992484, -1096640704,
      1610612755, 0, 0, -782524584, -641912640, 0, 0, 1792132188, 1,
      -782532608, 0, -1096640812, -1073117680, -641912640, 0, -1096640964,
      -1073115824, -1096640964, -1096640952}, _kill = {si_pid = 25848,
      si_uid = 0}, _timer = {si_tid = 25848, si_overrun = 0, si_sigval = {
        sival_int = -782532608, sival_ptr = 0xd15b8000}}, _rt = {
      si_pid = 25848, si_uid = 0, si_sigval = {sival_int = -782532608,
        sival_ptr = 0xd15b8000}}, _sigchld = {si_pid = 25848, si_uid = 0,
      si_status = -782532608, si_utime = 0, si_stime = 136}, _sigfault = {
      si_addr = 0x64f8}, _sigpoll = {si_band = 25848, si_fd = 0}}}

Thread 1 (Thread 25785):
$44 = {si_signo = 19, si_errno = 0, si_code = -6, _sifields = {_pad = {25848,
      0, -782532608, 0, 136, 1, 0, -641912640, -1072992484, -1096640704,
      1610612755, 0, 0, -782524584, -641912640, 0, 0, 1792132188, 1,
      -782532608, 0, -1096640812, -1073117680, -641912640, 0, -1096640964,
      -1073115824, -1096640964, -1096640952}, _kill = {si_pid = 25848,
      si_uid = 0}, _timer = {si_tid = 25848, si_overrun = 0, si_sigval = {
        sival_int = -782532608, sival_ptr = 0xd15b8000}}, _rt = {
      si_pid = 25848, si_uid = 0, si_sigval = {sival_int = -782532608,
        sival_ptr = 0xd15b8000}}, _sigchld = {si_pid = 25848, si_uid = 0,
      si_status = -782532608, si_utime = 0, si_stime = 136}, _sigfault = {
      si_addr = 0x64f8}, _sigpoll = {si_band = 25848, si_fd = 0}}}

>>> thread apply all printf "%x", $cpsr <<<
>>> everyone is in regular arm mode <<<<
Thread 15 (Thread 25877):
Cannot access memory at address 0x1
10
Thread 14 (Thread 25876):
Cannot access memory at address 0x1
10
Thread 13 (Thread 25862):
Cannot access memory at address 0x0
60000010
Thread 12 (Thread 25797):
Cannot access memory at address 0x1
10
Thread 11 (Thread 25796):
Cannot access memory at address 0x1
10
Thread 10 (Thread 25795):
Cannot access memory at address 0xfffffe10
60000010
Thread 9 (Thread 25794):
Cannot access memory at address 0xfffffe10
60000010
Thread 8 (Thread 25793):
Cannot access memory at address 0xfffffe10
60000010
Thread 7 (Thread 25792):
Cannot access memory at address 0x1
60000010
Thread 6 (Thread 25791):
Cannot access memory at address 0x1
40000010
Thread 5 (Thread 25790):
Cannot access memory at address 0x4a3d0
20000010
Thread 4 (Thread 25789):
Cannot access memory at address 0x1
60000010
Thread 3 (Thread 25861):
10
Thread 2 (Thread 25799):
Cannot access memory at address 0xe	  
40000010
Thread 1 (Thread 25785):
40000010

>>>> thread apply all x/i $pc <<<<
>>>> all these opcodes look ok <<<<
Thread 15 (Thread 25877):
Cannot access memory at address 0x1
=> 0x40111bd0:  svc     0x00000000

Thread 14 (Thread 25876):
Cannot access memory at address 0x1
=> 0x40111bd0:  svc     0x00000000

Thread 13 (Thread 25862):
Cannot access memory at address 0x0
=> 0x40112ce0:  svc     0x00000000

Thread 12 (Thread 25797):
Cannot access memory at address 0x1
=> 0x40111bd0:  svc     0x00000000

Thread 11 (Thread 25796):
Cannot access memory at address 0x1
=> 0x40111bd0:  svc     0x00000000

Thread 10 (Thread 25795):
Cannot access memory at address 0xfffffe10
=> 0x40112ce0:  svc     0x00000000

Thread 9 (Thread 25794):
Cannot access memory at address 0xfffffe10
=> 0x40112ce0:  svc     0x00000000

Thread 8 (Thread 25793):
Cannot access memory at address 0xfffffe10
=> 0x40112ce0:  svc     0x00000000

Thread 7 (Thread 25792):
Cannot access memory at address 0x1
=> 0x40112ce0:  svc     0x00000000

Thread 6 (Thread 25791):
Cannot access memory at address 0x1
=> 0x40111cf4:  svc     0x00000000

Thread 5 (Thread 25790):
Cannot access memory at address 0x4a3d0
=> 0x40112594:  pop     {r4, r7}

Thread 4 (Thread 25789):
Cannot access memory at address 0x1
=> 0x40112ce4:  pop     {r4, r7}

Thread 3 (Thread 25861):
=> 0x6ad1c45c <suspend_handler+8>:      bl      0x6aba23d8

Thread 2 (Thread 25799):
Cannot access memory at address 0xe
=> 0x4011253c:  svc     0x00000000

Thread 1 (Thread 25785):
=> 0x4011253c:  svc     0x00000000
Comment 17 T.J. Purtell 2013-10-14 01:46:12 UTC
I have been trying to find modifications to my test case that produce other visible error logs.  I managed to create a variant of this test case that causes a n assertion log statement.

10-13 22:40:07.672: A/(2709): * Assertion at /Users/builder/data/lanes/monodroid-mlion-monodroid-4.8.2-branch/a25a31d0/source/mono/mono/metadata/sgen-os-posix.c:90, condition `!info->stack_start' not met
10-13 22:40:07.672: E/mono-rt(2709): Stacktrace:
10-13 22:40:07.672: E/mono-rt(2709): =================================================================
10-13 22:40:07.672: E/mono-rt(2709): Got a SIGSEGV while executing native code. This usually indicates
10-13 22:40:07.672: E/mono-rt(2709): a fatal error in the mono runtime or one of the native libraries 
10-13 22:40:07.672: E/mono-rt(2709): used by your application.
10-13 22:40:07.672: E/mono-rt(2709): =================================================================


Code for this test case will be attached momentarily. It doesn't always produce a log message, sometimes it produces a signal 4/11 with no error message.  Sometimes it produces an assert with no backtrace (as in the above).  Other times it has produced a stack trace which shows some C# methods.

F/        (27712): * Assertion at /Users/builder/data/lanes/monodroid-mlion-monodroid-4.8.2-branch/a25a31d0/source/mono/mono/metadata/sgen-os-posix.c:90, condition `!info->stack_start' not met
E/mono-rt (27712): Stacktrace:
E/mono-rt (27712): 
E/mono-rt (27712):   at <unknown> <0xffffffff>
E/mono-rt (27712):   at (wrapper managed-to-native) System.Environment.get_TickCount () <0xffffffff>
E/mono-rt (27712):   at System.Random..ctor () <0x0000f>
E/mono-rt (27712):   at AndroidApplication3.LongCallFromJavaThread.Recurser (int) <0x0001f>
E/mono-rt (27712):   at AndroidApplication3.LongCallFromJavaThread.Recurser (int) <0x00053>
... Repeats 1000's of times (test code is doing a recursive call ...
E/mono-rt (27712):   at AndroidApplication3.LongCallFromJavaThread.Recurser (int) <0x00053>
E/mono-rt (27712):   at AndroidApplication3.LongCallFromJavaThread/<>c__DisplayClass1.<ReekHavok>b__0 () <0x00013>
E/mono-rt (27712):   at Java.Lang.Thread/RunnableImplementor.Run () <0x0003f>
E/mono-rt (27712):   at Java.Lang.IRunnableInvoker.n_Run (intptr,intptr) <0x0003b>
E/mono-rt (27712):   at (wrapper dynamic-method) object.c07f6396-4971-40a4-a0a0-d27678123b34 (intptr,intptr) <0x0003b>
E/mono-rt (27712):   at (wrapper native-to-managed) object.c07f6396-4971-40a4-a0a0-d27678123b34 (intptr,intptr) <0xffffffff>
E/mono-rt (27712): 
E/mono-rt (27712): =================================================================
E/mono-rt (27712): Got a SIGSEGV while executing native code. This usually indicates
E/mono-rt (27712): a fatal error in the mono runtime or one of the native libraries 
E/mono-rt (27712): used by your application.
E/mono-rt (27712): =================================================================
E/mono-rt (27712):
Comment 18 T.J. Purtell 2013-10-14 01:47:21 UTC
Created attachment 5141 [details]
test case: deep recursive on java main thread while GCing from a background thread
Comment 20 T.J. Purtell 2013-10-14 02:47:12 UTC
Created attachment 5142 [details]
test case: kick off java gc and mono gc while executing a 1000 frame call on main thread via handler.post
Comment 21 T.J. Purtell 2013-10-14 02:49:53 UTC
I attached another test case that uses semaphores to attempt to trigger both a Java GC and a Mono GC simultaneously while executing some simple C# code on the main Java thread.  On my quad core device (HTC One - Android 4.1.2) this crashes  extremely quickly (70% of the time within 1 second) and always with a signal 4.
Comment 22 T.J. Purtell 2013-10-14 16:29:57 UTC
The first and second test cases crash quickly on a MotoX as well.  The second test case crashes with the same assertion at least some of the time.
Comment 24 Jonathan Pryor 2013-10-18 17:22:11 UTC
> Is there any chance that Qualcomm provides an alternative optimized garbage
> collector which violates the assumptions drawn from the open source dalvik
> code?  

Anything is possible, though I've tried to stick to straight JNI. I don't believe that we're making an Dalvik-oriented assumptions.

> I am wondering why there has not been more activity on this bug.

Our biggest hurdle has been simply trying to reproduce it. We have been so far unable to reproduce on the test hardware we have at hand.

Unfortunately I seem to have missed Comment #18 and Comment #20 (the new testcases). I'll try those shortly.
Comment 25 Jonathan Pryor 2013-10-18 22:19:58 UTC
Attachment #5142 [details] doesn't crash on my Nexus 10/Android v4.3, even after running for hours.

Attachment #5141 [details] crashes (for me) with a stack overflow:

> F/        (29227): * Assertion at /Users/jon/Development/xamarin/mono/mono/metadata/sgen-os-posix.c:89, condition `!info->stack_start' not met
> E/mono-rt (29227): Stacktrace:
> E/mono-rt (29227): 
> E/mono-rt (29227):   at <unknown> <0xffffffff>
> E/mono-rt (29227):   at (wrapper managed-to-native) System.Environment.get_TickCount () <IL 0x00020, 0xffffffff>
> E/mono-rt (29227):   at System.Random..ctor () <IL 0x00001, 0x0000f>
> E/mono-rt (29227):   at Scratch.Bxc14129LongCall.MainActivity.Recurser (int) [0x00001] in /Users/jon/Development/Projects/Scratch.Bxc14129/Scratch.Bxc14129-LongCall/MainActivity.cs:39
> E/mono-rt (29227):   at Scratch.Bxc14129LongCall.MainActivity.Recurser (int) [0x00018] in /Users/jon/Development/Projects/Scratch.Bxc14129/Scratch.Bxc14129-LongCall/MainActivity.cs:42
# ...repeat a lot...

This is probably due to line 51:

	Recurser(10000);

I highly doubt that you can get 10000 nested method calls, so this isn't completely unexpected, and IIRC on pre-Android v4.1/v4.2 targets (the ones that don't copy stdout/stderr to logcat), this would cause _eactly_ what you described: app death with no notice of any kind.

Consequently, as-written Attachment #5141 [details] is invalid.

If I change `Recurser(10000)` to `Recurser(1000)`, the app doesn't crash instantly. (It's run for several 10s of seconds at the time of this writing; time to let it run longer...)
Comment 26 T.J. Purtell 2013-10-18 22:50:18 UTC
OK, thank you for taking the time to look at this.  I thought getting an assert was promising as it might lead to a discovery with regard to cause.  This seems to just be a stack blow out.   I also ran a variant of this case with no recursion, where they run in a loop, and I still get the crashes in that case.

Do you have access to an HTC One device?  I have 3 of these devices and they all exhibit these random signal 4/11 errors.  It's a pretty notable phone so I don't think its unreasonable to expect that it should be fully supported by Xamarin.

I ran some of these tests on my co-workers new MotoX and that exhibited crashes as well.  I will make sure we re-run these without the large recursion depth.
Comment 27 T.J. Purtell 2013-10-18 23:17:30 UTC
Created attachment 5177 [details]
test case: simultaneous gc kick off while running a long (non-recursive) call via Handler.Post

This is a replacement for test case 5142 that 
- prints a little bit of log information so that I can better illustrate the crashing behavior with excerpts from logcat
- does not use heavy recursion at all to eliminate any auxiliary problem that might come from stack size handling.

It still crashes instantly on HTC One, a quad-core Qualcomm flagship Android device.
Comment 28 T.J. Purtell 2013-10-18 23:22:14 UTC
Here is the excerpts from the crash log indicating the failure.  This is several runs.

10-18 20:11:01.670: W/MonoDroid-Debugger(7137): Not starting the debugger as the timeout value has been reached; current-time: 1382152261  timeout: 1382004576
10-18 20:11:01.710: W/monodroid-gc(7137): GREF GC Threshold: 46800
10-18 20:11:02.320: I/mono-stdout(7137): pass 1
10-18 20:11:02.340: I/mono-stdout(7137): pass 2
10-18 20:11:02.360: D/Zygote(189): Process 7137 terminated by signal (4)
10-18 20:11:04.673: W/MonoDroid-Debugger(7176): Not starting the debugger as the timeout value has been reached; current-time: 1382152264  timeout: 1382004576
10-18 20:11:04.703: W/monodroid-gc(7176): GREF GC Threshold: 46800
10-18 20:11:05.274: I/mono-stdout(7176): pass 1
10-18 20:11:05.314: I/mono-stdout(7176): pass 2
10-18 20:11:05.374: I/mono-stdout(7176): pass 3
10-18 20:11:05.414: I/mono-stdout(7176): pass 4
10-18 20:11:05.454: I/mono-stdout(7176): pass 5
10-18 20:11:05.504: I/mono-stdout(7176): pass 6
10-18 20:11:05.504: I/mono-stdout(7176): pass 7
10-18 20:11:05.514: I/mono-stdout(7176): pass 8
10-18 20:11:05.514: I/mono-stdout(7176): pass 9
10-18 20:11:05.684: I/mono-stdout(7176): pass 10
10-18 20:11:05.684: I/mono-stdout(7176): pass 11
10-18 20:11:05.694: I/mono-stdout(7176): pass 12
10-18 20:11:05.754: D/Zygote(189): Process 7176 terminated by signal (4)
10-18 20:11:06.805: W/MonoDroid-Debugger(7196): Not starting the debugger as the timeout value has been reached; current-time: 1382152266  timeout: 1382004576
10-18 20:11:06.835: W/monodroid-gc(7196): GREF GC Threshold: 46800
10-18 20:11:07.386: I/mono-stdout(7196): pass 1
10-18 20:11:07.416: I/mono-stdout(7196): pass 2
10-18 20:11:07.456: I/mono-stdout(7196): pass 3
10-18 20:11:07.466: D/Zygote(189): Process 7196 terminated by signal (4)
10-18 20:11:08.877: W/MonoDroid-Debugger(7218): Not starting the debugger as the timeout value has been reached; current-time: 1382152268  timeout: 1382004576
10-18 20:11:08.907: W/monodroid-gc(7218): GREF GC Threshold: 46800
10-18 20:11:09.478: I/mono-stdout(7218): pass 1
10-18 20:11:09.508: I/mono-stdout(7218): pass 2
10-18 20:11:09.548: I/mono-stdout(7218): pass 3
10-18 20:11:09.558: I/mono-stdout(7218): pass 4
10-18 20:11:09.648: I/mono-stdout(7218): pass 5
10-18 20:11:09.648: I/mono-stdout(7218): pass 6
10-18 20:11:09.738: I/mono-stdout(7218): pass 7
10-18 20:11:09.748: I/mono-stdout(7218): pass 8
10-18 20:11:09.748: I/mono-stdout(7218): pass 9
10-18 20:11:09.798: I/mono-stdout(7218): pass 10
10-18 20:11:09.928: I/mono-stdout(7218): pass 11
10-18 20:11:09.979: I/mono-stdout(7218): pass 12
10-18 20:11:10.019: I/mono-stdout(7218): pass 13
10-18 20:11:10.079: I/mono-stdout(7218): pass 14
10-18 20:11:10.089: D/Zygote(189): Process 7218 terminated by signal (4)
10-18 20:11:10.930: W/MonoDroid-Debugger(7238): Not starting the debugger as the timeout value has been reached; current-time: 1382152270  timeout: 1382004576
10-18 20:11:10.960: W/monodroid-gc(7238): GREF GC Threshold: 46800
10-18 20:11:11.560: I/mono-stdout(7238): pass 1
10-18 20:11:11.590: I/mono-stdout(7238): pass 2
10-18 20:11:11.630: I/mono-stdout(7238): pass 3
10-18 20:11:11.680: I/mono-stdout(7238): pass 4
10-18 20:11:11.720: I/mono-stdout(7238): pass 5
10-18 20:11:11.770: I/mono-stdout(7238): pass 6
10-18 20:11:11.820: I/mono-stdout(7238): pass 7
10-18 20:11:11.861: I/mono-stdout(7238): pass 8
10-18 20:11:11.861: I/mono-stdout(7238): pass 9
10-18 20:11:11.871: I/mono-stdout(7238): pass 10
10-18 20:11:11.921: I/mono-stdout(7238): pass 11
10-18 20:11:11.921: I/mono-stdout(7238): pass 12
10-18 20:11:11.921: I/mono-stdout(7238): pass 13
10-18 20:11:11.931: I/mono-stdout(7238): pass 14
10-18 20:11:12.091: I/mono-stdout(7238): pass 15
10-18 20:11:12.091: I/mono-stdout(7238): pass 16
10-18 20:11:12.191: I/mono-stdout(7238): pass 17
10-18 20:11:12.321: I/mono-stdout(7238): pass 18
10-18 20:11:12.321: I/mono-stdout(7238): pass 19
10-18 20:11:12.431: I/mono-stdout(7238): pass 20
10-18 20:11:12.441: I/mono-stdout(7238): pass 21
10-18 20:11:12.441: I/mono-stdout(7238): pass 22
10-18 20:11:12.611: I/mono-stdout(7238): pass 23
10-18 20:11:12.691: I/mono-stdout(7238): pass 24
10-18 20:11:12.701: D/Zygote(189): Process 7238 terminated by signal (4)
10-18 20:11:13.612: W/MonoDroid-Debugger(7263): Not starting the debugger as the timeout value has been reached; current-time: 1382152273  timeout: 1382004576
10-18 20:11:13.642: W/monodroid-gc(7263): GREF GC Threshold: 46800
10-18 20:11:14.233: I/mono-stdout(7263): pass 1
10-18 20:11:14.263: I/mono-stdout(7263): pass 2
10-18 20:11:14.323: I/mono-stdout(7263): pass 3
10-18 20:11:14.373: I/mono-stdout(7263): pass 4
10-18 20:11:14.423: I/mono-stdout(7263): pass 5
10-18 20:11:14.473: I/mono-stdout(7263): pass 6
10-18 20:11:14.513: I/mono-stdout(7263): pass 7
10-18 20:11:14.563: I/mono-stdout(7263): pass 8
10-18 20:11:14.613: I/mono-stdout(7263): pass 9
10-18 20:11:14.674: I/mono-stdout(7263): pass 10
10-18 20:11:14.724: I/mono-stdout(7263): pass 11
10-18 20:11:14.734: D/Zygote(189): Process 7263 terminated by signal (4)
10-18 20:11:16.095: W/MonoDroid-Debugger(7286): Not starting the debugger as the timeout value has been reached; current-time: 1382152276  timeout: 1382004576
10-18 20:11:16.115: W/monodroid-gc(7286): GREF GC Threshold: 46800
10-18 20:11:16.726: I/mono-stdout(7286): pass 1
10-18 20:11:16.766: I/mono-stdout(7286): pass 2
10-18 20:11:16.806: I/mono-stdout(7286): pass 3
10-18 20:11:16.856: I/mono-stdout(7286): pass 4
10-18 20:11:16.896: I/mono-stdout(7286): pass 5
10-18 20:11:16.936: I/mono-stdout(7286): pass 6
10-18 20:11:16.986: I/mono-stdout(7286): pass 7
10-18 20:11:16.986: I/mono-stdout(7286): pass 8
10-18 20:11:17.086: I/mono-stdout(7286): pass 9
10-18 20:11:17.086: I/mono-stdout(7286): pass 10
10-18 20:11:17.166: I/mono-stdout(7286): pass 11
10-18 20:11:17.216: I/mono-stdout(7286): pass 12
10-18 20:11:17.216: I/mono-stdout(7286): pass 13
10-18 20:11:17.266: D/Zygote(189): Process 7286 terminated by signal (4)
10-18 20:11:18.648: W/MonoDroid-Debugger(7305): Not starting the debugger as the timeout value has been reached; current-time: 1382152278  timeout: 1382004576
10-18 20:11:18.678: W/monodroid-gc(7305): GREF GC Threshold: 46800
10-18 20:11:19.258: I/mono-stdout(7305): pass 1
10-18 20:11:19.298: I/mono-stdout(7305): pass 2
10-18 20:11:19.339: I/mono-stdout(7305): pass 3
10-18 20:11:19.339: I/mono-stdout(7305): pass 4
10-18 20:11:19.339: I/mono-stdout(7305): pass 5
10-18 20:11:19.469: I/mono-stdout(7305): pass 6
10-18 20:11:19.519: I/mono-stdout(7305): pass 7
10-18 20:11:19.559: I/mono-stdout(7305): pass 8
10-18 20:11:19.629: I/mono-stdout(7305): pass 9
10-18 20:11:19.679: I/mono-stdout(7305): pass 10
10-18 20:11:19.679: I/mono-stdout(7305): pass 11
10-18 20:11:19.799: I/mono-stdout(7305): pass 12
10-18 20:11:19.849: I/mono-stdout(7305): pass 13
10-18 20:11:19.919: I/mono-stdout(7305): pass 14
10-18 20:11:19.979: I/mono-stdout(7305): pass 15
10-18 20:11:19.979: I/mono-stdout(7305): pass 16
10-18 20:11:20.089: I/mono-stdout(7305): pass 17
10-18 20:11:20.139: I/mono-stdout(7305): pass 18
10-18 20:11:20.199: I/mono-stdout(7305): pass 19
10-18 20:11:20.199: I/mono-stdout(7305): pass 20
10-18 20:11:20.310: I/mono-stdout(7305): pass 21
10-18 20:11:20.320: D/Zygote(189): Process 7305 terminated by signal (4)
10-18 20:11:21.371: W/MonoDroid-Debugger(7331): Not starting the debugger as the timeout value has been reached; current-time: 1382152281  timeout: 1382004576
10-18 20:11:21.391: W/monodroid-gc(7331): GREF GC Threshold: 46800
10-18 20:11:21.971: I/mono-stdout(7331): pass 1
10-18 20:11:22.001: I/mono-stdout(7331): pass 2
10-18 20:11:22.061: I/mono-stdout(7331): pass 3
10-18 20:11:22.101: I/mono-stdout(7331): pass 4
10-18 20:11:22.111: D/Zygote(189): Process 7331 terminated by signal (4)
10-18 20:11:23.993: W/MonoDroid-Debugger(7371): Not starting the debugger as the timeout value has been reached; current-time: 1382152283  timeout: 1382004576
10-18 20:11:24.014: W/monodroid-gc(7371): GREF GC Threshold: 46800
10-18 20:11:24.594: I/mono-stdout(7371): pass 1
10-18 20:11:24.624: I/mono-stdout(7371): pass 2
10-18 20:11:24.654: I/mono-stdout(7371): pass 3
10-18 20:11:24.694: I/mono-stdout(7371): pass 4
10-18 20:11:24.744: I/mono-stdout(7371): pass 5
10-18 20:11:24.794: I/mono-stdout(7371): pass 6
10-18 20:11:24.844: I/mono-stdout(7371): pass 7
10-18 20:11:24.904: I/mono-stdout(7371): pass 8
10-18 20:11:24.904: I/mono-stdout(7371): pass 9
10-18 20:11:24.914: I/mono-stdout(7371): pass 10
10-18 20:11:24.914: I/mono-stdout(7371): pass 11
10-18 20:11:25.105: I/mono-stdout(7371): pass 12
10-18 20:11:25.145: I/mono-stdout(7371): pass 13
10-18 20:11:25.195: I/mono-stdout(7371): pass 14
10-18 20:11:25.235: I/mono-stdout(7371): pass 15
10-18 20:11:25.285: I/mono-stdout(7371): pass 16
10-18 20:11:25.325: I/mono-stdout(7371): pass 17
10-18 20:11:25.385: I/mono-stdout(7371): pass 18
10-18 20:11:25.435: I/mono-stdout(7371): pass 19
10-18 20:11:25.495: I/mono-stdout(7371): pass 20
10-18 20:11:25.555: I/mono-stdout(7371): pass 21
10-18 20:11:25.595: I/mono-stdout(7371): pass 22
10-18 20:11:25.605: I/mono-stdout(7371): pass 23
10-18 20:11:25.715: I/mono-stdout(7371): pass 24
10-18 20:11:25.725: I/mono-stdout(7371): pass 25
10-18 20:11:25.835: I/mono-stdout(7371): pass 26
10-18 20:11:25.896: I/mono-stdout(7371): pass 27
10-18 20:11:25.916: D/Zygote(189): Process 7371 terminated by signal (4)
10-18 20:11:28.018: W/MonoDroid-Debugger(7416): Not starting the debugger as the timeout value has been reached; current-time: 1382152288  timeout: 1382004576
10-18 20:11:28.038: W/monodroid-gc(7416): GREF GC Threshold: 46800
10-18 20:11:28.598: I/mono-stdout(7416): pass 1
10-18 20:11:28.628: I/mono-stdout(7416): pass 2
10-18 20:11:28.668: I/mono-stdout(7416): pass 3
10-18 20:11:28.719: I/mono-stdout(7416): pass 4
10-18 20:11:28.759: I/mono-stdout(7416): pass 5
10-18 20:11:28.809: I/mono-stdout(7416): pass 6
10-18 20:11:28.849: I/mono-stdout(7416): pass 7
10-18 20:11:28.899: I/mono-stdout(7416): pass 8
10-18 20:11:28.949: I/mono-stdout(7416): pass 9
10-18 20:11:28.999: I/mono-stdout(7416): pass 10
10-18 20:11:29.059: I/mono-stdout(7416): pass 11
10-18 20:11:29.099: I/mono-stdout(7416): pass 12
10-18 20:11:29.159: I/mono-stdout(7416): pass 13
10-18 20:11:29.219: I/mono-stdout(7416): pass 14
10-18 20:11:29.259: I/mono-stdout(7416): pass 15
10-18 20:11:29.299: I/mono-stdout(7416): pass 16
10-18 20:11:29.359: I/mono-stdout(7416): pass 17
10-18 20:11:29.419: I/mono-stdout(7416): pass 18
10-18 20:11:29.469: I/mono-stdout(7416): pass 19
10-18 20:11:29.539: I/mono-stdout(7416): pass 20
10-18 20:11:29.549: I/mono-stdout(7416): pass 21
10-18 20:11:29.549: I/mono-stdout(7416): pass 22
10-18 20:11:29.680: D/Zygote(189): Process 7416 terminated by signal (4)
Comment 29 T.J. Purtell 2013-10-19 00:56:43 UTC
On the new MotoX, another recent flagship device (notably only dual core) the original test case (Attachment #4682 [details]) crashes quickly.  The other ones that I created do not crash quickly ( < 1 min ).
Comment 30 T.J. Purtell 2013-10-19 01:27:47 UTC
I flashed the open source CyanogenMod firmware on to the HTC One device and found that it still crashes in both current test cases almost immediately (Attachment #4682 [details] == original, and Attachment #5177 [details] == updated concurrent GC launch no recursion) .  There are no extra debug logs besides the termination messages.  The crashes manifest primarily as signal 11.  This is a 4.2.2 firmware.

So.. I can confirm the crashes exist on these device/firmware combos
- HTC One ATT - Stock 4.1.2 image - all crash quickly
- HTC One ATT - CyanogenMod AOSP based 4.2.2  Image - all crash quickly
- MotoX Verizon - Stock - only attachment 4682 [details] (first test case) crashes quickly

I mentioned before that I do not see them on some devices.
Comment 31 T.J. Purtell 2013-10-20 03:41:15 UTC
Today I confirmed that these issues are still present in the latest Xamarin.Android 4.10 alpha build.

I have an even more simplified test case available.
- No Handler.Post or use of the main thread
- No explicit call to the Mono GC, only a timed call to the Java GC
- No use of any framework classes, I make my own class with one member function
- No use of recursion, there is only a loop during which a 1000 objects are allocated and one (non-virtual even) method is called on each one immediately after it is allocated

I have been using a git repository to develop these tests as I attempted to narrow down the minimum case that fails.  I made it public so that you can just get all my test case code from there; I think the multiple attachments are starting to get confusing on this bug.  

The repository is available here: 
https://bitbucket.org/tpurtell/monodroid-action-gc-crash

I also created a pure Java version of the quickest test case for reproduction.  I wanted to eliminate the possibility that there was something else wrong in the devices frameworks in another way (I had previously tried the reproduction cases using a build of CyanogenMod for the HTC One).  The pure java equivalent does not seem to crash.  This case is also included in the repository.
Comment 32 T.J. Purtell 2013-10-20 04:13:29 UTC
My best guesses at this point are:
- an interaction between the mono and Dalvik garbage collectors
- tracking logic errors in the garbage collection for things reachable from actively running code (e.g. stack or register gc roots)
- missing memory barriers in some place that maybe expose themselves more on newer hardware
Comment 33 T.J. Purtell 2013-10-20 16:56:16 UTC
I have been digging through the mono commits and other reported bugs to try to find more leads on any of these potential causes.  I noticed this one which was of particular interest: https://github.com/mono/mono/commit/3f587a4ec5c9b201cd36d9210f4f5ccdf9f0dcae#diff-70f8005b72a9a3fd335ebf5afe931040

Essentially the x86/amd64 context restoration codes was doing some bad things with respect to the stack pointer leading the signal handler to corrupt the stack.  Mark Schani did an excellent write up of this problem here: http://schani.wordpress.com/2013/10/05/stacks-and-signals/

He noted that the same error was also in the amd64 port, however no one had explicitly reported it probably because of the difficulty of reproduction.  I noticed that there was not a corresponding change to the ARM port of mono.  I just took a look through that code and noticed it uses a full register set LDM instruction to restore the context.  This operation is not atomic and can be interrupted (and restarted either in full or partially) according to the ARM architecture design.

Reference (I haven't had a chance to read through the ARM spec in enough detail to confirm this):
http://stackoverflow.com/questions/9857760/can-an-arm-interrupt-occur-in-mid-instruction

If this is in fact the case, perhaps a power management function of the chip or other interrupt causes a core to interrupt mid-LDM, the chip cancels the operation planning to restart it.  Because the ARM cache line is equal to  size of all registers (http://www.7-cpu.com/cpu/Krait.html, cache line size is 64b, while  reg * 4 b/reg = 64 bytes) there would often be AT LEAST two completely separate memory operations due to stack alignment.  There could be significantly more depending on the exact implementation details of the chip.   Meanwhile, a suspend signal is dispatched by the mono garbage collector, the garbage collector sees a indeterminate state and walks the stack or registers incorrectly.
Comment 34 T.J. Purtell 2013-10-21 05:25:31 UTC
I made another test case (in the repo) which does not do a dalvik GC, instead it throws 100 exceptions.  This crashes quickly as well.  This is similar to the gc-altstack test case which appears to be disabled in wrench.  It would lend some more credence to the theory about their being an issue in the architecture specific context restore handler.
Comment 35 T.J. Purtell 2013-10-21 13:58:59 UTC
One of the things that I have been seeing a lot which I thought was ultimately related to this problem is frequent crashing of apps on the newer test devices while running under the soft debugger.

The soft debugger uses a function called "create_function_wrapper" which appears to share much of the code with the ARM architecture context restorer.

https://github.com/mono/mono/blob/master/mono/mini/mini-arm.c#L760

https://github.com/mono/mono/blob/master/mono/mini/exceptions-arm.c#L74
Comment 36 T.J. Purtell 2013-10-21 14:59:06 UTC
I have now confirmed that some of the test cases crash very quickly (particular the equivalent to the gc-altstack test case) on the 2013 edition of the Google Nexus 7.  It also has a Qualcomm Krait architecture SoC.  A lot of our development devices are Krait architecture and all of them have been exhibiting the crashes frequently in our apps and while making use of the soft debugger.
Comment 37 Mark Probst 2013-10-21 17:29:32 UTC
I've reproduced the crashes of CrossThreadPostGC.cs, SimplestLongCall.cs and LessRandomizedLongCall.cs on the HTC One.  GCAltStackEquivalent.cs does not crash, but locks up.

I'll fix the ARM context restore issue and see whether that makes a difference.
Comment 38 T.J. Purtell 2013-10-21 20:18:13 UTC
I am glad to hear you have a solid reproduction case now.  If the STM all-registers is not the issue, please let me know as soon as you can so that I can assist in looking for other possibilities.
Comment 39 T.J. Purtell 2013-10-21 22:27:04 UTC
I confirmed that the issues are also reproducible on a ASUS PadFone Infinity.  This is another Krait architecture CPU.
Comment 41 T.J. Purtell 2013-10-22 16:05:47 UTC
Have you been able to determine if adjusting the context restore code fixes the issue?
Comment 42 Mark Probst 2013-10-22 17:58:50 UTC
Fixing the context restore does not fix the crashes.

I did notice that while the latest Xamarin Android release makes GCAltStackEquivalent.cs lock up, the latest master makes it crash.  I'll bisect to see where it went from locking up to crashing - that might give us a clue as to what's going on.
Comment 43 T.J. Purtell 2013-10-22 19:54:30 UTC
:( Doh!

When I run these test under the debugger and catch the signal of doom, the CPU state usually looks pretty weird.  In particular the PC is often < 0x8000.  So it seems like there is some MonoContext corruption/mis-save/mis-restore or there is another redzone issue somewhere else that we just aren't seeing.

I suppose we could systematically probe for redzone issues by using ptrace to step the process and purposefully corrupting everything above the redzone pointer while tracking the previous full register state.  This could let us see what code is actually running before the process explodes.  Actually, perhaps this is a good idea even without automatic redzone corruption.  If the issue can still be triggered while ptracing, then knowing the sequence of previous PCs and register states could shed a lot of light on the issue.
Comment 44 T.J. Purtell 2013-10-22 23:22:59 UTC
It is possible that making a build with some compiler provided memory error detection might tease out issues too.  For example, clang and gcc appear to support -fsanitize-address, which writes some data in with stack allocated objects.  However, it seems these might have some limitations with regards to mono because of the use of the restore context (which is like swap context).  https://code.google.com/p/address-sanitizer/wiki/AddressSanitizer
Comment 45 T.J. Purtell 2013-10-24 02:12:41 UTC
Created attachment 5227 [details]
Linux Kernel patch that illustrates the original source of this bug
Comment 46 Jonathan Pryor 2013-10-24 12:24:35 UTC
It looks like the Linux kernel patch is to workaround a Krait processor bug.

Fun!
Comment 47 T.J. Purtell 2013-10-24 14:39:24 UTC
Created attachment 5230 [details]
ARM Architecture Clipping highlighting that the kernel is relying on unspecified behavior
Comment 48 Rodrigo Kumpera 2013-11-13 18:36:03 UTC
Fix is on 4.10.1.
Comment 49 T.J. Purtell 2013-11-13 19:33:01 UTC
Linux kernel patch has made its way to mainline kernel https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6ecf830e5029598732e04067e325d946097519cb
Comment 50 Sadik Ali 2013-12-17 07:01:10 UTC
I have checked this issue and its working fine, below are the steps and environment:

1. Created a new sample android project.
2. Added both attached test case's code in created sample project.
3. Debug sample app on HTC 4.0 for 5-6 minute, and not getting any exception.

Checked With:
Windows 7
VS 2013,12
XS: 4.2.3 (Build 3)
XA. 4.10.01073
Comment 51 Michal Dobrodenka 2014-01-20 09:35:10 UTC
I'm getting this bug probably on Scorpion CPUs with XA 4.10.2
Comment 52 Peter Collins 2014-01-20 12:47:38 UTC
Michal, I was unable to reproduce this issue on an HTC Droid Incredible (v2.3.4, Scorpion CPU). Could you possibley link to, or paste the test case that is failing for you on devices with a Scorpion CPU?
Comment 53 Michal Dobrodenka 2014-01-21 05:25:00 UTC
Can not reproduce, but after update to 4.10 branch I'm getting crash logs like these to flurry:

Exception of type 'Java.Lang.RuntimeException' was thrown.
at Android.Runtime.JNIEnv.CallVoidMethod (intptr,intptr) ...

And many clients using old phones are reporting crashes, giving 1 star rating:(

Problematic phones:

Galaxy S2 (SGH-T989) - using scorpion core
HTC Desire HD(ace) - scorpion
HTC Desire - scorpion

(We had one strange crash report from customer from HTC Legend phone some time ago, but maybe not connected to this bug.)
Comment 54 Jonathan Pryor 2014-01-21 10:27:51 UTC
@Michal:
> Exception of type 'Java.Lang.RuntimeException' was thrown.

Can you please provide the full stack trace, including the Java-side stack trace?

Java.Lang.RuntimeException means that the Java code threw a java.lang.RuntimeException, and we wrapped it in Java.Lang.RuntimeException. Consequently, the Java-side stack trace is important, because Java threw the exception.

This isn't necessarily a Xamarin.Android bug; it could be a usage or Android "bug".
Comment 55 Michal Dobrodenka 2014-01-22 04:32:10 UTC
@Jonathan

You are right, It can be bug in usage or Android, if I'll have test case, I'll share it here.

Unfortunately I don't have full stack trace, flurry cuts our logs:(