Bug 16075 - SIGSEGVs (with interface trampoline??), multiple devices, gdb info
Summary: SIGSEGVs (with interface trampoline??), multiple devices, gdb info
Status: RESOLVED FIXED
Alias: None
Product: Android
Classification: Xamarin
Component: Mono runtime / AOT Compiler ()
Version: 4.10.0.x
Hardware: PC Windows
: --- normal
Target Milestone: 4.12.0 (KitKat)
Assignee: Rodrigo Kumpera
URL:
Depends on:
Blocks:
 
Reported: 2013-11-09 03:34 UTC by T.J. Purtell
Modified: 2014-01-14 04:50 UTC (History)
7 users (show)

Tags:
Is this bug a regression?: ---
Last known good build:


Attachments
HTC One 4.3 (Eng - Brightstar Firmware) - Debug Xamarin 4.10.1 beta - SIGSEGV GDB Session (5.41 KB, text/plain)
2013-11-09 03:34 UTC, T.J. Purtell
Details
Nexus 5 (Stock Firmware) - Debug Xamarin 4.10.1 beta - SIGSEGV GDB Session (2.78 KB, text/plain)
2013-11-09 03:35 UTC, T.J. Purtell
Details
Nexus 5 (Stock Firmware) - Release Xamarin 4.10.1 beta - SIGSEGV GDB Session (2.31 KB, text/plain)
2013-11-09 03:35 UTC, T.J. Purtell
Details
Nexus 5 (Stock Firmware) - ReleaseXamarin 4.10.1 beta - SIGSEGV Logcat (30.93 KB, text/plain)
2013-11-09 03:39 UTC, T.J. Purtell
Details
Nexus 5 (Stock Firmware) - Debug Xamarin 4.10.1 beta - SIGSYS debugging session (45.14 KB, text/plain)
2013-11-10 16:56 UTC, T.J. Purtell
Details


Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.


Please create a new report on Developer Community or GitHub with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:
Status:
RESOLVED FIXED

Description T.J. Purtell 2013-11-09 03:34:27 UTC
Created attachment 5377 [details]
HTC One 4.3 (Eng - Brightstar Firmware) - Debug Xamarin 4.10.1 beta - SIGSEGV GDB Session

I have been seeing numerous SIGSEGVs with a mono-rt crash log with no stack trace.  I have several HTC One devices running the 4.3 firmware which crashes like this running using the debug runtime extremely often.  On the Nexus 5 (4.4), I see the same type of crashes occurring with the debug runtime.  Additionally, I see crashes frequently with the RELEASE runtime using Nexus 5 (easily catchable in gdb).  I will attach gdb sessions for each of these three scenarios.

I have attempted to isolate the error to usages of a particular type of API (e.g. SQLite, Network, XML, etc) but it seems to depend more on just having several threads doing work at the same time.  Any time I have a repeatable occurrence of the crash in hacky stripped down version of my app, I try to make a test case for what its doing.  These test cases never reproduce the error.

All instances of the crash end up calling a NULL pointer, with a few of the opcodes preceding the LR address seeming to be EXACTLY the same
  add     lr, pc, #4
  ldr     pc, [r1, #-60]  ; 0x3c


The crashy behavior appears with stable (with a thumb signal handler fix to rule out any of the already fixed issues), however, all of these traces were done with 4.10.1 beta downloaded this evening.  All of these test devices are also Krait, however, these look nothing at all like the signal handling issues from before.

When I tried to narrow down the window of crashing to in a single function in a relatively repeatable scenario, I found that if I set breakpoints at the top and bottom of this function, I would not see the second breakpoint be hit in crashing cases.

        public override bool OnPrepareOptionsMenu(IMenu menu)
        {
            if (_Feed == null) {
                menu.GetItem (0).SetTitle ("Start Chat");
                menu.GetItem (0).SetEnabled (true);
            } else {
                menu.GetItem (0).SetEnabled (_AllowAdd);
            }
            return base.OnPrepareOptionsMenu (menu);
        }

That said, there is a lot of background stuff going on in my app, and as I remove things, the repeatability drops.  Eliminating entire types of API access didn't yield fruitful results so it seems that the issue is either race-condition or garbage related.

Since this method in particular is using a Java interface, the binding itself doesn't have any useful IL code to look at for JNI related issues.  I am suspicious that the internal tables which store the function pointers for invoking these interface methods has a race/missing memory barrier etc.


Minor note: on the HTC 4.3 Debug crash trace there is a second SIGSEGV in the teardown libc code.  I have included it in case it is relevant, but I am pretty sure that only the original SIGSEGV is relevant.
Comment 1 T.J. Purtell 2013-11-09 03:35:11 UTC
Created attachment 5378 [details]
Nexus 5 (Stock Firmware) - Debug Xamarin 4.10.1 beta - SIGSEGV GDB Session
Comment 2 T.J. Purtell 2013-11-09 03:35:38 UTC
Created attachment 5379 [details]
Nexus 5 (Stock Firmware) - Release Xamarin 4.10.1 beta - SIGSEGV GDB Session
Comment 3 T.J. Purtell 2013-11-09 03:39:28 UTC
Created attachment 5380 [details]
Nexus 5 (Stock Firmware) - ReleaseXamarin 4.10.1 beta - SIGSEGV Logcat

gdb trace from this specific run is exactly the same structure.
(gdb) bt
#0  0x00000000 in ?? ()
#1  0x7a793fa8 in ?? ()
#2  0x7a793fa8 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) x/10i $lr - 12
   0x7a793f9c:  add     lr, pc, #4
   0x7a793fa0:  ldr     pc, [r1, #-60]  ; 0x3c
   0x7a793fa4:                  ; <UNDEFINED> instruction: 0x7795a4b8
   0x7a793fa8:  bl      0x7a772ed0
   0x7a793fac:  add     sp, sp, #4
   0x7a793fb0:  pop     {r7, r8, pc}
   0x7a793fb4:
    push        {r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11, r12, lr}
   0x7a793fb8:  ldr     r1, [pc, #8]    ; 0x7a793fc8
   0x7a793fbc:  mov     lr, pc
   0x7a793fc0:  bx      r1
Comment 4 Rodrigo Kumpera 2013-11-10 12:45:11 UTC
I'm working on it.
Comment 5 T.J. Purtell 2013-11-10 16:55:04 UTC
Thanks.  

I have been running more tests as well to see if I can catch any more hints at what is going.  I have noticed that  while debugging (both the soft debugger and gdb are attached) that I see SIGSYS delivered to the process.  The mono source code doesn't appear to contain SIGSYS, so it seems like this might be indicative of another failure mode.  

I will attach gdb info from one time where I caught this SIGSYS.  It's not clear that it always happens in the same place, I'll have to run some more iterations to know that for sure.  

This particular run shows SIGSYS often being delivered while in the sem_wait code.  The SI_CODE field is -6 indicating that the signal is delivered as a result of a tkill operation.  I disassembled all the libraries on the devices and found libchrominum_net to contain a bsd_signal handler for SIGSYS.  It appears to be related to some sort of profiling feature.  The library appears to be compiled as thumb code... but I am not 100% sure the signal handler function is.  I think it is but it was hard to figure out the address of it from the disassembly, so I could have gotten it wrong.  It had no body essentially which seems to match the ifdef in the chromium net code (base::EnableInProcessStackDumping(void)).  I am also not 100% sure that it is sending this signal, but this is the only lead I have so far.

I noticed that the sem_wait code actually will act as if the wait succeeded if the error code returned is not EINTR.  bsd_signal should just restart the system call, so I don't see how this could be the issue.  One guess iu that the SIGSYS may just be making an access to some JIT data that happens outside a lock cause problems more apparently.

Link to the relevant mono source code.
https://github.com/mono/mono/blob/433017c01185739599deb2fd8293b015892452be/mono/utils/mono-semaphore.c#L79

Link to the chromium source code.  Note, this is the newest version of the file that looks like it matches up with the disassembly of the framework on my device (uses bsd_signal called as signal in that code vs. sigaction)
http://git.chromium.org/gitweb/?p=chromium.git;a=blob;f=base/debug/stack_trace_posix.cc;hb=dc23f1974c24bb59abe9488ca0156f94caa21683
Comment 6 T.J. Purtell 2013-11-10 16:56:33 UTC
Created attachment 5387 [details]
Nexus 5 (Stock Firmware) - Debug Xamarin 4.10.1 beta - SIGSYS debugging session

I previously was ignoring SIGSYS because I thought it was a Xamarin signal.  Since it isn't these logs show where SIGSYS are happening.
Comment 7 T.J. Purtell 2013-11-10 18:46:49 UTC
I didn't see these signals on the Android 4.1 firmware, but on the 4.3 and 4.4, I see them.  I did some more runs and found that unless the soft debugger is connected I don't see the SIGSYS in gdb.  There definitely is a signal handler installed for signal 31 (I checked).

Note that the crash logs I reported happen regardless of it the soft debugger is attached so I suspect the soft debugger and these SIGSYSs are impacting timing to bring out a hidden issue.

I made a test case that generates ton's of the SIGSYS signals by using the default thread pool to reflectively call methods found by walking the Xamarin.Android dll.  I don't see any crashes while doing this, but if the SIGSYS is unexpected, this should let you easily observe it on a Nexus 5.  

https://bitbucket.org/tpurtell/monodroid-reflector-sigsys

Start it
Connect with soft debugger
Connect with gdb
Then click "Default Invoke"
You'll see stops in lots of places in libc/dalvik/mono etc
Comment 8 Rodrigo Kumpera 2013-11-11 16:50:59 UTC
I believe it's a corner case of interface dispatching, probably similar o another report we have.
Comment 9 T.J. Purtell 2013-11-11 17:19:36 UTC
Ok.  Well, I stripped the heck out of my app code down to a small chunk of code that reproduces the error.  This stripped down case only repro's reliably on one of the three test devices I have.  If you end up needing more leads, I can arrange to loan the device out and get you the stripped down code.  Let me know if you need me to try out a potential fix to confirm the issues are the same.
Comment 10 Rodrigo Kumpera 2013-11-12 17:42:35 UTC
Bug fixes, will be part of 4.10.2.
Comment 11 narayanp 2014-01-14 04:50:39 UTC
I have checked this issue with following builds

Mac
X.S 4.2.2(build 2)
Mono 3.2.5
X.Android 4.10.1-68

Device Info:
HTC One version 4.0.334
Samsung S3 version 4.1.1
Samsung S4 version 4.3

I have debug the attached project from comment. After clicking on 'Default Invoke' it is showing me system error in application as shown in gist: https://gist.github.com/saurabh360/fb323751ffd09092b676

Is it same issue which T.J Purtell facing?
Please let me know have I followed correct steps to reproduce this issue?