Bug 21939 - Boehm GC SIGSEGV During GC Finalize routines
Summary: Boehm GC SIGSEGV During GC Finalize routines
Status: RESOLVED NOT_REPRODUCIBLE
Alias: None
Product: Runtime
Classification: Mono
Component: GC ()
Version: 3.2.x
Hardware: Other Linux
: --- normal
Target Milestone: ---
Assignee: Bugzilla
URL:
Depends on:
Blocks:
 
Reported: 2014-08-08 03:43 UTC by evolvedmicrobe
Modified: 2014-08-18 21:37 UTC (History)
4 users (show)

Tags:
Is this bug a regression?: ---
Last known good build:

Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.


Please create a new report on GitHub or Developer Community with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:
Status:
RESOLVED NOT_REPRODUCIBLE

Description evolvedmicrobe 2014-08-08 03:43:20 UTC
I have a program that periodically crashes at the same location during a GC (using Boehm).  I am working on creating a test case, but the program is complex and finding a simple and reproducible example has so far been hard for this error.

However, I believe I know the cause based on GDB examinations of the core dumps.  The error occurs during a GC as shown by the stack trace at the end of this report.  The error is thrown when during the GC_make_disappearing_links_disappear method is called and the dl_hashtbl is being traversed to clear links.  While examining one link with the GC_is_marked method, the program throws a SIGSEGV.

GC_is_marked tries to examine the header for an allocated chunk.  However, the argument passed before the exception is thrown has no header, as the pointer passed appears to be a very special size_zero_object created at pthread_support.c:260. This item is described as follows from that code file:


/* We statically allocate a single "size 0" object. It is linked to	*/
/* itself, and is thus repeatedly reused for all size 0 allocation	*/
/* requests.  (Size 0 gcj allocation requests are incorrect, and	*/
/* we arrange for those to fault asap.)					*/
static ptr_t size_zero_object = (ptr_t)(&size_zero_object);

It appears somehow a reference to this wound up in the code called by the finalize.c file, and as this size_zero_object has no associated header, causes the SIGSEGV when it attempts to check it's header to see if it is marked.

I was looking online and it appears that after the mono fork of the Boehm GC was made Hans Boehm modified his code to remove this size_zero_object and instead use a 1 granule object that could be freed by the runtime.  The commit for his change is listed here:

https://gitorious.org/w64/bohem-gc/commit/870bd70d1a713f05d6fab01ac13255adbd4b1710

I think making those, or similar, changes to the mono runtime might solve this bug.  

GDB BACKTRACE BELOW

#0  0x00002b001d88ab65 in *__GI_raise (sig=<value optimized out>)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00002b001d88e6b0 in *__GI_abort () at abort.c:92
#2  0x000000000049bbf6 in mono_handle_native_sigsegv (signal=<value optimized
out>, 
    ctx=<value optimized out>) at mini-exceptions.c:2367
#3  0x00000000004f534b in mono_arch_handle_altstack_exception
(sigctx=0x2aaae0e26ac0, 
    fault_addr=<value optimized out>, stack_ovf=0) at exceptions-amd64.c:884
#4  0x0000000000415929 in mono_sigsegv_signal_handler (_dummy=11,
info=0x2aaae0e26bf0, context=
    0x2aaae0e26ac0) at mini.c:6379
#5  <signal handler called>
#6  0x00000000005f5c60 in GC_is_marked (p=0x93a558 "\230/\273\001") at
mark.c:209
#7  0x00000000005f2397 in GC_make_disappearing_links_disappear
(dl_hashtbl=0x93a2a0)
    at finalize.c:625
#8  0x00000000005f25a6 in GC_finalize () at finalize.c:684
#9  0x00000000005f012b in GC_finish_collection () at alloc.c:696
#10 0x00000000005ef913 in GC_try_to_collect_inner (stop_func=0x5ef2a7
<GC_never_stop_func>)
    at alloc.c:393
#11 0x00000000005f09bb in GC_collect_or_expand (needed_blocks=5,
ignore_off_page=0)
    at alloc.c:1045
#12 0x00000000005f416d in GC_alloc_large (lw=2052, k=1, flags=0) at malloc.c:60
#13 0x00000000005f4534 in GC_generic_malloc (lb=16416, k=1) at malloc.c:204
#14 0x00000000005f47f2 in GC_malloc (lb=16416) at malloc.c:311
#15 0x00000000005fe87f in GC_local_malloc (bytes=16416) at
pthread_support.c:339
#16 0x000000000059b25a in mono_object_allocate (vtable=vtable(0x2aab04081070),
n=2048)
    at object.c:4339
#17 mono_array_new_specific (vtable=vtable(0x2aab04081070), n=2048) at
object.c:4924
#18 0x00000000413e766d in ?? ()
#19 0x00002aaadc009a50 in ?? ()
#20 0x00002aab040608d8 in ?? ()
#21 0x0000000000000030 in ?? ()
#22 0x00002aaae0e20e60 in ?? ()
#23 0x0000000000000030 in ?? ()
#24 0x00002aaae0e20e60 in ?? ()
#25 0x00002aaae0e20c20 in ?? ()
#26 0x0000000000000000 in ?? ()
Comment 1 Rodrigo Kumpera 2014-08-12 19:59:23 UTC
We'll need a test case for that bug. As an alternative, could you try to change libgc's source code to trap those zero size allocation as mono should not produce them?

In the meanwhile, you should try to use sgen.
Comment 2 evolvedmicrobe 2014-08-17 12:31:06 UTC
I could not generate a test case that would reliably produce this bug without running for over an hour with a large code base in mono 3.0.  I then git cloned the most recent mono and tried to trap the size_zero_object allocation directly.  However, with the new version (still using Boehm though), my test case no longer throws the bug at all, and my code trapping the allocation/gc-link-registering never appears to be hit.

I have no idea what changed to fix this, but suppose something did, and without a test case on the newer code this does not seem worth pursuing.
Comment 3 evolvedmicrobe 2014-08-17 12:33:04 UTC
Can't create test case on current version.
Comment 4 evolvedmicrobe 2014-08-18 21:37:25 UTC
Some updates on this bug in case anyone else sees this in future versions, as I am not sure it is guaranteed not to occur again, I just can't make a test case on the latest mono version.
 
I tracked this bug down in mono v3.0.7.  This bug was thrown because when the weak references were being cleared during a GC, somehow a reference was held not to a well-formed MonoObject struct, but to the special size_zero_object.  To trace what registered it, I recompiled Mono 3.0.7 to throw an exception whenever a weak reference to this size_zero_object was added. After several attempts at running the test case, an exception was eventually thrown when such a weak reference was registered (stack trace of exception shown at end).

Examining the trace, basically the error appears in Lazy.cs:146, and a summary of the relevant code is:

    // In the Lazy class constructor this object was
    // allocated as a pointer to size_zero_object, a definitive bug
    object monitor = new object(); 

    // Then later on line 146 this class registers a weak reference of this size_zero_object with the GC
    // by using the lock keyword, leading to a SIGSEGV during a future collection.
    lock (monitor) { //  

The Lazy object was made by some FSharp.Collections classes.  My work around for Mono v3.0.7 was to change the monitor object from a simply object type to a List<object> type, on the theory that somewhere in the allocation code this would make the size_zero_object reference less likely.  So far that seems to work...