Bug 1917 - SIGSEGV during GC when write barrier is set to cardtable
Summary: SIGSEGV during GC when write barrier is set to cardtable
Status: RESOLVED FIXED
Alias: None
Product: Runtime
Classification: Mono
Component: GC ()
Version: unspecified
Hardware: PC Linux
: --- normal
Target Milestone: ---
Assignee: Bugzilla
URL:
Depends on:
Blocks:
 
Reported: 2011-11-07 16:45 UTC by Sam Lang
Modified: 2012-06-19 13:26 UTC (History)
4 users (show)

Tags:
Is this bug a regression?: ---
Last known good build:


Attachments
Test cases which periodically causes SIGSEGV (3.57 KB, text/plain)
2011-11-08 12:29 UTC, Sam Lang
Details
More minimal test (2.30 KB, text/plain)
2011-11-08 12:50 UTC, Sam Lang
Details
patch to fix bug (3.96 KB, patch)
2012-04-27 15:30 UTC, Sam Lang
Details
patch to fix bug (3.89 KB, patch)
2012-04-27 16:34 UTC, Sam Lang
Details
patch attempt #3 (3.13 KB, patch)
2012-04-28 16:21 UTC, Sam Lang
Details
patch #4 (452 bytes, patch)
2012-05-21 15:39 UTC, Sam Lang
Details


Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.


Please create a new report on GitHub or Developer Community with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:
Status:
RESOLVED FIXED

Description Sam Lang 2011-11-07 16:45:42 UTC
If I run my program using the boehm gc, with MONO_GC_PARAMS=major=copying, or with MONO_GC_PARAMS=wbarrier=remset then it runs fine.  However if I run it using the sgen gc and let it use the default write barrier of cardtable, then it fails every time anywhere from ~2-15 minutes after my program starts.

I am running Ubuntu 11.04 server edition on 64 bit machine.
I have tried using versions 2.10.2 and 2.10.6 or mono.

I've been digging around the source code to try and find the problem, but haven't had any luck so far.  Any insight as to what could be the problem or pointers for how I could track the problem down would be greatly appreciated.  Also, do you have any estimate of the performance benefit of cardtable over remset?

Here's a sample output if I run it with MONO_GC_DEBUG=check-at-minor-collections

Oldspace->newspace reference 0x7f8bf97812f0 at offset 2048 in object 0x7f8bf720a018 (.Slot[]) not found in remsets.
Oldspace->newspace reference 0x7f8bf97812f0 at offset 2056 in object 0x7f8bf720a018 (.Slot[]) not found in remsets.
* Assertion at sgen-gc.c:6809, condition `!missing_remsets' not met

Stacktrace:

  at (wrapper managed-to-native) object.__icall_wrapper_mono_gc_alloc_vector (intptr,intptr,intptr) <0xffffffff>
  at (wrapper alloc) object.AllocVector (intptr,intptr) <0xffffffff>
  at System.Collections.Generic.HashSet`1<string>.InitArrays (int) <0x000bf>
  at System.Collections.Generic.HashSet`1<string>.Init (int,System.Collections.Generic.IEqualityComparer`1<string>)   <0x000a7>
  at System.Collections.Generic.HashSet`1<string>..ctor () <0x0000c>

Native stacktrace:

        /home/slang/m6l/bin/mono-sgen() [0x498f21]
        /lib/x86_64-linux-gnu/libpthread.so.0(+0xfc60) [0x7f8bfa0d4c60]
        /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f8bf9b4ed05]
        /lib/x86_64-linux-gnu/libc.so.6(abort+0x186) [0x7f8bf9b52ab6]
        /home/slang/m6l/bin/mono-sgen() [0x609434]
        /home/slang/m6l/bin/mono-sgen() [0x6095b9]
        /home/slang/m6l/bin/mono-sgen() [0x5c23b2]
        /home/slang/m6l/bin/mono-sgen() [0x5c3da7]
        /home/slang/m6l/bin/mono-sgen() [0x5c4207]
        /home/slang/m6l/bin/mono-sgen() [0x5c45df]
        [0x40c1a930]

Debug info from gdb:

Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.

=================================================================
Got a SIGABRT while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries
used by your application.
=================================================================
Comment 1 Sam Lang 2011-11-08 12:29:39 UTC
Created attachment 845 [details]
Test cases which periodically causes SIGSEGV

I am able to (usually) reproduce the bug with this program.
Command: "mcs gc-test.cs && mono-sgen gc-test.exe"

If I set MONO_GC_PARAMS=wbarrier=remset it appears to run fine (albeit slowly).

I'll try to create a more minimal test.
Comment 2 Sam Lang 2011-11-08 12:50:10 UTC
Created attachment 846 [details]
More minimal test

Here's a more minimal version of the test.
Comment 3 Sam Lang 2011-11-08 18:12:59 UTC
After inserting calls to check_consistency() in various places, it seems that the call to scan_from_card_tables(nursery_start, nursery_next, &gray_queue) in sgen-gc.c is what puts mono in an inconsistent state.
Comment 4 Sam Lang 2011-11-09 10:38:57 UTC
Running the test case with the following command allows you to reproduce the bug much more consistently:

MONO_GC_DEBUG=9,check-at-minor-collections mono-sgen --debug gc-test.exe
Comment 5 Sam Lang 2011-11-09 16:52:03 UTC
It seems this bug is fixed in master.  Specifically, applying this change to 10.2.6 fixes the problem: https://github.com/mono/mono/commit/4a8b716e0bc63b79a2f13596680566773de42d18#diff-0
Comment 6 Sam Lang 2011-11-10 08:49:10 UTC
The test also fails on a 32 bit ubuntu virtual machine and the patch referenced in the previous comment does not fix the problem.
Comment 7 Rodrigo Kumpera 2011-11-23 14:45:53 UTC
I can still repro you bug on master.

Working on it.
Comment 8 Rodrigo Kumpera 2011-11-23 17:57:56 UTC
Fixed on 2.10 and master.

Thanks for the test case, it was incredibly helpful!
Comment 9 Sam Lang 2012-02-01 09:49:47 UTC
This test is still failing for me on 2.10.8 (running Ubuntu 11.04 server edition on 64 bit machine)
Comment 10 Sam Lang 2012-02-01 12:02:07 UTC
If I manually apply the patch you committed on 2011-11-23 to 2.10.6, then it runs fine.  It seems someone has introduced a new bug since then.
Comment 11 Sam Lang 2012-02-02 14:18:25 UTC
It appears to dead lock on master.
Comment 12 Rodrigo Kumpera 2012-02-02 15:23:23 UTC
It crashes on master, but on a pretty weird reason. I'm looking into it.
Comment 13 Zoltan Varga 2012-02-03 05:42:07 UTC
On linux, it usually hangs during stop-the-world.
Comment 14 Sam Lang 2012-03-07 14:54:58 UTC
This test is still failing for me on master.  Any further developments?
Comment 15 Rodrigo Kumpera 2012-03-09 13:21:38 UTC
It really looks like a linux only problem. I can't repro on OSX 32 and 64 bits.
Comment 16 Sam Lang 2012-03-23 11:55:33 UTC
I'm not sure if this is helpful, but specifying the nursery size affects the behavior dramatically.

For nursery sizes 1m, 2m, 4m, 8m, 16m, and 32m it usually hangs, but sometimes crashes.
For nursery sizes 64m, 128m, and 256m it usually finishes, but sometimes hangs.
For a nursery size of 512m it crashes immediately.
Comment 17 Sam Lang 2012-03-23 12:25:34 UTC
And if I set the major collector to copying, then it tends to crash more often on this assertion: "Assertion at sgen-debug.c:176, condition `!missing_remsets' not met"
Comment 18 Sam Lang 2012-04-19 15:30:57 UTC
I tried running the test against mono 2.11 and I got a stack trace that might be helpful: http://pastebin.com/faTy1UdT
Comment 19 Sam Lang 2012-04-23 12:00:35 UTC
I've spent some time digging into the code and the problem seems to occur whenever  a new thread is registered immediately before stop_world() is called.

The garbage collector either hangs when it attempts to resume the new thread in restart_threads_until_none_in_managed_allocator() because the thread was never suspended.  Or it segfaults when it attempts to scan the new thread because stack_start was not initialized yet.
Comment 20 Rodrigo Kumpera 2012-04-23 12:02:50 UTC
We pushed a few fixes WRT that before 2.11.2. It's worth the shot trying it.
Comment 21 Sam Lang 2012-04-23 12:51:40 UTC
I don't see 2.11.2, so I'm assuming you mean 2.11.1.  I just tried it out and I am no longer seeing any crashes, but it still hangs in the same spot.
Comment 22 Sam Lang 2012-04-23 13:53:52 UTC
The GC lock is held during the call to sgen_thread_register(), but is not held in register_thread().  So register_thread() is able to insert the new thread into the  linked list of threads while stop_world() is in the middle of being executed.

First stop_world() iterates through the linked list once, suspending all of the threads in sgen_thread_handshake().  Then the new thread gets inserted into the linked list.  Then stop_world() iterates thought the linked list again assuming that all of the thread have been suspended in restart_threads_until_none_in_managed_allocator() (but the new thread was never suspended).

It seems to me like stop_world() should hold a lock that prevents any modification of the linked list of threads, but the gc logic is a bit over my head so I'm not sure exactly.
Comment 23 Rodrigo Kumpera 2012-04-23 14:02:33 UTC
Your description makes sense and this might be a bug, I'll take a look and post a patch that might fix that later.
Comment 24 Zoltan Varga 2012-04-23 16:02:58 UTC
-> reopen.
Comment 25 Sam Lang 2012-04-27 15:30:03 UTC
Created attachment 1767 [details]
patch to fix bug

I locked the methods that insert and remove thread info into the linked list.  I then tried locking just the stop_world() function.  This stopped the dead locking that was occurring, but did not fix the problem where the gc would attempt to scan a thread with stack_start still set to NULL.  So I instead tried locking all of the functions that call stop_world().  This appears to fix the issue.
Comment 26 Sam Lang 2012-04-27 16:34:33 UTC
Created attachment 1769 [details]
patch to fix bug

The test was hanging every once in a while.  I believe it was because I was acquiring locks in the wrong order.  Here's an updated patch.
Comment 27 Rodrigo Kumpera 2012-04-27 20:22:34 UTC
Thanks for the patch Sam, but unfortunately we can't use it.

The thread list is a lock less data structure and must remain so to avoid worse problems.

I believe it's simpler to just tag what threads have been seen first during STW and only process those.
Comment 28 Sam Lang 2012-04-28 16:21:56 UTC
Created attachment 1772 [details]
patch attempt #3

Alright, I took your advice and took another stab at patching it.  It seems to fix the issue so far.  I'll keep doing some testing though.
Comment 29 Rodrigo Kumpera 2012-04-30 13:23:07 UTC
Hi Sam,

Nice patch! I had a much more complex idea in mind, which was to have a per collection version number, store it on that field and use that.

But thinking it further, it's overengineering and your approach is simpler.

Please keep me posted and I'll integrate it or a variant that does less stores.
Comment 30 Sam Lang 2012-05-01 15:06:13 UTC
I've run the test many times now, confirming the patch fixes the problem.

On an unrelated side note, I noticed something peculiar.  I was running the test in a loop while running the top command in another terminal and noticed that while the %CPU of the process was consistently around 170, the load average on the machine was near zero.  To experiment, I started running the test simultaneously in another terminal.  When I did this, the %CPU of both tests went to 100 (as expected), but load average jumped from near zero to greater than 2.  I'm not sure if this is meaningful at all, but I thought I would share in case it gave you guys some insight into how to optimize multi-threaded programs for multi-core devices or something.

This was on a dual core machine running a 64 bit version of Ubuntu.
Comment 31 Sam Lang 2012-05-21 15:39:11 UTC
Created attachment 1932 [details]
patch #4

I ran the test against master today and it failed on the assertion 'g_assert (info->doing_handshake)' in the sgen_thread_handshake method in sgen-os-posix.c.  Here's a small patch to prevent the gc from trying to restart a thread that wasn't stopped.
Comment 32 Rodrigo Kumpera 2012-06-19 13:26:36 UTC
I pushed your fix to master. Sorry for taking this long.