Notice (2018-05-24): bugzilla.xamarin.com is now in
Please join us on
Visual Studio Developer Community and in the
Mono organizations on
GitHub to continue tracking issues. Bugzilla will remain
available for reference in read-only mode. We will continue to work
on open Bugzilla bugs, copy them to the new locations
as needed for follow-up, and add the new items under Related
Our sincere thanks to everyone who has contributed on this bug
tracker over the years. Thanks also for your understanding as we
make these adjustments and improvements for the future.
Please create a new report on
GitHub or Developer Community with
your current version information, steps to reproduce, and relevant error
messages or log files if you are hitting an issue that looks similar to
this resolved bug and you do not yet see a matching new report.
If I run my program using the boehm gc, with MONO_GC_PARAMS=major=copying, or with MONO_GC_PARAMS=wbarrier=remset then it runs fine. However if I run it using the sgen gc and let it use the default write barrier of cardtable, then it fails every time anywhere from ~2-15 minutes after my program starts.
I am running Ubuntu 11.04 server edition on 64 bit machine.
I have tried using versions 2.10.2 and 2.10.6 or mono.
I've been digging around the source code to try and find the problem, but haven't had any luck so far. Any insight as to what could be the problem or pointers for how I could track the problem down would be greatly appreciated. Also, do you have any estimate of the performance benefit of cardtable over remset?
Here's a sample output if I run it with MONO_GC_DEBUG=check-at-minor-collections
Oldspace->newspace reference 0x7f8bf97812f0 at offset 2048 in object 0x7f8bf720a018 (.Slot) not found in remsets.
Oldspace->newspace reference 0x7f8bf97812f0 at offset 2056 in object 0x7f8bf720a018 (.Slot) not found in remsets.
* Assertion at sgen-gc.c:6809, condition `!missing_remsets' not met
at (wrapper managed-to-native) object.__icall_wrapper_mono_gc_alloc_vector (intptr,intptr,intptr) <0xffffffff>
at (wrapper alloc) object.AllocVector (intptr,intptr) <0xffffffff>
at System.Collections.Generic.HashSet`1<string>.InitArrays (int) <0x000bf>
at System.Collections.Generic.HashSet`1<string>.Init (int,System.Collections.Generic.IEqualityComparer`1<string>) <0x000a7>
at System.Collections.Generic.HashSet`1<string>..ctor () <0x0000c>
Debug info from gdb:
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
Got a SIGABRT while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries
used by your application.
Created attachment 845 [details]
Test cases which periodically causes SIGSEGV
I am able to (usually) reproduce the bug with this program.
Command: "mcs gc-test.cs && mono-sgen gc-test.exe"
If I set MONO_GC_PARAMS=wbarrier=remset it appears to run fine (albeit slowly).
I'll try to create a more minimal test.
Created attachment 846 [details]
More minimal test
Here's a more minimal version of the test.
After inserting calls to check_consistency() in various places, it seems that the call to scan_from_card_tables(nursery_start, nursery_next, &gray_queue) in sgen-gc.c is what puts mono in an inconsistent state.
Running the test case with the following command allows you to reproduce the bug much more consistently:
MONO_GC_DEBUG=9,check-at-minor-collections mono-sgen --debug gc-test.exe
It seems this bug is fixed in master. Specifically, applying this change to 10.2.6 fixes the problem: https://github.com/mono/mono/commit/4a8b716e0bc63b79a2f13596680566773de42d18#diff-0
The test also fails on a 32 bit ubuntu virtual machine and the patch referenced in the previous comment does not fix the problem.
I can still repro you bug on master.
Working on it.
Fixed on 2.10 and master.
Thanks for the test case, it was incredibly helpful!
This test is still failing for me on 2.10.8 (running Ubuntu 11.04 server edition on 64 bit machine)
If I manually apply the patch you committed on 2011-11-23 to 2.10.6, then it runs fine. It seems someone has introduced a new bug since then.
It appears to dead lock on master.
It crashes on master, but on a pretty weird reason. I'm looking into it.
On linux, it usually hangs during stop-the-world.
This test is still failing for me on master. Any further developments?
It really looks like a linux only problem. I can't repro on OSX 32 and 64 bits.
I'm not sure if this is helpful, but specifying the nursery size affects the behavior dramatically.
For nursery sizes 1m, 2m, 4m, 8m, 16m, and 32m it usually hangs, but sometimes crashes.
For nursery sizes 64m, 128m, and 256m it usually finishes, but sometimes hangs.
For a nursery size of 512m it crashes immediately.
And if I set the major collector to copying, then it tends to crash more often on this assertion: "Assertion at sgen-debug.c:176, condition `!missing_remsets' not met"
I tried running the test against mono 2.11 and I got a stack trace that might be helpful: http://pastebin.com/faTy1UdT
I've spent some time digging into the code and the problem seems to occur whenever a new thread is registered immediately before stop_world() is called.
The garbage collector either hangs when it attempts to resume the new thread in restart_threads_until_none_in_managed_allocator() because the thread was never suspended. Or it segfaults when it attempts to scan the new thread because stack_start was not initialized yet.
We pushed a few fixes WRT that before 2.11.2. It's worth the shot trying it.
I don't see 2.11.2, so I'm assuming you mean 2.11.1. I just tried it out and I am no longer seeing any crashes, but it still hangs in the same spot.
The GC lock is held during the call to sgen_thread_register(), but is not held in register_thread(). So register_thread() is able to insert the new thread into the linked list of threads while stop_world() is in the middle of being executed.
First stop_world() iterates through the linked list once, suspending all of the threads in sgen_thread_handshake(). Then the new thread gets inserted into the linked list. Then stop_world() iterates thought the linked list again assuming that all of the thread have been suspended in restart_threads_until_none_in_managed_allocator() (but the new thread was never suspended).
It seems to me like stop_world() should hold a lock that prevents any modification of the linked list of threads, but the gc logic is a bit over my head so I'm not sure exactly.
Your description makes sense and this might be a bug, I'll take a look and post a patch that might fix that later.
Created attachment 1767 [details]
patch to fix bug
I locked the methods that insert and remove thread info into the linked list. I then tried locking just the stop_world() function. This stopped the dead locking that was occurring, but did not fix the problem where the gc would attempt to scan a thread with stack_start still set to NULL. So I instead tried locking all of the functions that call stop_world(). This appears to fix the issue.
Created attachment 1769 [details]
patch to fix bug
The test was hanging every once in a while. I believe it was because I was acquiring locks in the wrong order. Here's an updated patch.
Thanks for the patch Sam, but unfortunately we can't use it.
The thread list is a lock less data structure and must remain so to avoid worse problems.
I believe it's simpler to just tag what threads have been seen first during STW and only process those.
Created attachment 1772 [details]
patch attempt #3
Alright, I took your advice and took another stab at patching it. It seems to fix the issue so far. I'll keep doing some testing though.
Nice patch! I had a much more complex idea in mind, which was to have a per collection version number, store it on that field and use that.
But thinking it further, it's overengineering and your approach is simpler.
Please keep me posted and I'll integrate it or a variant that does less stores.
I've run the test many times now, confirming the patch fixes the problem.
On an unrelated side note, I noticed something peculiar. I was running the test in a loop while running the top command in another terminal and noticed that while the %CPU of the process was consistently around 170, the load average on the machine was near zero. To experiment, I started running the test simultaneously in another terminal. When I did this, the %CPU of both tests went to 100 (as expected), but load average jumped from near zero to greater than 2. I'm not sure if this is meaningful at all, but I thought I would share in case it gave you guys some insight into how to optimize multi-threaded programs for multi-core devices or something.
This was on a dual core machine running a 64 bit version of Ubuntu.
Created attachment 1932 [details]
I ran the test against master today and it failed on the assertion 'g_assert (info->doing_handshake)' in the sgen_thread_handshake method in sgen-os-posix.c. Here's a small patch to prevent the gc from trying to restart a thread that wasn't stopped.
I pushed your fix to master. Sorry for taking this long.