Bug 19037 - sgen hang/deadlock during nursery collection
Summary: sgen hang/deadlock during nursery collection
Status: RESOLVED FIXED
Alias: None
Product: Runtime
Classification: Mono
Component: GC ()
Version: 3.2.x
Hardware: PC Linux
: --- normal
Target Milestone: ---
Assignee: Bugzilla
URL:
Depends on:
Blocks:
 
Reported: 2014-04-14 13:00 UTC by Eric Roller
Modified: 2014-08-11 18:37 UTC (History)
5 users (show)

Tags:
Is this bug a regression?: ---
Last known good build:


Attachments
thread apply all backtrace (39.04 KB, text/plain)
2014-04-14 13:00 UTC, Eric Roller
Details
another gdb backtrace after the hung program is allowed to run for a while longer (39.04 KB, text/plain)
2014-04-14 13:05 UTC, Eric Roller
Details
test case (54 bytes, text/plain)
2014-04-21 22:24 UTC, Eric Roller
Details
SIGSEGV stack trace from test case (35.84 KB, text/plain)
2014-04-21 22:25 UTC, Eric Roller
Details


Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.


Please create a new report on GitHub or Developer Community with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:
Status:
RESOLVED FIXED

Description Eric Roller 2014-04-14 13:00:31 UTC
Created attachment 6578 [details]
thread apply all backtrace

After running my program many times, eventually after ~24 hours it hangs. The CPU usage when it hangs is at ~700% meaning 7 cores are being used on a machine with 32 cores and 126.0 GB of RAM. The mono process is using less than 0.1% of total memory available. GDB backtrace (attached) indicates sgen is performing a collection. This is using 64bit fedora with mono 3.2.8.
Comment 1 Eric Roller 2014-04-14 13:05:24 UTC
Created attachment 6579 [details]
another gdb backtrace after the hung program is allowed to run for a while longer

Here is another backtrace from all threads after I allowed the program to continue running while it was in this hang state. Note that the line numbers in sgen-nursery-allocator.c have changed so presumably this is where the cpus are spinning.
Comment 2 Eric Roller 2014-04-15 13:13:22 UTC
So far testing using mono 2.10.9 has produced no hangs after 120 runs. Mono 3.2.8 was hanging after ~80 runs. I will continue testing with mono 2.10.9 to be sure, but so far it seems like this bug does not exist in that version.
Comment 3 Eric Roller 2014-04-17 11:25:25 UTC
I have confirmed that indeed the bug does not exist in mono 2.10.9. My most recent test using mono 3.2.8 resulted in a hang/deadlock after only 9 runs of my program with each run taking about 10 min. This time the CPU usage is hovering at 500% rather than 600%.
Comment 4 Mark Probst 2014-04-17 15:59:13 UTC
Could you provide a test case?
Comment 5 Eric Roller 2014-04-21 22:24:04 UTC
Created attachment 6627 [details]
test case

After I created this test case and compiled with mcs I didn't have the hang issue, but instead it caused a SIGSEGV (after about 25 rounds). The stacktrace indicated sgen was performing a nursery collection. On my machine each round took about 4 minutes so you might have to wait a while to see the SIGSEGV.
Comment 6 Eric Roller 2014-04-21 22:25:18 UTC
Created attachment 6628 [details]
SIGSEGV stack trace from test case
Comment 7 Eric Roller 2014-04-21 22:29:15 UTC
To run the test case download the file SgenHang.zip from the link above and extract.

Compile the program:

mcs -r:MathNet.Numerics.IO.dll -r:MathNet.Numerics.dll Program.cs

Run the program:

mono Program.exe 0.cleaned

Wait for the SIGSEGV. The program will cycle for up to 120 rounds, but the SIGSEGV should happen before that.
Comment 8 Eric Roller 2014-05-23 11:55:56 UTC
This bug is confirmed in mono 3.4.0 also. It happened after 72 cycles.
Comment 11 Eric Roller 2014-07-15 12:13:08 UTC
Any update on this bug? It has been confirmed in mono 3.6. Seems like it needs a high core machine (>8) to reproduce and might be actually two different issues (one for the hang and one for the SIGSEGV) both related to garbage collection.
Comment 12 Rodrigo Kumpera 2014-07-15 14:22:12 UTC
We haven't been able to reproduce which limits our ability to fix it.

Until we do, it will remain as is.
Comment 13 Rodrigo Kumpera 2014-07-15 14:25:02 UTC
We have commited a fix for the hang in mono/d2f66f2d9b4de1d2f79f029b7bec10581084601b

It's not part of 3.6.0, it will catch the next train.
Comment 14 Eric Roller 2014-07-15 15:42:35 UTC
Thanks Rodrigo. To troubleshoot the other issue (SIGSEGV during sgen collection) I can provide access to a VM with 32 cores. Contact me if you are interested in that route.
Comment 15 Mark Probst 2014-07-29 13:19:25 UTC
Eric, as far as I can tell the bug goes away if you use the environment variable

  MONO_GC_DEBUG=clear-at-gc

Please use that as a workaround for now.  We're looking into steps to fix the bug.
Comment 16 Mark Probst 2014-08-05 15:37:01 UTC
Eric, can you confirm that this issue is fixed on mono master?
Comment 17 Eric Roller 2014-08-11 18:36:42 UTC
Yes, this looks to be fixed in mono master. No crashes/hangs after 120 runs of the test case (took about 6 hours on a 40 core machine). Thanks Mark!
Comment 18 Eric Roller 2014-08-11 18:37:33 UTC
No crashes/hangs after 120 runs of
the test case