Bug 29212 - SIGSEGV on shutdown in threadpool stress test
Summary: SIGSEGV on shutdown in threadpool stress test
Status: RESOLVED FIXED
Alias: None
Product: Runtime
Classification: Mono
Component: GC ()
Version: 3.12.0
Hardware: Other Linux
: --- normal
Target Milestone: ---
Assignee: Bugzilla
URL:
Depends on:
Blocks:
 
Reported: 2015-04-17 15:39 UTC by Michael Thwaite
Modified: 2016-05-17 20:02 UTC (History)
13 users (show)

Tags:
Is this bug a regression?: ---
Last known good build:

Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.


Please create a new report on GitHub or Developer Community with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:
Status:
RESOLVED FIXED

Description Michael Thwaite 2015-04-17 15:39:35 UTC
The issue first raised in bug-18026 has resurfaced for kernel 3.13.0-48,49 on multi-cpu installations.

Running the stress test code:

wget https://github.com/mono/mono/raw/master/mono/tests/bug-18026.cs
mcs bug-18026.cs
mono bug-18026.exe

Causes a SIGSEGV when run on current stable Ubuntu kernel 3.13.0-48,49 when run on multi-cpu VMs. Code executes fine on single CPU configurations.

Table of test matches that pass/fail at https://docs.google.com/spreadsheets/d/1WF60r_92vCrK8AqciyZ9tb0kes8Z9u8GouXP-3cM-uc/edit?usp=sharing

Explored testing on mono 4.0.0 alpha on Debian Jessie kernal 3.16.0 but that didn't install cleanly and test code did fail with SIGSERV on any CPU count.
Comment 1 Alexander Kyte 2015-04-20 16:05:20 UTC
I cannot reproduce on OSX or on a virtual machine given 4 cores running Ubuntu 14.10 with the mono/master. Do you believe that the problem is isolated to a kernel minor version?
Comment 2 Michael Thwaite 2015-04-20 16:10:18 UTC
OSX is fine for me too. So too is Debian Wheezy 3.2.0-4.

Can you test with Ubuntu 3.13.0-49 - basically, the latest Ubuntu build and Mono 3.12.1 from the Debian Whezy main? It's certainly an issue with the minor build but my fear is that it's present on all versions after 3.13.0-46.
Comment 3 Michael Thwaite 2015-04-20 16:13:03 UTC
Ah, 14.04 LTS is the version that I'm experiencing the issue on.
Comment 4 Michael Thwaite 2015-04-20 16:13:19 UTC
Ah, 14.04 LTS is the version that I'm experiencing the issue on.
Comment 5 Alexander Kyte 2015-04-20 16:49:54 UTC
I will try to reproduce on 14.04.
Comment 6 Alexander Kyte 2015-04-20 16:50:44 UTC
Are you on a 32 or 64 bit virtual machine?
Comment 7 Michael Thwaite 2015-04-20 17:03:06 UTC
64-bit.
Comment 8 Taloth Saldono 2015-04-21 15:01:27 UTC
I've run this in a Ubuntu 14.04 LTS vbox 64-bit VM while booting into various kernel versions.
with mono 3.12.1.

With two cpus configured I tested these kernels:
3.13.0-43 (ok)
3.13.0-44 (ok)
3.13.0-46 (ok)
3.13.0-48 (crashes)
3.13.0-49 (crashes)
3.16.0-64 (crashes, appears to crash more rapidly, but that could just be me)

Earlier I tried run the test with 'taskset 1 mono bug-18026.exe', which seemed to greatly reduce the chance of it crashing, but it still does eventually. 'while taskset 1 mono bug-18026.exe; do echo -n +; done'
Unfortunately this means I don't have a workaround atm, other then telling ppl to boot using an older kernel.

The number of users now affected by these random SIGSEGV crashes is increasing, probably because they get to those newer kernel versions.

I also just went back to 1 cpu configured and was rather surprised to see it crash again. Earlier tests with one cpu didn't fail. So I went through the kernel versions again, same results as above. Which directly contradicts earlier tests, obviously still shows there's a problem.

I'm inclined to create a new vm from scratch to retest everything.
Comment 9 Taloth Saldono 2015-04-23 18:45:09 UTC
Recreated a vm from scratch.
vbox 64bit, 2 cpu with 3GB of ram. runs on Win7 64bit host.
ubuntu 14.04.2 server iso.

Apparently this installed kernel 3.16.0.
- Updated all packages.
- Installed mono 3.2.8 from the ubuntu repo, test didn't crash at first.
- Compiled 3.10.x from git in parallel env, test crashed.
- Added official mono repo and upgraded to 3.12.1, test crashed.
- Compiled 3.6.x from git in parallel env, test crashed.
- Compiled 3.2.8 from git in parallel env, test crashed.
- Reinstalled mono 3.2.8 from ubuntu repo, test didn't crash at first.

Wondered why the ubuntu repo version didn't crash, so ran the test many times in a row. finally after 61 runs it crashed. Not sure what's happening there. Ubuntu has a few patches on top of 3.2.8, that could be a factor, but it still crashes, just take a while longer.

- Installed 3.13.0-46 kernel.
- Ran test 90 times, didn't crash.
- Installed mono 3.12.1 again.
- Ran test 175 times, didn't crash.
- Rebooted kernel 3.13.0-48.
- Ran test, crashed on first run.

Anything else I can do to get more info?
Comment 10 Taloth Saldono 2015-04-28 10:56:04 UTC
I've run a complete git bisect on the kernel (8x 3h compile) to isolate the exact commit that triggers the problem:
https://github.com/torvalds/linux/commit/1ddf0b1b11aa8a90cef6706e935fc31c75c406ba

One user reported that kernel 4.0 no longer exhibits the error. I haven't been able to verify this, but these commits COULD be related (don't take my word for it):
https://github.com/torvalds/linux/commit/80f7fdb1c7f0f9266421f823964fd1962681f6ce
https://github.com/torvalds/linux/commit/0a4e6be9ca17c54817cf814b4b5aa60478c6df27

At this point I'm wondering why it affects mono so strongly vs other applications. And whether it's an issue with mono or the kernel.

In any case, we could really use an expert insight on this. I have no idea on how to proceed.
Comment 11 Taloth Saldono 2015-05-01 17:15:57 UTC
Submitted a bug report to the ubuntu kernel bugtracker at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1450584
Comment 12 Taloth Saldono 2015-06-16 02:36:04 UTC
Kernel bug is fixed, so this is fixed as well.

Michael, can you verify and close this bug?
Comment 13 Michael Thwaite 2015-06-16 06:01:24 UTC
Will check that today. Thanks.
Comment 14 Michael Thwaite 2015-06-16 07:28:59 UTC
Linux version 3.13.0-55-generic (buildd@kapok) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #92-Ubuntu SMP Sun Jun 14 18:32:20 UTC 2015

Ten runs of bug-18026.exe - no errors.

Thanks all!
Comment 15 satadru 2015-07-16 14:53:35 UTC
Hello all, on an AMD64 system running Ubuntu vivid, the stress test runs fine in ubuntu kernel 4.0.8. Kernel from here: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.0.8-wily/

Using the 4.1.2 kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.1.2-unstable/  it crashes thus:

Method (wrapper managed-to-managed) string:.ctor (char[],int,int) emitted at 0x40b5b1b0 to 0x40b5b1d9 (code length 41) 

[bug-18026.exe]
converting method (wrapper managed-to-native) object:__icall_wrapper_mono_gc_alloc_string (intptr,intptr,int)
Method (wrapper managed-to-native) object:__icall_wrapper_mono_gc_alloc_string (intptr,intptr,int) emitted at 0x40b5b1f0 to 0x40b5b284 (code length 148) [bug-18026.exe]

Unhandled Exception:
System.NullReferenceException: Object reference not set to an instance of an object
  at Test.Main () [0x00000] in <filename unknown>:0 
[ERROR] FATAL UNHANDLED EXCEPTION: System.NullReferenceException: Object reference not set to an instance of an object
  at Test.Main () [0x00000] in <filename unknown>:0
Comment 16 Seif Attar 2015-08-03 18:08:43 UTC
We're also hitting this issue with linux kernel 4.1.2, tried with mono 3.10/3.12/4.0.2 and nightlies.

Should there be a new issue? or re-open this one?

Taloth Saldano has managed to bisect to find the kernel commit that caused this:

http://lists.ximian.com/pipermail/mono-devel-list/2015-July/043135.html

https://github.com/torvalds/linux/commit/c70e1b475f37f07ab7181ad28458666d59aae634
Comment 17 renan jegouzo 2015-08-24 00:49:05 UTC
the test crash on elementary OS freya.

Mono JIT compiler version 4.0.3 (Stable 4.0.3.20/d6946b4 Tue Aug  4 09:43:57 UTC 
2015)
kernel: 3.16.0-34-generic
Comment 18 Aymen 2015-10-14 16:38:13 UTC
we are experiencing this issue on Ubuntu 14.04.1 x86_64 with Linux 3.13.0-48-generic and mono 4.0.4 (Stable 4.0.4.1/5ab4c0d).
Comment 19 Aymen 2015-10-14 20:17:08 UTC
Issue resolved with revision 65 ( Linux 3.13.0-65 )
Comment 20 luto 2016-05-16 18:46:20 UTC
Hi all-

There's a report that this bug is back.  I'm the kernel maintainer of the code that is alleged to be at fault.

First, I'm 99.9% sure that your initial bisection is nonsense.  You're blaming:

https://github.com/torvalds/linux/commit/1ddf0b1b11aa8a90cef6706e935fc31c75c406ba

That fixed a code generation issue.  I guarantee it's not the root cause.  That commit fixed an infinite loop that caused hangs, and I can imagine it fixing incorrect time readings.

Can someone who can reproduce this answer some questions, please:

1. What is the contents of
/sys/devices/system/clocksource/clocksource0/current_clocksource

2. If you do:

 echo tsc >/sys/devices/system/clocksource/clocksource0/current_clocksource

can you still reproduce it?

3. I rewrote the whole vdso pvclock mess in Linux 4.5.  Does the bug
exist in Linux 4.5?

4. What is actually crashing?  The stack trace says:

Method (wrapper managed-to-managed) string:.ctor (char[],int,int)
emitted at 0x40b5b1b0 to 0x40b5b1d9 (code length 41)

[bug-18026.exe]
converting method (wrapper managed-to-native)
object:__icall_wrapper_mono_gc_alloc_string (intptr,intptr,int)
Method (wrapper managed-to-native)
object:__icall_wrapper_mono_gc_alloc_string (intptr,intptr,int)
emitted at 0x40b5b1f0 to 0x40b5b284 (code length 148) [bug-18026.exe]

Unhandled Exception:
System.NullReferenceException: Object reference not set to an instance
of an object
  at Test.Main () [0x00000] in <filename unknown>:0
[ERROR] FATAL UNHANDLED EXCEPTION: System.NullReferenceException:
Object reference not set to an instance of an object
  at Test.Main () [0x00000] in <filename unknown>:0

What on earth does that mean?  Is mono crashing in the vdso?  Is mono
crashing because time went backwards?  Is mono crashing because its GC uses clock_gettime and has a race condition
that is or is not triggered depending on how long clock_gettime takes?

An actual stack dump of the segfault (the native stack, not what mono
thinks the stack is) would be nice.
Comment 21 Michael Thwaite 2016-05-16 19:09:42 UTC
I've tested ten iterations of the bug trigger tool without seeing any bug on:

Ubuntu 14.04LTS Linux version 3.13.0-66-generic, Mono 4.2.3
Debian Jesse Linux version 3.16.0-4-amd64, Mono 4.2.3

Each machine had two CPUs per the original report.
Comment 22 luto 2016-05-16 21:47:49 UTC
Is that a machine that had the problem before?  Is it a KVM guest?
Comment 23 Taloth Saldono 2016-05-16 23:49:00 UTC
Hey luto,

I concur that the bisect commit mentioned isn't the direct cause, it's likely only triggering a bug in mono. But I mentioned that, both here, on the ubuntu bug tracker and the mono dev mailing list.

Last year when the problem resurfaced I actually ended up recompiling the kernel a couple of times, tested dozen variations, see what the commit actually did to the vdso assembly code.
My last mail on the subject is on the mono mailing list (http://lists.ximian.com/pipermail/mono-devel-list/2015-August/043181.html).
The last thing I found was that the commit caused vread_pvclock to be inlined. Which on itself shouldn't affect anything, yet it did.
As you can see in that mail, I've been desperately hoping for mono experts to look into it, but I've gotten no responses since.

I'll check if I still have the virtualbox vm lying around, but it was vanilla ubuntu 14.04.2, so not hard to create again.

It may take me a couple of days before I have time to get a vm with kernel 4.5 working, but anyone else is welcome to have a crack at it. There are several guys who reproduced it on their system with those older kernels.
I had a modified test-case that reliably reproduced the crash in minutes, but often just seconds.
Comment 24 luto 2016-05-17 02:42:10 UTC
I have a wild guess: Mono incorrectly relies on vclock_gettime being a fairly strong memory barrier.

There were a bunch of kernel versions in which gcc generated correct but bizarre code due to the missing 'volatile': it hoisted an LSL instruction way out in to the vrclock_gettime common code.  LSL is microcoded and does some reads, and I wouldn't be at all surprised if has some ordering properties.

The Intel sequence LFENCE;RDTSC is not much of a barrier.  LFENCE doesn't order anything that isn't already ordered memory-wise; but it *does* order RDTSC with respect to previous loads.

IOW, is it possible that you're doing something where:

LSL; LFENCE; RDTSC

gives you an ordering property that you're accidentally relying on, as does:

MFENCE; RDTSC

but the correct sequence:

LFENCE; RDTSC

does not?
Comment 25 Taloth Saldono 2016-05-17 20:02:48 UTC
That was my first thought as well, so in 2015 I tested adding a strict memory barrier to mono_100ns_ticks, the major location where mono directly calls the clock_gettime, but it crashed just as easily. There are some other locations where clock_gettime is called, but nothing stood out, and I didn't investigate that further.

I just got my vm back up:
1) checked clocksource: tsc
2) n/a
3) 

I first retested on 3.13.0-48, just to check if the environment is still ok, crashed in seconds.
Then installed an ubuntu 4.5.0 kernel package, over 200 cycles of the testcase, so far no crashes.

Linux testmono 4.5.0-040500-generic #201603140130 SMP Mon Mar 14 05:32:22 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

I'll compile the kernel from the mainline sources, see how that goes.
Considering in the past it seemed to depend on how the vdso got optimized by the compiler, I'll have to check a few variations.
Will take a while though and I got some other stuff on my plate.

PS: I'm just an app developer that got sucked into the rabbit hole. Got some experience with embedded and realtime systems, so assembly isn't foreign. But the only thing I know about mono and the vdso is what I learned during my investigation.