Bug 29299 - NRE in System.Threading.Timer+Scheduler.SchedulerThread
Summary: NRE in System.Threading.Timer+Scheduler.SchedulerThread
Status: RESOLVED NOT_REPRODUCIBLE
Alias: None
Product: Runtime
Classification: Mono
Component: GC ()
Version: 3.12.0
Hardware: PC Linux
: --- normal
Target Milestone: ---
Assignee: Mark Probst
URL:
Depends on:
Blocks:
 
Reported: 2015-04-22 06:00 UTC by Vsevolod Kukol
Modified: 2017-07-12 23:13 UTC (History)
9 users (show)

Tags:
Is this bug a regression?: ---
Last known good build:


Attachments
Thmer Scheduler NRE demo (2.35 KB, application/x-zip)
2015-04-22 06:00 UTC, Vsevolod Kukol
Details
dmidecode (18.56 KB, text/plain)
2015-05-12 11:54 UTC, Dan Shechter
Details
TimerTest.zip (4.58 KB, application/zip)
2016-01-19 09:21 UTC, Olaf Kober
Details


Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.


Please create a new report on GitHub or Developer Community with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:
Status:
RESOLVED NOT_REPRODUCIBLE

Description Vsevolod Kukol 2015-04-22 06:00:35 UTC
Created attachment 10858 [details]
Thmer Scheduler NRE demo

There seems to be a hardly reproducible problem with System.Threading.Timer Scheduler. After a unspecific period of time the Scheduler thread throws an internal NullReferenceException. Sometimes it happens after some hours and sometimes after days of running without problems.

The project I'm working on is pretty large and there are many timers used in a high multithreaded context (with tasks, remoting, ...). Since the timers in the project are created and disposed dynamically, I thought that I missed some lock or that some timer gets Disposed too early. But in that case, the exception would be thrown inside a callback and not the Scheduler itself.
So I have created a little project to test and demonstrate the problem (attached). It simply starts 10 timers (keeping a state object with a back reference to the timer and a Stopwatch in a ConcurrentDictionary) and the callbacks print out the  timer uptime and restart them with a random due time (500ms-1500ms). If an unhandled Exception is thrown, the handler prints the current state of the state objects and all timers are referenced correctly.

Here is the output after the crash:

	UNHANDLED EXCEPTION RAISED BY TimerTest.exe
	AND TERMINATING!!!
	08: Timer is OK, Elapsed 13:35:18.9070343, 
	04: Timer is OK, Elapsed 13:35:18.9070629, 
	02: Timer is OK, Elapsed 13:35:18.9073530, 
	10: Timer is OK, Elapsed 13:35:18.9070693, 
	06: Timer is OK, Elapsed 13:35:18.9070759, 
	01: Timer is OK, Elapsed 13:35:18.9275112, 
	09: Timer is OK, Elapsed 13:35:18.9070850, 
	05: Timer is OK, Elapsed 13:35:18.9070914, 
	03: Timer is OK, Elapsed 13:35:18.9070979, 
	07: Timer is OK, Elapsed 13:35:18.9070991, 
	System.NullReferenceException: Object reference not set to an instance of an object
	  at System.Threading.Timer+Scheduler.SchedulerThread () [0x0002f] in /tmp/buildd/mono-3.12.1/mcs/class/corlib/System.Threading/Timer.cs:328 
	  at System.Threading.Thread.StartInternal () [0x00016] in /tmp/buildd/mono-3.12.1/mcs/class/corlib/System.Threading/Thread.cs:691 
	[ERROR] FATAL UNHANDLED EXCEPTION: System.NullReferenceException: Object reference not set to an instance of an object
	  at System.Threading.Timer+Scheduler.SchedulerThread () [0x0002f] in /tmp/buildd/mono-3.12.1/mcs/class/corlib/System.Threading/Timer.cs:328 
	  at System.Threading.Thread.StartInternal () [0x00016] in /tmp/buildd/mono-3.12.1/mcs/class/corlib/System.Threading/Thread.cs:691 

(the ManualResetEvent should never be null there: https://github.com/mono/mono/blob/mono-3.12.0-branch/mcs/class/corlib/System.Threading/Timer.cs#L328)

It seems to highly depend on the hardware, how long the test can run without problems (runs longer with more cores and/or larger memory). Sometimes the NRE occurs somewhere else inside SchedulerThread or even deeper in TimerCB, but never in a callback. In really rare situations the NRE comes from ManualResetEvent deeper down the stack. It looks like the Scheduler instance or its members get collected by gc, or the ManualResetEvent looses its native pointer, or even both (?).

I have tested this with mono 3.12.1 and 3.10 from http://origin-download.mono-project.com/repo/debian/on Ubuntu 12.04, 14.04 and 14.10 and with the old mono from official Ubuntu repositories (3.2.8+dfsg-4ubuntu2.1).

Any suggestions what the problem could be or maybe how I could provide more information?

Thanks!
Comment 1 Vsevolod Kukol 2015-04-22 07:11:17 UTC
And here is an other case when the NRE comes from EventWaitHandle, which is really rare (1 of 10 maybe):

[NullReferenceException] Object reference not set to an instance of an object
  at System.Threading.EventWaitHandle.Reset () [0x00012] in /tmp/buildd/mono-3.12.1/mcs/class/corlib/System.Threading/EventWaitHandle.cs:198    at (wrapper remoting-invoke-with-check) System.Threading.EventWaitHandle:Reset ()   at System.Threading.Timer+Scheduler.SchedulerThread () [0x0002f] in /tmp/buildd/mono-3.12.1/mcs/class/corlib/System.Threading/Timer.cs:328    at System.Threading.Thread.StartInternal () [0x00016] in /tmp/buildd/mono-3.12.1/mcs/class/corlib/System.Threading/Thread.cs:691 
[ERROR] FATAL UNHANDLED EXCEPTION: System.NullReferenceException: Object reference not set to an instance of an object
  at System.Threading.EventWaitHandle.Reset () [0x00012] in /tmp/buildd/mono-3.12.1/mcs/class/corlib/System.Threading/EventWaitHandle.cs:198 
  at (wrapper remoting-invoke-with-check) System.Threading.EventWaitHandle:Reset ()
  at System.Threading.Timer+Scheduler.SchedulerThread () [0x0002f] in /tmp/buildd/mono-3.12.1/mcs/class/corlib/System.Threading/Timer.cs:328 
  at System.Threading.Thread.StartInternal () [0x00016] in /tmp/buildd/mono-3.12.1/mcs/class/corlib/System.Threading/Thread.cs:691
Comment 2 Vsevolod Kukol 2015-04-24 03:04:56 UTC
After running some more tests with different GC settings (MONO_GC_PARAMS, --server option) I'm pretty sure that this is an SGen bug. With BOEHM everything runs as expected.
Comment 3 Vsevolod Kukol 2015-05-04 04:12:40 UTC
I have updated an Ubuntu 14.04 test machine to mono 4.0.1 (official Xamarin repository) and there is a similar error with SGen:

UNHANDLED EXCEPTION RAISED BY TimerTest.exe
AND TERMINATING!!!
01: Timer is OK, Elapsed 3.17:57:56.0869320, 
02: Timer is OK, Elapsed 3.17:57:56.0824996, 
03: Timer is OK, Elapsed 3.17:57:56.0824368, 
04: Timer is OK, Elapsed 3.17:57:56.0824541, 
05: Timer is OK, Elapsed 3.17:57:56.0824807, 
06: Timer is OK, Elapsed 3.17:57:56.0825017, 
07: Timer is OK, Elapsed 3.17:57:56.0825215, 
08: Timer is OK, Elapsed 3.17:57:56.0825399, 
09: Timer is OK, Elapsed 3.17:57:56.0825598, 
10: Timer is OK, Elapsed 3.17:57:56.0825766, 
System.NullReferenceException: Object reference not set to an instance of an object
  at System.Threading.Timer+Scheduler.ShrinkIfNeeded (System.Collections.Generic.List`1 list, Int32 initial) [0x00000] in <filename unknown>:0 
  at System.Threading.Timer+Scheduler.SchedulerThread () [0x00000] in <filename unknown>:0 
  at System.Threading.Thread.StartInternal () [0x00000] in <filename unknown>:0
Comment 4 Mark Probst 2015-05-05 13:58:57 UTC
If you shorten the timeouts and/or make the nursery smaller, does the bug occur more or less often?  I tried doing that in the hope that it would, but no luck after 17 hours.  I reduced the timeouts by a factor of 100 and used

    MONO_GC_PARAMS=nursery-size=32k

Will try an overnight run with the defaults.
Comment 5 Vsevolod Kukol 2015-05-05 16:08:19 UTC
Hi,
I've played around with different nursery sizes (from 64k up to 64m as proposed in an other bug report here: http://opensimulator.org/mantis/view.php?id=7519) and different Timeouts (100ms-10000ms) and concurrent timers (tested with up to 100 timers at the same time). It doesn't look like it would depend on one of these variables.

The example above (comment 3) was running 3 days and almost 18h on a i3 System with 4 GB memory. The first report comes from the same machine and crashed after 13h and 35min (only the mono version was different).

On an other machine the test crashed more often with the same default parameters and mono 3.2 and it is an old Atom N450 with 2GB. So it looks like it depends on the amount of concurrent threads supported by the CPU, or maybe not.
Comment 6 Vsevolod Kukol 2015-05-11 06:33:11 UTC
I have an other symtom here now with SGen and 100 parallel timers after 5 days running the test:


Timer 01 5.10:24:44.3530696
[...]
* Assertion: should not be reached at monitor.c:721

Stacktrace:

  at <unknown> <0xffffffff>
  at System.Threading.Timer/Scheduler.SchedulerThread () <0x00097>
  at System.Threading.Thread.StartInternal () <0x000c2>
  at (wrapper runtime-invoke) object.runtime_invoke_void__this__ (object,intptr,intptr,intptr) <IL 0x0005e, 0xffffffff>

Native stacktrace:

	mono() [0x4b1d6c]
	/lib/x86_64-linux-gnu/libpthread.so.0(+0x10340) [0x7f97eaa00340]
	/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7f97ea661cc9]
	/lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7f97ea6650d8]
	mono() [0x629079]
	mono() [0x629287]
	mono() [0x6293d6]
	mono() [0x5a44ff]
	[0x40d9fb6e]

Debug info from gdb:

Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Vorgang nicht zulässig.
No threads.

=================================================================
Got a SIGABRT while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries 
used by your application.
=================================================================
Comment 7 Dan Shechter 2015-05-11 08:59:13 UTC
Just adding my 2 cents here...

I've seen this happen all the way back to mono 3.2.8.

In a weird coincidence, I am using mono 3.2.8 on two machines with the same application, one of them was recently upgraded to ubuntu 14.04, the older one still running on ubuntu 12.04

The only change was the OS version (and libc of course etc.).

One of the instances, on ubuntu 14.04 crashed quite frequently (50+ threads).
The 12.04 machine is completely stable for MONTHS...
Comment 8 Mark Probst 2015-05-11 17:57:13 UTC
I've been running this now in a Ubuntu 14.10 64-bit VM for six days and it hasn't crashed yet.
Comment 9 Vsevolod Kukol 2015-05-12 03:11:18 UTC
@Mark: let it run longer :) it's really strange, on my i7 notebook (8GB, Ubuntu 15.04, mono 4.0) I was not able to reproduce this bug in a controlled environment (not even in a VirtualBox machine), since I can not let it run longer then several days without shutting it down. But it occurred some times while I was working on something else, so I was not able to get the trace.
Maybe you could try it with QEMU and some older CPU emulation, or some older/slower pc.

@Dan: well, I wasn't able to narrow it down to a specific hardware or os configuration. It happens on very different systems with different mono versions, kernels, and Ubuntus (12.05 - 14.10). And do you mean threads ("50+") or timers? All timers share one single scheduler thread. Only the callbacks are delegated to the thread pool, which has by default one thread per Cpu core (or even less) on Linux. On my system the timer test has always 5-7 threads (3-5x pool + main + scheduler) with 100 active timers.
Comment 10 Dan Shechter 2015-05-12 11:54:18 UTC
Created attachment 11167 [details]
dmidecode
Comment 11 Dan Shechter 2015-05-12 11:58:17 UTC
I've attached "dmidecode 17" output of the machine.

In my case, there are exactly 64 threads, each with a 1 minute timer that is waking up to write a log message.

These 64 threads are pushing data into a blocking collection (one collection per such thread) and there is another set of 64 threads, trying to dequeue from its collection and do something with the data passed to it.

When running with --gc=boehm there is no crash.
When defaulting to sgen I usually get a crash within a few minutes, longest I've seen it work was 2 hours, and that we really not typical.

This is a private machine, so I can't open it up too freely, but if someone from the mono team wants to access it "somehow", please get in touch with me, I don't mind running it together, but it would have to be during the weekend...
Comment 12 David Evans 2015-05-14 14:54:17 UTC
Is this a duplicate of bug 29212?

I noticed this in my automated tests recently after I upgraded to latest Ubuntu patches. Occurred on my test suite every time, running what was top of tree around mono 3.10, once I upgraded my Ubuntu 14.04 kernel to version 3.13.0-52-generic. Sometimes I would get a sig fault and sometimes I would get NullReference exceptions in the SchedulerThread but in different spots.

Same automated tests were passing like a charm on another x86_64 box running mono 3.10.0 with Ubuntu 14.04 kernel 3.13.0-44-generic, however.

I tried out top of tree mono this week and the problem reproduced.
Then I downgraded to kernel 3.13.0-46 and the problem has not recurred.
Comment 13 Vsevolod Kukol 2015-05-14 17:26:23 UTC
I don't think so. The bug 29212 seems to have something to do with "newer" kernels (various reports about 3.13+?, see also https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1450584, maybe a fix?). The Timer NRE occurs on older kernels, too (e.g. with 2.6.32 with lucid).
Comment 14 Earl Camp 2015-06-13 12:31:11 UTC
Can you tell us exactly which GC settings/MONO_GC_PARAMS you used to conclude it was a problem with SGen? I'm experiencing a similar bug and want to make sure it's the same thing (and employ the workaround if possible)

Thanks.
Comment 15 Dan Shechter 2015-06-14 09:05:37 UTC
--gc=bohem kills the problem.
It never happens with it.
Comment 16 Vsevolod Kukol 2015-06-19 06:56:58 UTC
It looks like this is fixed with #29212. On an Ubuntu 14.04.2 test system the bug disappeared with the last kernel updates (excessively tested with recent 3.19.0-20 vivid backport kernel).

On our old lucid systems our software seems to have stopped crashing with one of the last mono 4.0.1 updates and/or the latest 2.6.32-02063267-generic kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.32.67-lucid/. I'll have to do some more testing here, since I have not tracked the upgrade paths on working machines.
Comment 17 Mark Probst 2015-09-23 12:08:33 UTC
Any update on this?
Comment 18 Vsevolod Kukol 2015-09-23 12:47:03 UTC
Nothing new from my side. We have updated the Kernel on all testing and customer machines last month and the issue was fixed. I didn't test anything with 4.1+ Kernel (according to last comments on https://bugzilla.xamarin.com/show_bug.cgi?id=29212, the issue is still there) since I don't work on that project any more ;)

So for now I would propose to mark it as duplicate of #29212.

I'll ping my old team, maybe they can tell more.
Comment 19 Olaf Kober 2016-01-19 09:21:24 UTC
I'm also able to observe some nasty NRE when calling Change() on a System.Threading.Timer and I was able to put together a small sample application that reproduces this behavior, at least on our target systems (embedded systems). Find the solution attached (TimerTest.zip).

Our target systems are iMX6-based embedded systems with a custom-build linux and Mono 4.2.1.60. We have two flavors of this system, a single core and a dual core variant. On the single core machine it is pretty easy to reproduce the NRE, on the dual core machine it happens occasionally, much less often, but it happens too.

On the other hand, I was also able to reproduce it on my personal Pi 2 with latest Raspbian and Mono 4.2.1.102. This is a four core machine and here it was also pretty easy to generate the NREs.

The attached sample application, starts a new non-recurring timer. In the timer callback we call some other callbacks that took a random duration to execute (simulated). We then calculate the due time to the next second tick and restart the timer again.

The output from our embedded systems looks like:

---------
0:   10:08:26.823
c
0:   10:08:26.998
C
c
C
0:   10:08:27.049
c
C
0:   10:08:28.001
c
C
0:   10:08:29.003
c
C
d
D
---------
c
X: System.NullReferenceException: Object reference not set to an instance of an object
   at TimerTest.DefaultTimeProvider._RestartTimer (TimeSpan dueTime) in <filename unknown>:line 0
p
P
d
D
---------
2:   10:08:32.514
c
C
2:   10:08:33.001
c
C
♥

There are some curious things to note.

- If I increase the Thread.Sleep(3000) in Program.cs to say 10000 ms, I can no longer reproduce the behavior.

- If I allocate some more objects, for example output more diagnostic information onto the console, I'm no longer able to reproduce it.

- If I add another Thread.Sleep(3000) after provider.Dispose() in Program.cs to give the current timer some time to dispose, before the next timer is started, I no longer see the NRE. My feeling is that this comes from exactly this situation. One timer is disposed, while another timer is started.

- If I let my program run longer, the NRE doesn't show up anymore. If I get a NRE then it happens in the first or second loop of _RunTimer().


Hopefully, this helps someone.
Comment 20 Olaf Kober 2016-01-19 09:21:53 UTC
Created attachment 14625 [details]
TimerTest.zip
Comment 21 Olaf Kober 2016-02-12 09:23:11 UTC
I have to say that my test sample that I wrote to reproduce my initial issue is *incorrect* and contains a race condition that shows a NRE, but this has nothing to do with a potential Mono bug.

It's a race condition in the test code itself. Field mTimer might not be set before the first initial timer callback is called (this is invoked immediately by specifying 0).
Comment 22 Ludovic Henry 2017-07-12 23:13:39 UTC
Can you still reproduce with latest version of mono? If you can, please reopen and provide a complete repro case. Thank you