Bug 29827 - Segfault running workflow application with many threads and timers on Ubuntu 14.04
Summary: Segfault running workflow application with many threads and timers on Ubuntu ...
Status: RESOLVED INVALID
Alias: None
Product: Runtime
Classification: Mono
Component: GC ()
Version: 4.0.0
Hardware: PC Linux
: --- normal
Target Milestone: ---
Assignee: Bugzilla
URL:
Depends on:
Blocks:
 
Reported: 2015-05-07 08:08 UTC by blinke76
Modified: 2015-06-29 09:20 UTC (History)
4 users (show)

Tags:
Is this bug a regression?: ---
Last known good build:

Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.


Please create a new report on GitHub or Developer Community with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:
Status:
RESOLVED INVALID

Description blinke76 2015-05-07 08:08:57 UTC
An in-house workflow applications aborts with a segfault under Ubuntu 14.04 with Mono 4.0.1 (xamarin packages). After installing the debug packages and running the workflow again, the following stacktrace is generated:

Stacktrace:


Native stacktrace:

	mono() [0x4b1d6c]
	mono() [0x50833e]
	mono() [0x428bfd]
	/lib/x86_64-linux-gnu/libpthread.so.0(+0x10340) [0x7fbdc6371340]

Debug info from gdb:

  File "/usr/lib/debug/usr/bin/mono-sgen-gdb.py", line 34
    c = "\u%X".format (val)
                       ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
  File "/usr/lib/debug/usr/bin/mono-sgen-gdb.py", line 34
    c = "\u%X".format (val)
                       ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
[New LWP 30857]
[New LWP 30856]
[New LWP 30855]
[New LWP 30854]
[New LWP 30853]
[New LWP 30837]
[New LWP 30836]
[New LWP 30835]
[New LWP 30834]
[New LWP 30831]
[New LWP 30830]
[New LWP 30829]
[New LWP 30828]
[New LWP 30827]
[New LWP 30826]
[New LWP 30825]
[New LWP 30824]
[New LWP 30822]
[New LWP 30821]
[New LWP 30820]
[New LWP 30819]
[New LWP 30818]
[New LWP 30817]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fbdc5fd3062 in do_sigsuspend (set=0x945300 <suspend_signal_mask>) at ../sysdeps/unix/sysv/linux/sigsuspend.c:31
31	../sysdeps/unix/sysv/linux/sigsuspend.c: No such file or directory.
  Id   Target Id         Frame 
  24   Thread 0x7fbdc2e94700 (LWP 30817) "Finalizer" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
  23   Thread 0x7fbdc172b700 (LWP 30818) "Timer-Scheduler" 0x00007fbdc6370ee9 in __libc_waitpid (pid=pid@entry=30859, stat_loc=stat_loc@entry=0x7fbdc152a31c, options=options@entry=0) at ../sysdeps/unix/sysv/linux/waitpid.c:40
  22   Thread 0x7fbdc10fb700 (LWP 30819) "mono" 0x00007fbdc6096b13 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
  21   Thread 0x7fbdc10ba700 (LWP 30820) "IO Threadpool w" sem_timedwait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:101
  20   Thread 0x7fbdc1079700 (LWP 30821) "Threadpool moni" 0x00007fbdc5fd3062 in do_sigsuspend (set=0x945300 <suspend_signal_mask>) at ../sysdeps/unix/sysv/linux/sigsuspend.c:31
  19   Thread 0x7fbdc1034700 (LWP 30822) "Threadpool work" sem_timedwait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:101
  18   Thread 0x7fbdc0e2f700 (LWP 30824) "mono" 0x00007fbdc5fd3062 in do_sigsuspend (set=0x945300 <suspend_signal_mask>) at ../sysdeps/unix/sysv/linux/sigsuspend.c:31
  17   Thread 0x7fbdc0c2a700 (LWP 30825) "mono" 0x00007fbdc5fd3062 in do_sigsuspend (set=0x945300 <suspend_signal_mask>) at ../sysdeps/unix/sysv/linux/sigsuspend.c:31
  16   Thread 0x7fbdc0a25700 (LWP 30826) "mono" 0x00007fbdc5fd3062 in do_sigsuspend (set=0x945300 <suspend_signal_mask>) at ../sysdeps/unix/sysv/linux/sigsuspend.c:31
  15   Thread 0x7fbdc0820700 (LWP 30827) "mono" 0x00007fbdc5fd3062 in do_sigsuspend (set=0x945300 <suspend_signal_mask>) at ../sysdeps/unix/sysv/linux/sigsuspend.c:31
  14   Thread 0x7fbdc061b700 (LWP 30828) "mono" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
  13   Thread 0x7fbdc0416700 (LWP 30829) "mono" 0x00007fbdc5fd3062 in do_sigsuspend (set=0x945300 <suspend_signal_mask>) at ../sysdeps/unix/sysv/linux/sigsuspend.c:31
  12   Thread 0x7fbdc0211700 (LWP 30830) "mono" 0x00007fbdc5fd3062 in do_sigsuspend (set=0x945300 <suspend_signal_mask>) at ../sysdeps/unix/sysv/linux/sigsuspend.c:31
  11   Thread 0x7fbd8bfff700 (LWP 30831) "mono" 0x00007fbdc5fd3062 in do_sigsuspend (set=0x945300 <suspend_signal_mask>) at ../sysdeps/unix/sysv/linux/sigsuspend.c:31
  10   Thread 0x7fbd8b9fc700 (LWP 30834) "mono" 0x00007fbdc5fd3062 in do_sigsuspend (set=0x945300 <suspend_signal_mask>) at ../sysdeps/unix/sysv/linux/sigsuspend.c:31
  9    Thread 0x7fbd8b7f7700 (LWP 30835) "mono" 0x00007fbdc5fd3062 in do_sigsuspend (set=0x945300 <suspend_signal_mask>) at ../sysdeps/unix/sysv/linux/sigsuspend.c:31
  8    Thread 0x7fbd8bdfe700 (LWP 30836) "Threadpool work" sem_timedwait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S:101
  7    Thread 0x7fbd65b07700 (LWP 30837) "mono" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
  6    Thread 0x7fbd65306700 (LWP 30853) "mono" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
  5    Thread 0x7fbd64b05700 (LWP 30854) "mono" 0x00007fbdc608912d in poll () at ../sysdeps/unix/syscall-template.S:81
  4    Thread 0x7fbd64304700 (LWP 30855) "mono" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
  3    Thread 0x7fbd63b03700 (LWP 30856) "mono" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
  2    Thread 0x7fbd8bbfd700 (LWP 30857) "mono" 0x00007fbdc5fd3062 in do_sigsuspend (set=0x945300 <suspend_signal_mask>) at ../sysdeps/unix/sysv/linux/sigsuspend.c:31
* 1    Thread 0x7fbdc6ea37c0 (LWP 30816) "mono" 0x00007fbdc5fd3062 in do_sigsuspend (set=0x945300 <suspend_signal_mask>) at ../sysdeps/unix/sysv/linux/sigsuspend.c:31

Thread 24 (Thread 0x7fbdc2e94700 (LWP 30817)):
#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1  0x000000000061e418 in mono_sem_wait (sem=sem@entry=0x944e60 <finalizer_sem>, alertable=alertable@entry=1) at mono-semaphore.c:101
#2  0x00000000005a19be in finalizer_thread (unused=<optimized out>) at gc.c:1074
#3  0x0000000000586cf8 in start_wrapper_internal (data=<optimized out>) at threads.c:664
#4  start_wrapper (data=<optimized out>) at threads.c:711
#5  0x0000000000623246 in inner_start_thread (arg=0x7fffd77c6780) at mono-threads-posix.c:92
#6  0x00007fbdc6369182 in start_thread (arg=0x7fbdc2e94700) at pthread_create.c:312
#7  0x00007fbdc609647d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 23 (Thread 0x7fbdc172b700 (LWP 30818)):
#0  0x00007fbdc6370ee9 in __libc_waitpid (pid=pid@entry=30859, stat_loc=stat_loc@entry=0x7fbdc152a31c, options=options@entry=0) at ../sysdeps/unix/sysv/linux/waitpid.c:40
#1  0x00000000004b1df9 in mono_handle_native_sigsegv (signal=signal@entry=11, ctx=ctx@entry=0x7fbdc152ac40, info=info@entry=0x7fbdc152ad70) at mini-exceptions.c:2347
#2  0x000000000050833e in mono_arch_handle_altstack_exception (sigctx=sigctx@entry=0x7fbdc152ac40, siginfo=siginfo@entry=0x7fbdc152ad70, fault_addr=<optimized out>, stack_ovf=stack_ovf@entry=0) at exceptions-amd64.c:851
#3  0x0000000000428bfd in mono_sigsegv_signal_handler (_dummy=11, _info=0x7fbdc152ad70, context=0x7fbdc152ac40) at mini.c:6796
#4  <signal handler called>
#5  0x0000000000000000 in ?? ()
/build/buildd/gdb-7.7.1/gdb/dwarf2-frame.c:692: internal-error: Unknown CFI encountered.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) [answered Y; input not from terminal]
/build/buildd/gdb-7.7.1/gdb/dwarf2-frame.c:692: internal-error: Unknown CFI encountered.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n) [answered Y; input not from terminal]

=================================================================
Got a SIGSEGV while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries 
used by your application.
=================================================================

Aborted

-------

The error is reproducable; the failing thread is always the 'Timer-Scheduler' thread. It seems to run in spite of the active garbage collection (all other mono threads are blocked in do_sigsuspend()), resulting in a segfault.

Unfortunately I cannot provide a test case of the application itself
Comment 1 blinke76 2015-05-08 10:34:59 UTC
I did some tests with an older mono version (3.2.8 from ubuntu repository) and finally came up with this stacktrace:

Unhandled Exception:
System.NullReferenceException: Object reference not set to an instance of an object
  at System.Threading.Timer+Scheduler.SchedulerThread () [0x0002f] in /build/buildd/mono-3.2.8+dfsg/mcs/class/corlib/System.Threading/Timer.cs:328 
  at System.Threading.Thread.StartInternal () [0x00016] in /build/buildd/mono-3.2.8+dfsg/mcs/class/corlib/System.Threading/Thread.cs:691 


Another round of google brought up

http://forum.repetier.com/discussion/397/rh-started-crashing-after-10-minutes

The solution mentioned in that thread (downgrading to ubuntu kernel 3.13.0-46) keeps the application from crashing so far.

So there's definitely something wrong in either the Ubuntu kernels after 3.13.0-46 or in the way the mono runtime interacts with the kernel.
Comment 2 xamarin 2015-05-25 08:07:12 UTC
This might be related to bug #29462 I'm getting.

Also getting this sometimes:
16:41:18 [ERROR] Unhandled Exception: Object reference not set to an instance of an object
  at System.Threading.EventWaitHandle.Reset () [0x00000] in <filename unknown>:0
  at (wrapper remoting-invoke-with-check) System.Threading.EventWaitHandle:Reset ()
  at System.Threading.Timer+Scheduler.SchedulerThread () [0x00000] in <filename unknown>:0
  at System.Threading.Thread.StartInternal () [0x00000] in <filename unknown>:0

My kernel is 3.16.0-4-amd64.
Comment 3 Pavlos Touboulidis 2015-06-08 02:13:55 UTC
Similar to the OP, a long running socket based application has started crashing lately. I first noticed it using the Ubuntu 14.04 packages and it's still crashing with the Xamarin packages.

I don't have any debug packages installed but I'm running the application with the debug flag.

Sometimes it segfaults like this:

Native stacktrace:

        mono() [0x4b1fac]
        mono() [0x5085de]
        mono() [0x428f2d]
        /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340) [0x7feb9e32e340]

Other times there's no stacktrace.

The strangest thing is that most of the time I get NullReferenceExceptions on things that is practically impossible to be null. For example:

public bool AddSample(IDataSample sample)
{
    if (sample == null)
        throw new ArgumentNullException("sample");

    if (sample.SomeSimpleIntegerGetter == this.SomeIntegerAutoProperty)

and it crashes with NRE on the last line above. The NRE location varies between runs, but it's usually accessing a property of "this" and another object like "sample" that just can't be null.
Comment 4 Pavlos Touboulidis 2015-06-08 02:24:47 UTC
Looks like this is the same as #29212, which points to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1450584 and the fix is one the way.
Comment 5 blinke76 2015-06-08 08:21:54 UTC
Thanks for the hint. I've updated a test system to the latest kernel and some test runs are already started.
Comment 6 blinke76 2015-06-29 09:20:12 UTC
Resolved by kernel update