Bug 42665 - Random hangs / deadlock when running mono
Summary: Random hangs / deadlock when running mono
Status: RESOLVED NOT_REPRODUCIBLE
Alias: None
Product: Runtime
Classification: Mono
Component: General ()
Version: 4.4.0 (C7)
Hardware: PC Linux
: Normal normal
Target Milestone: Future Cycle (TBD)
Assignee: Ludovic Henry
URL:
Depends on:
Blocks:
 
Reported: 2016-07-19 16:07 UTC by Glover George
Modified: 2017-04-27 15:59 UTC (History)
5 users (show)

Tags:
Is this bug a regression?: ---
Last known good build:

Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.


Please create a new report on GitHub or Developer Community with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:
Status:
RESOLVED NOT_REPRODUCIBLE

Description Glover George 2016-07-19 16:07:43 UTC
We have been fighting this bug now for several versions of mono 4.x.  It continues to exist in the master branch as of July 18th, 2016.  We have tried for 6 months to narrow down our code to what is causing the problem, as well as trying to create reproduction code to no avail.  It is random.  It occurs in about 10% of our runs, when running thousands of runs of the same code. We initially thought this might be a shutdown issue, but I'm not sure anymore.  I am willing to try any suggestions, but I'm not familiar enough with the code to make rapid progress at the moment.  

Can someone explain to me what each of the three threads are? (GC, Finalizer, Main thread?)

NOTE: The last two lines of the STDOUT listed below :
Shutting down finalizer thread timed out.
* Assertion at gc.c:867, condition `finalizer_thread_exited' not met
..  This may be the same issue as bug id 41833 and should be fixed in master. This output if from 4.4.0.182.

See below:
OS: Suse EL 11 SP4
GCC 4.9.3
Mono 4.4.0.182 (also exists 4.2.x, 4.3.x, 4.4.x, and git-master)

I have included the stack trace, ps, and strace.  I have also included the CUE card that gets output.  

Ps listing : (not sure why it's not shown below, but the state is Sl (interruptible sleep, multithreaded)
-----------------------------------------------------------------------------------------------
ggeorge  64515 64382 64515 31    3 17:06 ?        00:08:37 mono --runtime=v4.0.30319 mycode.exe
ggeorge  64515 64382 64574  0    3 17:06 ?        00:00:00 mono --runtime=v4.0.30319 mycode.exe
ggeorge  64515 64382 64596  0    3 17:06 ?        00:00:06 mono --runtime=v4.0.30319 mycode.exe
ggeorge  67340 67160 67340  0    1 17:33 pts/0    00:00:00 grep mono

strace of hung process/threads:
-----------------------------------------------------------------------------------------------
ggeorge@r8i2n17:~> strace -p 64515
Process 64515 attached - interrupt to quit
futex(0x7fffedb05968, FUTEX_WAIT_PRIVATE, 2, NULL^C <unfinished ...>
Process 64515 detached
ggeorge@r8i2n17:~> strace -p 64382
Process 64382 attached - interrupt to quit
wait4(-1, ^C <unfinished ...>
Process 64382 detached
ggeorge@r8i2n17:~> strace -p 64574
Process 64574 attached - interrupt to quit
futex(0x97f604, FUTEX_WAIT_PRIVATE, 523, NULL^C <unfinished ...>
Process 64574 detached
ggeorge@r8i2n17:~> strace -p 64596
Process 64596 attached - interrupt to quit
rt_sigsuspend(~[PWR RTMIN RT_1 RT_5]^C <unfinished ...>
Process 64596 detached
ggeorge@r8i2n17:~> 

gdb backtrace of hung process/threads:
-----------------------------------------------------------------------------------------------
Thread 3 (Thread 0x7fffebfff700 (LWP 64574)):
#0  0x00007fffecd3d66c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000060db93 in mono_os_cond_wait (mutex=0x97f640 <lock>, cond=0x97f600 <work_cond>) at ../../mono/utils/mono-os-mutex.h:105
#2  thread_func (thread_data=0x0) at sgen-thread-pool.c:118
#3  0x00007fffecd39806 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fffec87d9bd in clone () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7fffd8093700 (LWP 64596)):
#0  0x00007fffec7d1c8b in sigsuspend () from /lib64/libc.so.6
#1  0x000000000063e3e6 in suspend_signal_handler (_dummy=<optimized out>, info=<optimized out>, context=0x7fffd808ff80) at mono-threads-posix-signals.c:209
#2  <signal handler called>
#3  0x00007fffed8faf97 in open64 () from /lib64/ld-linux-x86-64.so.2
#4  0x00007fffed8ea82d in open_verify () from /lib64/ld-linux-x86-64.so.2
#5  0x00007fffed8ecca1 in _dl_map_object () from /lib64/ld-linux-x86-64.so.2
#6  0x00007fffed8f7400 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#7  0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#8  0x00007fffed8f6e3b in _dl_open () from /lib64/ld-linux-x86-64.so.2
#9  0x00007fffecf4ff9b in dlopen_doit () from /lib64/libdl.so.2
#10 0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#11 0x00007fffecf5033c in _dlerror_run () from /lib64/libdl.so.2
#12 0x00007fffecf4ff01 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#13 0x00000000006327f5 in mono_dl_open_file (file=<optimized out>, flags=<optimized out>) at mono-dl-posix.c:67
#14 0x0000000000632029 in mono_dl_open (name=name@entry=0x1c604d0 "/p/home/apps/unsupported/NAVAIR/build/mono-4.4.0/lib/mono/gac/System.Data/4.0.0.0__b77a5c561934e089/libSystem.Data.dll", flags=flags@entry=1, 
    error_msg=error_msg@entry=0x7fffd8090e80) at mono-dl.c:150
#15 0x000000000054cf90 in cached_module_load (name=name@entry=0x1c604d0 "/p/home/apps/unsupported/NAVAIR/build/mono-4.4.0/lib/mono/gac/System.Data/4.0.0.0__b77a5c561934e089/libSystem.Data.dll", err=err@entry=0x7fffd8090e80, 
    flags=1) at loader.c:1402
Python Exception <type 'exceptions.ValueError'> zero length field name in format: 
#16 0x000000000054e218 in mono_lookup_pinvoke_call (method=method@entry=, exc_class=exc_class@entry=0x7fffd8091f00, exc_arg=exc_arg@entry=0x7fffd8091f08) at loader.c:1645
Python Exception <type 'exceptions.ValueError'> zero length field name in format: 
#17 0x00000000005641f6 in mono_marshal_get_native_wrapper (method=method@entry=, check_exceptions=check_exceptions@entry=1, aot=0) at marshal.c:7396
Python Exception <type 'exceptions.ValueError'> zero length field name in format: 
#18 0x00000000004602cf in mono_method_to_ir (cfg=cfg@entry=0x1c5f920, method=method@entry=, start_bblock=<optimized out>, start_bblock@entry=0x0, end_bblock=<optimized out>, end_bblock@entry=0x0, return_var=return_var@entry=0x0, 
    inline_args=inline_args@entry=0x0, inline_offset=0, is_virtual_call=0) at method-to-ir.c:9280
Python Exception <type 'exceptions.ValueError'> zero length field name in format: 
#19 0x000000000050acf9 in mini_method_compile (method=method@entry=, opts=opts@entry=370239999, domain=domain@entry=0x9c91f0, flags=flags@entry=JIT_FLAG_RUN_CCTORS, parts=parts@entry=0, aot_method_index=aot_method_index@entry=-1)
    at mini.c:3608
Python Exception <type 'exceptions.ValueError'> zero length field name in format: 
#20 0x000000000050c4d5 in mono_jit_compile_method_inner (method=method@entry=, target_domain=target_domain@entry=0x9c91f0, opt=opt@entry=370239999, jit_ex=jit_ex@entry=0x7fffd8092678) at mini.c:4263
Python Exception <type 'exceptions.ValueError'> zero length field name in format: 
#21 0x00000000004287f8 in mono_jit_compile_method_with_opt (method=method@entry=, opt=370239999, ex=ex@entry=0x7fffd8092678) at mini-runtime.c:1952
Python Exception <type 'exceptions.ValueError'> zero length field name in format: 
#22 0x0000000000428fbb in mono_jit_compile_method (method=) at mini-runtime.c:2008
Python Exception <type 'exceptions.ValueError'> zero length field name in format: 
#23 0x00000000004aeb43 in common_call_trampoline_inner (regs=regs@entry=0x7fffd8092890, code=code@entry=0x401c3ee4 "\270\001", m=m@entry=, vt=vt@entry=0x0, vtable_slot=<optimized out>, vtable_slot@entry=0x0)
    at mini-trampolines.c:694
Python Exception <type 'exceptions.ValueError'> zero length field name in format: 
#24 0x00000000004af2a0 in common_call_trampoline (regs=0x7fffd8092890, code=0x401c3ee4 "\270\001", m=, vt=0x0, vtable_slot=0x0) at mini-trampolines.c:808
#25 0x0000000040000289 in ?? ()
#26 0x0000000000a25145 in ?? ()
#27 0x00000000401c3ee4 in ?? ()
#28 0x0000000000a25145 in ?? ()
#29 0x00007fffd80929d0 in ?? ()
#30 0x00007fffd8092890 in ?? ()
#31 0x00007fffd8093698 in ?? ()
#32 0x00007fffd80d6188 in ?? ()
#33 0x00007fffd80d61a0 in ?? ()
#34 0x00007fffd8092af0 in ?? ()
#35 0x0000000000000003 in ?? ()
#36 0x00007fffd80929d0 in ?? ()
#37 0x00007fffd8092a60 in ?? ()
#38 0x0000000000000001 in ?? ()
#39 0x00007fffd80d6188 in ?? ()
#40 0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7fffedae7780 (LWP 64515)):
#0  0x00007fffecd40324 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fffecd3b684 in _L_lock_1091 () from /lib64/libpthread.so.0
#2  0x00007fffecd3b4f6 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fffed8f6dcc in _dl_open () from /lib64/ld-linux-x86-64.so.2
#4  0x00007fffec8b5530 in do_dlopen () from /lib64/libc.so.6
#5  0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#6  0x00007fffec8b55e5 in dlerror_run () from /lib64/libc.so.6
#7  0x00007fffec8b56d7 in __libc_dlopen_mode () from /lib64/libc.so.6
#8  0x00007fffec8902e5 in init () from /lib64/libc.so.6
#9  0x00007fffecd3ed03 in pthread_once () from /lib64/libpthread.so.0
#10 0x00007fffec89043c in backtrace () from /lib64/libc.so.6
#11 0x00000000004ad425 in mono_handle_native_sigsegv (signal=<optimized out>, ctx=<optimized out>, info=<optimized out>) at mini-exceptions.c:2309
#12 <signal handler called>
#13 0x00007fffec7d1875 in raise () from /lib64/libc.so.6
#14 0x00007fffec7d2e51 in abort () from /lib64/libc.so.6
#15 0x00000000006468be in monoeg_log_default_handler (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, message=0x1c70ac0 "suspend_thread suspend took 2000 ms, which is more than the allowed 2000 ms", unused_data=0x0) at goutput.c:233
#16 0x00000000006466ab in monoeg_g_logv (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, format=0x703240 "suspend_thread suspend took %d ms, which is more than the allowed %d ms", args=0x7fffffffc588) at goutput.c:113
#17 0x0000000000646761 in monoeg_g_log (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR, format=0x703240 "suspend_thread suspend took %d ms, which is more than the allowed %d ms") at goutput.c:123
#18 0x000000000063b77f in mono_threads_wait_pending_operations () at mono-threads.c:236
#19 0x000000000063bf0d in suspend_sync (interrupt_kernel=1, tid=140736817870592) at mono-threads.c:875
#20 suspend_sync_nolock (interrupt_kernel=1, id=140736817870592) at mono-threads.c:890
#21 mono_thread_info_safe_suspend_and_run (id=140736817870592, interrupt_kernel=interrupt_kernel@entry=1, callback=callback@entry=0x58ead0 <abort_thread_critical>, user_data=user_data@entry=0x7fffffffcb10) at mono-threads.c:933
#22 0x0000000000592f96 in abort_thread_internal (thread=thread@entry=0x7fffd813c230, install_async_abort=install_async_abort@entry=1, can_raise_exception=1) at threads.c:4728
#23 0x0000000000593039 in mono_thread_internal_stop (thread=0x7fffd813c230) at threads.c:2385
#24 0x00000000005b266e in mono_gc_cleanup () at gc.c:842
#25 0x00000000005abfbe in mono_runtime_cleanup (domain=domain@entry=0x9c91f0) at appdomain.c:356
#26 0x000000000042702b in mini_cleanup (domain=0x9c91f0) at mini-runtime.c:4017
#27 0x0000000000480dc6 in mono_main (argc=11, argv=<optimized out>) at driver.c:2119
#28 0x0000000000424cdf in mono_main_with_options (argv=0x7fffffffce78, argc=11) at main.c:28
#29 main (argc=11, argv=0x7fffffffce78) at main.c:177
(gdb) 


STATE CUE CARD of above process:
-----------------------------------------------------------------------------------------------
STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any number)
        0x0     - starting (GOOD, unless the thread is running managed code)
        0x1     - running (BAD, unless it's the gc thread)
        0x2     - detached (GOOD, unless the thread is running managed code)
        0x?03   - async suspended (GOOD)
        0x?04   - self suspended (GOOD)
        0x?05   - async suspend requested (BAD)
        0x?06   - self suspend requested (BAD)
        0x*07   - blocking (GOOD)
        0x?08   - blocking with pending suspend (GOOD)
--thread 0xa27f30 id 0x7fffd8093700 [(nil)] state 105
--thread 0x9c7a30 id 0x7fffedae7780 [(nil)] state 1  GC INITIATOR
WAITING for 1 threads, got 0 suspended
suspend_thread suspend took 2000 ms, which is more than the allowed 2000 ms
Shutting down finalizer thread timed out.
* Assertion at gc.c:867, condition `finalizer_thread_exited' not met
Comment 1 Rodrigo Kumpera 2016-07-20 00:44:30 UTC
This issue has being observed on machines under heavy cpu load - where loadavg > core count.

Can you try setting the MONO_SLEEP_ABORT_LIMIT env var to 10000 and see if things work as expected?

This would tell us whether it's a load issue or an actual deadlock in the runtime.

There's a discussion on improving this on this PR: https://github.com/mono/mono/pull/3292
Comment 2 Rodrigo Kumpera 2016-07-21 07:01:21 UTC
Hi Glover,

We merged a tentative fix yesterday into mono. 25494815948e52187ef607124e2f69290387208c

Would you be so kind to try it and tell whether it fixed your issue?
Comment 3 Glover George 2016-07-21 14:07:07 UTC
Hi,

The error still occurs.  Exact same stack trace, the only difference now is that the timeout reflected is 10000: (CUE CARD below). 

I'd like to point out that in this mono instance that hung, I was running 36 mono processes (identical code) on a 36-core, hyperthreaded (72 thread) Intel.  The load avg was just over 36.  The code is essentially performing a large loop over many input points, and each iteration can call up to a 100 functions, so the GC is fairly heavily involved.  Not sure if this helps.

STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any number)
        0x0     - starting (GOOD, unless the thread is running managed code)
        0x1     - running (BAD, unless it's the gc thread)
        0x2     - detached (GOOD, unless the thread is running managed code)
        0x?03   - async suspended (GOOD)
        0x?04   - self suspended (GOOD)
        0x?05   - async suspend requested (BAD)
        0x?06   - self suspend requested (BAD)
        0x*07   - blocking (GOOD)
        0x?08   - blocking with pending suspend (GOOD)
--thread 0xa29c40 id 0x7fffd7f8d700 [(nil)] state 105
--thread 0x9c9d40 id 0x7fffedae7780 [(nil)] state 1  GC INITIATOR
WAITING for 1 threads, got 0 suspended
suspend_thread suspend took 10000 ms, which is more than the allowed 10000 ms

I had actually tried this before the change, by changing SLEEP_DURATION_BEFORE_ABORT to 10000, and saw the same thing.  This is where the suspicion of deadlock came from.
Comment 4 Rodrigo Kumpera 2016-07-21 16:28:18 UTC
Glover,

What was the gdb dump in this case? That will tell me where the stray thread was.

Ludovic,

Can you work with Glover to troubleshoot this one? You done quite a lot of that recently and this one is right in your area.
Comment 5 Glover George 2016-07-22 22:48:53 UTC
Hi,

I didn't get the stack trace for that CUE card above, so I've reproduced it.  I was running 72 instances of MAOIE on 36 cores on a single machine (yes, I know that's a load average of 72+, but when it works, it's 33% faster to use 72 than 36).  Of the 36, around 9 failed.  Rather than post all of the hung processes backtraces, I diff two of them at the end.  I inspected all of them, and they all hung exactly the same.

Output below explained:
1. State Cue Card
2. gdb -p <hungprocessID> ; thread apply all bt
3. diff of this backtrace and the backtrace of another hung process on this machine.

1. --------------------------------------------
STATE CUE CARD: (? means a positive number, usually 1 or 2, * means any number)
        0x0     - starting (GOOD, unless the thread is running managed code)
        0x1     - running (BAD, unless it's the gc thread)
        0x2     - detached (GOOD, unless the thread is running managed code)
        0x?03   - async suspended (GOOD)
        0x?04   - self suspended (GOOD)
        0x?05   - async suspend requested (BAD)
        0x?06   - self suspend requested (BAD)
        0x*07   - blocking (GOOD)
        0x?08   - blocking with pending suspend (GOOD)
--thread 0xa29c40 id 0x7fffd7f8d700 [(nil)] state 105
--thread 0x9c9d40 id 0x7fffedae7780 [(nil)] state 1  GC INITIATOR
WAITING for 1 threads, got 0 suspended
suspend_thread suspend took 10000 ms, which is more than the allowed 10000 ms

2. ---------------------------------------------------------
ggeorge@r20i5n6:~> gdb -p 39538
GNU gdb (GDB; SUSE Linux Enterprise 11) 7.9.1
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://bugs.opensuse.org/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 39538
Reading symbols from /p/home/apps/unsupported/NAVAIR/build/mono-git/bin/mono-sgen...done.
Reading symbols from /p/home/ggeorge/Projects/AFRL/mstc/support/gcc/linux64/gcc-4.7.1/lib64/libstdc++.so.6...done.
Reading symbols from /lib64/libm.so.6...Missing separate debuginfo for /lib64/libm.so.6
Try: zypper install -C "debuginfo(build-id)=f0721cb50ab9fbdf06314a53bff5af581bbefe64"
(no debugging symbols found)...done.
Reading symbols from /lib64/librt.so.1...Missing separate debuginfo for /lib64/librt.so.1
Try: zypper install -C "debuginfo(build-id)=e1a13ecb56367b69b89d1c9ca1a4c42167336030"
(no debugging symbols found)...done.
Reading symbols from /lib64/libdl.so.2...Missing separate debuginfo for /lib64/libdl.so.2
Try: zypper install -C "debuginfo(build-id)=732292820e69f70459cb927ade5b49bc56d32b0f"
(no debugging symbols found)...done.
Reading symbols from /lib64/libpthread.so.0...Missing separate debuginfo for /lib64/libpthread.so.0
Try: zypper install -C "debuginfo(build-id)=48172710254f4e2549684d7d3e9f9622272d6c66"
(no debugging symbols found)...done.
[New LWP 39634]
[New LWP 39596]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Reading symbols from /p/home/ggeorge/Projects/AFRL/mstc/support/gcc/linux64/gcc-4.7.1/lib64/libgcc_s.so.1...done.
Reading symbols from /lib64/libc.so.6...Missing separate debuginfo for /lib64/libc.so.6
Try: zypper install -C "debuginfo(build-id)=e2cab3c95cb1189420734b4af264b047355be2e5"
(no debugging symbols found)...done.
Reading symbols from /lib64/ld-linux-x86-64.so.2...Missing separate debuginfo for /lib64/ld-linux-x86-64.so.2
Try: zypper install -C "debuginfo(build-id)=9fdc592b21682a31f460f6f043f50eea8c8b6821"
(no debugging symbols found)...done.
Reading symbols from /p/home/apps/unsupported/NAVAIR/build/mono-git/lib/mono/4.5/mscorlib.dll.so...done.
0x00007fffecd48324 in __lll_lock_wait () from /lib64/libpthread.so.0
Mono support loaded.
(gdb) thread apply all bt

Thread 3 (Thread 0x7fffebfff700 (LWP 39596)):
#0  0x00007fffecd4566c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000624f77 in mono_os_cond_wait (mutex=0x998f80 <lock>, cond=0x998f40 <work_cond>)
    at ../../mono/utils/mono-os-mutex.h:112
#2  thread_func (thread_data=0x0) at sgen-thread-pool.c:110
#3  0x00007fffecd41806 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fffec8879bd in clone () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7fffd7f8d700 (LWP 39634)):
#0  0x00007fffec7dbc8b in sigsuspend () from /lib64/libc.so.6
#1  0x000000000064ea86 in suspend_signal_handler (_dummy=<optimized out>, info=<optimized out>,
    context=0x7fffd7f8a100) at mono-threads-posix-signals.c:208
#2  <signal handler called>
#3  0x00007fffed8faf97 in open64 () from /lib64/ld-linux-x86-64.so.2
#4  0x00007fffed8ea82d in open_verify () from /lib64/ld-linux-x86-64.so.2
#5  0x00007fffed8ecca1 in _dl_map_object () from /lib64/ld-linux-x86-64.so.2
#6  0x00007fffed8f7400 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#7  0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#8  0x00007fffed8f6e3b in _dl_open () from /lib64/ld-linux-x86-64.so.2
#9  0x00007fffecf57f9b in dlopen_doit () from /lib64/libdl.so.2
#10 0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#11 0x00007fffecf5833c in _dlerror_run () from /lib64/libdl.so.2
#12 0x00007fffecf57f01 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#13 0x00000000006429b5 in mono_dl_open_file (file=<optimized out>, flags=<optimized out>) at mono-dl-posix.c:68
#14 0x000000000064217d in mono_dl_open (
    name=name@entry=0xe060f0 "/p/home/apps/unsupported/NAVAIR/build/mono-git/lib/mono/gac/System.Data/4.0.0.0__b77a5c561934e089/libSystem.Data.dll", flags=flags@entry=1, error_msg=error_msg@entry=0x7fffd7f8b030)
    at mono-dl.c:151
#15 0x0000000000558750 in cached_module_load (
    name=name@entry=0xe060f0 "/p/home/apps/unsupported/NAVAIR/build/mono-git/lib/mono/gac/System.Data/4.0.0.0__b77a5c561934e089/libSystem.Data.dll", err=err@entry=0x7fffd7f8b030, flags=1) at loader.c:1119
Python Exception <type 'exceptions.ValueError'> zero length field name in format:
#16 0x00000000005594d4 in mono_lookup_pinvoke_call (method=method@entry=,
    exc_class=exc_class@entry=0x7fffd7f8c0b0, exc_arg=exc_arg@entry=0x7fffd7f8c0b8) at loader.c:1362
Python Exception <type 'exceptions.ValueError'> zero length field name in format:
#17 0x000000000056fe56 in mono_marshal_get_native_wrapper (method=method@entry=,
    check_exceptions=check_exceptions@entry=1, aot=0) at marshal.c:7663
Python Exception <type 'exceptions.ValueError'> zero length field name in format:
#18 0x00000000004667c9 in mono_method_to_ir (cfg=cfg@entry=0x1ec08b0, method=method@entry=,
    start_bblock=<optimized out>, start_bblock@entry=0x0, end_bblock=<optimized out>, end_bblock@entry=0x0,
    return_var=return_var@entry=0x0, inline_args=inline_args@entry=0x0, inline_offset=0, is_virtual_call=0)
    at method-to-ir.c:9311
Python Exception <type 'exceptions.ValueError'> zero length field name in format:
#19 0x00000000005136e1 in mini_method_compile (method=method@entry=, opts=opts@entry=370239999,
    domain=domain@entry=0x9cb940, flags=flags@entry=JIT_FLAG_RUN_CCTORS, parts=parts@entry=0,
    aot_method_index=aot_method_index@entry=-1) at mini.c:3547
---Type <return> to continue, or q <return> to quit---
Python Exception <type 'exceptions.ValueError'> zero length field name in format:
#20 0x00000000005153e9 in mono_jit_compile_method_inner (method=method@entry=,
    target_domain=target_domain@entry=0x9cb940, opt=opt@entry=370239999, error=error@entry=0x7fffd7f8c650)
    at mini.c:4239
Python Exception <type 'exceptions.ValueError'> zero length field name in format:
#21 0x000000000042763a in mono_jit_compile_method_with_opt (method=method@entry=, opt=370239999,
    error=error@entry=0x7fffd7f8c650) at mini-runtime.c:1914
Python Exception <type 'exceptions.ValueError'> zero length field name in format:
#22 0x0000000000427eaa in mono_jit_compile_method (method=method@entry=, error=error@entry=0x7fffd7f8c650)
    at mini-runtime.c:1958
#23 0x00000000004b4e65 in common_call_trampoline (regs=regs@entry=0x7fffd7f8c728,
Python Exception <type 'exceptions.ValueError'> zero length field name in format:
    code=code@entry=0x4023b034 "\270\001", m=m@entry=, vt=vt@entry=0x0, vtable_slot=<optimized out>,
    vtable_slot@entry=0x0, error=error@entry=0x7fffd7f8c650) at mini-trampolines.c:701
#24 0x00000000004b55b2 in mono_magic_trampoline (regs=0x7fffd7f8c728, code=0x4023b034 "\270\001",
    arg=0x2016270, tramp=<optimized out>) at mini-trampolines.c:831
#25 0x000000004000028c in ?? ()
#26 0x00007fffd6fcc130 in ?? ()
#27 0x0000000000a28b15 in ?? ()
#28 0x000000004023b034 in ?? ()
#29 0x0000000000a28b15 in ?? ()
#30 0x00007fffd7f8c870 in ?? ()
#31 0x00007fffd7f8c728 in ?? ()
#32 0x00007fffd7f8d690 in ?? ()
#33 0x00007fffd6fcc130 in ?? ()
#34 0x0000000000000004 in ?? ()
#35 0x0000000000000003 in ?? ()
#36 0x00007fffd6fcc130 in ?? ()
#37 0x00007fffd7f8c870 in ?? ()
#38 0x00007fffd7f8c8d0 in ?? ()
#39 0x0000000000000001 in ?? ()
#40 0x00007fffd6fcc130 in ?? ()
#41 0x0000000000000001 in ?? ()
#42 0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7fffedae7780 (LWP 39538)):
#0  0x00007fffecd48324 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fffecd43684 in _L_lock_1091 () from /lib64/libpthread.so.0
#2  0x00007fffecd434f6 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fffed8f6dcc in _dl_open () from /lib64/ld-linux-x86-64.so.2
#4  0x00007fffec8bf530 in do_dlopen () from /lib64/libc.so.6
#5  0x00007fffed8f2e86 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#6  0x00007fffec8bf5e5 in dlerror_run () from /lib64/libc.so.6
#7  0x00007fffec8bf6d7 in __libc_dlopen_mode () from /lib64/libc.so.6
#8  0x00007fffec89a2e5 in init () from /lib64/libc.so.6
#9  0x00007fffecd46d03 in pthread_once () from /lib64/libpthread.so.0
#10 0x00007fffec89a43c in backtrace () from /lib64/libc.so.6
#11 0x00000000004b35a5 in mono_handle_native_sigsegv (signal=<optimized out>, ctx=<optimized out>,
---Type <return> to continue, or q <return> to quit---
    info=<optimized out>) at mini-exceptions.c:2388
#12 <signal handler called>
#13 0x00007fffec7db875 in raise () from /lib64/libc.so.6
#14 0x00007fffec7dce51 in abort () from /lib64/libc.so.6
#15 0x000000000065939e in monoeg_log_default_handler (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR,
    message=0xe06240 "suspend_thread suspend took 10000 ms, which is more than the allowed 10000 ms",
    unused_data=0x0) at goutput.c:231
#16 0x000000000065918b in monoeg_g_logv (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR,
    format=0x7177c8 "suspend_thread suspend took %d ms, which is more than the allowed %d ms",
    args=0x7fffffffc648) at goutput.c:111
#17 0x0000000000659241 in monoeg_g_log (log_domain=0x0, log_level=G_LOG_LEVEL_ERROR,
    format=0x7177c8 "suspend_thread suspend took %d ms, which is more than the allowed %d ms") at goutput.c:121
#18 0x000000000064c09f in mono_threads_wait_pending_operations () at mono-threads.c:248
#19 0x000000000064c712 in suspend_sync (interrupt_kernel=1, tid=140736816797440) at mono-threads.c:917
#20 suspend_sync_nolock (interrupt_kernel=1, id=140736816797440) at mono-threads.c:934
#21 mono_thread_info_safe_suspend_and_run (id=140736816797440, interrupt_kernel=interrupt_kernel@entry=1,
    callback=callback@entry=0x59b420 <async_abort_critical>, user_data=user_data@entry=0x7fffffffcbd0)
    at mono-threads.c:977
#22 0x000000000059b5ed in async_abort_internal (thread=<optimized out>, install_async_abort=<optimized out>)
    at threads.c:4825
#23 0x00000000005bfea0 in mono_gc_cleanup () at gc.c:900
#24 0x00000000005b9c0e in mono_runtime_cleanup (domain=domain@entry=0x9cb940) at appdomain.c:397
#25 0x0000000000425def in mini_cleanup (domain=0x9cb940) at mini-runtime.c:4065
#26 0x00000000004862ef in mono_main (argc=10, argv=<optimized out>) at driver.c:2149
#27 0x00000000004239af in mono_main_with_options (argv=0x7fffffffcf78, argc=10) at main.c:28
#28 main (argc=10, argv=0x7fffffffcf78) at main.c:177
(gdb)

3. --------------------------------------------------------------------------
Diff of backtrace between the process above and another hung process:
ggeorge@topaz04:~> diff 39538.st 39541.st 
1c1
< ggeorge@r20i5n6:~> gdb -p 39538
---
> ggeorge@r20i5n6:~> gdb -p 39541
16c16
< Attaching to process 39538
---
> Attaching to process 39541
31,32c31,32
< [New LWP 39634]
< [New LWP 39596]
---
> [New LWP 39669]
> [New LWP 39616]
47c47
< Thread 3 (Thread 0x7fffebfff700 (LWP 39596)):
---
> Thread 3 (Thread 0x7fffebfff700 (LWP 39616)):
56c56
< Thread 2 (Thread 0x7fffd7f8d700 (LWP 39634)):
---
> Thread 2 (Thread 0x7fffd7f8d700 (LWP 39669)):
73c73
<     name=name@entry=0xe060f0 "/p/home/apps/unsupported/NAVAIR/build/mono-git/lib/mono/gac/System.Data/4.0.0.0__b77a5c561934e089/libSystem.Data.dll", flags=flags@entry=1, error_msg=error_msg@entry=0x7fffd7f8b030)
---
>     name=name@entry=0xe061b0 "/p/home/apps/unsupported/NAVAIR/build/mono-git/lib/mono/gac/System.Data/4.0.0.0__b77a5c561934e089/libSystem.Data.dll", flags=flags@entry=1, error_msg=error_msg@entry=0x7fffd7f8b030)
76c76
<     name=name@entry=0xe060f0 "/p/home/apps/unsupported/NAVAIR/build/mono-git/lib/mono/gac/System.Data/4.0.0.0__b77a5c561934e089/libSystem.Data.dll", err=err@entry=0x7fffd7f8b030, flags=1) at loader.c:1119
---
>     name=name@entry=0xe061b0 "/p/home/apps/unsupported/NAVAIR/build/mono-git/lib/mono/gac/System.Data/4.0.0.0__b77a5c561934e089/libSystem.Data.dll", err=err@entry=0x7fffd7f8b030, flags=1) at loader.c:1119
84c84
< #18 0x00000000004667c9 in mono_method_to_ir (cfg=cfg@entry=0x1ec08b0, method=method@entry=, 
---
> #18 0x00000000004667c9 in mono_method_to_ir (cfg=cfg@entry=0x1ec2540, method=method@entry=, 
128c128
< Thread 1 (Thread 0x7fffedae7780 (LWP 39538)):
---
> Thread 1 (Thread 0x7fffedae7780 (LWP 39541)):
147c147
<     message=0xe06240 "suspend_thread suspend took 10000 ms, which is more than the allowed 10000 ms", 
---
>     message=0xe06300 "suspend_thread suspend took 10000 ms, which is more than the allowed 10000 ms", 
ggeorge@topaz04:~>
Comment 6 Ludovic Henry 2017-02-08 20:34:25 UTC
Hi Glover, can you still reproduce with latest mono? There has been a lot of bug fixes in our thread suspend mechanism, so your bug has certainly been fixed. Thank you!
Comment 7 Ludovic Henry 2017-02-08 21:35:07 UTC
Glover, by latest mono, I mean the latest stable (available at http://www.mono-project.com/download/), and if it doesn't work, then please try the latest alpha (available at http://www.mono-project.com/download/alpha/)
Comment 8 Glover George 2017-02-28 20:27:57 UTC
This issue still occurs in mono-4.6.2.16, however we have yet to reproduce
the issue in 4.8.0.483.  We are still testing, but so far so good.
Normally, 100,000 runs would hang 20% of the runs.  So far, 100,000 runs
has produced no hangs.

I should be more confident in saying this is fixed in a few more weeks
(fingers are still crossed right now).

Cheers.
Comment 9 Ludovic Henry 2017-03-02 22:12:15 UTC
Glover, I will mark this bug as not reproducible. If you witness this hang again, please do not hesitate to either reopen this bug or submit a new one. Thank you very much!