Bug 4389 - handle_cleanup / thread_exit race condition segfault
Summary: handle_cleanup / thread_exit race condition segfault
Status: RESOLVED FIXED
Alias: None
Product: Runtime
Classification: Mono
Component: io-layer ()
Version: unspecified
Hardware: PC Other
: --- normal
Target Milestone: ---
Assignee: Bugzilla
URL:
Depends on:
Blocks:
 
Reported: 2012-04-11 06:54 UTC by Kurt Miller
Modified: 2013-02-02 12:22 UTC (History)
3 users (show)

Tags:
Is this bug a regression?: ---
Last known good build:


Attachments
patch (1.37 KB, patch)
2012-04-12 04:35 UTC, Zoltan Varga
Details
patch (432 bytes, patch)
2013-01-15 18:32 UTC, Zoltan Varga
Details
patch (8.64 KB, patch)
2013-01-20 22:58 UTC, Zoltan Varga
Details


Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.


Please create a new report on GitHub or Developer Community with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:
Status:
RESOLVED FIXED

Description Kurt Miller 2012-04-11 06:54:26 UTC
On mono 2.10.6 running on OpenBSD current (post 5.1 release) there is an intermittent segfault upon termination of the process where the main thread is executing handle_cleanup() while another thread is in thread_exit(). handle_cleanup frees memory and thread_exit crashes with a use after free.

Here is some debug info to demonstrate the problem:

(gdb) info threads
* 2 process 1715365438  0x1c1d54d0 in handle_cleanup () at handles.c:233
  1 process 1731225150  0x0e40ed15 in sem_destroy (semp=0x7c3ac1d4)
    at /usr/src/lib/librthread/rthread_sem.c:123
(gdb) thread 1
[Switching to thread 1 (process 1731225150)]#0  0x0e40ed15 in sem_destroy (semp=0x7c3ac1d4)
    at /usr/src/lib/librthread/rthread_sem.c:123
123             *semp = NULL;
(gdb) bt
#0  0x0e40ed15 in sem_destroy (semp=0x7c3ac1d4) at /usr/src/lib/librthread/rthread_sem.c:123
#1  0x1c1eafc3 in _wapi_thread_set_termination_details (handle=0x207, exitstatus=0) at wthreads.c:167
#2  0x1c1eb13e in thread_exit (exitstatus=Variable "exitstatus" is not available.
) at wthreads.c:204
#3  0x1c1eb237 in thread_start_routine (args=0x7c3ac1b8) at wthreads.c:287
#4  0x03c6b545 in GC_inner_start_routine (sb=0x7c4e8fd8, arg=0x3c0fbfc0) at pthread_support.c:1079
#5  0x03c6b5c3 in GC_start_routine (arg=0x3c0fbfc0) at pthread_support.c:1114
#6  0x0e40e48e in _rthread_start (v=0x7c363b00) at /usr/src/lib/librthread/rthread.c:113
#7  0x071bde81 in __tfork_thread () from /usr/lib/libc.so.63.0
#8  0x0e40d725 in _spinlock (lock=0x7c0cd340) at /usr/src/lib/librthread/rthread.c:75
#9  0x271a4c98 in ?? () from /usr/lib/libc.so.63.0
#10 0x00000000 in ?? ()
(gdb) p semp
$1 = (sem_t *) 0x7c3ac1d4
(gdb) p *semp
Cannot access memory at address 0x7c3ac1d4
(gdb) thread 2
[Switching to thread 2 (process 1715365438)]#0  0x1c014405 in monoeg_g_free (ptr=0x0) at gmem.c:37
37      }
(gdb) bt
#0  0x1c014405 in monoeg_g_free (ptr=0x0) at gmem.c:37
#1  0x1c1d54d0 in handle_cleanup () at handles.c:233
#2  0x07238725 in __cxa_finalize (dso=0x0) at /usr/src/lib/libc/stdlib/atexit.c:153
#3  0x0721638e in exit (status=0) at /usr/src/lib/libc/stdlib/exit.c:57
#4  0x1c013eef in ___start ()
#5  0x1c013e62 in _start ()

To reproduce this problem the following steps were taken, adjust paths as needed:

$ pwd
/usr/obj/ports/mono-2.10.6/mono-2.10.6/mcs/class/System.Configuration
$ cat doit
export MONO_PATH=./../../class/lib/basic:
until [ $? -ne 0 ]; do
/usr/obj/ports/mono-2.10.6/mono-2.10.6/runtime/mono-wrapper ./../../class/lib/basic/sn.exe -q -v ../../class/lib/net_2_0/System.Configuration.dll
done
echo $?
$ ./doit

After about 300 iterations mono will drop core. Other machines with different performance characteristics can reproduce it much easier.

Please let me know if there is other information needed to help diagnose the problem.
Comment 1 Zoltan Varga 2012-04-12 04:25:53 UTC
This shouldn't happen, the runtime is supposed to wait for all threads to terminate before calling exit ().
At least, the runtime waits until this line is executed in wthreads.c:172:

	_wapi_handle_set_signal_state (handle, TRUE, TRUE);

Which is after the line which causes the crash.
Comment 2 Zoltan Varga 2012-04-12 04:34:33 UTC
Please try the attached patch.
Comment 3 Zoltan Varga 2012-04-12 04:35:05 UTC
Created attachment 1658 [details]
patch
Comment 4 Kurt Miller 2012-04-12 10:36:32 UTC
Hi Zoltan,

Thank you for the patch. Prior to receiving it, the port maintainer robert@openbsd.org sent me an update to 2.10.8. The update corrected this issue. I have run the 'doit' script in a loop for > 10.000 iterations without hitting the problem again. There is a different problem now.  but I will file a separate report for it since it is not the same.

Regards,
-Kurt
Comment 5 Kurt Miller 2013-01-08 19:07:44 UTC
This problem has returned again with version 2.10.9 on OpenBSD 5.2-current. The patch from comment 3 does not correct the problem.

I have made more debugging progress and determined the following:

The problem is not caused by multiple calls to handle_cleanup. It is caused by the main thread exiting prior to all threads being waited for. When the main thread exits, it calls the atexit handlers. handle_cleanup is one of the atexit handlers. At the time handle_cleanup is called, there remains one thread still executing: finalizer_thread. Depending on kernel scheduling of the threads, handle_cleanup can be executing while the finalizer_thread is trying to call _wapi_thread_set_termination_details. Sometimes the execution is such that _wapi_thread_set_termination_details attempts a use after free.

debugging the core file lead me to see that io-layer/wthreads.c thread_start_routine():

        thread_exit (thread->start_routine (thread->start_arg),
                     thread->handle, thread->start_routine, thread->start_arg);

thread->start_routine was the start_wrapper() function in metadata/threads.c.
Modifying start_wrapper_internal() slightly to output the address of start_info->func argument as follows:

        /* start_func is set only for unmanaged start functions */
        if (start_func) {
                g_message ("%s: (%p) start_func executing", __func__, start_func);
                fflush(stdout);
                start_func (start_arg);
                g_message ("%s: (%p) start_func finished", __func__, start_func);
                fflush(stdout);
        } else {
                void *args [1];
                g_assert (start_delegate != NULL);
                args [0] = start_arg;
                g_message ("%s: (%p) start_delegate executing", __func__, start_delegate);
                fflush(stdout);
                /* we may want to handle the exception here. See comment below on unhandled exceptions */
                mono_runtime_delegate_invoke (start_delegate, args, NULL);
                g_message ("%s: (%p) start_delegate finished", __func__, start_delegate);
                fflush(stdout);
        }

and I can see the address of start_func always to be the same: 

"start_wrapper_internal: (0x1c12a400) start_func executing"

the function at address 0x1c12a400 is finalizer_thread:
(gdb) x 0x1c12a400
0x1c12a400 <finalizer_thread>:  0x57e58955

In addition, most executions of the doit script don't output the corrasponding finished message prior to the program exiting. This also supports that theory the finalizer_thread is not waited for.

Please let me know if there is any additional infomation you may need to further diagnose this intermittant crash.
Comment 6 Zoltan Varga 2013-01-12 08:06:31 UTC
The runtime waits for the finalizer thread to exit in mono_gc_cleanup () in gc.c. That code is a bit problematic, since the finalizer thread might refuse to die, in which case we will continue with the shutdown even through the finalizer thread might still be accessing data we are freeing. You might be running into this situation, altough there is a log message to alert the user:
					g_warning ("Shutting down finalizer thread timed out.");
Comment 7 Kurt Miller 2013-01-13 17:20:58 UTC
Hi Zoltan,

Thank you for your reply. I've read the code in mono_gc_cleanup() and see that it will wait for the finalize_thread to reach SetEvent (shutdown_event). In our case the finalizer_thread does not get stuck. It does reach the SetEvent (shutdown_event) line. We do not see the "Shutting down finalizer thread timed out." message. The race occurs after that.

When the finalizer_thread returns, it returns back to start_wrapper_internal() and then back to start_wrapper() and then back to thread_start_routine() where it then calls thread_exit(). thread_exit() calls _wapi_thread_set_termination_details().

However, just before the finalizer_thread returns it calls SetEvent (shutdown_event) and that wakes up the main thread waiting in mono_gc_cleanup(). If the main thread gets to handle_cleanup before the finalizer_thread calls _wapi_thread_set_termination_details() and the main thread doesn't call exit() just yet, then the finalizer_thread's call to  _wapi_thread_set_termination_details() will segfault with a use-after-free.

To force the issue, nanosleep can be used to force the timing:

======
--- /usr/obj/ports/mono-2.10.9/mono-2.10.9/mono/metadata/gc.c.orig      Sun Jan 13 17:03:33 2013
+++ /usr/obj/ports/mono-2.10.9/mono-2.10.9/mono/metadata/gc.c   Sun Jan 13 16:55:01 2013
@@ -1098,6 +1098,10 @@ finalizer_thread (gpointer unused)
        }

        SetEvent (shutdown_event);
+       struct timespec req;
+       req.tv_sec = 1;
+       req.tv_nsec= 0;
+       while(nanosleep(&req, NULL) == -1 && errno == EINTR);
        return 0;
 }

--- /usr/obj/ports/mono-2.10.9/mono-2.10.9/mono/io-layer/handles.c.orig Sun Jan 13 17:16:24 2013
+++ /usr/obj/ports/mono-2.10.9/mono-2.10.9/mono/io-layer/handles.c      Sun Jan 13 16:46:59 2013
@@ -231,6 +231,10 @@ static void handle_cleanup (void)

        for (i = 0; i < _WAPI_PRIVATE_MAX_SLOTS; ++i)
                g_free (_wapi_private_handles [i]);
+       struct timespec req;
+       req.tv_sec=2;
+       req.tv_nsec=0;
+       nanosleep(&req, NULL);
 }

 void _wapi_cleanup ()

=====

The following diff has stabilized mono on OpenBSD:

=======
--- mono/metadata/gc.c.orig     Thu Jan 10 10:55:32 2013
+++ mono/metadata/gc.c  Thu Jan 10 11:24:55 2013
@@ -1098,6 +1098,7 @@ finalizer_thread (gpointer unused)
        }

        SetEvent (shutdown_event);
+       pthread_exit(NULL);
        return 0;
 }

=======

Regards,
-Kurt
Comment 8 Zoltan Varga 2013-01-15 18:32:23 UTC
Please try the attached patch.
Comment 9 Zoltan Varga 2013-01-15 18:32:48 UTC
Created attachment 3223 [details]
patch
Comment 10 Kurt Miller 2013-01-20 17:44:44 UTC
Hi Zoltan,

I tried the patch. It still segfaults. A bit more debugging shows it does take the new code path from the patch that calls WaitForSingleObjectEx(). However, WaitForSingleObjectEx() doesn't actually wait. It exits at wait.c line 169:

        if (own_if_signalled (handle) == TRUE) {
#ifdef DEBUG
                g_message ("%s: handle %p already signalled", __func__,
                           handle);
#endif

                ret=WAIT_OBJECT_0;
                goto done;
        }

Thanks,
-Kurt
Comment 11 Kurt Miller 2013-01-20 21:17:31 UTC
Hi Zoltan,

I'm sorry; I misstated what's happening. When I put a nanosleep at the end of the finalizer_thread right before it returns, and I see that WaitForSingleObjectEx() does wait. The 'already signalled' call path happens when the finalizer_thread has already returned and mostly completed with _wapi_thread_set_termination_details(). So the window for the race is smaller now but still there. It can be forced with the following nanosleeps:

--- mono/io-layer/wthreads.c.orig       Sun Jan 20 21:11:41 2013
+++ mono/io-layer/wthreads.c    Sun Jan 20 21:12:55 2013
@@ -177,6 +177,10 @@ void _wapi_thread_set_termination_details (gpointer ha
        g_message("%s: Recording thread handle %p id %ld status as %d",
                  __func__, handle, thread_handle->id, exitstatus);
 #endif
+        struct timespec req;
+        req.tv_sec = 1;
+        req.tv_nsec= 0;
+        while(nanosleep(&req, &req) == -1 && errno == EINTR);

        /* The thread is no longer active, so unref it */
        _wapi_handle_unref (handle);
--- mono/io-layer/handles.c.orig        Sun Jan 20 21:10:22 2013
+++ mono/io-layer/handles.c     Sun Jan 20 20:59:57 2013
@@ -231,6 +231,11 @@ static void handle_cleanup (void)

        for (i = 0; i < _WAPI_PRIVATE_MAX_SLOTS; ++i)
                g_free (_wapi_private_handles [i]);
+
+        struct timespec req;
+        req.tv_sec = 2;
+        req.tv_nsec= 0;
+        while(nanosleep(&req, &req) == -1 && errno == EINTR);
 }

 void _wapi_cleanup ()
Comment 12 Zoltan Varga 2013-01-20 22:11:52 UTC
Yes, we have a problem because we are waiting for threads to terminate using WaitForSingleObjectEx (), but that only waits for the 'thread exited' event, the actual thread is still running its shutdown code.
Comment 13 Zoltan Varga 2013-01-20 22:57:58 UTC
Please try the attached patch.
Comment 14 Zoltan Varga 2013-01-20 22:58:28 UTC
Created attachment 3248 [details]
patch
Comment 15 Kurt Miller 2013-01-21 16:53:50 UTC
That patch looks like it should fix it. I'm building with it now and will run an overnight test of my script that reproduces segfault. I'll let you know how it goes. Thanks!
Comment 16 Kurt Miller 2013-01-24 08:54:14 UTC
Looking good so far. I have submitted these changes to OpenBSD's ports tree for wider testing.
Comment 17 Kurt Miller 2013-02-02 12:08:11 UTC
Final follow up on this bug report. The intermittent segfault is gone now. The patch has been tested on several multicore OpenBSD build clusters now for about a week and has not segfaulted since. Please include it in the next release of mono if possible. Thank you for your help with this problem. -Kurt
Comment 18 Zoltan Varga 2013-02-02 12:22:46 UTC
Commited to master/2.10.