Bug 2642 - SIGSEGV in __strstr_sse42
Summary: SIGSEGV in __strstr_sse42
Status: RESOLVED FIXED
Alias: None
Product: Runtime
Classification: Mono
Component: JIT ()
Version: unspecified
Hardware: PC Linux
: --- normal
Target Milestone: ---
Assignee: Bugzilla
URL:
: 2644 ()
Depends on:
Blocks:
 
Reported: 2011-12-27 02:07 UTC by Bassam
Modified: 2012-03-28 17:44 UTC (History)
4 users (show)

Tags:
Is this bug a regression?: ---
Last known good build:


Attachments
patch (490 bytes, patch)
2012-02-19 08:07 UTC, Zoltan Varga
Details
patch (2.64 KB, patch)
2012-02-19 09:24 UTC, Zoltan Varga
Details
workarund (398 bytes, patch)
2012-02-29 21:12 UTC, Zoltan Varga
Details


Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.


Please create a new report on GitHub or Developer Community with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:
Status:
RESOLVED FIXED

Description Bassam 2011-12-27 02:07:45 UTC
We are seeing numerous crashes in mono in our ASP.NET farm. The following backtrace is common:

#0  0x00007f693c123165 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007f693c125f70 in *__GI_abort () at abort.c:92
#2  0x00000000004933d0 in mono_handle_native_sigsegv (signal=814058848, ctx=<value optimized out>) at mini-exceptions.c:2246
#3  0x00000000004e6c9d in mono_arch_handle_altstack_exception (sigctx=0x7f69374d6ac0, fault_addr=<value optimized out>, stack_ovf=0) at exceptions-amd64.c:957
#4  0x00000000004176e4 in mono_sigsegv_signal_handler (_dummy=11, info=0x7f69374d6bf0, context=0x7f69374d6ac0) at mini.c:5882
#5  <signal handler called>
#6  0x00007f693c1f9c0b in __strstr_sse42 (s1=0x1aa71b8 "__icall_wrapper_mono_thread_force_interruption_checkpoint", s2=0x600351 "__icall_wrapper_") at ../sysdeps/x86_64/multiarch/strstr.c:286
#7  0x000000000041e731 in mono_jit_compile_method_with_opt (method=0x1aa7180, opt=51472895, ex=0x7f68f43a7050) at mini.c:5343
#8  0x000000000041eede in mono_jit_compile_method (method=0x600351) at mini.c:5404
#9  0x0000000000496fef in common_call_trampoline (regs=0x7f68f43a7300, code=0x40133ac0 "\353", <incomplete sequence \344>, m=0x1aa7180, tramp=<value optimized out>, vt=0x0, vtable_slot=0x0, need_rgctx_tramp=0) at mini-trampolines.c:492
#10 0x000000000049790d in mono_magic_trampoline (regs=0x600351, code=0x600351 "__icall_wrapper_", arg=0x5f, tramp=0x8 <Address 0x8 out of bounds>) at mini-trampolines.c:594
#11 0x0000000040fb316a in ?? ()
#12 0x00007f68f43a7168 in ?? ()
#13 0x00000000005e2d7f in monoeg_g_hash_table_lookup (hash=0x1aa7180, key=0x7f68f43a73f8) at ghashtable.c:280
#14 0x0000000040133ac0 in ?? ()
#15 0x0000000000000000 in ?? ()

which seems very related to https://bugzilla.novell.com/show_bug.cgi?id=647464.

(gdb)disassemble
...
0x00007fb83ffc8c08 <__strstr_sse42+936>:        mov    %r13,%rdi
0x00007fb83ffc8c0b <__strstr_sse42+939>:        movdqa %xmm0,(%rsp) <----
0x00007fb83ffc8c10 <__strstr_sse42+944>:        callq  0x7fb83ffc8720 <__m128i_strloadu>
0x00007fb83ffc8c15 <__strstr_sse42+949>:        movdqa %xmm0,0x10(%rsp)
...

(gdb) p $rsp
$1 = (void *) 0x7fb831b5bf38

which is not 16-byte aligned and hence the SIGSEGV. Is this a problem with the mono runtime?
Comment 1 Zoltan Varga 2011-12-27 08:23:08 UTC
Can you paste the output of mono --version ? Also, is this problem reproducible, or does it happen randomly ?
Comment 2 Bassam 2011-12-27 11:04:00 UTC
Here's the output of mono --version:

Mono JIT compiler version 2.10.6 ((no/82d9fb2 Mon Nov 14 17:54:17 UTC 2011)
Copyright (C) 2002-2011 Novell, Inc, Xamarin, Inc and Contributors. www.mono-project.com
        TLS:           __thread
        SIGSEGV:       altstack
        Notifications: epoll
        Architecture:  amd64
        Disabled:      none
        Misc:          softdebug
        LLVM:          supported, not enabled.
        GC:            Included Boehm (with typed GC and Parallel Mark)

The problem happens randomly and seems to occur once every couple of days.
Comment 3 Zoltan Varga 2012-01-07 13:00:43 UTC
Could you try applying 2f985da336f3c23c317cdf085b96c4ca1b08de13 from our git repo ?
Comment 4 Zoltan Varga 2012-01-14 13:05:57 UTC
-> hopefully fixed.
Comment 5 TravisN 2012-02-13 17:41:38 UTC
The sse42 error is still being hit after applying 2f985da336f3c23c317cdf085b96c4ca1b08de13. The stack trace and alignment issue looks the same as what Bassam specified originally. Here is the version in use now:

Mono JIT compiler version 2.10.8.1 ((no/3e7cd1f Wed Jan 25 22:22:34 UTC 2012)
Copyright (C) 2002-2011 Novell, Inc, Xamarin, Inc and Contributors. www.mono-project.com
        TLS:           __thread
        SIGSEGV:       altstack
        Notifications: epoll
        Architecture:  amd64
        Disabled:      none
        Misc:          softdebug
        LLVM:          supported, not enabled.
        GC:            Included Boehm (with typed GC and Parallel Mark)
Comment 6 Bassam 2012-02-18 16:08:07 UTC
This bug is bringing our frontends down every few hours. Is there a workaround for this? Do you need more information to diagnose?
Comment 7 Zoltan Varga 2012-02-18 16:20:36 UTC
The problem is caused by the native stack getting misaligned. We do a lot of stack manipulation so this will be pretty hard to track down.

When the problem happens, in the stack trace, is this line always the same:
#6  0x00007f693c1f9c0b in __strstr_sse42 (s1=0x1aa71b8
"__icall_wrapper_mono_thread_force_interruption_checkpoint", s2=0x600351

I.e. does it always say "__icall_wrapper_mono_thread_force_interruption_checkpoint" ?
Comment 8 Bassam 2012-02-18 16:35:45 UTC
Not always. Here's a sample of some recent crashes:

> grep _strstr_sse42 mono-crashes-new.log
#6  0x00007fba4a72d52b in __strstr_sse42 (s1=0x26976f8 "__icall_wrapper_helper_ldstr", s2=0x600c53 "__icall_wrapper_")
#6  0x00007fa44a27652b in __strstr_sse42 (s1=0x10adb80 "__icall_wrapper_mono_thread_force_interruption_checkpoint",
#6  0x00007f5122c4f52b in __strstr_sse42 (s1=0x194cb80 "__icall_wrapper_mono_thread_force_interruption_checkpoint",
#6  0x00007ff2af80f52b in __strstr_sse42 (s1=0x7ff2a438ada8 "__icall_wrapper___emul_iconv_to_r_un", s2=0x600c53 "__icall_wrapper_")
#6  0x00007fb5f8d4152b in __strstr_sse42 (s1=0x13b9b80 "__icall_wrapper_mono_thread_force_interruption_checkpoint",
#6  0x00007fa079bd252b in __strstr_sse42 (s1=0x7fa0704439c0 "__icall_wrapper___emul_iconv_to_r_un", s2=0x600c53 "__icall_wrapper_")
#6  0x00007f2cfe40e52b in __strstr_sse42 (s1=0x7f2cf4257330 "__icall_wrapper___emul_iconv_to_r_un", s2=0x600c53 "__icall_wrapper_")
#6  0x00007fd0506c652b in __strstr_sse42 (s1=0x10fba68 "__icall_wrapper_helper_ldstr", s2=0x600c53 "__icall_wrapper_")
#6  0x00007fb1923bc52b in __strstr_sse42 (s1=0x1c07b80 "__icall_wrapper_mono_thread_force_interruption_checkpoint",

The issue is very reproducible, and I'd be happy to setup a debugging session if needed to catch it if that helps. Is this specific to the amd64 jit'er?
Comment 9 Zoltan Varga 2012-02-18 17:00:04 UTC
Probably yes. x86 doesn't have such stack alignment requirements.
Comment 10 Bassam 2012-02-18 17:21:46 UTC
sorry for the basic question, but is there a way to force x86 JITing on an amd64 at runtime? Or is this a mono build option?
Comment 11 Zoltan Varga 2012-02-18 17:50:08 UTC
Its a built time option. You can try running an x86 mono in a 32 bit chroot.
Comment 12 Zoltan Varga 2012-02-19 08:06:45 UTC
Please try the attached patch.
Comment 13 Zoltan Varga 2012-02-19 08:07:39 UTC
Created attachment 1384 [details]
patch
Comment 14 Zoltan Varga 2012-02-19 09:24:13 UTC
Created attachment 1385 [details]
patch
Comment 15 Bassam 2012-02-20 00:05:39 UTC
we will try this out and get back to you in a few days. Thanks Zoltan for the quick turnaround!
Comment 16 Zoltan Varga 2012-02-23 14:03:33 UTC
Did the patch work out ?
Comment 17 Bassam 2012-02-24 15:20:46 UTC
So far it looks good. I'm going to let the test run through the weekend, and report back here on Monday.
Comment 18 Bassam 2012-02-27 00:54:48 UTC
The patch seems to have worked. We are seeing a different issue now with the GC which I will open a different ticket on. Thanks Zoltan for the great work.
Comment 19 Bassam 2012-02-27 01:50:35 UTC
Any chance that due to the patch mono is using more memory (or potentially leaking)? We are no seeing the following on a daily basis now with the patch applied:

#0  0x00007f2b0349b1b5 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#0  0x00007f2b0349b1b5 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007f2b0349dfc0 in *__GI_abort () at abort.c:92
#2  0x000000000048f566 in mono_handle_native_sigsegv (signal=16, ctx=<value optimized out>) at mini-exceptions.c:2223
#3  <signal handler called>
#4  0x00007f2b0349b1b5 in *__GI_raise (sig=<vcd ..alue optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#5  0x00007f2b0349dfc0 in *__GI_abort () at abort.c:92
#6  0x00000000005e4fda in GC_abort (msg=0x671235 "mmap(...PROT_NONE...) failed") at misc.c:1099
#7  0x00000000005d9ebd in GC_unmap (start=0x7f2aa5b94000 "\200", bytes=8192) at os_dep.c:2037
#8  0x00000000005dbc5f in GC_unmap_old () at allchblk.c:383
#9  0x00000000005e6399 in GC_finish_collection () at alloc.c:776
#10 0x00000000005e5a11 in GC_try_to_collect_inner (stop_func=0x5e539b <GC_never_stop_func>) at alloc.c:393
#11 0x00000000005e6a75 in GC_collect_or_expand (needed_blocks=1, ignore_off_page=0) at alloc.c:1045
#12 0x00000000005e6cf5 in GC_allocobj (sz=10, kind=0) at alloc.c:1125
#13 0x00000000005dee91 in GC_generic_malloc_inner (lb=80, k=0) at malloc.c:136
#14 0x00000000005e0072 in GC_generic_malloc_many (lb=80, k=0, result=0x7f2acb8ea890) at mallocx.c:513
#15 0x00000000005d4e24 in GC_local_malloc_atomic (bytes=76) at pthread_support.c:395
#16 0x000000000050bd3a in mono_object_allocate_ptrfree (vtable=0x7f2af80ae988, n=22) at object.c:4220
#17 mono_array_new_specific (vtable=0x7f2af80ae988, n=22) at object.c:4779
#18 0x00000000410ff39d in ?? ()
#19 0x00000000010ef530 in ?? ()
#20 0x00007f2a397449f0 in ?? ()
#21 0x0000000000000001 in ?? ()
#22 0x0000000000000000 in ?? ()

which seems similar to http://mono.1490590.n4.nabble.com/mmap-PROT-NONE-failed-td3234031.html where Gonzalo thinks that the process id running out of memory. Indeed the process looks like it using more memory than before the patch was applied.
Comment 20 Zoltan Varga 2012-02-27 03:52:29 UTC
We _might_ be leaking some stack space after the call, but that shouldn't be enough to trigger an out-of-memory condition. 
The problem you are running into is probably:
https://bugzilla.novell.com/show_bug.cgi?id=487130

Try removing -DUSE_MUNMAP from libgc/Makefile and recompile.
Comment 21 Zoltan Varga 2012-02-27 11:24:02 UTC
The original issue should be fixed in HEAD/mono-2-10.
Comment 22 Bassam 2012-02-29 20:58:10 UTC
we tried running without USE_MUNMAP and we still see a substantial difference in memory usage between mono with and without this patch. 

One of our frontends running without the patch is using 363MB of memory, while the frontend with the patch is using 1.1GB. The frontends running very comparable workloads for the same amount of time.

Could there be a different explanation to this?
Comment 23 Zoltan Varga 2012-02-29 21:11:50 UTC
Please try the attached workaround patch.
Comment 24 Zoltan Varga 2012-02-29 21:12:15 UTC
Created attachment 1443 [details]
workarund
Comment 25 Zoltan Varga 2012-03-11 13:34:42 UTC
Did that workaround work ?
Comment 26 Bassam 2012-03-14 17:47:43 UTC
Unfortunately not. We still see an increase in memory usage with both patches (and  regardless of USE_MUNMAP). The SEGVs are gone, but the OOM killer keeps getting the mono process when running with the patches. Is there anything else you'd like me to try, or can I get you more information?
Comment 27 Zoltan Varga 2012-03-14 17:56:39 UTC
The workaround disables the code path affected by the first patch, so if there is memory problem, I don't think it is caused by this patch, it is probably a completely unrelated issue.
Comment 28 Bassam 2012-03-19 00:21:46 UTC
Zoltan, I think our memory problem was due to a process limit on locked memory. With mmap its likely that the GC is trying to mmap > 64KB of memory and failing with EAGAIN which would cause the ABRT. I increased the limit and trying again now. 

Q: is the second patch/workaround necessary? It it better to run without it?
Comment 29 Zoltan Varga 2012-03-19 00:23:57 UTC
I made the workaround to determine whenever this memory problem was caused by the first patch or not. So its not necessary.
Comment 30 Bassam 2012-03-20 14:38:25 UTC
thanks. closing.
Comment 31 Bassam 2012-03-28 17:42:09 UTC
*** Bug 2644 has been marked as a duplicate of this bug. ***
Comment 32 Bassam 2012-03-28 17:44:18 UTC
Just want to report that we have been running quite well with this change and after enabling large-heap, increasing the max_map_count limit in linux, and max locked memory ulimit.