Bug 4353 - [Crash] MonoDevelop 2.9.4 crashing randomly
Summary: [Crash] MonoDevelop 2.9.4 crashing randomly
Status: RESOLVED FIXED
Alias: None
Product: Xamarin Studio
Classification: Desktop
Component: General ()
Version: 2.9.x
Hardware: PC Mac OS
: High critical
Target Milestone: 3.0
Assignee: Bugzilla
URL:
: 4520 ()
Depends on: 4366
Blocks:
  Show dependency tree
 
Reported: 2012-04-09 19:00 UTC by Eric Beisecker
Modified: 2012-12-13 14:45 UTC (History)
11 users (show)

Tags:
Is this bug a regression?: ---
Last known good build:


Attachments
MonoDevelop 2.9.4 crash report (63.83 KB, text/plain)
2012-04-09 19:00 UTC, Eric Beisecker
Details
MonoDevelop "2.9.5" Crash report (62.11 KB, text/plain)
2012-04-23 14:31 UTC, Eric Beisecker
Details
patch fixing crash pointed out by Alan (2.13 KB, patch)
2012-05-04 16:58 UTC, Kristian Rietveld (inactive)
Details


Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.


Please create a new report on Developer Community or GitHub with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:
Status:
RESOLVED FIXED

Description Eric Beisecker 2012-04-09 19:00:22 UTC
Created attachment 1639 [details]
MonoDevelop 2.9.4 crash report

While working with MonoDevelop 2.9.4 I am experiencing random crashes while either Typing, Saving or opening a context menu on a Type.

I can't seem to get a reproducible set of steps but it happens pretty often.

I've attached the Crash Report.
Comment 1 Alan McGovern 2012-04-11 10:43:10 UTC
Looks like random memory corruption. Bug #4366 in the version of pango we are currently shipping could be causing this.
Comment 2 Eric Beisecker 2012-04-17 18:02:38 UTC
Since QA was not seeing this behavior in the 2.8.8 releases we should consider this a priority before we release 2.9.x to Beta/Stable
Comment 3 Duncan Mak 2012-04-23 14:23:15 UTC
We released Mono 2.10.9_10 with Pango 1.30.0 last week, which should fix #4366. 

We need to know if this is a duplicate of 4366 or not.
Comment 4 Eric Beisecker 2012-04-23 14:31:24 UTC
Created attachment 1730 [details]
MonoDevelop "2.9.5" Crash report
Comment 5 Eric Beisecker 2012-04-23 14:31:56 UTC
I am still seeing this issue with the new Version of Mono and the latest build of MonoDevelop.
Comment 6 Rolf Bjarne Kvinge [MSFT] 2012-04-23 16:21:44 UTC
*** Bug 4520 has been marked as a duplicate of this bug. ***
Comment 7 Miguel de Icaza [MSFT] 2012-04-23 19:18:26 UTC
Reopening, since we can repro in the office.

What other info is needed guys?
Comment 8 Michael Natterer 2012-04-26 06:29:22 UTC
Hmm, I don't see any indication in this bug report, or from the
place it crashes, that it's related to smooth scrolling, but
just to make sure that we are on the same page, you do use

"scrolling-4.patch" from
https://bugzilla.gnome.org/show_bug.cgi?id=516725

Other than that, it doesn't look related to bug #4366 either.

I'm rather suspecting it is a different-but-similar incarnation
of the issue where we fail to find the right window for an event,
so could be related to bug #2158.

Is there any way to turn the offsets in the MD crash reports into
line numbers?
Comment 9 Alan McGovern 2012-04-26 06:31:10 UTC
*** Bug 4651 has been marked as a duplicate of this bug. ***
Comment 10 Alan McGovern 2012-04-26 06:38:59 UTC
The reason for suspecting one of our custom patches was that the functions in the backtrace are ones which we are patching. I'm not sure which of the files is supposed to be 'scrolling-4.patch' in that report. Did you give the right filename?

Is there anything we can do to help diagnose this? At the moment we are struggling to get a reliable repro for it. 

I'll raise a critical bug to get our mono releases shipped with proper debug symbols so that future crashes will contain line numbers.
Comment 11 Michael Natterer 2012-04-26 08:59:52 UTC
Yes, how do I figure a line number from an offset in the MD crash reports?
Comment 12 Marek Habersack 2012-04-26 09:50:17 UTC
You can try to use the otool utility from OSX. It will let you dump and disassemble the TEXT section (where code is) of the Mono runtime and you should be able to find the range where the crash occurs and map it to the function name in the runtime. Example command to dump the mono binary from the attached crash report:

otool -tv /Library/Frameworks/Mono.framework/Versions/2.10.9/bin/mono
Comment 13 Mikayla Hutchinson [MSFT] 2012-04-26 10:54:07 UTC
I have tried unsuccessfully to get line info from the Gtk bundled with Mono, using otool & friends. The symbols appear to be corrupt.
Comment 14 Michael Natterer 2012-04-27 06:32:55 UTC
When i use "otool -tV" (not -tv) it tells me what functions are called,
smbolically, so it's quite easy to find where the crash happens.
Comment 15 Michael Natterer 2012-04-27 06:39:00 UTC
The relevant part seems to be:

000000000004085c        callq   _get_pointer_window
0000000000040861        movq    %rax,0xd0(%rbp)
0000000000040865        movl    0xfc(%rbp),%ecx  <--- offset 459 into
                                                      proxy_button_event()
0000000000040868        movq    0x90(%rbp),%rax
000000000004086c        movl    0xf4(%rbp),%edx
000000000004086f        movq    0xd0(%rbp),%rsi
0000000000040873        movq    0xa8(%rbp),%rdi
0000000000040877        movq    %rax,%r9
000000000004087a        movl    $0x00000000,%r8d
0000000000040880        callq   _get_event_window

In C, this looks like:

  pointer_window = get_pointer_window (display, toplevel_window,
				       toplevel_x, toplevel_y,
				       serial);

  event_win = get_event_window (display,
				pointer_window,
				type, state,
				NULL, serial);

I have absolutely no clue how this can crash here.

I'm looking at my own built gtk-2-24 and this file and function have
not changed in ages, so I guess the assembly would look the same
as in MD? I might be doing something horribly wrong here, I'm not
an assembler expert at all.
Comment 16 Michael Natterer 2012-04-27 06:51:13 UTC
For the lack of a better idea: get_event_window() would crash if
get_pointer_window() returns NULL. Can you try to insert

  if (pointer_window == NULL)
    return TRUE;

After get_pointer_window().

I would test it myself, but I can't reproduce the crash.
Comment 17 Kristian Rietveld (inactive) 2012-04-29 18:36:32 UTC
The relevant assembly from MD is (obtained using gdb):

0x03de6c4e <proxy_button_event+451>:    mov    %eax,(%esp)
0x03de6c51 <proxy_button_event+454>:    call   0x3de5d9d <get_pointer_window>
0x03de6c56 <proxy_button_event+459>:    mov    %eax,-0x18(%ebp)
0x03de6c59 <proxy_button_event+462>:    mov    -0x24(%ebp),%edx
0x03de6c5c <proxy_button_event+465>:    mov    0xc(%ebp),%eax
0x03de6c5f <proxy_button_event+468>:    mov    %eax,0x14(%esp)
0x03de6c63 <proxy_button_event+472>:    movl   $0x0,0x10(%esp)
0x03de6c6b <proxy_button_event+480>:    mov    %edx,0xc(%esp)


The only addresses which are being accessed here are the stack pointer and the base pointer, so one of these could have been busted by the function call somehow. A corrupted stack can cause the get_pointer_window function epilogue (leave instruction) to restore a busted %ebp. This is similar to the x86_64 code posted by Mitch in comment 15 which also mainly works with the base pointer.

It also looks like the offsets in the stack  trace are all one ahead, so perhaps the crash actually occurs on +454 (but then the big question is why this frame is not present in the stack trace).
Comment 18 Kristian Rietveld (inactive) 2012-04-29 19:26:01 UTC
In bug 4651 we also observe that a crash happens in a function call, which does not appear in the "native stack trace" , but this function frame is present in the gdb info.

Therefore, an alternative theory could be that the crash happens inside get_pointer_window() (since the offsets are often one ahead).  A plausible way this could happen is event_window being NULL, which for proxy_button_event() means toplevel_window is NULL.  It is not yet clear to me how toplevel_window can be NULL at this point, if event_window==NULL is passed into convert_native_coords_to_toplevel(), a crash would occur earlier (gdk_window_is_toplevel() accesses window without checking).
Comment 19 Duncan Mak 2012-04-30 15:30:33 UTC
Do we still want to include the patch to test for pointer_window == NULL, as suggested in comment 16?
Comment 20 Alan McGovern 2012-05-04 09:53:24 UTC
I have a 100% repro for a similar crash, i'm not sure if it's exactly the same one though.

1) Open MD 
2) Move the mouse to the very left edge of MonoDevelop, right where you'd normally left click and drag to resize the window.
3) Right click here.

Result:

Program received signal SIGSEGV, Segmentation fault.
0x04bd5aa4 in gdk_event_translate (event=0x177eb00, nsevent=0xed2b020) at gdkevents-quartz.c:1323
1323	  if (GDK_IS_WINDOW (window))
Current language:  auto; currently objective-c
(gdb) bt
#0  0x04bd5aa4 in gdk_event_translate (event=0x177eb00, nsevent=0xed2b020) at gdkevents-quartz.c:1323
#1  0x04bd6400 in _gdk_events_queue (display=0xaa2000) at gdkevents-quartz.c:1517
#2  0x04bd7742 in gdk_event_dispatch (source=0x6726a0, callback=0, user_data=0x0) at gdkeventloop-quartz.c:670
#3  0x043619b8 in g_main_dispatch (context=0x672700) at gmain.c:2441
#4  0x043631c5 in g_main_context_dispatch (context=0x672700) at gmain.c:3014
#5  0x04363749 in g_main_context_iterate (context=0x672700, block=1, dispatch=1, self=0x2831b60) at gmain.c:3092
#6  0x0436403a in g_main_loop_run (loop=0xb8cdeb0) at gmain.c:3300
#7  0x046623f0 in gtk_main () at gtkmain.c:1256
#8  0x0ec341f4 in ?? ()
#9  0x0ec341bc in ?? ()
#10 0x0ec3419c in ?? ()
#11 0x041507b8 in ?? ()
#12 0x00741f90 in ?? ()
#13 0x00741d9c in ?? ()
#14 0x00741e56 in ?? ()
#15 0x00010f4f in mono_jit_runtime_invoke (method=0x32a0e1c, obj=0x0, params=0xbffff1a8, exc=0x0) at mini.c:5791
#16 0x002216ca in mono_runtime_invoke (method=0x32a0e1c, obj=0x0, params=0xbffff1a8, exc=0x0) at object.c:2755
#17 0x002243ac in mono_runtime_exec_main (method=0x32a0e1c, args=0x4f4e00, exc=0x0) at object.c:3930
#18 0x00223611 in mono_runtime_run_main (method=0x32a0e1c, argc=0, argv=0xbffff4bc, exc=0x0) at object.c:3560
#19 0x000acaef in mono_jit_exec (domain=0x4ede00, assembly=0x2829f80, argc=1, argv=0xbffff4b8) at driver.c:944
#20 0x000acd40 in main_thread_handler (user_data=0xbffff3d8) at driver.c:1003
#21 0x000af198 in mono_main (argc=3, argv=0xbffff4b0) at driver.c:1855
#22 0x00002494 in mono_main_with_options (argc=3, argv=0xbffff4b0) at main.c:66
#23 0x00002528 in main (argc=3, argv=0xbffff4b0) at main.c:97
Comment 21 Alan McGovern 2012-05-04 11:07:55 UTC
The crash does not happen when building mono using rev 3f16b2f298feda1c1b5518849a9615d8421cd88c of git://github.com/xamarin/bockbuild.git . It does happen with current master, rev 2f535837dfc026f3804b684541e257fba2a9a66b. That leaves a commit diff of the following commits. one of these more than likely introduced/triggered the bug. I am currently trying to bisect this to figure out which commit it was.

commit 2f535837dfc026f3804b684541e257fba2a9a66b
Author: Duncan Mak <duncan.mak@xamarin.com>
Date:   Thu Apr 26 18:12:13 2012 -0400

    Fix typo.

commit ca88d4c792a6d175c2117b2b104895eea2a8663a
Author: Duncan Mak <duncan.mak@xamarin.com>
Date:   Thu Apr 26 17:50:24 2012 -0400

    Include libffi.py.

commit 320c7cc1d6dd711ffbe6e586f9fdbcfc5d216d95
Author: Michael Hutchinson <m.j.hutchinson@gmail.com>
Date:   Wed Apr 25 17:00:31 2012 -0400

    Remove old patch files

commit 4722779e56bb0282a33a66dbf9ef57a81283da49
Author: Michael Hutchinson <m.j.hutchinson@gmail.com>
Date:   Wed Apr 25 16:58:35 2012 -0400

    [libffi] Update to 3.0.11

commit 7e5698ae60403f90e96a083503c476b6d65b7796
Author: Michael Hutchinson <m.j.hutchinson@gmail.com>
Date:   Wed Apr 25 16:57:52 2012 -0400

    [gtk] Update the patch set
    
    * Take window resize patches from git, not bugzilla
    * Add DnD patches from git
    * Update event crash diagnostic patch
    * Clipboard persistence fix

commit 3c462118f4d7f2e24130dbf4c41dd51709fe62f4
Author: Michael Hutchinson <m.j.hutchinson@gmail.com>
Date:   Wed Apr 25 16:55:23 2012 -0400

    [pango] Update the font map crash patch

commit 068402b575cbf41c4e856801ad40d5b45c0d59a1
Author: Michael Hutchinson <m.j.hutchinson@gmail.com>
Date:   Wed Apr 25 16:55:00 2012 -0400

    [atk] Update to 2.2 to build w/new glib
    
    Can't update to 2.4, that needs glib 2.32

commit 61fac0925e17d7e05604a778a7b5f2817bdfbd86
Author: Michael Hutchinson <m.j.hutchinson@gmail.com>
Date:   Wed Apr 25 16:54:21 2012 -0400

    [glib] Update to 2.30.3
    
    Should fix 1758 - [GTK] Complete hang if you drag a file in the OpenFile dialog

commit dc4d36f6cc466d679c96496bd4d8117885458866
Author: Michael Hutchinson <m.j.hutchinson@gmail.com>
Date:   Wed Apr 25 16:52:45 2012 -0400

    [murrine] Update to 0.98.2

commit d0ce430b70d0491809d314563a2434498360a383
Author: Duncan Mak <duncan.mak@xamarin.com>
Date:   Tue Apr 24 11:16:09 2012 -0400

    Revert "Add back the custom gtkrc to the gtk+.py' - gtkrc is not a patch.
    
    This reverts commit a5ad6104361e99df0ea9f820fc8cf54e66d7cbcc.

commit a5ad6104361e99df0ea9f820fc8cf54e66d7cbcc
Author: Alex Corrado <alexander.corrado@gmail.com>
Date:   Fri Apr 20 20:07:02 2012 -0300

    Add back the custom gtkrc to the gtk+.py patch list- it was lost in the merge

commit 14ea80b029647a0eeb9f4e1a560488a39cd2cfd9
Author: Duncan Mak <duncan.mak@xamarin.com>
Date:   Fri Apr 20 17:44:01 2012 -0400

    Bump version number to 2.11.2.

commit 1561ae6a306ee19ea5fed57e6d4ce2cbdc1ea0ae
Author: Duncan Mak <duncan.mak@xamarin.com>
Date:   Thu Apr 19 18:35:57 2012 -0400

    Include XZ and GNU Tar so that we can build the new Pango.
Comment 22 Alan McGovern 2012-05-04 13:06:22 UTC
Unfortunately I didn't have the repro exactly right so I was incorrect when I said those commits contained the commit introducing the problem.

The repro is actually to right click on the border of monodevelop *and* move the mouse at the same time.

I traced the issue in gdb as best I could and it looks like the bug is that in this snippet [0] nswindow is null but the toplevel_under_pointer is non-null. This results in us failing to bail out and therefore accessing nswindow while it is null and blowing up.

[0] https://gist.github.com/2667edba4cbc149fc655
Comment 23 Mikayla Hutchinson [MSFT] 2012-05-04 13:29:31 UTC
Alan, could you open a separate bug for that? I think the trace and the repro/context are quite different to this, so it's just confusing this one.
Comment 24 Kristian Rietveld (inactive) 2012-05-04 14:49:51 UTC
Using Alan's reproduction technique, I get a stack trace crashing in proxy_button_event + 459; so exactly like the stack trace from comment 4.

So it could be related, will attempt some debugging now.
Comment 25 Kristian Rietveld (inactive) 2012-05-04 15:04:20 UTC
> The repro is actually to right click on the border of monodevelop *and* move
> the mouse at the same time.

Moving is not necessary, but you need to click a bit *beyond* the border. For the left edge, the toplevel_x coordinate is then negative and this seems to trigger some broken behavior (e.g. for toplevel_x < 0, you cannot find a pointer_window in proxy_button_event().
Comment 26 Kristian Rietveld (inactive) 2012-05-04 16:58:44 UTC
Created attachment 1805 [details]
patch fixing crash pointed out by Alan

This patch fixes the crash pointed out by Alan.  I cannot tell if this will fix the other cases as well, for this we will have to identify whether the same code path and cause is leading to a NULL pointer_window or whether this is because of a different code path.
Comment 27 Duncan Mak 2012-05-04 17:05:06 UTC
I just added attachment 1805 [details] to bockbuild.
Comment 28 Kristian Rietveld (inactive) 2012-05-04 17:10:54 UTC
To briefly explain what the patch does: OS X does push down events with x coordinate range [-3, 0] with a window set to the application. LeftMouseDown is already caught and ignored, because this is the coordinate range where window resizing may be started. However, for right and other mouse down, the event is forwarded into GDK.  Obviously, there is no GdkWindow in the range [-3, 0], and this causes problems.  The solution is to ignore these events, like we ignore for LeftMouseDown.

The patch will now also ignore events for all mouse buttons when the coordinates are beyond the window border on the right or at the bottom.  (Though I want to double check the logic for y, now that I have a second look at the code that was already there).
Comment 29 Mikayla Hutchinson [MSFT] 2012-05-04 18:04:15 UTC
I couldn't repro the crash with bockbuild master before the patch - it hits the Gdk-warnings we added about a week ago. Alan, are you sure you were actually using master?

Anyway, the patch cleans up the warnings in that particular case, which is good. It looks like Alan found a repro for the original report of bug 2158, since that involved a right-click at the edge of the screen.

But I don't think it can be responsible for the window lookup crashes related to typing in the text editor - I'm pretty sure no mouse button other than left mouse was involved in those. I still suspect we have two issues there - there's the issue with the completion window grabs described in bug 4497 comment 15, but the commit that introduced completion window re-use was two month after the first report of a window lookup crash in this context, in bug 2158. The completion window grab issue just made it worse. So I'm not sure if that's one bug or two bugs.

After we land verify this patch, I'll remove Mike's workarounds and try to repro the completion window issue.
Comment 30 Alan McGovern 2012-05-04 19:48:45 UTC
I was definitely on bockbuild master at the time. Here's a screencast of me randomly rightclicking and (eventually) triggering the crash: http://screencast.com/t/MeILx7ebk1 . You have to right click in exactly the right place to trigger it this way so it's quite possible you just didn't click in the right place.

I'll update to the latest bockbuild and try to verify the patch works as expected.
Comment 31 Mikayla Hutchinson [MSFT] 2012-05-04 19:53:11 UTC
Well, I could reproduce what would have been a crash without https://github.com/xamarin/bockbuild/blob/master/packages/gtk%2B.py#L40 but that was added a while ago, so I don't think your build was up to date. Note that bockbuild does *not* automatically rebuild when patches are updated - you have to remove the .success files.
Comment 32 Alan McGovern 2012-05-04 20:08:13 UTC
If you look at the screencast at about 0:19 you'll see the message being printed in my terminal saying that grab->window is NULL, so that patch was definitely in my build.

Kristian, I can confirm that the patch you supplied fixes the crash in the two repro cases I had. The first testcase was clicking near the edge of the main MonoDevelop window, the second was when i right/left clicked continuously on a widget to make it's context menu appear/disappear. Both were a reliable crash within a few seconds but i've clicked for ages now with no issue.
Comment 33 Kristian Rietveld (inactive) 2012-05-05 06:24:41 UTC
> Anyway, the patch cleans up the warnings in that particular case, which is
> good. It looks like Alan found a repro for the original report of bug 2158,
> since that involved a right-click at the edge of the screen.

This one is likely fixed for sure; in my post-mortem analysis I concluded in bug 2158 comment 27 that likely an event is generated with coordinates out of the bounds of the toplevel.  And that is exactly what we found now.


> But I don't think it can be responsible for the window lookup crashes related
> to typing in the text editor - I'm pretty sure no mouse button other than left
> mouse was involved in those. I still suspect we have two issues there - there's
> the issue with the completion window grabs described in bug 4497 comment 15,

I agree.  We need to find out if the cases when grabs are messed up actually corrupt the grab state  by having a grab with grab->window == NULL.
Comment 34 Kristian Rietveld (inactive) 2012-05-05 06:42:45 UTC
As I just commented in bug 4651, that bug is likely different.  In 4651 we have event->any.window is NULL, this bug was fixed because the coordinates were out of bounds (but we did have a window).
Comment 35 Kristian Rietveld (inactive) 2012-05-13 12:21:17 UTC
Attachment 1805 [details] from comment 26 has been upstreamed.
Comment 36 Alan McGovern 2012-12-13 08:21:52 UTC
Can this bug be closed now? I think the bug has been fixed for quite a few months
Comment 37 Mikayla Hutchinson [MSFT] 2012-12-13 14:45:34 UTC
Yes, I think we can.