Bug 6349 - [kqueue]: socket.close blocks indefinitely
Summary: [kqueue]: socket.close blocks indefinitely
Status: RESOLVED FIXED
Alias: None
Product: iOS
Classification: Xamarin
Component: Xamarin.iOS.dll ()
Version: 5.99.x
Hardware: PC Mac OS
: --- normal
Target Milestone: Untriaged
Assignee: Bugzilla
URL:
Depends on:
Blocks:
 
Reported: 2012-08-01 06:41 UTC by Alex
Modified: 2013-03-28 04:45 UTC (History)
7 users (show)

Tags:
Is this bug a regression?: ---
Last known good build:


Attachments
iPhone test application to reproduce the bug (1.32 MB, application/zip)
2012-08-02 12:26 UTC, Alex
Details


Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.


Please create a new report on Developer Community or GitHub with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:
Status:
RESOLVED FIXED

Description Alex 2012-08-01 06:41:42 UTC
Dear all,

we have found a problem with the socket.Close method in our application for iOS. When closing a TCP connection the socket.Close method blocks indefinitely. We have debugged through Mono source code and found out that the Close_internal method never returns (this method is called in socket.Dispose(bool explicit disposing), called by socket.Dispose which is called in socket.Close.

I think that no test code is necessary, because it is not hard to reproduce - just create an IPv4 TCP socket, connect to any TCP server, send some data and then call socket.Close. To reproduce this it is not necessary to Close directly after sending the data, it is not time dependent. We are able to produce one deadlock in the iOS simulator from approximately 10 - 20 trials, so it does not happen every time. We have also tried to explicitly set the linger time to 0 or to call Shutdown on the socket, but this does not help. If necessary, we can provide the test code based on the above description.

As this seems to be critical with no good workaround, any chances to have this fixed soon?

Best regards

Alex
Comment 1 Sebastien Pouliot 2012-08-01 14:12:05 UTC
Could you please attach your test case to the bug report ? The whole System.Net HTTP stack is built on top of the Socket class and this works (and get tested on every release). 

Also provide the exact version number of MonoTouch you're using (in case it's not the *alpha* 5.99).
Thanks
Comment 2 Alex 2012-08-02 03:15:30 UTC
Hi Sebastien,

ok, I started to develop an iOS app to make the smallest possible repro of this. When doing this I was able to track down the cause of this behaviour. I still think it is a bug, because this behaviour differs from how the same code behaves on Windows.

After creating a socket we use it for asynchronous data sends and receives (SendAsync and ReceiveAsync pattern). The first portion of data is sent directly in the connect callback method of the TCP client socket. After a call to SendAsync we also call ReceiveAsync to get the response from the server. This scenario leads (not always but often) to the problem, that a later call to socket.Close blocks in Close_internal (called by Dispose). When we call ReceiveAsync BEFORE the first SendAsync, the Close method always returns.

I can imagine that this might be caused by the fact that some TCP packets may actually reach the socket before a receive handler has been installed, which might be a synchronization issue. But it never causes a problem on Windows.

Best regards

Alex
Comment 3 Sebastien Pouliot 2012-08-02 07:52:46 UTC
Hello Alex,

I did not doubt there can be a bug (even more if this does work as expected on Windows) but it would be one, like you described, that can only be hit in some specific circumstances. This is why I asked for a test case, since we could create several of our own(s) test cases without hitting the same issue (this happens too many times).

To avoid any misunderstanding could you attach your test case to this bug report ? Thanks
Comment 4 Alex 2012-08-02 12:26:00 UTC
Created attachment 2291 [details]
iPhone test application to reproduce the bug
Comment 5 Alex 2012-08-02 12:27:03 UTC
Hi Sebastien,

thanks for looking into this. Yes, you are absolutely right, it can be hit under specific circumstances only. I did my best to help you to reproduce the problem. Please follow the instructions below and in case of having any additional questions, feel free to update this bug report. I have created the smallest possible application to reproduce the problem (about hundred lines of code). It is a simple Single ViewController iPhone application.

1. Open the MonoTouch solution.

2. For the bug to be hit you will need a TCP server. I have added a very very easy TCP server directly to the iPhone app for your convenience, but we have been able to reproduce this bug with various TCP or HTTP servers in our local network. If you want to run the app against a different TCP server, just edit the source code. If you do not make any changes, the TCP server will be listening on port 7000 and the TCP client socket will connect to 127.0.0.1:7000.

3. In case you will use the TCP server in the iPhone app, hit the "Start server" button.

4. Now. Click the "Connect and send" button. A new socket will be created and connected to the server and will be used to send 10 bytes of data to the server. Then Click the "Close" button. Since the Connect and Send action happens very fast, you can click the Close button immediately after the first one, but you may also wait some time.

5. Repeat step 4 until you hit the bug. The socket.Close method will block indefinitely in Close_internal, which is called by socket.Dispose, called in socket.Close. I typically need 4 - 30 trials to hit the bug. I never had to repeat the previous step more than 50 times to hit the bug. I am using a Mac Mini with Intel i5, 4GB RAM, MonoTouch 5.2.12, MonoDevelop 3.0.3.4, Mono 2.10.9.


Some additional remarks:

- when you change the order of calling BeginReceive and BeginSend in the sendandreceive() method (see source code), you will not hit the bug - this is our current workaround. Our application uses sockets heavily and suffered from these locks very much.

- The bug will be hit regardless of what programming model for asynchronous socket operations is used - using the ReceiveAsync and SendAsync pattern also leads to this problem.

Hope this helps. Please let me know, if you were able to reproduce.

Alex
Comment 6 Sebastien Pouliot 2012-08-15 14:06:17 UTC
Thanks for your sample. I was not able to reproduce this (60 tries) with 'master' (i.e. the latest code) but it uses a more recent version of mono (2.10.x). 

I'll try it again with the 5.4 branch and (if I have to go back) 5.2 branch.
Comment 7 Sebastien Pouliot 2012-08-15 15:45:26 UTC
I could duplicate it with 5.4-branch and (tested again) on master too.
Comment 8 Adam Kemp 2012-08-24 11:01:17 UTC
Any progress on this? I think that this is hitting our application as well, and it doesn't even appear to be assigned to anyone.
Comment 9 Martin Baulig 2012-08-30 06:22:31 UTC
I can easily reproduce this, it never takes about 5 attempts before it's blocking.  I'll have a look at this now.
Comment 10 Martin Baulig 2012-08-30 10:59:47 UTC
Observations so far:

This can easily be reproduced with a command-line mono program (without monotouch) - and I tested both mono 2.10 and latest master.

I could attach to the process and get a backtrace when it's hanging:

====

(gdb) attach 14577
Attaching to process 14577.
Reading symbols for shared libraries . done
Reading symbols for shared libraries .................................... done
[Switching to process 14577 thread 0x1503]
0x98d09c22 in mach_msg_trap ()
(gdb) bt
#0  0x98d09c22 in mach_msg_trap ()
#1  0x98d09280 in mach_msg ()
#2  0x0016bf11 in mach_exception_thread (arg=0x0) at mini-darwin.c:139
#3  0x00312fc8 in GC_start_routine (arg=0x542f60) at pthread_support.c:1508
#4  0x925d7ed9 in _pthread_start ()
#5  0x925db6de in thread_start ()
(gdb) thread apply all bt

Thread 8 (process 14577):
#0  0x98d09c76 in semaphore_timedwait_trap ()
#1  0x002e2f4d in mono_sem_timedwait (sem=0x40b028, timeout_ms=2000, 
    alertable=1) at mono-semaphore.c:76
#2  0x0028582a in async_invoke_thread (data=0x0) at threadpool.c:1513
#3  0x00286ff1 in start_wrapper_internal (data=0x40b2dc0) at threads.c:589
#4  0x00287112 in start_wrapper (data=0x40b2dc0) at threads.c:635
#5  0x002d1699 in thread_start_routine (args=0x1986fa8) at wthreads.c:286
#6  0x002e6301 in inner_start_thread (arg=0x40b1dd0) at mono-threads-posix.c:49
#7  0x00312fc8 in GC_start_routine (arg=0x542f00) at pthread_support.c:1508
#8  0x925d7ed9 in _pthread_start ()
#9  0x925db6de in thread_start ()

Thread 7 (process 14577):
#0  0x98d0bbb2 in __semwait_signal ()
#1  0x9258c7b9 in nanosleep$UNIX2003 ()
#2  0x002d2530 in SleepEx (ms=500, alertable=1) at wthreads.c:834
#3  0x0028305a in monitor_thread (unused=0x0) at threadpool.c:783
#4  0x00286ff1 in start_wrapper_internal (data=0x40b2950) at threads.c:589
#5  0x00287112 in start_wrapper (data=0x40b2950) at threads.c:635
#6  0x002d1699 in thread_start_routine (args=0x1986f2c) at wthreads.c:286
#7  0x002e6301 in inner_start_thread (arg=0x40b2d90) at mono-threads-posix.c:49
---Type <return> to continue, or q <return> to quit---
#8  0x00312fc8 in GC_start_routine (arg=0x542f00) at pthread_support.c:1508
#9  0x925d7ed9 in _pthread_start ()
#10 0x925db6de in thread_start ()

Thread 6 (process 14577):
#0  0x98d09c76 in semaphore_timedwait_trap ()
#1  0x002e2f4d in mono_sem_timedwait (sem=0x40af48, timeout_ms=2000, 
    alertable=1) at mono-semaphore.c:76
#2  0x0028582a in async_invoke_thread (data=0x0) at threadpool.c:1513
#3  0x00286ff1 in start_wrapper_internal (data=0x10a7da0) at threads.c:589
#4  0x00287112 in start_wrapper (data=0x10a7da0) at threads.c:635
#5  0x002d1699 in thread_start_routine (args=0x1986eb0) at wthreads.c:286
#6  0x002e6301 in inner_start_thread (arg=0x10a7aa0) at mono-threads-posix.c:49
#7  0x00312fc8 in GC_start_routine (arg=0x542f00) at pthread_support.c:1508
#8  0x925d7ed9 in _pthread_start ()
#9  0x925db6de in thread_start ()

Thread 5 (process 14577):
#0  0x98d0c90a in kevent ()
#1  0x00281091 in tp_kqueue_wait (p=0x40afc0) at tpool-kqueue.c:99
#2  0x00286ff1 in start_wrapper_internal (data=0x10a75c0) at threads.c:589
#3  0x00287112 in start_wrapper (data=0x10a75c0) at threads.c:635
#4  0x002d1699 in thread_start_routine (args=0x1986e34) at wthreads.c:286
---Type <return> to continue, or q <return> to quit---
#5  0x002e6301 in inner_start_thread (arg=0x10a8080) at mono-threads-posix.c:49
#6  0x00312fc8 in GC_start_routine (arg=0x542f00) at pthread_support.c:1508
#7  0x925d7ed9 in _pthread_start ()
#8  0x925db6de in thread_start ()

Thread 4 (process 14577):
#0  0x98d0c90a in kevent ()
#1  0x9a359e10 in _dispatch_mgr_invoke ()
#2  0x9a35885f in _dispatch_mgr_thread ()

Thread 3 (process 14577):
#0  0x98d09c5e in semaphore_wait_trap ()
#1  0x002e2fc2 in mono_sem_wait (sem=0x40a754, alertable=1)
    at mono-semaphore.c:115
#2  0x001b960d in finalizer_thread (unused=0x0) at gc.c:1094
#3  0x00286ff1 in start_wrapper_internal (data=0x1088fe0) at threads.c:589
#4  0x00287112 in start_wrapper (data=0x1088fe0) at threads.c:635
#5  0x002d1699 in thread_start_routine (args=0x1986db8) at wthreads.c:286
#6  0x002e6301 in inner_start_thread (arg=0x1088b30) at mono-threads-posix.c:49
#7  0x00312fc8 in GC_start_routine (arg=0x542f60) at pthread_support.c:1508
#8  0x925d7ed9 in _pthread_start ()
#9  0x925db6de in thread_start ()

---Type <return> to continue, or q <return> to quit---
Thread 2 (process 14577):
#0  0x98d09c22 in mach_msg_trap ()
#1  0x98d09280 in mach_msg ()
#2  0x0016bf11 in mach_exception_thread (arg=0x0) at mini-darwin.c:139
#3  0x00312fc8 in GC_start_routine (arg=0x542f60) at pthread_support.c:1508
#4  0x925d7ed9 in _pthread_start ()
#5  0x925db6de in thread_start ()

Thread 1 (process 14577):
#0  0x98d0c2ea in close$UNIX2003 ()
#1  0x002c80d4 in socket_close (handle=0x5, data=0xbffff3ec) at sockets.c:104
#2  0x002b466f in _wapi_handle_unref_full (handle=0x5, 
    ignore_private_busy_handles=0) at handles.c:1148
#3  0x002b473f in _wapi_handle_unref (handle=0x5) at handles.c:1156
#4  0x002c84c2 in closesocket (fd=5) at sockets.c:191
#5  0x0027b2f1 in ves_icall_System_Net_Sockets_Socket_Close_internal (sock=5, 
    error=0xbffff514) at socket-io.c:815
#6  0x005a6830 in ?? ()
#7  0x005a6664 in ?? ()
#8  0x005a65b6 in ?? ()
#9  0x005a6596 in ?? ()
#10 0x02e49b70 in ?? ()
#11 0x005a0e24 in ?? ()
---Type <return> to continue, or q <return> to quit---
#12 0x005a0ee7 in ?? ()
#13 0x00011581 in mono_jit_runtime_invoke (method=0x1994a1c, obj=0x0, 
    params=0xbffff6f8, exc=0x0) at mini.c:5928
#14 0x0023f9aa in mono_runtime_invoke (method=0x1994a1c, obj=0x0, 
    params=0xbffff6f8, exc=0x0) at object.c:2810
#15 0x002426bc in mono_runtime_exec_main (method=0x1994a1c, args=0x54be00, 
    exc=0x0) at object.c:4008
#16 0x0024187c in mono_runtime_run_main (method=0x1994a1c, argc=0, 
    argv=0xbffff9ec, exc=0x0) at object.c:3630
#17 0x000b1bbf in mono_jit_exec (domain=0x546e00, assembly=0x1089870, argc=1, 
    argv=0xbffff9e8) at driver.c:954
#18 0x000b1e10 in main_thread_handler (user_data=0xbffff900) at driver.c:1013
#19 0x000b4562 in mono_main (argc=3, argv=0xbffff9e0) at driver.c:1926
#20 0x00001e5f in mono_main_with_options (argc=3, argv=0xbffff9e0) at main.c:93
#21 0x00001ef8 in main (argc=3, argv=0xbffff9e0) at main.c:124
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x98d09c22 in mach_msg_trap ()
(gdb) bt
#0  0x98d09c22 in mach_msg_trap ()
#1  0x98d09280 in mach_msg ()
#2  0x0016bf11 in mach_exception_thread (arg=0x0) at mini-darwin.c:139
#3  0x00312fc8 in GC_start_routine (arg=0x542f60) at pthread_support.c:1508
#4  0x925d7ed9 in _pthread_start ()
#5  0x925db6de in thread_start ()
(gdb) thread 1
[Switching to thread 1 (process 14577), "com.apple.main-thread"]
0x98d0c2ea in close$UNIX2003 ()
(gdb) bt
#0  0x98d0c2ea in close$UNIX2003 ()
#1  0x002c80d4 in socket_close (handle=0x5, data=0xbffff3ec) at sockets.c:104
#2  0x002b466f in _wapi_handle_unref_full (handle=0x5, 
    ignore_private_busy_handles=0) at handles.c:1148
#3  0x002b473f in _wapi_handle_unref (handle=0x5) at handles.c:1156
#4  0x002c84c2 in closesocket (fd=5) at sockets.c:191
#5  0x0027b2f1 in ves_icall_System_Net_Sockets_Socket_Close_internal (sock=5, 
    error=0xbffff514) at socket-io.c:815
#6  0x005a6830 in ?? ()
#7  0x005a6664 in ?? ()
#8  0x005a65b6 in ?? ()
#9  0x005a6596 in ?? ()
#10 0x02e49b70 in ?? ()
#11 0x005a0e24 in ?? ()
#12 0x005a0ee7 in ?? ()
#13 0x00011581 in mono_jit_runtime_invoke (method=0x1994a1c, obj=0x0, 
    params=0xbffff6f8, exc=0x0) at mini.c:5928
#14 0x0023f9aa in mono_runtime_invoke (method=0x1994a1c, obj=0x0, 
    params=0xbffff6f8, exc=0x0) at object.c:2810
#15 0x002426bc in mono_runtime_exec_main (method=0x1994a1c, args=0x54be00, 
    exc=0x0) at object.c:4008
#16 0x0024187c in mono_runtime_run_main (method=0x1994a1c, argc=0, 
    argv=0xbffff9ec, exc=0x0) at object.c:3630
---Type <return> to continue, or q <return> to quit---q
#18 0x000b1e10 in main_thread_handler (user_data=0xbffff900) at driver.c:1013
#19 0x000b4562 in mono_main (argc=3, argv=0xbffff9e0) at driver.c:1926
#20 0x00001e5f in mono_main_with_options (argc=3, argv=0xbffff9e0) at main.c:93
#21 0x00001ef8 in main (argc=3, argv=0xbffff9e0) at main.c:124
(gdb) kill
Kill the program being debugged? (y or n) y
====

After that, gdb hangs, I can't even kill the process.  The same thing also happens when running mono from the command line.

====
[martin@nathan:SocketTest]$ ps x | grep mono
  467   ??  U      0:01.61 /Library/Frameworks/Mono.framework/Versions/2.10.9/bin/mono --debug /Developer/MonoTouch/Source/test/SocketTest/SocketTest/bin/Debug/SocketTest.exe
  478   ??  U      0:01.63 /work/INSTALL/bin/mono --debug /Developer/MonoTouch/Source/test/SocketTest/SocketTest/bin/Debug/SocketTest.exe
  699   ??  U      0:01.58 /work/INSTALL/bin/mono --debug --debugger-agent=transport=dt_socket,address=127.0.0.1:49355 /Developer/MonoTouch/Source/test/SocketTest/SocketTest/bin/Debug/SocketTest.exe
  717   ??  U      0:01.44 /work/INSTALL/bin/mono --debug /Developer/MonoTouch/Source/test/SocketTest/SocketTest/bin/Debug/SocketTest.exe
  746   ??  U      0:01.43 /work/INSTALL/bin/mono --debug /Developer/MonoTouch/Source/test/SocketTest/SocketTest/bin/Debug/SocketTest.exe
  927   ??  U      0:03.49 /work/INSTALL/bin/mono --debug /Developer/MonoTouch/Source/test/SocketTest/SocketTest/bin/Debug/SocketTest.exe
  943   ??  U      0:01.36 /work/INSTALL/bin/mono --debug /Developer/MonoTouch/Source/test/SocketTest/SocketTest/bin/Debug/SocketTest.exe
  967   ??  U      0:01.32 /work/INSTALL/bin/mono --debug /Developer/MonoTouch/Source/test/SocketTest/SocketTest/bin/Debug/SocketTest.exe
 3844 s000  ?E+    0:00.00 (mono)
 6406 s001  ?E     0:00.00 (mono)
12023 s001  ?E     0:00.00 (mono)
14577 s001  ?E+    0:00.00 (mono)
14703 s003  R+     0:00.00 grep mono
=====

I can't even kill the '?E' processes as root.

=====
[root@nathan:~]# kill -9 3844
[root@nathan:~]# ps 3844
  PID   TT  STAT      TIME COMMAND
 3844 s000  ?E+    0:00.00 (mono)
[root@nathan:~]# kill -9 3844
[root@nathan:~]# ps 3844
  PID   TT  STAT      TIME COMMAND
 3844 s000  ?E+    0:00.00 (mono)
======

Now I'm completely confused.  It pretty much looks like the process gets into some uninterruptible, blocking wait in close().

It's also worth nothing that the connection stays in the 'CLOSING' state:

====
[martin@nathan:SocketTest]$ netstat -na | grep 9999
tcp4       0      0  127.0.0.1.61141        127.0.0.1.9999         CLOSING    
tcp4       0      0  127.0.0.1.57177        127.0.0.1.9999         ESTABLISHED
tcp4       0      0  *.49999                *.*                    ESTABLISHED
tcp4       0      0  *.9999                 *.*                    LISTEN     
====

I actually let this running for about half an hour, so this is way past any potential network timeouts.
Comment 11 Alex 2012-09-12 02:47:32 UTC
Hi again,

ok, thanks for the effort. This bug is hitting our application really hard, I think that even the Mono's http implementation might not be safe, if it's built on top of the tcp socket layer. All workarounds I described earlier seem to be partial solutions only, as our app still suffers from the deadlocks during socket closure.

Still no workaround or a solution for this?

Thank.

Alex
Comment 16 Sebastien Pouliot 2012-09-28 15:19:22 UTC
MonoTouch 6.0.2 is available with queue disabled (beta channel at the moment). Please try it and re-open the bug if the issue is not completely fixed. Thanks