Bug 58299 - Race conditions of hazard-pointer.c
Summary: Race conditions of hazard-pointer.c
Status: RESOLVED ANSWERED
Alias: None
Product: Runtime
Classification: Mono
Component: General ()
Version: unspecified
Hardware: PC Linux
: --- normal
Target Milestone: ---
Assignee: Bugzilla
URL:
Depends on:
Blocks:
 
Reported: 2017-07-21 17:11 UTC by Armin
Modified: 2017-08-07 21:45 UTC (History)
4 users (show)

Tags:
Is this bug a regression?: ---
Last known good build:


Attachments
Stack traces (98.66 KB, text/plain)
2017-07-21 17:11 UTC, Armin
Details


Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.


Please create a new report on GitHub or Developer Community with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:
Status:
RESOLVED ANSWERED

Description Armin 2017-07-21 17:11:34 UTC
Created attachment 23722 [details]
Stack traces

The data races that I would like to discuss next are connected to hazard-pointer.c. I believe that everything in this report is harmless. However, before I blacklist anything for Clang's ThreadSanitizer, I'd appreciate a discussion and your thoughts.

I identified 9 more / less unique race reports. I subdivided these 9 reports into 4 different groups and attached the corresponding stack traces.

Most of these races occur in slightly different setups (initially called by different functions or by different threads). However, I believe that the contexts of most of these races are irrelevant as the race conditions are tied to how hazard pointers work in general. This is why I grouped similar stack traces / problems. Of course, if necessary, I can provide more (slightly different) traces.

The commit that I used for generating these race reports is 792a0169b9095d814f21d7603f3e4c0c0058d395 and the lines marked with "!" are the locations of reported races.

If green-lit, I would like to globally disable data race reports for `is_pointer_hazardous ()` and  `mono_get_hazardous_pointer()` using the recently approved `MONO_NO_SANITIZE_THREAD` macro (https://github.com/mono/mono/pull/5191):

--------------------------------------------------------------------------------

Race type #1 (133:20, 159:16)
'''''''''''''''''''''''''''''

This race type seems safe to me as a write mutex is used and a read mutex is not necessary.

int mono_thread_small_id_alloc (void) {
  // ...
  mono_os_mutex_lock (&small_id_mutex);
  // ...
! highest_small_id = id;
  // ...
  mono_os_mutex_unlock (&small_id_mutex);
  // ...
}

static gboolean is_pointer_hazardous (gpointer p) {
  // ...
! int highest = highest_small_id;
  // ...
}

--------------------------------------------------------------------------------

Race type #2 (165:8, 205:3)
'''''''''''''''''''''''''''

Another race that is tied to `is_pointer_hazardous ()` is the read of `hp->hazard_pointers[i]` in combination with `mono_get_hazardous_pointer ()` / `mono_hazard_pointer_set ()`:

static gboolean is_pointer_hazardous (gpointer p) {
  // ...
! if (hazard_table [i].hazard_pointers [j] == p)
  // ...
}

#define mono_hazard_pointer_set(hp, i, v) \
  do { g_assert ((i) >= 0 && (i) < HAZARD_POINTER_COUNT); \
    (hp)->hazard_pointers [(i)] = (v); \
    mono_memory_write_barrier (); \
  } while (0)

gpointer mono_get_hazardous_pointer (gpointer volatile *pp, MonoThreadHazardPointers *hp, int hazard_index) {
  // ...
! mono_hazard_pointer_set (hp, hazard_index, p);
  // ...
}

--------------------------------------------------------------------------------

Race type #3 (165:8)
''''''''''''''''''''

`is_pointer_hazardous ()` also races with `mono_lls_find ()` of mono-linked-list-set.c:

static gboolean is_pointer_hazardous (gpointer p) {
  // ...
! if (hazard_table [i].hazard_pointers [j] == p)
  // ...
}

gboolean mono_lls_find (MonoLinkedListSet *list, MonoThreadHazardPointers *hp, uintptr_t key) {
  // ...
! mono_hazard_pointer_set (hp, 2, prev);
  // ...
}

--------------------------------------------------------------------------------

Race type #4 (199:7)
''''''''''''''''''''

`mono_get_hazardous_pointer ()` is further detected as read-before-write and read-after-write race with different functions of jit-info.c (127:3, 604:14, 624:30, 633:23) as well as in combination with mono-conc-hashtable.c:107:20.

gpointer mono_get_hazardous_pointer (gpointer volatile *pp, MonoThreadHazardPointers *hp, int hazard_index) {
  gpointer p;
  // ...
! p = *pp;
  // ...
}
Comment 1 Rodrigo Kumpera 2017-08-07 21:45:36 UTC
Race 1 is fine
	This race is only fine because highest_small_id growns monotonically.
	It's ok that there's no sync between reading it and the actual hazard pointers.

Race 2,3 are fine
	Yeah, hazard pointers are, by design, racy between readers (is_pointer_hazardous) and writers (lls_find & others)


Race 4 is fine
	This another racy aspect of how you use hazard pointers. It needs to read, make hazardous, validate.

In Race 4, we read from a conc-hashtable, write to the hazardous table then read again from a conc-hashtable to ensure what's stored in the HP table is the current live value.

We do this way cuz the way Hazard Pointers (HP) work is that you need to first first store the pointer in the HP table before using it. So the validate step is to ensure we stored a live value.

This works because the writer side must do a full membar between the store to the hazardous area and reading the HP table.


This code could, in theory, live lock if mutation keeps it failing to validate. We never hit it in practice and we're ok with live with this.