Bug 1544386

Summary:	freeipa-server removing i686 arch support in F28
Product:	[Fedora] Fedora	Reporter:	Christian Heimes <cheimes>
Component:	freeipa	Assignee:	IPA Maintainers <ipa-maint>
Status:	CLOSED ERRATA	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	28	CC:	abokovoy, codonell, dgilmore, frenaud, fweimer, ipa-maint, jakub, jcholast, jhrozek, mharmsen, mreynolds, pbrobinson, pvoborni, rbarlow, rcritten, samuel-rhbugs, sgallagh, ssorce, triegel, wibrown
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:	freeipa-4.6.90.pre1-1.fc28	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-03-18 00:49:04 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Christian Heimes 2018-02-12 10:56:49 UTC

389-ds-base 1.4 has removed support for i686 architecture and other 32 bit platforms (i386, s390, ppc). The server component of freeIPA depends on 389-ds-base. Therefore freeIPA will also remove 32bit support for freeipa-server package and related components.

Starting with Fedora 28, the packages

* freeipa-server
* python2-ipaserver
* python3-ipaserver
* freeipa-server-common
* freeipa-server-dns
* freeipa-server-trust-ad

are only available and supported on 64bit platforms.

freeipa-client is not affected and enrolment of 32bit clients is still supported.

upstream ticket: https://pagure.io/freeipa/issue/7400
389-ds-ticket: https://bugzilla.redhat.com/show_bug.cgi?id=1530832
389-ds-base package commit that dropped i686: https://src.fedoraproject.org/rpms/389-ds-base/c/482fad4fbab0140a163ce08ddd84e5b4c7a0e47f?branch=master

Comment 1 Dennis Gilmore 2018-02-14 23:04:50 UTC

This really requires a system wide fedora change, Ia ccidently stumbled on it, I happen to be negatively affected as I run freeipa on 32 bit arm, an arch not mentioned in this bug.

Comment 2 Christian Heimes 2018-02-15 07:42:28 UTC

There isn't much freeIPA can do about this unles 389-ds-base developers change their mind. Sorry :/

The subject of this BZ is a bit too broad. 32bit ARMv7 is still supported by 389-ds-base. The current rawhide build excludes the architectures i386, i686, s390, and ppc. My PR https://github.com/freeipa/freeipa/pull/1564/files does not exclude ARM either.

Comment 3 mreynolds 2018-02-15 13:59:58 UTC

(In reply to Christian Heimes from comment #2)
> There isn't much freeIPA can do about this unles 389-ds-base developers
> change their mind. Sorry :/
> 
> The subject of this BZ is a bit too broad. 32bit ARMv7 is still supported by
> 389-ds-base. The current rawhide build excludes the architectures i386,
> i686, s390, and ppc. My PR
> https://github.com/freeipa/freeipa/pull/1564/files does not exclude ARM
> either.

This was a mistake on my part, as I didn't realize arm was 32bit.  So arm will also be excluded in f28.  So the original bug description was correct. 

Here is a little more detail about why we are moving off of 32 bit arches (quoting William Brown who did a majority of the atomic work in DS):

======================================================================
In libc there are a number of __atomic_* types that promise that "They
perform atomic manipulations of the data, falling back to a mutex in
the case that a native cpu atomic can not be used". On 32bit platforms
this fallback does *not* occur correctly, meaning that either the lower
or upper half only is atomically updated. This is due to variable
alignment not being correctly emited by gcc, meaning we would have to
then align every data that uses this. 

This was discovered due to expansion of our testing capability of the C
code base - a server stress test showed that counters could become
wildly inaccurate.

Given we use atomics for reference counting in a number of objects, as
well as for monitors, this can cause objects to leak, be freed early,
or to report incorrect data,
======================================================================


Basically we saw crashes, invalid data, and unexpected behavior on 32bit platforms.  To rewrite all the atomic work for 32 bit platforms is a massive change, for what we feel is a very small consumer base.  Other factors like RHEL not shipping 32bit packages also led us in this decision.

I was also told Freeipa had already stopped shipping 32bit packages (before we decided to do it), but perhaps that was just downstream, and not upstream.

Comment 4 Dennis Gilmore 2018-02-15 14:21:58 UTC

can gcc, and glibc be fixed?  regardless, this is something that needs to be well communicated to the wider Fedora ecosystem. I have escallated to FESCo https://pagure.io/fesco/issue/1845 I am open to how it can be communicated without involving the Fedora Change process

Comment 5 Rob Crittenden 2018-02-15 14:23:53 UTC

IPA is dropping support because 389-ds has dropped support.

Comment 6 mreynolds 2018-02-15 14:33:18 UTC

(In reply to Dennis Gilmore from comment #4)
> can gcc, and glibc be fixed?

I'm not entirely sure it's a bug in gcc, or if it's simply a limitation.  William did this investigation and he is currently on sick leave.  I would think if it was an obvious bug that he would of filed a ticket with gcc/glibc, so I'm inclined to say it's something more complicated than that.  

I expect (hope) to hear from William next week, and I will get more details from him at that time.

Comment 7 Carlos O'Donell 2018-02-15 18:10:36 UTC

(In reply to mreynolds from comment #6)
> (In reply to Dennis Gilmore from comment #4)
> > can gcc, and glibc be fixed?
> 
> I'm not entirely sure it's a bug in gcc, or if it's simply a limitation. 

Nobody runs on i486's anymore, the i686 runtime is effectively used only on modern 32-bit systems all of which support efficient atomics or kernel-assisted emulation.

For example glibc will not run on *real* i386, we require minimally __sync_val_compare_and_swap to build the higher level parallel and concurrent (P&C) primitives that we use.

I know of no know issues with the __sync_* or __atomic_* primitives on i686, s390, ppc, or arm, we use all of them in glibc to implement robust P&C support.

While this is not my decision to make, I agree with Dennis, this should go through a system-wide change request before updating the upstream version. There needs to be discussion around what can be done and if resources form Fedora might help fix the existing support. Dropping 32-bit support in key packages is going to affect the viability of the 32-bit distributions, which leads to other cascading questions :-)

Comment 8 Christian Heimes 2018-02-15 18:39:08 UTC

Mark,

I was going through William's mails and noticed that only i686 is affected.

~~~~~~~~~~~~~~~~
William wrote in reply to this question:
> Is it only about i686 or problem is in all 32 bit architectures?
> Fedora also has "armv7hl"

i686 only.
~~~~~~~~~~~~~~~~

We can lift the restriction and only exclude ix86 architecture. 32bit ARMv7 should be fine.

Comment 9 Carlos O'Donell 2018-02-15 19:25:21 UTC

(In reply to Christian Heimes from comment #8)
> Mark,
> 
> I was going through William's mails and noticed that only i686 is affected.
> 
> ~~~~~~~~~~~~~~~~
> William wrote in reply to this question:
> > Is it only about i686 or problem is in all 32 bit architectures?
> > Fedora also has "armv7hl"
> 
> i686 only.
> ~~~~~~~~~~~~~~~~
> 
> We can lift the restriction and only exclude ix86 architecture. 32bit ARMv7
> should be fine.

I'm surprised it's an i686 issue.

I just double checked 32-bit i686, s390, and ppc.

On i686 the compiler effectively generates 'lock cmpxchg8b' to work with the 64-bit values atomically, no problems there.

On s390 the compiler effectively generates 'cds' (compare double and swap) to work with the 64-bit values atomically, no problems there.

On ppc the compiler generates out-of-line calls to __atomic_compare_exchange_8 and the like and call into libatomic (part of gcc), so you have to support linking that in for 32-bit ppc.

There is a lot of use of __atomic_* operations in the code, along with the potential to use NSPR's PR_Atomic* operations also, so one has to know exactly which was selected at build time.

However, I note that slapi_counter.c has at least some instances of relaxed memory ordering which will loose counter counts, so it has to be a rough estimate of count, but if you fall back to the pthread operations, the counts will be accurate. No comments seem to mention this in the API for slapi_counter_add, slapi_counter_subtract, slapi_counter_set_value, or slapi_counter_get_value.

Likewise lfds711_porting_abstraction_layer_compiler.h uses relaxed memory ordering which could lead to multiple threads succeeding at LFDS711_PAL_ATOMIC_CAS (depending on the version used), but there the user in lfds711_freelist_push seems to couple this with load and store barriers to create synchronization points with other threads to be able to see their stores.  So this looks more correct than the slapi_counter.c functions.

However, vattr_map_entry_free certainly looks like it could have double-free calls to slapi_ch_free by using relaxed MO, and this looks like it could result in double calls to free() in glibc, which is undefined behaviour. This would need more investigation.

Overall there are a lot of suspicious uses of relaxed MO in the code which might cause problems.

I'm curious to hear what problems arose on i686 and what their root cause was.

Comment 10 Alexander Bokovoy 2018-02-15 20:27:19 UTC

Carlos, thank you for the analysis. I think this should actually happen on the bug 1530832 which William filed a while ago according to the existing process in Fedora to exclude 389-ds-base build from a specific architecture.

Comment 11 Alexander Bokovoy 2018-02-15 20:29:28 UTC

I copied the comment #9 to the bug 1530832 and set NEEDINFO on William there.

Comment 12 Christian Heimes 2018-02-16 15:01:15 UTC

We have decided to temporarily drop the freeipa-server package from i686 arch.

Fixed upstream
master:
https://pagure.io/freeipa/c/631d3152fe578916e6121a1e731505d502fcea84

Comment 13 Carlos O'Donell 2018-02-16 23:34:52 UTC

I'm worried the failing i686 case is an early warning of more serious problems.

To give some further examples of problematic code:

In ldap/servers/slapd/backend.c:

197 void
198 be_addsuffix(Slapi_Backend *be, const Slapi_DN *suffix)
199 {
200     if (be->be_state != BE_STATE_DELETED) {
201         struct suffixlist *new_suffix, *list;
202 
203         new_suffix = (struct suffixlist *)slapi_ch_malloc(sizeof(struct suffixlist));
204         new_suffix->be_suffix = slapi_sdn_dup(suffix);
205         new_suffix->next = NULL;
206 
207         PR_Lock(be->be_suffixlock);
208 
209         if (be->be_suffixlist == NULL) {
210             be->be_suffixlist = new_suffix;
211         } else {
212             list = be->be_suffixlist;
213             while (list->next != NULL) {
214                 list = list->next;
215             }
216             list->next = new_suffix;
217         }
218         slapi_counter_increment(be->be_suffixcounter);
219 
220         PR_Unlock(be->be_suffixlock);
221     }
222 }

We have a PR_Lock/PR_Unlock around the modification of be->be_suffixlist.

Within the critical section we have 'slapi_counter_increment(be->be_suffixcounter);', which if 64-bit atomics are present is carried out with relaxed MO.

Then we have:

168 int
169 slapi_be_issuffix(const Slapi_Backend *be, const Slapi_DN *suffix)
170 {
171     struct suffixlist *list;
172     int r = 0;
173     /* this backend is no longer valid */
174     if (be && be->be_state != BE_STATE_DELETED) {
175         int i = 0, count;
176 
177         count = slapi_counter_get_value(be->be_suffixcounter);
178         list = be->be_suffixlist;
179         while (list && i < count) {
180             if (slapi_sdn_compare(list->be_suffix, suffix) == 0) {
181                 r = 1;
182                 break;
183             }
184             i++;
185             list = list->next;
186         }
187     }
188     return r;
189 }

Which uses a relaxed MO load to read be_suffixcounter to iterate a list.

The use of relaxed MO means there is no synchronizes-with in these operations, across either function for the counter.

One thread may see the update of the counter *before* the list itself is updated.

Therefore slapi_be_issuffix may walk off the end of the list->next and crash.

Again it seems like the failing 32-bit code is just an early warning that the atomic usage in this code is not correct.

However, it is hard to know if the atomic code is correct or not because no documentation of intent is provided for the concurrent algorithm being used to ensure correctness.

If you need any help auditing this code, please do not hesitate to reach out to the toolchain team. We have a lot of experience writing parallel and concurrenct code, and have even given workshops on modern concurrent code topics.

Comment 14 Carlos O'Donell 2018-02-16 23:37:55 UTC

Filed "Problematic uses of atomic operations with relaxed memory ordering"
https://bugzilla.redhat.com/show_bug.cgi?id=1546399, against 380-ds-base to discuss the issues there.

Comment 15 wibrown@redhat.com 2018-02-19 00:26:54 UTC

Hi there,

I've been working extensively on the thread safety of DS for a year or so now.

There are large parts of the code that rely on "luck" to not crash. It's not a good state.

I have done a lot in terms of this safety and testing to help find and remove these issues. We haven't solved them all yet though!

One of the thread SAFE parts of the code in the connection subsystem (nunc-stans) is actually the key test that has highlighted the issue.

nunc-stans relies on atomics to operate correctly, and this is showing the core issue.

The issue is that on 32bit i686, atomic 64bit operations are not safe.

I believe the cause is alignment of the value, where i686 is attempting to use an SSE instruction to perform an atomic cmpxchg, but because the input 64bit value is not alligned it "sometimes fails" to correctly alter the value, sometimes manipulating bits incorrectly, or in some cases, triggering operations multiple times. There are some tests where atomic counters double add or end up wildly wrong.

Now because we use 64bit atomics for ref counts in other parts of the code these are likely affected them same way, but we DONT have testing fro them so we can't prove the issues as well.

Finally, what's really amazing here is that despite the issues, not one bug has ever been raised about i686 DS having issues.

So I suspect

* We need gcc/libc to correctly align values on i686, but I also don't really have the time or motivation to raise this issue or pursue it
* We have no users of i686
* We could spend ages of my time to test and align everything.

Given that we want to use uint64_t more because it's provably faster on modern cpus for operations, I also don't really care about i686.

We are heavily under resourced as is. There is one of "william". I'm not going to waste my time fixing a platform that there is limited use on that is provably slower, just to satisfy some desire that we "support everything".

Instead, I'm going to:

* Continue to resolve and fix thread safety issues in 64bit DS only
* Deprecate i686 DS support (and thus IPA)
* Try to help educate my team and others about cpu memory safety and models
* Spend my time more productively on features that matter to our largest user base (64bit)

Hope that helps explain my position and this issue.

Comment 16 wibrown@redhat.com 2018-02-19 01:12:23 UTC

(In reply to Christian Heimes from comment #12)
> We have decided to temporarily drop the freeipa-server package from i686
> arch.
> 
> Fixed upstream
> master:
> https://pagure.io/freeipa/c/631d3152fe578916e6121a1e731505d502fcea84

PS: it's permanent, not temporary.

Comment 17 Carlos O'Donell 2018-02-19 04:40:05 UTC

(In reply to wibrown from comment #15)
> The issue is that on 32bit i686, atomic 64bit operations are not safe.

Please understand that my position right now is that I'm worried you've found a toolchain bug.

I would appreciate any help you can provide to rule this out.

64-bit operations should be safe on IA32, and the compiler will align the types appropriately.

In ns_thrpool.c all of the types which are manipulated appear to be 32-bit types.

  72     int32_t shutdown;
  73     int32_t shutdown_event_loop;

Are there 64-bit types which are manipulated indirectly via atomics?

> I believe the cause is alignment of the value, where i686 is attempting to
> use an SSE instruction to perform an atomic cmpxchg, but because the input
> 64bit value is not alligned it "sometimes fails" to correctly alter the
> value, sometimes manipulating bits incorrectly, or in some cases, triggering
> operations multiple times. There are some tests where atomic counters double
> add or end up wildly wrong.

The generated code should be using 'lock cmpxchg8b' in all cases which uses normal registers and is IA32 compatible for any systems newer than Pentium.

Can you point me at the code that implements the atomic counter?

If the counters we are talking about are slapi_counter.c, then you will have problems on more than just 32-bit.

> So I suspect
> 
> * We need gcc/libc to correctly align values on i686, but I also don't
> really have the time or motivation to raise this issue or pursue it

You should not have to do this by hand.

> * We have no users of i686
> * We could spend ages of my time to test and align everything.

This is your call.

I'm just trying to determine if there is a core toolchain issue.

> Given that we want to use uint64_t more because it's provably faster on
> modern cpus for operations, I also don't really care about i686.

Certainly, and it should work for i686 also.

> * Continue to resolve and fix thread safety issues in 64bit DS only
> * Deprecate i686 DS support (and thus IPA)
> * Try to help educate my team and others about cpu memory safety and models
> * Spend my time more productively on features that matter to our largest
> user base (64bit)

Sounds like a reasonable plan to me, again I'm looking for a core toolchain issue.

Comment 18 wibrown@redhat.com 2018-02-19 05:39:19 UTC

(In reply to Carlos O'Donell from comment #17)
> (In reply to wibrown from comment #15)
> > The issue is that on 32bit i686, atomic 64bit operations are not safe.
> 
> Please understand that my position right now is that I'm worried you've
> found a toolchain bug.
> 
> I would appreciate any help you can provide to rule this out.
> 
> 64-bit operations should be safe on IA32, and the compiler will align the
> types appropriately.
> 
> In ns_thrpool.c all of the types which are manipulated appear to be 32-bit
> types.
> 
>   72     int32_t shutdown;
>   73     int32_t shutdown_event_loop;
> 
> Are there 64-bit types which are manipulated indirectly via atomics?

https://pagure.io/389-ds-base/blob/master/f/src/nunc-stans/test/test_nuncstans_stress_core.c#_56

This is the affected code. The following int64_t types are used to track success and failure counts. They are modified:

https://pagure.io/389-ds-base/blob/master/f/src/nunc-stans/test/test_nuncstans_stress_core.c#_132

They use SEQ_CST even.

In some test executions these values are *double* their expected amount, or wrong.

This test works 100% on ALL other arches, with no modification, only failing repeatedly on i686

> 
> > I believe the cause is alignment of the value, where i686 is attempting to
> > use an SSE instruction to perform an atomic cmpxchg, but because the input
> > 64bit value is not alligned it "sometimes fails" to correctly alter the
> > value, sometimes manipulating bits incorrectly, or in some cases, triggering
> > operations multiple times. There are some tests where atomic counters double
> > add or end up wildly wrong.
> 
> The generated code should be using 'lock cmpxchg8b' in all cases which uses
> normal registers and is IA32 compatible for any systems newer than Pentium.
> 
> Can you point me at the code that implements the atomic counter?
> 
> If the counters we are talking about are slapi_counter.c, then you will have
> problems on more than just 32-bit.
> 
> > So I suspect
> > 
> > * We need gcc/libc to correctly align values on i686, but I also don't
> > really have the time or motivation to raise this issue or pursue it
> 
> You should not have to do this by hand.
> 
> > * We have no users of i686
> > * We could spend ages of my time to test and align everything.
> 
> This is your call.
> 
> I'm just trying to determine if there is a core toolchain issue.

Yep. Thanks. I've already pretty well made the decision not to pursue this issue as our team is heavily under resourced.

> 
> > Given that we want to use uint64_t more because it's provably faster on
> > modern cpus for operations, I also don't really care about i686.
> 
> Certainly, and it should work for i686 also.
> 
> > * Continue to resolve and fix thread safety issues in 64bit DS only
> > * Deprecate i686 DS support (and thus IPA)
> > * Try to help educate my team and others about cpu memory safety and models
> > * Spend my time more productively on features that matter to our largest
> > user base (64bit)
> 
> Sounds like a reasonable plan to me, again I'm looking for a core toolchain
> issue.

Fair enough. See above for the affected code and calls.

Comment 19 Torvald Riegel 2018-02-20 00:38:24 UTC

(In reply to wibrown from comment #18)
> (In reply to Carlos O'Donell from comment #17)
> > (In reply to wibrown from comment #15)
> > > The issue is that on 32bit i686, atomic 64bit operations are not safe.
> > 
> > Please understand that my position right now is that I'm worried you've
> > found a toolchain bug.
> > 
> > I would appreciate any help you can provide to rule this out.
> > 
> > 64-bit operations should be safe on IA32, and the compiler will align the
> > types appropriately.
> > 
> > In ns_thrpool.c all of the types which are manipulated appear to be 32-bit
> > types.
> > 
> >   72     int32_t shutdown;
> >   73     int32_t shutdown_event_loop;
> > 
> > Are there 64-bit types which are manipulated indirectly via atomics?

In the thread pool implementation, the __atomic builtins are not used for those 32b fields, but it falls back to PR_Atomic if ATOMIC_64BIT_OPERATIONS is not defined.

> 
> https://pagure.io/389-ds-base/blob/master/f/src/nunc-stans/test/
> test_nuncstans_stress_core.c#_56
> 
> This is the affected code. The following int64_t types are used to track
> success and failure counts. They are modified:
> 
> https://pagure.io/389-ds-base/blob/master/f/src/nunc-stans/test/
> test_nuncstans_stress_core.c#_132
> 
> They use SEQ_CST even.
> 
> In some test executions these values are *double* their expected amount, or
> wrong.
> 
> This test works 100% on ALL other arches, with no modification, only failing
> repeatedly on i686

This doesn't mean that the implementation of the atomics is wrong.  It could just be a timing problem, or an out-of-bounds access that just triggers a bug in a 32b build.

The atomic builtins have fallbacks if data types for which atomic ops are not supported natively by the particular hardware.  For this tests, you should simply call the builtins and not fallback to the other implementation of atomics.  The fallback uses an internal array of locks, which should be fine for simple atomic counters that aren't performance-critical.  I think it's also very unlikely that the atomic builtins are the root cause here.

Looking at nunc-stans/ns/ns_thrpool.c a bit more:

Line 242: The mutex is destroyed immediately after the critical section.  That doesn't make sense, because the destruction means that there must not be another thread that could try to enter the critical section concurrently, and so the critical section would not be needed in the first place.  There are other places with the same issue.

Line 310: The condvar is signaled, the mutex unlocked, and then the function is called that destroys the mutex and condvar (Line 242 etc.).  There's no guarantee the waiting threads get to acquire the lock before it's destroyed (and same for the condvar).

Line 343-345: condvars can have spurious wake-ups.  They should always be used with a variable that represents the condition, and be in a loop (there is a loop in the caller, but the way the condvar is used this can be more like a yield, not a true producer/consumer relationship).  One will not get the normal ordering after the signal with a spurious wake-up, for example.

Line 441-442: This doesn't make sense.  The return is not semantically different from just falling through to the return at the end of the function.

Line 903 (and others for the other job types): Looks like an unnecessary critical section, given that it seems like this is a new event and not yet accessible to other threads.  That's not a problem in itself, but it's a red flag indicating that ownership of jobs and visibility to other threads aren't clear.

Line 1079: loading the data field after the critical section seems wrong (it might be the critical section is not needed).  The data field is set within the critical section on Line 1093, so this is inconsistent.  Similar issues exist in the accessors of the other fields.

Line 1580 (and others): No documentation why release MO is used.  A shutdown flag wouldn't need it, if all you want to communicate is the need to shut down.  There might be a misunderstanding what the different MOs do.

Overall, the lack of high-level documentation of how the concurrent code is supposed to work, the bugs I've seen just on the first read of the code, and the nature of some of the existing comments in the code are warning signs to me, indicating that this may not work as intended.  (For example, if there's behavior that shouldn't happen but has been observed, this should always be considered an actual bug.)

My wild guess would be that there's a bug in the thread pool / work queue implementation, and that the counters in the test are just fine (which can be checked by using just the __atomic builtins and their fallbacks).  For a test, it should also be sufficient to make them 32b, thus reducing one further unlikely but potential cause.

Comment 20 wibrown@redhat.com 2018-02-20 01:56:45 UTC

(In reply to Torvald Riegel from comment #19)
> (In reply to wibrown from comment #18)
> > (In reply to Carlos O'Donell from comment #17)
> > > (In reply to wibrown from comment #15)
> > > > The issue is that on 32bit i686, atomic 64bit operations are not safe.
> > > 
> > > Please understand that my position right now is that I'm worried you've
> > > found a toolchain bug.
> > > 
> > > I would appreciate any help you can provide to rule this out.
> > > 
> > > 64-bit operations should be safe on IA32, and the compiler will align the
> > > types appropriately.
> > > 
> > > In ns_thrpool.c all of the types which are manipulated appear to be 32-bit
> > > types.
> > > 
> > >   72     int32_t shutdown;
> > >   73     int32_t shutdown_event_loop;
> > > 
> > > Are there 64-bit types which are manipulated indirectly via atomics?
> 
> In the thread pool implementation, the __atomic builtins are not used for
> those 32b fields, but it falls back to PR_Atomic if ATOMIC_64BIT_OPERATIONS
> is not defined.

i686 defines ATOMIC_64BIT_OPERATIONS.

PR_Atomic doesn't have the bug .... 

> 
> > 
> > https://pagure.io/389-ds-base/blob/master/f/src/nunc-stans/test/
> > test_nuncstans_stress_core.c#_56
> > 
> > This is the affected code. The following int64_t types are used to track
> > success and failure counts. They are modified:
> > 
> > https://pagure.io/389-ds-base/blob/master/f/src/nunc-stans/test/
> > test_nuncstans_stress_core.c#_132
> > 
> > They use SEQ_CST even.
> > 
> > In some test executions these values are *double* their expected amount, or
> > wrong.
> > 
> > This test works 100% on ALL other arches, with no modification, only failing
> > repeatedly on i686
> 
> This doesn't mean that the implementation of the atomics is wrong.  It could
> just be a timing problem, or an out-of-bounds access that just triggers a
> bug in a 32b build.
> 
> The atomic builtins have fallbacks if data types for which atomic ops are
> not supported natively by the particular hardware.  For this tests, you
> should simply call the builtins and not fallback to the other implementation
> of atomics.  The fallback uses an internal array of locks, which should be
> fine for simple atomic counters that aren't performance-critical.  I think
> it's also very unlikely that the atomic builtins are the root cause here.
> 
> Looking at nunc-stans/ns/ns_thrpool.c a bit more:
> 
> Line 242: The mutex is destroyed immediately after the critical section. 
> That doesn't make sense, because the destruction means that there must not
> be another thread that could try to enter the critical section concurrently,
> and so the critical section would not be needed in the first place.  There
> are other places with the same issue.

"coverity whinges and complains"

> 
> Line 310: The condvar is signaled, the mutex unlocked, and then the function
> is called that destroys the mutex and condvar (Line 242 etc.).  There's no
> guarantee the waiting threads get to acquire the lock before it's destroyed
> (and same for the condvar).

 I think this is the cause of a random (1 in 10,000) deadlock I see in the tests sometimes, so good spot on this. I'll have to fix it.

> 
> Line 343-345: condvars can have spurious wake-ups.  They should always be
> used with a variable that represents the condition, and be in a loop (there
> is a loop in the caller, but the way the condvar is used this can be more
> like a yield, not a true producer/consumer relationship).  One will not get
> the normal ordering after the signal with a spurious wake-up, for example.

I think this is a non issue because if we wake up spuriously with no work, we will re-trigger the wait event, and we don't break anything. So I really see no issue here .... 

> 
> Line 441-442: This doesn't make sense.  The return is not semantically
> different from just falling through to the return at the end of the function.

If you don't return there, on the check in the next else if, it's possible a thread has called ns_job_done trigger a heap use after free. This code exists because the bug is real, and we test for it.

> 
> Line 903 (and others for the other job types): Looks like an unnecessary
> critical section, given that it seems like this is a new event and not yet
> accessible to other threads.  That's not a problem in itself, but it's a red
> flag indicating that ownership of jobs and visibility to other threads
> aren't clear.

That's called "coverity whinges needlessly and we need to shut it up".

> 
> Line 1079: loading the data field after the critical section seems wrong (it
> might be the critical section is not needed).  The data field is set within
> the critical section on Line 1093, so this is inconsistent.  Similar issues
> exist in the accessors of the other fields.

You can only load this field if the job is in a state where it's guaranteed not to be destroyed, or in a state where you have exclusive ownership. This is one of the more "bad" parts of the original design that I can't escape ... today.

> 
> Line 1580 (and others): No documentation why release MO is used.  A shutdown
> flag wouldn't need it, if all you want to communicate is the need to shut
> down.  There might be a misunderstanding what the different MOs do.

To answer all this there is method to the apparent madness, and there is documentation about how it all works. https://firstyear.fedorapeople.org/nunc-stans/md_docs_job-safety.html should answer a few of the questions.

The safety doesn't come from the mutexes, but the design of the job lifecycle, and how they act. There are guarantees in the framework and the design, and how jobs are meant to be called. 

Of course this is all limited by the fact that the code is all in C, so we really actually have no basis or ability to reason about it at all. Second, the original design was not thread safe at all - all the safety is "added later" which means we violated "correct, simple, fast". This means, yes, there are "edge cases" and it does rely on the CALLER to handle it correctly as an API, the API can't actually guarantee safety at all. 

For the stress tests however, this is a correctly implemented client, and one of the most reliable pieces of code in the server, having shaken out bugs in many other parts of the code and it's design. It's really really sensitive to these memory ordering issues, and that sensitivity has helped diagnose many issues above and below. 

Given the time an opportunity I will be rewriting this however, because the original design is really flawed and limited :( 

But for today, I think the issues you raise may or may not be correct, because there are other guarantees in the system which you may not have accounted for.

> 
> Overall, the lack of high-level documentation of how the concurrent code is
> supposed to work, the bugs I've seen just on the first read of the code, and
> the nature of some of the existing comments in the code are warning signs to
> me, indicating that this may not work as intended.  (For example, if there's
> behavior that shouldn't happen but has been observed, this should always be
> considered an actual bug.)
> 
> My wild guess would be that there's a bug in the thread pool / work queue
> implementation, and that the counters in the test are just fine (which can
> be checked by using just the __atomic builtins and their fallbacks).  For a
> test, it should also be sufficient to make them 32b, thus reducing one
> further unlikely but potential cause.

Only i686 exhibits the issue, and only when using the glibc atomic types. PR_Atomic isn't affected, and no other platform is affected. 

I think when I made the counters 32b they worked as intended, but I need to do it again to test - of course, I also don't care about i686 anymore, and it's the *only* broken platform we have, so I probably won't ever check this :)

Comment 21 Torvald Riegel 2018-02-20 12:33:01 UTC

(In reply to wibrown from comment #20)
> (In reply to Torvald Riegel from comment #19)
> > (In reply to wibrown from comment #18)
> > > (In reply to Carlos O'Donell from comment #17)
> > > > (In reply to wibrown from comment #15)
> > > > > The issue is that on 32bit i686, atomic 64bit operations are not safe.
> > > > 
> > > > Please understand that my position right now is that I'm worried you've
> > > > found a toolchain bug.
> > > > 
> > > > I would appreciate any help you can provide to rule this out.
> > > > 
> > > > 64-bit operations should be safe on IA32, and the compiler will align the
> > > > types appropriately.
> > > > 
> > > > In ns_thrpool.c all of the types which are manipulated appear to be 32-bit
> > > > types.
> > > > 
> > > >   72     int32_t shutdown;
> > > >   73     int32_t shutdown_event_loop;
> > > > 
> > > > Are there 64-bit types which are manipulated indirectly via atomics?
> > 
> > In the thread pool implementation, the __atomic builtins are not used for
> > those 32b fields, but it falls back to PR_Atomic if ATOMIC_64BIT_OPERATIONS
> > is not defined.
> 
> i686 defines ATOMIC_64BIT_OPERATIONS.

Okay.  So the macro isn't even really about 64b atomic operations, given that it's required to be defined for 32b operations too?

> > 
> > Line 310: The condvar is signaled, the mutex unlocked, and then the function
> > is called that destroys the mutex and condvar (Line 242 etc.).  There's no
> > guarantee the waiting threads get to acquire the lock before it's destroyed
> > (and same for the condvar).
> 
>  I think this is the cause of a random (1 in 10,000) deadlock I see in the
> tests sometimes, so good spot on this. I'll have to fix it.

It may surface as a deadlock, but please understand that this can lead to anything.  For example, if the memory of the mutex is reused after it's destroyed, and then some other thread tries to acquire the mutex, it can overwrite random other data...

> > 
> > Line 343-345: condvars can have spurious wake-ups.  They should always be
> > used with a variable that represents the condition, and be in a loop (there
> > is a loop in the caller, but the way the condvar is used this can be more
> > like a yield, not a true producer/consumer relationship).  One will not get
> > the normal ordering after the signal with a spurious wake-up, for example.
> 
> I think this is a non issue because if we wake up spuriously with no work,
> we will re-trigger the wait event, and we don't break anything. So I really
> see no issue here .... 

I haven't looked at the sds queue yet, so perhaps this gives enough ordering; if not, then the consumer may run before the producer has signaled.
Even if that works for the code present, the condvar is misused, and at the very least this can confuse other people working on the code in the future. 

> > 
> > Line 441-442: This doesn't make sense.  The return is not semantically
> > different from just falling through to the return at the end of the function.
> 
> If you don't return there, on the check in the next else if, it's possible a
> thread has called ns_job_done trigger a heap use after free. This code
> exists because the bug is real, and we test for it.

The checks on the next else if branches will not be executed because after line 440, line 470 will be executed next (when ignoring the unnecessary return).  Lines 443,451,459 all start else branches (and I think I didn't overlook any braces...).

I suspect you may create a timing difference with the unnecessary return, which makes it less likely to trigger the actual bug -- a bug that's still there.  The use-after-free sounds like threads can't agree whether something has been destroyed already.

> > 
> > Line 903 (and others for the other job types): Looks like an unnecessary
> > critical section, given that it seems like this is a new event and not yet
> > accessible to other threads.  That's not a problem in itself, but it's a red
> > flag indicating that ownership of jobs and visibility to other threads
> > aren't clear.
> 
> That's called "coverity whinges needlessly and we need to shut it up".

That's the wrong approach.  This code is confusing, and the fact that someone put in the critical section without reason is a sign that there's something wrong in the underlying understanding of how this needs to be synchronized properly.  Removing the critical section would make the code simpler.

And I hope you don't just ignore the first case I mentioned either.

> > 
> > Line 1079: loading the data field after the critical section seems wrong (it
> > might be the critical section is not needed).  The data field is set within
> > the critical section on Line 1093, so this is inconsistent.  Similar issues
> > exist in the accessors of the other fields.
> 
> You can only load this field if the job is in a state where it's guaranteed
> not to be destroyed, or in a state where you have exclusive ownership. This
> is one of the more "bad" parts of the original design that I can't escape
> ... today.

Even performing the load inside the critical section would make this code more consistent.  Doing it outside of the critical section is trying to optimize at the wrong place.  And doing it outside requires that "guaranteed not to be destroyed" is either something that holds forever, or there is additional synchronization from something after the load to whatever changes the state.  If that's not the case, you have a potential data race here.

> > 
> > Line 1580 (and others): No documentation why release MO is used.  A shutdown
> > flag wouldn't need it, if all you want to communicate is the need to shut
> > down.  There might be a misunderstanding what the different MOs do.
> 
> To answer all this there is method to the apparent madness, and there is
> documentation about how it all works.
> https://firstyear.fedorapeople.org/nunc-stans/md_docs_job-safety.html should
> answer a few of the questions.
> 
> The safety doesn't come from the mutexes, but the design of the job
> lifecycle, and how they act. There are guarantees in the framework and the
> design, and how jobs are meant to be called. 

This design description misses some essential things one needs to consider when dealing with synchronization.  For example, one needs to think about ownership or visibility when one can destroy data that may be visible to other threads; the deleted state doesn't make sense if destruction can happen right after it.  At destruction time, no other thread must be able to access the data any more.

I don't want to write a long critique of this description here, but feel free to reach out.

> Of course this is all limited by the fact that the code is all in C, so we
> really actually have no basis or ability to reason about it at all.

That's not true at all.  The abstract machine (see any C standard) and C11 memory model give you all you need.

> Second,
> the original design was not thread safe at all - all the safety is "added
> later" which means we violated "correct, simple, fast". This means, yes,
> there are "edge cases" and it does rely on the CALLER to handle it correctly
> as an API, the API can't actually guarantee safety at all. 

Adding it later isn't the problem.  The question is whether the synchronization scheme is correct or not.  It's fine if there are additional requirements on callers, if these are spelled out.

> For the stress tests however, this is a correctly implemented client, and
> one of the most reliable pieces of code in the server, having shaken out
> bugs in many other parts of the code and it's design. It's really really
> sensitive to these memory ordering issues, and that sensitivity has helped
> diagnose many issues above and below. 

I doubt you know whether the test is really sensitive in the sense of being a reliable indicator for a bug.  Exhaustive testing of all executions of concurrent code is a really tough problem, and one is unlikely to achieve that with a test that isn't hugely complex.

This is one of the things that makes concurrency hard: Unlike for sequential code, you can't easily test whether a piece of code is correct or not.  It's important to remember that, so as to not draw more confidence from a test result than is justified.

> Given the time an opportunity I will be rewriting this however, because the
> original design is really flawed and limited :( 
> 
> But for today, I think the issues you raise may or may not be correct,
> because there are other guarantees in the system which you may not have
> accounted for.

Possibly.

> > 
> > Overall, the lack of high-level documentation of how the concurrent code is
> > supposed to work, the bugs I've seen just on the first read of the code, and
> > the nature of some of the existing comments in the code are warning signs to
> > me, indicating that this may not work as intended.  (For example, if there's
> > behavior that shouldn't happen but has been observed, this should always be
> > considered an actual bug.)
> > 
> > My wild guess would be that there's a bug in the thread pool / work queue
> > implementation, and that the counters in the test are just fine (which can
> > be checked by using just the __atomic builtins and their fallbacks).  For a
> > test, it should also be sufficient to make them 32b, thus reducing one
> > further unlikely but potential cause.
> 
> Only i686 exhibits the issue, and only when using the glibc atomic types.
> PR_Atomic isn't affected, and no other platform is affected. 
> 
> I think when I made the counters 32b they worked as intended, but I need to
> do it again to test - of course, I also don't care about i686 anymore, and
> it's the *only* broken platform we have, so I probably won't ever check this
> :)

This code probably has concurrency bugs on all platforms, even if they may be less likely to trigger in your testing.

Comment 22 Fedora End Of Life 2018-02-20 15:37:38 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 28 development cycle.
Changing version to '28'.

Comment 23 wibrown@redhat.com 2018-02-21 22:17:10 UTC

(In reply to Torvald Riegel from comment #21)
> (In reply to wibrown from comment #20)
> > (In reply to Torvald Riegel from comment #19)
> > > (In reply to wibrown from comment #18)
> > > > (In reply to Carlos O'Donell from comment #17)
> > > > > (In reply to wibrown from comment #15)
> > > > > > The issue is that on 32bit i686, atomic 64bit operations are not safe.
> > > > > 
> > > > > Please understand that my position right now is that I'm worried you've
> > > > > found a toolchain bug.
> > > > > 
> > > > > I would appreciate any help you can provide to rule this out.
> > > > > 
> > > > > 64-bit operations should be safe on IA32, and the compiler will align the
> > > > > types appropriately.
> > > > > 
> > > > > In ns_thrpool.c all of the types which are manipulated appear to be 32-bit
> > > > > types.
> > > > > 
> > > > >   72     int32_t shutdown;
> > > > >   73     int32_t shutdown_event_loop;
> > > > > 
> > > > > Are there 64-bit types which are manipulated indirectly via atomics?
> > > 
> > > In the thread pool implementation, the __atomic builtins are not used for
> > > those 32b fields, but it falls back to PR_Atomic if ATOMIC_64BIT_OPERATIONS
> > > is not defined.
> > 
> > i686 defines ATOMIC_64BIT_OPERATIONS.
> 
> Okay.  So the macro isn't even really about 64b atomic operations, given
> that it's required to be defined for 32b operations too?

How great is legacy code. 

> 
> > > 
> > > Line 310: The condvar is signaled, the mutex unlocked, and then the function
> > > is called that destroys the mutex and condvar (Line 242 etc.).  There's no
> > > guarantee the waiting threads get to acquire the lock before it's destroyed
> > > (and same for the condvar).
> > 
> >  I think this is the cause of a random (1 in 10,000) deadlock I see in the
> > tests sometimes, so good spot on this. I'll have to fix it.
> 
> It may surface as a deadlock, but please understand that this can lead to
> anything.  For example, if the memory of the mutex is reused after it's
> destroyed, and then some other thread tries to acquire the mutex, it can
> overwrite random other data...

Yes, I'm aware. I have already opened an issue to resolve this now.

> 
> > > 
> > > Line 343-345: condvars can have spurious wake-ups.  They should always be
> > > used with a variable that represents the condition, and be in a loop (there
> > > is a loop in the caller, but the way the condvar is used this can be more
> > > like a yield, not a true producer/consumer relationship).  One will not get
> > > the normal ordering after the signal with a spurious wake-up, for example.
> > 
> > I think this is a non issue because if we wake up spuriously with no work,
> > we will re-trigger the wait event, and we don't break anything. So I really
> > see no issue here .... 
> 
> I haven't looked at the sds queue yet, so perhaps this gives enough
> ordering; if not, then the consumer may run before the producer has signaled.
> Even if that works for the code present, the condvar is misused, and at the
> very least this can confuse other people working on the code in the future. 

The queue has the conditions to check, so this works and I don't think it's an issue.

> 
> > > 
> > > Line 441-442: This doesn't make sense.  The return is not semantically
> > > different from just falling through to the return at the end of the function.
> > 
> > If you don't return there, on the check in the next else if, it's possible a
> > thread has called ns_job_done trigger a heap use after free. This code
> > exists because the bug is real, and we test for it.
> 
> The checks on the next else if branches will not be executed because after
> line 440, line 470 will be executed next (when ignoring the unnecessary
> return).  Lines 443,451,459 all start else branches (and I think I didn't
> overlook any braces...).
> 
> I suspect you may create a timing difference with the unnecessary return,
> which makes it less likely to trigger the actual bug -- a bug that's still
> there.  The use-after-free sounds like threads can't agree whether something
> has been destroyed already.

No! If you don't return there the if statement CAN and HAS been run after another thread freed the memory causing a heap-use-after-free! This is a real issue, and I have been able to prove it in the past. 

> 
> > > 
> > > Line 903 (and others for the other job types): Looks like an unnecessary
> > > critical section, given that it seems like this is a new event and not yet
> > > accessible to other threads.  That's not a problem in itself, but it's a red
> > > flag indicating that ownership of jobs and visibility to other threads
> > > aren't clear.
> > 
> > That's called "coverity whinges needlessly and we need to shut it up".
> 
> That's the wrong approach.  This code is confusing, and the fact that
> someone put in the critical section without reason is a sign that there's
> something wrong in the underlying understanding of how this needs to be
> synchronized properly.  Removing the critical section would make the code
> simpler.
> 
> And I hope you don't just ignore the first case I mentioned either.

During the job create, only the creation thread can see the job, and it's not til later does that matter. Initially there was NO locking on that section because there wasn't a need, it's only needed once you either copy the pointer to another thread or you mark the job armed. As a result, everythinga fter job_create does do locking and checks. 

> 
> > > 
> > > Line 1079: loading the data field after the critical section seems wrong (it
> > > might be the critical section is not needed).  The data field is set within
> > > the critical section on Line 1093, so this is inconsistent.  Similar issues
> > > exist in the accessors of the other fields.
> > 
> > You can only load this field if the job is in a state where it's guaranteed
> > not to be destroyed, or in a state where you have exclusive ownership. This
> > is one of the more "bad" parts of the original design that I can't escape
> > ... today.
> 
> Even performing the load inside the critical section would make this code
> more consistent.  Doing it outside of the critical section is trying to
> optimize at the wrong place.  And doing it outside requires that "guaranteed
> not to be destroyed" is either something that holds forever, or there is
> additional synchronization from something after the load to whatever changes
> the state.  If that's not the case, you have a potential data race here.

Yes, I am very aware of this. I have mentioned there are some really problematic design decisions made "not by me", and now they are hard to escape. There are *other* guarantees besides the lock that prevent this being an issue today. 

> 
> > > 
> > > Line 1580 (and others): No documentation why release MO is used.  A shutdown
> > > flag wouldn't need it, if all you want to communicate is the need to shut
> > > down.  There might be a misunderstanding what the different MOs do.
> > 
> > To answer all this there is method to the apparent madness, and there is
> > documentation about how it all works.
> > https://firstyear.fedorapeople.org/nunc-stans/md_docs_job-safety.html should
> > answer a few of the questions.
> > 
> > The safety doesn't come from the mutexes, but the design of the job
> > lifecycle, and how they act. There are guarantees in the framework and the
> > design, and how jobs are meant to be called. 
> 
> This design description misses some essential things one needs to consider
> when dealing with synchronization.  For example, one needs to think about
> ownership or visibility when one can destroy data that may be visible to
> other threads; the deleted state doesn't make sense if destruction can
> happen right after it.  At destruction time, no other thread must be able to
> access the data any more.
> 
> I don't want to write a long critique of this description here, but feel
> free to reach out.

The design does consider it, it's entire point is to understand that. There are some things which are needed to prevent issues which are the burden of the *caller* of the API to ensure they do not violate, and provided this holds true, it does work very well.

But once you start to step outside that highly limited, focused scope of the design, you WILL absolutely have issues. The safety is a combination of caller correctness and the library doing "just enough". And this all stems from the fact that I inherited something that was written as "fast, fast, fast". No "simple" or "correct" ever crossed the path of the code. When I got this, it used integer assignment as if it was atomic! It didn't have locks on jobs. It used the lock free queue that had a resource leak and data races. 

If I had enough time I would nuke this code from orbit and start again. But today it works "just enough"  if you hold it correctly for it to be passable. 

> 
> > Of course this is all limited by the fact that the code is all in C, so we
> > really actually have no basis or ability to reason about it at all.
> 
> That's not true at all.  The abstract machine (see any C standard) and C11
> memory model give you all you need.

No! C as a language lets you do anything you please, at any time, with no verification. That's why you can have locks that "aren't called" on critical sections, or atmoics with the wrong orderings. It's up to a human to work it out, and only at RUNTIME can you really determine the presence of an issue. C has NO POSSIBILITY of reasoning about it's correctness or safety until you run it!

> 
> > Second,
> > the original design was not thread safe at all - all the safety is "added
> > later" which means we violated "correct, simple, fast". This means, yes,
> > there are "edge cases" and it does rely on the CALLER to handle it correctly
> > as an API, the API can't actually guarantee safety at all. 
> 
> Adding it later isn't the problem.  The question is whether the
> synchronization scheme is correct or not.  It's fine if there are additional
> requirements on callers, if these are spelled out.

No no! You absolutely have to perform "correct" first. This api never did. Now it's like this. 

> 
> > For the stress tests however, this is a correctly implemented client, and
> > one of the most reliable pieces of code in the server, having shaken out
> > bugs in many other parts of the code and it's design. It's really really
> > sensitive to these memory ordering issues, and that sensitivity has helped
> > diagnose many issues above and below. 
> 
> I doubt you know whether the test is really sensitive in the sense of being
> a reliable indicator for a bug.  Exhaustive testing of all executions of
> concurrent code is a really tough problem, and one is unlikely to achieve
> that with a test that isn't hugely complex.
> 

I'm really aware of this. 

> This is one of the things that makes concurrency hard: Unlike for sequential
> code, you can't easily test whether a piece of code is correct or not.  It's
> important to remember that, so as to not draw more confidence from a test
> result than is justified.

I'm also really aware of this .... tests only get you so far. And in C, thats where it ends. You have no type checking (that's meaningful anyway), no verification, nothing. Tests are about as good of an assertion you can get when it comes to C. 

> 
> > Given the time an opportunity I will be rewriting this however, because the
> > original design is really flawed and limited :( 
> > 
> > But for today, I think the issues you raise may or may not be correct,
> > because there are other guarantees in the system which you may not have
> > accounted for.
> 
> Possibly.
> 
> > > 
> > > Overall, the lack of high-level documentation of how the concurrent code is
> > > supposed to work, the bugs I've seen just on the first read of the code, and
> > > the nature of some of the existing comments in the code are warning signs to
> > > me, indicating that this may not work as intended.  (For example, if there's
> > > behavior that shouldn't happen but has been observed, this should always be
> > > considered an actual bug.)
> > > 
> > > My wild guess would be that there's a bug in the thread pool / work queue
> > > implementation, and that the counters in the test are just fine (which can
> > > be checked by using just the __atomic builtins and their fallbacks).  For a
> > > test, it should also be sufficient to make them 32b, thus reducing one
> > > further unlikely but potential cause.
> > 
> > Only i686 exhibits the issue, and only when using the glibc atomic types.
> > PR_Atomic isn't affected, and no other platform is affected. 
> > 
> > I think when I made the counters 32b they worked as intended, but I need to
> > do it again to test - of course, I also don't care about i686 anymore, and
> > it's the *only* broken platform we have, so I probably won't ever check this
> > :)
> 
> This code probably has concurrency bugs on all platforms, even if they may
> be less likely to trigger in your testing.

Only if you "hold it the wrong way". 


This is also getting really off track from the bug. I've already opened the issue from what you showed, and the rest I don't think are issues here, and in the future it won't matter anyway as I will rewrite this eventually. There are too many limits on the design choices made for to really be a long term usable system for us.

Comment 24 Carlos O'Donell 2018-02-21 22:44:41 UTC

(In reply to wibrown from comment #23)
> This is also getting really off track from the bug. I've already opened the
> issue from what you showed, and the rest I don't think are issues here, and
> in the future it won't matter anyway as I will rewrite this eventually.
> There are too many limits on the design choices made for to really be a long
> term usable system for us.

Yes, we are a little off track, but the point is that if we are removing 32-bit arches because of concurrency issues, then it's a false dichotomy because all arches seem to be affected equally by concurrency issues.

Again, from our perspective we would like to narrow down exactly what failed on i686 concurrency wise and reproduce it.

How do I get a hold of a failing test? Shall we follow up in 1530832?

Comment 26 Florian Weimer 2018-02-23 17:14:57 UTC

> > * We need gcc/libc to correctly align values on i686, but I also don't
> > really have the time or motivation to raise this issue or pursue it
> 
> You should not have to do this by hand.

long long inside structs has 4-byte alignment on i386, although its alignment is 8 outside structs.  This means some manual work may be required to get 8-byte alignment.

I think this means that the GCC-provided atomics need to work with 4-byte-aligned pointers, too.  If they do not, they are indeed broken.

Carlos, did your review of the built-ins take this into account?

Even for this:

struct foo
{
  int x;
  _Atomic long long y;
};

we do not produce 8-byte alignment.  The struct is 12 bytes large and has no padding.

Comment 27 Jakub Jelinek 2018-02-23 17:33:40 UTC

(In reply to Florian Weimer from comment #26)
> > > * We need gcc/libc to correctly align values on i686, but I also don't
> > > really have the time or motivation to raise this issue or pursue it
> > 
> > You should not have to do this by hand.
> 
> long long inside structs has 4-byte alignment on i386, although its
> alignment is 8 outside structs.  This means some manual work may be required
> to get 8-byte alignment.

Yes, just use some typedef for int64_t that can be used in atomics, with __attribute__((aligned (8))).

> I think this means that the GCC-provided atomics need to work with
> 4-byte-aligned pointers, too.  If they do not, they are indeed broken.

I disagree, it is up to the user to make sure it is properly aligned.  There is no HW support for 64-bit atomics with just 32-bit alignment and using locking conditionally if it is aligned or not is something libatomic can do, but performant code shouldn't.

Comment 28 Torvald Riegel 2018-02-23 21:01:46 UTC

(In reply to wibrown from comment #23)
> (In reply to Torvald Riegel from comment #21)
> > (In reply to wibrown from comment #20)
> > > (In reply to Torvald Riegel from comment #19)
> > > > (In reply to wibrown from comment #18)
> > > > > (In reply to Carlos O'Donell from comment #17)
> > > > > > (In reply to wibrown from comment #15)
> > 
> > > > 
> > > > Line 343-345: condvars can have spurious wake-ups.  They should always be
> > > > used with a variable that represents the condition, and be in a loop (there
> > > > is a loop in the caller, but the way the condvar is used this can be more
> > > > like a yield, not a true producer/consumer relationship).  One will not get
> > > > the normal ordering after the signal with a spurious wake-up, for example.
> > > 
> > > I think this is a non issue because if we wake up spuriously with no work,
> > > we will re-trigger the wait event, and we don't break anything. So I really
> > > see no issue here .... 
> > 
> > I haven't looked at the sds queue yet, so perhaps this gives enough
> > ordering; if not, then the consumer may run before the producer has signaled.
> > Even if that works for the code present, the condvar is misused, and at the
> > very least this can confuse other people working on the code in the future. 
> 
> The queue has the conditions to check, so this works and I don't think it's
> an issue.

I'd suggest to at least add a comment to the code stating that how the condvar is used isn't quite how condvars should be used, and that the expectation is that the queue is supposed to ensure ordering.

> > 
> > > > 
> > > > Line 441-442: This doesn't make sense.  The return is not semantically
> > > > different from just falling through to the return at the end of the function.
> > > 
> > > If you don't return there, on the check in the next else if, it's possible a
> > > thread has called ns_job_done trigger a heap use after free. This code
> > > exists because the bug is real, and we test for it.
> > 
> > The checks on the next else if branches will not be executed because after
> > line 440, line 470 will be executed next (when ignoring the unnecessary
> > return).  Lines 443,451,459 all start else branches (and I think I didn't
> > overlook any braces...).
> > 
> > I suspect you may create a timing difference with the unnecessary return,
> > which makes it less likely to trigger the actual bug -- a bug that's still
> > there.  The use-after-free sounds like threads can't agree whether something
> > has been destroyed already.
> 
> No! If you don't return there the if statement CAN and HAS been run after
> another thread freed the memory causing a heap-use-after-free! This is a
> real issue, and I have been able to prove it in the past. 

The code is basically this:
if (...) {
 ...
 return;
} else if (...) {
 ...
}
return;

The first return doesn't make a semantic difference; removing it cannot affect the allowed outcomes of the program.  The second if statement is in the else branch and won't run if the first if's condition evaluated to true.

If you did observe the other branch to get executed, you probably have a concurrency problem somewhere in your program, or some other case of undefined behavior.

How did your "proof" look like?

> > 
> > > > 
> > > > Line 903 (and others for the other job types): Looks like an unnecessary
> > > > critical section, given that it seems like this is a new event and not yet
> > > > accessible to other threads.  That's not a problem in itself, but it's a red
> > > > flag indicating that ownership of jobs and visibility to other threads
> > > > aren't clear.
> > > 
> > > That's called "coverity whinges needlessly and we need to shut it up".
> > 
> > That's the wrong approach.  This code is confusing, and the fact that
> > someone put in the critical section without reason is a sign that there's
> > something wrong in the underlying understanding of how this needs to be
> > synchronized properly.  Removing the critical section would make the code
> > simpler.
> > 
> > And I hope you don't just ignore the first case I mentioned either.
> 
> During the job create, only the creation thread can see the job, and it's
> not til later does that matter. Initially there was NO locking on that
> section because there wasn't a need, it's only needed once you either copy
> the pointer to another thread or you mark the job armed. As a result,
> everythinga fter job_create does do locking and checks. 

I can't follow you.  My point was that you use critical sections on data that isn't yet visible to other threads.

> > 
> > > > 
> > > > Line 1079: loading the data field after the critical section seems wrong (it
> > > > might be the critical section is not needed).  The data field is set within
> > > > the critical section on Line 1093, so this is inconsistent.  Similar issues
> > > > exist in the accessors of the other fields.
> > > 
> > > You can only load this field if the job is in a state where it's guaranteed
> > > not to be destroyed, or in a state where you have exclusive ownership. This
> > > is one of the more "bad" parts of the original design that I can't escape
> > > ... today.
> > 
> > Even performing the load inside the critical section would make this code
> > more consistent.  Doing it outside of the critical section is trying to
> > optimize at the wrong place.  And doing it outside requires that "guaranteed
> > not to be destroyed" is either something that holds forever, or there is
> > additional synchronization from something after the load to whatever changes
> > the state.  If that's not the case, you have a potential data race here.
> 
> Yes, I am very aware of this. I have mentioned there are some really
> problematic design decisions made "not by me", and now they are hard to
> escape. There are *other* guarantees besides the lock that prevent this
> being an issue today. 

Is there a reason for not moving the loads inside of the respective critical sections to make this at least more consistent?

> > 
> > > > 
> > > > Line 1580 (and others): No documentation why release MO is used.  A shutdown
> > > > flag wouldn't need it, if all you want to communicate is the need to shut
> > > > down.  There might be a misunderstanding what the different MOs do.
> > > 
> > > To answer all this there is method to the apparent madness, and there is
> > > documentation about how it all works.
> > > https://firstyear.fedorapeople.org/nunc-stans/md_docs_job-safety.html should
> > > answer a few of the questions.
> > > 
> > > The safety doesn't come from the mutexes, but the design of the job
> > > lifecycle, and how they act. There are guarantees in the framework and the
> > > design, and how jobs are meant to be called. 
> > 
> > This design description misses some essential things one needs to consider
> > when dealing with synchronization.  For example, one needs to think about
> > ownership or visibility when one can destroy data that may be visible to
> > other threads; the deleted state doesn't make sense if destruction can
> > happen right after it.  At destruction time, no other thread must be able to
> > access the data any more.
> > 
> > I don't want to write a long critique of this description here, but feel
> > free to reach out.
> 
> The design does consider it, it's entire point is to understand that.

Just consider this example:  There's no need to designate and manage a "deleted" state if the only thread that is allowed to see this state is going to immediately delete the object.  The thought that it would be helpful to other threads would mean that there are other threads that race with deletion, which would be a bug. 

> There
> are some things which are needed to prevent issues which are the burden of
> the *caller* of the API to ensure they do not violate, and provided this
> holds true, it does work very well.
> 
> But once you start to step outside that highly limited, focused scope of the
> design, you WILL absolutely have issues.

This is not the point I am making.  The design is always a combination of *specified* preconditions (eg, requirements on callers) and promised post-conditions.  With that in mind, the program is either correctly implemented or it is not.

> > > Of course this is all limited by the fact that the code is all in C, so we
> > > really actually have no basis or ability to reason about it at all.
> > 
> > That's not true at all.  The abstract machine (see any C standard) and C11
> > memory model give you all you need.
> 
> No! C as a language lets you do anything you please, at any time, with no
> verification.

Your statement was that there's no way to reason about it, which is not true.  Verification is something else -- you don't need automatic verification to be able to reason about a program.

> That's why you can have locks that "aren't called" on critical
> sections, or atmoics with the wrong orderings. It's up to a human to work it
> out, and only at RUNTIME can you really determine the presence of an issue.
> C has NO POSSIBILITY of reasoning about it's correctness or safety until you
> run it!

That is absolutely wrong.  The C standard specifies semantics, and those semantics allow one to reason about it.  It may not be a fully formalized model, but there are formalizations of the memory model, for example:  http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/

Please familiarize yourself with the notion of the abstraction machine, and how the C standard uses it to define what executions (and thus outcomes) are allowed for a particular program.

> > 
> > > Second,
> > > the original design was not thread safe at all - all the safety is "added
> > > later" which means we violated "correct, simple, fast". This means, yes,
> > > there are "edge cases" and it does rely on the CALLER to handle it correctly
> > > as an API, the API can't actually guarantee safety at all. 
> > 
> > Adding it later isn't the problem.  The question is whether the
> > synchronization scheme is correct or not.  It's fine if there are additional
> > requirements on callers, if these are spelled out.
> 
> No no! You absolutely have to perform "correct" first. This api never did.
> Now it's like this. 

What I was saying is that you can have a correct sequential program, and then add multi-threading or other parallelism/concurrency later on, and still have it be correct.  There's no need to design for concurrency right from the start.

> > 
> > This is one of the things that makes concurrency hard: Unlike for sequential
> > code, you can't easily test whether a piece of code is correct or not.  It's
> > important to remember that, so as to not draw more confidence from a test
> > result than is justified.
> 
> I'm also really aware of this .... tests only get you so far. And in C,
> thats where it ends.

No, you can reason about the allowed executions of your program.

> You have no type checking (that's meaningful anyway),
> no verification, nothing. Tests are about as good of an assertion you can
> get when it comes to C. 

You can write type-safe programs with strong type checking if you stick to some good practices (note that that's not quite the same as getting memory safety).

I'd also add that actual verification is typically difficult no matter the language, because you need to build a formal specification of your program (ie, what is the program actually supposed to do?), against which the implementation can then be verified.

> > 
> > > > 
> > > > Overall, the lack of high-level documentation of how the concurrent code is
> > > > supposed to work, the bugs I've seen just on the first read of the code, and
> > > > the nature of some of the existing comments in the code are warning signs to
> > > > me, indicating that this may not work as intended.  (For example, if there's
> > > > behavior that shouldn't happen but has been observed, this should always be
> > > > considered an actual bug.)
> > > > 
> > > > My wild guess would be that there's a bug in the thread pool / work queue
> > > > implementation, and that the counters in the test are just fine (which can
> > > > be checked by using just the __atomic builtins and their fallbacks).  For a
> > > > test, it should also be sufficient to make them 32b, thus reducing one
> > > > further unlikely but potential cause.
> > > 
> > > Only i686 exhibits the issue, and only when using the glibc atomic types.
> > > PR_Atomic isn't affected, and no other platform is affected. 
> > > 
> > > I think when I made the counters 32b they worked as intended, but I need to
> > > do it again to test - of course, I also don't care about i686 anymore, and
> > > it's the *only* broken platform we have, so I probably won't ever check this
> > > :)
> > 
> > This code probably has concurrency bugs on all platforms, even if they may
> > be less likely to trigger in your testing.
> 
> Only if you "hold it the wrong way". 

I don't know what you mean.  Either it works under the preconditions that are spelled out somewhere, or it doesn't.  There's no fuzzy middle ground in which programs happen to work magically.

Comment 29 Randy Barlow 2018-03-04 00:50:22 UTC

FESCo is trying to make a decision about this issue, and it would help us if we could get answers to a few questions:

0) Should the bug title be updated to state that this affects all 32-bit architectures, or does it only affect i686?
1) Are there any known real-world use cases where problems are detected, or are the problems only noticeable via tests (i.e., is it reproducible under real usage)?
2) Is there any evidence as to whether this issue affects supported architectures?

If you feel strongly that this should be dropped in F28, please file a change request for FESCo to review during our next meeting. Thanks!

Comment 30 Alexander Bokovoy 2018-03-04 09:01:09 UTC

Randy,

I believe this is a wrong bug to ask these questions. They should be asked from 389-ds developers at the bug they filed following the process for excluding packages from an architecture: https://bugzilla.redhat.com/show_bug.cgi?id=1530832. Fedora processes ask filing a bug that blocks an architecture-specific bug https://bugzilla.redhat.com/show_bug.cgi?id=179258 and this was promptly done, with no reaction whatsoever.

Comment 31 Fedora Update System 2018-03-16 21:09:40 UTC

freeipa-4.6.90.pre1-1.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-2fd7295cb9

Comment 32 Fedora Update System 2018-03-17 19:30:02 UTC

dogtag-pki-10.6.0-0.2.fc28, dogtag-pki-theme-10.6.0-0.2.fc28, freeipa-4.6.90.pre1-1.fc28, pki-console-10.6.0-0.2.fc28, pki-core-10.6.0-0.2.fc28, tomcat-8.5.29-1.fc28, tomcatjss-7.3.0-0.2.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-2fd7295cb9

Comment 33 Fedora Update System 2018-03-18 00:49:04 UTC

dogtag-pki-10.6.0-0.2.fc28, dogtag-pki-theme-10.6.0-0.2.fc28, freeipa-4.6.90.pre1-1.fc28, pki-console-10.6.0-0.2.fc28, pki-core-10.6.0-0.2.fc28, tomcat-8.5.29-1.fc28, tomcatjss-7.3.0-0.2.fc28 has been pushed to the Fedora 28 stable repository. If problems still persist, please make note of it in this bug report.

Comment 34 Randy Barlow 2018-03-26 02:33:49 UTC

Please file a change for this for Fedora 28.

Comment 35 Florence Blanc-Renaud 2018-05-31 09:12:58 UTC

After discussion with jkurik about the right process, an informative ticket has been filed on FESCo (since Fedora 28 is already available):
https://pagure.io/fesco/issue/1900

Comment 36 Florence Blanc-Renaud 2018-06-18 06:40:28 UTC

Outcome of the FESCo meeting (2018-06-15):
FESCo acknowledges (+7, 0, -1)