Recent discussions in bug 1544386 have raised issues about the general correctness of the 389-ds-base atomic operations, not just for 32-bit platforms but in general for all platforms. In ldap/servers/slapd/backend.c: 197 void 198 be_addsuffix(Slapi_Backend *be, const Slapi_DN *suffix) 199 { 200 if (be->be_state != BE_STATE_DELETED) { 201 struct suffixlist *new_suffix, *list; 202 203 new_suffix = (struct suffixlist *)slapi_ch_malloc(sizeof(struct suffixlist)); 204 new_suffix->be_suffix = slapi_sdn_dup(suffix); 205 new_suffix->next = NULL; 206 207 PR_Lock(be->be_suffixlock); 208 209 if (be->be_suffixlist == NULL) { 210 be->be_suffixlist = new_suffix; 211 } else { 212 list = be->be_suffixlist; 213 while (list->next != NULL) { 214 list = list->next; 215 } 216 list->next = new_suffix; 217 } 218 slapi_counter_increment(be->be_suffixcounter); 219 220 PR_Unlock(be->be_suffixlock); 221 } 222 } We have a PR_Lock/PR_Unlock around the modification of be->be_suffixlist. Within the critical section we have 'slapi_counter_increment(be->be_suffixcounter);', which if 64-bit atomics are present is carried out with relaxed MO. Then we have: 168 int 169 slapi_be_issuffix(const Slapi_Backend *be, const Slapi_DN *suffix) 170 { 171 struct suffixlist *list; 172 int r = 0; 173 /* this backend is no longer valid */ 174 if (be && be->be_state != BE_STATE_DELETED) { 175 int i = 0, count; 176 177 count = slapi_counter_get_value(be->be_suffixcounter); 178 list = be->be_suffixlist; 179 while (list && i < count) { 180 if (slapi_sdn_compare(list->be_suffix, suffix) == 0) { 181 r = 1; 182 break; 183 } 184 i++; 185 list = list->next; 186 } 187 } 188 return r; 189 } Which uses a relaxed MO load to read be_suffixcounter to iterate a list. The use of relaxed MO means there is no synchronizes-with in these operations, across either function for the counter. One thread may see the update of the counter *before* the list itself is updated. Therefore slapi_be_issuffix may walk off the end of the list->next and crash. Again it seems like the failing 32-bit code is just an early warning that the atomic usage in this code is not correct. However, it is hard to know if the atomic code is correct or not because no documentation of intent is provided for the concurrent algorithm being used to ensure correctness. If you need any help auditing this code, please do not hesitate to reach out to the toolchain team. We have a lot of experience writing parallel and concurrenct code, and have even given workshops on modern concurrent code topics.
The issue here is the mis-use of the counter for acting as a memory barrier, when it's not. Previously, no sync was performed by NSPR either. We need a lock around this. plain and simple. There are SO MANY cases like this in the code base. I've spent months working on issues like this, and I keep finding more.
(In reply to wibrown from comment #1) > The issue here is the mis-use of the counter for acting as a memory barrier, > when it's not. > > Previously, no sync was performed by NSPR either. > > We need a lock around this. plain and simple. > > There are SO MANY cases like this in the code base. I've spent months > working on issues like this, and I keep finding more. Sorry to hear that! :-( Given that I was doing some auditing myself to double check what might be wrong with the 32-bit code being generated by the compiler (related to bug 1544386), I didn't want to loose the fact that both I and Torvald Riegel (our toolchain concurrency expert) felt there were serious issues in the code that could cause serious user problems. This bug is just so I didn't loose this comment in the event that you were not aware of the issue. However, it sounds like you are well aware of the problem. Yes, in the case above taking the be_suffixlock in *all* instances that the same structure is accessed in the critical section is sufficient. My comment about the counter still stands though that the relaxed MO counter does not implement what is written as pthread code in the #else fallback. So the counter may not increment monotonically as expected.
(In reply to wibrown from comment #1) > The issue here is the mis-use of the counter for acting as a memory barrier, > when it's not. > > Previously, no sync was performed by NSPR either. > > We need a lock around this. plain and simple. This would work but is not strictly necessary, I believe (based on a quick read of the code). The problem in this piece of code has similarities how double-checked locking is sometimes implemented incorrectly (e.g., a lack of barriers and thus synchronizes-with relations, when speaking in terms of the C11/C++11 memory model). This can be implemented correctly though. > There are SO MANY cases like this in the code base. I've spent months > working on issues like this, and I keep finding more. One approach that can work well in practice is to review existing code and document borh the intent of the synchronization involved and how the implementation achieves the intended behavior through use of the programming language's memory model (in this case, C11's / C++11's memory model are the ones to use, even though this may not be a C11 program). The high-level intents to focus on are atomicity and required ordering. The review will then highlight issues such as the one reported here (ie, that iterating over a list must happen after the list has been initialized, and inter-thread ordering of the kind needed here is not doable with relaxed memory order). In turn, relaxed memory order really says that you don't care about ordering (ignoring corner cases), which is a red flag unless there's a comment in the code explaining why that's fine.
The problem is that I don't want to use such low level primitives, because I'm the only member of the team who currently has this knowledge. Education is one step for adding in the right barriers, but it's hard to communicate and teach all the complexity and subtelty. In this case, we could add the barrier, but the algorithms in use aren't designed to work atomically as the often update multiple pointers in possibly unaligned locations/different cache lines. The use of relaxed in the counters is because they are exactly that - counters. They are not meant to provide synchronisation of data. Again, here we shouldn't be using the counter in this way, it's just wrong. The count should be a value held behind a rwlock. And that's why in this case, where we have possibly multiple pointer updates, AND possibly other consumers, we really need locking because converting this to a concurrently readable or lock free algo would be really hard - and pointless. And next, is the codebase is hugely complex, often with parts and accesses of variables spanning many .c files. Reviewing it is hugely complex and difficult. I only have so much time in a day, so I have used other tools like ASAN/TSAN/LSAN to catch low hanging fruits, and using our test suites to uncover issues in common code. That's why we didn't catch this, because I think you would have to be searching and in the mapping tree lookup code at the same time as you are doing a backend add or delete op. Not a common scenario at all! The next consideration is reference counting - reference count doesn't scale across cores anyway, so every instance of rc has to go at some point. So that's the other major users. Our final major use is performance counters, but I also plan to rewrite these with a different method because the continual calling of atomics is slow. But yeah, thanks for opening this. I will resolve the issue soon.
(In reply to wibrown from comment #4) > The problem is that I don't want to use such low level primitives, because > I'm the only member of the team who currently has this knowledge. Education > is one step for adding in the right barriers, but it's hard to communicate > and teach all the complexity and subtelty. Fair enough, but either you use atomics properly, or you get rid of them and replace them with critical sections. > In this case, we could add the barrier, but the algorithms in use aren't > designed to work atomically as the often update multiple pointers in > possibly unaligned locations/different cache lines. Don't think about cache lines here, except if you want to think about performance. C11 memory model is what you want to base your reasoning on, and it's data-race freedom is key. > The use of relaxed in the counters is because they are exactly that - > counters. They are not meant to provide synchronisation of data. Then these are just statistics, and you need to understand that there's no guarantee when the accesses happen compared to the code around them. > Again, here > we shouldn't be using the counter in this way, it's just wrong. The count > should be a value held behind a rwlock. If the rwlock critical sections would just have the access to the counter in them, then this isn't different semantically from using an atomic counter with seq-cst MO (provided that this is consistently used throughout the program). Also, for very short critical sections, just use a normal mutex. > And next, is the codebase is hugely complex, often with parts and accesses > of variables spanning many .c files. Reviewing it is hugely complex and > difficult. I only have so much time in a day, so I have used other tools > like ASAN/TSAN/LSAN to catch low hanging fruits, That's okay... > and using our test suites > to uncover issues in common code. ... but don't expect that this will give you similar confidence as when testing sequential code. You need to reason about the concurrent code, and review it. We don't have the technology in place necessary to actually test all interleavings. > That's why we didn't catch this, because I > think you would have to be searching and in the mapping tree lookup code at > the same time as you are doing a backend add or delete op. Not a common > scenario at all! That's why one needs to review code properly, and reason about it thoroughly. > The next consideration is reference counting - reference count doesn't scale > across cores anyway, This is not true in general. If there's enough work between the updates of the refcounts, you will be fine. It will not be as fast as using something with less contention, but that would also be harder to use correctly. > so every instance of rc has to go at some point. So > that's the other major users. Our final major use is performance counters, > but I also plan to rewrite these with a different method because the > continual calling of atomics is slow. I'm not sure how many events per second you try to count, but you have to optimize quite a bit until the contention on performance counters is a significant slowdown. Given the existing correctness problems, I doubt you have to worry about this in the near future. > But yeah, thanks for opening this. I will resolve the issue soon. I suggest taking your time and doing a thorough review, instead of just trying to fix the bug based on a local analysis. Feel free to reach out to Carlos or me regarding how we deal with concurrency in glibc.
(In reply to Torvald Riegel from comment #5) > (In reply to wibrown from comment #4) > > The problem is that I don't want to use such low level primitives, because > > I'm the only member of the team who currently has this knowledge. Education > > is one step for adding in the right barriers, but it's hard to communicate > > and teach all the complexity and subtelty. > > Fair enough, but either you use atomics properly, or you get rid of them and > replace them with critical sections. That's exactly the plan! > > > In this case, we could add the barrier, but the algorithms in use aren't > > designed to work atomically as the often update multiple pointers in > > possibly unaligned locations/different cache lines. > > Don't think about cache lines here, except if you want to think about > performance. C11 memory model is what you want to base your reasoning on, > and it's data-race freedom is key. > > > The use of relaxed in the counters is because they are exactly that - > > counters. They are not meant to provide synchronisation of data. > > Then these are just statistics, and you need to understand that there's no > guarantee when the accesses happen compared to the code around them. Yes, exactly. These stats are updated in a function anyway, and if my memory is correct, function calls act as an ordering guarantee to the compiler anyway. > > > Again, here > > we shouldn't be using the counter in this way, it's just wrong. The count > > should be a value held behind a rwlock. > > If the rwlock critical sections would just have the access to the counter in > them, then this isn't different semantically from using an atomic counter > with seq-cst MO (provided that this is consistently used throughout the > program). > Also, for very short critical sections, just use a normal mutex. Yes, that's true. But in this example the rwlock is best because you have many readers in parallel and occasional rare updates that need exclusive access :) > > > And next, is the codebase is hugely complex, often with parts and accesses > > of variables spanning many .c files. Reviewing it is hugely complex and > > difficult. I only have so much time in a day, so I have used other tools > > like ASAN/TSAN/LSAN to catch low hanging fruits, > > That's okay... > > > and using our test suites > > to uncover issues in common code. > > ... but don't expect that this will give you similar confidence as when > testing sequential code. You need to reason about the concurrent code, and > review it. We don't have the technology in place necessary to actually test > all interleavings. > > > That's why we didn't catch this, because I > > think you would have to be searching and in the mapping tree lookup code at > > the same time as you are doing a backend add or delete op. Not a common > > scenario at all! > > That's why one needs to review code properly, and reason about it thoroughly. There were about 20 years in the project before I joined the team, which somewhat hampers my ability to perform code review on legacy code. And a review *today* would be costly, but I do want to do it. > > > The next consideration is reference counting - reference count doesn't scale > > across cores anyway, > > This is not true in general. If there's enough work between the updates of > the refcounts, you will be fine. It will not be as fast as using something > with less contention, but that would also be harder to use correctly. > > > so every instance of rc has to go at some point. So > > that's the other major users. Our final major use is performance counters, > > but I also plan to rewrite these with a different method because the > > continual calling of atomics is slow. > > I'm not sure how many events per second you try to count, but you have to > optimize quite a bit until the contention on performance counters is a > significant slowdown. Given the existing correctness problems, I doubt you > have to worry about this in the near future. Yeah, first I want to get our code working "correctly". > > > But yeah, thanks for opening this. I will resolve the issue soon. > > I suggest taking your time and doing a thorough review, instead of just > trying to fix the bug based on a local analysis. > > Feel free to reach out to Carlos or me regarding how we deal with > concurrency in glibc. Thanks, that's a good idea. I'll add that to my list of things to do :)
(In reply to wibrown from comment #6) > (In reply to Torvald Riegel from comment #5) > > (In reply to wibrown from comment #4) > > > In this case, we could add the barrier, but the algorithms in use aren't > > > designed to work atomically as the often update multiple pointers in > > > possibly unaligned locations/different cache lines. > > > > Don't think about cache lines here, except if you want to think about > > performance. C11 memory model is what you want to base your reasoning on, > > and it's data-race freedom is key. > > > > > The use of relaxed in the counters is because they are exactly that - > > > counters. They are not meant to provide synchronisation of data. > > > > Then these are just statistics, and you need to understand that there's no > > guarantee when the accesses happen compared to the code around them. > > Yes, exactly. These stats are updated in a function anyway, and if my memory > is correct, function calls act as an ordering guarantee to the compiler > anyway. There's no such guarantee (think about LTO and inlining, for example). If you need ordering, it's best to do it through synchronization operations (atomics with a stronger-than-relaxed memory order, mutex lock/unlock (each generally prevent reordering in one direction only), ...). > > > > > Again, here > > > we shouldn't be using the counter in this way, it's just wrong. The count > > > should be a value held behind a rwlock. > > > > If the rwlock critical sections would just have the access to the counter in > > them, then this isn't different semantically from using an atomic counter > > with seq-cst MO (provided that this is consistently used throughout the > > program). > > Also, for very short critical sections, just use a normal mutex. > > Yes, that's true. But in this example the rwlock is best because you have > many readers in parallel and occasional rare updates that need exclusive > access :) Uncontended rwlocks can be slower than uncontended mutexes when the critical sections are short. This depends on various factors, of course, so best is to measure. > > > > > And next, is the codebase is hugely complex, often with parts and accesses > > > of variables spanning many .c files. Reviewing it is hugely complex and > > > difficult. I only have so much time in a day, so I have used other tools > > > like ASAN/TSAN/LSAN to catch low hanging fruits, > > > > That's okay... > > > > > and using our test suites > > > to uncover issues in common code. > > > > ... but don't expect that this will give you similar confidence as when > > testing sequential code. You need to reason about the concurrent code, and > > review it. We don't have the technology in place necessary to actually test > > all interleavings. > > > > > That's why we didn't catch this, because I > > > think you would have to be searching and in the mapping tree lookup code at > > > the same time as you are doing a backend add or delete op. Not a common > > > scenario at all! > > > > That's why one needs to review code properly, and reason about it thoroughly. > > There were about 20 years in the project before I joined the team, which > somewhat hampers my ability to perform code review on legacy code. And a > review *today* would be costly, but I do want to do it. I know it's hard to jump into an old code base, and it requires a lot of reconstruction. But on the positive side, it's also an opportunity to get rid of a lot of bit rot, and to have a fresh perspective that's not clouded by past assumptions about the code.
re: rwlocks - correct - they are not always the best choice - back in 2012/2013 when we did a lot of measurement of performance and lock contention (using systemtap, cachegrind, gdb stack snapshots, et. al.), we did find that rwlocks did not perform as well as plain old mutexes in some cases and in some code paths. re: code inspection - many of the heavily used code paths have been inspected and tested i.e. the main connection/operation handling in daemon.c/connection.c, the main ldap operation handling with connection/database locking, replication, etc. There is room for improvement certainly, but the existing code base isn't that bad. For the event loop, we wanted to move to nunc-stans which combines a more modern event framework with lock free data structures, which should solve some of the race conditions and thread contention problems we see in that area of the code. For the database, we wanted to move to use lmdb instead of bdb. But as William said, it is a huge code base written by ~100 people over 20 years primarily for Windows, Solaris, and Linux.
https://github.com/richm/scripts/ https://github.com/richm/scripts/blob/master/stap-dirsrv.sh https://github.com/richm/scripts/blob/master/stap-dirsrv.stp https://github.com/richm/scripts/blob/master/stap-report.py https://github.com/richm/scripts/blob/master/stap-spreadsheet.bas systemtap script (probably bit rotted quite a bit by now) to measure performance and lock contention, along with supporting scripts to massage the data and create spreadsheets.
This bug appears to have been reported against 'rawhide' during the Fedora 28 development cycle. Changing version to '28'.
(In reply to Rich Megginson from comment #8) > re: rwlocks - correct - they are not always the best choice - back in > 2012/2013 when we did a lot of measurement of performance and lock > contention (using systemtap, cachegrind, gdb stack snapshots, et. al.), we > did find that rwlocks did not perform as well as plain old mutexes in some > cases and in some code paths. I have fixed that for glibc's implementation since then. New glibc rwlocks perform much better in mostly-reader scenarios now, but it's still reference counting essentially, so an uncontended exclusive can be a bit faster. > re: code inspection - many of the heavily used code paths have been > inspected and tested i.e. the main connection/operation handling in > daemon.c/connection.c, the main ldap operation handling with > connection/database locking, replication, etc. There is room for improvement > certainly, but the existing code base isn't that bad. I can just speak for what I've seen so far. I have no idea whether it was representative, but I found potential issues on the first read of the code. > For the event loop, we wanted to move to nunc-stans which combines a more > modern event framework with lock free data structures, I've found (potential) issues in that one too (see bug 1544386).
This message is a reminder that Fedora 28 is nearing its end of life. On 2019-May-28 Fedora will stop maintaining and issuing updates for Fedora 28. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '28'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 28 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle. Changing version to '31'.
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle. Changing version to 31.
This package has changed maintainer in the Fedora. Reassigning to the new maintainer of this component.
This message is a reminder that Fedora 31 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 31 on 2020-11-24. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '31'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 31 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 31 changed to end-of-life (EOL) status on 2020-11-24. Fedora 31 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.