Bug 1939607
| Summary: | hang because of incorrect accounting of readers in vattr rwlock | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | thierry bordaz <tbordaz> | |
| Component: | 389-ds-base | Assignee: | thierry bordaz <tbordaz> | |
| Status: | CLOSED ERRATA | QA Contact: | RHDS QE <ds-qe-bugs> | |
| Severity: | unspecified | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 8.4 | CC: | bsmejkal, ldap-maint, mreynolds, sgouvern | |
| Target Milestone: | rc | Keywords: | Triaged | |
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
|
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | sync-to-jira | |||
| Fixed In Version: | 389-ds-1.4-8050020210514191740-d5c171fc | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2018257 (view as bug list) | Environment: | ||
| Last Closed: | 2021-11-09 18:11:20 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2018257 | |||
The first analyse was wrong. There is no vattr lock leak. Actually a pthread rwlock test program shows the exact same lock dump with
- T1 reader holding the lock
- T2 writer waiting for T1
- T3 reader waiting for T2
- T4 reader waiting for T2
(gdb) print *the_map->lock
$41 = {__data = {__readers = 14, __writers = 0, __wrphase_futex = 2, __writers_futex = 1, __pad3 = 0, __pad4 = 0, __cur_writer = 0, __shared = 0, __rwelision = 0 '\000',
__pad1 = "\000\000\000\000\000\000", __pad2 = 0, __flags = 2},
__size = "\016\000\000\000\000\000\000\000\002\000\000\000\001", '\000' , "\002\000\000\000\000\000\000", __align = 14}
The RC of the deadlock is a 3 threads deadlock scenario:
[08/Mar/2021:18:09:15.255947668 +0000] conn=4 op=561 ADD dn="cn=FleetCommander Desktop Profile Administrators,cn=roles,cn=accounts,dc=ipa,dc=test"
[08/Mar/2021:18:09:15.261133390 +0000] conn=4 op=562 SRCH base="cn=FleetCommander Desktop Profile Administrators,cn=privileges,cn=pbac,dc=ipa,dc=test" scope=0 filter="(objectClass=*)" attrs="objectClasses aci * attributeTypes"
[08/Mar/2021:18:09:15.263940289 +0000] conn=4 op=563 ADD dn="cn=FleetCommander Desktop Profile Administrators,cn=privileges,cn=pbac,dc=ipa,dc=test"
[08/Mar/2021:18:09:15.264024493 +0000] conn=4 op=562 RESULT err=32 tag=101 nentries=0 wtime=0.000045722 optime=0.002898152 etime=0.002941253
[08/Mar/2021:18:09:15.261304370 +0000] conn=4 op=561 RESULT err=0 tag=105 nentries=0 wtime=0.000103907 optime=0.005360651 etime=0.005394639
Thread 14
conn=4 op=561 ADD "cn=FleetCommander Desktop Profile Administrators,cn=roles,cn=accounts,dc=ipa,dc=test"
Hold vattr lock in read and wait for DB page (WAIT userRoot/objectclass.db) (hold by Thread 20)
-> SIDGEN (post-op)
-> internal SRCH -b "dc=ipa,dc=test" "(objectclass=ipantdomainattrs)"
op_shared_search => hold vattr lock in read
-> index read => DB page
Thread 20
conn=4 op=563 ADD "cn=FleetCommander Desktop Profile Administrators,cn=privileges,cn=pbac,dc=ipa,dc=test"
Hold DB page (HOLD userRoot/objectclass.db) waiting for vattr lock in read
-> ADD -> memberof modify (txnbe post)
-> DNA -> internal_search
-> vattr_map_lookup => wait for vattr in read
Thread 6
On backend state change, it rebuild the cos cache
Try to acquire vattr in write blocking new readers
Internal search SRCH -b "dc=ipa,dc=test" "(&(|(objectclass=cosSuperDefinition)(objectclass=cosDefinition))(objectclass=ldapsubentry))"
-> cos_dn_defs_cb
-> vattr_map_insert : wait vattr in write
Thread 6 is blocked by Thread 14
Thread 14 is blocked by Thread 20
Thread 20 is blocked by Thread 6
Fix pushed upstream => POST Build tested: 389-ds-base-1.4.3.23-1.module+el8.5.0+11016+7e7e9011.x86_64 I had freeipa installation running 141x times in a loop without hang. The fix is in the build. Marking as Verified:Tested, SanityOnly. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (389-ds-base bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4203 |
Description of problem: The hang occurs when a thread (cos cache rebuild) tries to acquire vattr rwlock (the_map->lock) in write. As it remains readers, this thread is stopped but because of priority of writers it blocks others SRCH threads that try to acquire the rwlock in read. The hang should finished when the readers threads that acquired the lock before writer thread (cos cache) release the lock. The problem is that there is no others readers threads. The backtrace is only showing readers that are waiting for the writers. The backtrace is showing 5 readers and 1 writers but the lock is showing 14 readers. So some readers, that complete their task, has not released the lock (gdb) print *the_map->lock $41 = {__data = {__readers = 14, __writers = 0, __wrphase_futex = 2, __writers_futex = 1, __pad3 = 0, __pad4 = 0, __cur_writer = 0, __shared = 0, __rwelision = 0 '\000', __pad1 = "\000\000\000\000\000\000", __pad2 = 0, __flags = 2}, __size = "\016\000\000\000\000\000\000\000\002\000\000\000\001", '\000' , "\002\000\000\000\000\000\000", __align = 14} Related ticket: #51068 That gives priority to writers and hang later readers #49873 That acquires the map rwlock at the operation level using a per cpu variable Version-Release number of selected component (if applicable): Likely the bug exists since 1.4.1.2 but is more prone to happen since 1.4.3.8 How reproducible: No identified simple testcase at the moment ATM it occurs 1% time with freeipa tests Actual results: DS hang Expected results: DS should not hang