Bug 1410514
Summary: | deadlock on cos cache rebuild | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Nikhil Dehadrai <ndehadra> | |
Component: | 389-ds-base | Assignee: | Noriko Hosoi <nhosoi> | |
Status: | CLOSED ERRATA | QA Contact: | Viktor Ashirov <vashirov> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 7.3 | CC: | gparente, mkosek, mreynolds, msauton, nkinder, nsoman, pvoborni, rcritten, rmeggins, sramling, tbordaz, wibrown | |
Target Milestone: | rc | Keywords: | TestBlocker, ZStream | |
Target Release: | --- | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | 389-ds-base-1.3.6.1-3.el7 | Doc Type: | No Doc Update | |
Doc Text: |
See 7.3.z bug 1414678.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1414678 (view as bug list) | Environment: | ||
Last Closed: | 2017-08-01 21:14:10 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1414678 |
Description
Nikhil Dehadrai
2017-01-05 16:10:50 UTC
- The server hangs because of a deadlock between two threads 252 dd=84 locks held 0 write locks 0 pid/thread 4927/140354647430912 (7FA6DCE71700) 252 READ 1 WAIT userRoot/id2entry.db page 102 80002084 dd= 1 locks held 23 write locks 16 pid/thread 4927/140354179274496 (7FA6C0FF9700) 80002084 WRITE 1 HELD userRoot/id2entry.db page 102 - The deadlock is due to locks taken in the opposite order: - Thread 41: cos_cache lock , then db lock - Thread 20: db lock, then cos_cache lock - thread 41 is cos_cache_create. It is a thread responsible to rebuild cos cache when it is notified that a cos definition was updated thread 20 is adding entry (owning some db page). - The entry is in conflict -> triggers bepreop to rename it - The entry is a cos definition, so when it is renamed (by beprop) it notifies thread 41 to rebuild the cache - The cos definition is "cn=Default Password Policy,cn=computers,cn=accounts,<suffix>" - when thread 20 tries to notify thread 41, Thread 41 was already processing a cache rebuild This bug is likely triggered when there are several cos cache rebuild at the same time. Conflict on cos definition entry is a good way to trigger this but I think it can happened independantly (gdb) thread 20 [Switching to thread 20 (Thread 0x7fa6c0ff9700 (LWP 4955))] #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 135 2: movl %edx, %eax (gdb) frame 5 #5 0x00007fa6e6c4ff04 in cos_cache_change_notify (pb=pb@entry=0x7fa68c239200) at ldap/servers/plugins/cos/cos_cache.c:3372 3372 slapi_lock_mutex(change_lock); (gdb) list 3370,3376 3370 if(do_update) 3371 { 3372 slapi_lock_mutex(change_lock); 3373 slapi_notify_condvar( something_changed, 1 ); 3374 cos_cache_notify_flag = 1; 3375 slapi_unlock_mutex(change_lock); 3376 } (gdb) print cos_cache_notify_flag $38 = 1 - The deadlock is dynamic but in freeipa tests it happens very frequently In conclusion: The bug is a DS bug in COS component, it likely happens in freeipa context because cos cache being under load by conflict on cos definition. No immediat fix identified IMHO this bug is independent of https://bugzilla.redhat.com/show_bug.cgi?id=1404338 The issue is also noticed for similar upgrade setup from RHEL 7.3 > RHEL 7.3.2 MASTER: #################################### Before Upgrade: ================ [root@auto-hv-01-guest07 ~]# ipa domainlevel-get ----------------------- Current domain level: 1 ----------------------- After Upgrade: ================= [root@auto-hv-01-guest07 ~]# ipa domainlevel-set 1 ipa: ERROR: no modifications to be performed [root@auto-hv-01-guest07 ~]# ipa domainlevel-get ----------------------- Current domain level: 1 ----------------------- REPLICA: #################################### Before Upgrade: ================ [root@auto-hv-01-guest05 ~]# ipa domainlevel-get ----------------------- Current domain level: 1 ----------------------- After Upgrade: ================= [root@auto-hv-01-guest05 ~]# ipa domainlevel-set 1 ipa: ERROR: Major (851968): Unspecified GSS failure. Minor code may provide more information, Minor (2529639068): Cannot contact any KDC for realm 'TESTRELM.TEST' [root@auto-hv-01-guest05 ~]# ipa domainlevel-get ipa: ERROR: Major (851968): Unspecified GSS failure. Minor code may provide more information, Minor (2529639068): Cannot contact any KDC for realm 'TESTRELM.TEST' The test does thing which tries to create replication conflicts and thus break IPA in order to verify bug 1404338. But this risky behavior broke IPA in different way producing other issue. It cannot be used for verification of bug 1404338 but it doesn't mean that there isn't any other way to verify bug 1404338. This bug is a DS bug. The hang can be reproduced without IDM upgrade scenario. An upstream ticket is opened: https://fedorahosted.org/389/ticket/49079 Moving to DS component according to comment 10. Fixed upstream https://fedorahosted.org/389/ticket/49079#comment:6 To verify, I executed ticket49079_test.py script based on https://pagure.io/389-ds-base/issue/49079#comment-117791 [0 root@qeos-38 tickets]# py.test -s -v ticket49079_test.py ===================================================== test session starts ===================================================== platform linux2 -- Python 2.7.5, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 -- /usr/bin/python cachedir: .cache INFO:lib389:Starting total init cn=meTo_localhost:39002,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config ('Update succeeded: status ', '0 Total update succeeded') INFO:dirsrvtests.tests.tickets.ticket49079_test:Replication is working. INFO:lib389:Pausing replication cn=meTo_localhost:39001,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config INFO:lib389:Pausing replication cn=meTo_localhost:39002,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config INFO:lib389:Resuming replication cn=meTo_localhost:39001,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config INFO:lib389:Resuming replication cn=meTo_localhost:39002,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config PASSEDInstance slapd-master_1 removed. Instance slapd-master_2 removed. ==================== 1 passed in 41.48 seconds ====================== Marking the bug as Verified since the tests PASSed. DS build: 1.3.6.1 389-ds-base: 1.3.6.1-13.el7 nss: 3.28.4-8.el7 nspr: 4.13.1-1.0.el7_3 openldap: 2.4.44-4.el7 svrcore: 4.1.3-2.el7 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2086 |