1410514 – deadlock on cos cache rebuild

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1410514 - deadlock on cos cache rebuild

Summary: deadlock on cos cache rebuild

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	389-ds-base
Sub Component:
Version:	7.3
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Noriko Hosoi
QA Contact:	Viktor Ashirov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1414678
TreeView+	depends on / blocked

Reported:	2017-01-05 16:10 UTC by Nikhil Dehadrai
Modified:	2020-09-13 21:55 UTC (History)
CC List:	12 users (show)
Fixed In Version:	389-ds-base-1.3.6.1-3.el7
Doc Type:	No Doc Update
Doc Text:	See 7.3.z bug 1414678.
Clone Of:
Clones:	1414678 (view as bug list)
Environment:
Last Closed:	2017-08-01 21:14:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	389ds 389-ds-base issues 2138	0	None	None	None	2020-09-13 21:55:24 UTC
Red Hat Product Errata	RHBA-2017:2086	0	normal	SHIPPED_LIVE	389-ds-base bug fix and enhancement update	2017-08-01 18:37:38 UTC

Description Nikhil Dehadrai 2017-01-05 16:10:50 UTC

Description of problem:
ipa: ERROR: Major (851968): Unspecified GSS failure.  Minor code may provide more information, Minor (2529639068): Cannot contact any KDC for realm 'TESTRELM.TEST' message received on running command "ipa domainlevel-set 1" after upgrade of IPA server from RHEL 7.2.z to RHEL 7.3.2.


Version-Release number of selected component (if applicable):
ipa-server-4.4.0-14.el7_3.4.x86_64


Steps: (Upgrade from 7.2.z > 7.3.2)
====================================
1) Install master on RHEL 7.2.z. (In my case ipa-server.x86_64 0:4.2.0-15.el7_2.19).
2) Install replica on RHEL 7.2.z against master in step1, with ipa-replica-prepare command.
3) Stop replica server using "ipactl stop".
4) Configure repos for RHEL 7.3.2 on Master and Replica.
5) Upgrade master to RHEL 7.3.2 and stop master using command "ipactl stop".
6) Start replica using command "ipactl start" and Upgrade replica to Rhel 7.3.2 using command "yum -y update 'ipa*' sssd".
7) Start master
 server using command "ipactl start"
8) Run "kinit admin" both on master and replica.
9) Run "ipa domainlevel-set 1" both on Master and Replica.


Actual results:
1) Both Master and Replica are upgraded successfully after step5 and step6.
2) After step9, following error message is received both on Master:
#ipa domainlevel-set 1
ipa: ERROR: Domain Level cannot be raised to 1, server <replica.testrelm.test> does not support it.

3) After step9, following error message is received both on REPLICA:
ipa domainlevel-set 1
ipa: ERROR: Major (851968): Unspecified GSS failure.  Minor code may provide more information, Minor (2529639068): Cannot contact any KDC for realm 'TESTRELM.TEST'


Expected results:
After step9, following error message should be reflected both on IPA MASTER and REPLICA:

The expected result is:
ipa: ERROR: Domain Level cannot be raised to 1, existing replication conflicts have to be resolved.

Comment 3 thierry bordaz 2017-01-05 17:41:11 UTC

    - The server hangs because of a deadlock between two threads
     252 dd=84 locks held 0    write locks 0    pid/thread 4927/140354647430912 (7FA6DCE71700)
     252 READ          1 WAIT    userRoot/id2entry.db      page        102

80002084 dd= 1 locks held 23   write locks 16   pid/thread 4927/140354179274496 (7FA6C0FF9700)
80002084 WRITE         1 HELD    userRoot/id2entry.db      page        102

    - The deadlock is due to locks taken in the opposite order:
        - Thread 41: cos_cache lock , then db lock
        - Thread 20: db lock, then cos_cache lock

    - thread 41 is cos_cache_create. It is a thread responsible to rebuild cos cache when it is
       notified that a cos definition was updated

      thread 20 is adding entry (owning some db page). 
        - The entry is in conflict -> triggers bepreop to rename it
        - The entry is a cos definition, so when it is renamed (by beprop) it notifies thread 41 to rebuild the cache

   - The cos definition is "cn=Default Password Policy,cn=computers,cn=accounts,<suffix>"
    
   - when thread 20 tries to notify thread 41, Thread 41 was already processing a cache rebuild
     This bug is likely triggered when there are several cos cache rebuild at the same time.
     Conflict on cos definition entry is a good way to trigger this but I think it can happened 
     independantly
(gdb) thread 20
[Switching to thread 20 (Thread 0x7fa6c0ff9700 (LWP 4955))]
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135 2:  movl    %edx, %eax
(gdb) frame 5
#5  0x00007fa6e6c4ff04 in cos_cache_change_notify (pb=pb@entry=0x7fa68c239200) at ldap/servers/plugins/cos/cos_cache.c:3372
3372            slapi_lock_mutex(change_lock);
(gdb) list 3370,3376
3370        if(do_update)
3371        {
3372            slapi_lock_mutex(change_lock);
3373            slapi_notify_condvar( something_changed, 1 );
3374            cos_cache_notify_flag = 1;
3375            slapi_unlock_mutex(change_lock);
3376        }
(gdb) print cos_cache_notify_flag
$38 = 1

     - The deadlock is dynamic but in freeipa tests it happens very frequently

    In conclusion:

        The bug is a DS bug in COS component, it likely happens in freeipa context because 
        cos cache being under load by conflict on cos definition.
        No immediat fix identified

Comment 5 thierry bordaz 2017-01-05 17:43:53 UTC


IMHO this bug is independent of https://bugzilla.redhat.com/show_bug.cgi?id=1404338

Comment 6 Nikhil Dehadrai 2017-01-06 06:29:43 UTC

The issue is also noticed for similar upgrade setup from RHEL 7.3 > RHEL 7.3.2

MASTER:
####################################

Before Upgrade:
================
[root@auto-hv-01-guest07 ~]# ipa domainlevel-get
-----------------------
Current domain level: 1
-----------------------

After Upgrade:
=================
[root@auto-hv-01-guest07 ~]# ipa domainlevel-set 1
ipa: ERROR: no modifications to be performed
[root@auto-hv-01-guest07 ~]# ipa domainlevel-get
-----------------------
Current domain level: 1
-----------------------


REPLICA:
####################################

Before Upgrade:
================
[root@auto-hv-01-guest05 ~]# ipa domainlevel-get
-----------------------
Current domain level: 1
-----------------------

After Upgrade:
=================
[root@auto-hv-01-guest05 ~]# ipa domainlevel-set 1
ipa: ERROR: Major (851968): Unspecified GSS failure.  Minor code may provide more information, Minor (2529639068): Cannot contact any KDC for realm 'TESTRELM.TEST'
[root@auto-hv-01-guest05 ~]# ipa domainlevel-get
ipa: ERROR: Major (851968): Unspecified GSS failure.  Minor code may provide more information, Minor (2529639068): Cannot contact any KDC for realm 'TESTRELM.TEST'

Comment 7 Petr Vobornik 2017-01-06 09:39:01 UTC

The test does thing which tries to create replication conflicts and thus break IPA in order to verify bug 1404338. But this risky behavior broke IPA in different way producing other issue.  It cannot be used for verification of bug 1404338 but it doesn't mean that there isn't any other way to verify bug 1404338.

Comment 10 thierry bordaz 2017-01-06 15:11:19 UTC

This bug is a DS bug. The hang can be reproduced without IDM upgrade scenario.
An upstream ticket is opened: https://fedorahosted.org/389/ticket/49079

Comment 11 Petr Vobornik 2017-01-10 14:15:19 UTC

Moving to DS component according to comment 10.

Comment 12 thierry bordaz 2017-01-11 12:49:52 UTC

Fixed upstream https://fedorahosted.org/389/ticket/49079#comment:6

Comment 15 Sankar Ramalingam 2017-05-16 00:33:53 UTC

To verify, I executed ticket49079_test.py script based on https://pagure.io/389-ds-base/issue/49079#comment-117791

[0 root@qeos-38 tickets]# py.test -s -v ticket49079_test.py 
===================================================== test session starts =====================================================
platform linux2 -- Python 2.7.5, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 -- /usr/bin/python
cachedir: .cache


INFO:lib389:Starting total init cn=meTo_localhost:39002,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
('Update succeeded: status ', '0 Total update succeeded')
INFO:dirsrvtests.tests.tickets.ticket49079_test:Replication is working.
INFO:lib389:Pausing replication cn=meTo_localhost:39001,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
INFO:lib389:Pausing replication cn=meTo_localhost:39002,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
INFO:lib389:Resuming replication cn=meTo_localhost:39001,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
INFO:lib389:Resuming replication cn=meTo_localhost:39002,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
PASSEDInstance slapd-master_1 removed.
Instance slapd-master_2 removed.


==================== 1 passed in 41.48 seconds ======================

Marking the bug as Verified since the tests PASSed.

DS build: 1.3.6.1
389-ds-base: 1.3.6.1-13.el7
nss: 3.28.4-8.el7
nspr: 4.13.1-1.0.el7_3
openldap: 2.4.44-4.el7
svrcore: 4.1.3-2.el7

Comment 16 errata-xmlrpc 2017-08-01 21:14:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2086

Note You need to log in before you can comment on or make changes to this bug.