Bug 1410514

Summary:	deadlock on cos cache rebuild
Product:	Red Hat Enterprise Linux 7	Reporter:	Nikhil Dehadrai <ndehadra>
Component:	389-ds-base	Assignee:	Noriko Hosoi <nhosoi>
Status:	CLOSED ERRATA	QA Contact:	Viktor Ashirov <vashirov>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	7.3	CC:	gparente, mkosek, mreynolds, msauton, nkinder, nsoman, pvoborni, rcritten, rmeggins, sramling, tbordaz, wibrown
Target Milestone:	rc	Keywords:	TestBlocker, ZStream
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	389-ds-base-1.3.6.1-3.el7	Doc Type:	No Doc Update
Doc Text:	See 7.3.z bug 1414678.	Story Points:	---
Clone Of:
Clones:	1414678 (view as bug list)		Environment:
Last Closed:	2017-08-01 21:14:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1414678

Description Nikhil Dehadrai 2017-01-05 16:10:50 UTC

Description of problem:
ipa: ERROR: Major (851968): Unspecified GSS failure.  Minor code may provide more information, Minor (2529639068): Cannot contact any KDC for realm 'TESTRELM.TEST' message received on running command "ipa domainlevel-set 1" after upgrade of IPA server from RHEL 7.2.z to RHEL 7.3.2.


Version-Release number of selected component (if applicable):
ipa-server-4.4.0-14.el7_3.4.x86_64


Steps: (Upgrade from 7.2.z > 7.3.2)
====================================
1) Install master on RHEL 7.2.z. (In my case ipa-server.x86_64 0:4.2.0-15.el7_2.19).
2) Install replica on RHEL 7.2.z against master in step1, with ipa-replica-prepare command.
3) Stop replica server using "ipactl stop".
4) Configure repos for RHEL 7.3.2 on Master and Replica.
5) Upgrade master to RHEL 7.3.2 and stop master using command "ipactl stop".
6) Start replica using command "ipactl start" and Upgrade replica to Rhel 7.3.2 using command "yum -y update 'ipa*' sssd".
7) Start master
 server using command "ipactl start"
8) Run "kinit admin" both on master and replica.
9) Run "ipa domainlevel-set 1" both on Master and Replica.


Actual results:
1) Both Master and Replica are upgraded successfully after step5 and step6.
2) After step9, following error message is received both on Master:
#ipa domainlevel-set 1
ipa: ERROR: Domain Level cannot be raised to 1, server <replica.testrelm.test> does not support it.

3) After step9, following error message is received both on REPLICA:
ipa domainlevel-set 1
ipa: ERROR: Major (851968): Unspecified GSS failure.  Minor code may provide more information, Minor (2529639068): Cannot contact any KDC for realm 'TESTRELM.TEST'


Expected results:
After step9, following error message should be reflected both on IPA MASTER and REPLICA:

The expected result is:
ipa: ERROR: Domain Level cannot be raised to 1, existing replication conflicts have to be resolved.

Comment 3 thierry bordaz 2017-01-05 17:41:11 UTC

    - The server hangs because of a deadlock between two threads
     252 dd=84 locks held 0    write locks 0    pid/thread 4927/140354647430912 (7FA6DCE71700)
     252 READ          1 WAIT    userRoot/id2entry.db      page        102

80002084 dd= 1 locks held 23   write locks 16   pid/thread 4927/140354179274496 (7FA6C0FF9700)
80002084 WRITE         1 HELD    userRoot/id2entry.db      page        102

    - The deadlock is due to locks taken in the opposite order:
        - Thread 41: cos_cache lock , then db lock
        - Thread 20: db lock, then cos_cache lock

    - thread 41 is cos_cache_create. It is a thread responsible to rebuild cos cache when it is
       notified that a cos definition was updated

      thread 20 is adding entry (owning some db page). 
        - The entry is in conflict -> triggers bepreop to rename it
        - The entry is a cos definition, so when it is renamed (by beprop) it notifies thread 41 to rebuild the cache

   - The cos definition is "cn=Default Password Policy,cn=computers,cn=accounts,<suffix>"
    
   - when thread 20 tries to notify thread 41, Thread 41 was already processing a cache rebuild
     This bug is likely triggered when there are several cos cache rebuild at the same time.
     Conflict on cos definition entry is a good way to trigger this but I think it can happened 
     independantly
(gdb) thread 20
[Switching to thread 20 (Thread 0x7fa6c0ff9700 (LWP 4955))]
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135 2:  movl    %edx, %eax
(gdb) frame 5
#5  0x00007fa6e6c4ff04 in cos_cache_change_notify (pb=pb@entry=0x7fa68c239200) at ldap/servers/plugins/cos/cos_cache.c:3372
3372            slapi_lock_mutex(change_lock);
(gdb) list 3370,3376
3370        if(do_update)
3371        {
3372            slapi_lock_mutex(change_lock);
3373            slapi_notify_condvar( something_changed, 1 );
3374            cos_cache_notify_flag = 1;
3375            slapi_unlock_mutex(change_lock);
3376        }
(gdb) print cos_cache_notify_flag
$38 = 1

     - The deadlock is dynamic but in freeipa tests it happens very frequently

    In conclusion:

        The bug is a DS bug in COS component, it likely happens in freeipa context because 
        cos cache being under load by conflict on cos definition.
        No immediat fix identified

Comment 5 thierry bordaz 2017-01-05 17:43:53 UTC


IMHO this bug is independent of https://bugzilla.redhat.com/show_bug.cgi?id=1404338

Comment 6 Nikhil Dehadrai 2017-01-06 06:29:43 UTC

The issue is also noticed for similar upgrade setup from RHEL 7.3 > RHEL 7.3.2

MASTER:
####################################

Before Upgrade:
================
[root@auto-hv-01-guest07 ~]# ipa domainlevel-get
-----------------------
Current domain level: 1
-----------------------

After Upgrade:
=================
[root@auto-hv-01-guest07 ~]# ipa domainlevel-set 1
ipa: ERROR: no modifications to be performed
[root@auto-hv-01-guest07 ~]# ipa domainlevel-get
-----------------------
Current domain level: 1
-----------------------


REPLICA:
####################################

Before Upgrade:
================
[root@auto-hv-01-guest05 ~]# ipa domainlevel-get
-----------------------
Current domain level: 1
-----------------------

After Upgrade:
=================
[root@auto-hv-01-guest05 ~]# ipa domainlevel-set 1
ipa: ERROR: Major (851968): Unspecified GSS failure.  Minor code may provide more information, Minor (2529639068): Cannot contact any KDC for realm 'TESTRELM.TEST'
[root@auto-hv-01-guest05 ~]# ipa domainlevel-get
ipa: ERROR: Major (851968): Unspecified GSS failure.  Minor code may provide more information, Minor (2529639068): Cannot contact any KDC for realm 'TESTRELM.TEST'

Comment 7 Petr Vobornik 2017-01-06 09:39:01 UTC

The test does thing which tries to create replication conflicts and thus break IPA in order to verify bug 1404338. But this risky behavior broke IPA in different way producing other issue.  It cannot be used for verification of bug 1404338 but it doesn't mean that there isn't any other way to verify bug 1404338.

Comment 10 thierry bordaz 2017-01-06 15:11:19 UTC

This bug is a DS bug. The hang can be reproduced without IDM upgrade scenario.
An upstream ticket is opened: https://fedorahosted.org/389/ticket/49079

Comment 11 Petr Vobornik 2017-01-10 14:15:19 UTC

Moving to DS component according to comment 10.

Comment 12 thierry bordaz 2017-01-11 12:49:52 UTC

Fixed upstream https://fedorahosted.org/389/ticket/49079#comment:6

Comment 15 Sankar Ramalingam 2017-05-16 00:33:53 UTC

To verify, I executed ticket49079_test.py script based on https://pagure.io/389-ds-base/issue/49079#comment-117791

[0 root@qeos-38 tickets]# py.test -s -v ticket49079_test.py 
===================================================== test session starts =====================================================
platform linux2 -- Python 2.7.5, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 -- /usr/bin/python
cachedir: .cache


INFO:lib389:Starting total init cn=meTo_localhost:39002,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
('Update succeeded: status ', '0 Total update succeeded')
INFO:dirsrvtests.tests.tickets.ticket49079_test:Replication is working.
INFO:lib389:Pausing replication cn=meTo_localhost:39001,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
INFO:lib389:Pausing replication cn=meTo_localhost:39002,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
INFO:lib389:Resuming replication cn=meTo_localhost:39001,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
INFO:lib389:Resuming replication cn=meTo_localhost:39002,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
PASSEDInstance slapd-master_1 removed.
Instance slapd-master_2 removed.


==================== 1 passed in 41.48 seconds ======================

Marking the bug as Verified since the tests PASSed.

DS build: 1.3.6.1
389-ds-base: 1.3.6.1-13.el7
nss: 3.28.4-8.el7
nspr: 4.13.1-1.0.el7_3
openldap: 2.4.44-4.el7
svrcore: 4.1.3-2.el7

Comment 16 errata-xmlrpc 2017-08-01 21:14:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2086