Bug 852202 - Ipa master system initiated more than a dozen simultaneous replication sessions, shut itself down and wiped out its db
Ipa master system initiated more than a dozen simultaneous replication sessio...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: 389-ds-base (Show other bugs)
6.3
x86_64 Linux
medium Severity unspecified
: rc
: ---
Assigned To: Rich Megginson
Sankar Ramalingam
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-08-27 16:42 EDT by baiesi
Modified: 2013-02-21 03:20 EST (History)
4 users (show)

See Also:
Fixed In Version: 389-ds-base-1.2.11.14-1.el6
Doc Type: Bug Fix
Doc Text:
Cause: Using replication, on a heavily loaded consumer that is having trouble keeping up with the suppliers. Consequence: During an incremental update, the consumer suddenly goes into total update mode, wipes out the database, then finally aborts replication with an error. Fix: Prevent a race condition in the consumer update code. The connection is a shared resource that must be protected against access from multiple threads, and multiple operations in the same connection. Result: Server should not spontaneously go into total update mode and wipe out the database.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-02-21 03:20:40 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description baiesi 2012-08-27 16:42:15 EDT
Description of problem:
Ipa ldap shutdown itself while running, admin unable to restart.  Dev indicates the system initiated more than a dozen simultaneous replication sessions where only 1 should be running.  Db was wiped pretty clean.  Currently unable to recover the server with ipa-replica-manage re-initialize.

Test Env:
2 Ipa Masters
2 Ipa Clients

What was Active:
-Sudo Client running 10 runtime allowes sudo command threads, each thread with 5sec delay.
-One admin client creating 10 sudo rules with 5 min delay after echa rule
-2 UI were currently up, mo load applied via the Ipa UI

Version-Release number of selected component (if applicable):
ipa-server-2.2.0-16.el6.x86_64
389-ds-base-1.2.10.2-19.el6_3.x86_64

How reproducible:
intermittent, have not yet reproduced...

Symptoms:

* Sudo client load began failing, Can't contact LDAP server
* UI not connecting, kinit failing
* Admin load not connecting, kinit failing

*Could not restart via ipactl:
[root@sti-high-1 slapd-TESTRELM-COM]# ipactl start
Starting Directory Service
Starting dirsrv: 
    PKI-IPA...[  OK  ]
    TESTRELM-COM...[  OK  ]
Failed to read data from Directory Service: Failed to get list of services to probe status!
Configured hostname 'sti-high-1.testrelm.com' does not match any master server in LDAP:
No master found because of error: {'matched': 'dc=testrelm,dc=com', 'desc': 'No such object'}
Shutting down
Shutting down dirsrv: 
    PKI-IPA...[  OK  ]
    TESTRELM-COM...[  OK  ]



***/var/log/dirsrv/slapd-TESTRELM-COM/errors ,last few messages as it went down...

[27/Aug/2012:12:32:21 -0400] NSMMReplicationPlugin - agmt="cn=meTosti-high-2.testrelm.com" (sti-high-2:389): Warning: unable to send endReplication extended operation (Timed out)
[27/Aug/2012:12:32:26 -0400] NSMMReplicationPlugin - agmt="cn=meTosti-high-2.testrelm.com" (sti-high-2:389): Replication bind with GSSAPI auth resumed
[27/Aug/2012:12:35:45 -0400] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=testrelm,dc=com is going offline; disabling replication
[27/Aug/2012:12:36:11 -0400] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica dc=testrelm,dc=com: LDAP error - 1
[27/Aug/2012:12:36:12 -0400] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[27/Aug/2012:12:36:32 -0400] - import userRoot: Processed 0 entries -- average rate 0.0/sec, recent rate 0.0/sec, hit ratio 0%
[27/Aug/2012:12:36:52 -0400] - import userRoot: Processed 0 entries -- average rate 0.0/sec, recent rate 0.0/sec, hit ratio 0%
[27/Aug/2012:12:37:12 -0400] - import userRoot: Processed 0 entries -- average rate 0.0/sec, recent rate 0.0/sec, hit ratio 0%
[27/Aug/2012:12:37:32 -0400] - import userRoot: Processed 0 entries -- average rate 0.0/sec, recent rate 0.0/sec, hit ratio 0%
[27/Aug/2012:12:37:52 -0400] - import userRoot: Processed 0 entries -- average rate 0.0/sec, recent rate 0.0/sec, hit ratio 0%
[27/Aug/2012:12:38:12 -0400] - import userRoot: Processed 0 entries -- average rate 0.0/sec, recent rate 0.0/sec, hit ratio 0%
[27/Aug/2012:12:38:32 -0400] - import userRoot: Processed 0 entries -- average rate 0.0/sec, recent rate 0.0/sec, hit ratio 0%
[27/Aug/2012:12:38:46 -0400] - ERROR bulk import abandoned
[27/Aug/2012:12:38:46 -0400] - import userRoot: Aborting all Import threads...
[27/Aug/2012:12:38:52 -0400] - import userRoot: Import threads aborted.
[27/Aug/2012:12:38:52 -0400] - import userRoot: Closing files...
[27/Aug/2012:12:38:52 -0400] - import userRoot: Import failed.


*Ldap db was wiped out on one of the Ipa Master servers
[root@sti-high-1 slapd-TESTRELM-COM]# ldapsearch -x -D "cn=Directory Manager" -w Secret123 -b "dc=testrelm,dc=com"
# extended LDIF
#
# LDAPv3
# base <dc=testrelm,dc=com> with scope subtree
# filter: (objectclass=*)
# requesting: ALL
#

# compat, testrelm.com
dn: cn=compat,dc=testrelm,dc=com
objectClass: extensibleObject
cn: compat

# groups, compat, testrelm.com
dn: cn=groups,cn=compat,dc=testrelm,dc=com
objectClass: extensibleObject
cn: groups

# ng, compat, testrelm.com
dn: cn=ng,cn=compat,dc=testrelm,dc=com
objectClass: extensibleObject
cn: ng

# users, compat, testrelm.com
dn: cn=users,cn=compat,dc=testrelm,dc=com
objectClass: extensibleObject
cn: users

# sudoers, testrelm.com
dn: ou=sudoers,dc=testrelm,dc=com
objectClass: extensibleObject
ou: sudoers

# search result
search: 2
result: 32 No such object

# numResponses: 6
# numEntries: 5


*The other ipa master server info:
[root@sti-high-2 ~]# ldapsearch -x -D "cn=Directory Manager" -w Secret123 -b "cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=com"
# extended LDIF
#
# LDAPv3
# base <cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=com> with scope subtree
# filter: (objectclass=*)
# requesting: ALL
#

# masters, ipa, etc, testrelm.com
dn: cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=com
cn: masters
objectClass: nsContainer
objectClass: top

# sti-high-1.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=sti-high-1.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=com
cn: sti-high-1.testrelm.com
objectClass: top
objectClass: nsContainer

# CA, sti-high-1.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=CA,cn=sti-high-1.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=c
 om
cn: CA
ipaConfigString: enabledService
ipaConfigString: startOrder 50
objectClass: nsContainer
objectClass: ipaConfigObject
objectClass: top

# KDC, sti-high-1.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=KDC,cn=sti-high-1.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=
 com
cn: KDC
ipaConfigString: enabledService
ipaConfigString: startOrder 10
objectClass: nsContainer
objectClass: ipaConfigObject
objectClass: top

# KPASSWD, sti-high-1.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=KPASSWD,cn=sti-high-1.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm
 ,dc=com
cn: KPASSWD
ipaConfigString: enabledService
ipaConfigString: startOrder 20
objectClass: nsContainer
objectClass: ipaConfigObject
objectClass: top

# MEMCACHE, sti-high-1.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=MEMCACHE,cn=sti-high-1.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrel
 m,dc=com
cn: MEMCACHE
ipaConfigString: enabledService
ipaConfigString: startOrder 39
objectClass: nsContainer
objectClass: ipaConfigObject
objectClass: top

# HTTP, sti-high-1.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=HTTP,cn=sti-high-1.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc
 =com
cn: HTTP
ipaConfigString: enabledService
ipaConfigString: startOrder 40
objectClass: nsContainer
objectClass: ipaConfigObject
objectClass: top

# DNS, sti-high-1.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=DNS,cn=sti-high-1.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=
 com
cn: DNS
ipaConfigString: enabledService
ipaConfigString: startOrder 30
objectClass: nsContainer
objectClass: ipaConfigObject
objectClass: top

# sti-high-2.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=sti-high-2.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=com
objectClass: top
objectClass: nsContainer
cn: sti-high-2.testrelm.com

# KDC, sti-high-2.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=KDC,cn=sti-high-2.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=
 com
objectClass: nsContainer
objectClass: ipaConfigObject
objectClass: top
ipaConfigString: enabledService
ipaConfigString: startOrder 10
cn: KDC

# KPASSWD, sti-high-2.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=KPASSWD,cn=sti-high-2.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm
 ,dc=com
objectClass: nsContainer
objectClass: ipaConfigObject
objectClass: top
ipaConfigString: enabledService
ipaConfigString: startOrder 20
cn: KPASSWD

# MEMCACHE, sti-high-2.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=MEMCACHE,cn=sti-high-2.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrel
 m,dc=com
objectClass: nsContainer
objectClass: ipaConfigObject
objectClass: top
ipaConfigString: enabledService
ipaConfigString: startOrder 39
cn: MEMCACHE

# HTTP, sti-high-2.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=HTTP,cn=sti-high-2.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc
 =com
objectClass: nsContainer
objectClass: ipaConfigObject
objectClass: top
ipaConfigString: enabledService
ipaConfigString: startOrder 40
cn: HTTP

# DNS, sti-high-2.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=DNS,cn=sti-high-2.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=
 com
objectClass: nsContainer
objectClass: ipaConfigObject
objectClass: top
ipaConfigString: enabledService
ipaConfigString: startOrder 30
cn: DNS

# search result
search: 2
result: 0 Success

# numResponses: 15
# numEntries: 14


Additional Dev Comments:
Rich Megginson's chat comments indicate the issue looks like a combination of;
https://fedorahosted.org/389/ticket/374
https://fedorahosted.org/freeipa/ticket/2842
Comment 2 Colin Walters 2012-08-27 17:05:40 EDT
Wrong component?
Comment 5 Rich Megginson 2012-08-30 10:32:20 EDT
Upstream ticket:
https://fedorahosted.org/389/ticket/442
Comment 6 Rich Megginson 2012-09-07 14:24:42 EDT
Upstream ticket:
https://fedorahosted.org/389/ticket/374
Comment 8 baiesi 2012-10-16 16:52:41 EDT
With IPA server version 3.0.0.rc2. API version 2.44 on Rhel6.3 installed in my test environment,  I'm currently running system tests watching for this specific defect.  Currently my plan is to run the test environment for a minimum of a 7 day period. The test env consists of 2 Ipa servers and 2 Ipa clients.
Comment 9 baiesi 2012-10-26 15:46:06 EDT
Test Type:
-System Test Defect Verification

Test Run Duration:
-10+ days

Test Env /Config:
-IPA Server version 3.0.0.rc2. API version 2.44 on RhelServer 6.3(Santiago)
-Installed 2012-10-12 13:28:23 -04:00 Job=315555
-Installed 2 Ipa Servers and 2 Ipa clients
-10K Users divided into 10 user groups
-11 sudo rules and approx 1500 sudo commands

Installed Packages:
ipa-server-3.0.0-104.20121011T1837zgit9a305c5.el6.x86_64
ipa-server-selinux-3.0.0-104.20121011T1837zgit9a305c5.el6.x86_64
ipa-client-3.0.0-104.20121011T1837zgit9a305c5.el6.x86_64
389-ds-base-1.2.11.15-2.el6.x86_64
389-ds-base-libs-1.2.11.15-2.el6.x86_64

Load Applied:
-The test environment has been subjected to tests that previously triggered the defect.  The test load is running through runtime ssh/sudo multi-thread load simulating 10 concurrent simulated users while applying light administration work via cli and UI. Over 10+ days approx 800k runtime attempts consisting of pos/neg ssh/sudo use cases were applied against the test environment.  The UI admin load use case was to add/delete sudo rules from each Ipa Server simultaneously each day.

Observations:
-I'm not seeing the reported symptoms/results described in this defect. Not reproduced.

Comments:
-Requests have been made to also run this load again on the most latest Ipa code running under Rhel6.4
Comment 10 baiesi 2012-11-13 10:12:36 EST
Test Type:
-System Test Defect Verification

Test Run Duration:
-6+ days

Test Env /Config:
-Red Hat Enterprise Linux Server release 6.4 Beta (Santiago)
-IPA server version 3.0.1.pre1. API version 2.44
-Installed 2012-10-29 10:49:25 -04:00 -04:00 Job=324310
-Kernel Linux version 2.6.32-330.el6.x86_64
-Installed 2 Ipa Servers and 2 Ipa clients
-10K Users divided into 10 user groups
-11 sudo rules and approx 1500 sudo commands

Installed Packages:
ipa-server-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64
ipa-client-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64
libipa_hbac-python-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64
ipa-pki-ca-theme-9.0.3-7.20121101T0026z.el6.noarch
libipa_hbac-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64
ipa-pki-common-theme-9.0.3-7.20121101T0026z.el6.noarch
ipa-python-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64
ipa-admintools-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64
ipa-server-selinux-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64
sssd-client-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64
sssd-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64
389-ds-base-libs-1.2.11.15-2.el6.x86_64
389-ds-base-debuginfo-1.2.11.15-2.el6.x86_64
389-ds-base-1.2.11.15-2.el6.x86_64

Load Applied:
-The test environment has been subjected to tests that previously triggered the defect.  The test load is running through runtime ssh/sudo multi-thread load, cycling through 10k ipa users.  Load simulating 10 concurrent simulated users 24/7 while applying light administration work via cli and UI. Over 6+ days consisting of pos/neg ssh/sudo use cases were applied against the test environment.  The UI admin load use case was to add/delete sudo rules from each Ipa Server simultaneously each day.

Observations:
-I'm not seeing the reported symptoms/results described in this defect. Not able to reproduced, this is a good thing.

Closing the defect as Verified.
Comment 11 errata-xmlrpc 2013-02-21 03:20:40 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0503.html

Note You need to log in before you can comment on or make changes to this bug.