852202 – Ipa master system initiated more than a dozen simultaneous replication sessions, shut itself down and wiped out its db

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 852202 - Ipa master system initiated more than a dozen simultaneous replication sessions, shut itself down and wiped out its db

Summary: Ipa master system initiated more than a dozen simultaneous replication sessio...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	389-ds-base
Sub Component:
Version:	6.3
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Rich Megginson
QA Contact:	Sankar Ramalingam
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-08-27 20:42 UTC by baiesi
Modified:	2020-09-13 20:16 UTC (History)
CC List:	4 users (show)
Fixed In Version:	389-ds-base-1.2.11.14-1.el6
Doc Type:	Bug Fix
Doc Text:	Cause: Using replication, on a heavily loaded consumer that is having trouble keeping up with the suppliers. Consequence: During an incremental update, the consumer suddenly goes into total update mode, wipes out the database, then finally aborts replication with an error. Fix: Prevent a race condition in the consumer update code. The connection is a shared resource that must be protected against access from multiple threads, and multiple operations in the same connection. Result: Server should not spontaneously go into total update mode and wipe out the database.
Clone Of:
Environment:
Last Closed:	2013-02-21 08:20:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	389ds 389-ds-base issues 374	None	None	None	2020-09-13 20:11:29 UTC
Github	389ds 389-ds-base issues 442	None	None	None	2020-09-13 20:16:32 UTC
Red Hat Product Errata	RHSA-2013:0503	normal	SHIPPED_LIVE	Moderate: 389-ds-base security, bug fix, and enhancement update	2013-02-21 08:18:44 UTC

Description baiesi 2012-08-27 20:42:15 UTC

Description of problem:
Ipa ldap shutdown itself while running, admin unable to restart.  Dev indicates the system initiated more than a dozen simultaneous replication sessions where only 1 should be running.  Db was wiped pretty clean.  Currently unable to recover the server with ipa-replica-manage re-initialize.

Test Env:
2 Ipa Masters
2 Ipa Clients

What was Active:
-Sudo Client running 10 runtime allowes sudo command threads, each thread with 5sec delay.
-One admin client creating 10 sudo rules with 5 min delay after echa rule
-2 UI were currently up, mo load applied via the Ipa UI

Version-Release number of selected component (if applicable):
ipa-server-2.2.0-16.el6.x86_64
389-ds-base-1.2.10.2-19.el6_3.x86_64

How reproducible:
intermittent, have not yet reproduced...

Symptoms:

* Sudo client load began failing, Can't contact LDAP server
* UI not connecting, kinit failing
* Admin load not connecting, kinit failing

*Could not restart via ipactl:
[root@sti-high-1 slapd-TESTRELM-COM]# ipactl start
Starting Directory Service
Starting dirsrv: 
    PKI-IPA...[  OK  ]
    TESTRELM-COM...[  OK  ]
Failed to read data from Directory Service: Failed to get list of services to probe status!
Configured hostname 'sti-high-1.testrelm.com' does not match any master server in LDAP:
No master found because of error: {'matched': 'dc=testrelm,dc=com', 'desc': 'No such object'}
Shutting down
Shutting down dirsrv: 
    PKI-IPA...[  OK  ]
    TESTRELM-COM...[  OK  ]



***/var/log/dirsrv/slapd-TESTRELM-COM/errors ,last few messages as it went down...

[27/Aug/2012:12:32:21 -0400] NSMMReplicationPlugin - agmt="cn=meTosti-high-2.testrelm.com" (sti-high-2:389): Warning: unable to send endReplication extended operation (Timed out)
[27/Aug/2012:12:32:26 -0400] NSMMReplicationPlugin - agmt="cn=meTosti-high-2.testrelm.com" (sti-high-2:389): Replication bind with GSSAPI auth resumed
[27/Aug/2012:12:35:45 -0400] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=testrelm,dc=com is going offline; disabling replication
[27/Aug/2012:12:36:11 -0400] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica dc=testrelm,dc=com: LDAP error - 1
[27/Aug/2012:12:36:12 -0400] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[27/Aug/2012:12:36:32 -0400] - import userRoot: Processed 0 entries -- average rate 0.0/sec, recent rate 0.0/sec, hit ratio 0%
[27/Aug/2012:12:36:52 -0400] - import userRoot: Processed 0 entries -- average rate 0.0/sec, recent rate 0.0/sec, hit ratio 0%
[27/Aug/2012:12:37:12 -0400] - import userRoot: Processed 0 entries -- average rate 0.0/sec, recent rate 0.0/sec, hit ratio 0%
[27/Aug/2012:12:37:32 -0400] - import userRoot: Processed 0 entries -- average rate 0.0/sec, recent rate 0.0/sec, hit ratio 0%
[27/Aug/2012:12:37:52 -0400] - import userRoot: Processed 0 entries -- average rate 0.0/sec, recent rate 0.0/sec, hit ratio 0%
[27/Aug/2012:12:38:12 -0400] - import userRoot: Processed 0 entries -- average rate 0.0/sec, recent rate 0.0/sec, hit ratio 0%
[27/Aug/2012:12:38:32 -0400] - import userRoot: Processed 0 entries -- average rate 0.0/sec, recent rate 0.0/sec, hit ratio 0%
[27/Aug/2012:12:38:46 -0400] - ERROR bulk import abandoned
[27/Aug/2012:12:38:46 -0400] - import userRoot: Aborting all Import threads...
[27/Aug/2012:12:38:52 -0400] - import userRoot: Import threads aborted.
[27/Aug/2012:12:38:52 -0400] - import userRoot: Closing files...
[27/Aug/2012:12:38:52 -0400] - import userRoot: Import failed.


*Ldap db was wiped out on one of the Ipa Master servers
[root@sti-high-1 slapd-TESTRELM-COM]# ldapsearch -x -D "cn=Directory Manager" -w Secret123 -b "dc=testrelm,dc=com"
# extended LDIF
#
# LDAPv3
# base <dc=testrelm,dc=com> with scope subtree
# filter: (objectclass=*)
# requesting: ALL
#

# compat, testrelm.com
dn: cn=compat,dc=testrelm,dc=com
objectClass: extensibleObject
cn: compat

# groups, compat, testrelm.com
dn: cn=groups,cn=compat,dc=testrelm,dc=com
objectClass: extensibleObject
cn: groups

# ng, compat, testrelm.com
dn: cn=ng,cn=compat,dc=testrelm,dc=com
objectClass: extensibleObject
cn: ng

# users, compat, testrelm.com
dn: cn=users,cn=compat,dc=testrelm,dc=com
objectClass: extensibleObject
cn: users

# sudoers, testrelm.com
dn: ou=sudoers,dc=testrelm,dc=com
objectClass: extensibleObject
ou: sudoers

# search result
search: 2
result: 32 No such object

# numResponses: 6
# numEntries: 5


*The other ipa master server info:
[root@sti-high-2 ~]# ldapsearch -x -D "cn=Directory Manager" -w Secret123 -b "cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=com"
# extended LDIF
#
# LDAPv3
# base <cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=com> with scope subtree
# filter: (objectclass=*)
# requesting: ALL
#

# masters, ipa, etc, testrelm.com
dn: cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=com
cn: masters
objectClass: nsContainer
objectClass: top

# sti-high-1.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=sti-high-1.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=com
cn: sti-high-1.testrelm.com
objectClass: top
objectClass: nsContainer

# CA, sti-high-1.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=CA,cn=sti-high-1.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=c
 om
cn: CA
ipaConfigString: enabledService
ipaConfigString: startOrder 50
objectClass: nsContainer
objectClass: ipaConfigObject
objectClass: top

# KDC, sti-high-1.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=KDC,cn=sti-high-1.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=
 com
cn: KDC
ipaConfigString: enabledService
ipaConfigString: startOrder 10
objectClass: nsContainer
objectClass: ipaConfigObject
objectClass: top

# KPASSWD, sti-high-1.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=KPASSWD,cn=sti-high-1.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm
 ,dc=com
cn: KPASSWD
ipaConfigString: enabledService
ipaConfigString: startOrder 20
objectClass: nsContainer
objectClass: ipaConfigObject
objectClass: top

# MEMCACHE, sti-high-1.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=MEMCACHE,cn=sti-high-1.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrel
 m,dc=com
cn: MEMCACHE
ipaConfigString: enabledService
ipaConfigString: startOrder 39
objectClass: nsContainer
objectClass: ipaConfigObject
objectClass: top

# HTTP, sti-high-1.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=HTTP,cn=sti-high-1.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc
 =com
cn: HTTP
ipaConfigString: enabledService
ipaConfigString: startOrder 40
objectClass: nsContainer
objectClass: ipaConfigObject
objectClass: top

# DNS, sti-high-1.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=DNS,cn=sti-high-1.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=
 com
cn: DNS
ipaConfigString: enabledService
ipaConfigString: startOrder 30
objectClass: nsContainer
objectClass: ipaConfigObject
objectClass: top

# sti-high-2.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=sti-high-2.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=com
objectClass: top
objectClass: nsContainer
cn: sti-high-2.testrelm.com

# KDC, sti-high-2.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=KDC,cn=sti-high-2.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=
 com
objectClass: nsContainer
objectClass: ipaConfigObject
objectClass: top
ipaConfigString: enabledService
ipaConfigString: startOrder 10
cn: KDC

# KPASSWD, sti-high-2.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=KPASSWD,cn=sti-high-2.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm
 ,dc=com
objectClass: nsContainer
objectClass: ipaConfigObject
objectClass: top
ipaConfigString: enabledService
ipaConfigString: startOrder 20
cn: KPASSWD

# MEMCACHE, sti-high-2.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=MEMCACHE,cn=sti-high-2.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrel
 m,dc=com
objectClass: nsContainer
objectClass: ipaConfigObject
objectClass: top
ipaConfigString: enabledService
ipaConfigString: startOrder 39
cn: MEMCACHE

# HTTP, sti-high-2.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=HTTP,cn=sti-high-2.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc
 =com
objectClass: nsContainer
objectClass: ipaConfigObject
objectClass: top
ipaConfigString: enabledService
ipaConfigString: startOrder 40
cn: HTTP

# DNS, sti-high-2.testrelm.com, masters, ipa, etc, testrelm.com
dn: cn=DNS,cn=sti-high-2.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=
 com
objectClass: nsContainer
objectClass: ipaConfigObject
objectClass: top
ipaConfigString: enabledService
ipaConfigString: startOrder 30
cn: DNS

# search result
search: 2
result: 0 Success

# numResponses: 15
# numEntries: 14


Additional Dev Comments:
Rich Megginson's chat comments indicate the issue looks like a combination of;
https://fedorahosted.org/389/ticket/374
https://fedorahosted.org/freeipa/ticket/2842

Comment 2 Colin Walters 2012-08-27 21:05:40 UTC

Wrong component?

Comment 5 Rich Megginson 2012-08-30 14:32:20 UTC

Upstream ticket:
https://fedorahosted.org/389/ticket/442

Comment 6 Rich Megginson 2012-09-07 18:24:42 UTC

Upstream ticket:
https://fedorahosted.org/389/ticket/374

Comment 8 baiesi 2012-10-16 20:52:41 UTC

With IPA server version 3.0.0.rc2. API version 2.44 on Rhel6.3 installed in my test environment,  I'm currently running system tests watching for this specific defect.  Currently my plan is to run the test environment for a minimum of a 7 day period. The test env consists of 2 Ipa servers and 2 Ipa clients.

Comment 9 baiesi 2012-10-26 19:46:06 UTC

Test Type:
-System Test Defect Verification

Test Run Duration:
-10+ days

Test Env /Config:
-IPA Server version 3.0.0.rc2. API version 2.44 on RhelServer 6.3(Santiago)
-Installed 2012-10-12 13:28:23 -04:00 Job=315555
-Installed 2 Ipa Servers and 2 Ipa clients
-10K Users divided into 10 user groups
-11 sudo rules and approx 1500 sudo commands

Installed Packages:
ipa-server-3.0.0-104.20121011T1837zgit9a305c5.el6.x86_64
ipa-server-selinux-3.0.0-104.20121011T1837zgit9a305c5.el6.x86_64
ipa-client-3.0.0-104.20121011T1837zgit9a305c5.el6.x86_64
389-ds-base-1.2.11.15-2.el6.x86_64
389-ds-base-libs-1.2.11.15-2.el6.x86_64

Load Applied:
-The test environment has been subjected to tests that previously triggered the defect.  The test load is running through runtime ssh/sudo multi-thread load simulating 10 concurrent simulated users while applying light administration work via cli and UI. Over 10+ days approx 800k runtime attempts consisting of pos/neg ssh/sudo use cases were applied against the test environment.  The UI admin load use case was to add/delete sudo rules from each Ipa Server simultaneously each day.

Observations:
-I'm not seeing the reported symptoms/results described in this defect. Not reproduced.

Comments:
-Requests have been made to also run this load again on the most latest Ipa code running under Rhel6.4

Comment 10 baiesi 2012-11-13 15:12:36 UTC

Test Type:
-System Test Defect Verification

Test Run Duration:
-6+ days

Test Env /Config:
-Red Hat Enterprise Linux Server release 6.4 Beta (Santiago)
-IPA server version 3.0.1.pre1. API version 2.44
-Installed 2012-10-29 10:49:25 -04:00 -04:00 Job=324310
-Kernel Linux version 2.6.32-330.el6.x86_64
-Installed 2 Ipa Servers and 2 Ipa clients
-10K Users divided into 10 user groups
-11 sudo rules and approx 1500 sudo commands

Installed Packages:
ipa-server-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64
ipa-client-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64
libipa_hbac-python-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64
ipa-pki-ca-theme-9.0.3-7.20121101T0026z.el6.noarch
libipa_hbac-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64
ipa-pki-common-theme-9.0.3-7.20121101T0026z.el6.noarch
ipa-python-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64
ipa-admintools-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64
ipa-server-selinux-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64
sssd-client-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64
sssd-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64
389-ds-base-libs-1.2.11.15-2.el6.x86_64
389-ds-base-debuginfo-1.2.11.15-2.el6.x86_64
389-ds-base-1.2.11.15-2.el6.x86_64

Load Applied:
-The test environment has been subjected to tests that previously triggered the defect.  The test load is running through runtime ssh/sudo multi-thread load, cycling through 10k ipa users.  Load simulating 10 concurrent simulated users 24/7 while applying light administration work via cli and UI. Over 6+ days consisting of pos/neg ssh/sudo use cases were applied against the test environment.  The UI admin load use case was to add/delete sudo rules from each Ipa Server simultaneously each day.

Observations:
-I'm not seeing the reported symptoms/results described in this defect. Not able to reproduced, this is a good thing.

Closing the defect as Verified.

Comment 11 errata-xmlrpc 2013-02-21 08:20:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0503.html

Note You need to log in before you can comment on or make changes to this bug.