Bug 852202
| Summary: | Ipa master system initiated more than a dozen simultaneous replication sessions, shut itself down and wiped out its db | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | baiesi |
| Component: | 389-ds-base | Assignee: | Rich Megginson <rmeggins> |
| Status: | CLOSED ERRATA | QA Contact: | Sankar Ramalingam <sramling> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | medium | ||
| Version: | 6.3 | CC: | jgalipea, mkosek, nkinder, rmeggins |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | 389-ds-base-1.2.11.14-1.el6 | Doc Type: | Bug Fix |
| Doc Text: |
Cause: Using replication, on a heavily loaded consumer that is having trouble keeping up with the suppliers.
Consequence: During an incremental update, the consumer suddenly goes into total update mode, wipes out the database, then finally aborts replication with an error.
Fix: Prevent a race condition in the consumer update code. The connection is a shared resource that must be protected against access from multiple threads, and multiple operations in the same connection.
Result: Server should not spontaneously go into total update mode and wipe out the database.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2013-02-21 08:20:40 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Wrong component? Upstream ticket: https://fedorahosted.org/389/ticket/442 Upstream ticket: https://fedorahosted.org/389/ticket/374 With IPA server version 3.0.0.rc2. API version 2.44 on Rhel6.3 installed in my test environment, I'm currently running system tests watching for this specific defect. Currently my plan is to run the test environment for a minimum of a 7 day period. The test env consists of 2 Ipa servers and 2 Ipa clients. Test Type: -System Test Defect Verification Test Run Duration: -10+ days Test Env /Config: -IPA Server version 3.0.0.rc2. API version 2.44 on RhelServer 6.3(Santiago) -Installed 2012-10-12 13:28:23 -04:00 Job=315555 -Installed 2 Ipa Servers and 2 Ipa clients -10K Users divided into 10 user groups -11 sudo rules and approx 1500 sudo commands Installed Packages: ipa-server-3.0.0-104.20121011T1837zgit9a305c5.el6.x86_64 ipa-server-selinux-3.0.0-104.20121011T1837zgit9a305c5.el6.x86_64 ipa-client-3.0.0-104.20121011T1837zgit9a305c5.el6.x86_64 389-ds-base-1.2.11.15-2.el6.x86_64 389-ds-base-libs-1.2.11.15-2.el6.x86_64 Load Applied: -The test environment has been subjected to tests that previously triggered the defect. The test load is running through runtime ssh/sudo multi-thread load simulating 10 concurrent simulated users while applying light administration work via cli and UI. Over 10+ days approx 800k runtime attempts consisting of pos/neg ssh/sudo use cases were applied against the test environment. The UI admin load use case was to add/delete sudo rules from each Ipa Server simultaneously each day. Observations: -I'm not seeing the reported symptoms/results described in this defect. Not reproduced. Comments: -Requests have been made to also run this load again on the most latest Ipa code running under Rhel6.4 Test Type: -System Test Defect Verification Test Run Duration: -6+ days Test Env /Config: -Red Hat Enterprise Linux Server release 6.4 Beta (Santiago) -IPA server version 3.0.1.pre1. API version 2.44 -Installed 2012-10-29 10:49:25 -04:00 -04:00 Job=324310 -Kernel Linux version 2.6.32-330.el6.x86_64 -Installed 2 Ipa Servers and 2 Ipa clients -10K Users divided into 10 user groups -11 sudo rules and approx 1500 sudo commands Installed Packages: ipa-server-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64 ipa-client-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64 libipa_hbac-python-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64 ipa-pki-ca-theme-9.0.3-7.20121101T0026z.el6.noarch libipa_hbac-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64 ipa-pki-common-theme-9.0.3-7.20121101T0026z.el6.noarch ipa-python-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64 ipa-admintools-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64 ipa-server-selinux-3.0.0-106.20121101T0117zgit5ff4025.el6.x86_64 sssd-client-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64 sssd-1.9.90-0.20121101T0057zgitf46bf56.el6.x86_64 389-ds-base-libs-1.2.11.15-2.el6.x86_64 389-ds-base-debuginfo-1.2.11.15-2.el6.x86_64 389-ds-base-1.2.11.15-2.el6.x86_64 Load Applied: -The test environment has been subjected to tests that previously triggered the defect. The test load is running through runtime ssh/sudo multi-thread load, cycling through 10k ipa users. Load simulating 10 concurrent simulated users 24/7 while applying light administration work via cli and UI. Over 6+ days consisting of pos/neg ssh/sudo use cases were applied against the test environment. The UI admin load use case was to add/delete sudo rules from each Ipa Server simultaneously each day. Observations: -I'm not seeing the reported symptoms/results described in this defect. Not able to reproduced, this is a good thing. Closing the defect as Verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0503.html |
Description of problem: Ipa ldap shutdown itself while running, admin unable to restart. Dev indicates the system initiated more than a dozen simultaneous replication sessions where only 1 should be running. Db was wiped pretty clean. Currently unable to recover the server with ipa-replica-manage re-initialize. Test Env: 2 Ipa Masters 2 Ipa Clients What was Active: -Sudo Client running 10 runtime allowes sudo command threads, each thread with 5sec delay. -One admin client creating 10 sudo rules with 5 min delay after echa rule -2 UI were currently up, mo load applied via the Ipa UI Version-Release number of selected component (if applicable): ipa-server-2.2.0-16.el6.x86_64 389-ds-base-1.2.10.2-19.el6_3.x86_64 How reproducible: intermittent, have not yet reproduced... Symptoms: * Sudo client load began failing, Can't contact LDAP server * UI not connecting, kinit failing * Admin load not connecting, kinit failing *Could not restart via ipactl: [root@sti-high-1 slapd-TESTRELM-COM]# ipactl start Starting Directory Service Starting dirsrv: PKI-IPA...[ OK ] TESTRELM-COM...[ OK ] Failed to read data from Directory Service: Failed to get list of services to probe status! Configured hostname 'sti-high-1.testrelm.com' does not match any master server in LDAP: No master found because of error: {'matched': 'dc=testrelm,dc=com', 'desc': 'No such object'} Shutting down Shutting down dirsrv: PKI-IPA...[ OK ] TESTRELM-COM...[ OK ] ***/var/log/dirsrv/slapd-TESTRELM-COM/errors ,last few messages as it went down... [27/Aug/2012:12:32:21 -0400] NSMMReplicationPlugin - agmt="cn=meTosti-high-2.testrelm.com" (sti-high-2:389): Warning: unable to send endReplication extended operation (Timed out) [27/Aug/2012:12:32:26 -0400] NSMMReplicationPlugin - agmt="cn=meTosti-high-2.testrelm.com" (sti-high-2:389): Replication bind with GSSAPI auth resumed [27/Aug/2012:12:35:45 -0400] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=testrelm,dc=com is going offline; disabling replication [27/Aug/2012:12:36:11 -0400] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica dc=testrelm,dc=com: LDAP error - 1 [27/Aug/2012:12:36:12 -0400] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database [27/Aug/2012:12:36:32 -0400] - import userRoot: Processed 0 entries -- average rate 0.0/sec, recent rate 0.0/sec, hit ratio 0% [27/Aug/2012:12:36:52 -0400] - import userRoot: Processed 0 entries -- average rate 0.0/sec, recent rate 0.0/sec, hit ratio 0% [27/Aug/2012:12:37:12 -0400] - import userRoot: Processed 0 entries -- average rate 0.0/sec, recent rate 0.0/sec, hit ratio 0% [27/Aug/2012:12:37:32 -0400] - import userRoot: Processed 0 entries -- average rate 0.0/sec, recent rate 0.0/sec, hit ratio 0% [27/Aug/2012:12:37:52 -0400] - import userRoot: Processed 0 entries -- average rate 0.0/sec, recent rate 0.0/sec, hit ratio 0% [27/Aug/2012:12:38:12 -0400] - import userRoot: Processed 0 entries -- average rate 0.0/sec, recent rate 0.0/sec, hit ratio 0% [27/Aug/2012:12:38:32 -0400] - import userRoot: Processed 0 entries -- average rate 0.0/sec, recent rate 0.0/sec, hit ratio 0% [27/Aug/2012:12:38:46 -0400] - ERROR bulk import abandoned [27/Aug/2012:12:38:46 -0400] - import userRoot: Aborting all Import threads... [27/Aug/2012:12:38:52 -0400] - import userRoot: Import threads aborted. [27/Aug/2012:12:38:52 -0400] - import userRoot: Closing files... [27/Aug/2012:12:38:52 -0400] - import userRoot: Import failed. *Ldap db was wiped out on one of the Ipa Master servers [root@sti-high-1 slapd-TESTRELM-COM]# ldapsearch -x -D "cn=Directory Manager" -w Secret123 -b "dc=testrelm,dc=com" # extended LDIF # # LDAPv3 # base <dc=testrelm,dc=com> with scope subtree # filter: (objectclass=*) # requesting: ALL # # compat, testrelm.com dn: cn=compat,dc=testrelm,dc=com objectClass: extensibleObject cn: compat # groups, compat, testrelm.com dn: cn=groups,cn=compat,dc=testrelm,dc=com objectClass: extensibleObject cn: groups # ng, compat, testrelm.com dn: cn=ng,cn=compat,dc=testrelm,dc=com objectClass: extensibleObject cn: ng # users, compat, testrelm.com dn: cn=users,cn=compat,dc=testrelm,dc=com objectClass: extensibleObject cn: users # sudoers, testrelm.com dn: ou=sudoers,dc=testrelm,dc=com objectClass: extensibleObject ou: sudoers # search result search: 2 result: 32 No such object # numResponses: 6 # numEntries: 5 *The other ipa master server info: [root@sti-high-2 ~]# ldapsearch -x -D "cn=Directory Manager" -w Secret123 -b "cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=com" # extended LDIF # # LDAPv3 # base <cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=com> with scope subtree # filter: (objectclass=*) # requesting: ALL # # masters, ipa, etc, testrelm.com dn: cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=com cn: masters objectClass: nsContainer objectClass: top # sti-high-1.testrelm.com, masters, ipa, etc, testrelm.com dn: cn=sti-high-1.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=com cn: sti-high-1.testrelm.com objectClass: top objectClass: nsContainer # CA, sti-high-1.testrelm.com, masters, ipa, etc, testrelm.com dn: cn=CA,cn=sti-high-1.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=c om cn: CA ipaConfigString: enabledService ipaConfigString: startOrder 50 objectClass: nsContainer objectClass: ipaConfigObject objectClass: top # KDC, sti-high-1.testrelm.com, masters, ipa, etc, testrelm.com dn: cn=KDC,cn=sti-high-1.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc= com cn: KDC ipaConfigString: enabledService ipaConfigString: startOrder 10 objectClass: nsContainer objectClass: ipaConfigObject objectClass: top # KPASSWD, sti-high-1.testrelm.com, masters, ipa, etc, testrelm.com dn: cn=KPASSWD,cn=sti-high-1.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm ,dc=com cn: KPASSWD ipaConfigString: enabledService ipaConfigString: startOrder 20 objectClass: nsContainer objectClass: ipaConfigObject objectClass: top # MEMCACHE, sti-high-1.testrelm.com, masters, ipa, etc, testrelm.com dn: cn=MEMCACHE,cn=sti-high-1.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrel m,dc=com cn: MEMCACHE ipaConfigString: enabledService ipaConfigString: startOrder 39 objectClass: nsContainer objectClass: ipaConfigObject objectClass: top # HTTP, sti-high-1.testrelm.com, masters, ipa, etc, testrelm.com dn: cn=HTTP,cn=sti-high-1.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc =com cn: HTTP ipaConfigString: enabledService ipaConfigString: startOrder 40 objectClass: nsContainer objectClass: ipaConfigObject objectClass: top # DNS, sti-high-1.testrelm.com, masters, ipa, etc, testrelm.com dn: cn=DNS,cn=sti-high-1.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc= com cn: DNS ipaConfigString: enabledService ipaConfigString: startOrder 30 objectClass: nsContainer objectClass: ipaConfigObject objectClass: top # sti-high-2.testrelm.com, masters, ipa, etc, testrelm.com dn: cn=sti-high-2.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc=com objectClass: top objectClass: nsContainer cn: sti-high-2.testrelm.com # KDC, sti-high-2.testrelm.com, masters, ipa, etc, testrelm.com dn: cn=KDC,cn=sti-high-2.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc= com objectClass: nsContainer objectClass: ipaConfigObject objectClass: top ipaConfigString: enabledService ipaConfigString: startOrder 10 cn: KDC # KPASSWD, sti-high-2.testrelm.com, masters, ipa, etc, testrelm.com dn: cn=KPASSWD,cn=sti-high-2.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm ,dc=com objectClass: nsContainer objectClass: ipaConfigObject objectClass: top ipaConfigString: enabledService ipaConfigString: startOrder 20 cn: KPASSWD # MEMCACHE, sti-high-2.testrelm.com, masters, ipa, etc, testrelm.com dn: cn=MEMCACHE,cn=sti-high-2.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrel m,dc=com objectClass: nsContainer objectClass: ipaConfigObject objectClass: top ipaConfigString: enabledService ipaConfigString: startOrder 39 cn: MEMCACHE # HTTP, sti-high-2.testrelm.com, masters, ipa, etc, testrelm.com dn: cn=HTTP,cn=sti-high-2.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc =com objectClass: nsContainer objectClass: ipaConfigObject objectClass: top ipaConfigString: enabledService ipaConfigString: startOrder 40 cn: HTTP # DNS, sti-high-2.testrelm.com, masters, ipa, etc, testrelm.com dn: cn=DNS,cn=sti-high-2.testrelm.com,cn=masters,cn=ipa,cn=etc,dc=testrelm,dc= com objectClass: nsContainer objectClass: ipaConfigObject objectClass: top ipaConfigString: enabledService ipaConfigString: startOrder 30 cn: DNS # search result search: 2 result: 0 Success # numResponses: 15 # numEntries: 14 Additional Dev Comments: Rich Megginson's chat comments indicate the issue looks like a combination of; https://fedorahosted.org/389/ticket/374 https://fedorahosted.org/freeipa/ticket/2842