Bug 813964
| Summary: | IPA dirsvr seg-fault during system longevity test | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | baiesi | ||||
| Component: | 389-ds-base | Assignee: | Rich Megginson <rmeggins> | ||||
| Status: | CLOSED ERRATA | QA Contact: | IDM QE LIST <seceng-idm-qe-list> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 6.3 | CC: | jgalipea, mkosek, rmeggins, shaines | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | 389-ds-base-1.2.10.2-7.el6 | Doc Type: | Bug Fix | ||||
| Doc Text: |
Cause: Performing delete and search operations against the directory server under a high load.
Consequence: Directory server crashes.
Fix: Entries may be deleted out from under a search request. DB_MULTIPLE does not like it when entries are remove out from under it. Server should handle this case by not returning deleted entries and not crashing.
Result: Server does not crash when performing searches and deletions while under a high load.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2012-06-20 07:15:26 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Can you provide more detail on the UI not being available? What error are you getting? Does Apache have any logging on the failure? (In reply to comment #2) > Can you provide more detail on the UI not being available? What error are you > getting? Does Apache have any logging on the failure? If this is the problem that Bruce was having earlier, we need to get a core dump and a stack trace. Yes, this is a 389-ds-base problem - but the hard part will be to reproduce with only 389-ds-base and not IPA. Created attachment 578485 [details]
threads 27 4 and 1
Update: - STI run2- reproduced issue now with a core file; Was able to re-provision the system test environment and reproduce the segfault within 24 hours. This time I enabled debugging which worked and generated a core file under There is a core file: /var/log/dirsrv/slapd-TESTRELM-COM/core.16339. File to big to attach to this defect. /var/log/messages; Apr 18 16:21:39 sti-high-1 logger: 2012-04-18 16:21:38 /usr/bin/rhts-test-runner.sh 1210569 105720 hearbeat... Apr 18 16:22:23 sti-high-1 kernel: ns-slapd[16381]: segfault at 7f9acbbd30cb ip 00007f99f82529bd sp 00007f99cbbd3000 error 4 in libback-ldbm.so[7f99f8222000+8f000] Apr 18 16:22:23 sti-high-1 named[20108]: LDAP error: Can't contact LDAP server Apr 18 16:22:23 sti-high-1 named[20108]: connection to the LDAP server was lost Apr 18 16:22:23 sti-high-1 httpd: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (KDC returned error string: PROCESS_TGS) Apr 18 16:22:23 sti-high-1 named[20108]: Failed to init credentials (Generic error (see e-text)) Apr 18 16:22:24 sti-high-1 named[20108]: LDAP error: Can't contact LDAP server I am able to reproduce the crash. Steps
1) set up 2 master replication
2) on master 1 continuously add 1000 users and then delete them - the users should have many objectclasses:
oclist = ["top", "person", "organizationalperson", "inetorgperson",
"inetuser", "posixaccount", 'uidObject', 'pkiUser', 'pkiCA',
'deltaCRL', 'userSecurityInformation', 'simpleSecurityObject',
'shadowAccount', 'posixGroup', 'inetSubscriber', 'inetAdmin',
'accountPolicy', 'mailRecipient', 'nsMessagingServerUser',
'mailGroup', 'groupOfMailEnhancedUniqueNames', 'netscapeMailServer',
'eduPerson', 'mozillaAbPersonAlpha', 'authorizedServiceObject',
'hostObject', 'calEntry', 'printerServiceAuxClass', 'printerIPP']
the required attributes are sn cn uid uidNumber gidNumber homeDirectory userPassword
I also added a description and a 1024 byte userCertificate for good measure
This is different than the ipa schema but I believe the large number of objectclasses has something to do with the crash
3) at the same time, do searches like this:
filt='(&(objectclass=top)(objectclass=person)(objectclass=organizationalperson)(objectclass=inetorgperson)(objectclass=inetuser)(objectclass=posixaccount)(objectclass=uidObject)(objectclass=pkiUser)(objectclass=pkiCA)(objectclass=deltaCRL)(objectclass=userSecurityInformation)(objectclass=simpleSecurityObject)(objectclass=shadowAccount)(objectclass=posixGroup)(objectclass=inetSubscriber)(objectclass=inetAdmin)(objectclass=accountPolicy)(objectclass=mailRecipient)(objectclass=nsMessagingServerUser)(objectclass=mailGroup)(objectclass=groupOfMailEnhancedUniqueNames)(objectclass=netscapeMailServer)(objectclass=eduPerson)(objectclass=mozillaAbPersonAlpha)(objectclass=authorizedServiceObject)(objectclass=hostObject)(objectclass=calEntry)(objectclass=printerServiceAuxClass)(objectclass=printerIPP))'
while [ 1 ] ; do ii=10 ; while [ $ii -ge 0 ] ; do ldapsearch -xLLL -h localhost -p 1389 -D "cn=directory manager" -w password -b dc=example,dc=com $filt dn > /dev/null & ii=`expr $ii - 1` ; done ; wait ; done
after a few minutes you will get a segfault crash in idl_new_fetch(). The problem is with DB_MULTIPLE_NEXT - the ptr variable holds the offset to the next data (ID) from the beginning of the buffer - a value of -1 means this buffer is done and a new buffer needs to be fetched - for some reason, the next to last offset is -5 - since this points before the beginning of the buffer, it points to random memory, and the attempt to dereference this causes the crash.
I have no idea where the -5 comes from - still investigating.
Upstream ticket: https://fedorahosted.org/389/ticket/347
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
Cause: Performing delete and search operations against the directory server under a high load.
Consequence: Directory server crashes.
Fix: Entries may be deleted out from under a search request. DB_MULTIPLE does not like it when entries are remove out from under it. Server should handle this case by not returning deleted entries and not crashing.
Result: Server does not crash when performing searches and deletions while under a high load.
Ran the same tests against the IPA test environment. The defect did not re-occur during the test run. Closing as Verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0813.html |
***Test Type: IPA System Longevity Test ***Issue: I'm running a Rhel6.3 nightly build IPA longevity test. The IPA Master dirsvr generated a seg-fault resulting in IPA client test load failures after approximate 81 hours of runtime. ***Actual results: 1. The IPA Master DirSvr generated a seg-fault. 2. After an IPA restart the system were left in a state where users were not in sync master to slave when searching for users using 'ipa user-find lname' commands. IPA client also became unable to run ipa commands as well. 3. At light vuser thread loads the admin throughput was easily saturated. 4. No fail-over occurred to the Slave, but not sure how it was implemented to work. ***Expected results: Expected Ipa to continue executing the virtual test client admin and authn load for the extended period of time, at a certain level of throughput based of the user population in the DirSvr. ****Servers Symptoms: IPA Master: Status indicating Directory Server had STOPPED kinit admin - failure getting initial credentials ipa user-find - failure UI - https://sti-high-1.testrelm.com/ipa/ui/ - Server Not Found (Not Accessible) No core files available No abrt crashes detected Nothing under /var/spool/abrt other than abrt-db IPA Slave: Status indicating all is RUNNING kinit admin - working ipa user-find - functional UI - https://sti-high-2.testrelm.com/ipa/ui/ - Server Not Found (Not Accessible) IPA Client: kinit admin - failure ipa user-find - failures UI - https://sti-high-X.testrelm.com/ipa/ui/ - Server Not Found (Not Accessible) IPA admin test clients began failing load IPA authn test clients began failing load For the run, I increased the load from 5 to 10 virtual user threads for both authn and ipa admin use cases (light load). Increasing the admin load however never increased the transaction thoughput but hit a saturated level. The systems running the test environment are high end machines listed below. After the failure and restart of the IPA servers, the systems were left in a strange state where the IPA master and slave users were not in sync and my IPA client had issues with basic kinit / ipa commands. ***Next Steps: To reproduce and get the test environment back to a known state, I re -provisioned the test environment, built a 1k user population and enabled debugging on the systems. No adjustments were made out of the box after the system was installed to accommodate any performance issues. Again I'm running the load so far so good at 24hrs...Had been conversing with dev, Rich Megginson on the issues at hand. ***Repeatability: With the released version of IPA on Rhel6.2 I had successfully caused seg faults. These issues had been seen before and defects written against by others. These defects had been resolved as far as I know, so the intent now was to test against the Rhel6.3 IPA nightly specifically. This seg fault issue has happened once so far with this version (this defect). Attempts now are to enable debug and rerun the tests once more to collect core files for dev for debugging the issues. ****System Test Env: IPA Master, Slave an Client Red Hat Enterprise Linux Server release 6.3 Beta (Santiago) Component: rpm -qif ipa Name : ipa-server Relocations: (not relocatable) Version : 2.2.0 Vendor: Red Hat, Inc. Release : 9.el6 Build Date: Tue 10 Apr 2012 08:39:54 PM EDT Install Date: Tue 17 Apr 2012 11:02:23 AM EDT Build Host: hs20-bc2-5.build.redhat.com Group : System Environment/Base Source RPM: ipa-2.2.0-9.el6.src.rpm Size : 3771583 License: GPLv3+ rpm -qi 389-ds-base Name : 389-ds-base Relocations: (not relocatable) Version : 1.2.10.2 Vendor: Red Hat, Inc. Release : 6.el6 Build Date: Tue 10 Apr 2012 04:31:17 PM EDT Install Date: Tue 17 Apr 2012 11:02:23 AM EDT Build Host: hs20-bc2-5.build.redhat.com Group : System Environment/Daemons Source RPM: 389-ds-base-1.2.10.2-6.el6.src.rpm Size : 4850666 Steps to Reproduce: 1. Provision IPA Nightly Rhel6.3 Master 2. Provision IPA Nightly Rhel6.3 Slave 3. Provision IPA Nightly Rhel6.3 Client 4. Apply and run kerb authn and ipa admin load through STI to collect system test data and drive the tests at defined schedules 5. Increase load to 10 vusers for ipa admin 6. Increase load to 10 vusers for authn ***Additional info: ****Longevity Test Failure: Run Identifier: run1-J216284 Start Date: 2012-04-11 15:49:00 Failure Date: 2012-04-16 01:50:00 ****Test load: Test Failures after 81 hours of Load. Both Authn Kerberose and ipa administrative load now failing. IPA Master indicating Dir Server is in a Stopped State. system logs indicate "ns-slapd[14134]: segfault at 7fac485d40cb ip 00007fab6f83f93d sp 00007fab485d4010 error 4 in libback-ldbm.s" ****Logs: /var/log/messages snip Apr 16 00:42:01 sti-high-1 httpd: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (Request is a replay) Apr 16 00:42:07 sti-high-1 logger: 2012-04-16 00:42:07 /usr/bin/rhts-test-runner.sh 1210506 400080 hearbeat... Apr 16 00:46:22 sti-high-1 httpd: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (Request is a replay) Apr 16 00:48:08 sti-high-1 logger: 2012-04-16 00:48:07 /usr/bin/rhts-test-runner.sh 1210506 400440 hearbeat... Apr 16 00:48:50 sti-high-1 kernel: ns-slapd[14134]: segfault at 7fac485d40cb ip 00007fab6f83f93d sp 00007fab485d4010 error 4 in libback-ldbm.so[7fab6f80f000+8e000] Apr 16 00:48:50 sti-high-1 httpd: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (KDC returned error string: PROCESS_TGS) Apr 16 00:48:50 sti-high-1 named[16591]: LDAP error: Can't contact LDAP server Apr 16 00:48:50 sti-high-1 named[16591]: connection to the LDAP server was lost Apr 16 00:48:50 sti-high-1 named[16591]: bind to LDAP server failed: Can't contact LDAP server /var/log/DirSvr/slapd-TestRelm-COM/errors snip [13/Apr/2012:13:26:36 -0400] slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error -2 (Local error) (SASL(-1): generic failure: GSSAPI Error: An invalid name was supplied (Hostname cannot be canonicalized)) errno 110 (Connection timed out) [13/Apr/2012:13:26:36 -0400] slapi_ldap_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: error -2 (Local error) [13/Apr/2012:13:26:36 -0400] NSMMReplicationPlugin - agmt="cn=meTosti-high-2.testrelm.com" (sti-high-2:389): Replication bind with GSSAPI auth failed: LDAP error -2 (Local error) (SASL(-1): generic failure: GSSAPI Error: An invalid name was supplied (Hostname cannot be canonicalized)) [13/Apr/2012:13:26:40 -0400] NSMMReplicationPlugin - agmt="cn=meTosti-high-2.testrelm.com" (sti-high-2:389): Replication bind with GSSAPI auth resumed [15/Apr/2012:07:33:33 -0400] entryrdn-index - _entryrdn_put_data: Adding the parent link (P28354) failed: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock (-30994) [16/Apr/2012:00:45:46 -0400] ipapwd_setPasswordHistory - [file ipapwd_common.c, line 926]: failed to generate new password history! [16/Apr/2012:00:46:53 -0400] ipapwd_setPasswordHistory - [file ipapwd_common.c, line 926]: failed to generate new password history! [16/Apr/2012:00:47:56 -0400] ipapwd_setPasswordHistory - [file ipapwd_common.c, line 926]: failed to generate new password history! ****Test Load: IPA Admin- Use Case: Test 1: - Positive Test Scenario - Reached 10 Virtual Users - Ipa usecase - (find, delete), add, find, disable, enable, modify, then delete - 1 sec delay per thread - Total users cycling in test is 30 Kerbersose Authn: Test 1: - Positive Test Scenario - Reached 10 Virtual Users - 1 sec delay per thread - Total users in test 1000 ****Beaker Provisioned: Beaker Provisioned: Job J216284 ****Hardware in System Test Environment: Dell PowerEdge M710 Blade Server, 2 Socket, 8 Core, 16 processors sti-high-1.testrelm.com 10.16.24.27 IPA Master - Dell PowerEdge M710 Blade Server, 2 Socket, 8 Core, 16 processors, 16GB Ram, x86-64 Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_stihigh1-lv_root 50G 5.1G 42G 11% / tmpfs 7.8G 288K 7.8G 1% /dev/shm /dev/sda1 485M 37M 423M 8% /boot /dev/mapper/vg_stihigh1-lv_home 1.8T 196M 1.7T 1% /home sti-high-2.testrelm.com 10.16.24.29 IPA Slave- Dell PowerEdge M710 Blade Server, 2 Socket, 8 Core, 16 processors, 16GB Ram, x86-64 Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_stihigh2-lv_root 50G 4.5G 43G 10% / tmpfs 7.8G 61M 7.8G 1% /dev/shm /dev/sda1 485M 37M 423M 8% /boot /dev/mapper/vg_stihigh2-lv_home 1.8T 196M 1.7T 1% /home sti-high-2.testrelm.com 10.16.24.31 IPA Client - Dell PowerEdge M710 Blade Server, 2 Socket, 8 Core, 16 processors, 16GB Ram, x86-64 Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_stihigh3-lv_root 50G 3.3G 44G 7% / tmpfs 7.8G 260K 7.8G 1% /dev/shm /dev/sda1 485M 37M 423M 8% /boot /dev/mapper/vg_stihigh3-lv_home 1.8T 197M 1.7T 1% /home