Bug 813964

Summary:

IPA dirsvr seg-fault during system longevity test

Product:

Red Hat Enterprise Linux 6

Reporter:

baiesi

Component:

389-ds-base

Assignee:

Rich Megginson <rmeggins>

Status:

CLOSED ERRATA

QA Contact:

IDM QE LIST <seceng-idm-qe-list>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

6.3

CC:

jgalipea, mkosek, rmeggins, shaines

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

389-ds-base-1.2.10.2-7.el6

Doc Type:

Bug Fix

Doc Text:

Cause: Performing delete and search operations against the directory server under a high load. Consequence: Directory server crashes. Fix: Entries may be deleted out from under a search request. DB_MULTIPLE does not like it when entries are remove out from under it. Server should handle this case by not returning deleted entries and not crashing. Result: Server does not crash when performing searches and deletions while under a high load.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2012-06-20 07:15:26 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
threads 27 4 and 1	none

Description baiesi 2012-04-18 21:00:39 UTC

***Test Type:
IPA System Longevity Test

***Issue:
I'm running a Rhel6.3 nightly build IPA longevity test. The IPA Master dirsvr generated a seg-fault resulting in IPA client test load failures after approximate 81 hours of runtime.

***Actual results:
1. The IPA Master DirSvr generated a seg-fault.
2. After an IPA restart the system were left in a state where users were not in sync master to slave when searching for users using 'ipa user-find lname'  commands.  IPA client also became unable to run ipa commands as well.
3. At light vuser thread loads the admin throughput was easily saturated. 
4. No fail-over occurred to the Slave, but not sure how it was implemented to work.

***Expected results:
Expected Ipa to continue executing the virtual test client admin and authn load for the extended period of time, at a certain level of throughput based of the user population in the DirSvr. 

****Servers Symptoms:
IPA Master:
        Status indicating Directory Server had STOPPED
        kinit admin - failure getting initial credentials
        ipa user-find  - failure
        UI - https://sti-high-1.testrelm.com/ipa/ui/ - Server Not Found (Not Accessible)
	No core files available
	No abrt crashes detected
	Nothing under /var/spool/abrt other than abrt-db
 
IPA Slave:
        Status indicating all is RUNNING
        kinit admin - working
        ipa user-find  - functional
        UI - https://sti-high-2.testrelm.com/ipa/ui/ - Server Not Found (Not Accessible)

IPA Client:
        kinit admin - failure
        ipa user-find  - failures
        UI - https://sti-high-X.testrelm.com/ipa/ui/ - Server Not Found (Not Accessible)
	IPA admin test clients began failing load
	IPA authn test clients began failing load


For the run, I increased the load from 5 to 10 virtual user threads for both authn and ipa admin use cases (light load).  Increasing the admin load however never increased the transaction thoughput but hit a saturated level.  The systems running the test environment are high end machines listed below.  After the failure and restart of the IPA servers, the systems were left in a strange state where the IPA master and slave users were not in sync and my IPA client had issues with basic kinit / ipa commands.

***Next Steps:
To reproduce and get the test environment back to a known state,  I re -provisioned the test environment, built a 1k user population and enabled debugging on the systems.  No adjustments were made out of the box after the system was installed to accommodate any performance issues.  Again I'm running the load so far so good at 24hrs...Had been conversing with dev, Rich Megginson on the issues at hand.

***Repeatability:
With the released version of IPA on Rhel6.2 I had successfully caused seg faults.  These issues had been seen before and defects written against by others.  These defects had been resolved as far as I know, so the intent now was to test against the Rhel6.3 IPA nightly specifically.  This seg fault issue has happened once so far with this version (this defect).  Attempts now are to enable debug and rerun the tests once more to collect core files for dev for debugging the issues.

****System Test Env:
IPA Master, Slave an Client
Red Hat Enterprise Linux Server release 6.3 Beta (Santiago)

Component:
rpm -qif ipa
Name        : ipa-server                   Relocations: (not relocatable)
Version     : 2.2.0                             Vendor: Red Hat, Inc.
Release     : 9.el6                         Build Date: Tue 10 Apr 2012 08:39:54 PM EDT
Install Date: Tue 17 Apr 2012 11:02:23 AM EDT      Build Host: hs20-bc2-5.build.redhat.com
Group       : System Environment/Base       Source RPM: ipa-2.2.0-9.el6.src.rpm
Size        : 3771583                          License: GPLv3+

rpm -qi 389-ds-base
Name        : 389-ds-base                  Relocations: (not relocatable)
Version     : 1.2.10.2                          Vendor: Red Hat, Inc.
Release     : 6.el6                         Build Date: Tue 10 Apr 2012 04:31:17 PM EDT
Install Date: Tue 17 Apr 2012 11:02:23 AM EDT      Build Host: hs20-bc2-5.build.redhat.com
Group       : System Environment/Daemons    Source RPM: 389-ds-base-1.2.10.2-6.el6.src.rpm
Size        : 4850666            

Steps to Reproduce:
1. Provision IPA Nightly Rhel6.3 Master
2. Provision IPA Nightly Rhel6.3 Slave
3. Provision IPA Nightly Rhel6.3 Client
4. Apply and run kerb authn and ipa admin load through STI to collect system test data and drive the tests at defined schedules
5. Increase load to 10 vusers for ipa admin
6. Increase load to 10 vusers for authn
  
***Additional info:
****Longevity Test Failure:
Run Identifier: run1-J216284
Start Date:         2012-04-11 15:49:00
Failure Date:        2012-04-16 01:50:00

****Test load:
Test Failures after 81 hours of Load. Both Authn Kerberose and ipa administrative load now failing. IPA Master indicating Dir Server is in a Stopped State. system logs indicate "ns-slapd[14134]: segfault at 7fac485d40cb ip 00007fab6f83f93d sp 00007fab485d4010 error 4 in libback-ldbm.s"

****Logs:
/var/log/messages snip
Apr 16 00:42:01 sti-high-1 httpd: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Request is a replay)
Apr 16 00:42:07 sti-high-1 logger: 2012-04-16 00:42:07 /usr/bin/rhts-test-runner.sh 1210506 400080 hearbeat...
Apr 16 00:46:22 sti-high-1 httpd: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Request is a replay)
Apr 16 00:48:08 sti-high-1 logger: 2012-04-16 00:48:07 /usr/bin/rhts-test-runner.sh 1210506 400440 hearbeat...
Apr 16 00:48:50 sti-high-1 kernel: ns-slapd[14134]: segfault at 7fac485d40cb ip 00007fab6f83f93d sp 00007fab485d4010 error 4 in libback-ldbm.so[7fab6f80f000+8e000]
Apr 16 00:48:50 sti-high-1 httpd: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (KDC returned error string: PROCESS_TGS)
Apr 16 00:48:50 sti-high-1 named[16591]: LDAP error: Can't contact LDAP server
Apr 16 00:48:50 sti-high-1 named[16591]: connection to the LDAP server was lost
Apr 16 00:48:50 sti-high-1 named[16591]: bind to LDAP server failed: Can't contact LDAP server

/var/log/DirSvr/slapd-TestRelm-COM/errors snip
[13/Apr/2012:13:26:36 -0400] slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error -2 (Local error) (SASL(-1): generic failure: GSSAPI Error: An invalid name was supplied (Hostname cannot be canonicalized)) errno 110 (Connection timed out)
[13/Apr/2012:13:26:36 -0400] slapi_ldap_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: error -2 (Local error)
[13/Apr/2012:13:26:36 -0400] NSMMReplicationPlugin - agmt="cn=meTosti-high-2.testrelm.com" (sti-high-2:389): Replication bind with GSSAPI auth failed: LDAP error -2 (Local error) (SASL(-1): generic failure: GSSAPI Error: An invalid name was supplied (Hostname cannot be canonicalized))
[13/Apr/2012:13:26:40 -0400] NSMMReplicationPlugin - agmt="cn=meTosti-high-2.testrelm.com" (sti-high-2:389): Replication bind with GSSAPI auth resumed
[15/Apr/2012:07:33:33 -0400] entryrdn-index - _entryrdn_put_data: Adding the parent link (P28354) failed: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock (-30994)
[16/Apr/2012:00:45:46 -0400] ipapwd_setPasswordHistory - [file ipapwd_common.c, line 926]: failed to generate new password history!
[16/Apr/2012:00:46:53 -0400] ipapwd_setPasswordHistory - [file ipapwd_common.c, line 926]: failed to generate new password history!
[16/Apr/2012:00:47:56 -0400] ipapwd_setPasswordHistory - [file ipapwd_common.c, line 926]: failed to generate new password history!

****Test Load:

IPA Admin- Use Case:
Test 1:
- Positive Test Scenario
- Reached 10 Virtual Users
- Ipa usecase - (find, delete), add, find, disable, enable, modify, then delete
- 1 sec delay per thread
- Total users cycling in test is 30

Kerbersose Authn:
Test 1:
- Positive Test Scenario
- Reached 10 Virtual Users
- 1 sec delay per thread
- Total users in test 1000

****Beaker Provisioned:
Beaker Provisioned: Job J216284

****Hardware in System Test Environment:
Dell PowerEdge M710 Blade Server, 2 Socket, 8 Core, 16 processors

sti-high-1.testrelm.com 10.16.24.27
IPA Master - Dell PowerEdge M710 Blade Server, 2 Socket, 8 Core, 16 processors, 16GB Ram, x86-64
        Filesystem                            Size  Used Avail Use% Mounted on
        /dev/mapper/vg_stihigh1-lv_root        50G  5.1G   42G  11% /
        tmpfs                                 7.8G  288K  7.8G   1% /dev/shm
        /dev/sda1                             485M   37M  423M   8% /boot
        /dev/mapper/vg_stihigh1-lv_home        1.8T  196M  1.7T   1% /home

sti-high-2.testrelm.com 10.16.24.29
IPA Slave- Dell PowerEdge M710 Blade Server, 2 Socket, 8 Core, 16 processors, 16GB Ram, x86-64
        Filesystem                            Size  Used Avail Use% Mounted on
        /dev/mapper/vg_stihigh2-lv_root        50G  4.5G   43G  10% /
        tmpfs                                 7.8G   61M  7.8G   1% /dev/shm
        /dev/sda1                             485M   37M  423M   8% /boot
        /dev/mapper/vg_stihigh2-lv_home        1.8T  196M  1.7T   1% /home

sti-high-2.testrelm.com 10.16.24.31
IPA Client - Dell PowerEdge M710 Blade Server, 2 Socket, 8 Core, 16 processors, 16GB Ram, x86-64
        Filesystem                            Size  Used Avail Use% Mounted on
        /dev/mapper/vg_stihigh3-lv_root        50G  3.3G   44G   7% /
        tmpfs                                 7.8G  260K  7.8G   1% /dev/shm
        /dev/sda1                             485M   37M  423M   8% /boot
        /dev/mapper/vg_stihigh3-lv_home        1.8T  197M  1.7T   1% /home

Comment 2 Rob Crittenden 2012-04-18 22:19:22 UTC

Can you provide more detail on the UI not being available? What error are you getting? Does Apache have any logging on the failure?

Comment 3 Rich Megginson 2012-04-18 23:28:41 UTC

(In reply to comment #2)
> Can you provide more detail on the UI not being available? What error are you
> getting? Does Apache have any logging on the failure?

If this is the problem that Bruce was having earlier, we need to get a core dump and a stack trace.  Yes, this is a 389-ds-base problem - but the hard part will be to reproduce with only 389-ds-base and not IPA.

Comment 4 Rich Megginson 2012-04-19 01:40:48 UTC

Created attachment 578485 [details]
threads 27 4 and 1

Comment 5 baiesi 2012-04-19 15:00:04 UTC

Update: - STI run2- reproduced issue now with a core file;

Was able to re-provision the system test environment and reproduce the segfault within 24 hours.  This time I enabled debugging which worked and generated a core file under There is a core file: /var/log/dirsrv/slapd-TESTRELM-COM/core.16339. File to big to attach to this defect.

/var/log/messages;
Apr 18 16:21:39 sti-high-1 logger: 2012-04-18 16:21:38 /usr/bin/rhts-test-runner.sh 1210569 105720 hearbeat...
Apr 18 16:22:23 sti-high-1 kernel: ns-slapd[16381]: segfault at 7f9acbbd30cb ip 00007f99f82529bd sp 00007f99cbbd3000 error 4 in libback-ldbm.so[7f99f8222000+8f000]
Apr 18 16:22:23 sti-high-1 named[20108]: LDAP error: Can't contact LDAP server
Apr 18 16:22:23 sti-high-1 named[20108]: connection to the LDAP server was lost
Apr 18 16:22:23 sti-high-1 httpd: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (KDC returned error string: PROCESS_TGS)
Apr 18 16:22:23 sti-high-1 named[20108]: Failed to init credentials (Generic error (see e-text))
Apr 18 16:22:24 sti-high-1 named[20108]: LDAP error: Can't contact LDAP server

Comment 6 Rich Megginson 2012-04-20 02:33:08 UTC

I am able to reproduce the crash.  Steps
1) set up 2 master replication
2) on master 1 continuously add 1000 users and then delete them - the users should have many objectclasses:
oclist = ["top", "person", "organizationalperson", "inetorgperson",
          "inetuser", "posixaccount", 'uidObject', 'pkiUser', 'pkiCA',
          'deltaCRL', 'userSecurityInformation', 'simpleSecurityObject',
          'shadowAccount', 'posixGroup', 'inetSubscriber', 'inetAdmin',
          'accountPolicy', 'mailRecipient', 'nsMessagingServerUser',
          'mailGroup', 'groupOfMailEnhancedUniqueNames', 'netscapeMailServer',
          'eduPerson', 'mozillaAbPersonAlpha', 'authorizedServiceObject',
          'hostObject', 'calEntry', 'printerServiceAuxClass', 'printerIPP']
the required attributes are sn cn uid uidNumber gidNumber homeDirectory userPassword
I also added a description and a 1024 byte userCertificate for good measure
This is different than the ipa schema but I believe the large number of objectclasses has something to do with the crash

3) at the same time, do searches like this:

filt='(&(objectclass=top)(objectclass=person)(objectclass=organizationalperson)(objectclass=inetorgperson)(objectclass=inetuser)(objectclass=posixaccount)(objectclass=uidObject)(objectclass=pkiUser)(objectclass=pkiCA)(objectclass=deltaCRL)(objectclass=userSecurityInformation)(objectclass=simpleSecurityObject)(objectclass=shadowAccount)(objectclass=posixGroup)(objectclass=inetSubscriber)(objectclass=inetAdmin)(objectclass=accountPolicy)(objectclass=mailRecipient)(objectclass=nsMessagingServerUser)(objectclass=mailGroup)(objectclass=groupOfMailEnhancedUniqueNames)(objectclass=netscapeMailServer)(objectclass=eduPerson)(objectclass=mozillaAbPersonAlpha)(objectclass=authorizedServiceObject)(objectclass=hostObject)(objectclass=calEntry)(objectclass=printerServiceAuxClass)(objectclass=printerIPP))'

while [ 1 ] ; do ii=10 ; while [ $ii -ge 0 ] ; do ldapsearch -xLLL -h localhost -p 1389 -D "cn=directory manager" -w password -b dc=example,dc=com $filt dn > /dev/null & ii=`expr $ii - 1` ; done ; wait ; done

after a few minutes you will get a segfault crash in idl_new_fetch().  The problem is with DB_MULTIPLE_NEXT - the ptr variable holds the offset to the next data (ID) from the beginning of the buffer - a value of -1 means this buffer is done and a new buffer needs to be fetched - for some reason, the next to last offset is -5 - since this points before the beginning of the buffer, it points to random memory, and the attempt to dereference this causes the crash.

I have no idea where the -5 comes from - still investigating.

Comment 7 Rich Megginson 2012-04-23 15:03:23 UTC

Upstream ticket:
https://fedorahosted.org/389/ticket/347

Comment 10 Rich Megginson 2012-05-24 23:55:11 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: Performing delete and search operations against the directory server under a high load.
Consequence: Directory server crashes.
Fix: Entries may be deleted out from under a search request.  DB_MULTIPLE does not like it when entries are remove out from under it.  Server should handle this case by not returning deleted entries and not crashing.
Result: Server does not crash when performing searches and deletions while under a high load.

Comment 11 baiesi 2012-05-31 12:52:59 UTC

Ran the same tests against the IPA test environment.  The defect did not re-occur during the test run.  Closing as Verified.

Comment 12 errata-xmlrpc 2012-06-20 07:15:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0813.html