Bug 1255042

Summary: Slapd crashes reported from replication tests
Product: Red Hat Enterprise Linux 7 Reporter: Sankar Ramalingam <sramling>
Component: 389-ds-baseAssignee: Noriko Hosoi <nhosoi>
Status: CLOSED WORKSFORME QA Contact: Viktor Ashirov <vashirov>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.2CC: nkinder, rmeggins, sramling
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-14 12:02:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Backtraces for slapd crashes
none
Stacktrace for slapd crash
none
Stacktrace for slapd crash
none
Stack trace from accPolicy tests
none
Stacktrace from mmraccept tests none

Description Sankar Ramalingam 2015-08-19 13:32:23 UTC
Created attachment 1064852 [details]
Backtraces for slapd crashes

Description of problem: Slapd crashes coming from multi-operations stress tests.


Version-Release number of selected component (if applicable): 389-ds-base-1.3.4.0-12


How reproducible: Consistently with the latest build of 389-ds-base-1.3.4.0-12


Steps to Reproduce:
1. Run Multi-operations stress tests on a beaker machine.
2. Job - https://beaker.engineering.redhat.com/jobs/1053896
3. Crashes reported from execution

Actual results: Slapd crashes


Expected results: No slapd crash


Additional info:

Comment 1 Sankar Ramalingam 2015-08-19 13:33:33 UTC
Created attachment 1064860 [details]
Stacktrace for slapd crash

Crash report from beaker

Comment 2 Sankar Ramalingam 2015-08-19 13:34:28 UTC
Created attachment 1064867 [details]
Stacktrace for slapd crash

Comment 4 Noriko Hosoi 2015-08-19 16:41:09 UTC
(In reply to Sankar Ramalingam from comment #1)
> Created attachment 1064860 [details]
> Stacktrace for slapd crash
> 
> Crash report from beaker

This is a stacktrace of ns-slapd.  It crashed in getting entryrdn element in the entryrdn index while deleting.


> Sankar Ramalingam 2015-08-19 09:34:28 EDT
> Created attachment 1064867 [details]
> Stacktrace for slapd crash

This is a stacktrace of ldclt (not ns-slapd).


What do "Multi-operations stress tests" do?  Where can I see the test program?  Is it a test against a standalone server or MMR?

Are the cores (especially of ns-slapd) left on the beaker?  Can I see the error and access logs?  Also, ldclt logs?

Could it be possible to run the stress test with valgrind?

Comment 5 Sankar Ramalingam 2015-08-24 18:31:39 UTC
Cloned a beaker job to reproduce the crash. I will re-run the crash tests with Valgrind.
https://beaker.engineering.redhat.com/jobs/1060253

Comment 9 Sankar Ramalingam 2015-08-25 17:38:13 UTC
1. I could reproduce the crash on one of the beaker machine. However, this is from ldclt, not from ns-slapd.

Host - vm-idm-002.lab.eng.pnq.redhat.com
Root pw: Redhat123
Coredump - /var/spool/abrt/ccpp-2015-08-25-11:21:28-30568/coredump

2. I ran the stress tests with valgrind and the report is already out. As expected there was no crash.
Host - vm-idm-017.lab.eng.pnq.redhat.com

Variables set by tests:
        MUOP01_MAX_RANGE=1000000
        MUOP01_NB_LOOPS=17280

I ran it with :
        MUOP01_MAX_RANGE=100000
        MUOP01_NB_LOOPS=1728

Feel free to ask more questions or for further execution.

Comment 10 Noriko Hosoi 2015-08-25 17:49:15 UTC
Thank you, Sankar.

Could you repeat the case 1 (no valgrind) on several beaker machines in parallel?  If the crash is not captured, I will give up...

Thanks...

Comment 11 Sankar Ramalingam 2015-08-25 18:28:32 UTC
(In reply to Noriko Hosoi from comment #10)
> Thank you, Sankar.
> 
> Could you repeat the case 1 (no valgrind) on several beaker machines in
> parallel?  
https://beaker.engineering.redhat.com/jobs/1061677
https://beaker.engineering.redhat.com/jobs/1061678
https://beaker.engineering.redhat.com/jobs/1061679
https://beaker.engineering.redhat.com/jobs/1061680

If the crash is not captured, I will give up...
> 
> Thanks...

Comment 13 Sankar Ramalingam 2015-08-26 09:21:22 UTC
Created attachment 1067208 [details]
Stack trace from accPolicy tests

Managed to reproduce the crash with the accPolicy acceptance tests. The crash is not specific to stress tests. I have cloned another beaker job with accPolicy tests to reproduce the crash and provide access to the core files.
Attaching the stack trace.

Comment 14 Sankar Ramalingam 2015-08-26 10:42:31 UTC
Created attachment 1067219 [details]
Stacktrace from mmraccept tests

mmraccept tests also crashing. Attaching the stack trace.

Comment 16 Sankar Ramalingam 2015-08-27 07:00:26 UTC
The crash is not reproducible for me when I clone beaker jobs or manual trigger of jobs from Jenkins. However, this keeps coming from the automated execution from Jenkins, not consistent though. I am trying few more runs today to reproduce the crash as well as reserve the same machine for troubleshooting.

Comment 17 Sankar Ramalingam 2015-08-27 10:55:30 UTC
Managed to reproduce the crash with mmraccept tests by manually triggering of jobs from Jenkins. Machine is reserved and available for further troubleshooting.

Hostname - apollo.idmqe.lab.eng.bos.redhat.com
Root pw: Redhat123

[root@apollo ~]# find /var -name core*
/var/lib/systemd/coredump
/var/spool/abrt/ccpp-2015-08-27-06:18:17-344/coredump
/var/spool/abrt/ccpp-2015-08-27-06:18:17-344/core_backtrace

I guess, the tests around accPolicy would also crash the server. It will reach accPlolicy tests about 5 hrs from now. Feel free to access this machine for further investigation.

Comment 18 Noriko Hosoi 2015-08-27 18:24:29 UTC
Sankar,

I see lots of file system full errors in /var/log/messages.  Could it be related to the test failure?  If so, could you rerun the test with more disk spaces?

....
Aug 27 13:33:56 apollo ns-slapd: Failed to write log, Netscape Portable Runtime error -5956 (The device for storing the file is full.):  - slapd shutting down - closing down internal subsystems and plugins
Aug 27 13:33:56 apollo ns-slapd: Writing to the errors log failed.  Exiting...
Aug 27 13:33:56 apollo ns-slapd: Failed to write log, Netscape Portable Runtime error -5956 (The device for storing the file is full.):  - Waiting for 4 database threads to stop
Aug 27 13:33:56 apollo ns-slapd: Writing to the errors log failed.  Exiting...
Aug 27 13:33:57 apollo ns-slapd: Failed to write log, Netscape Portable Runtime error -5956 (The device for storing the file is full.):  - All database threads now stopped
Aug 27 13:33:57 apollo ns-slapd: Writing to the errors log failed.  Exiting...
Aug 27 13:33:57 apollo ns-slapd: Failed to write log, Netscape Portable Runtime error -5956 (The device for storing the file is full.):  - slapd shutting down - freed 1 work q stack objects - freed 1 op stack objects
Aug 27 13:33:57 apollo ns-slapd: Writing to the errors log failed.  Exiting...
Aug 27 13:33:58 apollo ns-slapd: Failed to write log, Netscape Portable Runtime error -5956 (The device for storing the file is full.):  - slapd stopped.
....

If you still see the crash on the host with the larger file system, we learned openldap was rebased for 7.2.  The issue may be related to the openldap upgrade.  Could you please run the test with downgrading openldap to the version of 7.1?

Thanks.

Comment 19 Sankar Ramalingam 2015-08-28 14:54:07 UTC
The machine is already returned back to beaker pool and I am doubtful that I could run the tests by downgrading openldap. Moreover, its difficult to consistently reproduce the crash by running the same set of tests. So, I feel this should be pushed to next release unless we figure out a way to reproduce it very consistently.

Comment 20 Noriko Hosoi 2015-08-28 15:32:46 UTC
(In reply to Sankar Ramalingam from comment #19)
> The machine is already returned back to beaker pool and I am doubtful that I
> could run the tests by downgrading openldap. Moreover, its difficult to
> consistently reproduce the crash by running the same set of tests. So, I
> feel this should be pushed to next release unless we figure out a way to
> reproduce it very consistently.

Do you mean you want to stop investigating this crash for now?

Please note that this bug is already targeted as 7.3.0.

Comment 21 Sankar Ramalingam 2015-08-29 12:04:23 UTC
(In reply to Noriko Hosoi from comment #20)
> (In reply to Sankar Ramalingam from comment #19)
> > The machine is already returned back to beaker pool and I am doubtful that I
> > could run the tests by downgrading openldap. Moreover, its difficult to
> > consistently reproduce the crash by running the same set of tests. So, I
> > feel this should be pushed to next release unless we figure out a way to
> > reproduce it very consistently.
> 
> Do you mean you want to stop investigating this crash for now?
Yes. I felt I am spending more time for this but no outcome.
> 
> Please note that this bug is already targeted as 7.3.0.
Okay, I thought I found a reliable reproducer, but now it doesn't look like. So, I would like to give up now and continue with RHEL7.2 work.

Comment 45 Noriko Hosoi 2016-01-07 19:21:26 UTC
Upstream ticket:
https://fedorahosted.org/389/ticket/48403

Comment 46 Noriko Hosoi 2016-05-28 00:46:44 UTC
We did not have a chance to look into this issue recently.

Now we have a rhel-7.3 candidate (of course, we are fixing more bugs, though)
https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=495659
389-ds-base-1.3.5.4-1.el7

Sankar, could you please retry the test with this latest build on rhel-7.3?
> Description of problem: Slapd crashes coming from multi-operations stress tests.

And if the server still crashes, could you retain the test environment with a core file?

Thanks!

Comment 47 Sankar Ramalingam 2016-06-02 08:31:07 UTC
So far with TET acceptance tests, no crashes observed on RHEL7.3 389-ds-base builds. We are yet to start off with Longduration(Tier2) and Stress/Reliability(Tier3) tests for RHEL7.3. I will update the bug with more details if I encounter any crashes with Tier2 and Tier3 execution.

Comment 49 Sankar Ramalingam 2016-06-14 11:17:39 UTC
I cloned a beaker job - https://beaker.engineering.redhat.com/jobs/1369329. I will wait for this job to complete and then update the bug accordingly.

Comment 50 Sankar Ramalingam 2016-06-14 12:02:56 UTC
(In reply to Sankar Ramalingam from comment #49)
> I cloned a beaker job - https://beaker.engineering.redhat.com/jobs/1369329.
> I will wait for this job to complete and then update the bug accordingly.

I didn't observe any crash for the above beaker job. Hence, closing this bug as not reproducible.

Packages tested: 389-ds-base-1.3.5.4-1.el7.x86_64