Bug 1255042 - Slapd crashes reported from replication tests
Slapd crashes reported from replication tests
Status: CLOSED WORKSFORME
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: 389-ds-base (Show other bugs)
7.2
x86_64 Linux
medium Severity medium
: rc
: ---
Assigned To: Noriko Hosoi
Viktor Ashirov
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-08-19 09:32 EDT by Sankar Ramalingam
Modified: 2016-06-14 08:02 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-06-14 08:02:56 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Backtraces for slapd crashes (258.43 KB, text/plain)
2015-08-19 09:32 EDT, Sankar Ramalingam
no flags Details
Stacktrace for slapd crash (258.43 KB, text/plain)
2015-08-19 09:33 EDT, Sankar Ramalingam
no flags Details
Stacktrace for slapd crash (35.95 KB, text/plain)
2015-08-19 09:34 EDT, Sankar Ramalingam
no flags Details
Stack trace from accPolicy tests (188.13 KB, text/plain)
2015-08-26 05:21 EDT, Sankar Ramalingam
no flags Details
Stacktrace from mmraccept tests (189.29 KB, text/plain)
2015-08-26 06:42 EDT, Sankar Ramalingam
no flags Details

  None (edit)
Description Sankar Ramalingam 2015-08-19 09:32:23 EDT
Created attachment 1064852 [details]
Backtraces for slapd crashes

Description of problem: Slapd crashes coming from multi-operations stress tests.


Version-Release number of selected component (if applicable): 389-ds-base-1.3.4.0-12


How reproducible: Consistently with the latest build of 389-ds-base-1.3.4.0-12


Steps to Reproduce:
1. Run Multi-operations stress tests on a beaker machine.
2. Job - https://beaker.engineering.redhat.com/jobs/1053896
3. Crashes reported from execution

Actual results: Slapd crashes


Expected results: No slapd crash


Additional info:
Comment 1 Sankar Ramalingam 2015-08-19 09:33:33 EDT
Created attachment 1064860 [details]
Stacktrace for slapd crash

Crash report from beaker
Comment 2 Sankar Ramalingam 2015-08-19 09:34:28 EDT
Created attachment 1064867 [details]
Stacktrace for slapd crash
Comment 4 Noriko Hosoi 2015-08-19 12:41:09 EDT
(In reply to Sankar Ramalingam from comment #1)
> Created attachment 1064860 [details]
> Stacktrace for slapd crash
> 
> Crash report from beaker

This is a stacktrace of ns-slapd.  It crashed in getting entryrdn element in the entryrdn index while deleting.


> Sankar Ramalingam 2015-08-19 09:34:28 EDT
> Created attachment 1064867 [details]
> Stacktrace for slapd crash

This is a stacktrace of ldclt (not ns-slapd).


What do "Multi-operations stress tests" do?  Where can I see the test program?  Is it a test against a standalone server or MMR?

Are the cores (especially of ns-slapd) left on the beaker?  Can I see the error and access logs?  Also, ldclt logs?

Could it be possible to run the stress test with valgrind?
Comment 5 Sankar Ramalingam 2015-08-24 14:31:39 EDT
Cloned a beaker job to reproduce the crash. I will re-run the crash tests with Valgrind.
https://beaker.engineering.redhat.com/jobs/1060253
Comment 9 Sankar Ramalingam 2015-08-25 13:38:13 EDT
1. I could reproduce the crash on one of the beaker machine. However, this is from ldclt, not from ns-slapd.

Host - vm-idm-002.lab.eng.pnq.redhat.com
Root pw: Redhat123
Coredump - /var/spool/abrt/ccpp-2015-08-25-11:21:28-30568/coredump

2. I ran the stress tests with valgrind and the report is already out. As expected there was no crash.
Host - vm-idm-017.lab.eng.pnq.redhat.com

Variables set by tests:
        MUOP01_MAX_RANGE=1000000
        MUOP01_NB_LOOPS=17280

I ran it with :
        MUOP01_MAX_RANGE=100000
        MUOP01_NB_LOOPS=1728

Feel free to ask more questions or for further execution.
Comment 10 Noriko Hosoi 2015-08-25 13:49:15 EDT
Thank you, Sankar.

Could you repeat the case 1 (no valgrind) on several beaker machines in parallel?  If the crash is not captured, I will give up...

Thanks...
Comment 11 Sankar Ramalingam 2015-08-25 14:28:32 EDT
(In reply to Noriko Hosoi from comment #10)
> Thank you, Sankar.
> 
> Could you repeat the case 1 (no valgrind) on several beaker machines in
> parallel?  
https://beaker.engineering.redhat.com/jobs/1061677
https://beaker.engineering.redhat.com/jobs/1061678
https://beaker.engineering.redhat.com/jobs/1061679
https://beaker.engineering.redhat.com/jobs/1061680

If the crash is not captured, I will give up...
> 
> Thanks...
Comment 13 Sankar Ramalingam 2015-08-26 05:21:22 EDT
Created attachment 1067208 [details]
Stack trace from accPolicy tests

Managed to reproduce the crash with the accPolicy acceptance tests. The crash is not specific to stress tests. I have cloned another beaker job with accPolicy tests to reproduce the crash and provide access to the core files.
Attaching the stack trace.
Comment 14 Sankar Ramalingam 2015-08-26 06:42:31 EDT
Created attachment 1067219 [details]
Stacktrace from mmraccept tests

mmraccept tests also crashing. Attaching the stack trace.
Comment 16 Sankar Ramalingam 2015-08-27 03:00:26 EDT
The crash is not reproducible for me when I clone beaker jobs or manual trigger of jobs from Jenkins. However, this keeps coming from the automated execution from Jenkins, not consistent though. I am trying few more runs today to reproduce the crash as well as reserve the same machine for troubleshooting.
Comment 17 Sankar Ramalingam 2015-08-27 06:55:30 EDT
Managed to reproduce the crash with mmraccept tests by manually triggering of jobs from Jenkins. Machine is reserved and available for further troubleshooting.

Hostname - apollo.idmqe.lab.eng.bos.redhat.com
Root pw: Redhat123

[root@apollo ~]# find /var -name core*
/var/lib/systemd/coredump
/var/spool/abrt/ccpp-2015-08-27-06:18:17-344/coredump
/var/spool/abrt/ccpp-2015-08-27-06:18:17-344/core_backtrace

I guess, the tests around accPolicy would also crash the server. It will reach accPlolicy tests about 5 hrs from now. Feel free to access this machine for further investigation.
Comment 18 Noriko Hosoi 2015-08-27 14:24:29 EDT
Sankar,

I see lots of file system full errors in /var/log/messages.  Could it be related to the test failure?  If so, could you rerun the test with more disk spaces?

....
Aug 27 13:33:56 apollo ns-slapd: Failed to write log, Netscape Portable Runtime error -5956 (The device for storing the file is full.):  - slapd shutting down - closing down internal subsystems and plugins
Aug 27 13:33:56 apollo ns-slapd: Writing to the errors log failed.  Exiting...
Aug 27 13:33:56 apollo ns-slapd: Failed to write log, Netscape Portable Runtime error -5956 (The device for storing the file is full.):  - Waiting for 4 database threads to stop
Aug 27 13:33:56 apollo ns-slapd: Writing to the errors log failed.  Exiting...
Aug 27 13:33:57 apollo ns-slapd: Failed to write log, Netscape Portable Runtime error -5956 (The device for storing the file is full.):  - All database threads now stopped
Aug 27 13:33:57 apollo ns-slapd: Writing to the errors log failed.  Exiting...
Aug 27 13:33:57 apollo ns-slapd: Failed to write log, Netscape Portable Runtime error -5956 (The device for storing the file is full.):  - slapd shutting down - freed 1 work q stack objects - freed 1 op stack objects
Aug 27 13:33:57 apollo ns-slapd: Writing to the errors log failed.  Exiting...
Aug 27 13:33:58 apollo ns-slapd: Failed to write log, Netscape Portable Runtime error -5956 (The device for storing the file is full.):  - slapd stopped.
....

If you still see the crash on the host with the larger file system, we learned openldap was rebased for 7.2.  The issue may be related to the openldap upgrade.  Could you please run the test with downgrading openldap to the version of 7.1?

Thanks.
Comment 19 Sankar Ramalingam 2015-08-28 10:54:07 EDT
The machine is already returned back to beaker pool and I am doubtful that I could run the tests by downgrading openldap. Moreover, its difficult to consistently reproduce the crash by running the same set of tests. So, I feel this should be pushed to next release unless we figure out a way to reproduce it very consistently.
Comment 20 Noriko Hosoi 2015-08-28 11:32:46 EDT
(In reply to Sankar Ramalingam from comment #19)
> The machine is already returned back to beaker pool and I am doubtful that I
> could run the tests by downgrading openldap. Moreover, its difficult to
> consistently reproduce the crash by running the same set of tests. So, I
> feel this should be pushed to next release unless we figure out a way to
> reproduce it very consistently.

Do you mean you want to stop investigating this crash for now?

Please note that this bug is already targeted as 7.3.0.
Comment 21 Sankar Ramalingam 2015-08-29 08:04:23 EDT
(In reply to Noriko Hosoi from comment #20)
> (In reply to Sankar Ramalingam from comment #19)
> > The machine is already returned back to beaker pool and I am doubtful that I
> > could run the tests by downgrading openldap. Moreover, its difficult to
> > consistently reproduce the crash by running the same set of tests. So, I
> > feel this should be pushed to next release unless we figure out a way to
> > reproduce it very consistently.
> 
> Do you mean you want to stop investigating this crash for now?
Yes. I felt I am spending more time for this but no outcome.
> 
> Please note that this bug is already targeted as 7.3.0.
Okay, I thought I found a reliable reproducer, but now it doesn't look like. So, I would like to give up now and continue with RHEL7.2 work.
Comment 45 Noriko Hosoi 2016-01-07 14:21:26 EST
Upstream ticket:
https://fedorahosted.org/389/ticket/48403
Comment 46 Noriko Hosoi 2016-05-27 20:46:44 EDT
We did not have a chance to look into this issue recently.

Now we have a rhel-7.3 candidate (of course, we are fixing more bugs, though)
https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=495659
389-ds-base-1.3.5.4-1.el7

Sankar, could you please retry the test with this latest build on rhel-7.3?
> Description of problem: Slapd crashes coming from multi-operations stress tests.

And if the server still crashes, could you retain the test environment with a core file?

Thanks!
Comment 47 Sankar Ramalingam 2016-06-02 04:31:07 EDT
So far with TET acceptance tests, no crashes observed on RHEL7.3 389-ds-base builds. We are yet to start off with Longduration(Tier2) and Stress/Reliability(Tier3) tests for RHEL7.3. I will update the bug with more details if I encounter any crashes with Tier2 and Tier3 execution.
Comment 49 Sankar Ramalingam 2016-06-14 07:17:39 EDT
I cloned a beaker job - https://beaker.engineering.redhat.com/jobs/1369329. I will wait for this job to complete and then update the bug accordingly.
Comment 50 Sankar Ramalingam 2016-06-14 08:02:56 EDT
(In reply to Sankar Ramalingam from comment #49)
> I cloned a beaker job - https://beaker.engineering.redhat.com/jobs/1369329.
> I will wait for this job to complete and then update the bug accordingly.

I didn't observe any crash for the above beaker job. Hence, closing this bug as not reproducible.

Packages tested: 389-ds-base-1.3.5.4-1.el7.x86_64

Note You need to log in before you can comment on or make changes to this bug.