Bug 1255042
Summary: | Slapd crashes reported from replication tests | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Sankar Ramalingam <sramling> | ||||||||||||
Component: | 389-ds-base | Assignee: | Noriko Hosoi <nhosoi> | ||||||||||||
Status: | CLOSED WORKSFORME | QA Contact: | Viktor Ashirov <vashirov> | ||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||
Priority: | medium | ||||||||||||||
Version: | 7.2 | CC: | nkinder, rmeggins, sramling | ||||||||||||
Target Milestone: | rc | ||||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | x86_64 | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2016-06-14 12:02:56 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Attachments: |
|
Description
Sankar Ramalingam
2015-08-19 13:32:23 UTC
Created attachment 1064860 [details]
Stacktrace for slapd crash
Crash report from beaker
Created attachment 1064867 [details]
Stacktrace for slapd crash
(In reply to Sankar Ramalingam from comment #1) > Created attachment 1064860 [details] > Stacktrace for slapd crash > > Crash report from beaker This is a stacktrace of ns-slapd. It crashed in getting entryrdn element in the entryrdn index while deleting. > Sankar Ramalingam 2015-08-19 09:34:28 EDT > Created attachment 1064867 [details] > Stacktrace for slapd crash This is a stacktrace of ldclt (not ns-slapd). What do "Multi-operations stress tests" do? Where can I see the test program? Is it a test against a standalone server or MMR? Are the cores (especially of ns-slapd) left on the beaker? Can I see the error and access logs? Also, ldclt logs? Could it be possible to run the stress test with valgrind? Cloned a beaker job to reproduce the crash. I will re-run the crash tests with Valgrind. https://beaker.engineering.redhat.com/jobs/1060253 1. I could reproduce the crash on one of the beaker machine. However, this is from ldclt, not from ns-slapd. Host - vm-idm-002.lab.eng.pnq.redhat.com Root pw: Redhat123 Coredump - /var/spool/abrt/ccpp-2015-08-25-11:21:28-30568/coredump 2. I ran the stress tests with valgrind and the report is already out. As expected there was no crash. Host - vm-idm-017.lab.eng.pnq.redhat.com Variables set by tests: MUOP01_MAX_RANGE=1000000 MUOP01_NB_LOOPS=17280 I ran it with : MUOP01_MAX_RANGE=100000 MUOP01_NB_LOOPS=1728 Feel free to ask more questions or for further execution. Thank you, Sankar. Could you repeat the case 1 (no valgrind) on several beaker machines in parallel? If the crash is not captured, I will give up... Thanks... (In reply to Noriko Hosoi from comment #10) > Thank you, Sankar. > > Could you repeat the case 1 (no valgrind) on several beaker machines in > parallel? https://beaker.engineering.redhat.com/jobs/1061677 https://beaker.engineering.redhat.com/jobs/1061678 https://beaker.engineering.redhat.com/jobs/1061679 https://beaker.engineering.redhat.com/jobs/1061680 If the crash is not captured, I will give up... > > Thanks... Created attachment 1067208 [details]
Stack trace from accPolicy tests
Managed to reproduce the crash with the accPolicy acceptance tests. The crash is not specific to stress tests. I have cloned another beaker job with accPolicy tests to reproduce the crash and provide access to the core files.
Attaching the stack trace.
Created attachment 1067219 [details]
Stacktrace from mmraccept tests
mmraccept tests also crashing. Attaching the stack trace.
The crash is not reproducible for me when I clone beaker jobs or manual trigger of jobs from Jenkins. However, this keeps coming from the automated execution from Jenkins, not consistent though. I am trying few more runs today to reproduce the crash as well as reserve the same machine for troubleshooting. Managed to reproduce the crash with mmraccept tests by manually triggering of jobs from Jenkins. Machine is reserved and available for further troubleshooting. Hostname - apollo.idmqe.lab.eng.bos.redhat.com Root pw: Redhat123 [root@apollo ~]# find /var -name core* /var/lib/systemd/coredump /var/spool/abrt/ccpp-2015-08-27-06:18:17-344/coredump /var/spool/abrt/ccpp-2015-08-27-06:18:17-344/core_backtrace I guess, the tests around accPolicy would also crash the server. It will reach accPlolicy tests about 5 hrs from now. Feel free to access this machine for further investigation. Sankar, I see lots of file system full errors in /var/log/messages. Could it be related to the test failure? If so, could you rerun the test with more disk spaces? .... Aug 27 13:33:56 apollo ns-slapd: Failed to write log, Netscape Portable Runtime error -5956 (The device for storing the file is full.): - slapd shutting down - closing down internal subsystems and plugins Aug 27 13:33:56 apollo ns-slapd: Writing to the errors log failed. Exiting... Aug 27 13:33:56 apollo ns-slapd: Failed to write log, Netscape Portable Runtime error -5956 (The device for storing the file is full.): - Waiting for 4 database threads to stop Aug 27 13:33:56 apollo ns-slapd: Writing to the errors log failed. Exiting... Aug 27 13:33:57 apollo ns-slapd: Failed to write log, Netscape Portable Runtime error -5956 (The device for storing the file is full.): - All database threads now stopped Aug 27 13:33:57 apollo ns-slapd: Writing to the errors log failed. Exiting... Aug 27 13:33:57 apollo ns-slapd: Failed to write log, Netscape Portable Runtime error -5956 (The device for storing the file is full.): - slapd shutting down - freed 1 work q stack objects - freed 1 op stack objects Aug 27 13:33:57 apollo ns-slapd: Writing to the errors log failed. Exiting... Aug 27 13:33:58 apollo ns-slapd: Failed to write log, Netscape Portable Runtime error -5956 (The device for storing the file is full.): - slapd stopped. .... If you still see the crash on the host with the larger file system, we learned openldap was rebased for 7.2. The issue may be related to the openldap upgrade. Could you please run the test with downgrading openldap to the version of 7.1? Thanks. The machine is already returned back to beaker pool and I am doubtful that I could run the tests by downgrading openldap. Moreover, its difficult to consistently reproduce the crash by running the same set of tests. So, I feel this should be pushed to next release unless we figure out a way to reproduce it very consistently. (In reply to Sankar Ramalingam from comment #19) > The machine is already returned back to beaker pool and I am doubtful that I > could run the tests by downgrading openldap. Moreover, its difficult to > consistently reproduce the crash by running the same set of tests. So, I > feel this should be pushed to next release unless we figure out a way to > reproduce it very consistently. Do you mean you want to stop investigating this crash for now? Please note that this bug is already targeted as 7.3.0. (In reply to Noriko Hosoi from comment #20) > (In reply to Sankar Ramalingam from comment #19) > > The machine is already returned back to beaker pool and I am doubtful that I > > could run the tests by downgrading openldap. Moreover, its difficult to > > consistently reproduce the crash by running the same set of tests. So, I > > feel this should be pushed to next release unless we figure out a way to > > reproduce it very consistently. > > Do you mean you want to stop investigating this crash for now? Yes. I felt I am spending more time for this but no outcome. > > Please note that this bug is already targeted as 7.3.0. Okay, I thought I found a reliable reproducer, but now it doesn't look like. So, I would like to give up now and continue with RHEL7.2 work. Upstream ticket: https://fedorahosted.org/389/ticket/48403 We did not have a chance to look into this issue recently. Now we have a rhel-7.3 candidate (of course, we are fixing more bugs, though) https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=495659 389-ds-base-1.3.5.4-1.el7 Sankar, could you please retry the test with this latest build on rhel-7.3? > Description of problem: Slapd crashes coming from multi-operations stress tests. And if the server still crashes, could you retain the test environment with a core file? Thanks! So far with TET acceptance tests, no crashes observed on RHEL7.3 389-ds-base builds. We are yet to start off with Longduration(Tier2) and Stress/Reliability(Tier3) tests for RHEL7.3. I will update the bug with more details if I encounter any crashes with Tier2 and Tier3 execution. I cloned a beaker job - https://beaker.engineering.redhat.com/jobs/1369329. I will wait for this job to complete and then update the bug accordingly. (In reply to Sankar Ramalingam from comment #49) > I cloned a beaker job - https://beaker.engineering.redhat.com/jobs/1369329. > I will wait for this job to complete and then update the bug accordingly. I didn't observe any crash for the above beaker job. Hence, closing this bug as not reproducible. Packages tested: 389-ds-base-1.3.5.4-1.el7.x86_64 |