1275508 – Crashes observed while troubleshooting trac47490 test failures with mmraccept tests

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1275508 - Crashes observed while troubleshooting trac47490 test failures with mmraccept tests

Summary: Crashes observed while troubleshooting trac47490 test failures with mmraccept...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	389-ds-base
Sub Component:
Version:	7.2
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Noriko Hosoi
QA Contact:	Viktor Ashirov
Docs Contact:
URL:	http://faf-report.itos.redhat.com/rep...
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-10-27 06:50 UTC by Sankar Ramalingam
Modified:	2017-01-09 12:23 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-01-09 12:23:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Sankar Ramalingam 2015-10-27 06:50:50 UTC

Description of problem: slapd crashes reported when running mmrepl/accept tests

Version-Release number of selected component (if applicable): 389-ds-base-1.3.4.0-19

How reproducible: Not consistently. Its reproducible when part of the mmrepl/accept tests are run manually.


Steps to Reproduce:
1. Reserve a beaker machine and run only mmrepl/accept tests.
2. Make sure the iclist is modified to run few tests.
iclist="ic0 ic1 ic2 ic3 ic4"
3. Run the tests more than once to reproduce the crash.

Actual results: Server crashes.

Expected results: No crashes.


Additional info: Comment from Thierry:

This looking like a double free:

(gdb) where
#0  __GI___libc_free (mem=0xffffffffffffffff) at malloc.c:2917
#1  0x00007f675fabbb0c in ber_free_buf (ber=0x7f6760ff6010) at io.c:188
#2  0x00007f675fabbb75 in ber_free (ber=0x7f6760ff6010, freebuf=<optimized out>) at io.c:203
#3  0x00007f675fcec7d2 in ldap_free_connection (ld=ld@entry=0x7f66e80010d0, lc=0x7f66e8010400, force=force@entry=1,
    unbind=unbind@entry=1) at request.c:781
#4  0x00007f675fce3aff in ldap_ld_free (ld=0x7f66e80010d0, close=close@entry=1, sctrls=sctrls@entry=0x0, cctrls=cctrls@entry=0x0)
    at unbind.c:118
#5  0x00007f675fce3e77 in ldap_unbind_ext (ld=<optimized out>, sctrls=sctrls@entry=0x0, cctrls=cctrls@entry=0x0) at unbind.c:52
#6  0x00007f6760b18ece in slapi_ldap_unbind (ld=<optimized out>) at ldap/servers/slapd/ldaputil.c:114
#7  0x00007f6755babf29 in close_connection_internal (conn=conn@entry=0x7f66ec007170)
    at ldap/servers/plugins/replication/repl5_connection.c:1369
#8  0x00007f6755bad3f5 in conn_read_result_ex (conn=conn@entry=0x7f66ec007170, retoidp=retoidp@entry=0x0,
    retdatap=retdatap@entry=0x0, returned_controls=returned_controls@entry=0x0, send_msgid=send_msgid@entry=-1,
    resp_msgid=resp_msgid@entry=0x7f66f1ffad78, block=block@entry=0) at ldap/servers/plugins/replication/repl5_connection.c:454
#9  0x00007f6755bafa76 in repl5_inc_result_threadmain (param=0x7f66e8009220)
    at ldap/servers/plugins/replication/repl5_inc_protocol.c:268
#10 0x00007f675ed2a7bb in _pt_root (arg=0x7f66e8000d10) at ../../../nspr/pr/src/pthreads/ptthread.c:212
#11 0x00007f675e6cbdc5 in start_thread (arg=0x7f66f1ffb700) at pthread_create.c:308
#12 0x00007f675e3f91cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Comment 3 Viktor Ashirov 2015-10-27 08:42:20 UTC

I've got the same crash during mmrepl/fractional test execution.

Comment 4 Noriko Hosoi 2015-10-27 18:15:58 UTC

Sankar, do you happen to keep all the error and access logs from the crashed tet run?

The crash is in this disconnect error case.  I wonder what happened on the other side.  Thanks.
452                    if (IS_DISCONNECT_ERROR(rc))
453                    {
454                        close_connection_internal(conn); /* we already have the lock */
455                        return_value = CONN_NOT_CONNECTED;
456                    }

Comment 5 Noriko Hosoi 2015-10-27 20:30:50 UTC

Also, result files mmraccept.*.out.* are available?

I assume the core file core.11123 is from slapd-s1.  Is that correct?

Comment 6 Noriko Hosoi 2015-10-27 21:32:52 UTC

I've repeated mmr/accept iclist="ic0 ic1 ic2 ic3 ic4" 3 times on a beaker box: tyan-gt24-13.rhts.eng.bos.redhat.com.  So far, no luck.  Might it need beefier machine?  Could it be possible to let me login the test machine which got the crash?

Comment 7 Sankar Ramalingam 2015-10-28 03:56:40 UTC

(In reply to Noriko Hosoi from comment #6)
> I've repeated mmr/accept iclist="ic0 ic1 ic2 ic3 ic4" 3 times on a beaker
> box: tyan-gt24-13.rhts.eng.bos.redhat.com.  So far, no luck.  Might it need
> beefier machine?  Could it be possible to let me login the test machine
Sure!
Host - cloud-qe-4.idmqe.lab.eng.bos.redhat.com
Creds - root/Default beaker root passwd.
> which got the crash?

Comment 8 Sankar Ramalingam 2015-10-28 07:50:59 UTC

(In reply to Noriko Hosoi from comment #5)
> Also, result files mmraccept.*.out.* are available?
> 
> I assume the core file core.11123 is from slapd-s1.  Is that correct?

That is true. The instance slap-s1 has crashed. I kept the core files backed up in /export on cloud-qe-4.idmqe.lab.eng.bos.redhat.com.

Comment 9 Noriko Hosoi 2015-10-29 19:40:35 UTC

Thank you, Sankar.

I think the server should not crash, but the priority of this case may be low.

trac47606 modifies the maxbersize of C1 to check the error handling.  It's supposed to set back to the original like this in mmr/accept/accept.sh:
5730     message "C1: Setting nsslapd-maxbersize: $MAXBERSIZE"
5731     $LDAPMODIFY -p $C1PORT -h localhost -D "${REPLINSTROOTDN}" -w ${REPLINSTROOTPW} <<EOF
5732 dn: cn=config
5733 changetype: modify
5734 replace: nsslapd-maxbersize
5735 nsslapd-maxbersize: $MAXBERSIZE
5736 EOF

But somehow, it failed like this (cloud-qe-4.idmqe.lab.eng.bos.redhat.com:/tet/testcases/DS/6.0/tet_tmp_dir/32090/mmraccept.run.out):
S1->C1: Adding an entry > 2MB is supposed to be logged in C1 error log

C1: Incoming BER Element size is 
expr: syntax error
C1: Setting nsslapd-maxbersize: 
ldap_modify: Invalid syntax
ldap_modify: additional info: nsslapd-maxbersize: value #0 invalid per syntax

Due to the too low maxbersize on C1:
# egrep maxbersize /etc/dirsrv/slapd-c1/dse.ldif
nsslapd-maxbersize: 20480
some of the following {s1,s2}->c1 replications failed, e.g.:
[29/Oct/2015:15:21:50 -0400] connection - conn=95963 fd=64 Incoming BER Element was 220429 bytes, max allowable is 20480 bytes. Change the nsslapd-maxbersize attribute in cn=config to increase.

This is a sample error on S1->C1, in which schema replication failed.  Please note that "Can't contact LDAP server" is returned when the maxbersize is too small.  So, I would think this is one of the failure cases I described.  Also, at this point 15:21:40, we could guarantee C1 is up from the log.
[29/Oct/2015:15:21:39 -0400] NSMMReplicationPlugin - agmt="cn=24202_to_24206" (cloud-qe-4:24206): Replication bind with SIMPLE auth resumed
[29/Oct/2015:15:21:40 -0400] NSMMReplicationPlugin - agmt="cn=24202_to_24206" (cloud-qe-4:24206): Warning: unable to replicate schema: rc=2
[29/Oct/2015:15:21:40 -0400] NSMMReplicationPlugin - agmt="cn=24202_to_24206" (cloud-qe-4:24206): Failed to send update operation to consumer 
(uniqueid b39b3cb4-7c9311e5-90c5ba22-3d0bc543, CSN 562f4f05000000010000): Can't contact LDAP server. Will retry later.
[29/Oct/2015:15:21:40 -0400] NSMMReplicationPlugin - agmt="cn=24202_to_24206" (cloud-qe-4:24206): Consumer failed to replay change (uniqueid (
null), CSN (null)): Can't contact LDAP server(-1). Will retry later.
[29/Oct/2015:15:21:40 -0400] NSMMReplicationPlugin - agmt="cn=24202_to_24206" (cloud-qe-4:24206): Warning: unable to send endReplication exten
ded operation (Can't contact LDAP server)

Could you please fix the trac47606 test case to make sure it sets back the original (or default) maxbersize and rerun the test?  If it eliminates the crash, I'm going to push it to 7.3.  Thanks.

Comment 10 Sankar Ramalingam 2015-11-01 09:31:09 UTC

Modified the trac47606 tests to reset the value for nsslapd-maxbersize after test execution and restarted the same set of tests to check if there is a crash. 
$LDAPMODIFY -p $C1PORT -h localhost -D "${REPLINSTROOTDN}" -w ${REPLINSTROOTPW} <<EOF
dn: cn=config
changetype: modify
replace: nsslapd-maxbersize
nsslapd-maxbersize: 10485760
EOF

With this, the execution just went fine. No crash.
http://wiki.idm.lab.bos.redhat.com/qa/archive/beaker/x86_64/389-ds-base-1.3.4.0-19/Linux/20151101-022701.html

Comment 11 Noriko Hosoi 2015-11-02 17:19:14 UTC

Good news.  Thanks for the result, Sankar.

Pushing the target to 7.3.

Note You need to log in before you can comment on or make changes to this bug.