Bug 1275508 - Crashes observed while troubleshooting trac47490 test failures with mmraccept tests
Crashes observed while troubleshooting trac47490 test failures with mmraccept...
Status: CLOSED WORKSFORME
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: 389-ds-base (Show other bugs)
7.2
x86_64 Linux
high Severity high
: rc
: ---
Assigned To: Noriko Hosoi
Viktor Ashirov
http://faf-report.itos.redhat.com/rep...
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-10-27 02:50 EDT by Sankar Ramalingam
Modified: 2017-01-09 07:23 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-01-09 07:23:02 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Sankar Ramalingam 2015-10-27 02:50:50 EDT
Description of problem: slapd crashes reported when running mmrepl/accept tests

Version-Release number of selected component (if applicable): 389-ds-base-1.3.4.0-19

How reproducible: Not consistently. Its reproducible when part of the mmrepl/accept tests are run manually.


Steps to Reproduce:
1. Reserve a beaker machine and run only mmrepl/accept tests.
2. Make sure the iclist is modified to run few tests.
iclist="ic0 ic1 ic2 ic3 ic4"
3. Run the tests more than once to reproduce the crash.

Actual results: Server crashes.

Expected results: No crashes.


Additional info: Comment from Thierry:

This looking like a double free:

(gdb) where
#0  __GI___libc_free (mem=0xffffffffffffffff) at malloc.c:2917
#1  0x00007f675fabbb0c in ber_free_buf (ber=0x7f6760ff6010) at io.c:188
#2  0x00007f675fabbb75 in ber_free (ber=0x7f6760ff6010, freebuf=<optimized out>) at io.c:203
#3  0x00007f675fcec7d2 in ldap_free_connection (ld=ld@entry=0x7f66e80010d0, lc=0x7f66e8010400, force=force@entry=1,
    unbind=unbind@entry=1) at request.c:781
#4  0x00007f675fce3aff in ldap_ld_free (ld=0x7f66e80010d0, close=close@entry=1, sctrls=sctrls@entry=0x0, cctrls=cctrls@entry=0x0)
    at unbind.c:118
#5  0x00007f675fce3e77 in ldap_unbind_ext (ld=<optimized out>, sctrls=sctrls@entry=0x0, cctrls=cctrls@entry=0x0) at unbind.c:52
#6  0x00007f6760b18ece in slapi_ldap_unbind (ld=<optimized out>) at ldap/servers/slapd/ldaputil.c:114
#7  0x00007f6755babf29 in close_connection_internal (conn=conn@entry=0x7f66ec007170)
    at ldap/servers/plugins/replication/repl5_connection.c:1369
#8  0x00007f6755bad3f5 in conn_read_result_ex (conn=conn@entry=0x7f66ec007170, retoidp=retoidp@entry=0x0,
    retdatap=retdatap@entry=0x0, returned_controls=returned_controls@entry=0x0, send_msgid=send_msgid@entry=-1,
    resp_msgid=resp_msgid@entry=0x7f66f1ffad78, block=block@entry=0) at ldap/servers/plugins/replication/repl5_connection.c:454
#9  0x00007f6755bafa76 in repl5_inc_result_threadmain (param=0x7f66e8009220)
    at ldap/servers/plugins/replication/repl5_inc_protocol.c:268
#10 0x00007f675ed2a7bb in _pt_root (arg=0x7f66e8000d10) at ../../../nspr/pr/src/pthreads/ptthread.c:212
#11 0x00007f675e6cbdc5 in start_thread (arg=0x7f66f1ffb700) at pthread_create.c:308
#12 0x00007f675e3f91cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
Comment 3 Viktor Ashirov 2015-10-27 04:42:20 EDT
I've got the same crash during mmrepl/fractional test execution.
Comment 4 Noriko Hosoi 2015-10-27 14:15:58 EDT
Sankar, do you happen to keep all the error and access logs from the crashed tet run?

The crash is in this disconnect error case.  I wonder what happened on the other side.  Thanks.
452                    if (IS_DISCONNECT_ERROR(rc))
453                    {
454                        close_connection_internal(conn); /* we already have the lock */
455                        return_value = CONN_NOT_CONNECTED;
456                    }
Comment 5 Noriko Hosoi 2015-10-27 16:30:50 EDT
Also, result files mmraccept.*.out.* are available?

I assume the core file core.11123 is from slapd-s1.  Is that correct?
Comment 6 Noriko Hosoi 2015-10-27 17:32:52 EDT
I've repeated mmr/accept iclist="ic0 ic1 ic2 ic3 ic4" 3 times on a beaker box: tyan-gt24-13.rhts.eng.bos.redhat.com.  So far, no luck.  Might it need beefier machine?  Could it be possible to let me login the test machine which got the crash?
Comment 7 Sankar Ramalingam 2015-10-27 23:56:40 EDT
(In reply to Noriko Hosoi from comment #6)
> I've repeated mmr/accept iclist="ic0 ic1 ic2 ic3 ic4" 3 times on a beaker
> box: tyan-gt24-13.rhts.eng.bos.redhat.com.  So far, no luck.  Might it need
> beefier machine?  Could it be possible to let me login the test machine
Sure!
Host - cloud-qe-4.idmqe.lab.eng.bos.redhat.com
Creds - root/Default beaker root passwd.
> which got the crash?
Comment 8 Sankar Ramalingam 2015-10-28 03:50:59 EDT
(In reply to Noriko Hosoi from comment #5)
> Also, result files mmraccept.*.out.* are available?
> 
> I assume the core file core.11123 is from slapd-s1.  Is that correct?

That is true. The instance slap-s1 has crashed. I kept the core files backed up in /export on cloud-qe-4.idmqe.lab.eng.bos.redhat.com.
Comment 9 Noriko Hosoi 2015-10-29 15:40:35 EDT
Thank you, Sankar.

I think the server should not crash, but the priority of this case may be low.

trac47606 modifies the maxbersize of C1 to check the error handling.  It's supposed to set back to the original like this in mmr/accept/accept.sh:
5730     message "C1: Setting nsslapd-maxbersize: $MAXBERSIZE"
5731     $LDAPMODIFY -p $C1PORT -h localhost -D "${REPLINSTROOTDN}" -w ${REPLINSTROOTPW} <<EOF
5732 dn: cn=config
5733 changetype: modify
5734 replace: nsslapd-maxbersize
5735 nsslapd-maxbersize: $MAXBERSIZE
5736 EOF

But somehow, it failed like this (cloud-qe-4.idmqe.lab.eng.bos.redhat.com:/tet/testcases/DS/6.0/tet_tmp_dir/32090/mmraccept.run.out):
S1->C1: Adding an entry > 2MB is supposed to be logged in C1 error log

C1: Incoming BER Element size is 
expr: syntax error
C1: Setting nsslapd-maxbersize: 
ldap_modify: Invalid syntax
ldap_modify: additional info: nsslapd-maxbersize: value #0 invalid per syntax

Due to the too low maxbersize on C1:
# egrep maxbersize /etc/dirsrv/slapd-c1/dse.ldif
nsslapd-maxbersize: 20480
some of the following {s1,s2}->c1 replications failed, e.g.:
[29/Oct/2015:15:21:50 -0400] connection - conn=95963 fd=64 Incoming BER Element was 220429 bytes, max allowable is 20480 bytes. Change the nsslapd-maxbersize attribute in cn=config to increase.

This is a sample error on S1->C1, in which schema replication failed.  Please note that "Can't contact LDAP server" is returned when the maxbersize is too small.  So, I would think this is one of the failure cases I described.  Also, at this point 15:21:40, we could guarantee C1 is up from the log.
[29/Oct/2015:15:21:39 -0400] NSMMReplicationPlugin - agmt="cn=24202_to_24206" (cloud-qe-4:24206): Replication bind with SIMPLE auth resumed
[29/Oct/2015:15:21:40 -0400] NSMMReplicationPlugin - agmt="cn=24202_to_24206" (cloud-qe-4:24206): Warning: unable to replicate schema: rc=2
[29/Oct/2015:15:21:40 -0400] NSMMReplicationPlugin - agmt="cn=24202_to_24206" (cloud-qe-4:24206): Failed to send update operation to consumer 
(uniqueid b39b3cb4-7c9311e5-90c5ba22-3d0bc543, CSN 562f4f05000000010000): Can't contact LDAP server. Will retry later.
[29/Oct/2015:15:21:40 -0400] NSMMReplicationPlugin - agmt="cn=24202_to_24206" (cloud-qe-4:24206): Consumer failed to replay change (uniqueid (
null), CSN (null)): Can't contact LDAP server(-1). Will retry later.
[29/Oct/2015:15:21:40 -0400] NSMMReplicationPlugin - agmt="cn=24202_to_24206" (cloud-qe-4:24206): Warning: unable to send endReplication exten
ded operation (Can't contact LDAP server)

Could you please fix the trac47606 test case to make sure it sets back the original (or default) maxbersize and rerun the test?  If it eliminates the crash, I'm going to push it to 7.3.  Thanks.
Comment 10 Sankar Ramalingam 2015-11-01 04:31:09 EST
Modified the trac47606 tests to reset the value for nsslapd-maxbersize after test execution and restarted the same set of tests to check if there is a crash. 
$LDAPMODIFY -p $C1PORT -h localhost -D "${REPLINSTROOTDN}" -w ${REPLINSTROOTPW} <<EOF
dn: cn=config
changetype: modify
replace: nsslapd-maxbersize
nsslapd-maxbersize: 10485760
EOF

With this, the execution just went fine. No crash.
http://wiki.idm.lab.bos.redhat.com/qa/archive/beaker/x86_64/389-ds-base-1.3.4.0-19/Linux/20151101-022701.html
Comment 11 Noriko Hosoi 2015-11-02 12:19:14 EST
Good news.  Thanks for the result, Sankar.

Pushing the target to 7.3.

Note You need to log in before you can comment on or make changes to this bug.