Bug 1094468

Summary: 389-ds-base server reported crash in stan_GetCERTCertificate (pki3hack.c) under the replication replay failure condition
Product: Red Hat Enterprise Linux 7 Reporter: Sankar Ramalingam <sramling>
Component: nssAssignee: Elio Maldonado Batiz <emaldona>
Status: CLOSED ERRATA QA Contact: Alicja Kario <hkario>
Severity: high Docs Contact:
Priority: high    
Version: 7.0CC: amarecek, emaldona, ksrot, nhosoi, nkinder, rrelyea, sforsber, sramling, vanhoof
Target Milestone: rcKeywords: Regression, ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: nss-3.16.2.3-4.el7 Doc Type: Bug Fix
Doc Text:
Cause: nss internal call stan_GetCERTCertificate did nor properly ensure sure that objects do not go away until it it's finished with the operation Consequence: A crash in 389-ds-base server was reported in in nss's stan_GetCERTCertificate (pki3hack.c) under the replication replay failure condition. Fix: The certificate code was fixed to properly manage objects references. Result: The crashes reported by the 389 directory server no longer occur
Story Points: ---
Clone Of:
: 1139349 (view as bug list) Environment:
Last Closed: 2015-03-05 08:27:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1064025, 1113520, 1139349    
Attachments:
Description Flags
Manage objects references to ensure object does not go away until we finish
rrelyea: review+
changes to nss.spec in patch format
rrelyea: review+
Fix latest race
none
spec file with latest race fix.
none
Fix latest race
none
spec file with latest patch applied
none
spec file with latest patch applied - actually used for build none

Description Sankar Ramalingam 2014-05-05 18:38:18 UTC
Description of problem: Directory Server reported a crash when running acceptance tests.


Version-Release number of selected component (if applicable): 389-ds-base-1.3.1.6-25


How reproducible: Consistently


Steps to Reproduce:
1. Run automated acceptance for mmrepl/accept tests
2. Clone a similar beaker job - https://beaker.engineering.redhat.com/jobs/643479
3. Observe the test results - 15/66 FAIL

Actual results: Test failures reported with trac47606 tests. Server crashed when trying initialize one of the consumer.


Expected results: No crash reported.


Additional info:

The crash is in NSS....

Program terminated with signal 11, Segmentation fault.
#0  PR_EnterMonitor (mon=0x0) at ../../../nspr/pr/src/pthreads/ptsynch.c:527
527        if (!pthread_equal(mon->owner, self))
Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.3-4.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 libselinux-2.2.2-6.el7.x86_64 nss-softokn-3.15.4-2.el7.x86_64 nss-softokn-freebl-3.15.4-2.el7.x86_64 nss-util-3.15.4-2.el7.x86_64 sqlite-3.7.17-4.el7.x86_64 xz-libs-5.1.2-8alpha.el7.x86_64 zlib-1.2.7-13.el7.x86_64
(gdb) bt
#0  PR_EnterMonitor (mon=0x0) at ../../../nspr/pr/src/pthreads/ptsynch.c:527
#1  0x00007f9fe29c74b6 in nssPKIObject_Lock (object=object@entry=0x7f9fc82836f0) at pkibase.c:22
#2  0x00007f9fe29c96e9 in stan_GetCERTCertificate (c=0x7f9fc82836f0, forceUpdate=forceUpdate@entry=1)
    at pki3hack.c:858
#3  0x00007f9fe29c9d04 in STAN_ForceCERTCertificateUpdate (c=<optimized out>) at pki3hack.c:914
#4  0x00007f9fe29c59c0 in nssTrustDomain_RemoveTokenCertsFromCache (td=0x7f9fe6101c60,
    token=<optimized out>) at tdcache.c:448
#5  0x00007f9fe298627c in nssToken_NotifyCertsNotVisible (tok=<optimized out>) at dev3hack.c:303
#6  0x00007f9fe29cac59 in nssSlot_IsTokenPresent (slot=0x7f9fe624e030) at devslot.c:172
#7  0x00007f9fe29cd66e in nssToken_IsPresent (token=<optimized out>) at devtoken.c:1442
#8  0x00007f9fe29af5d4 in pk11_IsPresentCertLoad (slot=0x7f9fbc030aa0, loadCerts=1)
    at pk11slot.c:1436
#9  0x00007f9fe29b28b1 in SECMOD_CloseUserDB (slot=0x7f9fbc030aa0) at pk11util.c:1522
#10 0x00007f9fe312bc73 in tlsm_ctx_free (ctx=0x7f9fbc009710) at tls_m.c:2148
#11 0x00007f9fe3127c85 in ldap_int_tls_destroy (lo=0x7f9fbc000e80) at tls2.c:105
#12 0x00007f9fe310c6d7 in ldap_ld_free (ld=0x7f9fbc000dd0, close=close@entry=1,
    sctrls=sctrls@entry=0x0, cctrls=cctrls@entry=0x0) at unbind.c:209
#13 0x00007f9fe310c807 in ldap_unbind_ext (ld=<optimized out>, sctrls=sctrls@entry=0x0,
    cctrls=cctrls@entry=0x0) at unbind.c:52
#14 0x00007f9fe3d2d3de in slapi_ldap_unbind (ld=<optimized out>) at ldap/servers/slapd/ldaputil.c:148
#15 0x00007f9fd97cbf99 in close_connection_internal (conn=conn@entry=0x7f9fe64211b0)
    at ldap/servers/plugins/replication/repl5_connection.c:1203
#16 0x00007f9fd97cc8c6 in perform_operation (conn=0x7f9fe64211b0, optype=optype@entry=1,
    dn=<optimized out>, attrs=attrs@entry=0x7f9fbc080de0, newrdn=newrdn@entry=0x0,
    newparent=newparent@entry=0x0, deleteoldrdn=deleteoldrdn@entry=0,
    update_control=update_control@entry=0x7f9fbc056430, extop_oid=extop_oid@entry=0x0,
    extop_payload=extop_payload@entry=0x0, message_id=message_id@entry=0x7f9fd1f1cc6c)
    at ldap/servers/plugins/replication/repl5_connection.c:738
#17 0x00007f9fd97cd01d in conn_send_add (conn=<optimized out>, dn=<optimized out>,
    attrs=attrs@entry=0x7f9fbc080de0, update_control=update_control@entry=0x7f9fbc056430,
    message_id=message_id@entry=0x7f9fd1f1cc6c)
    at ldap/servers/plugins/replication/repl5_connection.c:771
#18 0x00007f9fd97cf94a in replay_update (prp=prp@entry=0x7f9fe64465f0, op=0x7f9fd1f1ccd0,
    message_id=message_id@entry=0x7f9fd1f1cc6c)
    at ldap/servers/plugins/replication/repl5_inc_protocol.c:1362
#19 0x00007f9fd97d0e56 in send_updates (num_changes_sent=0x7f9fd1f1cc60,
    remote_update_vector=<optimized out>, prp=0x7f9fe64465f0)
    at ldap/servers/plugins/replication/repl5_inc_protocol.c:1719
#20 repl5_inc_run (prp=<optimized out>) at ldap/servers/plugins/replication/repl5_inc_protocol.c:1056
#21 0x00007f9fd97d5a5c in prot_thread_main (arg=0x7f9fe6422c30)
    at ldap/servers/plugins/replication/repl5_protocol.c:296
#22 0x00007f9fe233a740 in _pt_root (arg=0x7f9fe641ff20)
    at ../../../nspr/pr/src/pthreads/ptthread.c:204
#23 0x00007f9fe1cdbdf3 in start_thread (arg=0x7f9fd1f1d700) at pthread_create.c:308
#24 0x00007f9fe1a093dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113


# rpm -q nss
nss-3.15.4-6.el7.x86_64

This is the latest NSS for rhel-7.0.


Just in case, other nss related package versions.

nss-3.15.4-6.el7.x86_64

nss-softokn-freebl-3.15.4-2.el7.x86_64
nss-util-3.15.4-2.el7.x86_64
nss-tools-3.15.4-6.el7.x86_64
nss-debuginfo-3.15.4-6.el7.x86_64
nss-softokn-3.15.4-2.el7.x86_64
nss-sysinit-3.15.4-6.el7.x86_64

Comment 2 Noriko Hosoi 2014-05-08 20:07:12 UTC
Additional information.

Output from valgrind

    ==9703== Invalid read of size 8
    ==9703==    at 0x61E0880: stan_GetCERTCertificate (pki3hack.c:834)
    ==9703==    by 0x61DC9BF: nssTrustDomain_RemoveTokenCertsFromCache (tdcache.c:448)
    ==9703==    by 0x61E1C58: nssSlot_IsTokenPresent (devslot.c:172)
    ==9703==    by 0x61C65D3: pk11_IsPresentCertLoad (pk11slot.c:1436)
    ==9703==    by 0x61C98B0: SECMOD_CloseUserDB (pk11util.c:1522)
    ==9703==    by 0x5B06C72: tlsm_ctx_free (tls_m.c:2148)
    ==9703==    by 0x5B02C84: ldap_int_tls_destroy (tls2.c:105)
    ==9703==    by 0x5AE76D6: ldap_ld_free (unbind.c:209)
    ==9703==    by 0x10E04F98: close_connection_internal (repl5_connection.c:1203)
    ==9703==    by 0x10E058C5: perform_operation (repl5_connection.c:738)
    ==9703==    by 0x10E0601C: conn_send_add (repl5_connection.c:771)
    ==9703==    by 0x10E08949: replay_update (repl5_inc_protocol.c:1362)
    ==9703==    by 0x10E09E55: repl5_inc_run (repl5_inc_protocol.c:1719)
    ==9703==    by 0x10E0EA5B: prot_thread_main (repl5_protocol.c:296)
    ==9703==    by 0x68E773F: ??? (in /usr/lib64/libnspr4.so)
    ==9703==    by 0x6F26DF2: start_thread (in /usr/lib64/libpthread-2.17.so)
    ==9703==    by 0x72313DC: clone (in /usr/lib64/libc-2.17.so)
    ==9703==  Address 0xe86ca80 is 240 bytes inside a block of size 2,048 free'd
    ==9703==    at 0x4C29577: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
    ==9703==    by 0x66BC61A: PL_FinishArenaPool (in /usr/lib64/libplds4.so)
    ==9703==    by 0x61E712E: nssArena_Destroy (arena.c:509)
    ==9703==    by 0x61D92B7: nssCertificate_Destroy (certificate.c:125)
    ==9703==    by 0x5F3ED48: ssl3_CleanupPeerCerts.isra.20 (ssl3con.c:9635)
    ==9703==    by 0x5F4AFE2: ssl3_DestroySSL3Info (ssl3con.c:11990)
    ==9703==    by 0x5F5883A: ssl_DestroySocketContents (sslsock.c:335)
    ==9703==    by 0x5F59BC1: ssl_FreeSocket (sslsock.c:398)
    ==9703==    by 0x5F5006F: ssl_DefClose (ssldef.c:205)
    ==9703==    by 0x5B05B49: tlsm_sb_remove (tls_m.c:3255)
    ==9703==    by 0x5D2BD7C: ber_sockbuf_remove_io (sockbuf.c:230)
    ==9703==    by 0x5D2C584: ber_int_sb_destroy (sockbuf.c:404)
    ==9703==    by 0x5D2C5FB: ber_sockbuf_free (sockbuf.c:75)
    ==9703==    by 0x5AE753D: ldap_ld_free (unbind.c:133)
    ==9703==    by 0x10E04F98: close_connection_internal (repl5_connection.c:1203)
    ==9703==    by 0x10E058C5: perform_operation (repl5_connection.c:738)
    ==9703==    by 0x10E0601C: conn_send_add (repl5_connection.c:771)
    ==9703==    by 0x10E08949: replay_update (repl5_inc_protocol.c:1362)
    ==9703==    by 0x10E09E55: repl5_inc_run (repl5_inc_protocol.c:1719)
    ==9703==    by 0x10E0EA5B: prot_thread_main (repl5_protocol.c:296)
    ==9703==    by 0x68E773F: ??? (in /usr/lib64/libnspr4.so)
    ==9703==    by 0x6F26DF2: start_thread (in /usr/lib64/libpthread-2.17.so)
    ==9703==    by 0x72313DC: clone (in /usr/lib64/libc-2.17.so) 

1) Not a race condition.  "perform_operation" calls PR_Lock(conn->lock) at the head of the function and holds it when close_connection_internal is called.

==9703==    by 0x10E04F98: close_connection_internal (repl5_connection.c:1203)
==9703==    by 0x10E058C5: perform_operation (repl5_connection.c:738)

2) Not a double free.  close_connection_internal sets NULL to the ldap handle conn->ld once slapi_ldap_unbind is done.  Thus,in the valgrind output, the 2 stacktraces cannot be from 2 different calls since if the call for the lower stacktrace is successfully finished, there is no handle conn->ld available to call slapi_ldap_unbind with for the upper stacktrace...
   close_connection_internal(Repl_Connection *conn)
    if (NULL != conn->ld)
    {
        /* Since we call slapi_ldap_init, we must call slapi_ldap_unbind */
        slapi_ldap_unbind(conn->ld);
    }
    conn->ld = NULL;

NSS related packages installed.

    nss-softokn-freebl-3.15.4-2.el7.x86_64
    nss-sysinit-3.15.4-6.el7.x86_64
    nss-util-devel-3.15.4-2.el7.x86_64
    nss-softokn-devel-3.15.4-2.el7.x86_64
    nss-debuginfo-3.15.4-6.el7.x86_64
    libsss_nss_idmap-1.11.2-65.el7.x86_64
    nss-util-3.15.4-2.el7.x86_64
    nss-softokn-freebl-devel-3.15.4-2.el7.x86_64
    nss-devel-3.15.4-6.el7.x86_64
    mod_nss-1.0.8-32.el7.x86_64
    nss-3.15.4-6.el7.x86_64
    python-nss-0.14.0-5.el7.x86_64
    libsss_nss_idmap-python-1.11.2-65.el7.x86_64
    nss-softokn-3.15.4-2.el7.x86_64
    nss-tools-3.15.4-6.el7.x86_64

Comment 3 Noriko Hosoi 2014-05-20 01:08:13 UTC
Hi Elio, could there be any progress on this bug?

We are getting this crash report from the automated beaker tests (sorry, debuginfo was not installed).  Do you need any other information for your debugging?

[IPA CRASH ALERT] Test_Suite:  Submitter:  Beaker JobID: 655359
:Thread 1 (Thread 0x7f29fffff700 (LWP 21597)):
:#0  0x00007f2a1d0221eb in PORT_ArenaAlloc_Util () from /lib64/libnssutil3.so
:No symbol table info available.
:#1  0x00007f2a1eb46b7a in cert_trust_from_stan_trust () from /lib64/libnss3.so
:No symbol table info available.
:#2  0x00007f2a1eb47640 in nssTrust_GetCERTCertTrustForCert () from /lib64/libnss3.so
:No symbol table info available.
:#3  0x00007f2a1eb479a5 in stan_GetCERTCertificate () from /lib64/libnss3.so
:No symbol table info available.
:#4  0x00007f2a1eb439c0 in nssTrustDomain_RemoveTokenCertsFromCache () from /lib64/libnss3.so
:No symbol table info available.
:#5  0x00007f2a1eb48c59 in nssSlot_IsTokenPresent () from /lib64/libnss3.so
:No symbol table info available.
:#6  0x00007f2a1eb2d5d4 in pk11_IsPresentCertLoad () from /lib64/libnss3.so
:No symbol table info available.
:#7  0x00007f2a1eb308b1 in SECMOD_CloseUserDB () from /lib64/libnss3.so
:No symbol table info available.
:#8  0x00007f2a1f2a9c73 in tlsm_ctx_free () from /lib64/libldap_r-2.4.so.2
:No symbol table info available.
:#9  0x00007f2a1f2a5c85 in ldap_int_tls_destroy () from /lib64/libldap_r-2.4.so.2
:No symbol table info available.
:#10 0x00007f2a1f28a6d7 in ldap_ld_free () from /lib64/libldap_r-2.4.so.2
:No symbol table info available.
:#11 0x00007f2a14b23f99 in close_connection_internal (conn=conn@entry=0x7f2a2179c6d0) at ldap/servers/plugins/replication/repl5_connection.c:1203
[...]

Thanks,
--noriko

Comment 10 Elio Maldonado Batiz 2014-08-07 04:28:42 UTC
Created attachment 924708 [details]
Manage objects references to ensure object does not go away until we finish

This is Bob Relyea's fix to get rid of the crash.

Comment 11 Elio Maldonado Batiz 2014-08-07 04:57:52 UTC
Created attachment 924709 [details]
changes to nss.spec in patch format

Comment 12 Bob Relyea 2014-08-11 18:26:05 UTC
Comment on attachment 924709 [details]
changes to nss.spec in patch format

r+ rrelyea

Comment 13 Sankar Ramalingam 2014-08-24 23:42:06 UTC
Once you have an official errata filed from NSS team, please let DS QE team and we will do the necessary verification. Please contact sramling for any further queries related to verifying this issue.

Comment 17 Sankar Ramalingam 2014-09-15 10:43:19 UTC
Tests automated in TET RHEL7 branch for 389-ds-base as trac47606 tests.

Comment 20 Sankar Ramalingam 2014-11-27 14:12:31 UTC
This crash is observed with the other beaker run. Hence, re-opening the bug.
Beaker job - https://beaker.engineering.redhat.com/jobs/812004

Comment 21 Noriko Hosoi 2014-12-12 19:53:55 UTC
(In reply to Sankar Ramalingam from comment #20)
> This crash is observed with the other beaker run. Hence, re-opening the bug.
> Beaker job - https://beaker.engineering.redhat.com/jobs/812004

Sankar, I cannot find the crash report (including the stacktraces) in the link you put.
https://beaker.engineering.redhat.com/jobs/812004
Could you attach the file to this bug?  If you don't find it, could you repeat running the test?

Also, every time you run, could you give the output of "rpm -qa |egrep '389-ds-base|nss'"?

Thanks,
--noriko

Comment 24 Bob Relyea 2014-12-18 17:28:18 UTC
Created attachment 970672 [details]
Fix latest race

Comment 25 Bob Relyea 2014-12-18 17:29:00 UTC
Created attachment 970673 [details]
spec file with latest race fix.

Comment 26 Elio Maldonado Batiz 2015-01-07 16:59:13 UTC
Created attachment 977437 [details]
Fix latest race

Updated patch from Bob that gave good results in Noriko's testing.

Comment 27 Elio Maldonado Batiz 2015-01-07 17:00:39 UTC
Created attachment 977439 [details]
spec file with latest patch applied

Comment 28 Elio Maldonado Batiz 2015-01-10 16:31:01 UTC
Created attachment 978539 [details]
spec file with latest patch applied - actually used for build

Comment 32 Sankar Ramalingam 2015-01-27 14:01:32 UTC
No crashes observed with the latest nss/nspr and 389-ds-base packages. Hence, marking the bug as Verified.

Packages tested:
[root@intel-s3eb1-03 MMR_WINSYNC]# rpm -qa |egrep 'nss-|nspr-'
nss-softokn-freebl-3.16.2.3-7.el7.x86_64
nss-tools-3.16.2.3-5.el7.x86_64
nspr-4.10.6-3.el7.x86_64
nss-util-3.16.2.3-2.el7.x86_64
nss-softokn-3.16.2.3-7.el7.x86_64
nss-3.16.2.3-5.el7.x86_64
nspr-debuginfo-4.10.6-3.el7.x86_64
nss-sysinit-3.16.2.3-5.el7.x86_64
nss-debuginfo-3.16.2.3-5.el7.x86_64

[root@intel-s3eb1-03 MMR_WINSYNC]# rpm -qa |egrep 389-ds-base
389-ds-base-libs-1.3.3.1-13.el7.x86_64
389-ds-base-debuginfo-1.3.3.1-13.el7.x86_64
389-ds-base-1.3.3.1-13.el7.x86_64

Comment 35 errata-xmlrpc 2015-03-05 08:27:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0364.html