Bug 707599
Summary: | replication with TLS does not work | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Gabor Szathmari <gabor.szathmari> | |
Component: | openldap | Assignee: | Jan Vcelak <jvcelak> | |
Status: | CLOSED ERRATA | QA Contact: | David Spurek <dspurek> | |
Severity: | high | Docs Contact: | ||
Priority: | medium | |||
Version: | 6.1 | CC: | ccsadmins, Dario.Palmisano, dspurek, ebenes, josh.mullis, jplans, jvcelak, jwest, kdudka, lutz.willek, omoris, rmeggins, rvokal, trevor.hemsley, tsmetana | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | openldap-2.4.23-29.el6 | Doc Type: | Bug Fix | |
Doc Text: |
Cause:
OpenLDAP server configured for replication. TLS is enabled both for accepting connections from remote peers and for TLS client authentication to the other replicas. Different TLS configuration used for server and for connecting to the replicas.
Consequence:
Connecting to a replica can fail due to TLS certificate lookup errors or due to unknown PKCS#11 TLS errors.
Fix:
A set of patches applied, makes multiple TLS LDAP contexts within one process possible without affecting the others.
Result:
Multiple LDAP TLS contexts with different configurations within one process are possible. OpenLDAP replication in described conditions is possible.
|
Story Points: | --- | |
Clone Of: | ||||
: | 795763 (view as bug list) | Environment: | ||
Last Closed: | 2013-02-21 09:45:19 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 783445 | |||
Bug Blocks: | 795763, 836160 | |||
Attachments: |
Created attachment 500818 [details]
OpenLDAP debug log
Created attachment 500822 [details]
Certificate Authority
Created attachment 500823 [details]
Cert
Comment on attachment 500822 [details] Certificate Authority >Certificate: > Data: > Version: 3 (0x2) > Serial Number: > 7f:a7:7a:5d:00:01:00:00:17:18 > Signature Algorithm: sha1WithRSAEncryption > Issuer: DC=hu, DC=pear, DC=pearit, CN=pearIT CA service > Validity > Not Before: Feb 16 08:52:50 2011 GMT > Not After : Feb 15 08:52:50 2013 GMT > Subject: C=HU, ST=Budapest, L=Budapest, O=pear qwe Hungary Zrt., OU=IT Unix/RWE, CN=ldapserver1.pear.hu/emailAddress=Administrators > Subject Public Key Info: > Public Key Algorithm: rsaEncryption > Public-Key: (2048 bit) > Modulus: > 00:ba:18:6f:8f:b5:f8:c1:c8:5a:72:d9:30:3d:a6: > db:a8:35:b4:7d:04:a0:6c:bf:d4:b6:34:74:53:ba: > d9:60:ef:7a:58:bc:39:28:87:15:87:16:b4:19:2c: > > Exponent: 65537 (0x10001) > X509v3 extensions: > X509v3 Subject Key Identifier: > 7A:45:B5:D3:47:EE:7B:AE:F3:E3:CA:B7:5E:AA:A6:22 > X509v3 Authority Key Identifier: > keyid:46:B3:50:80:6C:82:FB:7D:25:76:9C:02:D3:74:40:7C > > X509v3 CRL Distribution Points: > > Full Name: > URI:ldap:///CN=pearIT%20CA[...]Class=cRLDistributionPoint > URI:http://xys001tat4.pearit.pear.hu[...].crl > > Authority Information Access: > CA Issuers - URI:ldap:///CN=pearIT%20CA%[...]cationAuthority > CA Issuers - URI:http://xys001ta[...]ce(1).crt > > 1.3.6.1.4.1.311.20.2: > ...W.e.b.S.e.r.v.e.r > X509v3 Basic Constraints: critical > CA:FALSE > X509v3 Key Usage: > Digital Signature, Key Encipherment > X509v3 Extended Key Usage: > TLS Web Server Authentication > Signature Algorithm: sha1WithRSAEncryption > 4f:f5:a9:27:28:d5:7a:35:70:4a:ff:01:62:10:96:6b:5e:2e: > e6:d2:b1:77:34:c2:9d:f1:59:75:c5:82:e0:12:eb:2a:f3:86: > 81:54:9e:ca:00:77:9b:b5:c0:ac:71:a6:f3:06:ea:18:31:27: > Rich, have you ever encountered that NSS error message? -12272 SSL_ERROR_BAD_MAC_ALERT "SSL peer reports incorrect Message Authentication Code." The remote system has reported that it received a message with a bad Message Authentication Code from the local system. This may indicate that an attack on that server is underway. (In reply to comment #6) > Rich, have you ever encountered that NSS error message? > > -12272 SSL_ERROR_BAD_MAC_ALERT > > "SSL peer reports incorrect Message Authentication Code." > The remote system has reported that it received a message with a bad Message > Authentication Code from the local system. This may indicate that an attack on > that server is underway. Looks like https://bugzilla.redhat.com/show_bug.cgi?id=596058 Okay, it seems that the certificate which was working perfectly are not compatible somehow with the new TLS backend. It seems it is a very nasty regression that happened in RHEL 6.1. Before upgrade: # gnutls-cli ldapserver2.server.hu --x509cafile /etc/openldap/cacerts/ca.crt -V -p 636 Processed 1 CA certificate(s). Resolving 'ldapserver2.server.hu'... Connecting to '10.222.1.2:636'... - Server's trusted authorities: [0]: DC=hu,DC=server,DC=it,CN=IT CA service - Successfully sent 0 certificate(s) to server. - Server has requested a certificate. - Certificate type: X.509 - Got a certificate list of 2 certificates. - Certificate[0] info: - X.509 Certificate Information: Version: 3 Serial Number (hex): 7fa842f6000100001719 Issuer: DC=hu,DC=server,DC=it,CN=IT CA service [...] After the upgrade: # gnutls-cli ldapserver1.server.hu --x509cafile /etc/openldap/cacerts/ca.crt -V -p 636 Processed 1 CA certificate(s). Resolving 'ldapserver1.server.hu'... Connecting to '10.222.1.1:636'... - Server's trusted authorities: [0]: DC=hu,DC=server,DC=it,CN=IT CA service [1]: DC=hu,DC=server,DC=it,CN=Users,CN=nx [2]: C=HU,ST=Budapest,L=Budapest,O=Organization,OU=IT Unix/BEA,CN=ldapserver1.server.hu,EMAIL=Administrators - Successfully sent 0 certificate(s) to server. *** Fatal error: A TLS fatal alert has been received. *** Received alert [20]: Bad record MAC *** Handshake has failed GNUTLS ERROR: A TLS fatal alert has been received. (In reply to comment #3) > Created attachment 500822 [details] > Certificate Authority Note: this is not a CA (X509v3 Basic Constraints: critical CA:FALSE) (In reply to comment #4) > Created attachment 500823 [details] > Cert This is a CA cert Before the upgrade: - Server's trusted authorities: [0]: DC=hu,DC=server,DC=it,CN=IT CA service After: - Server's trusted authorities: [0]: DC=hu,DC=server,DC=it,CN=IT CA service [1]: DC=hu,DC=server,DC=it,CN=Users,CN=nx [2]: C=HU,ST=Budapest,L=Budapest,O=Organization,OU=IT Unix/BEA,CN=ldapserver1.server.hu,EMAIL=Administrators Any idea why the server now has 2 more trusted authorities? Still trying to reproduce. This is what I've done so far: 1) using the script /etc/pki/tls/misc/CA and my own custom openssl.cnf file, create a CA cert and use that CA to issue a server cert and key. I changed openssl.cnf to even generate the MS extensions so that the CA cert and the server cert look as much like the posted ones as possible. So that I have the following files: ~/save/CA/cacert.pem - contains the 1 new CA cert ~/save/CA/certs/localhost.localdomain.cert.pem - contains the 1 server cert ~/save/CA/certs/localhost.localdomain.key.pem - contains the 1 server cert 2) set up slapd with the following in cn=config (I'm using slapd.d not slapd.conf but it should not make any difference) olcTLSCACertificateFile: /path/to/save/CA/cacert.pem olcTLSCertificateFile: /path/to/save/CA/certs/localhost.localdomain.cert.pem olcTLSCertificateKeyFile: /path/to/save/CA/certs/localhost.localdomain.key.pem 3) gnutls-cli localhost.localdomain --x509cafile ~/save/CA/cacert.pem -V -p SSLPORT Processed 1 CA certificate(s). Resolving 'localhost.localdomain'... Connecting to '127.0.0.1:1201'... - Certificate type: X.509 - Got a certificate list of 2 certificates. - Certificate[0] info: - X.509 Certificate Information: ... the server cert details ... - Certificate[1] info: - X.509 Certificate Information: Version: 3 ... the CA cert details ... Other Information: MD5 fingerprint: 6571d10d46b21a60876bc2a136345eac SHA-1 fingerprint: f9ec0573af8eccbe4151cc30910834f56e779917 Public Key Id: 8c6058575769995ab034e5f3380691137baf3825 - The hostname in the certificate matches 'localhost.localdomain'. - Peer's certificate is trusted - Version: TLS1.0 - Key Exchange: RSA - Cipher: AES-256-CBC - MAC: SHA1 - Compression: NULL - Session ID: 20:96:60:53:BF:73:DD:91:F4:3C:85:E2:7D:8E:DE:89:F8:27:F1:53:99:A5:12:04:AD:AE:07:1B:EE:C6:05:41 - Handshake was completed - Simple Client Mode: openssl s_client -connect localhost.localdomain:1201 -CAfile ~/save/CA/cacert.pem gives similar output LDAPTLS_CACERT=~/save/CA/cacert.pem bin/ldapsearch -x -H ldaps://localhost.localdomain:1201/ -s base -b "" works. So I must be missing something about how you have your cert files or server configured. It seems that I might be able to solve the problem today. Just to clarify, I used 3 certificates: one CA, one for the server and one for replication (for SASL EXTERNAL authentication) previously. These were generated by a Windows Server CA. When the upgraded to the OpenLDAP with the new GNUTLS backend, every single connection from any client (Linux machines, Apache Directory Studio) were denied with the "Bad record MAC" error message. It was caused by the invalid client certificate defined in the olcSyncrepl attribute. Then, I replaced the server certificate and tried to use it for the replication by not defining any certificate in the olcSyncrepl attribute (so it falls back to the cert defined in olcTLSCertificateFile). Now the connections are not rejected but the replication fails because it tries to use a server certificate instead of a client one. Now, I am waiting for a new client certificate and hope it is going to solve the problem. In summary, I see two serious problems here: - Invalid client certificate (in olcSyncrepl) used for replication (while olcTLSCACertificateFile and olcTLSCertificateFile is valid) causes all SSL connections to fail with the error message "Bad record MAC" - Not configuring a client certificate in the olcSyncrepl attribute in order to fall back to olcTLSCertificateFile, replication fails because its a server certificate. It is normal but not documented well, I reckon. I forgot to mention that the new certificate requests are generated by certtool now and it is signed by Windows Server. Previously, the whole certificate was generated by Windows. I really don't know. The ldapserver1 and ldapserver2 were identical, the former was upgraded to RHEL 6.1. So I suppose the new backend cases these extra lines. [0]: CA Certificate defined in olcTLSCACertificateFile [1]: Cert for replication defined in olcSyncrepl [2]: Server certificate in olcTLSCertificateFile (In reply to comment #11) > Before the upgrade: > - Server's trusted authorities: > [0]: DC=hu,DC=server,DC=it,CN=IT CA service > After: > - Server's trusted authorities: > [0]: DC=hu,DC=server,DC=it,CN=IT CA service > [1]: DC=hu,DC=server,DC=it,CN=Users,CN=nx > [2]: C=HU,ST=Budapest,L=Budapest,O=Organization,OU=IT > Unix/BEA,CN=ldapserver1.server.hu,EMAIL=Administrators > Any idea why the server now has 2 more trusted authorities? Sorry, it seems I mixed up the files, I upload them again. (In reply to comment #10) > (In reply to comment #4) > > Created attachment 500823 [details] > > Cert > This is a CA cert Created attachment 502739 [details]
Certificate Authority (this is the correct file)
Created attachment 502740 [details]
Certificate (this is the correct file)
Created attachment 502746 [details]
This cert was used to replicate and caused the Bad addr MAC error
This cert was used to replicate and caused the Bad addr MAC error
Unfortunately, the new client certificates are not working. Now I am using GSSAPI authentication instead of EXTERNAL for the replication thus it works as expected. > When the upgraded to the OpenLDAP with the new GNUTLS backend, every single Mozilla NSS, not GNUTLS. > connection from any client (Linux machines, Apache Directory Studio) were > denied with the "Bad record MAC" error message. It was caused by the invalid > client certificate defined in the olcSyncrepl attribute. Can you be more specific? How is the client certificate invalid? I don't see anything obvious. You can create a cert that is used as both a server cert and a client cert. I don't know if certtool/Microsoft CA will allow you to do that. What is your olcSyncrepl configuration line? In order to use client certificate auth, you should have had something like this: olcSyncrepl: provider=ldap://host ... bindmethod=sasl saslmech=EXTERNAL starttls=critical tls_cert=/path/to/cert/used/to/replicate tls_key=/path/to/cert/used/to/replicate/corresponding/private/key/file tls_cacert=/path/to/cacert.pem With the NSS backend you could omit tls_cacert and it should fallback. My previously working attribute was the following: olcSyncrepl: {0}rid=001 provider=ldaps://ldapserver1.pear.hu type=refreshAndPersist retry="5 5 300 +" searchbase="dc=pear,dc=hu" bindmethod=sasl saslmech=EXTERNAL tls_cert=/etc/openldap/cacerts/replicator.crt tls_key=/etc/openldap/cacerts/replicator.key tls_cacert=/etc/openldap/cacerts/PearITCAservice.crt timeout=1 The replicator.crt was uploaded here in Comment #19 I managed to reproduce. And I'm also getting some segfaults in various situations. And the configuration is valid, the replication works if I switch to other authentication method. I will attach some scripts to reproduce soon. I'm getting the same error codes only when PEM certificates are used. If I create MozNSS certdb and import the certificates, the replication works fine. However I'm getting segfaults from time to time. #0 __strrchr_sse2 () at ../sysdeps/x86_64/strrchr.S:33 #1 0x00007ffff7f66efa in tlsm_get_certdb_prefix (certdir=0x1 <Address 0x1 out of bounds>, realcertdir=0x7ffff1623cc0, prefix=0x7ffff1623cc8) at tls_m.c:1525 #2 0x00007ffff7f67102 in tlsm_deferred_init (arg=0x7fffe4108dc0) at tls_m.c:1612 #3 0x00007ffff7f67f8e in tlsm_deferred_ctx_init (arg=0x7fffe4108dc0) at tls_m.c:2102 #4 0x00007ffff5f2cf55 in PR_CallOnceWithArg (once=0x7fffe4108df8, func=<value optimized out>, arg=<value optimized out>) at ../../../mozilla/nsprpub/pr/src/misc/prinit.c:832 #5 0x00007ffff7f68c68 in tlsm_session_new (ctx=0x7fffe4108dc0, is_server=0) at tls_m.c:2466 #6 0x00007ffff7f6211f in alloc_handle (ctx_arg=0x7fffe4108dc0, is_server=0) at tls2.c:296 #7 0x00007ffff7f62277 in ldap_int_tls_connect (ld=0x7fffe8112930, conn=0x7fffe800d850) at tls2.c:341 #8 0x00007ffff7f634e5 in ldap_int_tls_start (ld=0x7fffe8112930, conn=0x7fffe800d850, srv=0x7fffe800d7b0) at tls2.c:833 #9 0x00007ffff7f34457 in ldap_int_open_connection (ld=0x7fffe8112930, conn=0x7fffe800d850, srv=0x7fffe800d7b0, async=0) at open.c:417 #10 0x00007ffff7f4da8e in ldap_new_connection (ld=0x7fffe8112930, srvlist=0x7fffe81129f8, use_ldsb=1, connect=1, bind=0x0) at request.c:425 #11 0x00007ffff7f3386b in ldap_open_defconn (ld=0x7fffe8112930) at open.c:41 #12 0x00007ffff7f3c359 in ldap_int_sasl_bind (ld=0x7fffe8112930, dn=0x7ffff8312870 "cn=manager,dc=redhat,dc=bug", mechs=0x7ffff8312760 "EXTERNAL", sctrls=0x0, cctrls=0x0, flags=2, interact=0x7ffff7f237d0 <lutil_sasl_interact>, defaults=0x7fffe800d2e0) at cyrus.c:421 #13 0x00007ffff7f40c23 in ldap_sasl_interactive_bind_s (ld=0x7fffe8112930, dn=0x7ffff8312870 "cn=manager,dc=redhat,dc=bug", mechs=0x7ffff8312760 "EXTERNAL", serverControls=0x0, clientControls=0x0, flags=2, interact=0x7ffff7f237d0 <lutil_sasl_interact>, defaults=0x7fffe800d2e0) at sasl.c:464 #14 0x00007ffff7dbd191 in slap_client_connect (ldp=0x7ffff8312690, sb=0x7ffff8312470) at ../../../servers/slapd/config.c:1965 #15 0x00007ffff7e49b44 in do_syncrep1 (op=0x7ffff16244b0, si=0x7ffff8312440) at ../../../servers/slapd/syncrepl.c:571 #16 0x00007ffff7e4d1fd in do_syncrepl (ctx=0x7ffff1624b70, arg=0x7ffff83129f0) at ../../../servers/slapd/syncrepl.c:1422 #17 0x00007ffff7f3214e in ldap_int_thread_pool_wrapper (xpool=0x7ffff82bba70) at ../../../libraries/libldap_r/tpool.c:685 #18 0x00007ffff5cfd7f1 in start_thread (arg=0x7ffff1625700) at pthread_create.c:301 #19 0x00007ffff582f70d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 I guess, the TLS configuration is propagated wrong somewhere. As you may see, certdir in frame #0 - address out of bounds, while: (gdb) frame 15 #15 0x00007ffff7e49b44 in do_syncrep1 (op=0x7ffff16244b0, si=0x7ffff8312440) at ../../../servers/slapd/syncrepl.c:571 571 rc = slap_client_connect( &si->si_ld, &si->si_bindconf ); (gdb) p si->si_bindconf $5 = {sb_uri = {bv_len = 35, bv_val = 0x7ffff8312710 "ldaps://ldapserver2.redhat.bug:2636"}, sb_version = 3, sb_tls = 0, sb_method = 163, sb_timeout_api = 1, sb_timeout_net = 0, sb_binddn = {bv_len = 27, bv_val = 0x7ffff8312870 "cn=manager,dc=redhat,dc=bug"}, sb_cred = {bv_len = 0, bv_val = 0x0}, sb_saslmech = {bv_len = 8, bv_val = 0x7ffff8312760 "EXTERNAL"}, sb_secprops = 0x0, sb_realm = {bv_len = 0, bv_val = 0x0}, sb_authcId = {bv_len = 0, bv_val = 0x0}, sb_authzId = {bv_len = 0, bv_val = 0x0}, sb_keepalive = {sk_idle = 0, sk_probes = 0, sk_interval = 0}, sb_tls_ctx = 0x7fffe4108dc0, sb_tls_cert = 0x7ffff8312740 "replicator", sb_tls_key = 0x0, sb_tls_cacert = 0x0, sb_tls_cacertdir = 0x7ffff83127c0 "/root/bz707599/certdb", sb_tls_reqcert = 0x7ffff8312970 "demand", sb_tls_cipher_suite = 0x0, sb_tls_protocol_min = 0x0, sb_tls_do_init = 0} And the same issue can be reproduced on latest 2.4.28 in Fedora Rawhide. (In reply to comment #26) > I'm getting the same error codes only when PEM certificates are used. If I > create MozNSS certdb and import the certificates, the replication works fine. > However I'm getting segfaults from time to time. The segfaulting is a different problem. But maybe related, it needs to be fixed first. After resolving the segfaults by fixing bug #783445, this bug is no longer present when NSS certdb is used. The problem appears only with PEM certificates. I'm trying to find the cause. Created attachment 558640 [details] reproducer - configuration and scripts Attaching configuration files and script for reproduction. The replication works fine when MozNSS DB is used as a certificate storage (with fix for #783445). It fails only with PEM certificates. Untar the reproducer to /root directory. Append included hosts file to /etc/hosts. Run "make" in server1 and server2 directories to generate the configuration. Start the servers with make run (or make debug). server1 TLS messages: TLS: using moznss security dir /root/bz707599/certdb prefix . TLS: certificate [E=root.bug,CN=ldapservers.redhat.bug,OU=Bug,L=Brno,C=CZ] is valid TLS: certificate [CN=replicator,CN=users,DC=redhat,DC=bug] is valid TLS certificate verification: subject: CN=replicator,CN=users,DC=redhat,DC=bug, issuer: CN=Bug CA,DC=redhat,DC=bug, cipher: AES-256, security level: high, secret key bits: 256, total key bits: 256, cache hits: 0, cache misses: 1, cache not reusable: 0 TLS: could not initialize moznss - error -8018:Unknown PKCS #11 error.. TLS: could not perform TLS system initialization. TLS: error: could not initialize moznss security context - error -8018:Unknown PKCS #11 error. TLS: can't create ssl handle. TLS: certificate [CN=replicator,CN=users,DC=redhat,DC=bug] is valid TLS certificate verification: subject: CN=replicator,CN=users,DC=redhat,DC=bug, issuer: CN=Bug CA,DC=redhat,DC=bug, cipher: AES-256, security level: high, secret key bits: 256, total key bits: 256, cache hits: 0, cache misses: 2, cache not reusable: 0 TLS: error: could not initialize moznss security context - error -5925:The one-time function was previously called and failed. Its error code is no longer available TLS: can't create ssl handle. TLS: certificate [CN=replicator,CN=users,DC=redhat,DC=bug] is valid TLS certificate verification: subject: CN=replicator,CN=users,DC=redhat,DC=bug, issuer: CN=Bug CA,DC=redhat,DC=bug, cipher: AES-256, security level: high, secret key bits: 256, total key bits: 256, cache hits: 0, cache misses: 3, cache not reusable: 0 server2 TLS messages: TLS: loaded CA certificate file /root/bz707599/certs/cacert.pem. TLS: certificate [CN=replicator,CN=users,DC=redhat,DC=bug] is valid TLS: certificate [E=root.bug,CN=ldapservers.redhat.bug,OU=Bug,L=Brno,C=CZ] is valid TLS: certificate [CN=replicator,CN=users,DC=redhat,DC=bug] is valid TLS certificate verification: subject: E=root.bug,CN=ldapservers.redhat.bug,OU=Bug,L=Brno,C=CZ, issuer: CN=Bug CA,DC=redhat,DC=bug, cipher: AES-256, security level: high, secret key bits: 256, total key bits: 256, cache hits: 0, cache misses: 0, cache not reusable: 0 TLS: loaded CA certificate file /root/bz707599/certs/cacert.pem. TLS: certificate [E=root.bug,CN=ldapservers.redhat.bug,OU=Bug,L=Brno,C=CZ] is valid TLS: error: accept - force handshake failure: errno 0 - moznss error -5938 TLS: can't accept: TLS error -5938:Encountered end of file. It also breaks client connections: + ldapsearch -Y EXTERNAL -H ldaps://ldapserver1.redhat.bug:1636 -b dc=redhat,dc=bug -LLL SASL/EXTERNAL authentication started SASL username: cn=replicator,cn=users,dc=redhat,dc=bug SASL SSF: 0 No such object (32) + ldapsearch -Y EXTERNAL -H ldaps://ldapserver2.redhat.bug:2636 -b dc=redhat,dc=bug -LLL ldap_sasl_interactive_bind_s: Can't contact LDAP server (-1) additional info: TLS error -12272:SSL peer reports incorrect Message Authentication Code. From some reason, NSS_InitContext fails when creating the connection to the remote server. I'm debugging on Fedora Rawhdie, but the problem is the same on RHEL6: (gdb) bt #0 nss_Init (configdir=0x5555557df080 "", certPrefix=0x5555557df080 "", keyPrefix=0x5555557df080 "", secmodName=0x5555557dfcc2 "secmod.db", updateDir=0x7ffff6d22620 "", updCertPrefix=0x7ffff6d22620 "", updKeyPrefix=0x7ffff6d22620 "", updateID=0x7ffff6d22620 "", updateName=0x7ffff6d22620 "", initContextPtr=0x7ffff130ab88, initParams=0x7ffff130abf0, readOnly=1, noCertDB=1, noModDB=1, forceOpen=0, noRootInit=1, optimizeSpace=0, noSingleThreadedModules=0, allowAlreadyInitializedModules=0, dontFinalizeModules=0) at nssinit.c:566 #1 0x00007ffff6c0162d in NSS_InitContext (configdir=0x5555557df080 "", certPrefix=0x5555557df080 "", keyPrefix=0x5555557df080 "", secmodName=0x5555557dfcc2 "secmod.db", initParams=0x7ffff130abf0, flags=7) at nssinit.c:861 #2 0x0000555555748b2b in tlsm_deferred_init (arg=0x7fffe4108eb0) at tls_m.c:1666 #3 0x0000555555749bac in tlsm_deferred_ctx_init (arg=0x7fffe4108eb0) at tls_m.c:2133 #4 0x00007ffff6399445 in PR_CallOnceWithArg () from /lib64/libnspr4.so #5 0x000055555574a8a4 in tlsm_session_new (ctx=0x7fffe4108eb0, is_server=0) at tls_m.c:2500 #6 0x0000555555743842 in alloc_handle (ctx_arg=0x7fffe4108eb0, is_server=0) at tls2.c:288 #7 0x0000555555743972 in ldap_int_tls_connect (ld=0x7fffe8100910, conn=0x7fffe8108ec0) at tls2.c:333 #8 0x0000555555744bd5 in ldap_int_tls_start (ld=0x7fffe8100910, conn=0x7fffe8108ec0, srv=0x7fffe8108de0) at tls2.c:834 #9 0x0000555555717920 in ldap_int_open_connection (ld=0x7fffe8100910, conn=0x7fffe8108ec0, srv=0x7fffe8108de0, async=0) at open.c:437 #10 0x000055555572ecb6 in ldap_new_connection (ld=0x7fffe8100910, srvlist=0x7fffe8100a30, use_ldsb=1, connect=1, bind=0x0, m_req=0, m_res=0) at request.c:480 #11 0x0000555555716b43 in ldap_open_defconn (ld=0x7fffe8100910) at open.c:41 #12 0x000055555571ec83 in ldap_int_sasl_bind (ld=0x7fffe8100910, dn=0x555555b47ff0 "cn=manager,dc=redhat,dc=bug", mechs=0x555555b47ee0 "EXTERNAL", sctrls=0x0, cctrls=0x0, flags=2, interact=0x555555709a58 <lutil_sasl_interact>, defaults=0x7fffe8108e80, result=0x0, rmech=0x7ffff130b360, msgid=0x7ffff130b354) at cyrus.c:425 #13 0x0000555555721db7 in ldap_sasl_interactive_bind (ld=0x7fffe8100910, dn=0x555555b47ff0 "cn=manager,dc=redhat,dc=bug", mechs=0x555555b47ee0 "EXTERNAL", serverControls=0x0, clientControls=0x0, flags=2, interact=0x555555709a58 <lutil_sasl_interact>, defaults=0x7fffe8108e80, result=0x0, rmech=0x7ffff130b360, msgid=0x7ffff130b354) at sasl.c:474 #14 0x0000555555721e6a in ldap_sasl_interactive_bind_s (ld=0x7fffe8100910, dn=0x555555b47ff0 "cn=manager,dc=redhat,dc=bug", mechs=0x555555b47ee0 "EXTERNAL", serverControls=0x0, clientControls=0x0, flags=2, interact=0x555555709a58 <lutil_sasl_interact>, defaults=0x7fffe8108e80) at sasl.c:511 #15 0x000055555559c5d6 in slap_client_connect (ldp=0x555555b47e00, sb=0x555555b47be0) at ../../../servers/slapd/config.c:2040 #16 0x0000555555628d19 in do_syncrep1 (op=0x7ffff130b480, si=0x555555b47bb0) at ../../../servers/slapd/syncrepl.c:611 #17 0x000055555562c5ab in do_syncrepl (ctx=0x7ffff130bb60, arg=0x555555b481b0) at ../../../servers/slapd/syncrepl.c:1510 #18 0x000055555571546d in ldap_int_thread_pool_wrapper (xpool=0x555555ae6b70) at ../../../libraries/libldap_r/tpool.c:685 #19 0x00007ffff73ead34 in start_thread (arg=0x7ffff130c700) at pthread_create.c:309 #20 0x00007ffff5ca4f4d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 (gdb) fin Run till exit from #0 nss_Init (configdir=0x5555557df080 "", certPrefix=0x5555557df080 "", keyPrefix=0x5555557df080 "", secmodName=0x5555557dfcc2 "secmod.db", updateDir=0x7ffff6d22620 "", updCertPrefix=0x7ffff6d22620 "", updKeyPrefix=0x7ffff6d22620 "", updateID=0x7ffff6d22620 "", updateName=0x7ffff6d22620 "", initContextPtr=0x7ffff130ab88, initParams=0x7ffff130abf0, readOnly=1, noCertDB=1, noModDB=1, forceOpen=0, noRootInit=1, optimizeSpace=0, noSingleThreadedModules=0, allowAlreadyInitializedModules=0, dontFinalizeModules=0) at nssinit.c:566 0x00007ffff6c0162d in NSS_InitContext (configdir=0x5555557df080 "", certPrefix=0x5555557df080 "", keyPrefix=0x5555557df080 "", secmodName=0x5555557dfcc2 "secmod.db", initParams=0x7ffff130abf0, flags=7) at nssinit.c:861 861 rv = nss_Init(configdir, certPrefix, keyPrefix, secmodName, Value returned is $2 = SECFailure what version of nss? There are some thread safety issues in nss_Init being fixed in NSS 3.13.2 - I believe fedora is using nss 3.13.1? Yes, I have nss-3.13.1-13.fc17. I can try to get fresh nss from upstream repository (if there is any?). (In reply to comment #31) > Yes, I have nss-3.13.1-13.fc17. I can try to get fresh nss from upstream > repository (if there is any?). Not sure. Check with emaldonado (In reply to comment #30) > what version of nss? There are some thread safety issues in nss_Init being > fixed in NSS 3.13.2 - I believe fedora is using nss 3.13.1? I built NSS packages from upstream CVS (NSS_3_13_2_BETA2 tag) and it didn't help. Still the same behavior. I also added a mutex to protect the whole NSS initialization with no success. I do not think this is a race. So it looks as though the initial problem is this: TLS: could not initialize moznss - error -8018:Unknown PKCS #11 error.. The next step is to step through this in the debugger. I don't know in how many places the error SEC_ERROR_UNKNOWN_PKCS11_ERROR can be returned, but set break points at places that look like candidates for the error return. This is interesting as well: additional info: TLS error -12272:SSL peer reports incorrect Message Authentication Code. We've seen this error before when the CA/server cert is using a MAC/signature algorithm not supported on the client, or vice versa. But if all of the systems are using recent versions of NSS, this is not likely to be a problem. Created attachment 559594 [details]
patch resolving SEC_ERROR_UNKNOWN_PKCS11_ERROR
Finally, some update. SEC_ERROR_UNKNOWN_PKCS11_ERROR was in fact internal CKR_CRYPTOKI_ALREADY_INITIALIZED error. I added NSS_INIT_PK11RELOAD to init flags and this problem is gone. Attaching the patch.
However, I'm still getting invalid MAC error.
And in addition all SSL connections deadlocks from time to time. One thread seems to wait for some data:
#0 0x00007ffff73f105d in read () at ../sysdeps/unix/syscall-template.S:82
#1 0x00005555557568de in sb_stream_read (sbiod=0x7fffe417bb80, buf=0x7fffe41a1f18, len=5) at ../../../libraries/liblber/sockbuf.c:493
#2 0x000055555575730d in sb_debug_read (sbiod=0x7fffe414f490, buf=0x7fffe41a1f18, len=5) at ../../../libraries/liblber/sockbuf.c:829
#3 0x000055555574b6d2 in tlsm_PR_Recv (fd=0x7fffe414f820, buf=0x7fffe41a1f18, len=5, flags=0, timeout=4294967295) at tls_m.c:2900
#4 0x00007ffff71bd8fa in ssl_DefRecv (ss=0x7fffe41a1b40, buf=0x7fffe41a1f18 "", len=5, flags=0) at ssldef.c:94
#5 0x00007ffff71b6a6a in ssl3_GatherData (ss=0x7fffe41a1b40, gs=0x7fffe41a1ec8, flags=0) at ssl3gthr.c:90
#6 0x00007ffff71b6c42 in ssl3_GatherCompleteHandshake (ss=0x7fffe41a1b40, flags=0) at ssl3gthr.c:196
#7 0x00007ffff71b901c in ssl_GatherRecord1stHandshake (ss=0x7fffe41a1b40) at sslcon.c:1253
#8 0x00007ffff71c2f9c in ssl_Do1stHandshake (ss=0x7fffe41a1b40) at sslsecur.c:153
#9 0x00007ffff71c38e1 in SSL_ForceHandshake (fd=0x7fffe414f7e0) at sslsecur.c:423
#10 0x000055555574aa71 in tlsm_session_accept_or_connect (session=0x7fffe414f7e0, is_accept=0) at tls_m.c:2547
#11 0x000055555574aba6 in tlsm_session_connect (ld=0x7fffe4103240, session=0x7fffe414f7e0) at tls_m.c:2582
#12 0x0000555555743ad3 in ldap_int_tls_connect (ld=0x7fffe4103240, conn=0x7fffe41983c0) at tls2.c:358
#13 0x0000555555744bd5 in ldap_int_tls_start (ld=0x7fffe4103240, conn=0x7fffe41983c0, srv=0x7fffe41964d0) at tls2.c:834
#14 0x0000555555717920 in ldap_int_open_connection (ld=0x7fffe4103240, conn=0x7fffe41983c0, srv=0x7fffe41964d0, async=0) at open.c:437
#15 0x000055555572ecb6 in ldap_new_connection (ld=0x7fffe4103240, srvlist=0x7fffe4102490, use_ldsb=1, connect=1, bind=0x0, m_req=0,
m_res=0) at request.c:480
#16 0x0000555555716b43 in ldap_open_defconn (ld=0x7fffe4103240) at open.c:41
#17 0x000055555571ec83 in ldap_int_sasl_bind (ld=0x7fffe4103240, dn=0x555555b48050 "cn=manager,dc=redhat,dc=bug",
mechs=0x555555b47f40 "EXTERNAL", sctrls=0x0, cctrls=0x0, flags=2, interact=0x555555709a58 <lutil_sasl_interact>,
defaults=0x7fffe4173250, result=0x0, rmech=0x7ffff1af4360, msgid=0x7ffff1af4354) at cyrus.c:425
#18 0x0000555555721db7 in ldap_sasl_interactive_bind (ld=0x7fffe4103240, dn=0x555555b48050 "cn=manager,dc=redhat,dc=bug",
mechs=0x555555b47f40 "EXTERNAL", serverControls=0x0, clientControls=0x0, flags=2, interact=0x555555709a58 <lutil_sasl_interact>,
defaults=0x7fffe4173250, result=0x0, rmech=0x7ffff1af4360, msgid=0x7ffff1af4354) at sasl.c:474
#19 0x0000555555721e6a in ldap_sasl_interactive_bind_s (ld=0x7fffe4103240, dn=0x555555b48050 "cn=manager,dc=redhat,dc=bug",
mechs=0x555555b47f40 "EXTERNAL", serverControls=0x0, clientControls=0x0, flags=2, interact=0x555555709a58 <lutil_sasl_interact>,
defaults=0x7fffe4173250) at sasl.c:511
#20 0x000055555559c5d6 in slap_client_connect (ldp=0x555555b47e60, sb=0x555555b47c40) at ../../../servers/slapd/config.c:2040
#21 0x0000555555628d19 in do_syncrep1 (op=0x7ffff1af4480, si=0x555555b47c10) at ../../../servers/slapd/syncrepl.c:611
#22 0x000055555562c5ab in do_syncrepl (ctx=0x7ffff1af4b60, arg=0x555555b48210) at ../../../servers/slapd/syncrepl.c:1510
#23 0x000055555571546d in ldap_int_thread_pool_wrapper (xpool=0x555555ae6b70) at ../../../libraries/libldap_r/tpool.c:685
#24 0x00007ffff73ead34 in start_thread (arg=0x7ffff1af5700) at pthread_create.c:309
#25 0x00007ffff5ca4f4d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
While the second waits for the lock:
#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1 0x00007ffff73ecc56 in _L_lock_858 () from /lib64/libpthread.so.0
#2 0x00007ffff73ecb58 in __pthread_mutex_lock (mutex=0x555555a659c0) at pthread_mutex_lock.c:65
#3 0x0000555555716963 in ldap_pvt_thread_mutex_lock (mutex=0x555555a659c0) at ../../../libraries/libldap_r/thr_posix.c:296
#4 0x000055555574aa65 in tlsm_session_accept_or_connect (session=0x7fffd80025d0, is_accept=1) at tls_m.c:2545
#5 0x000055555574ab83 in tlsm_session_accept (session=0x7fffd80025d0) at tls_m.c:2576
#6 0x0000555555743d05 in ldap_pvt_tls_accept (sb=0x7fffd80025a0, ctx_arg=0x555555b657d0) at tls2.c:421
#7 0x00005555555a89bb in connection_read (s=19, cri=0x7fffeb7fdb10) at ../../../servers/slapd/connection.c:1355
#8 0x00005555555a84ef in connection_read_thread (ctx=0x7fffeb7fdb60, argv=0x13) at ../../../servers/slapd/connection.c:1267
#9 0x000055555571546d in ldap_int_thread_pool_wrapper (xpool=0x555555ae6b70) at ../../../libraries/libldap_r/tpool.c:685
#10 0x00007ffff73ead34 in start_thread (arg=0x7fffeb7fe700) at pthread_create.c:309
#11 0x00007ffff5ca4f4d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(In reply to comment #35) > And in addition all SSL connections deadlocks from time to time. I would not call it a deadlock. The tlsm_pem_mutex lock simply does not allow to perform multiple SSL handshakes at a time. If a single SSL handshake for whatever reason blocks, all other connections have to wait. I am not sure whether the waiting time is anyhow bounded by default. Maybe calling SSL_ForceHandshakeWithTimeout() instead of SSL_ForceHandshake() would improve the the responsiveness in this case? I don't understand - why is NSS giving CKR_CRYPTOKI_ALREADY_INITIALIZED? Is it really already initialized? If so, why haven't we been getting this error before? Is it because something changed in NSS recently? The deadlock - Kamil is correct, it's not really a deadlock, it is a lock to prevent multiple threads from calling SSL_ForceHandshake() at the same time - in earlier versions of NSS this was not thread safe. However, many thread safety issues have been addressed in NSS 3.13, so this may be ok, but perhaps for the NSS db implementation only - it may still be thread-unsafe when using PEM. (In reply to comment #37) > I don't understand - why is NSS giving CKR_CRYPTOKI_ALREADY_INITIALIZED? Is it > really already initialized? If so, why haven't we been getting this error > before? Is it because something changed in NSS recently? There are at least two TLS contexts used. The first one for accepting connections (olcTLS* options). The second one is used for replication as the client side (olcSyncRepl). Both live in a different thread. This error appears only when PEM module is used, and during the second initialization. If I understand it correctly, the NSS PKCS#11 module, which causes this error, is loaded only when PEM is used. I think we just haven't tested this use case properly when switching to MozNSS. > The deadlock - Kamil is correct, it's not really a deadlock, it is a lock to > prevent multiple threads from calling SSL_ForceHandshake() at the same time - > in earlier versions of NSS this was not thread safe. However, many thread > safety issues have been addressed in NSS 3.13, so this may be ok, but perhaps > for the NSS db implementation only - it may still be thread-unsafe when using > PEM. Yes, I wrote it wrong. I meant it the same. And yes - the mutex is effective only when PEM module is loaded - you wrote the patch: http://www.openldap.org/devel/gitweb.cgi?p=openldap.git;a=commit;h=8e9af53 I tried sniffing the packets using Wireshark and compared the successful and failing connections: (1) - successful -> Client Hello <- Server Hello, Certificate, Certificate Request, Server Hello Done -> Certificate, Client Key Exchange, Certificate Verify, Change Cipher Spec, Finished <- Change Cipher Spec, Finished ... data ... (2) - failing -> Client Hello <- Server Hello, Certificate, Certificate Request, Server Hello Done -> Certificate, Client Key Exchange, Certificate Verify, Change Cipher Spec, Finished <- Alert (Bad Record MAC) The only difference I found is in "Certificate Request" message. Message in (1) contains only "id-at-commonName=Bug CA,dc=redhat,dc=bug", while the same message in (2) contains "id-at-commonName=replicator,id-at-commonName=users,dc=redhat,dc=bug", "id-at-commonName=Bug CA,dc=redhat,dc=bug", and "pkcs-9-at-emailAddress=root.bug,id-at-commonName=ldapservers.redhat.bug,id-at-organizationalUnitName=Bug,id-at-localityName=Brno,id-at-countryName=CZ". It seems like the syncrepl TLS configuration mixes up with the server TLS configuration. If you use the reproducer I provided, the server started earlier seems to be fine. The secondly started server reports this MAC error. And even other connections to this server are failing: $ ldapsearch -Y EXTERNAL -H ldaps://ldapserver2.redhat.bug:2636 -b dc=redhat,dc=bug -LLL ldap_sasl_interactive_bind_s: Can't contact LDAP server (-1) additional info: TLS error -12272:SSL peer reports incorrect Message Authentication Code. Some progress: There are two TLS context within the server process. The first TLS context is used on the server side to accept connections, the second is used on the client side of syncrepl. While both contexts can have different TLS settings, they share one certificate database (CERTCertDBHandle *): tls_m.c:2140: ctx->tc_certdb = CERT_GetDefaultCertDB(); /* If there is ever a per-context db, change this */ When I set up the server to use server cert from MozNSS db, and PEM certificate for replication, I'm getting this error: TLS: using moznss security dir /root/bz707599/certdb prefix . TLS: error: the certificate server could not be found in the database - error -12285:Unable to find the certificate or key necessary for authentication. TLS: error: unable to find and verify server's cert and key for certificate server TLS: error: could not initialize moznss security context - error -12285:Unable to find the certificate or key necessary for authentication. TLS: can't create ssl handle. The call PK11_FindCertFromNickname in tlsm_find_and_verify_cert_key fails. The first connection created by slapd is syncrepl outbound connection. The initialization of server context happens when there is a first inbound connection. There is a following sequence of TLS function calls: tlsm_deferred_ctx_init (syncrepl) tlsm_deferred_init (syncrepl) tlsm_deferred_ctx_init (server) tlsm_deferred_init (server) tlsm_find_and_verify_cert_key (syncrepl, "PEM Token #0:replicator2.pem - 0") tlsm_find_and_verify_cert_key (syncrepl, "PEM Token #0:replicator2.pem - 0") tlsm_find_and_verify_cert_key (server, "server") When I suspended the syncrepl thread in do_syncrepl function (using gdb), the initialization of server context was successful. And after releasing the syncrepl thread, everything worked well. I think there is kind of a conflict when adding the PEM certificates to the database. Aren't the old replaced somehow? This will explain the error message above, and probably the invalid MAC error - if the key is removed too early. I haven't found any replacement for CERT_GetDefaultCertDB(). And according to mailing lists, implementation of that function was postponed. I do not know, if this information is up to date... Any ideas? We can probably workaround it by not doing deferred initialization for server context. Or, we can disable deferred initialization at all. But I don't know, it if makes sense and if it helps for all use cases. If we can assume that the only versions of NSS we support have NSS_InitContext, we can probably do away with deferred initialization. I implemented it that way because you have to do that in order to support the old style NSS_Initialize which must be done once per process. TLS: using moznss security dir /root/bz707599/certdb prefix . Does this mean there is a valid key/cert database in /root/bz707599/certdb? If so, is the problem that the process is trying to mix using a real NSS key/cert db with PEM certificates? I wonder if the problem is that you cannot add PEM keys/certs once you have a real NSS key/cert db. Or vice versa - perhaps once you have loaded the PEM module and added keys/certs, you cannot use an NSS key/cert db? The way NSS works is that it does not actually have separate "contexts" where keys/certs are stored, independent of other contexts. There is supposed to be only one key/cert context - if you ask NSS to verify a cert, it verifies it against all CA certs it knows about in all contexts. That's why there is only one CERT_GetDefaultCertDB() because there is no support for a per-context key/cert db. (In reply to comment #40) > I think there is kind of a conflict when adding the PEM certificates to the > database. The PEM reader does not add any objects to the NSS internal database. They are held in an independent array owned by the PEM module and exposed though its API. (In reply to comment #41) > TLS: using moznss security dir /root/bz707599/certdb prefix . > > Does this mean there is a valid key/cert database in /root/bz707599/certdb? If > so, is the problem that the process is trying to mix using a real NSS key/cert > db with PEM certificates? Yes. The certificate is there. > I wonder if the problem is that you cannot add PEM > keys/certs once you have a real NSS key/cert db. Or vice versa - perhaps once > you have loaded the PEM module and added keys/certs, you cannot use an NSS > key/cert db? According to the sequence of calls mentioned in comment #40 -- I think the second variant is the case. (In reply to comment #42) > The PEM reader does not add any objects to the NSS internal database. They are > held in an independent array owned by the PEM module and exposed though its > API. Then I will take a deep look, why FindCertFromNickname fails. I made another progress. When tls_certdir in syncrepl configuration differs from olcTLSCACertificatePath in server configuration, I'm getting the same error as above: TLS: error: the certificate server could not be found in the database If there are multiple TLS contexts within the application, PK11_FindCertFromNickname() doesn't work without specifying token description as a prefix. Otherwise some prefix is added for each certificate during the second (and any further) initialization calls. I'm getting certificate nicknames like "NSS Application Token 00000004:server". (I'm not sure if I used correct NSS terminology.) So I believe that adding a token description in the initParams for each NSS_InitContext() call and using this string as a prefix in PK11_FindCertFromNickname() can push us further in solving this bug. I haven't tried yet. (Some numbering or original TLS certdir path might be used as the description.) Or, we can call NSS_InitContext without certdir name, and load each database using SECMOD_OpenUserDB. Do not know, if this is a cleaner solution. I will attach a simple program which demonstrates both approaches. (In reply to comment #44) > When tls_certdir in syncrepl configuration differs from olcTLSCACertificatePath > in server configuration, I'm getting the same error as above: Forgot to mention: both MozNSS cert DBs. Created attachment 566365 [details]
multiple NSS_InitContext() calls demonstration
I have two MozNSS cert DBs (certdb_1, certdb_2). The first one contains only "ca" certificate, the second one contains only "server" certificate.
#define USE_OPENUSERDB 0
without setting initParams.dbTokenDescription
Loaded certificates:
- 0x238ae70 db2:server (E=root.bug,CN=ldapservers.redhat.bug,OU=Bug,L=Brno,C=CZ)
- 0x238ae40 db1:ca (CN=Bug CA,DC=redhat,DC=bug)
Certificate 'db1:ca', SN: CN=Bug CA,DC=redhat,DC=bug
Certificate 'db2:server', SN: E=root.bug,CN=ldapservers.redhat.bug,OU=Bug,L=Brno,C=CZ
#define USE_OPENUSERDB 0
Loaded certificates:
- 0x677d90 db2:server (E=root.bug,CN=ldapservers.redhat.bug,OU=Bug,L=Brno,C=CZ)
- 0x677d60 ca (CN=Bug CA,DC=redhat,DC=bug)
Certificate 'db1:ca', SN: CN=Bug CA,DC=redhat,DC=bug
Certificate 'db2:server', SN: E=root.bug,CN=ldapservers.redhat.bug,OU=Bug,L=Brno,C=CZ
#define USE_OPENUSERDB 1
Loaded certificates:
- 0x2630e70 db2:server (E=root.bug,CN=ldapservers.redhat.bug,OU=Bug,L=Brno,C=CZ)
- 0x2630e40 db1:ca (CN=Bug CA,DC=redhat,DC=bug)
Certificate 'db1:ca', SN: CN=Bug CA,DC=redhat,DC=bug
Certificate 'db2:server', SN: E=root.bug,CN=ldapservers.redhat.bug,OU=Bug,L=Brno,C=CZ
Sorry again, this is correct:
> #define USE_OPENUSERDB 0
> without setting initParams.dbTokenDescription
Loaded certificates:
- 0x1cc0d08 NSS Application Token 00000004:server (E=root.bug,CN=ldapservers.redhat.bug,OU=Bug,L=Brno,C=CZ)
- 0x1cc0cc0 ca (CN=Bug CA,DC=redhat,DC=bug)
Certificate 'ca', SN: CN=Bug CA,DC=redhat,DC=bug
Cannot find 'server' certificate, error -8174.
(In reply to comment #44) > If there are multiple TLS contexts within the application, > PK11_FindCertFromNickname() doesn't work without specifying token description > as a prefix. Otherwise some prefix is added for each certificate during the > second (and any further) initialization calls. I'm getting certificate > nicknames like "NSS Application Token 00000004:server". (I'm not sure if I used > correct NSS terminology.) Not sure if this is the same issue we were fixing in libcurl, but the symptoms are pretty similar. The solution was to avoid using PK11_FindCertFromNickname() to look for the certificates: https://github.com/bagder/curl/commit/f6980bb After all, we were forced to use PK11_FindCertFromNickname() anyway in our workaround preventing nss from an unnecessary SIGSEGV: https://github.com/bagder/curl/commit/491c5a4 (In reply to comment #48) > Not sure if this is the same issue we were fixing in libcurl, but the symptoms > are pretty similar. The solution was to avoid using > PK11_FindCertFromNickname() to look for the certificates: Thanks for sharing. But this is not applicable in our case unfortunatelly. PK11_FindCertFromNickname() is used to reference certificate nickname in the certdb. And the nickname is taken from the server configuration file. My idea is to set token description during initialization and use it when requesting the certificate. If two different contexts request the same certdb, NSS_InitContext succeeds, but I'm unable to query the get the certificates with the prefix from the second initialization. initParams.dbTokenDescription = "db1"; ctx1 = InitContext("./certdb", "", "", SECMOD_DB, &initParams, NSS_INIT_READONLY); initParams.dbTokenDescription = "db2"; ctx2 = InitContext("./certdb", "", "", SECMOD_DB, &initParams, NSS_INIT_READONLY); initParams.dbTokenDescription = "db3"; ctx2 = InitContext("././certdb", "", "", SECMOD_DB, &initParams, NSS_INIT_READONLY); CERT_FindCertByNickname(certdb, "server") --> works CERT_FindCertByNickname(certdb, "db1:server") --> works CERT_FindCertByNickname(certdb, "db2:server") --> does not work CERT_FindCertByNickname(certdb, "db3:server") --> works (surprise) It makes sense, that only unique certficates are added to the database. But it is unusable because during the second NSS_InitContext, the prefix is added automatically. Which means, that we can either lookup the certificate manually using PK11_ListCerts() and do some "heuristics". Or, use InitContext() with NSS_INIT_NOCERTDB|NSS_INIT_NOMODDB and open the certdb for each OpenLDAP TLS context using SECMOD_OpenUserDB(). Which seems to work always fine when prefixing token description. (In reply to comment #50) > My idea is to set token description during initialization and use it when > requesting the certificate. > > If two different contexts request the same certdb, NSS_InitContext succeeds, > but I'm unable to query the get the certificates with the prefix from the > second initialization. > > initParams.dbTokenDescription = "db1"; > ctx1 = InitContext("./certdb", "", "", SECMOD_DB, &initParams, > NSS_INIT_READONLY); > > initParams.dbTokenDescription = "db2"; > ctx2 = InitContext("./certdb", "", "", SECMOD_DB, &initParams, > NSS_INIT_READONLY); > > initParams.dbTokenDescription = "db3"; > ctx2 = InitContext("././certdb", "", "", SECMOD_DB, &initParams, > NSS_INIT_READONLY); > > CERT_FindCertByNickname(certdb, "server") --> works > CERT_FindCertByNickname(certdb, "db1:server") --> works > CERT_FindCertByNickname(certdb, "db2:server") --> does not work > CERT_FindCertByNickname(certdb, "db3:server") --> works (surprise) > > It makes sense, that only unique certficates are added to the database. But it > is unusable because during the second NSS_InitContext, the prefix is added > automatically. > > Which means, that we can either lookup the certificate manually using > PK11_ListCerts() and do some "heuristics". Or, use InitContext() with > NSS_INIT_NOCERTDB|NSS_INIT_NOMODDB and open the certdb for each OpenLDAP TLS > context using SECMOD_OpenUserDB(). Which seems to work always fine when > prefixing token description. I'm not sure if this is how NSS is intended to work, and I don't think apps are meant to call SECMOD_OpenUserDB() directly. You should talk to Robert Relyea and Elio Maldonado about this. *** Bug 809817 has been marked as a duplicate of this bug. *** Still working on it. Some update: I have a patch which uses SECMOD_OpenUserDB(), now I'm testing if it works will all types of configuration. But there is a problem with NSS_ShutdownContext() which always fails. I found a upstream bug report on this. I will polish and post the patch, hopefully soon. Still working on it. I keep my changes on github in moznss branch: https://github.com/fcelda/openldap Most of the problems seems to be solved. I see no certdb issues with various combinations of TLS config options. But there are still some problems with PEM certificates, I'm still getting SEC_ERROR_UNKNOWN_PKCS11_ERROR and SSL_ERROR_BAD_MAC_ALERT occasionally. Finally, I have a solution, there were too many bugs which were complicating this: - Every MozNSS certdb is opened using SECMOD_OpenUserDB with a token description unique for each context within libldap, certificate lookup by nickname always works in this case. - PEM certificates are not referenced by nickname derived from the certificate file base name, but by their DER value (this is already used in CURL). - There was a problem with uninitialized memory when retrieving the certdb pin which sometimes caused reading the pin file infinitely. - SEC_ERROR_UNKNOWN_PKCS11_ERROR (not only this error) was caused by thread-unsafety, when I locked the whole initialization and cleanup process by a mutex, the errors disappeared. Suppose this will not have any performance impact, we are doing this only once per context. - SSL_ERROR_BAD_MAC_ALERT was caused by PEM module supplying invalid key for the certificate. PEM module expects that all certificates with CKA_TRUST == true are CA certificates. After removing this for non-CA certificates, it works as expected. All these fixes are in my repo on github. Some are already merged upstream. I will fine-tune the rest before submitting them. All my patches were accepted upstream. And I just submitted an update for Fedora 17 (openldap-2.4.31-3.fc17) with these patches applied. Resolved in: openldap-2.4.23-29.el6 Is there some place I might find this updated RPM (or SRPM that I can rebuild)? I believe I am seeing this issue with 3 servers set up to do N-way multi-master replication on CentOS 6.3. Currently using 2.4.23-26.el6_3.2 and what I see is that if I am lucky I can start all 3 servers and everything works correctly. Quite often, I can start 2 out of 3 servers successfully but starting the 3rd one will kill one or both of the others - they all still respond to ldapsearch on ldap:// but requests to ldaps:// hang forever. If only one of the other 2 stops responding then just restarting the 3rd might fix both but if it kills both of the others then I have to stop any 2 and then the remaining one comes back to life. Hello Trevor, I have uploaded SRPM and test packages for x86_64, i686 to: http://people.redhat.com/jvcelak/bz707599/ These packages have been provided by Red Hat for testing purposes only and are NOT supported for any other use. These RPMs may contain changes that are necessary for debugging but that are not appropriate for other uses, or that are not compatible with third-party hardware or software. These RPMs should NOT be deployed for purposes other than testing and debugging. Please test in your environment, and let us know the outcome. Jan Thank Jan, I have downloaded and installed those packages but I still have the same problem so I think I have a different issue. If you had any tips on how I might debug it, I would appreciate it! Hi Trevor. Please, create a separate bug report for that. It would be great if you could provide configuration files for all the servers and testing certificates. If you do not have testing certificates, the textual representation (certutil -d . -L -n certname / openssl x509 -text) will do as well. Feel free to obfuscated any sensitive information. I can try to reproduce. But in general, starting the server manually with -d1 options should give you some clue, where the problem is. (slapd -u ldap -g ldap -F /etc/openldap/slapd.d -d1) Jan Jan, so it appears to be (probably) user error but I don't see anywhere documented that you should use a different certificate on the olcSyncRepl to the one used by each server so I had set each of my 3 servers up with a single certificate each and that was referenced in both olcTLSCertificateFile in cn=config.ldif and in the olcSyncrepl tls_cert= in olcDatabase={0}config.ldif Doing this seems to give approximately a 30% chance of hanging the ldaps listener. I can almost always start all 3 servers if I logon to each box and run `service slapd start` on server 1 then 2 then 3 but running `service slapd restart` on say server 3 will kill the ldaps listener on itself and at least one of the others about 1 time in 3 restarts. I've now set them up with 2 certificates per server, one referenced by olcTLSCertificateFile and then a different one in the olcSyncrepl and this seems to make the problem go away. Not sure if that would be classified as a bug or as user error but at least it's now googleable and may help someone else with the same symptoms in the future. Hello, I am also experiencing the exact same issue with the same setup, and while at first it appeared as though the different cert/key pairs on each server helped, after a few minutes the problem would reoccur when issuing a slapd restart. Thanks -josh Trevor and Josh, please, create a separate bug report for that problem. Jan - I'm sorry. I was redirected from bug 809817. I suppose that bug actually fits closer to my issue. If you need a separate bug filed, can you just reopen 809817 ? Thanks -josh On a side note, would it be possible to provide an RPM of openldap compiled with openssl instead of moznss? That seems to be the fix that's going around the openldap forums. Thanks -josh (In reply to comment #73) > Jan - > I'm sorry. I was redirected from bug 809817. Oh, I see. I will take a look at it. Thank you, Josh. (In reply to comment #74) > On a side note, would it be possible to provide an RPM of openldap compiled > with openssl instead of moznss? > > That seems to be the fix that's going around the openldap forums. I do not want to provide even testing packages compiled with OpenSSL, because there is no plan to support this configuration in future and I guess that Directory Server packages in RHEL rely on OpenLDAP with Mozilla NSS. (But you can easily recompile the package by yourself.) Jan (In reply to comment #75) > (In reply to comment #73) > > Jan - > > I'm sorry. I was redirected from bug 809817. > > Oh, I see. I will take a look at it. Thank you, Josh. > > (In reply to comment #74) > > On a side note, would it be possible to provide an RPM of openldap compiled > > with openssl instead of moznss? > > > > That seems to be the fix that's going around the openldap forums. > > I do not want to provide even testing packages compiled with OpenSSL, > because there is no plan to support this configuration in future and I guess > that Directory Server packages in RHEL rely on OpenLDAP with Mozilla NSS. True, and switching to OpenSSL would cause major short term problems for 389, but for long term, as long as the 389 team had enough lead time, the fact that 389 uses Mozilla NSS should not prevent OpenLDAP from using OpenSSL instead. > (But you can easily recompile the package by yourself.) . . . which is the real "fix" recommended by people on OpenLDAP forums. > > Jan Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-0364.html |
Created attachment 500817 [details] Main configuration file Description of problem: I can't connect to the new version of OpenLDAP with NSS support after the upgrade. I am using two identical OpenLDAP servers for multi master replication. One of them has been upgraded to openldap-2.4.23-15.el6.x86_64 recently. Everything works as expected until the other server (which has not been upgraded yet) connects to replicate. After then, every single client connection (including every authenticating Linux clients, Apache Directory Studio etc.) is rejected. # ldapwhoami -xZZ -vv ldap_initialize( <DEFAULT> ) ldap_start_tls: Connect error (-11) additional info: TLS error -12272:Unknown code ___P 16 Relevant OpenLDAP log: May 25 15:34:28 ldapserver1 slapd[3278]: conn=1078 fd=15 ACCEPT from IP=10.112.242.100:36484 (IP=0.0.0.0:389) May 25 15:34:28 ldapserver1 slapd[3278]: conn=1078 op=0 EXT oid=1.3.6.1.4.1.1466.20037 May 25 15:34:28 ldapserver1 slapd[3278]: conn=1078 op=0 STARTTLS May 25 15:34:28 ldapserver1 slapd[3278]: conn=1078 op=0 RESULT oid= err=0 text= May 25 15:34:28 ldapserver1 slapd[3278]: conn=1078 fd=15 closed (TLS negotiation failure) Relevant OpenLDAP debug log: connection_get(15): got connid=1000 connection_read(15): checking for input on id=1000 TLS: error: accept - force handshake failure: errno 11 - moznss error -12273 TLS: can't accept: TLS error -12273:Unknown code ___P 15. connection_read(15): TLS accept failure error=-1 id=1000, closing connection_close: conn=1000 sd=15 Version-Release number of selected component (if applicable): nss-3.12.9-9.el6.x86_64 openldap-2.4.23-15.el6.x86_64 openldap-clients-2.4.23-15.el6.x86_64 openldap-servers-2.4.23-15.el6.x86_64 Steps to Reproduce: 1. Upgrade to RHEL 6.1 2. Allow the other server to log in to replicate 3. No connection is allowed anymore Actual results: Every connection is denied Expected results: Every connection is allowed like in RHEL 6.0 Additional info: If I block the connection from the other replicating server with iptables, everything works as usual.