Bug 707599

Summary:

replication with TLS does not work

Product:

Red Hat Enterprise Linux 6

Reporter:

Gabor Szathmari <gabor.szathmari>

Component:

openldap

Assignee:

Jan Vcelak <jvcelak>

Status:

CLOSED ERRATA

QA Contact:

David Spurek <dspurek>

Severity:

high

Docs Contact:

Priority:

medium

Version:

6.1

CC:

ccsadmins, Dario.Palmisano, dspurek, ebenes, josh.mullis, jplans, jvcelak, jwest, kdudka, lutz.willek, omoris, rmeggins, rvokal, trevor.hemsley, tsmetana

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

openldap-2.4.23-29.el6

Doc Type:

Bug Fix

Doc Text:

Cause: OpenLDAP server configured for replication. TLS is enabled both for accepting connections from remote peers and for TLS client authentication to the other replicas. Different TLS configuration used for server and for connecting to the replicas. Consequence: Connecting to a replica can fail due to TLS certificate lookup errors or due to unknown PKCS#11 TLS errors. Fix: A set of patches applied, makes multiple TLS LDAP contexts within one process possible without affecting the others. Result: Multiple LDAP TLS contexts with different configurations within one process are possible. OpenLDAP replication in described conditions is possible.

Story Points:

---

Clone Of:

Clones:

795763 (view as bug list)

Environment:

Last Closed:

2013-02-21 09:45:19 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

783445

Bug Blocks:

795763, 836160

Attachments:

Description	Flags
Main configuration file	none
OpenLDAP debug log	none
Certificate Authority	none
Cert	none
Certificate Authority (this is the correct file)	none
Certificate (this is the correct file)	none
This cert was used to replicate and caused the Bad addr MAC error	none
reproducer - configuration and scripts	none
patch resolving SEC_ERROR_UNKNOWN_PKCS11_ERROR	none
multiple NSS_InitContext() calls demonstration	none

Description Gabor Szathmari 2011-05-25 13:39:26 UTC

Created attachment 500817 [details]
Main configuration file

Description of problem:
I can't connect to the new version of OpenLDAP with NSS support after the upgrade. I am using two identical OpenLDAP servers for multi master replication. One of them has been upgraded to openldap-2.4.23-15.el6.x86_64 recently. 

Everything works as expected until the other server (which has not been upgraded yet) connects to replicate. After then, every single client connection (including every authenticating Linux clients, Apache Directory Studio etc.) is rejected.

# ldapwhoami  -xZZ -vv
ldap_initialize( <DEFAULT> )
ldap_start_tls: Connect error (-11)
        additional info: TLS error -12272:Unknown code ___P 16

Relevant OpenLDAP log:
May 25 15:34:28 ldapserver1 slapd[3278]: conn=1078 fd=15 ACCEPT from IP=10.112.242.100:36484 (IP=0.0.0.0:389)
May 25 15:34:28 ldapserver1 slapd[3278]: conn=1078 op=0 EXT oid=1.3.6.1.4.1.1466.20037
May 25 15:34:28 ldapserver1 slapd[3278]: conn=1078 op=0 STARTTLS
May 25 15:34:28 ldapserver1 slapd[3278]: conn=1078 op=0 RESULT oid= err=0 text=
May 25 15:34:28 ldapserver1 slapd[3278]: conn=1078 fd=15 closed (TLS negotiation failure)

Relevant OpenLDAP debug log:
connection_get(15): got connid=1000
connection_read(15): checking for input on id=1000
TLS: error: accept - force handshake failure: errno 11 - moznss error -12273
TLS: can't accept: TLS error -12273:Unknown code ___P 15.
connection_read(15): TLS accept failure error=-1 id=1000, closing
connection_close: conn=1000 sd=15


Version-Release number of selected component (if applicable):
nss-3.12.9-9.el6.x86_64
openldap-2.4.23-15.el6.x86_64
openldap-clients-2.4.23-15.el6.x86_64
openldap-servers-2.4.23-15.el6.x86_64

Steps to Reproduce:
1. Upgrade to RHEL 6.1
2. Allow the other server to log in to replicate
3. No connection is allowed anymore
  
Actual results:
Every connection is denied

Expected results:
Every connection is allowed like in RHEL 6.0

Additional info:
If I block the connection from the other replicating server with iptables, everything works as usual.

Comment 1 Gabor Szathmari 2011-05-25 13:40:23 UTC

Created attachment 500818 [details]
OpenLDAP debug log

Comment 3 Gabor Szathmari 2011-05-25 14:00:55 UTC

Created attachment 500822 [details]
Certificate Authority

Comment 4 Gabor Szathmari 2011-05-25 14:03:45 UTC

Created attachment 500823 [details]
Cert

Comment 5 Gabor Szathmari 2011-05-25 14:05:03 UTC

Comment on attachment 500822 [details]
Certificate Authority

>Certificate:
>    Data:
>        Version: 3 (0x2)
>        Serial Number:
>            7f:a7:7a:5d:00:01:00:00:17:18
>        Signature Algorithm: sha1WithRSAEncryption
>        Issuer: DC=hu, DC=pear, DC=pearit, CN=pearIT CA service
>        Validity
>            Not Before: Feb 16 08:52:50 2011 GMT
>            Not After : Feb 15 08:52:50 2013 GMT
>        Subject: C=HU, ST=Budapest, L=Budapest, O=pear qwe Hungary Zrt., OU=IT Unix/RWE, CN=ldapserver1.pear.hu/emailAddress=Administrators
>        Subject Public Key Info:
>            Public Key Algorithm: rsaEncryption
>                Public-Key: (2048 bit)
>                Modulus:
>                    00:ba:18:6f:8f:b5:f8:c1:c8:5a:72:d9:30:3d:a6:
>                    db:a8:35:b4:7d:04:a0:6c:bf:d4:b6:34:74:53:ba:
>                    d9:60:ef:7a:58:bc:39:28:87:15:87:16:b4:19:2c:
>
>                Exponent: 65537 (0x10001)
>        X509v3 extensions:
>            X509v3 Subject Key Identifier:
>                7A:45:B5:D3:47:EE:7B:AE:F3:E3:CA:B7:5E:AA:A6:22
>            X509v3 Authority Key Identifier:
>                keyid:46:B3:50:80:6C:82:FB:7D:25:76:9C:02:D3:74:40:7C
>
>            X509v3 CRL Distribution Points:
>
>                Full Name:
>                  URI:ldap:///CN=pearIT%20CA[...]Class=cRLDistributionPoint
>                  URI:http://xys001tat4.pearit.pear.hu[...].crl
>
>            Authority Information Access:
>                CA Issuers - URI:ldap:///CN=pearIT%20CA%[...]cationAuthority
>                CA Issuers - URI:http://xys001ta[...]ce(1).crt
>
>            1.3.6.1.4.1.311.20.2:
>                ...W.e.b.S.e.r.v.e.r
>            X509v3 Basic Constraints: critical
>                CA:FALSE
>            X509v3 Key Usage:
>                Digital Signature, Key Encipherment
>            X509v3 Extended Key Usage:
>                TLS Web Server Authentication
>    Signature Algorithm: sha1WithRSAEncryption
>        4f:f5:a9:27:28:d5:7a:35:70:4a:ff:01:62:10:96:6b:5e:2e:
>        e6:d2:b1:77:34:c2:9d:f1:59:75:c5:82:e0:12:eb:2a:f3:86:
>        81:54:9e:ca:00:77:9b:b5:c0:ac:71:a6:f3:06:ea:18:31:27:
>

Comment 6 Jan Vcelak 2011-05-26 12:27:07 UTC

Rich, have you ever encountered that NSS error message?

-12272  SSL_ERROR_BAD_MAC_ALERT

"SSL peer reports incorrect Message Authentication Code."
The remote system has reported that it received a message with a bad Message Authentication Code from the local system. This may indicate that an attack on that server is underway.

Comment 7 Rich Megginson 2011-05-26 13:17:26 UTC

(In reply to comment #6)
> Rich, have you ever encountered that NSS error message?
> 
> -12272  SSL_ERROR_BAD_MAC_ALERT
> 
> "SSL peer reports incorrect Message Authentication Code."
> The remote system has reported that it received a message with a bad Message
> Authentication Code from the local system. This may indicate that an attack on
> that server is underway.

Looks like https://bugzilla.redhat.com/show_bug.cgi?id=596058

Comment 8 Gabor Szathmari 2011-06-02 08:02:05 UTC

Okay, it seems that the certificate which was working perfectly are not compatible somehow with the new TLS backend. It seems it is a very nasty regression that happened in RHEL 6.1.

Before upgrade:

# gnutls-cli ldapserver2.server.hu --x509cafile /etc/openldap/cacerts/ca.crt -V -p 636
Processed 1 CA certificate(s).
Resolving 'ldapserver2.server.hu'...
Connecting to '10.222.1.2:636'...
- Server's trusted authorities:
   [0]: DC=hu,DC=server,DC=it,CN=IT CA service
- Successfully sent 0 certificate(s) to server.
- Server has requested a certificate.
- Certificate type: X.509
 - Got a certificate list of 2 certificates.
 - Certificate[0] info:
  - X.509 Certificate Information:
        Version: 3
        Serial Number (hex): 7fa842f6000100001719
        Issuer: DC=hu,DC=server,DC=it,CN=IT CA service
[...]


After the upgrade:

# gnutls-cli ldapserver1.server.hu --x509cafile /etc/openldap/cacerts/ca.crt -V -p 636
Processed 1 CA certificate(s).
Resolving 'ldapserver1.server.hu'...
Connecting to '10.222.1.1:636'...
- Server's trusted authorities:
   [0]: DC=hu,DC=server,DC=it,CN=IT CA service
   [1]: DC=hu,DC=server,DC=it,CN=Users,CN=nx
   [2]: C=HU,ST=Budapest,L=Budapest,O=Organization,OU=IT Unix/BEA,CN=ldapserver1.server.hu,EMAIL=Administrators
- Successfully sent 0 certificate(s) to server.
*** Fatal error: A TLS fatal alert has been received.
*** Received alert [20]: Bad record MAC
*** Handshake has failed
GNUTLS ERROR: A TLS fatal alert has been received.

Comment 9 Rich Megginson 2011-06-02 18:59:44 UTC

(In reply to comment #3)
> Created attachment 500822 [details]
> Certificate Authority

Note: this is not a CA (X509v3 Basic Constraints: critical
                CA:FALSE)

Comment 10 Rich Megginson 2011-06-02 19:00:19 UTC

(In reply to comment #4)
> Created attachment 500823 [details]
> Cert

This is a CA cert

Comment 11 Rich Megginson 2011-06-02 19:01:39 UTC

Before the upgrade:
- Server's trusted authorities:
   [0]: DC=hu,DC=server,DC=it,CN=IT CA service
After:
- Server's trusted authorities:
   [0]: DC=hu,DC=server,DC=it,CN=IT CA service
   [1]: DC=hu,DC=server,DC=it,CN=Users,CN=nx
   [2]: C=HU,ST=Budapest,L=Budapest,O=Organization,OU=IT
Unix/BEA,CN=ldapserver1.server.hu,EMAIL=Administrators

Any idea why the server now has 2 more trusted authorities?

Comment 12 Rich Megginson 2011-06-02 23:00:59 UTC

Still trying to reproduce.  This is what I've done so far:

1) using the script /etc/pki/tls/misc/CA and my own custom openssl.cnf file, create a CA cert and use that CA to issue a server cert and key.  I changed openssl.cnf to even generate the MS extensions so that the CA cert and the server cert look as much like the posted ones as possible.  So that I have the following files:
~/save/CA/cacert.pem - contains the 1 new CA cert
~/save/CA/certs/localhost.localdomain.cert.pem - contains the 1 server cert
~/save/CA/certs/localhost.localdomain.key.pem - contains the 1 server cert

2) set up slapd with the following in cn=config (I'm using slapd.d not slapd.conf but it should not make any difference)
olcTLSCACertificateFile: /path/to/save/CA/cacert.pem
olcTLSCertificateFile: /path/to/save/CA/certs/localhost.localdomain.cert.pem
olcTLSCertificateKeyFile: /path/to/save/CA/certs/localhost.localdomain.key.pem

3) gnutls-cli localhost.localdomain --x509cafile ~/save/CA/cacert.pem -V -p SSLPORT
Processed 1 CA certificate(s).
Resolving 'localhost.localdomain'...
Connecting to '127.0.0.1:1201'...
- Certificate type: X.509
 - Got a certificate list of 2 certificates.
 - Certificate[0] info:
  - X.509 Certificate Information:
... the server cert details ...
 - Certificate[1] info:
  - X.509 Certificate Information:
	Version: 3
... the CA cert details ...
Other Information:
	MD5 fingerprint:
		6571d10d46b21a60876bc2a136345eac
	SHA-1 fingerprint:
		f9ec0573af8eccbe4151cc30910834f56e779917
	Public Key Id:
		8c6058575769995ab034e5f3380691137baf3825

- The hostname in the certificate matches 'localhost.localdomain'.
- Peer's certificate is trusted
- Version: TLS1.0
- Key Exchange: RSA
- Cipher: AES-256-CBC
- MAC: SHA1
- Compression: NULL
- Session ID: 20:96:60:53:BF:73:DD:91:F4:3C:85:E2:7D:8E:DE:89:F8:27:F1:53:99:A5:12:04:AD:AE:07:1B:EE:C6:05:41
- Handshake was completed

- Simple Client Mode:

openssl s_client -connect localhost.localdomain:1201 -CAfile ~/save/CA/cacert.pem

gives similar output

LDAPTLS_CACERT=~/save/CA/cacert.pem bin/ldapsearch -x -H ldaps://localhost.localdomain:1201/ -s base -b ""

works.

So I must be missing something about how you have your cert files or server configured.

Comment 13 Gabor Szathmari 2011-06-03 07:01:20 UTC

It seems that I might be able to solve the problem today.

Just to clarify, I used 3 certificates: one CA, one for the server and one for replication (for SASL EXTERNAL authentication) previously. These were generated by a Windows Server CA.

When the upgraded to the OpenLDAP with the new GNUTLS backend, every single connection from any client (Linux machines, Apache Directory Studio) were denied with the "Bad record MAC" error message. It was caused by the invalid client certificate defined in the olcSyncrepl attribute.

Then, I replaced the server certificate and tried to use it for the replication by not defining any certificate in the olcSyncrepl attribute (so it falls back to the cert defined in olcTLSCertificateFile). Now the connections are not rejected but the replication fails because it tries to use a server certificate instead of a client one. Now, I am waiting for a new client certificate and hope it is going to solve the problem.

In summary, I see two serious problems here:
- Invalid client certificate (in olcSyncrepl) used for replication (while olcTLSCACertificateFile and olcTLSCertificateFile is valid) causes all SSL connections to fail with the error message "Bad record MAC" 
- Not configuring a client certificate in the olcSyncrepl attribute in order to fall back to olcTLSCertificateFile, replication fails because its a server certificate. It is normal but not documented well, I reckon.

Comment 14 Gabor Szathmari 2011-06-03 07:03:54 UTC

I forgot to mention that the new certificate requests are generated by certtool now and it is signed by Windows Server. Previously, the whole certificate was generated by Windows.

Comment 15 Gabor Szathmari 2011-06-03 07:07:58 UTC

I really don't know. The ldapserver1 and ldapserver2 were identical, the former was upgraded to RHEL 6.1. So I suppose the new backend cases these extra lines.

[0]: CA Certificate defined in olcTLSCACertificateFile 
[1]: Cert for replication defined in olcSyncrepl
[2]: Server certificate in olcTLSCertificateFile 


(In reply to comment #11)
> Before the upgrade:
> - Server's trusted authorities:
>    [0]: DC=hu,DC=server,DC=it,CN=IT CA service
> After:
> - Server's trusted authorities:
>    [0]: DC=hu,DC=server,DC=it,CN=IT CA service
>    [1]: DC=hu,DC=server,DC=it,CN=Users,CN=nx
>    [2]: C=HU,ST=Budapest,L=Budapest,O=Organization,OU=IT
> Unix/BEA,CN=ldapserver1.server.hu,EMAIL=Administrators
> Any idea why the server now has 2 more trusted authorities?

Comment 16 Gabor Szathmari 2011-06-03 07:11:05 UTC

Sorry, it seems I mixed up the files, I upload them again.


(In reply to comment #10)
> (In reply to comment #4)
> > Created attachment 500823 [details]
> > Cert
> This is a CA cert

Comment 17 Gabor Szathmari 2011-06-03 07:14:07 UTC

Created attachment 502739 [details]
Certificate Authority (this is the correct file)

Comment 18 Gabor Szathmari 2011-06-03 07:15:53 UTC

Created attachment 502740 [details]
Certificate (this is the correct file)

Comment 19 Gabor Szathmari 2011-06-03 07:36:41 UTC

Created attachment 502746 [details]
This cert was used to replicate and caused the Bad addr MAC error

This cert was used to replicate and caused the Bad addr MAC error

Comment 20 Gabor Szathmari 2011-06-03 13:47:15 UTC

Unfortunately, the new client certificates are not working. Now I am using 

GSSAPI authentication instead of EXTERNAL for the replication thus it works as expected.

Comment 21 Rich Megginson 2011-06-03 16:35:58 UTC

> When the upgraded to the OpenLDAP with the new GNUTLS backend, every single

Mozilla NSS, not GNUTLS.

> connection from any client (Linux machines, Apache Directory Studio) were
> denied with the "Bad record MAC" error message. It was caused by the invalid
> client certificate defined in the olcSyncrepl attribute.

Can you be more specific?  How is the client certificate invalid?  I don't see anything obvious.

You can create a cert that is used as both a server cert and a client cert.  I don't know if certtool/Microsoft CA will allow you to do that.

Comment 22 Rich Megginson 2011-06-03 17:00:54 UTC

What is your olcSyncrepl configuration line?  In order to use client certificate auth, you should have had something like this:

olcSyncrepl:  provider=ldap://host ... bindmethod=sasl saslmech=EXTERNAL starttls=critical tls_cert=/path/to/cert/used/to/replicate tls_key=/path/to/cert/used/to/replicate/corresponding/private/key/file tls_cacert=/path/to/cacert.pem 

With the NSS backend you could omit tls_cacert and it should fallback.

Comment 23 Gabor Szathmari 2011-06-06 06:58:52 UTC

My previously working attribute was the following:

olcSyncrepl: {0}rid=001 provider=ldaps://ldapserver1.pear.hu type=refreshAndPersist retry="5 5 300 +" searchbase="dc=pear,dc=hu" bindmethod=sasl saslmech=EXTERNAL tls_cert=/etc/openldap/cacerts/replicator.crt tls_key=/etc/openldap/cacerts/replicator.key tls_cacert=/etc/openldap/cacerts/PearITCAservice.crt timeout=1

The replicator.crt was uploaded here in Comment #19

Comment 25 Jan Vcelak 2012-01-17 17:14:45 UTC

I managed to reproduce. And I'm also getting some segfaults in various situations. And the configuration is valid, the replication works if I switch to other authentication method.

I will attach some scripts to reproduce soon.

Comment 26 Jan Vcelak 2012-01-18 17:49:47 UTC

I'm getting the same error codes only when PEM certificates are used. If I create MozNSS certdb and import the certificates, the replication works fine. However I'm getting segfaults from time to time.

#0  __strrchr_sse2 () at ../sysdeps/x86_64/strrchr.S:33
#1  0x00007ffff7f66efa in tlsm_get_certdb_prefix (certdir=0x1 <Address 0x1 out of bounds>, realcertdir=0x7ffff1623cc0, 
    prefix=0x7ffff1623cc8) at tls_m.c:1525
#2  0x00007ffff7f67102 in tlsm_deferred_init (arg=0x7fffe4108dc0) at tls_m.c:1612
#3  0x00007ffff7f67f8e in tlsm_deferred_ctx_init (arg=0x7fffe4108dc0) at tls_m.c:2102
#4  0x00007ffff5f2cf55 in PR_CallOnceWithArg (once=0x7fffe4108df8, func=<value optimized out>, arg=<value optimized out>)
    at ../../../mozilla/nsprpub/pr/src/misc/prinit.c:832
#5  0x00007ffff7f68c68 in tlsm_session_new (ctx=0x7fffe4108dc0, is_server=0) at tls_m.c:2466
#6  0x00007ffff7f6211f in alloc_handle (ctx_arg=0x7fffe4108dc0, is_server=0) at tls2.c:296
#7  0x00007ffff7f62277 in ldap_int_tls_connect (ld=0x7fffe8112930, conn=0x7fffe800d850) at tls2.c:341
#8  0x00007ffff7f634e5 in ldap_int_tls_start (ld=0x7fffe8112930, conn=0x7fffe800d850, srv=0x7fffe800d7b0) at tls2.c:833
#9  0x00007ffff7f34457 in ldap_int_open_connection (ld=0x7fffe8112930, conn=0x7fffe800d850, srv=0x7fffe800d7b0, async=0) at open.c:417
#10 0x00007ffff7f4da8e in ldap_new_connection (ld=0x7fffe8112930, srvlist=0x7fffe81129f8, use_ldsb=1, connect=1, bind=0x0)
    at request.c:425
#11 0x00007ffff7f3386b in ldap_open_defconn (ld=0x7fffe8112930) at open.c:41
#12 0x00007ffff7f3c359 in ldap_int_sasl_bind (ld=0x7fffe8112930, dn=0x7ffff8312870 "cn=manager,dc=redhat,dc=bug", 
    mechs=0x7ffff8312760 "EXTERNAL", sctrls=0x0, cctrls=0x0, flags=2, interact=0x7ffff7f237d0 <lutil_sasl_interact>, 
    defaults=0x7fffe800d2e0) at cyrus.c:421
#13 0x00007ffff7f40c23 in ldap_sasl_interactive_bind_s (ld=0x7fffe8112930, dn=0x7ffff8312870 "cn=manager,dc=redhat,dc=bug", 
    mechs=0x7ffff8312760 "EXTERNAL", serverControls=0x0, clientControls=0x0, flags=2, interact=0x7ffff7f237d0 <lutil_sasl_interact>, 
    defaults=0x7fffe800d2e0) at sasl.c:464
#14 0x00007ffff7dbd191 in slap_client_connect (ldp=0x7ffff8312690, sb=0x7ffff8312470) at ../../../servers/slapd/config.c:1965
#15 0x00007ffff7e49b44 in do_syncrep1 (op=0x7ffff16244b0, si=0x7ffff8312440) at ../../../servers/slapd/syncrepl.c:571
#16 0x00007ffff7e4d1fd in do_syncrepl (ctx=0x7ffff1624b70, arg=0x7ffff83129f0) at ../../../servers/slapd/syncrepl.c:1422
#17 0x00007ffff7f3214e in ldap_int_thread_pool_wrapper (xpool=0x7ffff82bba70) at ../../../libraries/libldap_r/tpool.c:685
#18 0x00007ffff5cfd7f1 in start_thread (arg=0x7ffff1625700) at pthread_create.c:301
#19 0x00007ffff582f70d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

I guess, the TLS configuration is propagated wrong somewhere. As you may see, certdir in frame #0 - address out of bounds, while:

(gdb) frame 15
#15 0x00007ffff7e49b44 in do_syncrep1 (op=0x7ffff16244b0, si=0x7ffff8312440) at ../../../servers/slapd/syncrepl.c:571
571             rc = slap_client_connect( &si->si_ld, &si->si_bindconf );
(gdb) p si->si_bindconf 
$5 = {sb_uri = {bv_len = 35, bv_val = 0x7ffff8312710 "ldaps://ldapserver2.redhat.bug:2636"}, sb_version = 3, sb_tls = 0, 
  sb_method = 163, sb_timeout_api = 1, sb_timeout_net = 0, sb_binddn = {bv_len = 27, 
    bv_val = 0x7ffff8312870 "cn=manager,dc=redhat,dc=bug"}, sb_cred = {bv_len = 0, bv_val = 0x0}, sb_saslmech = {bv_len = 8, 
    bv_val = 0x7ffff8312760 "EXTERNAL"}, sb_secprops = 0x0, sb_realm = {bv_len = 0, bv_val = 0x0}, sb_authcId = {bv_len = 0, 
    bv_val = 0x0}, sb_authzId = {bv_len = 0, bv_val = 0x0}, sb_keepalive = {sk_idle = 0, sk_probes = 0, sk_interval = 0}, 
  sb_tls_ctx = 0x7fffe4108dc0, sb_tls_cert = 0x7ffff8312740 "replicator", sb_tls_key = 0x0, sb_tls_cacert = 0x0, 
  sb_tls_cacertdir = 0x7ffff83127c0 "/root/bz707599/certdb", sb_tls_reqcert = 0x7ffff8312970 "demand", sb_tls_cipher_suite = 0x0, 
  sb_tls_protocol_min = 0x0, sb_tls_do_init = 0}

And the same issue can be reproduced on latest 2.4.28 in Fedora Rawhide.

Comment 27 Jan Vcelak 2012-01-20 13:04:06 UTC

(In reply to comment #26)
> I'm getting the same error codes only when PEM certificates are used. If I
> create MozNSS certdb and import the certificates, the replication works fine.
> However I'm getting segfaults from time to time.

The segfaulting is a different problem. But maybe related, it needs to be fixed first.

Comment 28 Jan Vcelak 2012-01-25 14:30:50 UTC

After resolving the segfaults by fixing bug #783445, this bug is no longer present when NSS certdb is used. The problem appears only with PEM certificates. I'm trying to find the cause.

Comment 29 Jan Vcelak 2012-01-31 16:05:42 UTC

Created attachment 558640 [details]
reproducer - configuration and scripts

Attaching configuration files and script for reproduction.

The replication works fine when MozNSS DB is used as a certificate storage (with fix for #783445). It fails only with PEM certificates.

Untar the reproducer to /root directory. Append included hosts file to /etc/hosts. Run "make" in server1 and server2 directories to generate the configuration. Start the servers with make run (or make debug).

server1 TLS messages:

TLS: using moznss security dir /root/bz707599/certdb prefix .
TLS: certificate [E=root.bug,CN=ldapservers.redhat.bug,OU=Bug,L=Brno,C=CZ] is valid
TLS: certificate [CN=replicator,CN=users,DC=redhat,DC=bug] is valid
TLS certificate verification: subject: CN=replicator,CN=users,DC=redhat,DC=bug, issuer: CN=Bug CA,DC=redhat,DC=bug, cipher: AES-256, security level: high, secret key bits: 256, total key bits: 256, cache hits: 0, cache misses: 1, cache not reusable: 0
TLS: could not initialize moznss - error -8018:Unknown PKCS #11 error..
TLS: could not perform TLS system initialization.
TLS: error: could not initialize moznss security context - error -8018:Unknown PKCS #11 error.
TLS: can't create ssl handle.
TLS: certificate [CN=replicator,CN=users,DC=redhat,DC=bug] is valid
TLS certificate verification: subject: CN=replicator,CN=users,DC=redhat,DC=bug, issuer: CN=Bug CA,DC=redhat,DC=bug, cipher: AES-256, security level: high, secret key bits: 256, total key bits: 256, cache hits: 0, cache misses: 2, cache not reusable: 0
TLS: error: could not initialize moznss security context - error -5925:The one-time function was previously called and failed. Its error code is no longer available
TLS: can't create ssl handle.
TLS: certificate [CN=replicator,CN=users,DC=redhat,DC=bug] is valid
TLS certificate verification: subject: CN=replicator,CN=users,DC=redhat,DC=bug, issuer: CN=Bug CA,DC=redhat,DC=bug, cipher: AES-256, security level: high, secret key bits: 256, total key bits: 256, cache hits: 0, cache misses: 3, cache not reusable: 0

server2 TLS messages:

TLS: loaded CA certificate file /root/bz707599/certs/cacert.pem.
TLS: certificate [CN=replicator,CN=users,DC=redhat,DC=bug] is valid
TLS: certificate [E=root.bug,CN=ldapservers.redhat.bug,OU=Bug,L=Brno,C=CZ] is valid
TLS: certificate [CN=replicator,CN=users,DC=redhat,DC=bug] is valid
TLS certificate verification: subject: E=root.bug,CN=ldapservers.redhat.bug,OU=Bug,L=Brno,C=CZ, issuer: CN=Bug CA,DC=redhat,DC=bug, cipher: AES-256, security level: high, secret key bits: 256, total key bits: 256, cache hits: 0, cache misses: 0, cache not reusable: 0
TLS: loaded CA certificate file /root/bz707599/certs/cacert.pem.
TLS: certificate [E=root.bug,CN=ldapservers.redhat.bug,OU=Bug,L=Brno,C=CZ] is valid
TLS: error: accept - force handshake failure: errno 0 - moznss error -5938
TLS: can't accept: TLS error -5938:Encountered end of file.

It also breaks client connections:

+ ldapsearch -Y EXTERNAL -H ldaps://ldapserver1.redhat.bug:1636 -b dc=redhat,dc=bug -LLL
SASL/EXTERNAL authentication started
SASL username: cn=replicator,cn=users,dc=redhat,dc=bug
SASL SSF: 0
No such object (32)
+ ldapsearch -Y EXTERNAL -H ldaps://ldapserver2.redhat.bug:2636 -b dc=redhat,dc=bug -LLL
ldap_sasl_interactive_bind_s: Can't contact LDAP server (-1)
        additional info: TLS error -12272:SSL peer reports incorrect Message Authentication Code.

From some reason, NSS_InitContext fails when creating the connection to the remote server. I'm debugging on Fedora Rawhdie, but the problem is the same on RHEL6:

(gdb) bt
#0  nss_Init (configdir=0x5555557df080 "", certPrefix=0x5555557df080 "", keyPrefix=0x5555557df080 "", 
    secmodName=0x5555557dfcc2 "secmod.db", updateDir=0x7ffff6d22620 "", updCertPrefix=0x7ffff6d22620 "", 
    updKeyPrefix=0x7ffff6d22620 "", updateID=0x7ffff6d22620 "", updateName=0x7ffff6d22620 "", initContextPtr=0x7ffff130ab88, 
    initParams=0x7ffff130abf0, readOnly=1, noCertDB=1, noModDB=1, forceOpen=0, noRootInit=1, optimizeSpace=0, 
    noSingleThreadedModules=0, allowAlreadyInitializedModules=0, dontFinalizeModules=0) at nssinit.c:566
#1  0x00007ffff6c0162d in NSS_InitContext (configdir=0x5555557df080 "", certPrefix=0x5555557df080 "", keyPrefix=0x5555557df080 "", 
    secmodName=0x5555557dfcc2 "secmod.db", initParams=0x7ffff130abf0, flags=7) at nssinit.c:861
#2  0x0000555555748b2b in tlsm_deferred_init (arg=0x7fffe4108eb0) at tls_m.c:1666
#3  0x0000555555749bac in tlsm_deferred_ctx_init (arg=0x7fffe4108eb0) at tls_m.c:2133
#4  0x00007ffff6399445 in PR_CallOnceWithArg () from /lib64/libnspr4.so
#5  0x000055555574a8a4 in tlsm_session_new (ctx=0x7fffe4108eb0, is_server=0) at tls_m.c:2500
#6  0x0000555555743842 in alloc_handle (ctx_arg=0x7fffe4108eb0, is_server=0) at tls2.c:288
#7  0x0000555555743972 in ldap_int_tls_connect (ld=0x7fffe8100910, conn=0x7fffe8108ec0) at tls2.c:333
#8  0x0000555555744bd5 in ldap_int_tls_start (ld=0x7fffe8100910, conn=0x7fffe8108ec0, srv=0x7fffe8108de0) at tls2.c:834
#9  0x0000555555717920 in ldap_int_open_connection (ld=0x7fffe8100910, conn=0x7fffe8108ec0, srv=0x7fffe8108de0, async=0) at open.c:437
#10 0x000055555572ecb6 in ldap_new_connection (ld=0x7fffe8100910, srvlist=0x7fffe8100a30, use_ldsb=1, connect=1, bind=0x0, m_req=0, 
    m_res=0) at request.c:480
#11 0x0000555555716b43 in ldap_open_defconn (ld=0x7fffe8100910) at open.c:41
#12 0x000055555571ec83 in ldap_int_sasl_bind (ld=0x7fffe8100910, dn=0x555555b47ff0 "cn=manager,dc=redhat,dc=bug", 
    mechs=0x555555b47ee0 "EXTERNAL", sctrls=0x0, cctrls=0x0, flags=2, interact=0x555555709a58 <lutil_sasl_interact>, 
    defaults=0x7fffe8108e80, result=0x0, rmech=0x7ffff130b360, msgid=0x7ffff130b354) at cyrus.c:425
#13 0x0000555555721db7 in ldap_sasl_interactive_bind (ld=0x7fffe8100910, dn=0x555555b47ff0 "cn=manager,dc=redhat,dc=bug", 
    mechs=0x555555b47ee0 "EXTERNAL", serverControls=0x0, clientControls=0x0, flags=2, interact=0x555555709a58 <lutil_sasl_interact>, 
    defaults=0x7fffe8108e80, result=0x0, rmech=0x7ffff130b360, msgid=0x7ffff130b354) at sasl.c:474
#14 0x0000555555721e6a in ldap_sasl_interactive_bind_s (ld=0x7fffe8100910, dn=0x555555b47ff0 "cn=manager,dc=redhat,dc=bug", 
    mechs=0x555555b47ee0 "EXTERNAL", serverControls=0x0, clientControls=0x0, flags=2, interact=0x555555709a58 <lutil_sasl_interact>, 
    defaults=0x7fffe8108e80) at sasl.c:511
#15 0x000055555559c5d6 in slap_client_connect (ldp=0x555555b47e00, sb=0x555555b47be0) at ../../../servers/slapd/config.c:2040
#16 0x0000555555628d19 in do_syncrep1 (op=0x7ffff130b480, si=0x555555b47bb0) at ../../../servers/slapd/syncrepl.c:611
#17 0x000055555562c5ab in do_syncrepl (ctx=0x7ffff130bb60, arg=0x555555b481b0) at ../../../servers/slapd/syncrepl.c:1510
#18 0x000055555571546d in ldap_int_thread_pool_wrapper (xpool=0x555555ae6b70) at ../../../libraries/libldap_r/tpool.c:685
#19 0x00007ffff73ead34 in start_thread (arg=0x7ffff130c700) at pthread_create.c:309
#20 0x00007ffff5ca4f4d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(gdb) fin
Run till exit from #0  nss_Init (configdir=0x5555557df080 "", certPrefix=0x5555557df080 "", keyPrefix=0x5555557df080 "", 
    secmodName=0x5555557dfcc2 "secmod.db", updateDir=0x7ffff6d22620 "", updCertPrefix=0x7ffff6d22620 "", 
    updKeyPrefix=0x7ffff6d22620 "", updateID=0x7ffff6d22620 "", updateName=0x7ffff6d22620 "", initContextPtr=0x7ffff130ab88, 
    initParams=0x7ffff130abf0, readOnly=1, noCertDB=1, noModDB=1, forceOpen=0, noRootInit=1, optimizeSpace=0, 
    noSingleThreadedModules=0, allowAlreadyInitializedModules=0, dontFinalizeModules=0) at nssinit.c:566
0x00007ffff6c0162d in NSS_InitContext (configdir=0x5555557df080 "", certPrefix=0x5555557df080 "", keyPrefix=0x5555557df080 "", 
    secmodName=0x5555557dfcc2 "secmod.db", initParams=0x7ffff130abf0, flags=7) at nssinit.c:861
861         rv = nss_Init(configdir, certPrefix, keyPrefix, secmodName,
Value returned is $2 = SECFailure

Comment 30 Rich Megginson 2012-02-01 17:10:09 UTC

what version of nss?  There are some thread safety issues in nss_Init being fixed in NSS 3.13.2 - I believe fedora is using nss 3.13.1?

Comment 31 Jan Vcelak 2012-02-01 17:36:20 UTC

Yes, I have nss-3.13.1-13.fc17. I can try to get fresh nss from upstream repository (if there is any?).

Comment 32 Rich Megginson 2012-02-01 17:42:17 UTC

(In reply to comment #31)
> Yes, I have nss-3.13.1-13.fc17. I can try to get fresh nss from upstream
> repository (if there is any?).

Not sure.  Check with emaldonado

Comment 33 Jan Vcelak 2012-02-02 16:07:46 UTC

(In reply to comment #30)
> what version of nss?  There are some thread safety issues in nss_Init being
> fixed in NSS 3.13.2 - I believe fedora is using nss 3.13.1?

I built NSS packages from upstream CVS (NSS_3_13_2_BETA2 tag) and it didn't help. Still the same behavior. I also added a mutex to protect the whole NSS initialization with no success. I do not think this is a race.

Comment 34 Rich Megginson 2012-02-02 16:25:33 UTC

So it looks as though the initial problem is this:

TLS: could not initialize moznss - error -8018:Unknown PKCS #11 error..

The next step is to step through this in the debugger.  I don't know in how many places the error SEC_ERROR_UNKNOWN_PKCS11_ERROR can be returned, but set break points at places that look like candidates for the error return.

This is interesting as well:

        additional info: TLS error -12272:SSL peer reports incorrect Message
Authentication Code.

We've seen this error before when the CA/server cert is using a MAC/signature algorithm not supported on the client, or vice versa.  But if all of the systems are using recent versions of NSS, this is not likely to be a problem.

Comment 35 Jan Vcelak 2012-02-06 10:15:49 UTC

Created attachment 559594 [details]
patch resolving SEC_ERROR_UNKNOWN_PKCS11_ERROR

Finally, some update. SEC_ERROR_UNKNOWN_PKCS11_ERROR was in fact internal CKR_CRYPTOKI_ALREADY_INITIALIZED error. I added NSS_INIT_PK11RELOAD to init flags and this problem is gone. Attaching the patch.

However, I'm still getting invalid MAC error.

And in addition all SSL connections deadlocks from time to time. One thread seems to wait for some data:

#0  0x00007ffff73f105d in read () at ../sysdeps/unix/syscall-template.S:82
#1  0x00005555557568de in sb_stream_read (sbiod=0x7fffe417bb80, buf=0x7fffe41a1f18, len=5) at ../../../libraries/liblber/sockbuf.c:493
#2  0x000055555575730d in sb_debug_read (sbiod=0x7fffe414f490, buf=0x7fffe41a1f18, len=5) at ../../../libraries/liblber/sockbuf.c:829
#3  0x000055555574b6d2 in tlsm_PR_Recv (fd=0x7fffe414f820, buf=0x7fffe41a1f18, len=5, flags=0, timeout=4294967295) at tls_m.c:2900
#4  0x00007ffff71bd8fa in ssl_DefRecv (ss=0x7fffe41a1b40, buf=0x7fffe41a1f18 "", len=5, flags=0) at ssldef.c:94
#5  0x00007ffff71b6a6a in ssl3_GatherData (ss=0x7fffe41a1b40, gs=0x7fffe41a1ec8, flags=0) at ssl3gthr.c:90
#6  0x00007ffff71b6c42 in ssl3_GatherCompleteHandshake (ss=0x7fffe41a1b40, flags=0) at ssl3gthr.c:196
#7  0x00007ffff71b901c in ssl_GatherRecord1stHandshake (ss=0x7fffe41a1b40) at sslcon.c:1253
#8  0x00007ffff71c2f9c in ssl_Do1stHandshake (ss=0x7fffe41a1b40) at sslsecur.c:153
#9  0x00007ffff71c38e1 in SSL_ForceHandshake (fd=0x7fffe414f7e0) at sslsecur.c:423
#10 0x000055555574aa71 in tlsm_session_accept_or_connect (session=0x7fffe414f7e0, is_accept=0) at tls_m.c:2547
#11 0x000055555574aba6 in tlsm_session_connect (ld=0x7fffe4103240, session=0x7fffe414f7e0) at tls_m.c:2582
#12 0x0000555555743ad3 in ldap_int_tls_connect (ld=0x7fffe4103240, conn=0x7fffe41983c0) at tls2.c:358
#13 0x0000555555744bd5 in ldap_int_tls_start (ld=0x7fffe4103240, conn=0x7fffe41983c0, srv=0x7fffe41964d0) at tls2.c:834
#14 0x0000555555717920 in ldap_int_open_connection (ld=0x7fffe4103240, conn=0x7fffe41983c0, srv=0x7fffe41964d0, async=0) at open.c:437
#15 0x000055555572ecb6 in ldap_new_connection (ld=0x7fffe4103240, srvlist=0x7fffe4102490, use_ldsb=1, connect=1, bind=0x0, m_req=0, 
    m_res=0) at request.c:480
#16 0x0000555555716b43 in ldap_open_defconn (ld=0x7fffe4103240) at open.c:41
#17 0x000055555571ec83 in ldap_int_sasl_bind (ld=0x7fffe4103240, dn=0x555555b48050 "cn=manager,dc=redhat,dc=bug", 
    mechs=0x555555b47f40 "EXTERNAL", sctrls=0x0, cctrls=0x0, flags=2, interact=0x555555709a58 <lutil_sasl_interact>, 
    defaults=0x7fffe4173250, result=0x0, rmech=0x7ffff1af4360, msgid=0x7ffff1af4354) at cyrus.c:425
#18 0x0000555555721db7 in ldap_sasl_interactive_bind (ld=0x7fffe4103240, dn=0x555555b48050 "cn=manager,dc=redhat,dc=bug", 
    mechs=0x555555b47f40 "EXTERNAL", serverControls=0x0, clientControls=0x0, flags=2, interact=0x555555709a58 <lutil_sasl_interact>, 
    defaults=0x7fffe4173250, result=0x0, rmech=0x7ffff1af4360, msgid=0x7ffff1af4354) at sasl.c:474
#19 0x0000555555721e6a in ldap_sasl_interactive_bind_s (ld=0x7fffe4103240, dn=0x555555b48050 "cn=manager,dc=redhat,dc=bug", 
    mechs=0x555555b47f40 "EXTERNAL", serverControls=0x0, clientControls=0x0, flags=2, interact=0x555555709a58 <lutil_sasl_interact>, 
    defaults=0x7fffe4173250) at sasl.c:511
#20 0x000055555559c5d6 in slap_client_connect (ldp=0x555555b47e60, sb=0x555555b47c40) at ../../../servers/slapd/config.c:2040
#21 0x0000555555628d19 in do_syncrep1 (op=0x7ffff1af4480, si=0x555555b47c10) at ../../../servers/slapd/syncrepl.c:611
#22 0x000055555562c5ab in do_syncrepl (ctx=0x7ffff1af4b60, arg=0x555555b48210) at ../../../servers/slapd/syncrepl.c:1510
#23 0x000055555571546d in ldap_int_thread_pool_wrapper (xpool=0x555555ae6b70) at ../../../libraries/libldap_r/tpool.c:685
#24 0x00007ffff73ead34 in start_thread (arg=0x7ffff1af5700) at pthread_create.c:309
#25 0x00007ffff5ca4f4d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

While the second waits for the lock:

#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1  0x00007ffff73ecc56 in _L_lock_858 () from /lib64/libpthread.so.0
#2  0x00007ffff73ecb58 in __pthread_mutex_lock (mutex=0x555555a659c0) at pthread_mutex_lock.c:65
#3  0x0000555555716963 in ldap_pvt_thread_mutex_lock (mutex=0x555555a659c0) at ../../../libraries/libldap_r/thr_posix.c:296
#4  0x000055555574aa65 in tlsm_session_accept_or_connect (session=0x7fffd80025d0, is_accept=1) at tls_m.c:2545
#5  0x000055555574ab83 in tlsm_session_accept (session=0x7fffd80025d0) at tls_m.c:2576
#6  0x0000555555743d05 in ldap_pvt_tls_accept (sb=0x7fffd80025a0, ctx_arg=0x555555b657d0) at tls2.c:421
#7  0x00005555555a89bb in connection_read (s=19, cri=0x7fffeb7fdb10) at ../../../servers/slapd/connection.c:1355
#8  0x00005555555a84ef in connection_read_thread (ctx=0x7fffeb7fdb60, argv=0x13) at ../../../servers/slapd/connection.c:1267
#9  0x000055555571546d in ldap_int_thread_pool_wrapper (xpool=0x555555ae6b70) at ../../../libraries/libldap_r/tpool.c:685
#10 0x00007ffff73ead34 in start_thread (arg=0x7fffeb7fe700) at pthread_create.c:309
#11 0x00007ffff5ca4f4d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Comment 36 Kamil Dudka 2012-02-06 11:06:32 UTC

(In reply to comment #35)
> And in addition all SSL connections deadlocks from time to time.

I would not call it a deadlock.  The tlsm_pem_mutex lock simply does not allow to perform multiple SSL handshakes at a time.  If a single SSL handshake for whatever reason blocks, all other connections have to wait.

I am not sure whether the waiting time is anyhow bounded by default.  Maybe calling SSL_ForceHandshakeWithTimeout() instead of SSL_ForceHandshake() would improve the the responsiveness in this case?

Comment 37 Rich Megginson 2012-02-06 14:25:36 UTC

I don't understand - why is NSS giving CKR_CRYPTOKI_ALREADY_INITIALIZED?  Is it really already initialized?  If so, why haven't we been getting this error before?  Is it because something changed in NSS recently?

The deadlock - Kamil is correct, it's not really a deadlock, it is a lock to prevent multiple threads from calling SSL_ForceHandshake() at the same time - in earlier versions of NSS this was not thread safe.  However, many thread safety issues have been addressed in NSS 3.13, so this may be ok, but perhaps for the NSS db implementation only - it may still be thread-unsafe when using PEM.

Comment 39 Jan Vcelak 2012-02-06 15:29:06 UTC

(In reply to comment #37)
> I don't understand - why is NSS giving CKR_CRYPTOKI_ALREADY_INITIALIZED?  Is it
> really already initialized?  If so, why haven't we been getting this error
> before?  Is it because something changed in NSS recently?

There are at least two TLS contexts used. The first one for accepting connections (olcTLS* options). The second one is used for replication as the client side (olcSyncRepl). Both live in a different thread. This error appears only when PEM module is used, and during the second initialization. If I understand it correctly, the NSS PKCS#11 module, which causes this error, is loaded only when PEM is used. I think we just haven't tested this use case properly when switching to MozNSS.

> The deadlock - Kamil is correct, it's not really a deadlock, it is a lock to
> prevent multiple threads from calling SSL_ForceHandshake() at the same time -
> in earlier versions of NSS this was not thread safe.  However, many thread
> safety issues have been addressed in NSS 3.13, so this may be ok, but perhaps
> for the NSS db implementation only - it may still be thread-unsafe when using
> PEM.

Yes, I wrote it wrong. I meant it the same. And yes - the mutex is effective only when PEM module is loaded - you wrote the patch:
http://www.openldap.org/devel/gitweb.cgi?p=openldap.git;a=commit;h=8e9af53

I tried sniffing the packets using Wireshark and compared the successful and failing connections:

(1) - successful

-> Client Hello
<- Server Hello, Certificate, Certificate Request, Server Hello Done
-> Certificate, Client Key Exchange, Certificate Verify, Change Cipher Spec, Finished
<- Change Cipher Spec, Finished
... data ...

(2) - failing

-> Client Hello
<- Server Hello, Certificate, Certificate Request, Server Hello Done
-> Certificate, Client Key Exchange, Certificate Verify, Change Cipher Spec, Finished
<- Alert (Bad Record MAC)

The only difference I found is in "Certificate Request" message. Message in (1) contains only "id-at-commonName=Bug CA,dc=redhat,dc=bug", while the same message in (2) contains "id-at-commonName=replicator,id-at-commonName=users,dc=redhat,dc=bug", "id-at-commonName=Bug CA,dc=redhat,dc=bug", and "pkcs-9-at-emailAddress=root.bug,id-at-commonName=ldapservers.redhat.bug,id-at-organizationalUnitName=Bug,id-at-localityName=Brno,id-at-countryName=CZ". 

It seems like the syncrepl TLS configuration mixes up with the server TLS configuration. If you use the reproducer I provided, the server started earlier seems to be fine. The secondly started server reports this MAC error. And even other connections to this server are failing:

$ ldapsearch -Y EXTERNAL -H ldaps://ldapserver2.redhat.bug:2636 -b dc=redhat,dc=bug -LLL
ldap_sasl_interactive_bind_s: Can't contact LDAP server (-1)
        additional info: TLS error -12272:SSL peer reports incorrect Message Authentication Code.

Comment 40 Jan Vcelak 2012-02-07 15:42:50 UTC

Some progress:

There are two TLS context within the server process. The first TLS context is used on the server side to accept connections, the second is used on the client side of syncrepl. While both contexts can have different TLS settings, they share one certificate database (CERTCertDBHandle *):

tls_m.c:2140:
ctx->tc_certdb = CERT_GetDefaultCertDB(); /* If there is ever a per-context db, change this */

When I set up the server to use server cert from MozNSS db, and PEM certificate for replication, I'm getting this error:

TLS: using moznss security dir /root/bz707599/certdb prefix .
TLS: error: the certificate server could not be found in the database - error -12285:Unable to find the certificate or key necessary for authentication.
TLS: error: unable to find and verify server's cert and key for certificate server
TLS: error: could not initialize moznss security context - error -12285:Unable to find the certificate or key necessary for authentication.
TLS: can't create ssl handle.

The call PK11_FindCertFromNickname in tlsm_find_and_verify_cert_key fails.

The first connection created by slapd is syncrepl outbound connection. The initialization of server context happens when there is a first inbound connection.

There is a following sequence of TLS function calls:

tlsm_deferred_ctx_init (syncrepl)
tlsm_deferred_init (syncrepl)
tlsm_deferred_ctx_init (server)
tlsm_deferred_init (server)
tlsm_find_and_verify_cert_key (syncrepl, "PEM Token #0:replicator2.pem - 0")
tlsm_find_and_verify_cert_key (syncrepl, "PEM Token #0:replicator2.pem - 0")
tlsm_find_and_verify_cert_key (server, "server")

When I suspended the syncrepl thread in do_syncrepl function (using gdb), the initialization of server context was successful. And after releasing the syncrepl thread, everything worked well.

I think there is kind of a conflict when adding the PEM certificates to the database. Aren't the old replaced somehow? This will explain the error message above, and probably the invalid MAC error - if the key is removed too early.

I haven't found any replacement for CERT_GetDefaultCertDB(). And according to mailing lists, implementation of that function was postponed. I do not know, if this information is up to date...

Any ideas?

We can probably workaround it by not doing deferred initialization for server context. Or, we can disable deferred initialization at all. But I don't know, it if makes sense and if it helps for all use cases.

Comment 41 Rich Megginson 2012-02-07 16:02:00 UTC

If we can assume that the only versions of NSS we support have NSS_InitContext, we can probably do away with deferred initialization.  I implemented it that way because you have to do that in order to support the old style NSS_Initialize which must be done once per process.

TLS: using moznss security dir /root/bz707599/certdb prefix .

Does this mean there is a valid key/cert database in /root/bz707599/certdb?  If so, is the problem that the process is trying to mix using a real NSS key/cert db with PEM certificates?  I wonder if the problem is that you cannot add PEM keys/certs once you have a real NSS key/cert db.  Or vice versa - perhaps once you have loaded the PEM module and added keys/certs, you cannot use an NSS key/cert db?

The way NSS works is that it does not actually have separate "contexts" where keys/certs are stored, independent of other contexts.  There is supposed to be only one key/cert context - if you ask NSS to verify a cert, it verifies it against all CA certs it knows about in all contexts.  That's why there is only one CERT_GetDefaultCertDB() because there is no support for a per-context key/cert db.

Comment 42 Kamil Dudka 2012-02-07 16:31:48 UTC

(In reply to comment #40)
> I think there is kind of a conflict when adding the PEM certificates to the
> database.

The PEM reader does not add any objects to the NSS internal database.  They are held in an independent array owned by the PEM module and exposed though its API.

Comment 43 Jan Vcelak 2012-02-08 14:43:06 UTC

(In reply to comment #41)
> TLS: using moznss security dir /root/bz707599/certdb prefix .
> 
> Does this mean there is a valid key/cert database in /root/bz707599/certdb?  If
> so, is the problem that the process is trying to mix using a real NSS key/cert
> db with PEM certificates?

Yes. The certificate is there. 

> I wonder if the problem is that you cannot add PEM
> keys/certs once you have a real NSS key/cert db.  Or vice versa - perhaps once
> you have loaded the PEM module and added keys/certs, you cannot use an NSS
> key/cert db?

According to the sequence of calls mentioned in comment #40 -- I think the second variant is the case.

(In reply to comment #42)
> The PEM reader does not add any objects to the NSS internal database.  They are
> held in an independent array owned by the PEM module and exposed though its
> API.

Then I will take a deep look, why FindCertFromNickname fails.

Comment 44 Jan Vcelak 2012-02-28 17:48:48 UTC

I made another progress.

When tls_certdir in syncrepl configuration differs from olcTLSCACertificatePath in server configuration, I'm getting the same error as above:

TLS: error: the certificate server could not be found in the database

If there are multiple TLS contexts within the application, PK11_FindCertFromNickname() doesn't work without specifying token description as a prefix. Otherwise some prefix is added for each certificate during the second (and any further) initialization calls. I'm getting certificate nicknames like "NSS Application Token 00000004:server". (I'm not sure if I used correct NSS terminology.)

So I believe that adding a token description in the initParams for each NSS_InitContext() call and using this string as a prefix in PK11_FindCertFromNickname() can push us further in solving this bug. I haven't tried yet. (Some numbering or original TLS certdir path might be used as the description.)

Or, we can call NSS_InitContext without certdir name, and load each database using SECMOD_OpenUserDB. Do not know, if this is a cleaner solution.

I will attach a simple program which demonstrates both approaches.

Comment 45 Jan Vcelak 2012-02-28 17:49:51 UTC

(In reply to comment #44)
> When tls_certdir in syncrepl configuration differs from olcTLSCACertificatePath
> in server configuration, I'm getting the same error as above:

Forgot to mention: both MozNSS cert DBs.

Comment 46 Jan Vcelak 2012-02-28 17:53:31 UTC

Created attachment 566365 [details]
multiple NSS_InitContext() calls demonstration

I have two MozNSS cert DBs (certdb_1, certdb_2). The first one contains only "ca" certificate, the second one contains only "server" certificate.

#define USE_OPENUSERDB 0
without setting initParams.dbTokenDescription

Loaded certificates:
- 0x238ae70 db2:server (E=root.bug,CN=ldapservers.redhat.bug,OU=Bug,L=Brno,C=CZ)
- 0x238ae40 db1:ca (CN=Bug CA,DC=redhat,DC=bug)
Certificate 'db1:ca', SN: CN=Bug CA,DC=redhat,DC=bug
Certificate 'db2:server', SN: E=root.bug,CN=ldapservers.redhat.bug,OU=Bug,L=Brno,C=CZ

#define USE_OPENUSERDB 0

Loaded certificates:
- 0x677d90 db2:server (E=root.bug,CN=ldapservers.redhat.bug,OU=Bug,L=Brno,C=CZ)
- 0x677d60 ca (CN=Bug CA,DC=redhat,DC=bug)
Certificate 'db1:ca', SN: CN=Bug CA,DC=redhat,DC=bug
Certificate 'db2:server', SN: E=root.bug,CN=ldapservers.redhat.bug,OU=Bug,L=Brno,C=CZ

#define USE_OPENUSERDB 1

Loaded certificates:
- 0x2630e70 db2:server (E=root.bug,CN=ldapservers.redhat.bug,OU=Bug,L=Brno,C=CZ)
- 0x2630e40 db1:ca (CN=Bug CA,DC=redhat,DC=bug)
Certificate 'db1:ca', SN: CN=Bug CA,DC=redhat,DC=bug
Certificate 'db2:server', SN: E=root.bug,CN=ldapservers.redhat.bug,OU=Bug,L=Brno,C=CZ

Comment 47 Jan Vcelak 2012-02-28 17:55:41 UTC

Sorry again, this is correct:

> #define USE_OPENUSERDB 0
> without setting initParams.dbTokenDescription

Loaded certificates:
- 0x1cc0d08 NSS Application Token 00000004:server (E=root.bug,CN=ldapservers.redhat.bug,OU=Bug,L=Brno,C=CZ)
- 0x1cc0cc0 ca (CN=Bug CA,DC=redhat,DC=bug)
Certificate 'ca', SN: CN=Bug CA,DC=redhat,DC=bug
Cannot find 'server' certificate, error -8174.

Comment 48 Kamil Dudka 2012-02-28 19:03:15 UTC

(In reply to comment #44)
> If there are multiple TLS contexts within the application,
> PK11_FindCertFromNickname() doesn't work without specifying token description
> as a prefix. Otherwise some prefix is added for each certificate during the
> second (and any further) initialization calls. I'm getting certificate
> nicknames like "NSS Application Token 00000004:server". (I'm not sure if I used
> correct NSS terminology.)

Not sure if this is the same issue we were fixing in libcurl, but the symptoms are pretty similar.  The solution was to avoid using PK11_FindCertFromNickname() to look for the certificates:

https://github.com/bagder/curl/commit/f6980bb

After all, we were forced to use PK11_FindCertFromNickname() anyway in our workaround preventing nss from an unnecessary SIGSEGV:

https://github.com/bagder/curl/commit/491c5a4

Comment 49 Jan Vcelak 2012-02-29 14:12:51 UTC

(In reply to comment #48)
> Not sure if this is the same issue we were fixing in libcurl, but the symptoms
> are pretty similar.  The solution was to avoid using
> PK11_FindCertFromNickname() to look for the certificates:

Thanks for sharing. But this is not applicable in our case unfortunatelly. PK11_FindCertFromNickname() is used to reference certificate nickname in the certdb. And the nickname is taken from the server configuration file.

Comment 50 Jan Vcelak 2012-02-29 15:40:25 UTC

My idea is to set token description during initialization and use it when requesting the certificate.

If two different contexts request the same certdb, NSS_InitContext succeeds, but I'm unable to query the get the certificates with the prefix from the second initialization.

initParams.dbTokenDescription = "db1";
ctx1 = InitContext("./certdb", "", "", SECMOD_DB, &initParams, NSS_INIT_READONLY);

initParams.dbTokenDescription = "db2";
ctx2 = InitContext("./certdb", "", "", SECMOD_DB, &initParams, NSS_INIT_READONLY);

initParams.dbTokenDescription = "db3";
ctx2 = InitContext("././certdb", "", "", SECMOD_DB, &initParams, NSS_INIT_READONLY);

CERT_FindCertByNickname(certdb, "server") --> works
CERT_FindCertByNickname(certdb, "db1:server") --> works
CERT_FindCertByNickname(certdb, "db2:server") --> does not work
CERT_FindCertByNickname(certdb, "db3:server") --> works (surprise)

It makes sense, that only unique certficates are added to the database. But it is unusable because during the second NSS_InitContext, the prefix is added automatically.

Which means, that we can either lookup the certificate manually using PK11_ListCerts() and do some "heuristics". Or, use InitContext() with NSS_INIT_NOCERTDB|NSS_INIT_NOMODDB and open the certdb for each OpenLDAP TLS context using SECMOD_OpenUserDB(). Which seems to work always fine when prefixing token description.

Comment 51 Rich Megginson 2012-02-29 15:44:53 UTC

(In reply to comment #50)
> My idea is to set token description during initialization and use it when
> requesting the certificate.
> 
> If two different contexts request the same certdb, NSS_InitContext succeeds,
> but I'm unable to query the get the certificates with the prefix from the
> second initialization.
> 
> initParams.dbTokenDescription = "db1";
> ctx1 = InitContext("./certdb", "", "", SECMOD_DB, &initParams,
> NSS_INIT_READONLY);
> 
> initParams.dbTokenDescription = "db2";
> ctx2 = InitContext("./certdb", "", "", SECMOD_DB, &initParams,
> NSS_INIT_READONLY);
> 
> initParams.dbTokenDescription = "db3";
> ctx2 = InitContext("././certdb", "", "", SECMOD_DB, &initParams,
> NSS_INIT_READONLY);
> 
> CERT_FindCertByNickname(certdb, "server") --> works
> CERT_FindCertByNickname(certdb, "db1:server") --> works
> CERT_FindCertByNickname(certdb, "db2:server") --> does not work
> CERT_FindCertByNickname(certdb, "db3:server") --> works (surprise)
> 
> It makes sense, that only unique certficates are added to the database. But it
> is unusable because during the second NSS_InitContext, the prefix is added
> automatically.
> 
> Which means, that we can either lookup the certificate manually using
> PK11_ListCerts() and do some "heuristics". Or, use InitContext() with
> NSS_INIT_NOCERTDB|NSS_INIT_NOMODDB and open the certdb for each OpenLDAP TLS
> context using SECMOD_OpenUserDB(). Which seems to work always fine when
> prefixing token description.

I'm not sure if this is how NSS is intended to work, and I don't think apps are meant to call SECMOD_OpenUserDB() directly.  You should talk to Robert Relyea and Elio Maldonado about this.

Comment 52 Jan Vcelak 2012-04-04 14:54:58 UTC

*** Bug 809817 has been marked as a duplicate of this bug. ***

Comment 54 Jan Vcelak 2012-04-30 18:11:51 UTC

Still working on it. Some update: I have a patch which uses SECMOD_OpenUserDB(), now I'm testing if it works will all types of configuration. But there is a problem with NSS_ShutdownContext() which always fails. I found a upstream bug report on this.

I will polish and post the patch, hopefully soon.

Comment 55 Jan Vcelak 2012-06-06 16:37:26 UTC

Still working on it. I keep my changes on github in moznss branch:
https://github.com/fcelda/openldap

Most of the problems seems to be solved. I see no certdb issues with various combinations of TLS config options. But there are still some problems with PEM certificates, I'm still getting SEC_ERROR_UNKNOWN_PKCS11_ERROR and SSL_ERROR_BAD_MAC_ALERT occasionally.

Comment 56 Jan Vcelak 2012-06-18 12:36:53 UTC

Finally, I have a solution, there were too many bugs which were complicating this:

- Every MozNSS certdb is opened using SECMOD_OpenUserDB with a token description unique for each context within libldap, certificate lookup by nickname always works in this case.

- PEM certificates are not referenced by nickname derived from the certificate file base name, but by their DER value (this is already used in CURL).

- There was a problem with uninitialized memory when retrieving the certdb pin which sometimes caused reading the pin file infinitely.

- SEC_ERROR_UNKNOWN_PKCS11_ERROR (not only this error) was caused by thread-unsafety, when I locked the whole initialization and cleanup process by a mutex, the errors disappeared. Suppose this will not have any performance impact, we are doing this only once per context.

- SSL_ERROR_BAD_MAC_ALERT was caused by PEM module supplying invalid key for the certificate. PEM module expects that all certificates with CKA_TRUST == true are CA certificates. After removing this for non-CA certificates, it works as expected.


All these fixes are in my repo on github. Some are already merged upstream. I will fine-tune the rest before submitting them.

Comment 58 Jan Vcelak 2012-06-27 16:23:30 UTC

All my patches were accepted upstream. And I just submitted an update for Fedora 17 (openldap-2.4.31-3.fc17) with these patches applied.

Comment 62 Jan Vcelak 2012-09-25 16:10:09 UTC

Resolved in: openldap-2.4.23-29.el6

Comment 66 Trevor Hemsley 2012-11-21 12:58:21 UTC

Is there some place I might find this updated RPM (or SRPM that I can rebuild)? I believe I am seeing this issue with 3 servers set up to do N-way multi-master replication on CentOS 6.3. Currently using 2.4.23-26.el6_3.2 and what I see is that if I am lucky I can start all 3 servers and everything works correctly. Quite often, I can start 2 out of 3 servers successfully but starting the 3rd one will kill one or both of the others - they all still respond to ldapsearch on ldap:// but requests to ldaps:// hang forever. If only one of the other 2 stops responding then just restarting the 3rd might fix both but if it kills both of the others then I have to stop any 2 and then the remaining one comes back to life.

Comment 67 Jan Vcelak 2012-11-22 11:45:13 UTC

Hello Trevor,

I have uploaded SRPM and test packages for x86_64, i686 to:
http://people.redhat.com/jvcelak/bz707599/

These packages have been provided by Red Hat for testing purposes only and are
NOT supported for any other use. These RPMs may contain changes that are
necessary for debugging but that are not appropriate for other uses, or that
are not compatible with third-party hardware or software. These RPMs should
NOT be deployed for purposes other than testing and debugging.
 
Please test in your environment, and let us know the outcome.

Jan

Comment 68 Trevor Hemsley 2012-11-22 16:17:15 UTC

Thank Jan, I have downloaded and installed those packages but I still have the same problem so I think I have a different issue. If you had any tips on how I might debug it, I would appreciate it!

Comment 69 Jan Vcelak 2012-11-23 09:26:10 UTC

Hi Trevor.

Please, create a separate bug report for that. It would be great if you could provide configuration files for all the servers and testing certificates. If you do not have testing certificates, the textual representation (certutil -d . -L -n certname / openssl x509 -text) will do as well. Feel free to obfuscated any sensitive information. I can try to reproduce.

But in general, starting the server manually with -d1 options should give you some clue, where the problem is. (slapd -u ldap -g ldap -F /etc/openldap/slapd.d -d1)

Jan

Comment 70 Trevor Hemsley 2012-11-23 14:33:30 UTC

Jan, so it appears to be (probably) user error but I don't see anywhere documented that you should use a different certificate on the olcSyncRepl to the one used by each server so I had set each of my 3 servers up with a single certificate each and that was referenced in both olcTLSCertificateFile in cn=config.ldif and in the olcSyncrepl tls_cert= in olcDatabase={0}config.ldif

Doing this seems to give approximately a 30% chance of hanging the ldaps listener. I can almost always start all 3 servers if I logon to each box and run `service slapd start` on server 1 then 2 then 3 but running `service slapd restart` on say server 3 will kill the ldaps listener on itself and at least one of the others about 1 time in 3 restarts.

I've now set them up with 2 certificates per server, one referenced by olcTLSCertificateFile and then a different one in the olcSyncrepl and this seems to make the problem go away. Not sure if that would be classified as a bug or as user error but at least it's now googleable and may help someone else with the same symptoms in the future.

Comment 71 Josh Mullis 2013-01-02 20:22:05 UTC

Hello,

I am also experiencing the exact same issue with the same setup, and while at first it appeared as though the different cert/key pairs on each server helped, after a few minutes the problem would reoccur when issuing a slapd restart.



Thanks
-josh

Comment 72 Jan Vcelak 2013-01-03 18:56:32 UTC

Trevor and Josh, please, create a separate bug report for that problem.

Comment 73 Josh Mullis 2013-01-03 19:53:53 UTC

Jan -
I'm sorry.  I was redirected from bug 809817.
I suppose that bug actually fits closer to my issue.

If you need a separate bug filed, can you just reopen 809817 ?


Thanks
-josh

Comment 74 Josh Mullis 2013-01-03 21:17:46 UTC

On a side note, would it be possible to provide an RPM of openldap compiled with openssl instead of moznss?

That seems to be the fix that's going around the openldap forums.

Thanks
-josh

Comment 75 Jan Vcelak 2013-01-04 23:07:30 UTC

(In reply to comment #73)
> Jan -
> I'm sorry.  I was redirected from bug 809817.

Oh, I see. I will take a look at it. Thank you, Josh.

(In reply to comment #74)
> On a side note, would it be possible to provide an RPM of openldap compiled
> with openssl instead of moznss?
> 
> That seems to be the fix that's going around the openldap forums.

I do not want to provide even testing packages compiled with OpenSSL, because there is no plan to support this configuration in future and I guess that Directory Server packages in RHEL rely on OpenLDAP with Mozilla NSS. (But you can easily recompile the package by yourself.)

Jan

Comment 76 Rich Megginson 2013-01-04 23:19:00 UTC

(In reply to comment #75)
> (In reply to comment #73)
> > Jan -
> > I'm sorry.  I was redirected from bug 809817.
> 
> Oh, I see. I will take a look at it. Thank you, Josh.
> 
> (In reply to comment #74)
> > On a side note, would it be possible to provide an RPM of openldap compiled
> > with openssl instead of moznss?
> > 
> > That seems to be the fix that's going around the openldap forums.
> 
> I do not want to provide even testing packages compiled with OpenSSL,
> because there is no plan to support this configuration in future and I guess
> that Directory Server packages in RHEL rely on OpenLDAP with Mozilla NSS.

True, and switching to OpenSSL would cause major short term problems for 389, but for long term, as long as the 389 team had enough lead time, the fact that 389 uses Mozilla NSS should not prevent OpenLDAP from using OpenSSL instead.

> (But you can easily recompile the package by yourself.)

. . . which is the real "fix" recommended by people on OpenLDAP forums.

> 
> Jan

Comment 78 errata-xmlrpc 2013-02-21 09:45:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0364.html