Bug 1377646

Summary: Intermittent NSS SSL/TLS connectivity issue with error code SEC_ERROR_PKCS11_DEVICE_ERROR
Product: [Fedora] Fedora Reporter: Alexey Gladkov <agladkov>
Component: nssAssignee: Kai Engert (:kaie) (inactive account) <kengert>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: dueno, emaldona, hkario, kdudka, kengert, lakostis, pasik, rrelyea
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-24 10:28:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
tcpdump of failed session none

Description Alexey Gladkov 2016-09-20 09:28:53 UTC
Description of problem:

After update to recent recent nss (v3.26.0), strange problem appears: firefox (or curl) drop connections to sites having outdated security settings and probably not FIPS-compatible. See https://api.telecom13.ru/v2/user_auth as example:

----
Secure Connection Failed

An error occurred during a connection to api.telecom13.ru. A PKCS #11 module returned CKR_DEVICE_ERROR, indicating that a problem has occurred with the token or slot. Error code: SEC_ERROR_PKCS11_DEVICE_ERROR
----

I can reproduce this problem on recent Fedora, but not on Debian/Ubunty, Google Chrome and Firefox from mozilla.org site.

Quick code lookup brings an idea that this could due enabled FIPS mode (because SEC_ERROR_PKCS11_DEVICE_ERROR is triggered by NSSLOW_Init call). But at the same time I don't have FIPS mode enabled (checked on two computers where /proc/sys/crypto/fips_enabled is missing or contains 0).

Version-Release number of selected component (if applicable):
nss.x86_64 3.26.0-1.0.fc23
nss-softokn.x86_64 3.26.0-1.0.fc23
nss-softokn-freebl.x86_64 3.26.0-1.0.fc23
nss-sysinit.x86_64 3.26.0-1.0.fc23
nss-tools.x86_64 3.26.0-1.0.fc23
nss-util.x86_64 3.26.0-1.0.fc23

How reproducible:

The problem is reproducible constantly

Actual results:
Error code: SEC_ERROR_PKCS11_DEVICE_ERROR

Expected results:

Answer with status 405 and content:

{"data":{"message":"method not allowed"},"error":"405","message":"invalid_method","status":"error","request_id":"c6697051847aaf62b05bcd9fe0f34dcf","auth_token":""}

Comment 1 Kamil Dudka 2016-09-26 16:36:27 UTC
CKR_DEVICE_ERROR is misleading.  The error seems to come from RSA_CheckSign(), which is annotated by "XXX Doesn't set error code":

1310│     /*
1311│      * make sure we get the same results
1312│      */
1313│     if (PORT_Memcmp(buffer + modulusLen - dataLen, data, dataLen) != 0)
1314├>        goto loser;
1315│
1316│     PORT_Free(buffer);
1317│     return SECSuccess;
1318│
1319│ loser:
1320│     PORT_Free(buffer);
1321│ failure:
1322│     return SECFailure;
1323│ }

Comment 2 Kai Engert (:kaie) (inactive account) 2016-09-28 17:21:57 UTC
I cannot reproduce the bug on Fedora 24.
I'm getting the expected result (the text result, no error)
with NSS 3.26, with both firefox and curl.

nss-util-3.26.0-1.0.fc24.x86_64
nss-3.26.0-1.0.fc24.x86_64
nss-softokn-freebl-3.26.0-1.0.fc24.x86_64
nss-pem-1.0.2-2.fc24.x86_64
firefox-48.0.1-1.fc24.x86_64
nss-tools-3.26.0-1.0.fc24.x86_64
nss-sysinit-3.26.0-1.0.fc24.x86_64
nss-softokn-3.26.0-1.0.fc24.x86_64

/proc/sys/crypto/fips_enabled is 0

Is this bug specific to Fedora 23 ?

Do you have any global pkcs#11 modules registered on your systems?

Comment 3 Kai Engert (:kaie) (inactive account) 2016-09-28 17:30:54 UTC
This is a strange coincidence, because a few minutes after I looked at this bug for the first time, I saw the error myself for the very first time, on Fedora 24.

I got the same error message with a different web page. When I hit reload, the error went away, and the page loaded fine.

I wonder if this could be a side effect of a network connectivity issue occuring at an unepected time.

It doesn't seem FIPS related.

Comment 4 Kai Engert (:kaie) (inactive account) 2016-09-28 18:26:00 UTC
On your systems, do you see the error every time, or just sometimes?

Comment 6 Konstantin A. Lepikhov 2016-09-29 07:48:59 UTC
(In reply to Kai Engert (:kaie) from comment #4)
> On your systems, do you see the error every time, or just sometimes?

Hi Kai,

From originating bug at ALTLinux I see that they're replaced Apache by nginx and problem disappeared (obviously due more strict cipher suite there). But old interface where the problem still exists available for testing on different port:

https://api.telecom13.ru:8443/v2/user_auth

Cheers

Comment 7 Kamil Dudka 2016-09-29 08:05:45 UTC
(In reply to Kai Engert (:kaie) from comment #2)
> Is this bug specific to Fedora 23 ?

Unlikely.  Both f23 and f24 run the same version of nss if they are up2date.

> Do you have any global pkcs#11 modules registered on your systems?

I did not have any special setup when debugging this.

(In reply to Kai Engert (:kaie) from comment #4)
> On your systems, do you see the error every time, or just sometimes?

Every time but the original URL cannot be used any more as the server changed.

(In reply to Konstantin A. Lepikhov from comment #6)
> https://api.telecom13.ru:8443/v2/user_auth

Thanks!  The above URL works reliably for me:

$ curl -v https://api.telecom13.ru:8443/v2/user_auth
*   Trying 194.150.104.153...
* TCP_NODELAY set
* Connected to api.telecom13.ru (194.150.104.153) port 8443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* NSS error -8023 (SEC_ERROR_PKCS11_DEVICE_ERROR)
* A PKCS #11 module returned CKR_DEVICE_ERROR, indicating that a problem has occurred with the token or slot.
* Curl_http_done: called premature == 1
* stopped the pause stream!
* Closing connection 0
curl: (35) A PKCS #11 module returned CKR_DEVICE_ERROR, indicating that a problem has occurred with the token or slot.

Comment 8 Alexey Gladkov 2016-09-29 08:53:32 UTC
(In reply to Kai Engert (:kaie) from comment #2)
> I cannot reproduce the bug on Fedora 24.

They change URL. Try this: https://api.telecom13.ru:8443/v2/user_auth

> Do you have any global pkcs#11 modules registered on your systems?

No.

(In reply to Kai Engert (:kaie) from comment #4)
> On your systems, do you see the error every time, or just sometimes?

Reproduced constantly.

Comment 9 Kai Engert (:kaie) (inactive account) 2016-10-18 14:30:28 UTC
Bob, any ideas what could produce this error?

Comment 10 Kai Engert (:kaie) (inactive account) 2016-10-18 15:40:17 UTC
Could you use ssltap to find out which ciphers were being used here?

What computer hardware is involved here?

Comment 11 Hubert Kario 2016-10-18 16:32:08 UTC
the server at api.telecom13.ru:8443 is intolerant to additional signature algorithms (RSA-PSS from TLSv1.3 draft)

Comment 12 Kamil Dudka 2016-10-18 16:36:29 UTC
Hubert, are you still able to reproduce the SEC_ERROR_PKCS11_DEVICE_ERROR failure?

I am currently getting just PR_END_OF_FILE_ERROR, which differs from the behavior captured on September 29 (comment #7).  I suspect that something on the server has changed since then...

Comment 13 Alexey Gladkov 2016-10-19 11:32:42 UTC
I can reproduce:

$ curl -v https://api.telecom13.ru:8443/v2/user_auth
*   Trying 194.150.104.153...
* Connected to api.telecom13.ru (194.150.104.153) port 8443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* NSS error -8023 (SEC_ERROR_PKCS11_DEVICE_ERROR)
* A PKCS #11 module returned CKR_DEVICE_ERROR, indicating that a problem has occurred with the token or slot.
* Closing connection 0
curl: (35) A PKCS #11 module returned CKR_DEVICE_ERROR, indicating that a problem has occurred with the token or slot.

Comment 14 Kamil Dudka 2016-10-19 12:06:09 UTC
Then NSS needs to be fixed such that it does not fail with CKR_DEVICE_ERROR while connecting to TLS intolerant servers.  I believe that the debugger snapshot presented in comment #1 is a good hint for writing the fix.

Comment 15 Hubert Kario 2016-10-19 13:11:43 UTC
(In reply to Kamil Dudka from comment #12)
> Hubert, are you still able to reproduce the SEC_ERROR_PKCS11_DEVICE_ERROR
> failure?

no, I never saw it
 
> I am currently getting just PR_END_OF_FILE_ERROR, which differs from the
> behavior captured on September 29 (comment #7).  I suspect that something on
> the server has changed since then...

that's what I always saw, and I think it's because of what I said in Comment #11

(In reply to Kamil Dudka from comment #14)
> Then NSS needs to be fixed such that it does not fail with CKR_DEVICE_ERROR
> while connecting to TLS intolerant servers.  I believe that the debugger
> snapshot presented in comment #1 is a good hint for writing the fix.

I don't think the intolerance is causing that, though I don't have any better explanation.

Comment 16 Hubert Kario 2016-10-19 13:13:29 UTC
(In reply to Alexey Gladkov from comment #13)
> I can reproduce:
> 
> $ curl -v https://api.telecom13.ru:8443/v2/user_auth
> *   Trying 194.150.104.153...
> * Connected to api.telecom13.ru (194.150.104.153) port 8443 (#0)
> * Initializing NSS with certpath: sql:/etc/pki/nssdb
> *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
>   CApath: none
> * NSS error -8023 (SEC_ERROR_PKCS11_DEVICE_ERROR)
> * A PKCS #11 module returned CKR_DEVICE_ERROR, indicating that a problem has
> occurred with the token or slot.
> * Closing connection 0
> curl: (35) A PKCS #11 module returned CKR_DEVICE_ERROR, indicating that a
> problem has occurred with the token or slot.

can you please make a packet capture of a connection that results in this error? tcpdump format preferably

Comment 17 Konstantin A. Lepikhov 2016-10-19 18:10:28 UTC
(In reply to Hubert Kario from comment #16)
> (In reply to Alexey Gladkov from comment #13)
> > I can reproduce:
> > 
> > $ curl -v https://api.telecom13.ru:8443/v2/user_auth
> > *   Trying 194.150.104.153...
> > * Connected to api.telecom13.ru (194.150.104.153) port 8443 (#0)
> > * Initializing NSS with certpath: sql:/etc/pki/nssdb
> > *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
> >   CApath: none
> > * NSS error -8023 (SEC_ERROR_PKCS11_DEVICE_ERROR)
> > * A PKCS #11 module returned CKR_DEVICE_ERROR, indicating that a problem has
> > occurred with the token or slot.
> > * Closing connection 0
> > curl: (35) A PKCS #11 module returned CKR_DEVICE_ERROR, indicating that a
> > problem has occurred with the token or slot.
> 
> can you please make a packet capture of a connection that results in this
> error? tcpdump format preferably

Attached. Captured by tcpdump using this commmand:

$ sudo tcpdump -s 65535 -w bug1377646.dump -i eth0 host 194.150.104.153

Comment 18 Konstantin A. Lepikhov 2016-10-19 18:11:13 UTC
Created attachment 1212238 [details]
tcpdump of failed session

Comment 19 Hubert Kario 2016-10-20 13:57:02 UTC
I've checked the first 2 connections in the packet dump, in both cases the signature created by server is invalid - the hashes don't match the data exchanged.

So it's the server that is buggy, not the client.

While the error returned by NSS is misleading (as Kamil already said in Comment #1), the resulting connection abort (including the alert it sends back to the server) is entirely correct and expected.

Comment 20 Kai Engert (:kaie) (inactive account) 2016-10-24 10:28:35 UTC
Hubert, thank you very much for your analysis.

Marking as not-a-bug.

To get the error code improved, it would be best to file an upstream bug.

Comment 21 Kamil Dudka 2016-10-24 10:54:06 UTC
(In reply to Kai Engert (:kaie) from comment #20)
> To get the error code improved, it would be best to file an upstream bug.

Even with patch attached, it would take months or years to get a reply.  I have no local reproducer for this and, by the time someone actually replies, I will no longer remember what this bug was about.

Comment 22 Hubert Kario 2016-10-24 10:59:49 UTC
Kamil, I can create such misbehaving server in like 5 minutes of work :)

Comment 23 Kamil Dudka 2016-10-24 11:21:26 UTC
(In reply to Hubert Kario from comment #22)
> Kamil, I can create such misbehaving server in like 5 minutes of work :)

If you are able to create it in a way that NSS upstream can use, it would be really appreciated!

Based on comment #5 I thought that a local reproducer was close to impossible because the error I debugged was supposed to only happen as a result of some internal error?

Comment 24 Hubert Kario 2016-10-24 11:46:39 UTC
(In reply to Kamil Dudka from comment #23)
> (In reply to Hubert Kario from comment #22)
> > Kamil, I can create such misbehaving server in like 5 minutes of work :)
> 
> If you are able to create it in a way that NSS upstream can use, it would be
> really appreciated!
> 
> Based on comment #5 I thought that a local reproducer was close to
> impossible because the error I debugged was supposed to only happen as a
> result of some internal error?

well, if it *was* internal error then bug in bignum library would be correct, but your comment #1 looks to be correct - the error is just mislabeled

To be precise, the server sends a signature that can be decrypted using public key from certificate, but the decrypted value does not include correct hashes, but *does* include correct padding.

The solution I was thinking about was monkey-patching tlslite-ng - modifying ServerKeyExchange.hash() to ignore serverRandom for hash creation would be enough.