Bug 1730109

Summary: NSS fails to handshake on s390x
Product: Red Hat Enterprise Linux 8 Reporter: Alex Scheel <ascheel>
Component: jssAssignee: RHCS Maintainers <rhcs-maint>
Status: CLOSED UPSTREAM QA Contact: PKI QE <bugzilla-pkiqe>
Severity: high Docs Contact:
Priority: high    
Version: 8.1CC: ascheel, hkario, nss-nspr-maint, rrelyea
Target Milestone: rcKeywords: Triaged
Target Release: 8.2   
Hardware: s390   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-15 17:29:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ltrace on relevant test case, s390x none

Description Alex Scheel 2019-07-15 21:03:21 UTC
Description of problem:


During the RHEL 8.1 release cycle, I had to disable a SSL handshake (client + server) test case in JSS. I've confirmed that this is a NSS issue. The following test case will not work on s390x with RSA certificates, but works fine with s390x and ECDSA certificates or on non-s390x architectures with either type of certificate.


https://github.com/dogtagpki/jss/blob/master/org/mozilla/jss/tests/TestBufferPRFD.c#L315-L338


Version-Release number of selected component (if applicable):

# rpm -qa | grep '^nss'
nss-softokn-freebl-3.41.0-5.el8.s390x
nss-util-3.41.0-5.el8.s390x
nss-util-devel-3.41.0-5.el8.s390x
nss-softokn-3.41.0-5.el8.s390x
nss-sysinit-3.41.0-5.el8.s390x
nss-softokn-devel-3.41.0-5.el8.s390x
nss-devel-3.41.0-5.el8.s390x
nss-3.41.0-5.el8.s390x
nss-softokn-freebl-devel-3.41.0-5.el8.s390x
nss-tools-3.41.0-5.el8.s390x


How reproducible:

Very



Steps to Reproduce:

1. Clone jss from upstream: git clone https://github.com/dogtagpki/jss
2. Install required dependencies: sudo dnf module enable -y javapackages-tools pki-deps && sudo dnf install apache-commons-codec apache-commons-lang gcc-c++ java-devel jpackage-utils slf4j zlib-devel glassfish-jaxb-api nss-tools nss-devel cmake junit tar zip unzip time git -y
3. Try building: cd build && cmake .. && make all && make test 



Actual results:

Two test cases will fail:

The following tests FAILED:
	 17 - JSS_Test_BufferPRFD (Failed)
	 36 - TestBufferPRFD_RSA (Failed)



Expected results:

All test cases succeed, like they do on every other architecture.

Comment 1 Alex Scheel 2019-08-08 16:50:24 UTC
This problem occurs on Fedora as well. It is likely related to async io.

Comment 2 Alex Scheel 2019-08-09 15:31:35 UTC
Created attachment 1602241 [details]
ltrace on relevant test case, s390x

bash -c 'LD_LIBRARY_PATH="$(pwd)/lib:" ltrace -l "*ss*" ./bin/TestBufferPRFD "/root/sandbox/jss/build/results/nssdb" "m1oZilla" "Server_RSA" 2>&1' 2>&1  > ltrace.txt

(Grabs stuff from ssl3 and nss3 libraries; not entirely sure it is helpful).

Comment 3 Alex Scheel 2019-08-09 17:10:19 UTC
Ok, so I think this is a weird edge case on s390x and its handling of async io. 


On s390x, the async io adapter gets:

> Found cert with nickname: Server_RSA
> Trying handshake...
> Client Handshake:
> <buffer:client_write_buffer:0xbbcc7c90>  jb_write of 191 bytes | write_cap of 2048 bytes | read_cap of 0 bytes
> <buffer:client_write_buffer:0xbbcc7c90>  jb_write of 0 bytes | write_cap of 1857 bytes | read_cap of 191 bytes
> 
> 
> Server Handshake:
> <buffer:client_write_buffer:0xbbcc7c90>  jb_read of 5 bytes | write_cap of 1857 bytes | read_cap of 191 bytes
> <buffer:client_write_buffer:0xbbcc7c90>  jb_read of 0 bytes | write_cap of 1862 bytes | read_cap of 186 bytes
> <buffer:client_write_buffer:0xbbcc7c90>  jb_read of 186 bytes | write_cap of 1862 bytes | read_cap of 186 bytes
> <buffer:client_write_buffer:0xbbcc7c90>  jb_read of 0 bytes | write_cap of 2048 bytes | read_cap of 0 bytes
> <buffer:client_read_buffer:0xbbccc400>  jb_write of 3342 bytes | write_cap of 2048 bytes | read_cap of 0 bytes
> <buffer:client_read_buffer:0xbbccc400>  jb_write of 1294 bytes | write_cap of 0 bytes | read_cap of 2048 bytes
> 
> 
> Client Handshake:
> <buffer:client_read_buffer:0xbbccc400>  jb_read of 5 bytes | write_cap of 0 bytes | read_cap of 2048 bytes
> <buffer:client_read_buffer:0xbbccc400>  jb_read of 0 bytes | write_cap of 5 bytes | read_cap of 2043 bytes
> <buffer:client_read_buffer:0xbbccc400>  jb_read of 3337 bytes | write_cap of 5 bytes | read_cap of 2043 bytes
> <buffer:client_read_buffer:0xbbccc400>  jb_read of 1294 bytes | write_cap of 2048 bytes | read_cap of 0 bytes
> 
> 
> Server Handshake:
> 
> 
> Client Handshake:
> 
> 
> Server Handshake:
> 
> 
> Client Handshake:(i.e., 
> 
> 

It tries to write 3342 bytes into a buffer of size 2048. 1294 bytes remain, and it never tries to write those remaining bytes again. I think it should under semantics of write(...).


On x86_64 and other platforms: 

> Found cert with nickname: Server_RSA
> Trying handshake...
> Client Handshake:
> <buffer:client_write_buffer:0x1fe0430>  jb_write of 177 bytes | write_cap of 2048 bytes | read_cap of 0 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_write of 0 bytes | write_cap of 1871 bytes | read_cap of 177 bytes
> 
> 
> Server Handshake:
> <buffer:client_write_buffer:0x1fe0430>  jb_read of 5 bytes | write_cap of 1871 bytes | read_cap of 177 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_read of 0 bytes | write_cap of 1876 bytes | read_cap of 172 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_read of 172 bytes | write_cap of 1876 bytes | read_cap of 172 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_read of 0 bytes | write_cap of 2048 bytes | read_cap of 0 bytes
> <buffer:client_read_buffer:0x1fe04e0>  jb_write of 3277 bytes | write_cap of 2048 bytes | read_cap of 0 bytes
> <buffer:client_read_buffer:0x1fe04e0>  jb_write of 1229 bytes | write_cap of 0 bytes | read_cap of 2048 bytes
> 
> 
> Client Handshake:
> <buffer:client_read_buffer:0x1fe04e0>  jb_read of 5 bytes | write_cap of 0 bytes | read_cap of 2048 bytes
> <buffer:client_read_buffer:0x1fe04e0>  jb_read of 0 bytes | write_cap of 5 bytes | read_cap of 2043 bytes
> <buffer:client_read_buffer:0x1fe04e0>  jb_read of 3272 bytes | write_cap of 5 bytes | read_cap of 2043 bytes
> <buffer:client_read_buffer:0x1fe04e0>  jb_read of 1229 bytes | write_cap of 2048 bytes | read_cap of 0 bytes
> 
> 
> Server Handshake:
> <buffer:client_read_buffer:0x1fe04e0>  jb_write of 1229 bytes | write_cap of 2048 bytes | read_cap of 0 bytes
> <buffer:client_read_buffer:0x1fe04e0>  jb_write of 0 bytes | write_cap of 819 bytes | read_cap of 1229 bytes
> 
> 
> Client Handshake:
> <buffer:client_read_buffer:0x1fe04e0>  jb_read of 1229 bytes | write_cap of 819 bytes | read_cap of 1229 bytes
> <buffer:client_read_buffer:0x1fe04e0>  jb_read of 0 bytes | write_cap of 2048 bytes | read_cap of 0 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_write of 93 bytes | write_cap of 2048 bytes | read_cap of 0 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_write of 0 bytes | write_cap of 1955 bytes | read_cap of 93 bytes
> 
> 
> Server Handshake:
> <buffer:client_write_buffer:0x1fe0430>  jb_read of 5 bytes | write_cap of 1955 bytes | read_cap of 93 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_read of 0 bytes | write_cap of 1960 bytes | read_cap of 88 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_read of 37 bytes | write_cap of 1960 bytes | read_cap of 88 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_read of 0 bytes | write_cap of 1997 bytes | read_cap of 51 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_read of 5 bytes | write_cap of 1997 bytes | read_cap of 51 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_read of 0 bytes | write_cap of 2002 bytes | read_cap of 46 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_read of 1 bytes | write_cap of 2002 bytes | read_cap of 46 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_read of 0 bytes | write_cap of 2003 bytes | read_cap of 45 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_read of 5 bytes | write_cap of 2003 bytes | read_cap of 45 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_read of 0 bytes | write_cap of 2008 bytes | read_cap of 40 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_read of 40 bytes | write_cap of 2008 bytes | read_cap of 40 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_read of 0 bytes | write_cap of 2048 bytes | read_cap of 0 bytes
> <buffer:client_read_buffer:0x1fe04e0>  jb_write of 51 bytes | write_cap of 2048 bytes | read_cap of 0 bytes
> <buffer:client_read_buffer:0x1fe04e0>  jb_write of 0 bytes | write_cap of 1997 bytes | read_cap of 51 bytes
> 
> 
> Send a message from the client to the server...
> <buffer:client_read_buffer:0x1fe04e0>  jb_read of 5 bytes | write_cap of 1997 bytes | read_cap of 51 bytes
> <buffer:client_read_buffer:0x1fe04e0>  jb_read of 0 bytes | write_cap of 2002 bytes | read_cap of 46 bytes
> <buffer:client_read_buffer:0x1fe04e0>  jb_read of 1 bytes | write_cap of 2002 bytes | read_cap of 46 bytes
> <buffer:client_read_buffer:0x1fe04e0>  jb_read of 0 bytes | write_cap of 2003 bytes | read_cap of 45 bytes
> <buffer:client_read_buffer:0x1fe04e0>  jb_read of 5 bytes | write_cap of 2003 bytes | read_cap of 45 bytes
> <buffer:client_read_buffer:0x1fe04e0>  jb_read of 0 bytes | write_cap of 2008 bytes | read_cap of 40 bytes
> <buffer:client_read_buffer:0x1fe04e0>  jb_read of 40 bytes | write_cap of 2008 bytes | read_cap of 40 bytes
> <buffer:client_read_buffer:0x1fe04e0>  jb_read of 0 bytes | write_cap of 2048 bytes | read_cap of 0 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_write of 40 bytes | write_cap of 2048 bytes | read_cap of 0 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_write of 0 bytes | write_cap of 2008 bytes | read_cap of 40 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_read of 5 bytes | write_cap of 2008 bytes | read_cap of 40 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_read of 0 bytes | write_cap of 2013 bytes | read_cap of 35 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_read of 35 bytes | write_cap of 2013 bytes | read_cap of 35 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_read of 0 bytes | write_cap of 2048 bytes | read_cap of 0 bytes
> Received message from client: Cooking MCs [len: 11]
> 
> 
> Send a message from the server to the client...
> <buffer:client_read_buffer:0x1fe04e0>  jb_write of 50 bytes | write_cap of 2048 bytes | read_cap of 0 bytes
> <buffer:client_read_buffer:0x1fe04e0>  jb_write of 0 bytes | write_cap of 1998 bytes | read_cap of 50 bytes
> <buffer:client_read_buffer:0x1fe04e0>  jb_read of 5 bytes | write_cap of 1998 bytes | read_cap of 50 bytes
> <buffer:client_read_buffer:0x1fe04e0>  jb_read of 0 bytes | write_cap of 2003 bytes | read_cap of 45 bytes
> <buffer:client_read_buffer:0x1fe04e0>  jb_read of 45 bytes | write_cap of 2003 bytes | read_cap of 45 bytes
> <buffer:client_read_buffer:0x1fe04e0>  jb_read of 0 bytes | write_cap of 2048 bytes | read_cap of 0 bytes
> Received message from client: like a pound of bacon [len: 21]
> <buffer:client_write_buffer:0x1fe0430>  jb_write of 31 bytes | write_cap of 2048 bytes | read_cap of 0 bytes
> <buffer:client_write_buffer:0x1fe0430>  jb_write of 0 bytes | write_cap of 2017 bytes | read_cap of 31 bytes
> <buffer:client_read_buffer:0x1fe04e0>  jb_write of 31 bytes | write_cap of 2048 bytes | read_cap of 0 bytes
> <buffer:client_read_buffer:0x1fe04e0>  jb_write of 0 bytes | write_cap of 2017 bytes | read_cap of 31 bytes

It overflows the buffer size of 2048, (with a slightly larger write, I'm not sure why or what the differences are). And next time SSL_ForceHandshake gets called on the server socket, it tries to write the remaining bytes and succeeds.

Comment 5 Bob Relyea 2019-08-14 17:54:15 UTC
I think this is consistent with a race condition. I suspect we are racing on the ss->pendingBug.len.

Comment 6 Bob Relyea 2019-08-15 16:15:54 UTC
OK, The issue looks like it's in nspr: 

opt.option = PR_SockOpt_Nonblocking;
opt.value.non_blocking = PR_FALSE;
status = PR_GetSocketOption(fd, &opt);

is returning PR_FALSE for non-blocking instead of PR_TRUE (since the socket is non-blocking). This mean SSL_ForceHandshake() won't write out the the buffered write data (since there should be none on a blocking socket).

I don't know if the base problem is NSPR or the kernel. Moving component to NSPR. (doesn't change that I'm still actively looking at the underlying problem).

Comment 7 Bob Relyea 2019-08-15 16:56:18 UTC
The bug is in: ./org/mozilla/jss/ssl/javax/BufferPRFD.c


The following function is incorrect:

// Fake responses to getSocketOption requests
static PRStatus PRBufferGetSocketOption(PRFileDesc *fd, PRSocketOptionData *data)
{   
    /* getSocketOption takes a PRFileDesc and modifies the PRSocketOptionData
     * with the options on this. We set a couple of sane defaults here:
     *
     *   non_blocking = true
     *   reuse_addr = true
     *   keep_alive = false
     *   no_delay = true
     *
     * However the list above is far fom extensive. Note that responses are
     * "fake" in that calls to setSocketOption fail to reflect here.
     */

    if (data) {
        PRFilePrivate *internal = fd->secret;

        data->value.non_blocking = PR_TRUE;
        data->value.reuse_addr = PR_TRUE;
        data->value.keep_alive = PR_FALSE;
        data->value.mcast_loopback = PR_FALSE;
        data->value.no_delay = PR_TRUE;
        data->value.max_segment = jb_capacity(internal->read_buffer);
        data->value.recv_buffer_size = jb_capacity(internal->read_buffer);
        data->value.send_buffer_size = jb_capacity(internal->write_buffer);

        return PR_SUCCESS;
    }

    return PR_FAILURE;
}

PRSocketOptionData.value is a union. You need to switch on data->option to select the correct value. On all the other platforms this sort of worked because either sizeof data->value.send_buffer_size  == sizeof (PRBool), or the platform was little endian so value->non_blocking was non-zero. on s390, sizeof (data->value.send_buffer_size) was greater than sizeof (PRBool) and the platform was big endian.

Upshot PRBufferGetSocketOption needs to be coded as a switch statement and only set the requested value;

Comment 8 Bob Relyea 2019-08-15 17:17:09 UTC
A good testcase for this that would fail on all platforms:

Use PR_GetSocketOption() to fetch each option and make sure it's correct (compare bools to PR_TRUE or PR_FALSE).

Example:

opt.option = PR_SockOpt_Nonblocking;
opt.value.non_blocking = PR_FALSE;
status = PR_GetSocketOption(fd, &opt);
if (status != PR_SUCCESS)
    return FAIL;
if (opt.value.non_blocking != PR_TRUE)
    return FAIL;
return SUCCESS

Do this for each value PRBufferGetSocketOption intends to return. Also test on a socket with different write and read buffer lengths.

Comment 9 Alex Scheel 2019-08-15 17:29:46 UTC
Ahh... That does make a lot more sense why I only see the problem on s390x. 

I think I had thought at the time it was weird that PRBufferGetSocketOption didn't take a specific option to query like getsockopt(...) does. I definitely need to figure out where NSPR has put its other PRFileDesc related tests and exercise mine as well.

OTOH, we now know that NSS and NSPR both do non-blocking better than I can! :P

My bad, and many thanks for pointing me in the right direction!