Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1200772

Summary: Upstream test suite run on i386 reports segmentation faults
Product: Red Hat Enterprise Linux 6 Reporter: Alicja Kario <hkario>
Component: nssAssignee: Kai Engert (:kaie) (inactive account) <kengert>
Status: CLOSED INSUFFICIENT_DATA QA Contact: BaseOS QE Security Team <qe-baseos-security>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.6CC: ebenes, emaldona, hkario, jrieden, kengert, ksrot, nkinder, pvrabec, rrelyea, tmraz
Target Milestone: rc   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-31 14:27:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alicja Kario 2015-03-11 11:23:30 UTC
Description of problem:
Running upstream test suite on machines without SSE2 (like Pentium III, possibly also first gen Athlon) causes it to fail some tests with segmentation faults.

Version-Release number of selected component (if applicable):
nss-3.16.2.3-3.el6_6.i686
nspr-devel-4.10.6-1.el6_5.i686
nspr-4.10.6-1.el6_5.i686

How reproducible:
Always

Steps to Reproduce:
1. recompile NSS and run test suite on PIII system

Actual results:
(multiple tests fail, showing just one for brevity)

trying to kill selfserv_9118 with PID 4200 at Mon Mar  9 16:02:34 EDT 2015
kill -USR1 4200
selfserv: 999 cache hits; 1 cache misses, 0 cache not reusable
          999 stateless resumes, 0 ticket parse failures
selfserv: normal termination
selfserv_9118 -b -p 9118 2>/dev/null;
selfserv_9118 with PID 4200 killed at Mon Mar  9 16:02:35 EDT 2015
ssl.sh: Stress TLS  RC4 128 with MD5 (compression) ----
selfserv_9118 starting at Mon Mar  9 16:02:35 EDT 2015
selfserv_9118 -D -p 9118 -d ../server -n localhost.localdomain -B -s \
         -e localhost.localdomain-ec -w nss -z -i ../tests_pid.13123  &
trying to connect to selfserv_9118 at Mon Mar  9 16:02:35 EDT 2015
tstclnt -p 9118 -h localhost.localdomain  -q \
        -d ../client -v < /tmp/tmp.sUzTc1nCjq/rpmroot/BUILD/nss-3.16.2.3/nss/tests/ssl/sslreq.dat
tstclnt: connecting to localhost.localdomain:9118 (address=::1)
kill -0 4300 >/dev/null 2>/dev/null
selfserv_9118 with PID 4300 found at Mon Mar  9 16:02:35 EDT 2015
selfserv_9118 with PID 4300 started at Mon Mar  9 16:02:35 EDT 2015
strsclnt -q -p 9118 -d ../client  -w nss -V ssl3: -c 1000 -C c -z \
          localhost.localdomain
strsclnt started at Mon Mar  9 16:02:35 EDT 2015
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: 0 cache hits; 1 cache misses, 0 cache not reusable
          0 stateless resumes
./ssl.sh: line 540:  4316 Segmentation fault      (core dumped) ${PROFTOOL} ${BINDIR}/strsclnt -q -p ${PORT} -d ${P_R_CLIENTDIR} ${CLIENT_OPTIONS} -w nss $cparam $verbose ${HOSTADDR}
selfserv: HDX PR_Read returned error -5961:
TCP connection reset by peer
selfserv: HDX PR_Read returned error -5961:
TCP connection reset by peer
selfserv: HDX PR_Read returned error -5961:
TCP connection reset by peer
selfserv: HDX PR_Read returned error -5961:
TCP connection reset by peer
selfserv: HDX PR_Read returned error -5961:
TCP connection reset by peer
selfserv: HDX PR_Read returned error -5961:
TCP connection reset by peer
strsclnt completed at Mon Mar  9 16:02:37 EDT 2015
ssl.sh: #1672: Stress TLS  RC4 128 with MD5 (compression) produced a returncode of 139, expected is 0.  - Core file is detected - FAILED


Quick debugging reports following back traces (thanks kaie):
    Core was generated by `/tmp/tmp.rt0lMYh7x4/rpmroot/BUILD/nss-3.16.2.3/dist/Linux2.6_x86_glibc_PTH_OPT.'.
    Program terminated with signal 11, Segmentation fault.
    #0  CERT_DestroyCertificate (cert=0xffffffff) at stanpcertdb.c:791
    791             NSSCertificate *tmp = cert->nssCertificate;
    Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.162.el6.i686 nspr-4.10.6-1.el6_5.i686 nss-softokn-3.14.3-22.el6_6.i686 nss-softokn-freebl-3.14.3-22.el6_6.i686 nss-util-3.16.2.3-2.el6_6.i686 sqlite-3.6.20-1.el6.i686 zlib-1.2.3-29.el6.i686
    (gdb) bt
    #0  CERT_DestroyCertificate (cert=0xffffffff) at stanpcertdb.c:791
    #1  0x0044bb6a in ssl_ResetSecurityInfo (sec=0xb773d050, doMemset=1) at sslsecur.c:1026
    #2  0x0044d19d in SSL_ResetHandshake (s=0xb7744008, asServer=0) at sslsecur.c:240
    #3  0x0804d87d in do_connects (a=0xbff552b0, b=0x8a8e878, tid=3) at strsclnt.c:848
    #4  0x0804c437 in thread_wrapper (arg=0x8060130) at strsclnt.c:410
    #5  0x0099c4f8 in ?? () from /lib/libnspr4.so
    #6  0x0034cb69 in start_thread () from /lib/libpthread.so.0
    #7  0x007ffc7e in clone () from /lib/libc.so.6
    (gdb) print cert
    $1 = (CERTCertificate *) 0xffffffff
    (gdb) print *cert
    Cannot access memory at address 0xffffffff


    (gdb) bt
    #0  0x0066b3ed in pthread_mutex_lock () from /lib/libpthread.so.0
    #1  0x00f1d4a3 in PR_Lock (lock=0xffffffff) at ../../../nspr/pr/src/pthreads/ptsynch.c:177
    #2  0x003b30eb in SSL_ResetHandshake (s=0xb77a5008, asServer=0) at sslsecur.c:210
    #3  0x0804d87d in do_connects (a=0xbfbecb70, b=0x9e05878, tid=1) at strsclnt.c:848
    #4  0x0804c437 in thread_wrapper (arg=0x80600f8) at strsclnt.c:410
    #5  0x00f244f8 in _pt_root (arg=0x9e44a08) at ../../../nspr/pr/src/pthreads/ptthread.c:212
    #6  0x00669b69 in start_thread () from /lib/libpthread.so.0
    #7  0x005abc7e in clone () from /lib/libc.so.6

Expected results:
No failures in test suite

Additional info:

Comment 1 Alicja Kario 2015-03-11 11:53:03 UTC
Version that do pass testing on Athlon (fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mp mmxext 3dnowext 3dnow up) is nss-3.16.1-14.el6.i686
Version that passed testing on same machine as the above report is nss-3.15.1-15.el6.i686

Comment 2 Alicja Kario 2015-03-23 11:37:29 UTC
Additional test runs have confirmed that the issue is not limited to machines without SSE2, it's just easier to reproduce there. It was reproduced on an modern system with AVX instruction set (so SSE4.2 and lower).

Comment 3 Elio Maldonado Batiz 2015-04-02 15:59:23 UTC
I have logged to the test system without SSE2 that you has been reserved. When I run with the TCMS Test Case I see the segfaults. They happen in several stress tests. I also ran the test via more conventional methods, (1) via rhpkg local and (2) via rhpkg mockbuild, and in both cases there are no segfaults, all test pass. I also checked the shared libraries the tools involved are dynamically linking to with ldd path_to_tool_used and it shows me they are using the system-installed nss libraries.

Comment 5 Bob Relyea 2015-05-20 23:21:09 UTC
Elio or Kai, and you get me a core dump of this issue?

From the stack traceback and the source code:

1) 0xffffffff is clearly an invalid pointer to a certificate. It looks like we are referencing freed memory.
2) The sec structure that is passed in is part of the socket (ss), which sort of indicates that the the ssl socket structure is being released underneath us.
3) We are holding the relevant locks in SSL_ResetHandshake() (SSL_LOCK_READER, SSL_LOCK_WRITER, sslGetXmitBufLock).
4) If, however, we are trying to close the socket on a different thread at the same time we are calling SSL_ResetHandshake(). It looks like there could be a race with close, where we get the socket in SSL_ResetHandshake(), but then switch to ssl_Close, which gets the SSL_LOCK_READER() and SSL_LOCK_WRITER(), calls ssl_DefClose, removes the sslSock from the NSPR socket, then calls ssl_FreeSocket, which gets a bunch more locks, then clears up the sslSock data, then unlocks all the locks. A this point we can switch back to ResetHandshake which gets the SSL_LOCK_READER() and SSL_LOCK_WRITER() eventually get SSL_ResetHandshake and proceeds clearing the ssl_secureInfo structure. At this point somewhere we switch back to the ssl_FreeSocket() and actually free up the data structure. When we switch back the ssl data structure has been thrashed.

All nice and neat except 1 big issue: it's forbidden for applications to call PR_Close on the a different thread then an active SSL session on the socket is already on. strstest in fact only makes socket calls on the same thread. I notices some other bugs in error paths with ssl_FreeSocket(), but those aren't are crashe here;(.

bob

Comment 6 Alicja Kario 2015-05-21 10:17:47 UTC
in other words, we still don't know what is the cause of the crash, and the most likely candidate (race in test suite itself) was ruled out. Is that correct?

Comment 8 Bob Relyea 2015-05-26 16:31:07 UTC
Right, I need core dump to poke around with to get further.

bob

Comment 9 Bob Relyea 2015-05-27 17:43:11 UTC
re comment 8

Comment 10 Alicja Kario 2015-05-28 11:35:12 UTC
it happens deep in test suite so it's rather hard for me to get it, I know that Kai and Elio were able to produce it

Comment 11 Elio Maldonado Batiz 2015-05-28 13:35:54 UTC
For some reason the nss-3.16.2.3-3.el6_6.i686 build is not to be found in brew. I examined the spec file and did git log and examine that and I can recreate that build. I had to create a temporary private branch 'private-bug1200772-segfault' otherwise I wasn't allow to built (not even create the srpm) so it may take some time to reproduce this.

Comment 12 Kai Engert (:kaie) (inactive account) 2015-06-15 15:19:06 UTC
This bug continues to be a mistery.

We saw the same crash on one of the TPS machines.

On that machine with hostname i386-6s-m1.ss.eng.bos.redhat.com, the issue can be reproduced IF, and ONLY IF rpmbuild-and-testsuite is executed as part of TPS (which runs as root).

On machine i386-6s-m1.ss.eng.bos.redhat.com, if rpmbuild-and-testsuite is executed outside of control of TPS, and using a regular user account, it works.

We tested using a different i386 machine, which has been set up to use TPS, too. On that other machine, running rpmbuild-and-testsuite as part of TPS - WORKS!

We have tried multiple other i386 machines, without using TPS, and all of them have succeeded.

It's not as simple as "TPS is bad".
It's not as simple as "certain hardware is bad".
It's not as simple as "running as root is bad".

It's more like:
 "The mix of TPS and certain hardware and the NSS test suite is bad."

While we investigated the crashes on the failure, we were able to obtain multiple core files.

Bob had investigated the core files.

The best theory that Bob came up with seemed like a somewhat unrealistic scenario.

His theory was: Maybe the memory allocation code has a bug, let's initialize additional variables to NULL, just to be certain.

We performed a test with additional memory initialization. That didn't fix the crash. We still crash. We simply crash at a different location.


Given the circumstances it requires to trigger this trash, I expect that it would take a lot of experiments, code changes, logging, assertions, to isolate the cause of this crash.

If you want us to continue to investigate this bug, you will need to reserve the bad system to us exclusively.

Unfortunately, the mentioned machine currently is being used as a central machine for general QA testing. So it cannot be reserved to us easily.

(Reserving it for us is necessary, because the bug cannot be reproduced using an ordinary user account, only with root permissions. And the risk of disturbing other QA processes on that machine is too high.)


Please let me know if you can reserve the system exclusively for our investigation.


In the meantime, I recommend to move on despite this failure.

Comment 13 Kai Engert (:kaie) (inactive account) 2015-06-15 17:56:21 UTC
(In reply to Kai Engert (:kaie) from comment #12)
> On that machine with hostname i386-6s-m1.ss.eng.bos.redhat.com, the issue
> can be reproduced IF, and ONLY IF rpmbuild-and-testsuite is executed as part
> of TPS (which runs as root).
> 
> On machine i386-6s-m1.ss.eng.bos.redhat.com, if rpmbuild-and-testsuite is
> executed outside of control of TPS, and using a regular user account, it
> works.

Another test has shown:
It even works fine on that machine and running as root.

Only when running as root under TPS on that specific machine, only then we crash.

Comment 15 Kai Engert (:kaie) (inactive account) 2015-10-09 17:11:02 UTC
I had spent a lot of time on analyzing this bug.

I wasn't able to reproduce it in a way that allowed me to find the cause.

The issue is erratic.

The issue only appeared with one specific installation, and couldn't be reproduced with similar environments.

I don't know what to do about this bug.