Bug 1200772
| Summary: | Upstream test suite run on i386 reports segmentation faults | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Alicja Kario <hkario> |
| Component: | nss | Assignee: | Kai Engert (:kaie) (inactive account) <kengert> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | BaseOS QE Security Team <qe-baseos-security> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 6.6 | CC: | ebenes, emaldona, hkario, jrieden, kengert, ksrot, nkinder, pvrabec, rrelyea, tmraz |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | i386 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-08-31 14:27:38 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Version that do pass testing on Athlon (fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mp mmxext 3dnowext 3dnow up) is nss-3.16.1-14.el6.i686 Version that passed testing on same machine as the above report is nss-3.15.1-15.el6.i686 Additional test runs have confirmed that the issue is not limited to machines without SSE2, it's just easier to reproduce there. It was reproduced on an modern system with AVX instruction set (so SSE4.2 and lower). I have logged to the test system without SSE2 that you has been reserved. When I run with the TCMS Test Case I see the segfaults. They happen in several stress tests. I also ran the test via more conventional methods, (1) via rhpkg local and (2) via rhpkg mockbuild, and in both cases there are no segfaults, all test pass. I also checked the shared libraries the tools involved are dynamically linking to with ldd path_to_tool_used and it shows me they are using the system-installed nss libraries. Elio or Kai, and you get me a core dump of this issue? From the stack traceback and the source code: 1) 0xffffffff is clearly an invalid pointer to a certificate. It looks like we are referencing freed memory. 2) The sec structure that is passed in is part of the socket (ss), which sort of indicates that the the ssl socket structure is being released underneath us. 3) We are holding the relevant locks in SSL_ResetHandshake() (SSL_LOCK_READER, SSL_LOCK_WRITER, sslGetXmitBufLock). 4) If, however, we are trying to close the socket on a different thread at the same time we are calling SSL_ResetHandshake(). It looks like there could be a race with close, where we get the socket in SSL_ResetHandshake(), but then switch to ssl_Close, which gets the SSL_LOCK_READER() and SSL_LOCK_WRITER(), calls ssl_DefClose, removes the sslSock from the NSPR socket, then calls ssl_FreeSocket, which gets a bunch more locks, then clears up the sslSock data, then unlocks all the locks. A this point we can switch back to ResetHandshake which gets the SSL_LOCK_READER() and SSL_LOCK_WRITER() eventually get SSL_ResetHandshake and proceeds clearing the ssl_secureInfo structure. At this point somewhere we switch back to the ssl_FreeSocket() and actually free up the data structure. When we switch back the ssl data structure has been thrashed. All nice and neat except 1 big issue: it's forbidden for applications to call PR_Close on the a different thread then an active SSL session on the socket is already on. strstest in fact only makes socket calls on the same thread. I notices some other bugs in error paths with ssl_FreeSocket(), but those aren't are crashe here;(. bob in other words, we still don't know what is the cause of the crash, and the most likely candidate (race in test suite itself) was ruled out. Is that correct? Right, I need core dump to poke around with to get further. bob it happens deep in test suite so it's rather hard for me to get it, I know that Kai and Elio were able to produce it For some reason the nss-3.16.2.3-3.el6_6.i686 build is not to be found in brew. I examined the spec file and did git log and examine that and I can recreate that build. I had to create a temporary private branch 'private-bug1200772-segfault' otherwise I wasn't allow to built (not even create the srpm) so it may take some time to reproduce this. This bug continues to be a mistery. We saw the same crash on one of the TPS machines. On that machine with hostname i386-6s-m1.ss.eng.bos.redhat.com, the issue can be reproduced IF, and ONLY IF rpmbuild-and-testsuite is executed as part of TPS (which runs as root). On machine i386-6s-m1.ss.eng.bos.redhat.com, if rpmbuild-and-testsuite is executed outside of control of TPS, and using a regular user account, it works. We tested using a different i386 machine, which has been set up to use TPS, too. On that other machine, running rpmbuild-and-testsuite as part of TPS - WORKS! We have tried multiple other i386 machines, without using TPS, and all of them have succeeded. It's not as simple as "TPS is bad". It's not as simple as "certain hardware is bad". It's not as simple as "running as root is bad". It's more like: "The mix of TPS and certain hardware and the NSS test suite is bad." While we investigated the crashes on the failure, we were able to obtain multiple core files. Bob had investigated the core files. The best theory that Bob came up with seemed like a somewhat unrealistic scenario. His theory was: Maybe the memory allocation code has a bug, let's initialize additional variables to NULL, just to be certain. We performed a test with additional memory initialization. That didn't fix the crash. We still crash. We simply crash at a different location. Given the circumstances it requires to trigger this trash, I expect that it would take a lot of experiments, code changes, logging, assertions, to isolate the cause of this crash. If you want us to continue to investigate this bug, you will need to reserve the bad system to us exclusively. Unfortunately, the mentioned machine currently is being used as a central machine for general QA testing. So it cannot be reserved to us easily. (Reserving it for us is necessary, because the bug cannot be reproduced using an ordinary user account, only with root permissions. And the risk of disturbing other QA processes on that machine is too high.) Please let me know if you can reserve the system exclusively for our investigation. In the meantime, I recommend to move on despite this failure. (In reply to Kai Engert (:kaie) from comment #12) > On that machine with hostname i386-6s-m1.ss.eng.bos.redhat.com, the issue > can be reproduced IF, and ONLY IF rpmbuild-and-testsuite is executed as part > of TPS (which runs as root). > > On machine i386-6s-m1.ss.eng.bos.redhat.com, if rpmbuild-and-testsuite is > executed outside of control of TPS, and using a regular user account, it > works. Another test has shown: It even works fine on that machine and running as root. Only when running as root under TPS on that specific machine, only then we crash. I had spent a lot of time on analyzing this bug. I wasn't able to reproduce it in a way that allowed me to find the cause. The issue is erratic. The issue only appeared with one specific installation, and couldn't be reproduced with similar environments. I don't know what to do about this bug. |
Description of problem: Running upstream test suite on machines without SSE2 (like Pentium III, possibly also first gen Athlon) causes it to fail some tests with segmentation faults. Version-Release number of selected component (if applicable): nss-3.16.2.3-3.el6_6.i686 nspr-devel-4.10.6-1.el6_5.i686 nspr-4.10.6-1.el6_5.i686 How reproducible: Always Steps to Reproduce: 1. recompile NSS and run test suite on PIII system Actual results: (multiple tests fail, showing just one for brevity) trying to kill selfserv_9118 with PID 4200 at Mon Mar 9 16:02:34 EDT 2015 kill -USR1 4200 selfserv: 999 cache hits; 1 cache misses, 0 cache not reusable 999 stateless resumes, 0 ticket parse failures selfserv: normal termination selfserv_9118 -b -p 9118 2>/dev/null; selfserv_9118 with PID 4200 killed at Mon Mar 9 16:02:35 EDT 2015 ssl.sh: Stress TLS RC4 128 with MD5 (compression) ---- selfserv_9118 starting at Mon Mar 9 16:02:35 EDT 2015 selfserv_9118 -D -p 9118 -d ../server -n localhost.localdomain -B -s \ -e localhost.localdomain-ec -w nss -z -i ../tests_pid.13123 & trying to connect to selfserv_9118 at Mon Mar 9 16:02:35 EDT 2015 tstclnt -p 9118 -h localhost.localdomain -q \ -d ../client -v < /tmp/tmp.sUzTc1nCjq/rpmroot/BUILD/nss-3.16.2.3/nss/tests/ssl/sslreq.dat tstclnt: connecting to localhost.localdomain:9118 (address=::1) kill -0 4300 >/dev/null 2>/dev/null selfserv_9118 with PID 4300 found at Mon Mar 9 16:02:35 EDT 2015 selfserv_9118 with PID 4300 started at Mon Mar 9 16:02:35 EDT 2015 strsclnt -q -p 9118 -d ../client -w nss -V ssl3: -c 1000 -C c -z \ localhost.localdomain strsclnt started at Mon Mar 9 16:02:35 EDT 2015 strsclnt: -- SSL: Server Certificate Validated. strsclnt: 0 cache hits; 1 cache misses, 0 cache not reusable 0 stateless resumes ./ssl.sh: line 540: 4316 Segmentation fault (core dumped) ${PROFTOOL} ${BINDIR}/strsclnt -q -p ${PORT} -d ${P_R_CLIENTDIR} ${CLIENT_OPTIONS} -w nss $cparam $verbose ${HOSTADDR} selfserv: HDX PR_Read returned error -5961: TCP connection reset by peer selfserv: HDX PR_Read returned error -5961: TCP connection reset by peer selfserv: HDX PR_Read returned error -5961: TCP connection reset by peer selfserv: HDX PR_Read returned error -5961: TCP connection reset by peer selfserv: HDX PR_Read returned error -5961: TCP connection reset by peer selfserv: HDX PR_Read returned error -5961: TCP connection reset by peer strsclnt completed at Mon Mar 9 16:02:37 EDT 2015 ssl.sh: #1672: Stress TLS RC4 128 with MD5 (compression) produced a returncode of 139, expected is 0. - Core file is detected - FAILED Quick debugging reports following back traces (thanks kaie): Core was generated by `/tmp/tmp.rt0lMYh7x4/rpmroot/BUILD/nss-3.16.2.3/dist/Linux2.6_x86_glibc_PTH_OPT.'. Program terminated with signal 11, Segmentation fault. #0 CERT_DestroyCertificate (cert=0xffffffff) at stanpcertdb.c:791 791 NSSCertificate *tmp = cert->nssCertificate; Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.162.el6.i686 nspr-4.10.6-1.el6_5.i686 nss-softokn-3.14.3-22.el6_6.i686 nss-softokn-freebl-3.14.3-22.el6_6.i686 nss-util-3.16.2.3-2.el6_6.i686 sqlite-3.6.20-1.el6.i686 zlib-1.2.3-29.el6.i686 (gdb) bt #0 CERT_DestroyCertificate (cert=0xffffffff) at stanpcertdb.c:791 #1 0x0044bb6a in ssl_ResetSecurityInfo (sec=0xb773d050, doMemset=1) at sslsecur.c:1026 #2 0x0044d19d in SSL_ResetHandshake (s=0xb7744008, asServer=0) at sslsecur.c:240 #3 0x0804d87d in do_connects (a=0xbff552b0, b=0x8a8e878, tid=3) at strsclnt.c:848 #4 0x0804c437 in thread_wrapper (arg=0x8060130) at strsclnt.c:410 #5 0x0099c4f8 in ?? () from /lib/libnspr4.so #6 0x0034cb69 in start_thread () from /lib/libpthread.so.0 #7 0x007ffc7e in clone () from /lib/libc.so.6 (gdb) print cert $1 = (CERTCertificate *) 0xffffffff (gdb) print *cert Cannot access memory at address 0xffffffff (gdb) bt #0 0x0066b3ed in pthread_mutex_lock () from /lib/libpthread.so.0 #1 0x00f1d4a3 in PR_Lock (lock=0xffffffff) at ../../../nspr/pr/src/pthreads/ptsynch.c:177 #2 0x003b30eb in SSL_ResetHandshake (s=0xb77a5008, asServer=0) at sslsecur.c:210 #3 0x0804d87d in do_connects (a=0xbfbecb70, b=0x9e05878, tid=1) at strsclnt.c:848 #4 0x0804c437 in thread_wrapper (arg=0x80600f8) at strsclnt.c:410 #5 0x00f244f8 in _pt_root (arg=0x9e44a08) at ../../../nspr/pr/src/pthreads/ptthread.c:212 #6 0x00669b69 in start_thread () from /lib/libpthread.so.0 #7 0x005abc7e in clone () from /lib/libc.so.6 Expected results: No failures in test suite Additional info: