Bug 768300

Summary: [abrt] bash-4.2.20-1.fc16: _dl_x86_64_restore_sse: Process /bin/bash was killed by signal 4 (SIGILL)
Product: [Fedora] Fedora Reporter: Bas <basvansummeren>
Component: glibcAssignee: Jeff Law <law>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 16CC: aoliva, fweimer, jakub, law, maxamillion, mishu, pmachata, rrakus, schwab
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard: abrt_hash:7ff8871fa03c9becb11540a50e91308f60486937
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-03-05 19:28:23 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
File: coredump
none
File: smolt_data
none
File: var_log_messages
none
File: maps
none
File: backtrace none

Description Bas 2011-12-16 09:42:57 UTC
libreport version: 2.0.7
abrt_version:   2.0.6
backtrace_rating: 4
cmdline:        sh -c 'cat Bestand|./sortunique1>/dev/null'
crash_function: _dl_x86_64_restore_sse
executable:     /bin/bash
kernel:         3.1.5-2.fc16.x86_64
pid:            7693
pwd:            /home/bas/NetBeansProjects/SortUnique3c
reason:         Process /bin/bash was killed by signal 4 (SIGILL)
time:           Fri 16 Dec 2011 10:39:13 AM CET
uid:            1000
username:       bas

backtrace:      Text file, 4801 bytes
maps:           Text file, 4060 bytes
smolt_data:     Text file, 3111 bytes
var_log_messages: Text file, 9268 bytes

dso_list:
:/lib64/libdl-2.14.90.so glibc-2.14.90-21.x86_64 (Fedora Project) 1323683288
:/lib64/libpthread-2.14.90.so glibc-2.14.90-21.x86_64 (Fedora Project) 1323683288
:/lib64/libtinfo.so.5.9 ncurses-libs-5.9-2.20110716.fc16.x86_64 (Fedora Project) 1320287304
:/bin/bash bash-4.2.20-1.fc16.x86_64 (Fedora Project) 1323683296
:/lib64/libc-2.14.90.so glibc-2.14.90-21.x86_64 (Fedora Project) 1323683288
:/lib64/librt-2.14.90.so glibc-2.14.90-21.x86_64 (Fedora Project) 1323683288
:/lib64/libgcc_s-4.6.2-20111027.so.1 libgcc-4.6.2-1.fc16.x86_64 (Fedora Project) 1320287295
:/lib64/ld-2.14.90.so glibc-2.14.90-21.x86_64 (Fedora Project) 1323683288

environ:
:XDG_VTNR=1
:XDG_SESSION_ID=2
:HOSTNAME=localhost.localdomain
:GIO_LAUNCHED_DESKTOP_FILE_PID=1936
:IMSETTINGS_INTEGRATE_DESKTOP=yes
:GPG_AGENT_INFO=/tmp/keyring-4lwweL/gpg:0:1
:TERM=ansi
:SHELL=/bin/bash
:XDG_SESSION_COOKIE=d85850b1a5768353a02b6df500000011-1324026635.634194-1884796011
:HISTSIZE=1000
:GJS_DEBUG_OUTPUT=stderr
:GNOME_KEYRING_CONTROL=/tmp/keyring-4lwweL
:'GJS_DEBUG_TOPICS=JS ERROR;JS LOG'
:IMSETTINGS_MODULE=none
:USER=bas
:LD_LIBRARY_PATH=/home/bas/netbeans-7.0.1/dlight/tools/Linux-x86_64/bin:/home/bas/netbeans-7.0.1/dlight/tools/Linux-x86/bin
:SSH_AUTH_SOCK=/tmp/keyring-4lwweL/ssh
:SESSION_MANAGER=local/unix:@/tmp/.ICE-unix/1322,unix/unix:/tmp/.ICE-unix/1322
:USERNAME=bas
:GIO_LAUNCHED_DESKTOP_FILE=/home/bas/.local/share/applications/netbeans-7.0.1.desktop
:NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat
:DESKTOP_SESSION=gnome
:PATH=/bin:/usr/bin:/usr/lib64/ccache:/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/bas/.local/bin:/home/bas/bin
:MAIL=/var/spool/mail/bas
:QT_IM_MODULE=xim
:PWD=/home/bas/NetBeansProjects/SortUnique3c
:XMODIFIERS=@im=none
:J2D_PIXMAPS=shared
:LANG=en_US.UTF-8
:GNOME_KEYRING_PID=1315
:KDE_IS_PRELINKED=1
:KDEDIRS=/usr
:XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
:GDMSESSION=gnome
:HISTCONTROL=ignoredups
:SHLVL=2
:XDG_SEAT=seat0
:HOME=/home/bas
:GNOME_DESKTOP_SESSION_ID=this-is-deprecated
:LOGNAME=bas
:CVS_RSH=ssh
:DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-Y1PbTt8HiC,guid=66138481505b1e44dcdd095100000044
:OMP_NUM_THREADS=2
:'LESSOPEN=||/usr/bin/lesspipe.sh %s'
:WINDOWPATH=1
:DISPLAY=:0
:XDG_RUNTIME_DIR=/run/user/bas
:XAUTHORITY=/var/run/gdm/auth-for-bas-WddQUZ/database
:LD_PRELOAD=prof_agent.so

event_log:
:2011-12-16-10:41:16> Querying server settings
:2011-12-16-10:41:17  Preparing an archive to upload
:2011-12-16-10:41:18  Uploading 38456 bytes
:2011-12-16-10:41:19  Upload successful
:2011-12-16-10:41:19  Retrace job started
:2011-12-16-10:41:30  Initializing virtual root
:2011-12-16-10:41:41  Initializing virtual root
:2011-12-16-10:41:52  Initializing virtual root
:2011-12-16-10:42:08  Initializing virtual root
:2011-12-16-10:42:19  Retrace job finished successfully
:2011-12-16-10:43:04> Smolt profile successfully saved

Comment 1 Bas 2011-12-16 09:43:07 UTC
Created attachment 547676 [details]
File: coredump

Comment 2 Bas 2011-12-16 09:43:09 UTC
Created attachment 547677 [details]
File: smolt_data

Comment 3 Bas 2011-12-16 09:43:11 UTC
Created attachment 547678 [details]
File: var_log_messages

Comment 4 Bas 2011-12-16 09:43:14 UTC
Created attachment 547679 [details]
File: maps

Comment 5 Bas 2011-12-16 09:43:16 UTC
Created attachment 547680 [details]
File: backtrace

Comment 6 Roman Rakus 2012-01-11 15:06:19 UTC
*** Bug 768306 has been marked as a duplicate of this bug. ***

Comment 7 Roman Rakus 2012-01-11 15:06:51 UTC
*** Bug 768343 has been marked as a duplicate of this bug. ***

Comment 8 Roman Rakus 2012-01-11 15:30:38 UTC
Seems like bug in glibc (or gcc?).

Comment 9 Petr Machata 2012-01-11 16:45:30 UTC
The offending instruction is in AVX block.  Could it be that glibc was built on AVX-enabled machine, but the reporter's system doesn't have that support?  (Note that the process was killed by SIGILL.)

Comment 10 Petr Machata 2012-01-11 16:51:51 UTC
Hm, except that in that case the check before the block should have fired.

Comment 11 Jeff Law 2012-01-25 18:40:09 UTC
*** Bug 768305 has been marked as a duplicate of this bug. ***

Comment 12 Petr Machata 2012-01-27 12:35:59 UTC
To original reporter: could you please show an output from /proc/cpuinfo?

Comment 13 Petr Machata 2012-01-27 14:18:59 UTC
(gdb) x/gx 0x3475222f78
0x3475222f78:	0x0000000000000000

0x3475222f78 is an address where have_avx is stored.  This is consistent in all these duplicate core dumps.

Zero implies that that field hasn't been initialized yet.  We are hitting this inside _dl_x86_64_restore_sse.  I went through the code of _dl_x86_64_save_sse, and the AVX detection seems correct (in line with Intel documentation anyway, I don't have a machine with that feature handy).  In any case, it should end in initializing that value to either +1 or -1.  Maybe we are just neglecting to call _*save_sse under some circumstances?  I'll have to look in.

Another curious thing that I noticed: %fs is set to 0.  I don't understand how this could be, given the code that follows (vmovdqa %fs:0x80,%ymm0 etc.)

And finally, there's some strange prof_agent.so mapped in the address space.

Comment 14 Jeff Law 2012-01-27 15:56:57 UTC
Any chance this BZ is related to this upstream bug report:

http://sourceware.org/bugzilla/show_bug.cgi?id=13583

What's interesting here is the fix for 13583 is it twiddles the how we check for YMM_Usable in init_cpu_features.

Comment 15 Petr Machata 2012-01-28 16:49:47 UTC
It's a related area, but I don't think it's directly related.  The code that I'm looking into is cited as an example of how to do the check correctly.  I'll have to look into the overall logic some more, to connect the dots between _save_sse and _restore_sse.  My current working theory is that _save_sse is somehow not called, because if it were called, the value of have_avx would have to be non-zero.

(As a side note, this doesn't seem to be a result of a race.  Even if several threads ended up in this region at once, they would just all consistently conclude whether the CPU supports AVX, and write that result in that memory location.  It's a waste of cycles in the worst case.)

Comment 16 Jeff Law 2012-01-30 05:14:21 UTC
Presumably all we have to work with are the core dumps and haven't been able to reproduce in-house.  I wonder if there's any nonobvious ways to get into that restore code.

Anyway, I'll keep monitoring and throw in any ideas that come up.

Comment 19 Jeff Law 2012-02-14 20:15:53 UTC
Is there any chance this is related to 720176?  I realize this bug is in the restore path and the other is in the save path, but it's an awful interesting coincidence...

Comment 20 Jeff Law 2012-02-17 06:19:34 UTC
Whoops.  I meant to ask if this was related to 752122 which shows a similar failure, but in the save path.

Jeff

Comment 21 Jeff Law 2012-02-17 06:46:48 UTC
How reproducible is this?  Like Petr I'm really confused as to how we can get into this hunk of code with the value of have_avx being zero.

If you can consistently trigger this failure, what happens if you run it under valgrind?  Can you get an strace of the process from start to finish?   Any chance you could give Petr or myself access to whatever machine where this is happening?

One of the things I find myself wondering is perhaps there's a wild pointer clobbering have_avx or something along those lines.  The only way to find that is likely to be able reproduce the failure.

Comment 22 Petr Machata 2012-02-17 09:20:13 UTC
FWIW, I tried running this under valgrind on the same theory (there's a test binary in one of the duplicates)  but it came out clean.  FWIW, from the data in Smolt it seems the machine does not support AVX, but that of course doesn't explain anything.

Comment 23 Jeff Law 2012-02-17 20:17:51 UTC
Petr,

I missed that tidbit.

So the illegal instruction fault makes sense...  Clearly we shouldn't be getting into this code.

The fact that have_avx is zero I think means one of two things.    One something might be clobbering have_avx.  valgrind isn't likely to catch that unless something has passed bogus args to memcpy or something similar.  Or two that the return address on the stack got twiddled causing a return insn to jump into this code rather than the proper return address.

Perhaps we could set a gdb hardware watchpoint on the have_avx location and see who changes it over time using whatever testcase you have in the duplicate..

Comment 24 Roman Rakus 2012-02-19 15:29:36 UTC
Or maybe the reporter used hibernation? Which could be similar to bug #551570.

Comment 25 Jeff Law 2012-02-23 06:44:15 UTC
I doubt this is related to 551570; the overall structure of the backtraces is radically different.

I think we also have to consider the possibility that rtld_must_xmm_save is getting clobbered.  If it were somehow set to zero after RTLD_ENABLE_FOREIGN_CALL, but before RTLD_FINALIZE_FOREIGN_CALL without a corresponding call to save_sse, then we might get this behaviour.

Comment 26 Alexandre Oliva 2012-03-01 14:51:21 UTC
I've been looking into this, and one think I found puzzling is that in all cases the bug hit, _dl_fixup has old_rtld_must_xmm_save == 1, even though AFAICT this should only ever happen when we're within another _dl_fixup call upthread, but there's no such additional _dl_fixup upthread.  Furthermore, all cases involve the dlight profiler, which might be messing things up somehow.  Maybe we should have a look at its source code?

Comment 27 Jeff Law 2012-03-01 16:58:01 UTC
I'd missed the fact that for old_rtld_must_xmm_save to have a nonzero value that we must already have a thread inside dl_fixup.

Hmm, I wonder if there's some kind of race which causes us to muck up those state variables and by the time we see the fault the other thread has left dl_fixup.

A comment prior to the definition of RTLD_ENABLE_FOREIGN_CALL hints that locking is not needed, but perhaps it's wrong.



/* NB: Don't use the xchg operation because that would imply a lock
   prefix which is expensive and unnecessary.  The cache line is also
   not contested at all.  */

Comment 28 Alexandre Oliva 2012-03-01 17:58:25 UTC
This is thread-local state, so whatever other threads do (short of scribbling over other threads' static TLS segments:-) is irrelevant.  It's the failing thread's own thread-local storage segment that seems to have been messed up.  If we had a reliable reproducer, setting a hardware watchpoint on that TLS location would help us figure out what's going wrong.  Do we have one?

Comment 29 Jeff Law 2012-03-01 18:19:14 UTC
No reliable reproducer.  Petr got access to a similar machine but was unable to trigger the failure there.

I'm not sure how long it's been since we've heard from the original submitter, but I'm starting to think we may have to close as INSUFFICIENT_DATA if we don't get some response from them soon.

Bas: What we really need is a semi-reliable way to reproduce this problem.  To date nobody has been able to identify a way to get into the failing routine without there having been memory corruption or similar problem.  Without a way to reproduce this particular failure it's unlikely we'll be able to resolve it.