1146967 – Latest firmware update causes SIGILL on xbeginq instruction on Haswell processors

Bug 1146967 - Latest firmware update causes SIGILL on xbeginq instruction on Haswell processors

Summary: Latest firmware update causes SIGILL on xbeginq instruction on Haswell proces...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	glibc
Sub Component:
Version:	21
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Carlos O'Donell
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	https://fedoraproject.org/wiki/Common...
Duplicates (3):	1146749 1147062 1147118 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-09-26 13:02 UTC by Amit Shah
Modified:	2016-11-24 12:21 UTC (History)
CC List:	16 users (show)
Fixed In Version:	glibc-2.20-5.fc21
Clone Of:
Environment:
Last Closed:	2014-09-30 03:47:18 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
gdb analysis of core dump (9.71 KB, text/plain) 2014-09-26 13:02 UTC, Amit Shah	no flags	Details
cpuinfo (1.13 KB, text/plain) 2014-09-26 13:03 UTC, Amit Shah	no flags	Details
View All

Description Amit Shah 2014-09-26 13:02:45 UTC

Created attachment 941562 [details]
gdb analysis of core dump

Description of problem:

The 2.1-8 update to microcode_ctl incorporates the latest Intel firmware.  This update causes rtm and hle instructions in glibc lock elision code to SIGILL, causing all apps on my Haswell system to crash.  The first crash happens in systemd-udevd, which means I don't get along much further in the boot sequence.

Siddhesh and I found this while debugging systemd-udevd core dumps, and we found xbeginq was the instruction getting SIGILL.

Checking for Intel errata, Siddesh found a reference to https://lkml.org/lkml/2014/9/18/218 where they mention this behaviour in the firmware might be deliberate (to keep advertising hle/rtm instructions, but causing them to SIGILL), rather than a bug.

We might need fixes across the kernel and glibc for this, or quirks for this hardware and microcode update in glibc and the kernel.

Core dump info in attachments.

microcode_ctl-2.1-8 causes the badness; 2.1-7 is fine (which has older firmware).

Comment 1 Amit Shah 2014-09-26 13:03:21 UTC

Created attachment 941563 [details]
cpuinfo

Comment 2 Carlos O'Donell 2014-09-26 13:04:42 UTC

We are spinning an F20, F20, and Rawhide glibc with lock elision disabled. This will hopefully mean that nobody ends up with a broken system if they update glibc *and* the microcode_ctl package at the same time.

Comment 3 Carlos O'Donell 2014-09-26 13:16:55 UTC

Scratch builds for F20, F21 and Rawhide in progress.

Comment 4 Carlos O'Donell 2014-09-26 13:42:10 UTC

Rawhide: http://koji.fedoraproject.org/koji/taskinfo?taskID=7702857
F21: http://koji.fedoraproject.org/koji/taskinfo?taskID=7702869
F20: http://koji.fedoraproject.org/koji/taskinfo?taskID=7702870

Comment 5 Carlos O'Donell 2014-09-26 17:01:39 UTC

The rawhide build has somet testing issues I'm sorting out right now.

Final build for f21:
http://koji.fedoraproject.org/koji/taskinfo?taskID=7703592

Final build for f20:
http://koji.fedoraproject.org/koji/taskinfo?taskID=7703676

Comment 6 Siddhesh Poyarekar 2014-09-26 18:05:48 UTC

Some related reading on this:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=762195
https://bugs.launchpad.net/intel/+bug/1370352
https://lkml.org/lkml/2014/9/18/218

Comment 7 Fedora Update System 2014-09-26 18:19:12 UTC

glibc-2.18-16.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/glibc-2.18-16.fc20

Comment 8 Fedora Update System 2014-09-26 18:19:20 UTC

glibc-2.20-4.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/glibc-2.20-4.fc21

Comment 9 Carlos O'Donell 2014-09-26 18:35:18 UTC

*** Bug 1147062 has been marked as a duplicate of this bug. ***

Comment 10 Hans de Goede 2014-09-26 18:53:22 UTC

I've just tested the F-21 update for this, and I'm afraid that the problem is still present there. I've also regenerated my initrd to make sure that that included the new glibc too, but that did not help.

Comment 11 Andy Lutomirski 2014-09-26 18:54:44 UTC

There's a long discussion about a real fix here:

http://thread.gmane.org/gmane.linux.kernel/1790211

No great solution yet.

Comment 12 Carlos O'Donell 2014-09-27 03:59:32 UTC

(In reply to Hans de Goede from comment #10)
> I've just tested the F-21 update for this, and I'm afraid that the problem
> is still present there. I've also regenerated my initrd to make sure that
> that included the new glibc too, but that did not help.

Are you certain? This update absolutely removes elision, you shouldn't have any TSX usage going on after the update to glibc-2.20-4.fc21. Can you confirm the version of your installed glibc is correct? Can you track down if *all* your binaries fault or just some of them (statically compiled against libpthread)?

Comment 13 Carlos O'Donell 2014-09-27 04:44:53 UTC

(In reply to Carlos O'Donell from comment #12)
> (In reply to Hans de Goede from comment #10)
> > I've just tested the F-21 update for this, and I'm afraid that the problem
> > is still present there. I've also regenerated my initrd to make sure that
> > that included the new glibc too, but that did not help.
> 
> Are you certain? This update absolutely removes elision, you shouldn't have
> any TSX usage going on after the update to glibc-2.20-4.fc21. Can you
> confirm the version of your installed glibc is correct? Can you track down
> if *all* your binaries fault or just some of them (statically compiled
> against libpthread)?

OK, I think I found the problme. There is a code path in rwlock that is using TSX unconditionally. I'm going to fix that and push out another build.

Comment 14 Carlos O'Donell 2014-09-27 04:55:24 UTC

Hans,

Would you mind testing this scratch build?
http://koji.fedoraproject.org/koji/taskinfo?taskID=7707772

It should fully disable TSX usage in libpthread.so.0. I don't have easy access to a box that I can do this kind of testing on e.g. micrcode updates etc.

Comment 15 Hans de Goede 2014-09-27 09:08:46 UTC

(In reply to Carlos O'Donell from comment #12)
> (In reply to Hans de Goede from comment #10)
> > I've just tested the F-21 update for this, and I'm afraid that the problem
> > is still present there. I've also regenerated my initrd to make sure that
> > that included the new glibc too, but that did not help.
> 
> Are you certain? This update absolutely removes elision, you shouldn't have
> any TSX usage going on after the update to glibc-2.20-4.fc21. Can you
> confirm the version of your installed glibc is correct?

Yes I double checked I had the correct version (and regenerated my initrd and rebooted) before putting in the comment that the update does not fix things.

(In reply to Carlos O'Donell from comment #14)
> Hans,
> 
> Would you mind testing this scratch build?
> http://koji.fedoraproject.org/koji/taskinfo?taskID=7707772
> 
> It should fully disable TSX usage in libpthread.so.0. I don't have easy
> access to a box that I can do this kind of testing on e.g. micrcode updates
> etc.

In the mean time I've installed this update:

https://admin.fedoraproject.org/updates/dracut-038-29.git20140903.fc21,kernel-3.16.3-302.fc21

Which fixes things in a less of a big hammer approach, and that fixes things too.

I can still reproduce the problem by booting an older kernel though. I've verified that booting an older kernel still exhibits the problem, then I've installed your glibc scratch build, and I can confirm that the problem is gone, even when using the older kernel, when using the glibc from the scratch build.

Comment 16 Fedora Update System 2014-09-27 09:42:15 UTC

Package glibc-2.18-16.fc20:
* should fix your issue,
* was pushed to the Fedora 20 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing glibc-2.18-16.fc20'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2014-11586/glibc-2.18-16.fc20
then log in and leave karma (feedback).

Comment 17 Carlos O'Donell 2014-09-27 16:50:47 UTC

(In reply to Hans de Goede from comment #15)
> Which fixes things in a less of a big hammer approach, and that fixes things
> too.
> 
> I can still reproduce the problem by booting an older kernel though. I've
> verified that booting an older kernel still exhibits the problem, then I've
> installed your glibc scratch build, and I can confirm that the problem is
> gone, even when using the older kernel, when using the glibc from the
> scratch build.

We are going to push the new glibc into F21 to fix this problem for anyone that doesn't want to upgrade their kernel.

I think the conservative approach of a kernel fix, and runtime fix is the best here given that missing an update could cause your box to break if you install a new microcode_ctl.

Final F21 build here:
http://koji.fedoraproject.org/koji/taskinfo?taskID=7709522

Comment 18 Zbigniew Jędrzejewski-Szmek 2014-09-27 23:10:41 UTC

*** Bug 1147118 has been marked as a duplicate of this bug. ***

Comment 19 Carlos O'Donell 2014-09-28 16:49:00 UTC

OK, final update for FC21 with a full fix was just pushed into Bodhi.

https://admin.fedoraproject.org/updates/FEDORA-2014-11673/glibc-2.20-5.fc21

Comment 20 Fedora Update System 2014-09-29 04:04:45 UTC

glibc-2.18-16.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 21 Amit Shah 2014-09-29 06:01:08 UTC

Given the kernel is going to inject the microcode on start, and no execution of 'cpuid' is going to see those tmem instructions being exposed by the cpu, the glibc fix may not be needed at all?  I think we can revert the glibc fix given the kernel has been fixed appropriately already.  Relevant kernel bug is bug 1083716.

Comment 22 Siddhesh Poyarekar 2014-09-29 08:48:46 UTC

(In reply to Amit Shah from comment #21)
> Given the kernel is going to inject the microcode on start, and no execution
> of 'cpuid' is going to see those tmem instructions being exposed by the cpu,
> the glibc fix may not be needed at all?  I think we can revert the glibc fix
> given the kernel has been fixed appropriately already.  Relevant kernel bug
> is bug 1083716.

The glibc fix is still useful because it fixes TSX code that sneaked out of the --enable-lock-elision configuration, which could potentially cause problems later.  Also, there is no point keeping the bits enabled because if TSX does get enabled in future microcode and it behaves differently (breaking glibc expectations), we'll have to rush to patch older systems.

We do need to re-enable elision for s390* in f21 and rawhide:

https://lists.fedoraproject.org/pipermail/glibc/2014-September/000062.html

Comment 23 Bojan Smojver 2014-09-29 22:56:03 UTC

That update to F-20, -16, appears broken. See my comments in the update.

I have no idea why xrdp would just hang like that, but going back to either -11 or -14 immediately fixes the problem.

I'm running this on an i686 VM, which is most likely running on VMWare ESX or something like that (I don't control this bit).

Comment 24 Carlos O'Donell 2014-09-29 22:59:16 UTC

(In reply to Bojan Smojver from comment #23)
> That update to F-20, -16, appears broken. See my comments in the update.
> 
> I have no idea why xrdp would just hang like that, but going back to either
> -11 or -14 immediately fixes the problem.
> 
> I'm running this on an i686 VM, which is most likely running on VMWare ESX
> or something like that (I don't control this bit).

Are you able to remote ssh into the box, attache a debugger, and do a backtrace to see where it's hung?

Comment 25 Bojan Smojver 2014-09-29 23:22:47 UTC

(In reply to Carlos O'Donell from comment #24)
 
> Are you able to remote ssh into the box, attache a debugger, and do a
> backtrace to see where it's hung?

Yeah, ssh works. Xrdp runs a couple of precesses, I can attach to both before I login and see. Maybe something that is forked off crashes or something like that. No idea at this point - the logs have nothing useful.

A quick strace of the remaining xrdp processes just has one of them sitting in select.

Comment 26 Bojan Smojver 2014-09-30 03:10:59 UTC

(In reply to Bojan Smojver from comment #25)
 
> Yeah, ssh works. Xrdp runs a couple of precesses, I can attach to both
> before I login and see. Maybe something that is forked off crashes or
> something like that. No idea at this point - the logs have nothing useful.
> 
> A quick strace of the remaining xrdp processes just has one of them sitting
> in select.

Wow - did I make a fool of myself or what? Xrdp works now with -16.

Yesterday when I upgraded, it hung on login. I rebooted the VM. Hung again. I reverted to -11, rebooted. Worked. Installed -14, rebooted, worked.

No idea...

Anyhow, must have been some other condition somewhere that conicided with glibc upgrade or something.

Comment 27 Carlos O'Donell 2014-09-30 03:47:18 UTC

(In reply to Bojan Smojver from comment #26)
> Wow - did I make a fool of myself or what? Xrdp works now with -16.

Bojan, You are not a fool. I am incredibly appreciative of people like you who are willing to step up and say something is broken and help out. I have infinite patience for that kind of dedication. Thank you for raising the issue. I'm glad it turned out to be nothing, but it might not have been.

Comment 28 Fedora Update System 2014-10-03 04:05:44 UTC

glibc-2.20-5.fc21 has been pushed to the Fedora 21 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 29 Boris Ranto 2014-10-31 12:22:08 UTC

We are seeing something similar to this in rados binary [1]. On shutdown, the program calls rados_shutdown() which calls the appropriate destructors. In particular, it calls ~RWLock() which issues pthread_rwlock_unlock(). This causes the program to receive SIGILL signal. Debugging with gdb, it seems that the instruction that causes this is xend [2] which is an Intel TSX instruction.

I am no expert in this matter but it seems that this issue is not fully resolved, yet. I've looked at the patches and xbegin seems to be explicitly disabled there. Maybe, we need to explicitly disable xend in the code as well?

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1144794 ; in short just run 'rados df' to reproduce (you need to have ceph-common installed)

[2] layout asm in gdb shows this as the crashing line:

>│0x7ffff6c75153 <__GI___pthread_rwlock_unlock+19>        xend   |

Comment 30 Carlos O'Donell 2014-10-31 12:59:43 UTC

(In reply to Boris Ranto from comment #29)
> We are seeing something similar to this in rados binary [1]. On shutdown,
> the program calls rados_shutdown() which calls the appropriate destructors.
> In particular, it calls ~RWLock() which issues pthread_rwlock_unlock(). This
> causes the program to receive SIGILL signal. Debugging with gdb, it seems
> that the instruction that causes this is xend [2] which is an Intel TSX
> instruction.
> 
> I am no expert in this matter but it seems that this issue is not fully
> resolved, yet. I've looked at the patches and xbegin seems to be explicitly
> disabled there. Maybe, we need to explicitly disable xend in the code as
> well?
> 
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1144794 ; in short just run
> 'rados df' to reproduce (you need to have ceph-common installed)
> 
> [2] layout asm in gdb shows this as the crashing line:
> 
> >│0x7ffff6c75153 <__GI___pthread_rwlock_unlock+19>        xend   |

Please open a new bug for this and we can triage there.

The 2.20-5 version should have fixed this.

Comment 31 Robert Hancock 2015-05-30 00:10:43 UTC

*** Bug 1146749 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.