801650 – Fedora PV guest 17 fails to boot under Xen (with SandyBridge or newer hardware that do xsave)

Bug 801650 - Fedora PV guest 17 fails to boot under Xen (with SandyBridge or newer hardware that do xsave)

Summary: Fedora PV guest 17 fails to boot under Xen (with SandyBridge or newer hardwar...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	glibc
Sub Component:
Version:	17
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Jeff Law
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	AcceptedBlocker
Duplicates (3):	814376 820495 825586 (view as bug list)
Depends On:
Blocks:	F17Blocker, F17FinalBlocker
TreeView+	depends on / blocked

Reported:	2012-03-09 05:04 UTC by Major Hayden 🤠
Modified:	2016-11-24 16:12 UTC (History)
CC List:	21 users (show)
Fixed In Version:	glibc-2.15-37.fc17
Clone Of:
Environment:
Last Closed:	2012-05-15 05:25:29 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Fix for SandyBridge (497 bytes, patch) 2012-05-10 17:46 UTC, Konrad Rzeszutek Wilk	no flags	Details \| Diff
View All

Description Major Hayden 🤠 2012-03-09 05:04:55 UTC

Description of problem:
Fedora 17 cannot complete the boot process within a Xen virtual machine.  A kernel panic occurs.

Version-Release number of selected component (if applicable):
Fedora 17 Alpha
kernel-3.3.0-0.rc3.git7.2.fc17.x86_64

How reproducible:
Boot any of the Fedora 17 install ISO's inside a Xen virtual machine.

Steps to Reproduce:
1. Install a server with a Xen hypervisor
2. Download a Fedora 17 ISO
3. Attempt to boot the ISO to install Fedora 17
  
Actual results:
Kernel panic as shown here -> http://pastie.org/pastes/3553286/text

Expected results:
Boot to anaconda.

Additional info:
Fedora 17 boots just fine on this server on bare metal.

Comment 1 Josh Boyer 2012-03-09 14:14:12 UTC

The warnings are mostly just warnings.  They will probably go away if you boot with a kernel that has the debugging options disabled.

The actual problem here is that libc crashes due to an invalid opcode:

[    4.288694] init[1] trap invalid opcode ip:7fcd0eff8145 sp:7fff7493d2f8 error:0 in libc-2.15.so[7fcd0eec1000+1ab000]

and that causes your init process to die, which causes the box to fail to boot.

Comment 2 Major Hayden 🤠 2012-03-09 15:03:54 UTC

Thanks for the quick ticket response!

Should I adjust this ticket so that it's listed under the glibc component rather than kernel?  I haven't reported too many bugs in the past, so this is a bit new for me.

Comment 3 Justin M. Forbes 2012-03-09 18:41:54 UTC

Is this an HVM or PV guest install?

Comment 4 Major Hayden 🤠 2012-03-09 18:47:56 UTC

It was a PV install.  The HVM install failed because it couldn't find the CD drive using a label.

Comment 5 Justin M. Forbes 2012-03-09 19:17:51 UTC

This is a glibc issue resolved upstream.  Reference bug:

http://sourceware.org/bugzilla/show_bug.cgi?id=13583

Comment 6 Jeff Law 2012-03-09 22:22:24 UTC

Patches to fix upstream bug #13583 were pulled into rawhide & f17.  Builds spinning.

Comment 7 Major Hayden 🤠 2012-03-11 19:23:38 UTC

Have the changes made it into the development mirrors yet?

I noticed that the builds succeeded in Koji on Mar 9.  I just tested the installation today from the latest packages in the development repo for Fedora 17 and I'm getting the same problem.

Comment 8 Fedora Update System 2012-03-12 04:13:28 UTC

glibc-2.15-27.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/glibc-2.15-27.fc17

Comment 9 Jeff Law 2012-03-12 04:15:10 UTC

No, I just created the update tonight.  So unless you explicitly pulled the March 9 build out of koji you weren't getting the fixed version.

The update will go through the normal process of moving from the testing repo to the f17 repo once enough positive karma is acquired.

Comment 10 Major Hayden 🤠 2012-03-12 11:26:32 UTC

Thanks, Jeff.  I totally forgot about that process.  I'll blame the DST switch for that gaffe.

Comment 11 Fedora Update System 2012-03-15 15:35:24 UTC

glibc-2.15-28.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/glibc-2.15-28.fc17

Comment 12 Fedora Update System 2012-04-12 01:54:56 UTC

glibc-2.15-28.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 13 Major Hayden 🤠 2012-04-17 19:05:18 UTC

I just tried the F17 beta ISO as well as the F17 development repository's network install ISO and I'm still butting up against the same problem:

[    1.969176] init[1] trap invalid opcode ip:7fbca85cff15 sp:7fffe9fac868 error:0 in libc-2.15.so[7fbca8498000+1ac000]
[    1.969335] Kernel panic - not syncing: Attempted to kill init!

Comment 14 Major Hayden 🤠 2012-04-17 19:21:23 UTC

This seems Sandy Bridge specific. I can't replicate this on other XenServer boxes with other CPU's.

Comment 15 Major Hayden 🤠 2012-04-17 21:11:18 UTC

It looks like the patches from this commit made it into the glibc package:

http://sourceware.org/git/?p=glibc.git;a=commit;h=afc5ed09cbce5d6fd48b3a8c5ec427b31f996880

But these may have been missed:

http://sourceware.org/git/?p=glibc.git;a=commit;h=08cf777f9e7f6d826658a99c7d77a359f73a45bf

The second commit may be irrelevant for this bug but I noticed that Ulrich made the second commit within a couple of hours and put "Really fix AVX tests" in the commit message.  I'm still a little new to the innards of glibc.

Thanks in advance!

Comment 16 Jeff Law 2012-04-18 17:38:00 UTC

Both patches were installed into rawhide/f17 back in early March.  If you're still seeing this problem, then it must be something different than what Justin identified in c#5.

Comment 17 Jeff Law 2012-04-20 20:09:40 UTC

*** Bug 814376 has been marked as a duplicate of this bug. ***

Comment 18 Major Hayden 🤠 2012-04-20 20:28:09 UTC

Here's the latest boot failure with the latest boot.iso from Apr 20:

http://www.fpaste.org/5sYb/

If anyone has any ideas, I'll be glad to test some things out.

Comment 19 Jeff Law 2012-04-20 20:36:48 UTC

I have the theory that the first of the two patches from Uli fixed the problem and the second re-introduced it.  But to test that I'm going to have to get familiar enough with xen to start it :-)

Comment 20 Jeff Law 2012-04-20 21:40:50 UTC

Thanks for the fpaste update.  Both you and the submitter of 814376 are failing in the exact same place.

Comment 21 Major Hayden 🤠 2012-04-20 23:09:55 UTC

XenServer is a pretty quick way to get Xen installed on a box without much fuss.  I'll be glad to test something out if I can get a boot.iso made.

Might be able to give you access to a node if you want to mess with one.

Comment 22 Major Hayden 🤠 2012-04-20 23:11:48 UTC

The Arch folks may have found a working solution:

https://projects.archlinux.org/svntogit/packages.git/tree/trunk/glibc-2.15-avx.patch?h=packages/glibc
https://projects.archlinux.org/svntogit/packages.git/tree/trunk?h=packages/glibc

Comment 23 Jeff Law 2012-04-21 03:06:03 UTC

Yes, I'm already aware of those patches.  In fact it was those patches which lead me to believe the problem is the second referenced patch from Uli.

If you compare the Arch patches to the upstream changes it looks like they took the original patch from Uli, then just *part* of the second patch (the libm bits).  Thus they didn't pick up the reversion of the initial patch from Uli.

Comment 24 Jeff Law 2012-04-25 17:20:50 UTC

Have you tried passing xsave=1 to xen's grub configuration?  That seems to be common practice to get this resolved after applying the upstream glibc patches from Uli.

Comment 25 Major Hayden 🤠 2012-04-25 19:17:20 UTC

I gave that a try but didn't have any luck:

  http://pastebin.com/raw.php?i=cZ0CrVsg

Comment 26 Ferris Draugh 2012-04-28 04:14:30 UTC

I tried xsave=1 and it didn't work for me either.  Tried xsave=0 and noxsave for the heck of it with the same results.  

vmlinuz-3.3.2-8.fc17.x86_64
glibc-2.15-32.fc17.x86_64

I can make my test xen images and xva's available to the list if that would be helpful.

Comment 27 Ferris Draugh 2012-04-28 15:07:29 UTC

Just converted the image into an xva file (PV template) and executed under XCP 1.1 on a different box.  It boots clean under XCP with no problems noted so far.

The dom0 where it is not working is a sandy bridge cpu if that bolsters comment 14.  The XCP box is non-sandy bridge

Comment 28 Mathieu Chouquet-Stringer 2012-04-28 18:00:46 UTC

I seem to be hitting a similar bug triggered by asterisk using glibc-2.15-32.fc17.x86_64.

Asterisk dies with the following backtrace (running on an Atom CPU) because glibc tries to use roundsd which is SSE4.1, the latter not being present on Atom...

Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/asterisk -f -C /etc/asterisk/asterisk.conf'.
Program terminated with signal 4, Illegal instruction.
#0  floor (__x=<optimized out>) at /usr/include/bits/mathinline.h:218
218	  __asm ("roundsd $1, %1, %0" : "=x" (__res) : "xm" (__x));

Thread 29 (Thread 0x7f0666121700 (LWP 26507)):
#0  strchrnul () at ../sysdeps/x86_64/strchrnul.S:34
No locals.
#1  0x00007f06975b42df in __find_specmb (format=0x7f064d988e00 "Got 423 Interval too brief for service %s@%s, minimum is %d seconds\n") at printf-parse.h:99
No locals.
#2  _IO_vfprintf_internal (s=s@entry=0x7f066611ec40, format=format@entry=0x7f064d988e00 "Got 423 Interval too brief for service %s@%s, minimum is %d seconds\n", ap=ap@entry=0x7f066611ee08) at vfprintf.c:1277
        thousands_sep = 0x0
        grouping = 0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>
        done = 0
        f = <optimized out>
        lead_str_end = <optimized out>
        end_of_spec = <optimized out>
        work_buffer = '\000' <repeats 999 times>
        workstart = 0x0
        workend = <optimized out>
        ap_save = {{gp_offset = 40, fp_offset = 48, overflow_arg_area = 0x7f066611f110, reg_save_area = 0x7f066611f020}}
        nspecs_done = 0
        save_errno = 0
        readonly_format = 0
        args_malloced = 0x0
        jump_table = "\001\000\000\004\000\016\000\006\000\000\a\002\000\003\t\000\005\b\b\b\b\b\b\b\b\b\000\000\000\000\000\000\000\032\000\031\000\023\023\023\000\035\000\000\f\000\000\000\000\000\000\025\000\000\000\000\022\000\r\000\000\000\000\000\000\032\000\024\017\023\023\023\n\017\034\000\v\030\027\021\026\f\000\025\033\020\000\000\022\000\r"
#3  0x00007f0697674ff0 in ___vsnprintf_chk (s=0x7f063c017ca8 "", maxlen=<optimized out>, flags=flags@entry=1, slen=slen@entry=18446744073709551615, format=format@entry=0x7f064d988e00 "Got 423 Interval too brief for service %s@%s, minimum is %d seconds\n", args=args@entry=0x7f066611ee08) at vsnprintf_chk.c:65
        sf = {f = {_sbf = {_f = {_flags = -72515583, _IO_read_ptr = 0x7f063c017ca8 "", _IO_read_end = 0x7f063c017ca8 "", _IO_read_base = 0x7f063c017ca8 "", _IO_write_base = 0x7f063c017ca8 "", _IO_write_ptr = 0x7f063c017ca8 "", _IO_write_end = 0x7f063c017da7 "", _IO_buf_base = 0x7f063c017ca8 "", _IO_buf_end = 0x7f063c017da7 "", _IO_save_base = 0x0, _IO_backup_base = 0x0, _IO_save_end = 0x0, _markers = 0x0, _chain = 0x0, _fileno = 0, _flags2 = 4, _old_offset = 0, _cur_column = 0, _vtable_offset = 0 '\000', _shortbuf = "", _lock = 0x0, _offset = 0, _codecvt = 0x0, _wide_data = 0x0, _freeres_list = 0x0, _freeres_buf = 0x0, _freeres_size = 0, _mode = -1, _unused2 = '\000' <repeats 19 times>}, vtable = 0x7f069791d420}, _s = {_allocate_buffer = 0, _free_buffer = 0}}, overflow_buf = '\000' <repeats 56 times>"\300, \355\021f\006\177\000"}
        ret = <optimized out>

[...]

Dump of assembler code for function i_nint:
   0x00007f0658765c90 <+0>:	movss  (%rdi),%xmm0
   0x00007f0658765c94 <+4>:	ucomiss 0x14c9(%rip),%xmm0        # 0x7f0658767164
   0x00007f0658765c9b <+11>:	unpcklps %xmm0,%xmm0
   0x00007f0658765c9e <+14>:	jb     0x7f0658765cc0 <i_nint+48>
   0x00007f0658765ca0 <+16>:	cvtps2pd %xmm0,%xmm0
   0x00007f0658765ca3 <+19>:	addsd  0x14e5(%rip),%xmm0        # 0x7f0658767190
=> 0x00007f0658765cab <+27>:	roundsd $0x1,%xmm0,%xmm0
   0x00007f0658765cb1 <+33>:	cvttsd2si %xmm0,%eax
   0x00007f0658765cb5 <+37>:	retq

Comment 29 Major Hayden 🤠 2012-04-30 14:51:27 UTC

I can pretty much guarantee* this is a Sandy Bridge problem only.  It's not reproducible on first-generation i5/i7 or Core 2 Duo CPU's.  Woodcrest/Kentfield Xeon CPU's are also unaffected.

* (For what that's worth)

Comment 30 Gordon McLellan 2012-05-03 03:27:11 UTC

I also have a sandy bridge system that fails to boot FC 17 beta under Xen (all packages  latest as of May 2, 2012.

FC 17 installs fine except for setting /boot to ext4 which xen doesn't like.  After yum install xen and a reboot, the Xen hypervisor appears to boot fine and then it boots the fc 17 kernel (3.3.4-1?) and the screen goes black, next thing I see is the last line of the kernel panic and the system reboots.  I also tried booting xen with 3.3.0 same problem.

System is running a i7-3820 sandy bridge-e processor on an asus socket 2011 x79 board with 32g ram.  No overclocking or other funny business.  Mobo bios is at its latest version.

Comment 31 Konrad Rzeszutek Wilk 2012-05-09 00:30:49 UTC

Trying to add this to the list of blockers. But more importantly - what happens if you put 'xsave=0' on the hypervisor line ? Does that fix the issue?

Comment 32 Ferris Draugh 2012-05-09 01:22:02 UTC

xsave=0|1 does not help
See my comment 26

Comment 33 Major Hayden 🤠 2012-05-09 01:35:13 UTC

I downloaded the latest x86_64 boot.iso from F17's development repo and tried using xsave=1.  The messages around the failure are slightly different this time around:

http://pastie.org/pastes/3882062/text

I still get the invalid opcode as I did before, but there's a call trace shortly after.  It didn't get that far before.  I'm trying to do all the debugging I know of, but if there are some extra kernel parameters I should be passing for extra debugging, please let me know.

Comment 34 Major Hayden 🤠 2012-05-09 01:38:07 UTC

Attempting to boot with xsave=0, xsave=1 and no xsave parameter gives the exact same results as shown in the paste from comment 33 on three different Sandy Bridge boxes.

Comment 35 Jeff Law 2012-05-09 03:20:32 UTC

I'm waiting on access to Ivybridge hardware so that I can directly investigate.    At this point I really need to be able to test things rather than throwing ideas/code over the wall.

Comment 36 Konrad Rzeszutek Wilk 2012-05-09 12:07:41 UTC

The one thing that might be causing this is this patch: "fix_xen_guest_on_old_EC2.patch" that has been added to deal with Amazon's EC2:
http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/UserProvidedkernels.html#PVGRUB_compatible_kernels
"
 "Kernels that disable the pv-ops XSAVE hypercall are known to work
   on all instance types, whereas those that enable this hypercall
   will fail to launch in some cases. Similarly, non-pv-ops kernels
   that do not adhere to the Xen 3.0.2 interface might fail to launch
   in some cases."

In other words - using XSAVE capability on Amazon EC2 is not good. But the newer hypervisor do expose it and user-space tries to use it. But with that patch in, part of enabling xsave is turned off while user-space tries to use it. That is the theory - which would be a bit of Catch-22 - take out the patch - works with the newer hypervisor - but does not under Amazon EC2. Leave patch in - works with older hypervisors - but does not with newer hypervisor (And SandyBridge hardware). 

<sigh>

Comment 37 Konrad Rzeszutek Wilk 2012-05-09 15:28:34 UTC

Scratch comment #36. This is all user-space related - where the CPUID flags are naked. And the CPUID.OSXSAVE should be set to zero unless on the hypervisor line xsave=1 is set. Running a little test-program showed me that 
AVX and XSAVE are set, but OSXSAVE was cleared.

So something is fishy with userspace executing a wrong instructions. It could be this: http://marc.info/?l=xen-devel&m=133612371602480 as well.

Comment 38 Konrad Rzeszutek Wilk 2012-05-09 16:39:11 UTC

Ah, this patch might be the solution: https://launchpadlibrarian.net/104512210/fma4-depends-avx.diff

Comment 39 Adam Williamson 2012-05-10 15:53:02 UTC

Jeff, what's your timeline on getting hardware? This is a proposed blocker, and I'd say it's about 50/50 to be accepted. We're already way behind on RC1. We really need to be spinning it tomorrow. Will you have access to hardware by then, as things stand? If there's not, is there any way to expedite, or get help from someone who does have access? Thanks!



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 40 Jeff Law 2012-05-10 16:01:27 UTC

I've been waiting on suitable beaker hardware for at least a week.  Obviously with beaker there's no real indication when the required hardware will become available.  

Even once the hardware is available, I'm going to have to figure out enough Xen-fu to install a guest -- my attempts to do that in the past have been a total, miserable failure.

So right now I don't see any way to get this resolved in RC1.  I'm not going to keep throwing patches & ideas over the wall, I really need to be able to sit down with a failing system, debug the problem and verify a solution.

If you know someone that can get me access to an ivybridge box via ssh, then obviously that's a good first step.

Jeff

Comment 41 Konrad Rzeszutek Wilk 2012-05-10 16:27:27 UTC

*** Bug 820495 has been marked as a duplicate of this bug. ***

Comment 42 Konrad Rzeszutek Wilk 2012-05-10 16:42:13 UTC

It looks like the issue is also present with AMD hardware.

Comment 43 Mathieu Chouquet-Stringer 2012-05-10 16:47:20 UTC

And I get SIGILL in strchrnul on Atom (cf bug 820094).

Comment 44 Konrad Rzeszutek Wilk 2012-05-10 16:53:13 UTC

(In reply to comment #40)
> I've been waiting on suitable beaker hardware for at least a week.  Obviously
> with beaker there's no real indication when the required hardware will become
> available.  
> 
> Even once the hardware is available, I'm going to have to figure out enough
> Xen-fu to install a guest -- my attempts to do that in the past have been a
> total, miserable failure.

If you need help, just ping dariof or konrad on freenode.net and we can help.
In regards to the hardware, all mine is behind corporate firewall so can't help there - perhaps some of the other reports can?

Comment 45 Adam Williamson 2012-05-10 17:38:34 UTC

Jeff: thanks for the update. I'll see if we can do anything to speed things along.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 46 Konrad Rzeszutek Wilk 2012-05-10 17:46:30 UTC

Created attachment 583644 [details]
Fix for SandyBridge

This fixes it on the SandyBridge box. To make sure I wasn't doing anything silly I did:

[boot the guest on an older Intel box]
yumdownloader --source glibc
rpm *.src.rpm
cd rpmbuild/SPECS
rpmbuild -ba glibc.spec
cd ../RPMS/x86_64
rpm -hiv --force *.rpm
dracut -f
poweroff
[boot the guest on the Sandybridge]
xm create -c test.xm
..
see it crash on
[    0.225467] init[1] trap invalid opcode ip:7f6fe709a895 sp:7ffff12a1848 error:0 in libc-2.15.so[7f6fe6f62000+1ac000]

[boot the guest on an older Intel box]
edit the glibc.spec (Also changed the release to 36 so that upgrade would work) and added the file.
built the new RPM
installed the new rpms: rpm -U ../RPMS/x86_64/*36*.rpm
dracut -f
poweroff

[boot the guest on the Sandybridge]
boots!

Comment 47 Laszlo Ersek 2012-05-10 17:47:04 UTC

Jeff,

I can also try to help (on the Beaker box), just ping me when it's provisioned.

Comment 48 Jeff Law 2012-05-10 20:51:50 UTC

I hate to do it, but I'm throwing a test build over the wall...  If y'all could give this a whirl (when the build is complete) it'd be greatly appreciated.  It's really just Konrad's change which in effect reverts one of Uli's bogus changes.

http://koji.fedoraproject.org/koji/taskinfo?taskID=4069046

Comment 49 Konrad Rzeszutek Wilk 2012-05-10 21:37:49 UTC

(In reply to comment #48)
> I hate to do it, but I'm throwing a test build over the wall...  If y'all could
> give this a whirl (when the build is complete) it'd be greatly appreciated. 
> It's really just Konrad's change which in effect reverts one of Uli's bogus
> changes.
> 
> http://koji.fedoraproject.org/koji/taskinfo?taskID=4069046

Sure. Will give it a whirl in a couple of hours and try booting the guest on different hardware.

Comment 50 Ferris Draugh 2012-05-10 21:38:16 UTC

"Even once the hardware is available, I'm going to have to figure out enough
Xen-fu to install a guest -- my attempts to do that in the past have been a
total, miserable failure."

I can't help with the hardware but here are two domU images for Xen3/4 and XCP/XenServer respectively.  Build today from the beta repo.

Raw disk image for Xen 3/4:
http://stacklet.com/sites/stk/files/fedora.17.x86-64.20120510.img.tar.bz2

XVA Template for XCP/XenServer
http://stacklet.com/sites/stk/files/fedora.17.x86-64.20120510.xva.bz2

Comment 51 Konrad Rzeszutek Wilk 2012-05-10 21:51:32 UTC

The link you posted is for fc18. Is there a F17 version?

Comment 52 Konrad Rzeszutek Wilk 2012-05-10 22:05:17 UTC

So installed the F18 RPMs on a F17 guest on a Intel(R) Core(TM) i5-2500 CPU @ 3.30GH on/DQ67SW, BIOS SWQ6710H.86A.0052.2011.0520.1
and it fixed the issue.

Later on tonight I will bootup some other CPUs to make sure there are no regressions when booting the guest on various families of CPUs.

Comment 53 Ferris Draugh 2012-05-11 00:11:45 UTC

I installed the new rpm's into the above image (comment 50) and it still won't boot for me:

Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz


[    0.657325] Freeing unused kernel memory: 1356k freed
[    0.661730] init[1] trap invalid opcode ip:7f1a3c711895 sp:7fff15c0ac48 error:0 in libc-2.15.so[7f1a3c5d9000+1ac000]
[    0.662664] init used greatest stack depth: 4368 bytes left
[    0.662694] Kernel panic - not syncing: Attempted to kill init!
[    0.701731] Pid: 1, comm: init Tainted: G        W    3.3.4-4.fc17.x86_64.debug #1
[    0.701743] Call Trace:
[    0.701753]  [<ffffffff8168e9f1>] panic+0xba/0x1cb
[    0.701764]  [<ffffffff8169aac0>] ? _raw_write_unlock_irq+0x30/0x50
[    0.701776]  [<ffffffff8106631a>] do_exit+0xa8a/0xa90
[    0.701786]  [<ffffffff8106666c>] do_group_exit+0x4c/0xc0
[    0.701796]  [<ffffffff8107a2c6>] get_signal_to_deliver+0x2b6/0x820
[    0.701808]  [<ffffffff810cca8d>] ? trace_hardirqs_on_caller+0x10d/0x1a0
[    0.701820]  [<ffffffff81019288>] do_signal+0x68/0x7c0
[    0.701829]  [<ffffffff8169bf24>] ? do_trap+0x74/0x170
[    0.701838]  [<ffffffff8101a150>] ? do_invalid_op+0xb0/0xc0
[    0.701847]  [<ffffffff8169b83f>] ? retint_signal+0x11/0x92
[    0.701857]  [<ffffffff81019a90>] do_notify_resume+0x90/0xc0
[    0.701866]  [<ffffffff8169b87b>] retint_signal+0x4d/0x92

Comment 54 Konrad Rzeszutek Wilk 2012-05-11 01:14:04 UTC

(In reply to comment #53)
> I installed the new rpm's into the above image (comment 50) and it still won't
> boot for me:
> 

You need to run dracut -f after installing the RPMs for the initrd to pick up the new library.

Comment 55 Konrad Rzeszutek Wilk 2012-05-11 01:15:32 UTC

I tested the F17 PV guest with the new F18 RPMs  on these boxes:


AMD Phenom X2 on Tilapia prototype
AMD A8-3850 APU with Radeon(tm) HD Graphics on ASUS F1A75-M
Quad-Core AMD Opteron(tm) Processor 1352 Dell Inc. PowerEdge T105 /0RR825, BIOS 1.3.2 08/20/2008
AMD E-350 Processor on  AMD HDZS01       /AHD1S               , BIOS A93F1019 03/04/2011

Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz MSI MS-7680/H61M-P23 (MS-7680), BIOS V17.0 03/14/2011
Intel(R) Core(TM)2 Duo CPU     E6550  @ 2.33GHz on Gigabyte G31M-ES2L
Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz  Dell Inc. OptiPlex 780     
[Woodcrest] Genuine Intel(R) CPU 3.20GHz  Supermicro X7DB8/X7DB8, BIOS 2.1a 12/20/2008
[SandyBridge-EP] Genuine Intel(R) CPU  @ 2.30GHz on S2600CP
Intel(R) Core(TM) i5-2500 CPU @ 3.30GH on/DQ67SW, BIOS SWQ6710H.86A.0052.2011.0520.1802 05/20/2011


And it worked out great.

Comment 56 Ferris Draugh 2012-05-11 01:26:38 UTC

OK I recreated the initrd (thanks Konrad).  The domU boots now.  

So my changes to the image I uploaded earlier (new rpm's grabbed from koji):

yum install glibc-2.15-37.fc18.x86_64.rpm glibc-common-2.15-37.fc18.x86_64.rpm
dracut -f initramfs-3.3.4-4.fc17.x86_64.debug.img 3.3.4-4.fc17.x86_64.debug

Of course you will need to do this in a chroot if you are impacted by this bug.  I will start some more intensive testing tomorrow.

Comment 57 Jeff Law 2012-05-11 02:48:46 UTC

Konrad: re c#51.  glibc for f17 and f18 are the same right now (will be changing shortly).  I had somewhere to be this afternoon and didn't have time to spin both.  Obviously when we settle on a fix it'll go into f17.

Upstream glibc is discussing the issue right now; the direction they're taking is marginally better technically, but I'm leaning towards using your patch for f17. Simply because it's been through a smoke test.

Comment 58 Fedora Update System 2012-05-11 03:51:35 UTC

glibc-2.15-37.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/glibc-2.15-37.fc17

Comment 59 Adam Williamson 2012-05-11 05:04:03 UTC

On blocker status - this is a very close call, we do have a criterion for Xen but this doesn't break it on all systems, only IB...but then, that's an important platform. I'm a weak +1, I think.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 60 Peter Teoh 2012-05-11 06:06:54 UTC

*** Bug 820495 has been marked as a duplicate of this bug. ***

Comment 62 Konrad Rzeszutek Wilk 2012-05-11 10:41:27 UTC

(In reply to comment #59)
> On blocker status - this is a very close call, we do have a criterion for Xen
> but this doesn't break it on all systems, only IB...but then, that's an
> important platform. I'm a weak +1, I think.

It is also SandyBridge (released last year) and the new SandyBridge EP platforms (server one) - at least those were dying on me before I used the new rpms.

Comment 63 Jeff Law 2012-05-11 14:21:07 UTC

OK, with Laszlo's help, I've got a box I can run additional tests on.  The patch we're using to unblock F17 is quite safe as it merely disables AVX when all the components aren't available.  It shouldn't have any effect on other systems.  There's a marginally better patch being developed upstream that I'll be testing as well.

Comment 64 Jeff Law 2012-05-11 14:34:49 UTC

The updated RPMS seem to resolve the problem on my Sandybridge testbox.

Comment 65 Major Hayden 🤠 2012-05-11 14:39:45 UTC

Booting okay here with Sandy Bridge Xeons.

Comment 66 Konrad Rzeszutek Wilk 2012-05-11 15:41:19 UTC

(In reply to comment #58)
> glibc-2.15-37.fc17 has been submitted as an update for Fedora 17.
> https://admin.fedoraproject.org/updates/glibc-2.15-37.fc17

OK, left feedback and added +1 karma

Comment 67 Jeff Law 2012-05-11 16:30:51 UTC

Further testing managed to trip the strcasecmp avx code. I would have expected the code which disabled avx completely to have avoided this problem.  I'm investigating.

Comment 68 Jeff Law 2012-05-11 17:37:33 UTC

Hahaha.  Everything is working fine with the f17 rpms.  The failure was due to installing a new glibc with an incorrect fix from upstream on a running system.

After the files are installed, there's a scriptlet that runs and restarts the ssh daemon.  Of course, the scriptlet and its children are using the *new* glibc (which again, was broken).  Thus we were able to get into the avx strncasecmp.

Everything still looks good for f17 at this point.  And clearly the upstream fix we want to use for f18 needs refinement ;-)

Comment 69 Ferris Draugh 2012-05-11 19:03:32 UTC

After more testing the patched glibc is working solid for me on the affected system.  Once new packages hit the official repo I will rebuild everything and retest on all systems.

Comment 70 Jeff Law 2012-05-11 19:03:34 UTC

*** Bug 820495 has been marked as a duplicate of this bug. ***

Comment 71 Adam Williamson 2012-05-12 00:49:36 UTC

Discussed at 2012-05-11 blocker review meeting: http://meetbot.fedoraproject.org/fedora-bugzappers/2012-05-11/f17-final-blocker-review-meeting-5.2012-05-11-17.04.html . Accepted as a blocker per criterion "The release must boot successfully as Xen DomU with releases providing a functional, supported Xen Dom0 and widely used cloud providers utilizing Xen. This does not include any issues limited to the release functioning as Xen Dom0" on a significant number of newer platforms.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 72 Adam Williamson 2012-05-12 00:51:41 UTC

Hey Jeff - please don't close blocker bugs at least until the update is pushed stable by Bodhi and ideally until it's clear they are included in the latest official compose. This is very important - if the bug gets closed it drops right off of our 'blocker tracking radar' and the fix might not make the compose! thanks :)

Comment 73 Jeff Law 2012-05-14 15:47:21 UTC

My bad, sorry Adam.

Comment 74 Ferris Draugh 2012-05-14 22:49:27 UTC

Latest disk images with working glibc:

Raw disk image for Xen 3/4:
http://stacklet.com/sites/stk/files/fedora.17.x86-64.20120514.img.tar.bz2

XVA Template for XCP/XenServer
http://stacklet.com/sites/stk/files/fedora.17.x86-64.20120514.xva.bz2

Comment 75 Fedora Update System 2012-05-15 05:25:29 UTC

glibc-2.15-37.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 76 Justin M. Forbes 2012-05-29 21:00:48 UTC

*** Bug 825586 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

awilliam
dprince
fweimer
fxd
gansalmon
gordonthree
htmldeveloper
itamar
jakub
jforbes
jonathan
kernel-maint
ketuzsezr
law
lersek
madhu.chinakonda
mathieu-acct
mishu
robatino
schwab
yang.z.zhang