Bug 694492
Summary: | RHEL 6 Xen PV domU kernel panic after migrate to older CPU | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Jinxin Zheng <jzheng> | ||||||||||||||
Component: | kernel | Assignee: | Andrew Jones <drjones> | ||||||||||||||
Status: | CLOSED DUPLICATE | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||||
Priority: | medium | ||||||||||||||||
Version: | 6.1 | CC: | drjones, leiwang, lersek, qwan, xen-maint, yuzhang, yuzhou | ||||||||||||||
Target Milestone: | rc | Keywords: | Reopened | ||||||||||||||
Target Release: | --- | ||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||
OS: | Unspecified | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2011-09-15 15:11:03 UTC | Type: | --- | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Bug Depends On: | |||||||||||||||||
Bug Blocks: | 653816 | ||||||||||||||||
Attachments: |
|
Created attachment 490551 [details]
xenctx output for cpu 0
/usr/lib64/xen/bin/xenctx -s System.map-2.6.32-128.el6.x86_64 13 0
Created attachment 490552 [details]
xenctx output for cpu 1
/usr/lib64/xen/bin/xenctx -s System.map-2.6.32-128.el6.x86_64 13 1
Since RHEL 6.1 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. domain config: bootloader = "/usr/bin/pygrub" vif = ['mac=00:21:7F:B7:08:60,script=vif-bridge,bridge=xenbr0,type=netfront'] name = "rhel6-pv" on_reboot = "restart" localtime = "0" apic = "1" on_poweroff = "destroy" on_crash = "preserve" vcpus = "2" pae = "1" memory = "512" maxmem = "512" vnclisten = "0.0.0.0" vnc = "1" disk = ['tap:aio:/var/lib/xen/images/RHEL-Server-6.0-64-pv.raw,xvda,w'] acpi = "1" It turns out this was a problem with the test environment. The guest's disk image is locally stored on the W3520 and shared by NFS to the other machines that it was attempting to migrate to/from. However it was exported with all_squash, rather than no_root_squash, which doesn't work. I fixed the nfs export and was able to ping-pong the guest back and forth from/to all the hosts I was given access to. closing as not-a-bug. Hi Andrew, Did you really solve the bug merely by modifying the /etc/exports? the panic is still there.. W3520 => E5310: panic E5504 => E5310: panic E8400 => E5310: OK all migration in opposite direction (E5310 => *) are OK. and, if the domU was firstly started on E5310, migrating back and forth OK. if the domU was firstly started on W3520/E5504, migrating to E5310: panic. I also tried RHEL 5 guest, no problem. Created attachment 490704 [details]
xm dmesg from E5310 after migration
Call Trace of DomU (RHEL6 kernel-2.6.32-128.el6.x86_64): ------------------------------------------------------------------ rsyslogd[911] trap invalid opcode ip:7fe9a0ba3730 sp:7fe99f067d18 error:0 in libc-2.12.so[7fe9a0a7d000+187000] auditd[882] trap invalid opcode ip:7f0304e03818 sp:7fffe042e2e8 error:0 in libc-2.12.so[7f0304cde000+187000] automount[1644] trap invalid opcode ip:7f835bed4730 sp:7f8359b15c78 error:0 in libc-2.12.so[7f835bdae000+187000] automount[1642] trap invalid opcode ip:7f835bed4730 sp:7f835d373c78 error:0 in libc-2.12.so[7f835bdae000+187000] abrtd[1266] trap invalid opcode ip:7ff158bc5730 sp:7fffb642ffd8 error:0 in libc-2.12.so[7ff158a9f000+187000] dbus-daemon[965] trap invalid opcode ip:7f5958d929f3 sp:7fff55234b78 error:0 in libc-2.12.so[7f5958c6d000+187000] hald[1096] trap invalid opcode ip:7fc10ec3d553 sp:7fffd5036e28 error:0 in libc-2.12.so[7fc10eb17000+187000] wpa_supplicant[993] trap invalid opcode ip:7fddfae6e473 sp:7fff9aa9e898 error:0 in libc-2.12.so[7fddfad48000+187000] modem-manager[981] trap invalid opcode ip:7ff365ab3553 sp:7ffffed38998 error:0 in libc-2.12.so[7ff36598d000+187000] Pid 981(modem-manager) over core_pipe_limit Skipping core dump avahi-daemon[992] trap invalid opcode ip:7fa4c861b473 sp:7fffb8a9d158 error:0 in libc-2.12.so[7fa4c84f5000+187000] Pid 1312(libvirtd) over core_pipe_limit Skipping core dump Pid 1558(console-kit-dae) over core_pipe_limit Skipping core dump Pid 1097(hald-runner) over core_pipe_limit Skipping core dump Pid 1136(hald-addon-inpu) over core_pipe_limit Skipping core dump Pid 1(init) over core_pipe_limit Skipping core dump Kernel panic - not syncing: Attempted to kill init! Pid: 1, comm: init Not tainted 2.6.32-128.el6.x86_64 #1 Call Trace: [<ffffffff814d9627>] ? panic+0x78/0x143 [<ffffffff812ff3d6>] ? get_current_tty+0x66/0x70 [<ffffffff8106c4b2>] ? do_exit+0x852/0x860 [<ffffffff8107d96d>] ? __sigqueue_free+0x3d/0x50 [<ffffffff8106c518>] ? do_group_exit+0x58/0xd0 [<ffffffff81081966>] ? get_signal_to_deliver+0x1f6/0x460 [<ffffffff8100731d>] ? xen_force_evtchn_callback+0xd/0x10 [<ffffffff81007b52>] ? check_events+0x12/0x20 [<ffffffff81007b3f>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff8100a365>] ? do_signal+0x75/0x800 [<ffffffff814dcef5>] ? do_trap+0x75/0x160 [<ffffffff8100ceb5>] ? do_invalid_op+0x95/0xb0 [<ffffffff81007b3f>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff8100ab80>] ? do_notify_resume+0x90/0xc0 [<ffffffff8100bbdc>] ? retint_signal+0x48/0x8c Created attachment 490707 [details]
console output from the crashed domU
Created attachment 490708 [details]
xenctx output for cpu 0
(In reply to comment #7) > Did you really solve the bug merely by modifying the /etc/exports? > the panic is still there.. > yes. Everything worked great for me just last night on all these machines once the nfs export was corrected. > W3520 => E5310: panic I just got back on the systems and see this is now true, it panics every time. So there is some problem, but I'm still not sure if it's enviornment or a real problem. For starters I see the clocks on these machines are greatly out of synch. I've set the one on E5310 now, to synch it with W3520, but it was off by 6 months. I'm attempting to reserve my own machines, where I will set up my own test environment. I couldn't find a W3520, but you say E5504 => E5310 also panics, so I'm attempting to reserve a E5504. I've already got a E5310 to start setting up. OK, I've confirmed this is a reproducible bug on another independently setup environment. Booting on E5310 first I can ping-pong to/from E5504 without any problems. Booting on E5504 first, I cannot migrate to E5310. I get invalid opcode logs and the backtrace shown in comment 9. It doesn't matter which host has the image locally or over nfs. I've create a crash dump and am taking a closer look. When I looked at this bug Friday, I thought I had it figured out before I went home. I saw that a guest on a E5504 machine has both constant_tsc and nonstop_tsc in /proc/cpuinfo, but the E5310 only has constant_tsc, which has been reported as problematic for migrations. Furthermore, I found that upstream Xen HVs mask nonstop_tsc unless migration is disabled (c/s 20402). However, unfortunately, yesterday morning when I tested a migration with nonstop_tsc masked from the guest, it still failed the same way. Yesterday I did a range of experiments to try and find a working migration, or even a different failure, but all experiments gave me the same results. A persistent bug! Here is a summary of the experiments that all failed to migrate. * RHEL 6.1 32-bit PV * RHEL 6.0 * F14 (both PV and HVM guest without pv-drivers) using an F15 kernel (2.6.38-0.rc5.git1.1.fc15.x86_64) So it's not a regression from 6.0 and it doesn't appear to be fixed upstream, at least not as of 2.6.38-0.rc5, it doesn't need to be a 6.1 blocker. Moving to 6.2. Perhaps stating the obvious, but the error messages in comment 9 are formatted by "/arch/x86/kernel/traps.c": DO_ERROR_INFO(6, SIGILL, "invalid opcode", invalid_op, ILL_ILLOPN, regs->ip) -> do_trap() -> print_vma_addr() The string "libc-2.12.so[7fe9a0a7d000+187000]" describes the file name backing the VMA, the start address of the VMA, and the size of the VMA. The file is always libc, the VMA size is always 187000. Grouped by offset: process offset into libc VMA ------------------- -------------------- hald[1096] 0x126553 auditd[882] 0x125818 abrtd[1266] 0x126730 rsyslogd[911] 0x126730 automount[1642] 0x126730 automount[1644] 0x126730 dbus-daemon[965] 0x1259F3 avahi-daemon[992] 0x126473 wpa_supplicant[993] 0x126473 I reproduced the problem. Here's one log entry: rsyslogd[1004] trap invalid opcode ip:7f574a9f1890 sp:7f5748eb5d18 error:0 in libc-2.12.so[7f574a8cb000+187000] addr=7f574a9f1890 vma_start=7f574a8cb000 offset_into_vma=126890 $ objdump -dR /lib64/libc-2.12.so >dump Then search for "126890" in "dump": 0000000000126860 <__strlen_sse42>: [...] 126890: 66 0f 3a 63 4f 10 08 pcmpistri $0x8,0x10(%rdi),%xmm1 According to the function name (__strlen_sse42) and to the PCMPISTRI specification in the Intel Instruction Set Reference ("Packed Compare Implicit Length Strings, Return Index"), this instruction (66 0f 3a 63) requires SSE 4.2. Now looking up some of the processors mentioned in comment 7: > W3520 => E5310: panic > E5504 => E5310: panic http://pclinks.xtreemhost.com/server.htm Xeon W3520 MMX SSE SSE2 SSE3 SSE4.2 (Bloomfield) Xeon E5504 MMX SSE SSE2 SSE3 SSE4.2 (Gainestown, Nehalem-EP) Xeon E5310 MMX SSE SSE2 SSE3 (Clovertown) The problem seems to be that at program startup, libc determines we have SSE4.2, but after migration that doesn't hold anymore and we get a bunch of SIGILLs. This problem probably doesn't hit under RHEL-5 (see end of comment 7) because the glibc version shipped in RHEL-5 may have no SSE 4.2 specific code. I think this should be fixed somewhere in ldconfig / LD_HWCAP_MASK, but I'll have to research precisely how. (The panic happens because "init" gets a SIGILL too.) I'm closing this as a duplicate of bug 525873 ("Support for cpuid masking per domU with Xen hypervisor"). Alternatively, we could close this as a duplicate of bug 526862 ("Mask out CPU features by default"), but I really don't think SSE4.2 should be masked out by default (too much performance loss). So, don't attempt to migrate like this. *** This bug has been marked as a duplicate of bug 525873 *** |
Created attachment 490550 [details] console output from the crashed domU Description of problem: PV domU kernel panic after migrating from a newer cpu (Intel W3520) to an older one (Intel E5310). Version-Release number of selected component (if applicable): kernel-2.6.32-128.el6.x86_64 How reproducible: Always Steps to Reproduce: 1. Setup RHEL 6 PV guest, install kernel -128, boot up the guest on W3520. 2. xm migrate (not live) the domU to E5310. 3. Actual results: kernel panic after a while. Expected results: continue to work.