Bug 658720
| Summary: | xen domU between minor CPU revs fails | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Rich Graves <rgraves> |
| Component: | kernel | Assignee: | Red Hat Kernel Manager <kernel-mgr> |
| Status: | CLOSED DUPLICATE | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
| Severity: | medium | Docs Contact: | |
| Priority: | low | ||
| Version: | 6.0 | CC: | drjones, mrezanin, pradhanparas, xen-maint |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2011-01-24 07:46:50 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 523117 | ||
| Attachments: | |||
|
Description
Rich Graves
2010-12-01 03:59:29 UTC
Hi Rich, how exactly does the migration fail? Any console output or logs? Have you always ping-ponged it like this in your testing (i.e. start on host A, migrate to host B, attempt to go back to A), or have you tried immediately B -> A. Which host, A or B, has the image stored locally? Or is the image accessible to both through the network from some host C? Thanks for the additional information. Drew No console output. No xm dmesg... but on third look, yes, there is something in xend.log. Will attach separately. At the xm level, I see state go from bp to b on the destination host and the source vm goes away cleanly, but the clock never starts ticking: Name ID Mem(MiB) VCPUs State Time(s) rhel6 36 2000 1 -b---- 0.0 The VM's disk is a raw fibre channel LUN, with the same multipath name on all hosts. name = "rhel6" uuid = "76c15315-f2a8-6a06-7735-2c7204c63bfe" maxmem = 2000 memory = 2000 vcpus = 1 bootloader = "/usr/bin/pygrub" on_poweroff = "destroy" on_reboot = "restart" on_crash = "restart" vfb = [ "type=vnc,vncunused=1,keymap=en-us" ] disk = [ "phy:/dev/mapper/rhel6,xvda,w" ] vif = [ "mac=00:16:36:4f:fc:18,bridge=xenbr0,script=vif-bridge" ] Created attachment 464008 [details]
xend.log when booting rhel6 on L5520
Created attachment 464011 [details]
xend.log on L5520 when migrating to X5680
Created attachment 464013 [details]
xend.log on X5680 as rhel6 is migrating from L5520
On the destination X5680, qemu-dm.26641.log says domid: 37 Change xvda to look like hda Watching /local/domain/37/logdirty/next-active Watching /local/domain/0/device-model/37/command xs_read(): vncpasswd get error. /vm/76c15315-f2a8-6a06-7735-2c7204c63bfe/vncpasswd. Sticking to new protocol xen-hotplug.log says simply: Nothing to flush. Nothing to flush. brctl show and /var/log/messages on the X5680 show that the network got connected, but the guest is not pingable or arpable. Dec 1 09:17:18 xen5 kernel: device vif37.0 entered promiscuous mode Dec 1 09:17:38 xen5 kernel: blkback: ring-ref 8, event-channel 9, protocol 1 (x86_64-abi) (I haven't actually verified that migration between EXACT SAME CPU works. I don't currently have any host pairs that are *exactly* identical.) After the migration fails can you grab one more thing? From the host that it's supposed to be running on grab the output of xenctx (with a command like the following) # /usr/lib64/xen/bin/xenctx -s System.map-2.6.32-71.7.1.el6.x86_64 <domid> Also please take a look at 'xm dmesg' on both hosts to see if anything interesting popped up. thanks, Drew Nothing new has come out xm dmesg since boot, but I'll attach the full output in case it's interesting. Clarifying my comment #8, I can reliably live-migrate rhel6 X5680 to L5520... but it's possible I could not migrate rhel6 between two identical X5680's or two identical L5520's. rhel4 and rhel5 can move anywhere. Awyway, xenctx says: [root@xen5 ~]# /usr/lib64/xen/bin/xenctx -s /mnt/System.map-2.6.32-71.7.1.el6.x86_64 37 rip: ffffffff810093aa _stext+0x3aa rsp: ffffffff8170df08 rax: 00000000 rbx: ffffffff8170c000 rcx: ffffffff810093aa rdx: 00000000 rsi: 00000000 rdi: 00000001 rbp: ffffffff8170df20 r8: 00000000 r9: 00000000 r10: 00000000 r11: 00000246 r12: ffffffff818a1b60 r13: 00000000 r14: ffffffffffffffff r15: 00000000 cs: 0000e033 ds: 00000000 fs: 00000000 gs: 00000000 Stack: 0000000000000000 0000000000000000 ffffffff8100f3a0 ffffffff8170df38 ffffffff8100c405 ffffffff8170dfd8 ffffffff8170df68 ffffffff81011e96 6db6db6db6db6db7 a421597014070596 0000000000000000 6db6db6db6db6db7 ffffffff8170df78 ffffffff814b0daa ffffffff8170dfb8 ffffffff818c1ecd Code: cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc Call Trace: [<ffffffff810093aa>] _stext+0x3aa <-- [<ffffffff8100f3a0>] xen_safe_halt+0x10 [<ffffffff8100c405>] xen_idle+0x35 [<ffffffff81011e96>] cpu_idle+0xb6 [<ffffffff814b0daa>] rest_init+0x7a [<ffffffff818c1ecd>] start_kernel+0x413 [<ffffffff818c133a>] x86_64_start_reservations+0x125 [<ffffffff818c50b6>] xen_start_kernel+0x578 Created attachment 464037 [details]
xen dmsg output on X5680 (nothing new during migration)
Backtrace just shows that the guest isn't currently doing anything. Does the guest respond to xm commands? such as 'xm shutdown'? To work on testing the ARAT (Always Running APIC Timer) theory, we can compare dmesg output from the guest from fresh boots on each host (no migration) to see if any message exists showing that we're using it in some way. We can also try booting with nolapic_timer on the guest kernel command line, and then attempting to migrate again. When the guest has been migrated from L5520 to X5680, there is no response to xm shutdown, sysrq b, console, or mem-set. xm dump-core works. It pauses, creates a file in /var/lib/xen/dump, and unpauses, though the host remains unresponsive. Booting the guest with nolapic_timer does not help. I'll collect and compare dmesg. irqbalance is not running on either host (some online chatter that it can be bad). I am slightly downrev on host kernel -- 2.6.18-194.11.1.el5xen versus 2.6.18-194.26.1.el5 -- but both hosts are at exactly the same patch level. These are the only differences in guest dmesg (other than auditd timestamps). --- dmesg.L5520 2010-12-03 12:22:10.000000000 -0600 +++ dmesg.X5580 2010-12-03 12:09:18.000000000 -0600 @@ -60,7 +60,7 @@ PERCPU: Embedded 31 pages/cpu @ffff880004209000 s95064 r8192 d23720 u126976 pcpu-alloc: s95064 r8192 d23720 u126976 alloc=31*4096 pcpu-alloc: [0] 0 -trying to map vcpu_info 0 at ffff880004214020, mfn 390f43, offset 32 +trying to map vcpu_info 0 at ffff880004214020, mfn eea131, offset 32 cpu 0 using vcpu_info at ffff880004214020 Xen: using vcpu_info placement Built 1 zonelists in Node order, mobility grouping on. Total pages: 251938 @@ -80,8 +80,8 @@ please try 'cgroup_disable=memory' option if you don't want memory cgroups Xen: using vcpuop timer interface installing Xen timer for CPU 0 -Detected 2260.998 MHz processor. -Calibrating delay loop (skipped), value calculated using timer frequency.. 4521.99 BogoMIPS (lpj=2260998) +Detected 3325.010 MHz processor. +Calibrating delay loop (skipped), value calculated using timer frequency.. 6650.02 BogoMIPS (lpj=3325010) pid_max: default: 32768 minimum: 301 Security Framework initialized SELinux: Initializing. @@ -96,8 +96,8 @@ Initializing cgroup subsys freezer Initializing cgroup subsys net_cls Initializing cgroup subsys blkio -CPU: Unsupported number of siblings 16 -Performance Events: unsupported p6 CPU model 26 no PMU driver, software events only. +CPU: Unsupported number of siblings 32 +Performance Events: unsupported p6 CPU model 44 no PMU driver, software events only. alternatives: switching to unfair spinlock SMP alternatives: switching to UP code Freeing SMP alternatives: 32k freed Possibly related reports: https://bugzilla.redhat.com/show_bug.cgi?id=613513 - but that talks about ping-ponging, and I'm seeing it on first migration after boot http://lists.xensource.com/archives/html/xen-users/2010-12/msg00302.html - but that alleges problems going from slower to faster CPU, reverse of my experience - alleges problem exists on upstream Xen 3.4 and also Xen Cloud Platform 1.0 Possibly related, though repro steps are different https://bugzilla.redhat.com/show_bug.cgi?id=663755 Migration failes in exactly same CPU as well. Paras. This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. If you would like it considered as an exception in the current release, please ask your support representative. This request was erroneously denied for the current release of Red Hat Enterprise Linux. The error has been fixed and this request has been re-proposed for the current release. Can you try kernel 2.6.37-2.fc15.x86_64, available from rawhide repos? Likely bug 663755 and this are the same problem. We still need to figure out what patches in that F15 kernel fix it. Hi Rich, After you've completed your testing, please let me know if this bug can be dupped to bug 663755. Thanks, Drew *** This bug has been marked as a duplicate of bug 663755 *** |