658720 – xen domU between minor CPU revs fails

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 658720 - xen domU between minor CPU revs fails

Summary: xen domU between minor CPU revs fails

Keywords:
Status:	CLOSED DUPLICATE of bug 663755
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.0
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Red Hat Kernel Manager
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	523117
TreeView+	depends on / blocked

Reported:	2010-12-01 03:59 UTC by Rich Graves
Modified:	2011-01-24 07:46 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-01-24 07:46:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
xend.log when booting rhel6 on L5520 (10.98 KB, text/plain) 2010-12-01 15:22 UTC, Rich Graves	no flags	Details
xend.log on L5520 when migrating to X5680 (3.21 KB, text/plain) 2010-12-01 15:23 UTC, Rich Graves	no flags	Details
xend.log on X5680 as rhel6 is migrating from L5520 (11.46 KB, text/plain) 2010-12-01 15:24 UTC, Rich Graves	no flags	Details
xen dmsg output on X5680 (nothing new during migration) (9.81 KB, text/plain) 2010-12-01 16:01 UTC, Rich Graves	no flags	Details
View All

Description Rich Graves 2010-12-01 03:59:29 UTC

Description of problem:

RHEL6 XenU paravirt guests can't migrate across minor CPU flag differences.

RHEL4 and RHEL5 guests can.

Version-Release number of selected component (if applicable):

2.6.32-71.7.1.el6.x86_64 on RHEL6 paravirt guest
2.6.18-194.11.1.el5xen on RHEL5 domU host

How reproducible:

Always when migrating RHEL6 from a newer to an older CPU.
Never when migrating RHEL5 or RHEL4 guests (they work fine).

Steps to Reproduce:
1. Install RHEL5.5 on two different hosts with slightly different CPU flags. For example, a Nehalem2 X5680 and Nehalem L5520, or Core2 Duo 5160 and Core Duo 5060. Configure xend to allow migration.

2. Install RHEN6 paravirt guest.

3. Attempt to live-migrate from newer CPU to older CPU. Then from older to newer.
  
Actual results:

Live-migrate from an X5680 to L5520 works (/proc/cpuinfo shows that the difference is "arat," whatever that is). Live-migrate from L5520 to X5680 fails.

Expected results:

There oughta be a supported way to mask CPU flags (or something) to allow live-migrate across slightly different CPUs.

Additional info:

Beginning of a possible relevant thread at http://www.mailinglistarchive.com/html/xen-devel@lists.xensource.com/2008-07/msg05091.html

Comment 2 Andrew Jones 2010-12-01 14:47:07 UTC

Hi Rich,

how exactly does the migration fail? Any console output or logs? Have you always ping-ponged it like this in your testing (i.e. start on host A, migrate to host B, attempt to go back to A), or have you tried immediately B -> A. Which host, A or B, has the image stored locally? Or is the image accessible to both through the network from some host C?

Thanks for the additional information.
Drew

Comment 3 Rich Graves 2010-12-01 15:08:43 UTC

No console output. No xm dmesg... but on third look, yes, there is something in xend.log. Will attach separately.

At the xm level, I see state go from bp to b on the destination host and the source vm goes away cleanly, but the clock never starts ticking:

Name                                      ID Mem(MiB) VCPUs State   Time(s)
rhel6                                     36     2000     1 -b----      0.0

The VM's disk is a raw fibre channel LUN, with the same multipath name on all hosts.

name = "rhel6"
uuid = "76c15315-f2a8-6a06-7735-2c7204c63bfe"
maxmem = 2000
memory = 2000
vcpus = 1
bootloader = "/usr/bin/pygrub"
on_poweroff = "destroy"
on_reboot = "restart"
on_crash = "restart"
vfb = [ "type=vnc,vncunused=1,keymap=en-us" ]
disk = [ "phy:/dev/mapper/rhel6,xvda,w" ]
vif = [ "mac=00:16:36:4f:fc:18,bridge=xenbr0,script=vif-bridge" ]

Comment 4 Rich Graves 2010-12-01 15:22:51 UTC

Created attachment 464008 [details]
xend.log when booting rhel6 on L5520

Comment 5 Rich Graves 2010-12-01 15:23:44 UTC

Created attachment 464011 [details]
xend.log on L5520 when migrating to X5680

Comment 6 Rich Graves 2010-12-01 15:24:51 UTC

Created attachment 464013 [details]
xend.log on X5680 as rhel6 is migrating from L5520

Comment 7 Rich Graves 2010-12-01 15:28:09 UTC

On the destination X5680, qemu-dm.26641.log says

domid: 37
Change xvda to look like hda
Watching /local/domain/37/logdirty/next-active
Watching /local/domain/0/device-model/37/command
xs_read(): vncpasswd get error. /vm/76c15315-f2a8-6a06-7735-2c7204c63bfe/vncpasswd.
Sticking to new protocol

xen-hotplug.log says simply:

Nothing to flush.
Nothing to flush.

brctl show and /var/log/messages on the X5680 show that the network got connected, but the guest is not pingable or arpable.

Dec  1 09:17:18 xen5 kernel: device vif37.0 entered promiscuous mode
Dec  1 09:17:38 xen5 kernel: blkback: ring-ref 8, event-channel 9, protocol 1 (x86_64-abi)

Comment 8 Rich Graves 2010-12-01 15:31:31 UTC

(I haven't actually verified that migration between EXACT SAME CPU works. I don't currently have any host pairs that are *exactly* identical.)

Comment 9 Andrew Jones 2010-12-01 15:42:33 UTC

After the migration fails can you grab one more thing? From the host that it's
supposed to be running on grab the output of xenctx (with a command like the
following)

# /usr/lib64/xen/bin/xenctx -s System.map-2.6.32-71.7.1.el6.x86_64 <domid>

Also please take a look at 'xm dmesg' on both hosts to see if anything interesting popped up.

thanks,
Drew

Comment 10 Rich Graves 2010-12-01 15:57:52 UTC

Nothing new has come out xm dmesg since boot, but I'll attach the full output in case it's interesting.

Clarifying my comment #8, I can reliably live-migrate rhel6 X5680 to L5520... but it's possible I could not migrate rhel6 between two identical X5680's or two identical L5520's. rhel4 and rhel5 can move anywhere.

Awyway, xenctx says:

[root@xen5 ~]# /usr/lib64/xen/bin/xenctx -s /mnt/System.map-2.6.32-71.7.1.el6.x86_64 37
rip: ffffffff810093aa _stext+0x3aa
rsp: ffffffff8170df08
rax: 00000000   rbx: ffffffff8170c000   rcx: ffffffff810093aa   rdx: 00000000
rsi: 00000000   rdi: 00000001   rbp: ffffffff8170df20
 r8: 00000000    r9: 00000000   r10: 00000000   r11: 00000246
r12: ffffffff818a1b60   r13: 00000000   r14: ffffffffffffffff   r15: 00000000
 cs: 0000e033    ds: 00000000    fs: 00000000    gs: 00000000

Stack:
 0000000000000000 0000000000000000 ffffffff8100f3a0 ffffffff8170df38
 ffffffff8100c405 ffffffff8170dfd8 ffffffff8170df68 ffffffff81011e96
 6db6db6db6db6db7 a421597014070596 0000000000000000 6db6db6db6db6db7
 ffffffff8170df78 ffffffff814b0daa ffffffff8170dfb8 ffffffff818c1ecd

Code:
cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc 

Call Trace:
  [<ffffffff810093aa>] _stext+0x3aa <--
  [<ffffffff8100f3a0>] xen_safe_halt+0x10
  [<ffffffff8100c405>] xen_idle+0x35
  [<ffffffff81011e96>] cpu_idle+0xb6
  [<ffffffff814b0daa>] rest_init+0x7a
  [<ffffffff818c1ecd>] start_kernel+0x413
  [<ffffffff818c133a>] x86_64_start_reservations+0x125
  [<ffffffff818c50b6>] xen_start_kernel+0x578

Comment 11 Rich Graves 2010-12-01 16:01:33 UTC

Created attachment 464037 [details]
xen dmsg output on X5680 (nothing new during migration)

Comment 12 Andrew Jones 2010-12-03 14:46:08 UTC

Backtrace just shows that the guest isn't currently doing anything. Does the guest respond to xm commands? such as 'xm shutdown'?

To work on testing the ARAT (Always Running APIC Timer) theory, we can compare dmesg output from the guest from fresh boots on each host (no migration) to see if any message exists showing that we're using it in some way. We can also try booting with nolapic_timer on the guest kernel command line, and then attempting to migrate again.

Comment 13 Rich Graves 2010-12-03 17:42:45 UTC

When the guest has been migrated from L5520 to X5680, there is no response to xm shutdown, sysrq b, console, or mem-set.

xm dump-core works. It pauses, creates a file in /var/lib/xen/dump, and unpauses, though the host remains unresponsive.

Booting the guest with nolapic_timer does not help.

I'll collect and compare dmesg.

irqbalance is not running on either host (some online chatter that it can be bad).

I am slightly downrev on host kernel -- 2.6.18-194.11.1.el5xen versus 2.6.18-194.26.1.el5 -- but both hosts are at exactly the same patch level.

Comment 14 Rich Graves 2010-12-03 18:26:44 UTC

These are the only differences in guest dmesg (other than auditd timestamps).

--- dmesg.L5520    2010-12-03 12:22:10.000000000 -0600
+++ dmesg.X5580    2010-12-03 12:09:18.000000000 -0600
@@ -60,7 +60,7 @@
 PERCPU: Embedded 31 pages/cpu @ffff880004209000 s95064 r8192 d23720 u126976
 pcpu-alloc: s95064 r8192 d23720 u126976 alloc=31*4096
 pcpu-alloc: [0] 0 
-trying to map vcpu_info 0 at ffff880004214020, mfn 390f43, offset 32
+trying to map vcpu_info 0 at ffff880004214020, mfn eea131, offset 32
 cpu 0 using vcpu_info at ffff880004214020
 Xen: using vcpu_info placement
 Built 1 zonelists in Node order, mobility grouping on.  Total pages: 251938
@@ -80,8 +80,8 @@
 please try 'cgroup_disable=memory' option if you don't want memory cgroups
 Xen: using vcpuop timer interface
 installing Xen timer for CPU 0
-Detected 2260.998 MHz processor.
-Calibrating delay loop (skipped), value calculated using timer frequency.. 4521.99 BogoMIPS (lpj=2260998)
+Detected 3325.010 MHz processor.
+Calibrating delay loop (skipped), value calculated using timer frequency.. 6650.02 BogoMIPS (lpj=3325010)
 pid_max: default: 32768 minimum: 301
 Security Framework initialized
 SELinux:  Initializing.
@@ -96,8 +96,8 @@
 Initializing cgroup subsys freezer
 Initializing cgroup subsys net_cls
 Initializing cgroup subsys blkio
-CPU: Unsupported number of siblings 16
-Performance Events: unsupported p6 CPU model 26 no PMU driver, software events only.
+CPU: Unsupported number of siblings 32
+Performance Events: unsupported p6 CPU model 44 no PMU driver, software events only.
 alternatives: switching to unfair spinlock
 SMP alternatives: switching to UP code
 Freeing SMP alternatives: 32k freed

Comment 15 Rich Graves 2010-12-16 16:31:38 UTC

Possibly related reports:

https://bugzilla.redhat.com/show_bug.cgi?id=613513
 - but that talks about ping-ponging, and I'm seeing it on first migration after boot

http://lists.xensource.com/archives/html/xen-users/2010-12/msg00302.html
 - but that alleges problems going from slower to faster CPU, reverse of my experience
 - alleges problem exists on upstream Xen 3.4 and also Xen Cloud Platform 1.0

Comment 16 Rich Graves 2010-12-16 18:55:02 UTC

Possibly related, though repro steps are different https://bugzilla.redhat.com/show_bug.cgi?id=663755

Comment 17 Paras Pradhan 2011-01-03 16:09:09 UTC

Migration failes in exactly same CPU as well.

Paras.

Comment 18 RHEL Program Management 2011-01-07 04:21:28 UTC

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 19 Suzanne Logcher 2011-01-07 16:11:56 UTC

This request was erroneously denied for the current release of Red Hat
Enterprise Linux.  The error has been fixed and this request has been
re-proposed for the current release.

Comment 20 Andrew Jones 2011-01-10 15:18:02 UTC

Can you try kernel 2.6.37-2.fc15.x86_64, available from rawhide repos? Likely bug 663755 and this are the same problem. We still need to figure out what patches in that F15 kernel fix it.

Comment 21 Andrew Jones 2011-01-14 07:26:18 UTC

Hi Rich,

After you've completed your testing, please let me know if this bug can be dupped to bug 663755.

Thanks,
Drew

Comment 22 Andrew Jones 2011-01-24 07:46:50 UTC


*** This bug has been marked as a duplicate of bug 663755 ***

Note You need to log in before you can comment on or make changes to this bug.