Description of problem:
Trying to boot Xen guests on newer hardware is always an adventure. One of the reasons for this is that, by default, the Xen hypervisor only masks out features it knows it can't support. However, it can't know about support for newer features until the features are there. So what it should instead do is to mask out *all* features, and then selectively enable the ones that it knows it can support. Upstream currently does not work like this (nor does RHEL-5), so we will have to submit patches there.
considering upstream's handling of CPUID is completely different and done in the tools, the solution to this issue would likely have to be done twice for RHEL-5 and upstream.
It seems more palatable (if anything) to backport upstream's userspace handling of CPUID. The patches are relatively large but quite self-contained, and it would make it easier to tweak the defaults without requiring kernel upgrades. What do you think?
I'm OK with going with upstream's userspace implementation, though I'm not quite sure how it works. In particular, does it *always* send a list of supported flags down to the hypervisor when starting a guest? As long as there is always a whitelist (that will mask out things like GB pages, etc), then I think doing the userspace version would be just fine. The only thing we'll have to be careful of is that since this is (probably?) a change to the hypervisor/tools ABI, we'll have to have a compat mode so that a new userspace could run on an older hypervisor.
Some features current I'd like to mask out haven't necessarily caused problems, but one never knows going into the future, and of course the idea behind this bug is to guard against features that don't currently exist.
One current feature I'd like to mask is X86_FEATURE_HT. This hasn't caused problems yet, but it does cause a warning to be output on every boot of RHEL6 PV guests.
CPU: Unsupported number of siblings
This is output from detect_ht(). After that warning, the guest kernel decides to to forget the whole thing and is fine. The warning could be avoided by simply masking the HT feature though.
Created attachment 526923 [details]
cpuid whitelist function
It looks like it could be quite useful in the upstream Xen? Why not post there as well?
it was our understanding (... any inaccuracy in representing my colleagues' understanding is my fault ...) that upstream Xen "has a mix of white and black listing depending on guest type and does its cpuid management in userspace".
The set of whitelisted features might be useful for upstream, but then it should be specified somewhere in the vm configs or another default setting in userspace, shouldn't it? (Eg. tools/libxc/xc_cpuid_x86.c, amd_xc_cpuid_policy() / intel_xc_cpuid_policy().)
*** Bug 711070 has been marked as a duplicate of this bug. ***
Created attachment 528802 [details]
Do not expose X86_FEATURE_POPCNT feature to avoid crash on migration to a host that doesn't have it
FC16 HVM will crash after migration with invalid op if it was started on host with X86_FEATURE_POPCNT feature but have been migrated to a host without it.
Attached patch, applied on top of white-listing-V2, fixes this issue.
Patch(es) available in kernel-2.6.18-294.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.
*** Bug 711322 has been marked as a duplicate of this bug. ***
Testing of this problem is covered by running acceptance/functional test with
several Snapshot builds (from Snapshot1 to Snapshot4) on different CPU models.
No any problem found during the testing, marked it as Verified:SanityOnly.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.