Bug 443779
Summary: | Guest OS install causes host machine to crash | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | RHEL Program Management <pm-rhel> |
Component: | kernel-xen | Assignee: | Chris Lalancette <clalance> |
Status: | CLOSED WONTFIX | QA Contact: | Martin Jenner <mjenner> |
Severity: | high | Docs Contact: | |
Priority: | urgent | ||
Version: | 5.1 | CC: | anton, bburns, ddomingo, dhoward, dwa, jwest, tao, xen-maint |
Target Milestone: | rc | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-06-09 09:47:02 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 294551 | ||
Bug Blocks: | 449945 |
Description
RHEL Program Management
2008-04-23 09:21:41 UTC
OK. Because we had some confusing information amongst ourselves, I went back to scratch on this one. I installed the sun x4600 with RHEL-5 U1, and worked from there. Here are my test results: 1) 2.6.18-53 HV, U1 tools, installing FV guest - it took me a couple of installs, but I finally was able to reproduce the crash 2) Hand built 2.6.18-53 HV, U1 tools, installing FV guest - crashed 3) Hand built 2.6.18-53 HV w/ max_phys_cpus=64, U1 tools, installing FV guest - crashed immediately, but in a different spot; different bug in U1 4) 2.6.18-92 HV, U1 tools, installing FV guest - didn't test 5) Hand built 2.6.18-92 HV, U1 tools, installing FV guest - After 3 installs, no crash. I gave up at this point; it looks like this works. 6) Hand built 2.6.18-92 HV, no max_phys_cpus=64, U1 tools, installing FV guest - immediate crash in svm_ctxt_switch_to, which is different than the original crash, but similar. 7) Hand-built upstream 3.1.4 HV, no max_phys_cpus=64, U1 tools, installing FV guest - After 3 installs, no crash. I gave up at this point; it looks like this works. So, it looks like 5.2 is OK for whatever reason because of the max_phys_cpus=64 setting. The upstream hypervisor seems to be good without the max_phys_cpus=64, so there is probably a fix in that upstream HV. I'm going to bisect the upstream HV and see if I can find out what fixed this. Chris Lalancette Got it. This is ugly. From upstream changeset 15601: x86: Remove broken and unnecessary numa code from smpboot.c. This was trampling on random memory when node==NUMA_NO_NODE. And indeed, looking at xm dmesg from a 5.2 HV, for instance, built with max_phys_cpus=32: (XEN) Mapping cpu 0 to node 255 (XEN) Mapping cpu 1 to node 255 (XEN) Mapping cpu 2 to node 255 (XEN) Mapping cpu 3 to node 255 (XEN) Mapping cpu 4 to node 255 (XEN) Mapping cpu 5 to node 255 (XEN) Mapping cpu 6 to node 255 (XEN) Mapping cpu 7 to node 255 (XEN) Mapping cpu 8 to node 255 (XEN) Mapping cpu 9 to node 255 (XEN) Mapping cpu 10 to node 255 (XEN) Mapping cpu 11 to node 255 (XEN) Mapping cpu 12 to node 255 (XEN) Mapping cpu 13 to node 255 (XEN) Mapping cpu 14 to node 255 (XEN) Mapping cpu 15 to node 255 Finally, looking at the code in arch/x86/smpboot.c: static inline void map_cpu_to_node(int cpu, int node) { printk("Mapping cpu %d to node %d\n", cpu, node); cpu_set(cpu, node_2_cpu_mask[node]); cpu_2_node[cpu] = node; } Oops. We just indexed the node_2_cpu_mask[] array with -1, leading to random memory corruption. The one part I don't understand here is why max_phys_cpus=64 "fixes" the issue; it probably just moved it around so that we still have memory corruption, but always in a place that doesn't matter (or something like that). Anyway, I think we need this patch for both 5.1.z (if it's still open) and 5.2.z. Chris Lalancette Duh. My analysis is slightly wrong. The "node" parameter is actually an int, so 255 is not -1. However, this is still a problem; the node_2_cpu_mask array is only 64 entries long, and we are indexing with 255, which is off the end. Chris Lalancette |