Bug 443779 - Guest OS install causes host machine to crash
Guest OS install causes host machine to crash
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen (Show other bugs)
5.1
All Linux
urgent Severity high
: rc
: ---
Assigned To: Chris Lalancette
Martin Jenner
: ZStream
Depends On: 294551
Blocks: 449945
  Show dependency treegraph
 
Reported: 2008-04-23 05:21 EDT by RHEL Product and Program Management
Modified: 2008-06-16 15:46 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-06-09 05:47:02 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description RHEL Product and Program Management 2008-04-23 05:21:41 EDT
This bug has been copied from bug #294551 and has been proposed
to be backported to 5.1 z-stream (EUS).
Comment 7 Chris Lalancette 2008-06-03 10:50:25 EDT
OK.  Because we had some confusing information amongst ourselves, I went back to
scratch on this one.  I installed the sun x4600 with RHEL-5 U1, and worked from
there.  Here are my test results:

1)  2.6.18-53 HV, U1 tools, installing FV guest - it took me a couple of
installs, but I finally was able to reproduce the crash

2) Hand built 2.6.18-53 HV, U1 tools, installing FV guest - crashed

3) Hand built 2.6.18-53 HV w/ max_phys_cpus=64, U1 tools, installing FV guest -
crashed immediately, but in a different spot; different bug in U1

4) 2.6.18-92 HV, U1 tools, installing FV guest - didn't test

5) Hand built 2.6.18-92 HV, U1 tools, installing FV guest - After 3 installs, no
crash.  I gave up at this point; it looks like this works.

6) Hand built 2.6.18-92 HV, no max_phys_cpus=64, U1 tools, installing FV guest -
immediate crash in svm_ctxt_switch_to, which is different than the original
crash, but similar.

7) Hand-built upstream 3.1.4 HV, no max_phys_cpus=64, U1 tools, installing FV
guest - After 3 installs, no crash.  I gave up at this point; it looks like this
works.

So, it looks like 5.2 is OK for whatever reason because of the max_phys_cpus=64
setting.  The upstream hypervisor seems to be good without the max_phys_cpus=64,
so there is probably a fix in that upstream HV.  I'm going to bisect the
upstream HV and see if I can find out what fixed this.

Chris Lalancette
Comment 11 Chris Lalancette 2008-06-04 08:20:32 EDT
Got it.  This is ugly.  From upstream changeset 15601:

x86: Remove broken and unnecessary numa code from smpboot.c.
This was trampling on random memory when node==NUMA_NO_NODE.

And indeed, looking at xm dmesg from a 5.2 HV, for instance, built with
max_phys_cpus=32:

(XEN) Mapping cpu 0 to node 255
(XEN) Mapping cpu 1 to node 255
(XEN) Mapping cpu 2 to node 255
(XEN) Mapping cpu 3 to node 255
(XEN) Mapping cpu 4 to node 255
(XEN) Mapping cpu 5 to node 255
(XEN) Mapping cpu 6 to node 255
(XEN) Mapping cpu 7 to node 255
(XEN) Mapping cpu 8 to node 255
(XEN) Mapping cpu 9 to node 255
(XEN) Mapping cpu 10 to node 255
(XEN) Mapping cpu 11 to node 255
(XEN) Mapping cpu 12 to node 255
(XEN) Mapping cpu 13 to node 255
(XEN) Mapping cpu 14 to node 255
(XEN) Mapping cpu 15 to node 255

Finally, looking at the code in arch/x86/smpboot.c:

static inline void map_cpu_to_node(int cpu, int node)
{
	printk("Mapping cpu %d to node %d\n", cpu, node);
	cpu_set(cpu, node_2_cpu_mask[node]);
	cpu_2_node[cpu] = node;
}

Oops.  We just indexed the node_2_cpu_mask[] array with -1, leading to random
memory corruption.

The one part I don't understand here is why max_phys_cpus=64 "fixes" the issue;
it probably just moved it around so that we still have memory corruption, but
always in a place that doesn't matter (or something like that).

Anyway, I think we need this patch for both 5.1.z (if it's still open) and 5.2.z.

Chris Lalancette
Comment 12 Chris Lalancette 2008-06-04 08:29:25 EDT
Duh.  My analysis is slightly wrong.  The "node" parameter is actually an int,
so 255 is not -1.  However, this is still a problem; the node_2_cpu_mask array
is only 64 entries long, and we are indexing with 255, which is off the end.

Chris Lalancette

Note You need to log in before you can comment on or make changes to this bug.