Bug 961667 - xen: numa: guest crash on second restore
Summary: xen: numa: guest crash on second restore
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen
Version: 5.10
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Andrew Jones
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 514489
TreeView+ depends on / blocked
 
Reported: 2013-05-10 08:37 UTC by Andrew Jones
Modified: 2013-10-01 00:04 UTC (History)
9 users (show)

Fixed In Version: kernel-2.6.18-360.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-10-01 00:04:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2013:1348 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise Linux 5 kernel update 2013-10-01 00:41:39 UTC

Description Andrew Jones 2013-05-10 08:37:15 UTC
While testing save/restore on a NUMA machine[1] that was booted with numa=on on Xen's command line I noticed that the first save/restore appears to work, and the second save seems fine, but the second restore results in a crashed guest! However, Wei tested repeated save/restore on another NUMA machine[2], also with numa=on, and didn't have any problems.

[1] hp-dl580g7-01.lab.bos.redhat.com

host: 2.6.18-348.el5xen, xen-3.0.3-144.el5
guest: 2.6.18-348.el5xen

host                   : hp-dl580g7-01.lab.bos.redhat.com
release                : 2.6.18-348.el5xen
version                : #1 SMP Wed Nov 28 21:31:28 EST 2012
machine                : x86_64
nr_cpus                : 80
nr_nodes               : 1
sockets_per_node       : 4
cores_per_socket       : 10
threads_per_core       : 2
cpu_mhz                : 2397
hw_caps                : bfebfbff:2c100800:00000000:00000940:02bee3ff:00000000:00000001
total_memory           : 1048565
free_memory            : 996270
node_to_cpu            : node0:0-79
xen_major              : 3
xen_minor              : 1
xen_extra              : .2-348.el5
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : unavailable
cc_compiler            : gcc version 4.1.2 20080704 (Red Hat 4.1.2-54)
cc_compile_by          : mockbuild
cc_compile_domain      : redhat.com
cc_compile_date        : Wed Nov 28 21:20:36 EST 2012
xend_config_format     : 2


processor	: 31
vendor_id	: GenuineIntel
cpu family	: 6
model		: 47
model name	:        Intel(R) Xeon(R) CPU E7- 4870  @ 2.40GHz
stepping	: 2
cpu MHz		: 2397.398
cache size	: 30720 KB
physical id	: 31
siblings	: 1
core id		: 0
cpu cores	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu tsc msr pae cx8 apic mtrr cmov pat clflush mmx fxsr sse sse2 ss syscall nx lm constant_tsc pni vmx est ssse3 cx16 sse4_1 sse4_2 popcnt lahf_lm
bogomips	: 5996.61
clflush size	: 64
cache_alignment	: 64
address sizes	: 44 bits physical, 48 bits virtual
power management:


[2] ibm-x3850x5-07.qe.lab.eng.nay.redhat.com

host: 2.6.18-355.el5xen, xen-3.0.3-144.el5
guest: 2.6.18-348.el5xen

host                   : ibm-x3850x5-07.qe.lab.eng.nay.redhat.com
release                : 2.6.18-355.el5xen
version                : #1 SMP Fri Apr 26 02:24:22 EDT 2013
machine                : x86_64
nr_cpus                : 32
nr_nodes               : 2
sockets_per_node       : 1
cores_per_socket       : 8
threads_per_core       : 2
cpu_mhz                : 2128
hw_caps                : bfebfbff:2c100800:00000000:00000940:02bee3ff:00000000:00000001
total_memory           : 32746
free_memory            : 0
node_to_cpu            : node0:0-7,16-23
                         node1:8-15,24-31
xen_major              : 3
xen_minor              : 1
xen_extra              : .2-355.el5
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : unavailable
cc_compiler            : gcc version 4.1.2 20080704 (Red Hat 4.1.2-54)
cc_compile_by          : mockbuild
cc_compile_domain      : redhat.com
cc_compile_date        : Fri Apr 26 01:59:33 EDT 2013
xend_config_format     : 2


processor	: 31
vendor_id	: GenuineIntel
cpu family	: 6
model		: 47
model name	:        Intel(R) Xeon(R) CPU E7- 4830  @ 2.13GHz
stepping	: 2
cpu MHz		: 1064.000
cache size	: 24576 KB
physical id	: 31
siblings	: 1
core id		: 0
cpu cores	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu tsc msr pae cx8 apic mtrr cmov pat clflush mmx fxsr sse sse2 ss syscall nx lm constant_tsc pni vmx est ssse3 cx16 sse4_1 sse4_2 popcnt lahf_lm
bogomips	: 5323.16
clflush size	: 64
cache_alignment	: 64
address sizes	: 44 bits physical, 48 bits virtual
power management:

Comment 1 Andrew Jones 2013-05-10 08:42:16 UTC
Err.. copy+paste without reading first. It looks like somebody rebooted hp-dl580g7-01.lab.bos.redhat.com with numa=off, as 'xm info' now shows the single fake node. I'll reboot it again with numa=on and paste the correct 'xm info'.

Comment 2 RHEL Program Management 2013-05-10 08:58:57 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unable to address this
request at this time.

Red Hat invites you to ask your support representative to
propose this request, if appropriate, in the next release of
Red Hat Enterprise Linux.

Comment 3 Andrew Jones 2013-05-10 09:22:38 UTC
OK, the numa=on 'xm info' for hp-dl580g7-01.lab.bos.redhat.com

host                   : hp-dl580g7-01.lab.bos.redhat.com
release                : 2.6.18-355.el5xen
version                : #1 SMP Fri Apr 26 02:24:22 EDT 2013
machine                : x86_64
nr_cpus                : 80
nr_nodes               : 4
sockets_per_node       : 1
cores_per_socket       : 10
threads_per_core       : 2
cpu_mhz                : 2397
hw_caps                : bfebfbff:2c100800:00000000:00000940:02bee3ff:00000000:00000001
total_memory           : 1048565
free_memory            : 996270
node_to_cpu            : node0:0-9,40-49
                         node1:10-19,50-59
                         node2:20-29,60-69
                         node3:30-39,70-79
xen_major              : 3
xen_minor              : 1
xen_extra              : .2-355.el5
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : unavailable
cc_compiler            : gcc version 4.1.2 20080704 (Red Hat 4.1.2-54)
cc_compile_by          : mockbuild
cc_compile_domain      : redhat.com
cc_compile_date        : Fri Apr 26 01:59:33 EDT 2013
xend_config_format     : 2

I also updated the host and guest to 2.6.18-355.el5xen, but the problem still reproduces. I'll get a backtrace from the crashed guest and paste it here next.

Comment 4 Andrew Jones 2013-05-10 09:25:11 UTC
This time it crashed on the second save. I guess it's possible it always crashed on the second save, but it wasn't noticed until the second restore. Here's the console output captured at the time of the 2nd save with the resulting bt

CPU 1 offline: Remove Rx thread
CPU 2 offline: Remove Rx thread
CPU 3 offline: Remove Rx thread
CPU 4 offline: Remove Rx thread
CPU 5 offline: Remove Rx thread
CPU 6 offline: Remove Rx thread
CPU 7 offline: Remove Rx thread
CPU 8 offline: Remove Rx thread
CPU 9 offline: Remove Rx thread
tap tap-3-51712: 2 getting info
netfront: device eth0 has copying receive path.
blktap: ring-ref 8, event-channel 7, protocol 1 (x86_64-abi)
Changing capacity of (202, 0) to 20971520 sectors
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at drivers/xen/core/smpboot.c:242
invalid opcode: 0000 [1] SMP 
last sysfs file: /class/cpuid/cpu9/dev
CPU 0 
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i libcxgbi cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi dm_multipath scsi_dh scsi_mod parport_pc lp parport xennet pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod xenblk ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 3313, comm: suspend Not tainted 2.6.18-355.el5xen #1
RIP: e030:[<ffffffff803bf18e>]  [<ffffffff803bf18e>] __cpu_up+0x265/0x37b
RSP: e02b:ffff88002ba01d50  EFLAGS: 00010282
RAX: ffffffffffffffef RBX: 000000000000000a RCX: ffffffff805087e0
RDX: ffff88002b884000 RSI: 0000000000000001 RDI: 0000000000000000
RBP: ffff88002b884000 R08: 0000000000000001 R09: 0000000000000000
R10: ffff88002b884000 R11: 0000000000000200 R12: 0000000000000001
R13: ffff88000254e040 R14: ffffffff80505e0a R15: ffffffff8029eff4
FS:  00002ba287571710(0000) GS:ffffffff80639000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process suspend (pid: 3313, threadinfo ffff88002ba00000, task ffff88002c0667b0)
Stack:  ffff88003f995cd0  ffffffff8029be00  0000000000000007  ffffffff8051a680 
 0000000000000008  ffffffff8034905c  ffff88003f995cd0  ffffffff802c72da 
 ffffffff80533250  ffffffff805557a0 
Call Trace:
 [<ffffffff8029be00>] create_workqueue_thread+0x94/0xa9
 [<ffffffff8034905c>] __next_cpu+0x19/0x28
 [<ffffffff802c72da>] refresh_zone_stat_thresholds+0x91/0xa2
 [<ffffffff803bb396>] topology_cpu_callback+0xb/0x42
 [<ffffffff802a3da5>] cpu_up+0xc7/0x12c
 [<ffffffff803bdc95>] vcpu_hotplug+0x92/0xd4
 [<ffffffff803b5cb4>] next_device+0x9/0x1d
 [<ffffffff803b5df3>] bus_for_each_dev+0x61/0x6e
 [<ffffffff803be55c>] __do_suspend+0x0/0x600
 [<ffffffff803bdced>] smp_resume+0x16/0x2f
 [<ffffffff803beb54>] __do_suspend+0x5f8/0x600
 [<ffffffff80289433>] __wake_up_common+0x3e/0x68
 [<ffffffff8029eff4>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8029eff4>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80233eb3>] kthread+0xfe/0x132
 [<ffffffff80260b2c>] child_rip+0xa/0x12
 [<ffffffff8029eff4>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80233db5>] kthread+0x0/0x132
 [<ffffffff80260b22>] child_rip+0x0/0x12


Code: 0f 0b 68 c5 63 4c 80 c2 f2 00 eb fe 48 89 ef e8 85 c9 e4 ff 
RIP  [<ffffffff803bf18e>] __cpu_up+0x265/0x37b
 RSP <ffff88002ba01d50>
 <0>Kernel panic - not syncing: Fatal exception

Comment 5 RHEL Program Management 2013-05-10 09:29:17 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unable to address this
request at this time.

Red Hat invites you to ask your support representative to
propose this request, if appropriate, in the next release of
Red Hat Enterprise Linux.

Comment 6 Andrew Jones 2013-05-10 10:16:13 UTC
Even when changing the vm config to have a single vcpu the 2nd save fails, but the guest didn't crash this time. Here are some xend logs

First successful save

[2013-05-10 06:03:22 xend 9123] DEBUG (XendCheckpoint:192) [xc_save]: /usr/lib64/xen/bin/xc_save 20 1 0 0 0 0
[2013-05-10 06:03:22 xend 9123] DEBUG (XendCheckpoint:466) suspend
[2013-05-10 06:03:22 xend 9123] DEBUG (XendCheckpoint:195) In saveInputHandler suspend
[2013-05-10 06:03:22 xend 9123] DEBUG (XendCheckpoint:197) Suspending 1 ...
[2013-05-10 06:03:22 xend.XendDomainInfo 9123] DEBUG (XendDomainInfo:1335) XendDomainInfo.handleShutdownWatch
[2013-05-10 06:03:22 xend.XendDomainInfo 9123] DEBUG (XendDomainInfo:1335) XendDomainInfo.handleShutdownWatch
[2013-05-10 06:03:22 xend.XendDomainInfo 9123] INFO (XendDomainInfo:1292) Domain has shutdown: name=migrating-vm1 id=1 reason=suspend.
[2013-05-10 06:03:22 xend 9123] INFO (XendCheckpoint:202) Domain 1 suspended.
[2013-05-10 06:03:22 xend 9123] DEBUG (XendCheckpoint:211) Written done
[2013-05-10 06:03:22 xend 9123] INFO (XendCheckpoint:515) Had 0 unexplained entries in p2m table
[2013-05-10 06:03:27 xend 9123] INFO (XendCheckpoint:515) Saving memory pages: iter 1  95%^M 1: sent 262144, skipped 0, delta 4734ms, dom0 90%, target 0%, sent 1814Mb/s, dirtied 0Mb/s 0 pages
[2013-05-10 06:03:27 xend 9123] INFO (XendCheckpoint:515) Total pages sent= 262144 (0.99x)
[2013-05-10 06:03:27 xend 9123] INFO (XendCheckpoint:515) (of which 0 were fixups)
[2013-05-10 06:03:27 xend 9123] INFO (XendCheckpoint:515) All memory is saved
[2013-05-10 06:03:27 xend 9123] INFO (XendCheckpoint:515) Save exit rc=0

First successful restore

[2013-05-10 06:03:35 xend 9123] DEBUG (XendCheckpoint:359) [xc_restore]: /usr/lib64/xen/bin/xc_restore 21 2 1 2 0 0 0
[2013-05-10 06:03:35 xend 9123] INFO (XendCheckpoint:515) xc_domain_restore start: p2m_size = 40800
[2013-05-10 06:03:35 xend 9123] INFO (XendCheckpoint:515) Reloading memory pages:   0%
[2013-05-10 06:03:39 xend 9123] INFO (XendCheckpoint:515) Received all pages (0 races)
[2013-05-10 06:03:39 xend 9123] INFO (XendCheckpoint:5100%
[2013-05-10 06:03:39 xend 9123] INFO (XendCheckpoint:515) Memory reloaded (80153 pages)
[2013-05-10 06:03:39 xend 9123] INFO (XendCheckpoint:515) Domain ready to be built.
[2013-05-10 06:03:39 xend 9123] DEBUG (XendCheckpoint:466) store-mfn 201591106
[2013-05-10 06:03:39 xend 9123] DEBUG (XendCheckpoint:466) console-mfn 264496267
[2013-05-10 06:03:39 xend 9123] INFO (XendCheckpoint:515) Restore exit with rc=0

Second failed save

[2013-05-10 06:04:22 xend 9123] DEBUG (XendCheckpoint:192) [xc_save]: /usr/lib64/xen/bin/xc_save 21 2 0 0 0 0
[2013-05-10 06:04:22 xend 9123] DEBUG (XendCheckpoint:466) suspend
[2013-05-10 06:04:22 xend 9123] DEBUG (XendCheckpoint:195) In saveInputHandler suspend
[2013-05-10 06:04:22 xend 9123] DEBUG (XendCheckpoint:197) Suspending 2 ...
[2013-05-10 06:04:22 xend.XendDomainInfo 9123] DEBUG (XendDomainInfo:1335) XendDomainInfo.handleShutdownWatch
[2013-05-10 06:04:22 xend.XendDomainInfo 9123] DEBUG (XendDomainInfo:1335) XendDomainInfo.handleShutdownWatch
[2013-05-10 06:04:22 xend.XendDomainInfo 9123] INFO (XendDomainInfo:1292) Domain has shutdown: name=migrating-vm1 id=2 reason=suspend.
[2013-05-10 06:04:22 xend 9123] INFO (XendCheckpoint:202) Domain 2 suspended.
[2013-05-10 06:04:22 xend 9123] DEBUG (XendCheckpoint:211) Written done
[2013-05-10 06:04:22 xend 9123] INFO (XendCheckpoint:515) Had 0 unexplained entries in p2m table
[2013-05-10 06:04:22 xend 9123] INFO (XendCheckpoint:515) Saving memory pages: iter 1   0%ERROR Internal error: Fatal PT race (pfn 400, type 10000000)
[2013-05-10 06:04:22 xend 9123] INFO (XendCheckpoint:515) Save exit rc=1


So the difference to highlight between the first and the second save attempts are

Saving memory pages: iter 1  95%^M 1: sent 262144, skipped 0, delta 4734ms, dom0 90%, target 0%, sent 1814Mb/s, dirtied 0Mb/s 0 pages

and

Saving memory pages: iter 1   0%ERROR Internal error: Fatal PT race (pfn 400, type 10000000)

Comment 7 Andrew Jones 2013-05-10 15:39:49 UTC
I did a few different experiments and found that the 2nd save bug only occurs if the guest is booted with affinity to the cpus of last of the four nodes (node 3). cpu affinity to these cpus gives it a loose affinity to the memory of node3 because the Xen hypervisor will prefer memory located on the node of the currently executing processor. Therefore the guest likely had its pages mapped to within the address range of that node. Xen's boot logs show the node address range mappings

(XEN) SRAT: Node 0 PXM 0 0-c0000000
(XEN) SRAT: Node 0 PXM 0 0-4040000000
(XEN) SRAT: Node 1 PXM 1 4040000000-8040000000
(XEN) SRAT: Node 2 PXM 2 8040000000-c040000000
(XEN) SRAT: Node 3 PXM 3 c040000000-10040000000

Now, the "ERROR Internal error: Fatal PT race (pfn 400, type 10000000)" error occurs when we compute a bad mfn. To be precise, when we fail this test

#define MFN_IS_IN_PSEUDOPHYS_MAP(_mfn)          \
    (((_mfn) < (max_mfn)) &&                    \
     ((mfn_to_pfn(_mfn) < (p2m_size)) &&        \
      (pfn_to_mfn(mfn_to_pfn(_mfn)) == (_mfn))))

I've gota run for the weekend now, but my hunch is that we computed an addr in the higher part of node3's range (which goes up to 10040000000) and that failed a max test. Note that MFN_MASK_X86 is only ffffffffff

10040000000
 ffffffffff

so that's consistent with the > max theory.

Comment 8 Andrew Jones 2013-05-13 11:42:15 UTC
Capping the usable memory to <= 1T using the mem= Xen command line option resolves this. I don't know what we've officially announced as the amount of memory Xen hosts support, but it looks like there's a real limit at 1T, so we should double check that we announce 1T or less. To avoid issues like this one we can backport the following patch and initialize availmem to our announced limit.

commit 077b04c0350e1bc20a41b06dad49a3e2a72ca8c8
Author: Keir Fraser <keir.fraser>
Date:   Wed Jul 21 08:43:35 2010 +0100

    x86: New boot option availmem= to limit usable system RAM.
    
    Unlike mem=, this specifies the limit on usable RAM, rather than a
    limit on maximum physical address of RAM.

Comment 16 RHEL Program Management 2013-06-07 14:30:05 UTC
This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release.  Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.

Comment 18 Phillip Lougher 2013-06-11 17:52:26 UTC
Patch(es) available in kernel-2.6.18-360.el5
You can download this test kernel (or newer) from http://people.redhat.com/plougher/el5/
Detailed testing feedback is always welcomed.
If you require guidance regarding testing, please ask the bug assignee.

Comment 20 Wei Shi 2013-07-15 07:10:02 UTC
Using machine hp-dl580g7-01

Reproduced:
  kernel-xen-2.6.18-359.el5

  Steps:
  0. Boot host with kernel parameter numa=on
  1. launch PV guest
  2. xm save vm1 vm1.save
  3. xm restore vm1.save
  4. rm vm1.save -f
  5. xm save vm1 vm1.save

# xm save vm1 vm1.save
Error: /usr/lib64/xen/bin/xc_save 4 14 0 0 0 0 failed
Usage: xm save <Domain> <CheckpointFile>

Save a domain state to restore later.


Verified:
  kernel-xen-2.6.18-363.el5

  Same steps as above, no error output.

Comment 22 errata-xmlrpc 2013-10-01 00:04:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-1348.html


Note You need to log in before you can comment on or make changes to this bug.