While testing save/restore on a NUMA machine[1] that was booted with numa=on on Xen's command line I noticed that the first save/restore appears to work, and the second save seems fine, but the second restore results in a crashed guest! However, Wei tested repeated save/restore on another NUMA machine[2], also with numa=on, and didn't have any problems. [1] hp-dl580g7-01.lab.bos.redhat.com host: 2.6.18-348.el5xen, xen-3.0.3-144.el5 guest: 2.6.18-348.el5xen host : hp-dl580g7-01.lab.bos.redhat.com release : 2.6.18-348.el5xen version : #1 SMP Wed Nov 28 21:31:28 EST 2012 machine : x86_64 nr_cpus : 80 nr_nodes : 1 sockets_per_node : 4 cores_per_socket : 10 threads_per_core : 2 cpu_mhz : 2397 hw_caps : bfebfbff:2c100800:00000000:00000940:02bee3ff:00000000:00000001 total_memory : 1048565 free_memory : 996270 node_to_cpu : node0:0-79 xen_major : 3 xen_minor : 1 xen_extra : .2-348.el5 xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 xen_pagesize : 4096 platform_params : virt_start=0xffff800000000000 xen_changeset : unavailable cc_compiler : gcc version 4.1.2 20080704 (Red Hat 4.1.2-54) cc_compile_by : mockbuild cc_compile_domain : redhat.com cc_compile_date : Wed Nov 28 21:20:36 EST 2012 xend_config_format : 2 processor : 31 vendor_id : GenuineIntel cpu family : 6 model : 47 model name : Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz stepping : 2 cpu MHz : 2397.398 cache size : 30720 KB physical id : 31 siblings : 1 core id : 0 cpu cores : 1 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu tsc msr pae cx8 apic mtrr cmov pat clflush mmx fxsr sse sse2 ss syscall nx lm constant_tsc pni vmx est ssse3 cx16 sse4_1 sse4_2 popcnt lahf_lm bogomips : 5996.61 clflush size : 64 cache_alignment : 64 address sizes : 44 bits physical, 48 bits virtual power management: [2] ibm-x3850x5-07.qe.lab.eng.nay.redhat.com host: 2.6.18-355.el5xen, xen-3.0.3-144.el5 guest: 2.6.18-348.el5xen host : ibm-x3850x5-07.qe.lab.eng.nay.redhat.com release : 2.6.18-355.el5xen version : #1 SMP Fri Apr 26 02:24:22 EDT 2013 machine : x86_64 nr_cpus : 32 nr_nodes : 2 sockets_per_node : 1 cores_per_socket : 8 threads_per_core : 2 cpu_mhz : 2128 hw_caps : bfebfbff:2c100800:00000000:00000940:02bee3ff:00000000:00000001 total_memory : 32746 free_memory : 0 node_to_cpu : node0:0-7,16-23 node1:8-15,24-31 xen_major : 3 xen_minor : 1 xen_extra : .2-355.el5 xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 xen_pagesize : 4096 platform_params : virt_start=0xffff800000000000 xen_changeset : unavailable cc_compiler : gcc version 4.1.2 20080704 (Red Hat 4.1.2-54) cc_compile_by : mockbuild cc_compile_domain : redhat.com cc_compile_date : Fri Apr 26 01:59:33 EDT 2013 xend_config_format : 2 processor : 31 vendor_id : GenuineIntel cpu family : 6 model : 47 model name : Intel(R) Xeon(R) CPU E7- 4830 @ 2.13GHz stepping : 2 cpu MHz : 1064.000 cache size : 24576 KB physical id : 31 siblings : 1 core id : 0 cpu cores : 1 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu tsc msr pae cx8 apic mtrr cmov pat clflush mmx fxsr sse sse2 ss syscall nx lm constant_tsc pni vmx est ssse3 cx16 sse4_1 sse4_2 popcnt lahf_lm bogomips : 5323.16 clflush size : 64 cache_alignment : 64 address sizes : 44 bits physical, 48 bits virtual power management:
Err.. copy+paste without reading first. It looks like somebody rebooted hp-dl580g7-01.lab.bos.redhat.com with numa=off, as 'xm info' now shows the single fake node. I'll reboot it again with numa=on and paste the correct 'xm info'.
This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate, in the next release of Red Hat Enterprise Linux.
OK, the numa=on 'xm info' for hp-dl580g7-01.lab.bos.redhat.com host : hp-dl580g7-01.lab.bos.redhat.com release : 2.6.18-355.el5xen version : #1 SMP Fri Apr 26 02:24:22 EDT 2013 machine : x86_64 nr_cpus : 80 nr_nodes : 4 sockets_per_node : 1 cores_per_socket : 10 threads_per_core : 2 cpu_mhz : 2397 hw_caps : bfebfbff:2c100800:00000000:00000940:02bee3ff:00000000:00000001 total_memory : 1048565 free_memory : 996270 node_to_cpu : node0:0-9,40-49 node1:10-19,50-59 node2:20-29,60-69 node3:30-39,70-79 xen_major : 3 xen_minor : 1 xen_extra : .2-355.el5 xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 xen_pagesize : 4096 platform_params : virt_start=0xffff800000000000 xen_changeset : unavailable cc_compiler : gcc version 4.1.2 20080704 (Red Hat 4.1.2-54) cc_compile_by : mockbuild cc_compile_domain : redhat.com cc_compile_date : Fri Apr 26 01:59:33 EDT 2013 xend_config_format : 2 I also updated the host and guest to 2.6.18-355.el5xen, but the problem still reproduces. I'll get a backtrace from the crashed guest and paste it here next.
This time it crashed on the second save. I guess it's possible it always crashed on the second save, but it wasn't noticed until the second restore. Here's the console output captured at the time of the 2nd save with the resulting bt CPU 1 offline: Remove Rx thread CPU 2 offline: Remove Rx thread CPU 3 offline: Remove Rx thread CPU 4 offline: Remove Rx thread CPU 5 offline: Remove Rx thread CPU 6 offline: Remove Rx thread CPU 7 offline: Remove Rx thread CPU 8 offline: Remove Rx thread CPU 9 offline: Remove Rx thread tap tap-3-51712: 2 getting info netfront: device eth0 has copying receive path. blktap: ring-ref 8, event-channel 7, protocol 1 (x86_64-abi) Changing capacity of (202, 0) to 20971520 sectors ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at drivers/xen/core/smpboot.c:242 invalid opcode: 0000 [1] SMP last sysfs file: /class/cpuid/cpu9/dev CPU 0 Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i libcxgbi cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi dm_multipath scsi_dh scsi_mod parport_pc lp parport xennet pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod xenblk ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 3313, comm: suspend Not tainted 2.6.18-355.el5xen #1 RIP: e030:[<ffffffff803bf18e>] [<ffffffff803bf18e>] __cpu_up+0x265/0x37b RSP: e02b:ffff88002ba01d50 EFLAGS: 00010282 RAX: ffffffffffffffef RBX: 000000000000000a RCX: ffffffff805087e0 RDX: ffff88002b884000 RSI: 0000000000000001 RDI: 0000000000000000 RBP: ffff88002b884000 R08: 0000000000000001 R09: 0000000000000000 R10: ffff88002b884000 R11: 0000000000000200 R12: 0000000000000001 R13: ffff88000254e040 R14: ffffffff80505e0a R15: ffffffff8029eff4 FS: 00002ba287571710(0000) GS:ffffffff80639000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process suspend (pid: 3313, threadinfo ffff88002ba00000, task ffff88002c0667b0) Stack: ffff88003f995cd0 ffffffff8029be00 0000000000000007 ffffffff8051a680 0000000000000008 ffffffff8034905c ffff88003f995cd0 ffffffff802c72da ffffffff80533250 ffffffff805557a0 Call Trace: [<ffffffff8029be00>] create_workqueue_thread+0x94/0xa9 [<ffffffff8034905c>] __next_cpu+0x19/0x28 [<ffffffff802c72da>] refresh_zone_stat_thresholds+0x91/0xa2 [<ffffffff803bb396>] topology_cpu_callback+0xb/0x42 [<ffffffff802a3da5>] cpu_up+0xc7/0x12c [<ffffffff803bdc95>] vcpu_hotplug+0x92/0xd4 [<ffffffff803b5cb4>] next_device+0x9/0x1d [<ffffffff803b5df3>] bus_for_each_dev+0x61/0x6e [<ffffffff803be55c>] __do_suspend+0x0/0x600 [<ffffffff803bdced>] smp_resume+0x16/0x2f [<ffffffff803beb54>] __do_suspend+0x5f8/0x600 [<ffffffff80289433>] __wake_up_common+0x3e/0x68 [<ffffffff8029eff4>] keventd_create_kthread+0x0/0xc4 [<ffffffff8029eff4>] keventd_create_kthread+0x0/0xc4 [<ffffffff80233eb3>] kthread+0xfe/0x132 [<ffffffff80260b2c>] child_rip+0xa/0x12 [<ffffffff8029eff4>] keventd_create_kthread+0x0/0xc4 [<ffffffff80233db5>] kthread+0x0/0x132 [<ffffffff80260b22>] child_rip+0x0/0x12 Code: 0f 0b 68 c5 63 4c 80 c2 f2 00 eb fe 48 89 ef e8 85 c9 e4 ff RIP [<ffffffff803bf18e>] __cpu_up+0x265/0x37b RSP <ffff88002ba01d50> <0>Kernel panic - not syncing: Fatal exception
Even when changing the vm config to have a single vcpu the 2nd save fails, but the guest didn't crash this time. Here are some xend logs First successful save [2013-05-10 06:03:22 xend 9123] DEBUG (XendCheckpoint:192) [xc_save]: /usr/lib64/xen/bin/xc_save 20 1 0 0 0 0 [2013-05-10 06:03:22 xend 9123] DEBUG (XendCheckpoint:466) suspend [2013-05-10 06:03:22 xend 9123] DEBUG (XendCheckpoint:195) In saveInputHandler suspend [2013-05-10 06:03:22 xend 9123] DEBUG (XendCheckpoint:197) Suspending 1 ... [2013-05-10 06:03:22 xend.XendDomainInfo 9123] DEBUG (XendDomainInfo:1335) XendDomainInfo.handleShutdownWatch [2013-05-10 06:03:22 xend.XendDomainInfo 9123] DEBUG (XendDomainInfo:1335) XendDomainInfo.handleShutdownWatch [2013-05-10 06:03:22 xend.XendDomainInfo 9123] INFO (XendDomainInfo:1292) Domain has shutdown: name=migrating-vm1 id=1 reason=suspend. [2013-05-10 06:03:22 xend 9123] INFO (XendCheckpoint:202) Domain 1 suspended. [2013-05-10 06:03:22 xend 9123] DEBUG (XendCheckpoint:211) Written done [2013-05-10 06:03:22 xend 9123] INFO (XendCheckpoint:515) Had 0 unexplained entries in p2m table [2013-05-10 06:03:27 xend 9123] INFO (XendCheckpoint:515) Saving memory pages: iter 1 95%^M 1: sent 262144, skipped 0, delta 4734ms, dom0 90%, target 0%, sent 1814Mb/s, dirtied 0Mb/s 0 pages [2013-05-10 06:03:27 xend 9123] INFO (XendCheckpoint:515) Total pages sent= 262144 (0.99x) [2013-05-10 06:03:27 xend 9123] INFO (XendCheckpoint:515) (of which 0 were fixups) [2013-05-10 06:03:27 xend 9123] INFO (XendCheckpoint:515) All memory is saved [2013-05-10 06:03:27 xend 9123] INFO (XendCheckpoint:515) Save exit rc=0 First successful restore [2013-05-10 06:03:35 xend 9123] DEBUG (XendCheckpoint:359) [xc_restore]: /usr/lib64/xen/bin/xc_restore 21 2 1 2 0 0 0 [2013-05-10 06:03:35 xend 9123] INFO (XendCheckpoint:515) xc_domain_restore start: p2m_size = 40800 [2013-05-10 06:03:35 xend 9123] INFO (XendCheckpoint:515) Reloading memory pages: 0% [2013-05-10 06:03:39 xend 9123] INFO (XendCheckpoint:515) Received all pages (0 races) [2013-05-10 06:03:39 xend 9123] INFO (XendCheckpoint:5100% [2013-05-10 06:03:39 xend 9123] INFO (XendCheckpoint:515) Memory reloaded (80153 pages) [2013-05-10 06:03:39 xend 9123] INFO (XendCheckpoint:515) Domain ready to be built. [2013-05-10 06:03:39 xend 9123] DEBUG (XendCheckpoint:466) store-mfn 201591106 [2013-05-10 06:03:39 xend 9123] DEBUG (XendCheckpoint:466) console-mfn 264496267 [2013-05-10 06:03:39 xend 9123] INFO (XendCheckpoint:515) Restore exit with rc=0 Second failed save [2013-05-10 06:04:22 xend 9123] DEBUG (XendCheckpoint:192) [xc_save]: /usr/lib64/xen/bin/xc_save 21 2 0 0 0 0 [2013-05-10 06:04:22 xend 9123] DEBUG (XendCheckpoint:466) suspend [2013-05-10 06:04:22 xend 9123] DEBUG (XendCheckpoint:195) In saveInputHandler suspend [2013-05-10 06:04:22 xend 9123] DEBUG (XendCheckpoint:197) Suspending 2 ... [2013-05-10 06:04:22 xend.XendDomainInfo 9123] DEBUG (XendDomainInfo:1335) XendDomainInfo.handleShutdownWatch [2013-05-10 06:04:22 xend.XendDomainInfo 9123] DEBUG (XendDomainInfo:1335) XendDomainInfo.handleShutdownWatch [2013-05-10 06:04:22 xend.XendDomainInfo 9123] INFO (XendDomainInfo:1292) Domain has shutdown: name=migrating-vm1 id=2 reason=suspend. [2013-05-10 06:04:22 xend 9123] INFO (XendCheckpoint:202) Domain 2 suspended. [2013-05-10 06:04:22 xend 9123] DEBUG (XendCheckpoint:211) Written done [2013-05-10 06:04:22 xend 9123] INFO (XendCheckpoint:515) Had 0 unexplained entries in p2m table [2013-05-10 06:04:22 xend 9123] INFO (XendCheckpoint:515) Saving memory pages: iter 1 0%ERROR Internal error: Fatal PT race (pfn 400, type 10000000) [2013-05-10 06:04:22 xend 9123] INFO (XendCheckpoint:515) Save exit rc=1 So the difference to highlight between the first and the second save attempts are Saving memory pages: iter 1 95%^M 1: sent 262144, skipped 0, delta 4734ms, dom0 90%, target 0%, sent 1814Mb/s, dirtied 0Mb/s 0 pages and Saving memory pages: iter 1 0%ERROR Internal error: Fatal PT race (pfn 400, type 10000000)
I did a few different experiments and found that the 2nd save bug only occurs if the guest is booted with affinity to the cpus of last of the four nodes (node 3). cpu affinity to these cpus gives it a loose affinity to the memory of node3 because the Xen hypervisor will prefer memory located on the node of the currently executing processor. Therefore the guest likely had its pages mapped to within the address range of that node. Xen's boot logs show the node address range mappings (XEN) SRAT: Node 0 PXM 0 0-c0000000 (XEN) SRAT: Node 0 PXM 0 0-4040000000 (XEN) SRAT: Node 1 PXM 1 4040000000-8040000000 (XEN) SRAT: Node 2 PXM 2 8040000000-c040000000 (XEN) SRAT: Node 3 PXM 3 c040000000-10040000000 Now, the "ERROR Internal error: Fatal PT race (pfn 400, type 10000000)" error occurs when we compute a bad mfn. To be precise, when we fail this test #define MFN_IS_IN_PSEUDOPHYS_MAP(_mfn) \ (((_mfn) < (max_mfn)) && \ ((mfn_to_pfn(_mfn) < (p2m_size)) && \ (pfn_to_mfn(mfn_to_pfn(_mfn)) == (_mfn)))) I've gota run for the weekend now, but my hunch is that we computed an addr in the higher part of node3's range (which goes up to 10040000000) and that failed a max test. Note that MFN_MASK_X86 is only ffffffffff 10040000000 ffffffffff so that's consistent with the > max theory.
Capping the usable memory to <= 1T using the mem= Xen command line option resolves this. I don't know what we've officially announced as the amount of memory Xen hosts support, but it looks like there's a real limit at 1T, so we should double check that we announce 1T or less. To avoid issues like this one we can backport the following patch and initialize availmem to our announced limit. commit 077b04c0350e1bc20a41b06dad49a3e2a72ca8c8 Author: Keir Fraser <keir.fraser> Date: Wed Jul 21 08:43:35 2010 +0100 x86: New boot option availmem= to limit usable system RAM. Unlike mem=, this specifies the limit on usable RAM, rather than a limit on maximum physical address of RAM.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release.
Patch(es) available in kernel-2.6.18-360.el5 You can download this test kernel (or newer) from http://people.redhat.com/plougher/el5/ Detailed testing feedback is always welcomed. If you require guidance regarding testing, please ask the bug assignee.
Using machine hp-dl580g7-01 Reproduced: kernel-xen-2.6.18-359.el5 Steps: 0. Boot host with kernel parameter numa=on 1. launch PV guest 2. xm save vm1 vm1.save 3. xm restore vm1.save 4. rm vm1.save -f 5. xm save vm1 vm1.save # xm save vm1 vm1.save Error: /usr/lib64/xen/bin/xc_save 4 14 0 0 0 0 failed Usage: xm save <Domain> <CheckpointFile> Save a domain state to restore later. Verified: kernel-xen-2.6.18-363.el5 Same steps as above, no error output.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-1348.html