**** Description of problem: In a PV guest, a process randomly flipping the online status of its VCPUs in a busy loop interferes with the "suspend" kthread. The latter tries to bring down VCPUs and then suspend the VM in a controlled manner. At resumption, after returning from the hypercall, the "suspend" kthread fixes up the environment, and tries to hotplug the VCPUs as the final step. The racing user-space process invalidates the "suspend" thread's assumptions about the online status of VCPUs. **** Version-Release number of selected component (if applicable): Tested with RHEL-5.3 guest (-128) and most recent RHEL-5 build (-250 as of now), in a RHEL-5.3 guest installation. **** How reproducible: Practically always; the race manifests after a few iterations of the host loop (see below). **** Steps to Reproduce: 1. Create a 32 bit RHEL-5.3 PV guest, with at least 4 VCPUs. (The problem is very likely not 32-bit specific, but it is how we reproduced it.) 2. In the guest, run the (to-be attached) "cpu-hotwalk" python script, over "xm console". This script is randomly switching beetween VCPU_1 to VCPU_{N-1} in a single-threaded busy loop, flipping the online status of the given VCPU. (VCPU_0 cannot be offlined.) 3. In the host, run the following save-restore loop: for ((K=0; K<100; K++)); do echo cycle $K (xm save rhel53-32pv rhel53-32pv.save \ && xm restore rhel53-32pv.save) || break done **** Actual results: ** When the guest kernel is -128, the guest kernel console displays the following KERN_CRIT message: Failed to take all CPUs down: -22. On the host, the "xm save" command hangs, and xend logs DEBUG (XendCheckpoint:175) [xc_save]: /usr/lib64/xen/bin/xc_save 4 3 0 0 0 DEBUG (XendCheckpoint:449) suspend DEBUG (XendCheckpoint:178) In saveInputHandler suspend DEBUG (XendCheckpoint:180) Suspending 3 ... DEBUG (XendDomainInfo:1298) XendDomainInfo.handleShutdownWatch DEBUG (XendDomainInfo:1298) XendDomainInfo.handleShutdownWatch ** When the guest kernel is -250, the save completes, but the restore bugs out the guest kernel. A few iterations around the host loop are possible, but ultimately, the race quickly triggers. The guest crash logs for two attempts: #1: ------------[ cut here ]------------ kernel BUG at drivers/xen/core/evtchn.c:839! invalid opcode: 0000 [#1] SMP last sysfs file: /kernel/kexec_crash_loaded Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh scsi_mod parport_pc lp parport pcspkr xennet dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod xenblk ext3 jbd uhci_hcd ohci_hcd ehci_hcd CPU: 0 EIP: 0061:[<c055d474>] Not tainted VLI EFLAGS: 00010013 (2.6.18-250.el5xen #1) EIP is at irq_resume+0x68/0x276 eax: c11e566c ebx: 00a8b24c ecx: 00000003 edx: 00000000 esi: 0000003a edi: 00000000 ebp: 00000000 esp: c4c19efc ds: 007b es: 007b ss: 0069 Process suspend (pid: 19238, ti=c4c19000 task=cd9c7aa0 task.ti=c4c19000) Stack: c4c19f74 18722067 00000000 cf000000 c068c100 c0f4c018 cf002000 cf002000 00000000 c1000000 c0f4d3c0 cf000000 c068c100 18722067 00000000 18722067 00000000 cf002000 cf002000 00000002 00002000 c046f72c 00000002 00002000 Call Trace: [<c046f72c>] kfree+0xe/0x80 [<c055d9f6>] gnttab_map+0xef/0xfa [<c055ee08>] __do_suspend+0x30f/0x46a [<c041a1d1>] complete+0x2b/0x3d [<c055eaf9>] __do_suspend+0x0/0x46a [<c0431e0e>] kthread+0xc0/0xee [<c0431d4e>] kthread+0x0/0xee [<c0403005>] kernel_thread_helper+0x5/0xb ======================= Code: 01 00 00 75 e4 b8 e0 a5 86 c0 e8 7c de f8 ff eb 52 85 c9 74 42 8b 1c 8d 80 6b 72 c0 31 d2 b8 20 a4 75 c0 01 d8 83 3c 90 ff 74 0a <0f> 0b 47 03 29 c0 65 c0 eb fe 42 83 fa 18 75 e3 b8 80 a4 75 c0 EIP: [<c055d474>] irq_resume+0x68/0x276 SS:ESP 0069:c4c19efc <0>Kernel panic - not syncing: Fatal exception #2, only diff to the previous: 5c5 < last sysfs file: /kernel/kexec_crash_loaded --- > last sysfs file: /devices/system/cpu/cpu2/online 17,18c17,18 < eax: c11e566c ebx: 00a8b24c ecx: 00000003 edx: 00000000 < esi: 0000003a edi: 00000000 ebp: 00000000 esp: c4c19efc --- > eax: c11dfaa8 ebx: 00a85688 ecx: 00000002 edx: 00000000 > esi: 0000003a edi: 00000000 ebp: 00000000 esp: c43abefc 20,22c20,22 < Process suspend (pid: 19238, ti=c4c19000 task=cd9c7aa0 task.ti=c4c19000) < Stack: c4c19f74 18722067 00000000 cf000000 c068c100 c0f4c018 cf002000 cf002000 < 00000000 c1000000 c0f4d3c0 cf000000 c068c100 18722067 00000000 18722067 --- > Process suspend (pid: 16385, ti=c43ab000 task=cd347aa0 task.ti=c43ab000) > Stack: c43abf74 18096067 00000000 cf000000 c068c100 c0f4c018 cf002000 cf002000 > 00000000 c1000000 c0f4d3c0 cf000000 c068c100 18096067 00000000 18096067 37c37 < EIP: [<c055d474>] irq_resume+0x68/0x276 SS:ESP 0069:c4c19efc --- > EIP: [<c055d474>] irq_resume+0x68/0x276 SS:ESP 0069:c43abefc **** Expected results: Save-restore should work. The VCPU-fuzzing process should not interfere with suspension, and should normally continue after resumption.
Created attachment 487271 [details] cpu-hotwalk script, turning VCPUs on and off, to be run in the PV guest This python script enables and disables VCPUs randomly, via the sysfs interface.
Analysis of the -128 (08139c1) case. The python program exercising the sysfs interface enters the kernel in about store_online(), in "drivers/base/cpu.c". The functions it calls are (dependent on the value written) cpu_down() and cpu_up(). Let's consider cpu_down(); see "kernel/kernel/cpu.c". cpu_down() acquires the "cpu_add_remove_lock", and under the protection of this lock calls _cpu_down(). _cpu_down(), among other things, checks if the cpu to down is still online (by testing the corresponding bit in the "cpu_online_map" extern variable, defined in "drivers/xen/core/smpboot.c"). If not, -EINVAL is returned. If the VCPU is online, notifications are sent out (see the protocol in "Documentation/cpu-hotplug.txt"), and if no registered callback vetoes the downing, _cpu_down() takes "cpu_bitmask_lock" in addition, and disables the VCPU by calling __stop_machine_run(). More precisely, __stop_machine_run() initiates the disabling of the VCPU asynchronously, and -- after releasing "cpu_bitmask_lock" -- _cpu_down() waits in a loop for this procedure complete, then "actually kills the CPU" with __cpu_die(). Some notifications are done afterwards, the result code is returned to cpu_down(), which releases "cpu_add_remove_lock", and propagates the result code outwards. This was one contestant in the race. The other contestant is the "suspend" kernel thread (= __do_suspend() action routine), spawned by the __shutdown_handler() workqueue routine in "drivers/xen/core/reboot.c", in response to the "xm save" request. The __do_suspend() function has the following bird's-view structure: - first of all, calls smp_suspend(), - does several further "setup" steps, - calls HYPERVISOR_suspend(), which won't return in the guest until it is restored, - when HYPERVISOR_suspend() returns, the "setup steps" are undone in reverse order (one of these is irq_resume(), see later in the -250 analysis), - and finally smp_resume() is called. The smp_suspend() and the smp_resume() calls intend to put the other steps in a nice "bracket", where only VCPU0 is online -- smp_suspend() downs all VCPUs except VCPU0, and smp_resume() hotplugs them all. (This second "all" means the "cpu_possible_map", which is set, I presume, via xenstore etc etc etc.) If smp_suspend(), the very first step, fails, the rest of the protocol is not done, __do_suspend() (the "suspend" kernel thread's action routine) returns smp_suspend()'s error. Since the __shutdown_handler() wq routine is not waiting for this thread, the error (and probably the thread exit status) get lost, and the host's "xm save" request hangs, because no callback is done to it, due to the early error. The message "Failed to take all CPUs down: -22." is logged by smp_suspend(). Let's see why it fails. See "drivers/xen/core/cpu_hotplug.c": 123 int smp_suspend(void) 124 { 125 int cpu, err; 126 127 for_each_online_cpu(cpu) { 128 if (cpu == 0) 129 continue; 130 err = cpu_down(cpu); 131 if (err) { 132 printk(KERN_CRIT "Failed to take all CPUs " 133 "down: %d.\n", err); 134 for_each_possible_cpu(cpu) 135 vcpu_hotplug(cpu); 136 return err; 137 } 138 } 139 140 return 0; 141 } The for_each_online_cpu() macro iterates over the set bits of "cpu_online_map". The loop skips VCPU0. We then call cpu_down() (see workings described above) for each other online VCPU. The race triggers when the guest process (contestant #1) downs a VCPU *between* the for_each_online_cpu() macro testing the VCPU for online-ness, and _cpu_down() re-testing it for online-ness. _cpu_down() returns -EINVAL (see above), and this is why -22 is logged in the above message. (The cpu maps themselves present no visibility problem; for example, the "cpu_set()" primitive ends up in a LOCK_PREFIX btsl assembly instruction.) See the analysis of the -250 case in the next comment.
Analysis of the -250 (c878e17) case. The contenders are unchanged, only the race triggers in a different place. Referring back to the "bracket" setup in comment 2, irq_resume() is called after save has completed and the guest is resumed. Note that at this point the saved image might be already corrupt (because after smp_suspend() returned successfully, the python VCPU fuzzer was still running on VCPU0 and could have enabled other VCPUs, violating the "bracket invariant" in __do_suspend(). Or, we may have saved a correct image, but at resumption time, the python program could re-enable (hotplug) some VCPUs much earlier than the final smp_resume() call that intends to complete the "bracket". Either way, when irq_resume() runs deep within the bracket, on the restore path, it assumes that all VCPUs modulo VCPU0 are disabled. The panic: kernel BUG at drivers/xen/core/evtchn.c:839! is triggered by a check (an assert) that verifies if this is in fact the case: 818 void irq_resume(void) 819 { 834 /* Secondary CPUs must have no VIRQ or IPI bindings. */ 835 for_each_possible_cpu(cpu) { 836 if (cpu == 0) 837 continue; 838 for (virq = 0; virq < NR_VIRQS; virq++) 839 BUG_ON(per_cpu(virq_to_irq, cpu)[virq] != -1); 840 for (ipi = 0; ipi < NR_IPIS; ipi++) 841 BUG_ON(per_cpu(ipi_to_irq, cpu)[ipi] != -1); 842 } The code skips VCPU0, because it knows it's online -- the code runs *on* VCPU0. Here's the concise path how the concurrent python program, upping the VCPU, violates the checks on line 839 and 841: drivers/base/cpu.c store_online() calls cpu_up() kernel/cpu.c cpu_up() calls _cpu_up() _cpu_up() calls __cpu_up() drivers/xen/core/smpboot.c __cpu_up() calls xen_smp_intr_init() and xen_smp_intr_init() modifies, directly and indirectly, the arrays pointed to by per_cpu(...).
Therefore the problem is missing mutual exclusion. The first idea was to fix the locking only in smp_suspend(), that is, "lock together" the for_each_online_cpu() check and the manipulation of the online map. Since recursive mutexen are not supported (see mutex_lock() in "kernel/mutex.c"), we would have to extract the locking (at least that of "cpu_add_remove_lock") from cpu_down(), and push the onus / opportunity to acquire it to all callers. smp_suspend() would then take the lock outside of the loop. Unfortunately, there is a problem with this idea. It is not enough to lock the maps for the time of smp_suspend(), since smp_suspend() only starts the bracket. The lock must be held for the entire lifetime of the bracket, until after smp_resume() completes. Indeed, the previous version (23a376d) of smp_suspend() and smp_resume() did use exactly such a (recursive) locking scheme ("drivers/xen/core/cpu_hotplug.c"): 123 int smp_suspend(void) 124 { 125 int i, err; 126 127 lock_cpu_hotplug(); 128 129 /* 130 * Take all other CPUs offline. We hold the hotplug mutex to 131 * avoid other processes bringing up CPUs under our feet. 132 */ 133 while (num_online_cpus() > 1) { 134 unlock_cpu_hotplug(); 135 for_each_online_cpu(i) { 136 if (i == 0) 137 continue; 138 err = cpu_down(i); 139 if (err) { 140 printk(KERN_CRIT "Failed to take all CPUs " 141 "down: %d.\n", err); 142 for_each_possible_cpu(i) 143 vcpu_hotplug(i); 144 return err; 145 } 146 } 147 lock_cpu_hotplug(); 148 } 149 150 return 0; 151 } "lock_cpu_hotplug()" wraps recursion around acquiring "cpu_bitmask_lock" (which is the inner lock taken by cpu_down()). The above code takes the mutex only for the time of num_online_cpus() (which counts the number of set bits in the online map), then releases the lock, since cpu_down() does not use recursive locking. All that the above achieves is that if we *re-enable* a CPU after the for_each_online_cpu() disabled it, it notice it, and will retry the entire loop again. If we succeed, we *keep* the mutex locked, so it is held across the entire bracket, even when we call the hypervisor, and when we return from that call. Then, 153 void smp_resume(void) 154 { 155 int cpu; 156 157 for_each_possible_cpu(cpu) 158 cpu_initialize_context(cpu); 159 160 unlock_cpu_hotplug(); 161 162 for_each_possible_cpu(cpu) 163 vcpu_hotplug(cpu); 164 } So why did this not work out? See this message (partially backported to RHEL-5 as c1515ea, which is still the current version as of -250): http://lists.xensource.com/archives/html/xen-changelog/2007-02/msg00414.html ----v---- linux: More save/restore fixes. Fix deadlock of cpu_hotplug_lock vs workqueue_mutex. This is a new deadlock since the workqueue_mutex is acquired in the workqueue_cpu_calbback() function across CPU_UP_PREPARE->CPU_ONLINE. The fix is for us not to rudely grab the cpu_hotplug_lock() during save/restore -- it's really not necessary. This patch is applicable to any of our 2.6 kernels, but is absolutely required from 2.6.18 onwards. ----^---- This was committed to upstream as c/s 14174:c3c03089c59e, see http://xenbits.xensource.com/hg/xen-unstable.hg/rev/c3c03089c59e Note that in c1515ea we backported only the parts that remove the calls to [un]lock_cpu_hotplug(). In comparison, the upstream changeset also adds something in exchange (see the second hunk): it wraps smp_suspend() in a loop with preempt_disable(), and checks if the number of online CPUs actually went down to 1. If so, then it breaks out of the loop (with preempt_disable() in effect). Now, preempt_disable() is not in effect for the whole rest of the bracket, but presumably, if we made sure that num_online_cpus() == 1 while preemption was disabled, then after mm_pin_all(); local_irq_disable(); preempt_enable(); it might not be possible for the python program to interfere anymore. Additionally, "Documentation/cpu-hotplug.txt" mentions preempt_disable(): ----v---- #include <linux/cpu.h> lock_cpu_hotplug() and unlock_cpu_hotplug(): The above calls are used to inhibit cpu hotplug operations. While holding the cpucontrol mutex, cpu_online_map will not change. If you merely need to avoid cpus going away, you could also use preempt_disable() and preempt_enable() for those sections. Just remember the critical section cannot call any function that can sleep or schedule this process away. The preempt_disable() will work as long as stop_machine_run() is used to take a cpu down. ----^---- In the end, I have no better idea than to backport this small hunk. (Note that the relative order of xenbus_suspend() and preempt_disable() is reversed between the patch and the current RHEL kernel, so that must be investigated as well.) I have no clue if (a) it will be sufficient, (b) it will not introduce deadlocks. I certainly could not *prove* the correctness of using preempt_disable() for this purpose. The current status of upstream is close to useless for us; suspend seems to have been completely redesigned. Please examine if we should risk touching this, or if we should just "verbally discourage" this behavior in guests. As a further point, bug 497080 ("Backport Xen suspend cancellation bits") affects precisely the same code. Thank you.
Another note: it seems to me that it is really not supported to lock the map of online VCPUs, iterate over them, and while going, flip bits. Whatever I have seen till now always locks the map for a cursory glance. If we like what we see, we go on. If not, we release the "looking mutex", and retry the entire loop. Now suppose that we have a multi-threaded C version of the python program -- how long will the "suspend" guest kernel thread have to try to outrun this guest process? Perhaps we could just say: - Write access to the sysfs CPU online/offline interface needs root privileges (or some capability). - Being root in the guest (or having this capability in the guest) and running such an arguably malicious program qualifies as deliberately messing with the guest kernel. - If you cripple the suspend functionality of your guest kernel on purpose, don't be surprised if it can't be suspended. What are your thoughts? Thank you.
Long message, short answer. :) Can you whip up a backport of what remains of c/s 14174? I believe that we should backport it even if it means that a race would delay the suspension arbitrarily. Note that (unlike the kexec case in bug 667340 for example) delaying suspension is a bit ugly that what you say in comment 5. A guest can delay save/restore/migration arbitrarily by making sure that it cannot suspend. On the other hand, I believe that it is preferrable to a guest crash.
Actually, I think that this bug can be reproduced without the twiddling of CPU's online and offline. Actually it seems to be dependent on the workload of the host! The only way I can repro it is doing the suspend/resume loop while another vm is being migrated to and from the host. I'm not 100% sure it is the exact same bug but it's very similar looking. BUG: unable to handle kernel NULL pointer dereference at virtual address 00000010 printing eip: c04e6544 00db8000 -> *pde = 00000002:1a097027 16f63000 -> *pme = 00000002:1bf23067 150d7000 -> *pte = 00000000:00000000 Oops: 0000 [#1] SMP last sysfs file: /block/dm-1/range Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh scsi_mod parport_pc lp parport pcspkr xennet dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod xenblk ext3 jbd uhci_hcd ohci_hcd ehci_hcd CPU: 0 EIP: 0061:[<c04e6544>] Not tainted VLI EFLAGS: 00010246 (2.6.18-164.el5xen #1) EIP is at __first_cpu+0x0/0x1a eax: 00000010 ebx: 00000001 ecx: 00000001 edx: c0771978 esi: 00000000 edi: 00000000 ebp: 00000000 esp: c037af18 ds: 007b es: 007b ss: 0069 Process suspend (pid: 2398, ti=c037a000 task=dfe87550 task.ti=c037a000) Stack: c040e36d 00000001 c0771978 c0618fa8 c06870e4 00000000 c037af30 c1408e0c c06881d4 c06837e0 00000001 00000007 c061a9fd 00000001 00000000 dffaf550 dffaf550 c04354f3 00000001 fffffffd 00000001 c0e77f34 c0557251 00000000 Call Trace: [<c040e36d>] cacheinfo_cpu_callback+0x324/0x38a [<c0618fa8>] _spin_lock_irq+0x8/0x18 [<c061a9fd>] notifier_call_chain+0x19/0x29 [<c04354f3>] cpu_down+0x146/0x1f4 [<c0557251>] __do_suspend+0x0/0x458 [<c0556a89>] smp_suspend+0x19/0x70 [<c05572a4>] __do_suspend+0x53/0x458 [<c04190c5>] complete+0x2b/0x3d [<c0557251>] __do_suspend+0x0/0x458 [<c042fe35>] kthread+0xc0/0xeb [<c042fd75>] kthread+0x0/0xeb [<c0403005>] kernel_thread_helper+0x5/0xb ======================= Code: 83 c2 04 39 c1 7c f5 85 ff 78 0d 01 fe eb 04 46 48 74 05 3b 34 24 7c b0 8d 46 ff 89 45 00 8b 44 24 04 5a 59 5b 5e 5f 5d c3 90 90 <8b> 00 ba 20 00 00 00 85 c0 74 03 0f bc d0 83 fa 20 b8 20 00 00 EIP: [<c04e6544>] __first_cpu+0x0/0x1a SS:ESP 0069:c037af18 <0>Kernel panic - not syncing: Fatal exception
Actually it could be a different race of the same kind? But I'm not sure what the other contender would be and why it appears to be tickled by activity on the host and not the guest.
Created attachment 488125 [details] Discover CPU onlining/offlining race and retry. Attempt #1. Backport the remaining two hunks of xen-unstable c/s 14174:c3c03089c59e (the "machine_reboot.c" changes), on top of c1515ea. http://xenbits.xensource.com/hg/xen-unstable.hg/rev/c3c03089c59e#l2.1 The first hunk overwrites the assignment to "cpu_initialized_map" in post_suspend(). The new map will explicitly contain VCPU#0 only. The second hunk differs a bit from upstream. Upstream has xenbus_suspend() *outside* preempt_disable(). Their loop, in case of retries, exercises xenbus_suspend() and xenbus_suspend_cancel(), and, as stated above, outside of preempt_disable(). This attempt at a backport keeps the RHEL characteristic of calling xenbus_suspend() inside preempt_disable(). We also call it only once, after the loop has completed, so there should be no need to call xenbus_suspend_cancel() at all. This backport basically wraps the smp_suspend() call in a loop, the condition being a protected check of num_online_cpus(). When we exit the loop, this protection remains in place. Currently brewing: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3208987 Please give it a quick glance if you will. Thanks.
Comment on attachment 488125 [details] Discover CPU onlining/offlining race and retry. Attempt #1. looks good
Unfortunately, the patch is not sufficient; the bug reproduces with it after some tries. It is probably the case that the preempt_disable() region is too small, and after it completes, the python program (already running on VCPU#0) can still re-enable other VCPUs. I'm not sure if the entire bracket can be put in preempt_disable(). "Documentation/cpu-hotplug.txt" states, "Just remember the critical section cannot call any function that can sleep or schedule this process away". ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at drivers/xen/core/evtchn.c:839 invalid opcode: 0000 [1] SMP last sysfs file: /devices/system/cpu/cpu1/online CPU 0 Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink xt_tcpudp iptable_filter ip_tables ip6_tables x_tables be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi loop dm_multipath scsi_dh scsi_mod parport_pc lp parport pcspkr xennet dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod xenblk ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 21121, comm: suspend Not tainted 2.6.18-251.el5.cpuoff_race_bz690419_1xen #1 RIP: e030:[<ffffffff803b8104>] [<ffffffff803b8104>] irq_resume+0x7d/0x2b8 RSP: e02b:ffff88002b5fddc0 EFLAGS: 00010013 RAX: ffff88000109bc60 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff88008090ba00 RBP: ffffffff803b99a4 R08: ffffffff8062e100 R09: 000000000000003e R10: ffffffff808bc0c0 R11: 0000000000000200 R12: ffff88003f8e7ce0 R13: 0000000000000000 R14: ffff88003f8e7cd0 R15: ffffffff8029dfa9 FS: 00002af4afcde710(0000) GS:ffffffff8062e000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process suspend (pid: 21121, threadinfo ffff88002b5fc000, task ffff88003ec01860) Stack: ffffc20000002000 0000000000000000 0000000000002000 ffffffff8020bbd2 0000000000000000 0000000000002000 0000000000000010 ffffffff803b870a 00000002022c7ff0 ffffffff803b0000 Call Trace: [<ffffffff8020bbd2>] kfree+0x15/0xc7 [<ffffffff803b870a>] gnttab_map+0x10f/0x11c [<ffffffff803b0000>] device_add+0x156/0x372 [<ffffffff803b99a4>] __do_suspend+0x0/0x64e [<ffffffff803b9dda>] __do_suspend+0x436/0x64e [<ffffffff802889a4>] __wake_up_common+0x3e/0x68 [<ffffffff80233996>] kthread+0xfe/0x132 [<ffffffff80260b2c>] child_rip+0xa/0x12 [<ffffffff8029dfa9>] keventd_create_kthread+0x0/0xc4 [<ffffffff80233898>] kthread+0x0/0x132 [<ffffffff80260b22>] child_rip+0x0/0x12 Code: 0f 0b 68 60 e3 4b 80 c2 47 03 eb fe ff c6 48 ff c1 83 fe 18 RIP [<ffffffff803b8104>] irq_resume+0x7d/0x2b8 RSP <ffff88002b5fddc0> <0>Kernel panic - not syncing: Fatal exception
I think it would work to completely remove preemption at least until right after the suspend hypercall, or possibly something like diff --git a/drivers/xen/core/reboot.c b/drivers/xen/core/reboot.c index 060d4e5..eba4fe0 100644 --- a/drivers/xen/core/reboot.c +++ b/drivers/xen/core/reboot.c @@ -189,7 +189,6 @@ static int __do_suspend(void *ignore) mm_pin_all(); local_irq_disable(); - preempt_enable(); gnttab_suspend(); @@ -215,12 +214,14 @@ static int __do_suspend(void *ignore) local_irq_enable(); + preempt_enable(); + + smp_resume(); + xencons_resume(); xenbus_resume(); - smp_resume(); - return err; } #endif /* CONFIG_XEN */ In fact, am I blind, or upstream linux-2.6.18.hg isn't calling preempt_enable after suspension at all (in drivers/xen/core/machine_reboot.c)?!?
(In reply to comment #12) > In fact, am I blind, or upstream linux-2.6.18.hg isn't calling preempt_enable > after suspension at all (in drivers/xen/core/machine_reboot.c)?!? No, I think you're correct.
Applying the patch in comment 12 caused the following deadlock during "xm save": ----v---- BUG: soft lockup - CPU#0 stuck for 60s! [xenfb thread:258] CPU 0: Modules linked in: autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) lockd(U) sunrpc(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_state(U) ip_conntrack(U) nfnetlink(U) xt_tcpudp(U) iptable_filter(U) ip_tables(U) ip6_tables(U) x_tables(U) be2iscsi(U) ib_iser(U) rdma_cm(U) ib_cm(U) iw_cm(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) iscsi_tcp(U) bnx2i(U) cnic(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) uio(U) cxgb3i(U) cxgb3(U) 8021q(U) libiscsi_tcp(U) libiscsi2(U) scsi_transport_iscsi2(U) scsi_transport_iscsi(U) loop(U) dm_multipath(U) scsi_dh(U) scsi_mod(U) parport_pc(U) lp(U) parport(U) xennet(U) pcspkr(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) xenblk(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Pid: 258, comm: xenfb thread Tainted: G 2.6.18.251_cpuoff_race_bz690419_2_local_xen #1 RIP: e030:[<ffffffff803c281c>] [<ffffffff803c281c>] xenfb_thread+0x29c/0x2d1 RSP: e02b:ffff88003f5d3e10 EFLAGS: 00000202 RAX: 0000000000000000 RBX: ffff88003f3a9720 RCX: 0000000000000320 RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88003f3a9680 R08: ffff88003f5d2000 R09: 0000000000000000 R10: ffff88003f211320 R11: ffff88003f399820 R12: 0000000000000320 R13: 0000000000000250 R14: 0000000000000000 R15: 0000000000000000 FS: 00002af50cf3a710(0000) GS:ffffffff8062e000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Call Trace: [<ffffffff80248a57>] try_to_wake_up+0x392/0x3a4 [<ffffffff8029e1c1>] autoremove_wake_function+0x0/0x2e [<ffffffff8029dfa9>] keventd_create_kthread+0x0/0xc4 [<ffffffff803c2580>] xenfb_thread+0x0/0x2d1 [<ffffffff8029dfa9>] keventd_create_kthread+0x0/0xc4 [<ffffffff80233996>] kthread+0xfe/0x132 [<ffffffff80260b2c>] child_rip+0xa/0x12 [<ffffffff8029dfa9>] keventd_create_kthread+0x0/0xc4 [<ffffffff80233898>] kthread+0x0/0x132 [<ffffffff80260b22>] child_rip+0x0/0x12 ----^---- This message is repeated once per minute, with slightly varying register values. The xend log ends with [2011-03-29 06:27:16 xend 5422] DEBUG (XendCheckpoint:180) Suspending 3 ... [2011-03-29 06:27:16 xend.XendDomainInfo 5422] DEBUG (XendDomainInfo:1298) XendDomainInfo.handleShutdownWatch I will dump the core of the guest domain and try to look closer at the problem.
Running the guest without a virtual framebuffer (ie. removing the vfb = [...] line from the config file) and repeating the test resulted in the following: - No "soft lockup" message (comment 14). - After the first save-resume, the second save attempt stuck. Tail of xend.log: xend.XendDomainInfo 5476] DEBUG (XendDomainInfo:1298) \ XendDomainInfo.handleShutdownWatch - At this time the virtual machine has the name "migrating-rhel56pv". One can connect to its virtual console with "xm console", and the python program is running. However, it can't be interrupted with ^C. - Pinging or attempting to ssh into the vm elicits no response. After a while there is no entry for the guest / bridge in the dom0 ARP cache. - "xm shutdown" has no effect, except a further entry in xend.log. Looks like a dead end to me. I suggest CANTFIX (--> comment 5), unless someone can come up with a brilliant, non-intrusive workaround. Thanks.
Hi Gianni, (In reply to comment #7) > Actually, I think that this bug can be reproduced without the twiddling of > CPU's online and offline. Actually it seems to be dependent on the workload of > the host! The only way I can repro it is doing the suspend/resume loop while > another vm is being migrated to and from the host. I'm not 100% sure it is the > exact same bug but it's very similar looking. I believe this is an unrelated problem. The circumstances are different (race within single guest vs. two independent guests), and the symptoms seem to be different as well (BUG_ON() ws. nullptr deref.) The problems do have "suspend" in common though. > EFLAGS: 00010246 (2.6.18-164.el5xen #1) So this was on RHEL-5.4, right? Can you please try to reproduce the problem with a recent RHEL kernel? And if it's still there, would you please report this as a separate bug? Till then, an idea (arch/i386/kernel/cpu/intel_cacheinfo.c): > EIP is at __first_cpu+0x0/0x1a > [<c040e36d>] cacheinfo_cpu_callback+0x324/0x38a > [<c0618fa8>] _spin_lock_irq+0x8/0x18 > [<c061a9fd>] notifier_call_chain+0x19/0x29 > [<c04354f3>] cpu_down+0x146/0x1f4 cpu_down() --> ... --> cacheinfo_cpu_callback() (CPU_DEAD) --> cache_remove_dev(): for (i = 0; i < num_cache_leaves; i++) { cache_remove_shared_cpu_map(cpu, i); /* ... */ } cache_remove_shared_cpu_map(cpu, index) { this_leaf = CPUID4_INFO_IDX(cpu, index); for_each_cpu_mask(sibling, this_leaf->shared_cpu_map) { sibling_leaf = CPUID4_INFO_IDX(sibling, index); cpu_clear(cpu, sibling_leaf->shared_cpu_map); } } The for_each_cpu_mask() macro makes use of __first_cpu() in the initialization statement of the "for" loop that it expands to. The access is in find_first_bit(). > BUG: unable to handle kernel NULL pointer dereference at virtual address > 00000010 struct _cpuid4_info { union _cpuid4_leaf_eax eax; union _cpuid4_leaf_ebx ebx; union _cpuid4_leaf_ecx ecx; unsigned long size; cpumask_t shared_cpu_map; }; (long unsigned *)(&this_leaf->shared_cpu_map)->bits == (long unsigned *)this_leaf->shared_cpu_map.bits == (long unsigned *)( (char unsigned *)this_leaf + offsetof(struct _cpuid4_info, shared_cpu_map) /* 0x10 */ + offsetof(cpumask_t, bits) /* 0x00 */ ) So in cache_remove_shared_cpu_map(), this_leaf = CPUID4_INFO_IDX(cpu, index); should set "this_leaf" to NULL. "cpuid4_info" is an array of pointers. It has one entry for each CPU. Each pointer points to an array of "struct _cpuid4_info": /* pointer to _cpuid4_info array (for each cache leaf) */ static struct _cpuid4_info *cpuid4_info[NR_CPUS]; #define CPUID4_INFO_IDX(x,y) (&((cpuid4_info[x])[y])) A more readable macro definition would have been: #define CPUID4_INFO_IDX(x,y) (cpuid4_info[x] + (y)) I can see only one way how this can return NULL: - First, "y" must be 0 (so, cache_remove_shared_cpu_map() was called with "index" being 0, that is, the in the first iteration of the loop in cache_remove_dev()). - Second, cpuid4_info[x] must be NULL. How the latter was possible should be analyzed in the new bug :) Note that this code was extensively changed after RHEL-5.4. For example, "struct _cpuid4_info" was modified multiple times: it was extended (and re-extended) with "can_disable", among other things. My working idea (hope) is now that this second, unrelated bug (ie. the one in your comment 7) will go away if you try to reproduce it on RHEL-5.6. If not, please report a separate bug. Thank you.
The second save is probably failing because the first resume hasn't reset the shutdown watch (so the domain doesn't react to the request to suspend). You can try xenctx to see where it is stuck.
(In reply to comment #12; for the record) > - preempt_enable(); > + preempt_enable(); Turns out this makes no difference: http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00158.html This way the patch only makes smp_resume() happen earlier, whose effect is possible through cpu-hotwalk anyway.
See also bug 589123, which has the same symptoms as "after my patch" (failure to suspend the second time) and would also be fixed by suspend cancellation.
See bug 497080 comment 7. Additionally, suspend resumes cooperation from the guest, and a guest running "cpu-hotwalk" (which requires root in the guest) during suspend is suicidal. WONTFIX.
(In reply to comment #20) > suspend resumes cooperation from the guest My brain short-circuited, apparently. I meant "suspension requires cooperation from the guest".