Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 690419

Summary:

sysfs cpu online/offline race against suspend-resume

Product:

Red Hat Enterprise Linux 5

Reporter:

Laszlo Ersek <lersek>

Component:

kernel-xen

Assignee:

Xen Maintainance List <xen-maint>

Status:

CLOSED CANTFIX

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

5.6

CC:

gianni.tedesco, kraxel, leiwang, pbonzini, qwan, xen-maint, yuzhang

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-06-15 12:44:02 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

497080

Bug Blocks:

514490

Attachments:

Description	Flags
cpu-hotwalk script, turning VCPUs on and off, to be run in the PV guest	none
Discover CPU onlining/offlining race and retry. Attempt #1.	pbonzini: review+

Description Laszlo Ersek 2011-03-24 10:13:03 UTC

**** Description of problem:

In a PV guest, a process randomly flipping the online status of its VCPUs in a busy loop interferes with the "suspend" kthread.

The latter tries to bring down VCPUs and then suspend the VM in a controlled manner. At resumption, after returning from the hypercall, the "suspend" kthread fixes up the environment, and tries to hotplug the VCPUs as the final step. The racing user-space process invalidates the "suspend" thread's assumptions about the online status of VCPUs.


****  Version-Release number of selected component (if applicable):

Tested with RHEL-5.3 guest (-128) and most recent RHEL-5 build (-250 as of now), in a RHEL-5.3 guest installation.


**** How reproducible:

Practically always; the race manifests after a few iterations of the host loop (see below).


**** Steps to Reproduce:

1. Create a 32 bit RHEL-5.3 PV guest, with at least 4 VCPUs. (The problem is very likely not 32-bit specific, but it is how we reproduced it.)

2. In the guest, run the (to-be attached) "cpu-hotwalk" python script, over "xm console". This script is randomly switching beetween VCPU_1 to VCPU_{N-1} in a single-threaded busy loop, flipping the online status of the given VCPU. (VCPU_0 cannot be offlined.)

3. In the host, run the following save-restore loop:

for ((K=0; K<100; K++)); do
  echo cycle $K
  (xm save rhel53-32pv rhel53-32pv.save \
  && xm restore rhel53-32pv.save) || break
done

  
****  Actual results:

** When the guest kernel is -128, the guest kernel console displays the following KERN_CRIT message:

    Failed to take all CPUs down: -22.

On the host, the "xm save" command hangs, and xend logs

DEBUG (XendCheckpoint:175) [xc_save]: /usr/lib64/xen/bin/xc_save 4 3 0 0 0
DEBUG (XendCheckpoint:449) suspend
DEBUG (XendCheckpoint:178) In saveInputHandler suspend
DEBUG (XendCheckpoint:180) Suspending 3 ...
DEBUG (XendDomainInfo:1298) XendDomainInfo.handleShutdownWatch
DEBUG (XendDomainInfo:1298) XendDomainInfo.handleShutdownWatch

** When the guest kernel is -250, the save completes, but the restore bugs out the guest kernel. A few iterations around the host loop are possible, but ultimately, the race quickly triggers. The guest crash logs for two attempts:

#1:

------------[ cut here ]------------
kernel BUG at drivers/xen/core/evtchn.c:839!
invalid opcode: 0000 [#1]
SMP
last sysfs file: /kernel/kexec_crash_loaded
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh scsi_mod parport_pc lp parport pcspkr xennet dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod xenblk ext3 jbd uhci_hcd ohci_hcd ehci_hcd
CPU:    0
EIP:    0061:[<c055d474>]    Not tainted VLI
EFLAGS: 00010013   (2.6.18-250.el5xen #1)
EIP is at irq_resume+0x68/0x276
eax: c11e566c   ebx: 00a8b24c   ecx: 00000003   edx: 00000000
esi: 0000003a   edi: 00000000   ebp: 00000000   esp: c4c19efc
ds: 007b   es: 007b   ss: 0069
Process suspend (pid: 19238, ti=c4c19000 task=cd9c7aa0 task.ti=c4c19000)
Stack: c4c19f74 18722067 00000000 cf000000 c068c100 c0f4c018 cf002000 cf002000
       00000000 c1000000 c0f4d3c0 cf000000 c068c100 18722067 00000000 18722067
       00000000 cf002000 cf002000 00000002 00002000 c046f72c 00000002 00002000
Call Trace:
 [<c046f72c>] kfree+0xe/0x80
 [<c055d9f6>] gnttab_map+0xef/0xfa
 [<c055ee08>] __do_suspend+0x30f/0x46a
 [<c041a1d1>] complete+0x2b/0x3d
 [<c055eaf9>] __do_suspend+0x0/0x46a
 [<c0431e0e>] kthread+0xc0/0xee
 [<c0431d4e>] kthread+0x0/0xee
 [<c0403005>] kernel_thread_helper+0x5/0xb
 =======================
Code: 01 00 00 75 e4 b8 e0 a5 86 c0 e8 7c de f8 ff eb 52 85 c9 74 42 8b 1c 8d 80 6b 72 c0 31 d2 b8 20 a4 75 c0 01 d8 83 3c 90 ff 74 0a <0f> 0b 47 03 29 c0 65 c0 eb fe 42 83 fa 18 75 e3 b8 80 a4 75 c0
EIP: [<c055d474>] irq_resume+0x68/0x276 SS:ESP 0069:c4c19efc
 <0>Kernel panic - not syncing: Fatal exception

#2, only diff to the previous:

5c5
< last sysfs file: /kernel/kexec_crash_loaded
---
> last sysfs file: /devices/system/cpu/cpu2/online
17,18c17,18
< eax: c11e566c   ebx: 00a8b24c   ecx: 00000003   edx: 00000000
< esi: 0000003a   edi: 00000000   ebp: 00000000   esp: c4c19efc
---
> eax: c11dfaa8   ebx: 00a85688   ecx: 00000002   edx: 00000000
> esi: 0000003a   edi: 00000000   ebp: 00000000   esp: c43abefc
20,22c20,22
< Process suspend (pid: 19238, ti=c4c19000 task=cd9c7aa0 task.ti=c4c19000)
< Stack: c4c19f74 18722067 00000000 cf000000 c068c100 c0f4c018 cf002000 cf002000 
<        00000000 c1000000 c0f4d3c0 cf000000 c068c100 18722067 00000000 18722067 
---
> Process suspend (pid: 16385, ti=c43ab000 task=cd347aa0 task.ti=c43ab000)
> Stack: c43abf74 18096067 00000000 cf000000 c068c100 c0f4c018 cf002000 cf002000 
>        00000000 c1000000 c0f4d3c0 cf000000 c068c100 18096067 00000000 18096067 
37c37
< EIP: [<c055d474>] irq_resume+0x68/0x276 SS:ESP 0069:c4c19efc
---
> EIP: [<c055d474>] irq_resume+0x68/0x276 SS:ESP 0069:c43abefc


**** Expected results:

Save-restore should work. The VCPU-fuzzing process should not interfere with suspension, and should normally continue after resumption.

Comment 1 Laszlo Ersek 2011-03-24 10:18:08 UTC

Created attachment 487271 [details]
cpu-hotwalk script, turning VCPUs on and off, to be run in the PV guest

This python script enables and disables VCPUs randomly, via the sysfs interface.

Comment 2 Laszlo Ersek 2011-03-24 10:53:20 UTC

Analysis of the -128 (08139c1) case.

The python program exercising the sysfs interface enters the kernel in about store_online(), in "drivers/base/cpu.c". The functions it calls are (dependent on the value written) cpu_down() and cpu_up(). Let's consider cpu_down(); see "kernel/kernel/cpu.c".

cpu_down() acquires the "cpu_add_remove_lock", and under the protection of this lock calls _cpu_down(). _cpu_down(), among other things, checks if the cpu to down is still online (by testing the corresponding bit in the "cpu_online_map" extern variable, defined in "drivers/xen/core/smpboot.c"). If not, -EINVAL is returned. If the VCPU is online, notifications are sent out (see the protocol in "Documentation/cpu-hotplug.txt"), and if no registered callback vetoes the downing, _cpu_down() takes "cpu_bitmask_lock" in addition, and disables the VCPU by calling __stop_machine_run().

More precisely, __stop_machine_run() initiates the disabling of the VCPU asynchronously, and -- after releasing "cpu_bitmask_lock" -- _cpu_down() waits in a loop for this procedure complete, then "actually kills the CPU" with __cpu_die(). Some notifications are done afterwards, the result code is returned to cpu_down(), which releases "cpu_add_remove_lock", and propagates the result code outwards.

This was one contestant in the race.

The other contestant is the "suspend" kernel thread (= __do_suspend() action routine), spawned by the __shutdown_handler() workqueue routine in "drivers/xen/core/reboot.c", in response to the "xm save" request. The __do_suspend() function has the following bird's-view structure:

- first of all, calls smp_suspend(),
- does several further "setup" steps,
- calls HYPERVISOR_suspend(), which won't return in the guest until it is restored,
- when HYPERVISOR_suspend() returns, the "setup steps" are undone in reverse order (one of these is irq_resume(), see later in the -250 analysis),
- and finally smp_resume() is called.

The smp_suspend() and the smp_resume() calls intend to put the other steps in a nice "bracket", where only VCPU0 is online -- smp_suspend() downs all VCPUs except VCPU0, and smp_resume() hotplugs them all. (This second "all" means the "cpu_possible_map", which is set, I presume, via xenstore etc etc etc.)

If smp_suspend(), the very first step, fails, the rest of the protocol is not done, __do_suspend() (the "suspend" kernel thread's action routine) returns smp_suspend()'s error. Since the __shutdown_handler() wq routine is not waiting for this thread, the error (and probably the thread exit status) get lost, and the host's "xm save" request hangs, because no callback is done to it, due to the early error.

The message "Failed to take all CPUs down: -22." is logged by smp_suspend(). Let's see why it fails. See "drivers/xen/core/cpu_hotplug.c":

    123  int smp_suspend(void)
    124  {
    125    int cpu, err;
    126  
    127    for_each_online_cpu(cpu) {
    128      if (cpu == 0)
    129        continue;
    130      err = cpu_down(cpu);
    131      if (err) {
    132        printk(KERN_CRIT "Failed to take all CPUs "
    133               "down: %d.\n", err);
    134        for_each_possible_cpu(cpu)
    135          vcpu_hotplug(cpu);
    136        return err;
    137      }
    138    }
    139  
    140    return 0;
    141  }

The for_each_online_cpu() macro iterates over the set bits of "cpu_online_map". The loop skips VCPU0. We then call cpu_down() (see workings described above) for each other online VCPU.

The race triggers when the guest process (contestant #1) downs a VCPU *between* the for_each_online_cpu() macro testing the VCPU for online-ness, and _cpu_down() re-testing it for online-ness. _cpu_down() returns -EINVAL (see above), and this is why -22 is logged in the above message.

(The cpu maps themselves present no visibility problem; for example, the "cpu_set()" primitive ends up in a LOCK_PREFIX btsl assembly instruction.)

See the analysis of the -250 case in the next comment.

Comment 3 Laszlo Ersek 2011-03-24 11:15:54 UTC

Analysis of the -250 (c878e17) case.

The contenders are unchanged, only the race triggers in a different place.

Referring back to the "bracket" setup in comment 2, irq_resume() is called after save has completed and the guest is resumed. Note that at this point the saved image might be already corrupt (because after smp_suspend() returned successfully, the python VCPU fuzzer was still running on VCPU0 and could have enabled other VCPUs, violating the "bracket invariant" in __do_suspend(). Or, we may have saved a correct image, but at resumption time, the python program could re-enable (hotplug) some VCPUs much earlier than the final smp_resume() call that intends to complete the "bracket".

Either way, when irq_resume() runs deep within the bracket, on the restore path, it assumes that all VCPUs modulo VCPU0 are disabled. The panic:

    kernel BUG at drivers/xen/core/evtchn.c:839!

is triggered by a check (an assert) that verifies if this is in fact the case:

    818  void irq_resume(void)
    819  {

    834    /* Secondary CPUs must have no VIRQ or IPI bindings. */
    835    for_each_possible_cpu(cpu) {
    836      if (cpu == 0)
    837        continue;
    838      for (virq = 0; virq < NR_VIRQS; virq++)
    839        BUG_ON(per_cpu(virq_to_irq, cpu)[virq] != -1);
    840      for (ipi = 0; ipi < NR_IPIS; ipi++)
    841        BUG_ON(per_cpu(ipi_to_irq, cpu)[ipi] != -1);
    842    }

The code skips VCPU0, because it knows it's online -- the code runs *on* VCPU0.

Here's the concise path how the concurrent python program, upping the VCPU, violates the checks on line 839 and 841:

drivers/base/cpu.c         store_online()      calls cpu_up()
kernel/cpu.c               cpu_up()            calls _cpu_up()
                           _cpu_up()           calls __cpu_up()
drivers/xen/core/smpboot.c __cpu_up()          calls xen_smp_intr_init()

and xen_smp_intr_init() modifies, directly and indirectly, the arrays pointed to by per_cpu(...).

Comment 4 Laszlo Ersek 2011-03-24 12:06:07 UTC

Therefore the problem is missing mutual exclusion.

The first idea was to fix the locking only in smp_suspend(), that is, "lock together" the for_each_online_cpu() check and the manipulation of the online map. Since recursive mutexen are not supported (see mutex_lock() in "kernel/mutex.c"), we would have to extract the locking (at least that of "cpu_add_remove_lock") from cpu_down(), and push the onus / opportunity to acquire it to all callers. smp_suspend() would then take the lock outside of the loop.

Unfortunately, there is a problem with this idea. It is not enough to lock the maps for the time of smp_suspend(), since smp_suspend() only starts the bracket. The lock must be held for the entire lifetime of the bracket, until after smp_resume() completes.

Indeed, the previous version (23a376d) of smp_suspend() and smp_resume() did use exactly such a (recursive) locking scheme ("drivers/xen/core/cpu_hotplug.c"):

    123  int smp_suspend(void)
    124  {
    125    int i, err;
    126  
    127    lock_cpu_hotplug();
    128  
    129    /*
    130     * Take all other CPUs offline. We hold the hotplug mutex to
    131     * avoid other processes bringing up CPUs under our feet.
    132     */
    133    while (num_online_cpus() > 1) {
    134      unlock_cpu_hotplug();
    135      for_each_online_cpu(i) {
    136        if (i == 0)
    137          continue;
    138        err = cpu_down(i);
    139        if (err) {
    140          printk(KERN_CRIT "Failed to take all CPUs "
    141                 "down: %d.\n", err);
    142          for_each_possible_cpu(i)
    143            vcpu_hotplug(i);
    144          return err;
    145        }
    146      }
    147      lock_cpu_hotplug();
    148    }
    149  
    150    return 0;
    151  }

"lock_cpu_hotplug()" wraps recursion around acquiring "cpu_bitmask_lock" (which is the inner lock taken by cpu_down()). The above code takes the mutex only for the time of num_online_cpus() (which counts the number of set bits in the online map), then releases the lock, since cpu_down() does not use recursive locking. All that the above achieves is that if we *re-enable* a CPU after the for_each_online_cpu() disabled it, it notice it, and will retry the entire loop again. If we succeed, we *keep* the mutex locked, so it is held across the entire bracket, even when we call the hypervisor, and when we return from that call. Then,

    153  void smp_resume(void)
    154  {
    155    int cpu;
    156  
    157    for_each_possible_cpu(cpu)
    158      cpu_initialize_context(cpu);
    159  
    160    unlock_cpu_hotplug();
    161  
    162    for_each_possible_cpu(cpu)
    163      vcpu_hotplug(cpu);
    164  }

So why did this not work out? See this message (partially backported to RHEL-5 as c1515ea, which is still the current version as of -250):

http://lists.xensource.com/archives/html/xen-changelog/2007-02/msg00414.html

----v----
linux: More save/restore fixes. Fix deadlock of cpu_hotplug_lock vs
workqueue_mutex. This is a new deadlock since the workqueue_mutex is
acquired in the workqueue_cpu_calbback() function across
CPU_UP_PREPARE->CPU_ONLINE.

The fix is for us not to rudely grab the cpu_hotplug_lock() during
save/restore -- it's really not necessary.

This patch is applicable to any of our 2.6 kernels, but is absolutely
required from 2.6.18 onwards.
----^----

This was committed to upstream as c/s 14174:c3c03089c59e, see

http://xenbits.xensource.com/hg/xen-unstable.hg/rev/c3c03089c59e

Note that in c1515ea we backported only the parts that remove the calls to [un]lock_cpu_hotplug(). In comparison, the upstream changeset also adds something in exchange (see the second hunk): it wraps smp_suspend() in a loop with preempt_disable(), and checks if the number of online CPUs actually went down to 1. If so, then it breaks out of the loop (with preempt_disable() in effect).

Now, preempt_disable() is not in effect for the whole rest of the bracket, but presumably, if we made sure that num_online_cpus() == 1 while preemption was disabled, then after 

	mm_pin_all();
	local_irq_disable();
	preempt_enable();

it might not be possible for the python program to interfere anymore. Additionally, "Documentation/cpu-hotplug.txt" mentions preempt_disable():

----v----
#include <linux/cpu.h>
lock_cpu_hotplug() and unlock_cpu_hotplug():

The above calls are used to inhibit cpu hotplug operations. While holding the
cpucontrol mutex, cpu_online_map will not change. If you merely need to avoid
cpus going away, you could also use preempt_disable() and preempt_enable()
for those sections. Just remember the critical section cannot call any
function that can sleep or schedule this process away. The preempt_disable()
will work as long as stop_machine_run() is used to take a cpu down.
----^----

In the end, I have no better idea than to backport this small hunk. (Note that the relative order of xenbus_suspend() and preempt_disable() is reversed between the patch and the current RHEL kernel, so that must be investigated as well.) I have no clue if (a) it will be sufficient, (b) it will not introduce deadlocks. I certainly could not *prove* the correctness of using preempt_disable() for this purpose. The current status of upstream is close to useless for us; suspend seems to have been completely redesigned.

Please examine if we should risk touching this, or if we should just "verbally discourage" this behavior in guests. As a further point, bug 497080 ("Backport Xen suspend cancellation bits") affects precisely the same code. Thank you.

Comment 5 Laszlo Ersek 2011-03-24 12:18:57 UTC

Another note: it seems to me that it is really not supported to lock the map of online VCPUs, iterate over them, and while going, flip bits. Whatever I have seen till now always locks the map for a cursory glance. If we like what we see, we go on. If not, we release the "looking mutex", and retry the entire loop.

Now suppose that we have a multi-threaded C version of the python program -- how long will the "suspend" guest kernel thread have to try to outrun this guest process?

Perhaps we could just say:
- Write access to the sysfs CPU online/offline interface needs root privileges (or some capability).
- Being root in the guest (or having this capability in the guest) and running such an arguably malicious program qualifies as deliberately messing with the guest kernel.
- If you cripple the suspend functionality of your guest kernel on purpose, don't be surprised if it can't be suspended.

What are your thoughts? Thank you.

Comment 6 Paolo Bonzini 2011-03-25 09:44:26 UTC

Long message, short answer. :)

Can you whip up a backport of what remains of c/s 14174?  I believe that we should backport it even if it means that a race would delay the suspension arbitrarily.

Note that (unlike the kexec case in bug 667340 for example) delaying suspension is a bit ugly that what you say in comment 5.  A guest can delay save/restore/migration arbitrarily by making sure that it cannot suspend.  On the other hand, I believe that it is preferrable to a guest crash.

Comment 7 Gianni Tedesco 2011-03-25 17:12:28 UTC

Actually, I think that this bug can be reproduced without the twiddling of CPU's online and offline. Actually it seems to be dependent on the workload of the host! The only way I can repro it is doing the suspend/resume loop while another vm is being migrated to and from the host. I'm not 100% sure it is the exact same bug but it's very similar looking.

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000010
 printing eip:
c04e6544
00db8000 -> *pde = 00000002:1a097027
16f63000 -> *pme = 00000002:1bf23067
150d7000 -> *pte = 00000000:00000000
Oops: 0000 [#1]
SMP 
last sysfs file: /block/dm-1/range
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh scsi_mod parport_pc lp parport pcspkr xennet dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod xenblk ext3 jbd uhci_hcd ohci_hcd ehci_hcd
CPU:    0
EIP:    0061:[<c04e6544>]    Not tainted VLI
EFLAGS: 00010246   (2.6.18-164.el5xen #1) 
EIP is at __first_cpu+0x0/0x1a
eax: 00000010   ebx: 00000001   ecx: 00000001   edx: c0771978
esi: 00000000   edi: 00000000   ebp: 00000000   esp: c037af18
ds: 007b   es: 007b   ss: 0069
Process suspend (pid: 2398, ti=c037a000 task=dfe87550 task.ti=c037a000)
Stack: c040e36d 00000001 c0771978 c0618fa8 c06870e4 00000000 c037af30 c1408e0c 
       c06881d4 c06837e0 00000001 00000007 c061a9fd 00000001 00000000 dffaf550 
       dffaf550 c04354f3 00000001 fffffffd 00000001 c0e77f34 c0557251 00000000 
Call Trace:
 [<c040e36d>] cacheinfo_cpu_callback+0x324/0x38a
 [<c0618fa8>] _spin_lock_irq+0x8/0x18
 [<c061a9fd>] notifier_call_chain+0x19/0x29
 [<c04354f3>] cpu_down+0x146/0x1f4
 [<c0557251>] __do_suspend+0x0/0x458
 [<c0556a89>] smp_suspend+0x19/0x70
 [<c05572a4>] __do_suspend+0x53/0x458
 [<c04190c5>] complete+0x2b/0x3d
 [<c0557251>] __do_suspend+0x0/0x458
 [<c042fe35>] kthread+0xc0/0xeb
 [<c042fd75>] kthread+0x0/0xeb
 [<c0403005>] kernel_thread_helper+0x5/0xb
 =======================
Code: 83 c2 04 39 c1 7c f5 85 ff 78 0d 01 fe eb 04 46 48 74 05 3b 34 24 7c b0 8d 46 ff 89 45 00 8b 44 24 04 5a 59 5b 5e 5f 5d c3 90 90 <8b> 00 ba 20 00 00 00 85 c0 74 03 0f bc d0 83 fa 20 b8 20 00 00 
EIP: [<c04e6544>] __first_cpu+0x0/0x1a SS:ESP 0069:c037af18
 <0>Kernel panic - not syncing: Fatal exception

Comment 8 Gianni Tedesco 2011-03-25 17:17:40 UTC

Actually it could be a different race of the same kind? But I'm not sure what the other contender would be and why it appears to be tickled by activity on the host and not the guest.

Comment 9 Laszlo Ersek 2011-03-28 10:27:10 UTC

Created attachment 488125 [details]
Discover CPU onlining/offlining race and retry. Attempt #1.

Backport the remaining two hunks of xen-unstable c/s 14174:c3c03089c59e (the "machine_reboot.c" changes), on top of c1515ea.

http://xenbits.xensource.com/hg/xen-unstable.hg/rev/c3c03089c59e#l2.1

The first hunk overwrites the assignment to "cpu_initialized_map" in post_suspend(). The new map will explicitly contain VCPU#0 only.

The second hunk differs a bit from upstream. Upstream has xenbus_suspend() *outside* preempt_disable(). Their loop, in case of retries, exercises xenbus_suspend() and xenbus_suspend_cancel(), and, as stated above, outside of preempt_disable().

This attempt at a backport keeps the RHEL characteristic of calling xenbus_suspend() inside preempt_disable(). We also call it only once, after the loop has completed, so there should be no need to call xenbus_suspend_cancel() at all.

This backport basically wraps the smp_suspend() call in a loop, the condition being a protected check of num_online_cpus(). When we exit the loop, this protection remains in place.

Currently brewing:
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3208987

Please give it a quick glance if you will. Thanks.

Comment 10 Paolo Bonzini 2011-03-28 11:56:50 UTC

Comment on attachment 488125 [details]
Discover CPU onlining/offlining race and retry. Attempt #1.

looks good

Comment 11 Laszlo Ersek 2011-03-28 16:45:43 UTC

Unfortunately, the patch is not sufficient; the bug reproduces with it after some tries. It is probably the case that the preempt_disable() region is too small, and after it completes, the python program (already running on VCPU#0) can still re-enable other VCPUs.

I'm not sure if the entire bracket can be put in preempt_disable(). "Documentation/cpu-hotplug.txt" states, "Just remember the critical section cannot call any function that can sleep or schedule this process away".

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at drivers/xen/core/evtchn.c:839
invalid opcode: 0000 [1] SMP 
last sysfs file: /devices/system/cpu/cpu1/online
CPU 0 
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink xt_tcpudp iptable_filter ip_tables ip6_tables x_tables be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi loop dm_multipath scsi_dh scsi_mod parport_pc lp parport pcspkr xennet dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod xenblk ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 21121, comm: suspend Not tainted 2.6.18-251.el5.cpuoff_race_bz690419_1xen #1
RIP: e030:[<ffffffff803b8104>]  [<ffffffff803b8104>] irq_resume+0x7d/0x2b8
RSP: e02b:ffff88002b5fddc0  EFLAGS: 00010013
RAX: ffff88000109bc60 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff88008090ba00
RBP: ffffffff803b99a4 R08: ffffffff8062e100 R09: 000000000000003e
R10: ffffffff808bc0c0 R11: 0000000000000200 R12: ffff88003f8e7ce0
R13: 0000000000000000 R14: ffff88003f8e7cd0 R15: ffffffff8029dfa9
FS:  00002af4afcde710(0000) GS:ffffffff8062e000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process suspend (pid: 21121, threadinfo ffff88002b5fc000, task ffff88003ec01860)
Stack:  ffffc20000002000  0000000000000000  0000000000002000  ffffffff8020bbd2 
 0000000000000000  0000000000002000  0000000000000010  ffffffff803b870a 
 00000002022c7ff0  ffffffff803b0000 
Call Trace:
 [<ffffffff8020bbd2>] kfree+0x15/0xc7
 [<ffffffff803b870a>] gnttab_map+0x10f/0x11c
 [<ffffffff803b0000>] device_add+0x156/0x372
 [<ffffffff803b99a4>] __do_suspend+0x0/0x64e
 [<ffffffff803b9dda>] __do_suspend+0x436/0x64e
 [<ffffffff802889a4>] __wake_up_common+0x3e/0x68
 [<ffffffff80233996>] kthread+0xfe/0x132
 [<ffffffff80260b2c>] child_rip+0xa/0x12
 [<ffffffff8029dfa9>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80233898>] kthread+0x0/0x132
 [<ffffffff80260b22>] child_rip+0x0/0x12


Code: 0f 0b 68 60 e3 4b 80 c2 47 03 eb fe ff c6 48 ff c1 83 fe 18 
RIP  [<ffffffff803b8104>] irq_resume+0x7d/0x2b8
 RSP <ffff88002b5fddc0>
 <0>Kernel panic - not syncing: Fatal exception

Comment 12 Paolo Bonzini 2011-03-28 17:02:58 UTC

I think it would work to completely remove preemption at least until right after the suspend hypercall, or possibly something like

diff --git a/drivers/xen/core/reboot.c b/drivers/xen/core/reboot.c
index 060d4e5..eba4fe0 100644
--- a/drivers/xen/core/reboot.c
+++ b/drivers/xen/core/reboot.c
@@ -189,7 +189,6 @@ static int __do_suspend(void *ignore)
 
 	mm_pin_all();
 	local_irq_disable();
-	preempt_enable();
 
 	gnttab_suspend();
 
@@ -215,12 +214,14 @@ static int __do_suspend(void *ignore)
 
 	local_irq_enable();
 
+	preempt_enable();
+
+	smp_resume();
+
 	xencons_resume();
 
 	xenbus_resume();
 
-	smp_resume();
-
 	return err;
 }
 #endif /* CONFIG_XEN */


In fact, am I blind, or upstream linux-2.6.18.hg isn't calling preempt_enable after suspension at all (in drivers/xen/core/machine_reboot.c)?!?

Comment 13 Laszlo Ersek 2011-03-29 08:33:20 UTC

(In reply to comment #12)

> In fact, am I blind, or upstream linux-2.6.18.hg isn't calling preempt_enable
> after suspension at all (in drivers/xen/core/machine_reboot.c)?!?

No, I think you're correct.

Comment 14 Laszlo Ersek 2011-03-29 10:46:13 UTC

Applying the patch in comment 12 caused the following deadlock during "xm save":

----v----
BUG: soft lockup - CPU#0 stuck for 60s! [xenfb thread:258]
CPU 0:
Modules linked in: autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) lockd(U) sunrpc(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_state(U) ip_conntrack(U) nfnetlink(U) xt_tcpudp(U) iptable_filter(U) ip_tables(U) ip6_tables(U) x_tables(U) be2iscsi(U) ib_iser(U) rdma_cm(U) ib_cm(U) iw_cm(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) iscsi_tcp(U) bnx2i(U) cnic(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) uio(U) cxgb3i(U) cxgb3(U) 8021q(U) libiscsi_tcp(U) libiscsi2(U) scsi_transport_iscsi2(U) scsi_transport_iscsi(U) loop(U) dm_multipath(U) scsi_dh(U) scsi_mod(U) parport_pc(U) lp(U) parport(U) xennet(U) pcspkr(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) xenblk(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Pid: 258, comm: xenfb thread Tainted: G      2.6.18.251_cpuoff_race_bz690419_2_local_xen #1
RIP: e030:[<ffffffff803c281c>]  [<ffffffff803c281c>] xenfb_thread+0x29c/0x2d1
RSP: e02b:ffff88003f5d3e10  EFLAGS: 00000202
RAX: 0000000000000000 RBX: ffff88003f3a9720 RCX: 0000000000000320
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88003f3a9680 R08: ffff88003f5d2000 R09: 0000000000000000
R10: ffff88003f211320 R11: ffff88003f399820 R12: 0000000000000320
R13: 0000000000000250 R14: 0000000000000000 R15: 0000000000000000
FS:  00002af50cf3a710(0000) GS:ffffffff8062e000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000

Call Trace:
 [<ffffffff80248a57>] try_to_wake_up+0x392/0x3a4
 [<ffffffff8029e1c1>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8029dfa9>] keventd_create_kthread+0x0/0xc4
 [<ffffffff803c2580>] xenfb_thread+0x0/0x2d1
 [<ffffffff8029dfa9>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80233996>] kthread+0xfe/0x132
 [<ffffffff80260b2c>] child_rip+0xa/0x12
 [<ffffffff8029dfa9>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80233898>] kthread+0x0/0x132
 [<ffffffff80260b22>] child_rip+0x0/0x12
----^----

This message is repeated once per minute, with slightly varying register values. The xend log ends with

[2011-03-29 06:27:16 xend 5422] DEBUG (XendCheckpoint:180) Suspending 3 ...
[2011-03-29 06:27:16 xend.XendDomainInfo 5422] DEBUG (XendDomainInfo:1298) XendDomainInfo.handleShutdownWatch

I will dump the core of the guest domain and try to look closer at the problem.

Comment 15 Laszlo Ersek 2011-04-01 20:33:20 UTC

Running the guest without a virtual framebuffer (ie. removing the vfb = [...] line from the config file) and repeating the test resulted in the following:

- No "soft lockup" message (comment 14).
- After the first save-resume, the second save attempt stuck. Tail of xend.log:
  xend.XendDomainInfo 5476] DEBUG (XendDomainInfo:1298) \
  XendDomainInfo.handleShutdownWatch 
- At this time the virtual machine has the name "migrating-rhel56pv". One can
  connect to its virtual console with "xm console", and the python program is
  running. However, it can't be interrupted with ^C.
- Pinging or attempting to ssh into the vm elicits no response. After a while
  there is no entry for the guest / bridge in the dom0 ARP cache.
- "xm shutdown" has no effect, except a further entry in xend.log.

Looks like a dead end to me. I suggest CANTFIX (--> comment 5), unless someone can come up with a brilliant, non-intrusive workaround. Thanks.

Comment 16 Laszlo Ersek 2011-04-04 13:36:17 UTC

Hi Gianni,

(In reply to comment #7)

> Actually, I think that this bug can be reproduced without the twiddling of
> CPU's online and offline. Actually it seems to be dependent on the workload of
> the host! The only way I can repro it is doing the suspend/resume loop while
> another vm is being migrated to and from the host. I'm not 100% sure it is the
> exact same bug but it's very similar looking.

I believe this is an unrelated problem. The circumstances are different (race within single guest vs. two independent guests), and the symptoms seem to be different as well (BUG_ON() ws. nullptr deref.) The problems do have "suspend" in common though.


> EFLAGS: 00010246   (2.6.18-164.el5xen #1) 

So this was on RHEL-5.4, right? Can you please try to reproduce the problem with a recent RHEL kernel? And if it's still there, would you please report this as a separate bug?

Till then, an idea (arch/i386/kernel/cpu/intel_cacheinfo.c):

> EIP is at __first_cpu+0x0/0x1a

>  [<c040e36d>] cacheinfo_cpu_callback+0x324/0x38a
>  [<c0618fa8>] _spin_lock_irq+0x8/0x18
>  [<c061a9fd>] notifier_call_chain+0x19/0x29
>  [<c04354f3>] cpu_down+0x146/0x1f4

cpu_down() --> ... --> cacheinfo_cpu_callback()
(CPU_DEAD) --> cache_remove_dev():

	for (i = 0; i < num_cache_leaves; i++) {
		cache_remove_shared_cpu_map(cpu, i);
		/* ... */
	}

cache_remove_shared_cpu_map(cpu, index)
{
	this_leaf = CPUID4_INFO_IDX(cpu, index);
	for_each_cpu_mask(sibling, this_leaf->shared_cpu_map) {
		sibling_leaf = CPUID4_INFO_IDX(sibling, index);
		cpu_clear(cpu, sibling_leaf->shared_cpu_map);
	}
}

The for_each_cpu_mask() macro makes use of __first_cpu() in the initialization statement of the "for" loop that it expands to. The access is in find_first_bit().


> BUG: unable to handle kernel NULL pointer dereference at virtual address
> 00000010

struct _cpuid4_info {
	union _cpuid4_leaf_eax eax;
	union _cpuid4_leaf_ebx ebx;
	union _cpuid4_leaf_ecx ecx;
	unsigned long size;
	cpumask_t shared_cpu_map;
};

   (long unsigned *)(&this_leaf->shared_cpu_map)->bits
== (long unsigned *)this_leaf->shared_cpu_map.bits
== (long unsigned *)(
     (char unsigned *)this_leaf
     + offsetof(struct _cpuid4_info, shared_cpu_map)     /* 0x10 */
     + offsetof(cpumask_t, bits)                         /* 0x00 */
   )

So in cache_remove_shared_cpu_map(),

    this_leaf = CPUID4_INFO_IDX(cpu, index);

should set "this_leaf" to NULL. "cpuid4_info" is an array of pointers. It has one entry for each CPU. Each pointer points to an array of "struct _cpuid4_info":

/* pointer to _cpuid4_info array (for each cache leaf) */
static struct _cpuid4_info *cpuid4_info[NR_CPUS];
#define CPUID4_INFO_IDX(x,y)    (&((cpuid4_info[x])[y]))

A more readable macro definition would have been:

#define CPUID4_INFO_IDX(x,y)    (cpuid4_info[x] + (y))

I can see only one way how this can return NULL:

- First, "y" must be 0 (so, cache_remove_shared_cpu_map() was called with "index" being 0, that is, the in the first iteration of the loop in cache_remove_dev()).
- Second, cpuid4_info[x] must be NULL.

How the latter was possible should be analyzed in the new bug :) Note that this code was extensively changed after RHEL-5.4. For example, "struct _cpuid4_info" was modified multiple times: it was extended (and re-extended) with "can_disable", among other things.

My working idea (hope) is now that this second, unrelated bug (ie. the one in your comment 7) will go away if you try to reproduce it on RHEL-5.6. If not, please report a separate bug. Thank you.

Comment 17 Paolo Bonzini 2011-04-04 15:06:28 UTC

The second save is probably failing because the first resume hasn't reset the shutdown watch (so the domain doesn't react to the request to suspend).  You can try xenctx to see where it is stuck.

Comment 18 Laszlo Ersek 2011-04-05 15:29:26 UTC

(In reply to comment #12; for the record)

> - preempt_enable();
> + preempt_enable();

Turns out this makes no difference:

http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00158.html

This way the patch only makes smp_resume() happen earlier, whose effect is possible through cpu-hotwalk anyway.

Comment 19 Paolo Bonzini 2011-04-05 17:10:04 UTC

See also bug 589123, which has the same symptoms as "after my patch" (failure to suspend the second time) and would also be fixed by suspend cancellation.

Comment 20 Laszlo Ersek 2011-06-15 12:44:02 UTC

See bug 497080 comment 7. Additionally, suspend resumes cooperation from the guest, and a guest running "cpu-hotwalk" (which requires root in the guest) during suspend is suicidal. WONTFIX.

Comment 21 Laszlo Ersek 2011-06-15 12:45:13 UTC

(In reply to comment #20)
> suspend resumes cooperation from the guest

My brain short-circuited, apparently. I meant "suspension requires cooperation from the guest".