Description of problem: In RHEL-4 (and possibly RHEL-5) SMP paravirtualized guests, suspend/resume sometimes fails. To reproduce this behavior, I do the following: 1. Start up a CPU-intensive job inside the PV guest (i.e. stress with 4 jobs) 2. Use the following script on the dom0: #!/bin/bash if [ $# -ne 1 ]; then echo "Usage: xen_test_save_restore.sh <domname>" exit 1 fi domname=$1 count=0 mkdir -p /var/lib/xen/save while true ; do rm -f /var/lib/xen/save/$domname-save echo "Doing save $count..." xm save $domname /var/lib/xen/save/$domname-save sleep 20 echo "Doing restore $count..." xm restore /var/lib/xen/save/$domname-save sleep 40 count=$(( $count + 1 )) done After a number of iterations (it varies), the script will get hung up on a restore. During this time, it looks like 1 of the vCPUs has been hotplugged, and it also looks like that CPU is running the jobs. However, the second vCPU will never be plugged, and the "xm save" command will never return. This happens on both i386 and x86_64.
I've taken a core-dump of the pv domain while it was "stuck". What it looks like is happening is that the "suspend" thread that is kicked off to do the suspend never completes. Looking at the backtrace, it looks like it is hung-up in unregister_xenbus_watch(), trying to acquire the "xenwatch_mutex" in the end. However, the "down" fails, which puts the suspend process to sleep, and it looks like it just never wakes up again. Because of this, we never reach smp_resume() (which explains why the other vCPUs don't get plugged back in). Besides unregister_xenbus_watch(), the other two users of xenwatch_mutex are xenbus_register_driver_common(), and the xenwatch_thread(). This certainly feels like a race with one of these two; I'll next try to see which one. Chris Lalancette
Gah. This goes still deeper. The xenwatch thread got woken up to handle a vcpu hotplug event. It is currently executing the callback for that, which is handle_vcpu_hotplug_event() -> vcpu_hotplug() -> cpu_up(). However, in cpu_up(), the first down_interruptible(&cpucontrol) failed. So, because xenwatch thread is holding the xenwatch_mutex, and waiting for the cpucontrol mutex (which never gets woken up), the suspend thread can never acquire xenwatch_mutex, which causes the whole problem. Now to figure out why cpucontrol isn't getting released. Chris Lalancette
This may be relevant: http://lists.xensource.com/archives/html/xen-changelog/2007-02/msg00414.html I'm going to test this out next. Chris Lalancette
OK, after talking with Don some more, more details about what is actually going on when we have the bug in "xm restore": 1. When booting, we register a xenbus watch to look for shutdown events, which include suspend (in drivers/xen/core/reboot.c) 2. When we get a shutdown event for suspend via xenbus, we fire shutdown_handler(), which schedules work for "shutdown_work". 3. shutdown_work eventually calls __shutdown_handler() 4. In the case of suspend, it creates a "suspend" kthread on cpu 0, using the __do_suspend() function. 5. __do_suspend() does the necessary teardown of everything, including suspending the xenbus with xenbus_suspend(). Then it calls "HYPERVISOR_suspend", which puts the guest into suspend. 6. On resume, the suspend kthread starts after the HYPERVISOR_suspend. It does things like re-enable timers, gnttab, etc. Eventually it gets to xenbus_resume() 7. xenbus_resume() ends up calling bus_for_each_dev on the xenbus_frontend; this calls resume_dev() for each of the devices on the bus. 8. In turn, resume_dev() -> talk_to_otherend() -> free_otherend_watch() -> unregister_xenbus_watch() 9. unregister_xenbus_watch() ends up unwatching the node, canceling pending watch events, and then down() on the xenwatch_mutex. Step 9 fails, because the xenwatch_mutex is being held by the xenwatch kthread. It looks like by the time we get to unregister_xenbus_watch(), the xenbus kthread may have already received a "vcpu_hotplug" event, which causes xenwatch kthread to wake up, and take the xenwatch_mutex. However, it is never letting go of the mutex. This is what happens in the xenwatch_thread(): 1. Woken up because the xenbus kthread did a wakeup on "watch_events_waitq". 2. down() on xenwatch_mutex 3. Now pulls the event off of the pending list and calls the callback for it. 4. In this case, the callback is handle_vcpu_hotplug_event() -> vcpu_hotplug() -> cpu_up() 5. cpu_up() tries to take the cpucontrol semaphore, but fails because smp_suspend() had already taken, and held onto, that lock before going to sleep. In the link mentioned previously, the solution was to slightly rewrite smp_suspend()/smp_resume() so as not to take that lock at all. I'm still testing out a modified version of this patch. Chris Lalancette
Created attachment 153275 [details] PATCH 1: Clean-up is_initial_domain First patch in a series to address this problem. This one just does some cleanup around the is_initial_domain() macro.
Created attachment 153276 [details] PATCH 2: Clean-up hotplug files Second patch in a series to address this problem. This patch cleans up the hotplug files, mostly drivers/xen/core/cpu_hotplug.c and drivers/xen/core/smpboot.c
Created attachment 153277 [details] PATCH 3: Fix cpu_hotplug_lock deadly embrace Third patch in a series to address this issue. Take the upstream Xen code to fix the issue and backport into RHEL-4; this mostly involves not taking the cpu hotplug lock (since it is not strictly necessary).
Created attachment 153278 [details] PATCH 4: Fix "hlt" instruction for i386 and x86_64 Fourth patch in a series to address this problem. Currently in RHEL-4 PV, we issue the "hlt" command when we are stopping an SMP CPU. I don't think this is actually a problem, but we are not being a good Xen citizen this way. Fix this up to be like RHEL-5; namely, make a hypercall to tell the HV that this CPU is now dead from the domU point-of-view.
Note that with the above 4 patches, the save/restore loop is still failing, although it takes a much longer time, and it only fails when using 4 vCPUs. Apparently there is still work to be done here. Chris Lalancette
Created attachment 153350 [details] PATCH 5: Fix up evtchn to not smp_call_function Fifth patch in a series to address this issue. This patch takes out an smp_call_function in the suspend path that seems entirely superflous, and was also causing problems in x86_64.
With the above 5 patches applied, I have a x86_64, 4 vCPU test that has been running for > 12 hours now. That's the longest it has ever run. Next I'm sending the build through brew and will test both x86_64 and i386. Chris Lalancette
Created attachment 153821 [details] PATCH 3 (revised): Fix cpu_hotplug_lock deadly embrace Slightly updated PATCH 3 to fix the deadly embrace. With these 5 patches applied, I no longer see the hangs. However, I am noticing that per-CPU kernel threads (like aio/1, for instance) are NOT getting properly migrated onto their appropriate CPUs. Besides the obvious performance problems, this can also cause certain CPUs to never get scheduled again, and, under heavy load (such as a -j5 kernel build), can cause all CPUs never to get scheduled. This needs to be fixed as well before this bug is busted. Chris Lalancette
Created attachment 153881 [details] PATCH 6: Fix RHEL-4 workqueue to correctly bind kernel threads Patch 6 in a series to address this issue. This patch introduces a fix into the workqueue code to re-bind per-CPU kernel threads to the appropriate processors on a CPU hotplug.
Created attachment 153962 [details] PATCH 6: Fix RHEL-4 workqueue to correctly bind kernel threads (revised) Sixth patch in a series to address this issue. This is a respin of the previous patch; after discussion with ddd, the last patch was overly complex for no reason. It also wasn't generated with -p. This just cleans it up. Chris Lalancette
Created attachment 153977 [details] PATCH 3 (revised again): Fix cpu_hotplug_lock deadly embrace Third patch in a series to address this issue. I attached the wrong patch last time; this is the right patch to fix the deadly embrace.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
This changeset: http://git2.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=e17224bf1d01b461ec02a60f5a9b7657a89bdd23 Might also be involved; I'm testing it out now. Chris Lalancette
That last changeset didn't seem to make a difference; here's what I've found out on the current problem: 1) We are running into a race condition where the suspend thread is trying to "cpu_down". 2) One of the things cpu_down does is cpu_attach_domain() of all of the online CPUs to the "dummy" domain. 3) cpu_attach_domain() actually sets up a completion, kicks the migration thread (via wake_up_process()), and then does a "wait_for_completion". 4) wake_up_process() attempts to wake up the migration thread; however, when the race happens, the migration thread is already in state TASK_RUNNING, although it is in "schedule". Because of this, wake_up_process immediately exits without actually kicking the migration thread, and then we will wait forever on the completion, since the migration thread will never awake to service it. I'm not quite sure how the migration thread is getting into this state; it seems that every time it goes into schedule(), it should already be in state TASK_INTERRUPTIBLE. So either that assumption is wrong, or some other thread is coming along AFTER the migration thread has scheduled() and is changing it to TASK_RUNNING. Chris Lalancette
I just posted a series of 9 patches to fix this issue. Note that the first 6 of them are similar to the ones posted here, but they are not all of them; I'll upload the newer ones at some point. Chris Lalancette
committed in stream U6 build 55.16. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0791.html
Hi, the RHEL5.2 release notes will be dropped to translation on April 15, 2008, at which point no further additions or revisions will be entertained. a mockup of the RHEL5.2 release notes can be viewed at the following link: http://intranet.corp.redhat.com/ic/intranet/RHEL5u2relnotesmockup.html please use the aforementioned link to verify if your bugzilla is already in the release notes (if it needs to be). each item in the release notes contains a link to its original bug; as such, you can search through the release notes by bug number. Cheers, Don