Description of problem:
In RHEL-4 (and possibly RHEL-5) SMP paravirtualized guests, suspend/resume
sometimes fails. To reproduce this behavior, I do the following:
1. Start up a CPU-intensive job inside the PV guest (i.e. stress with 4 jobs)
2. Use the following script on the dom0:
if [ $# -ne 1 ]; then
echo "Usage: xen_test_save_restore.sh <domname>"
mkdir -p /var/lib/xen/save
while true ; do
rm -f /var/lib/xen/save/$domname-save
echo "Doing save $count..."
xm save $domname /var/lib/xen/save/$domname-save
echo "Doing restore $count..."
xm restore /var/lib/xen/save/$domname-save
count=$(( $count + 1 ))
After a number of iterations (it varies), the script will get hung up on a
restore. During this time, it looks like 1 of the vCPUs has been hotplugged,
and it also looks like that CPU is running the jobs. However, the second vCPU
will never be plugged, and the "xm save" command will never return. This
happens on both i386 and x86_64.
I've taken a core-dump of the pv domain while it was "stuck". What it looks
like is happening is that the "suspend" thread that is kicked off to do the
suspend never completes. Looking at the backtrace, it looks like it is hung-up
in unregister_xenbus_watch(), trying to acquire the "xenwatch_mutex" in the end.
However, the "down" fails, which puts the suspend process to sleep, and it
looks like it just never wakes up again. Because of this, we never reach
smp_resume() (which explains why the other vCPUs don't get plugged back in).
Besides unregister_xenbus_watch(), the other two users of xenwatch_mutex are
xenbus_register_driver_common(), and the xenwatch_thread(). This certainly
feels like a race with one of these two; I'll next try to see which one.
Gah. This goes still deeper. The xenwatch thread got woken up to handle a vcpu
hotplug event. It is currently executing the callback for that, which is
handle_vcpu_hotplug_event() -> vcpu_hotplug() -> cpu_up(). However, in
cpu_up(), the first down_interruptible(&cpucontrol) failed. So, because
xenwatch thread is holding the xenwatch_mutex, and waiting for the cpucontrol
mutex (which never gets woken up), the suspend thread can never acquire
xenwatch_mutex, which causes the whole problem. Now to figure out why
cpucontrol isn't getting released.
This may be relevant:
I'm going to test this out next.
OK, after talking with Don some more, more details about what is actually going
on when we have the bug in "xm restore":
1. When booting, we register a xenbus watch to look for shutdown events, which
include suspend (in drivers/xen/core/reboot.c)
2. When we get a shutdown event for suspend via xenbus, we fire
shutdown_handler(), which schedules work for "shutdown_work".
3. shutdown_work eventually calls __shutdown_handler()
4. In the case of suspend, it creates a "suspend" kthread on cpu 0, using the
5. __do_suspend() does the necessary teardown of everything, including
suspending the xenbus with xenbus_suspend(). Then it calls
"HYPERVISOR_suspend", which puts the guest into suspend.
6. On resume, the suspend kthread starts after the HYPERVISOR_suspend. It does
things like re-enable timers, gnttab, etc. Eventually it gets to xenbus_resume()
7. xenbus_resume() ends up calling bus_for_each_dev on the xenbus_frontend;
this calls resume_dev() for each of the devices on the bus.
8. In turn, resume_dev() -> talk_to_otherend() -> free_otherend_watch() ->
9. unregister_xenbus_watch() ends up unwatching the node, canceling pending
watch events, and then down() on the xenwatch_mutex.
Step 9 fails, because the xenwatch_mutex is being held by the xenwatch kthread.
It looks like by the time we get to unregister_xenbus_watch(), the xenbus
kthread may have already received a "vcpu_hotplug" event, which causes xenwatch
kthread to wake up, and take the xenwatch_mutex. However, it is never letting
go of the mutex. This is what happens in the xenwatch_thread():
1. Woken up because the xenbus kthread did a wakeup on "watch_events_waitq".
2. down() on xenwatch_mutex
3. Now pulls the event off of the pending list and calls the callback for it.
4. In this case, the callback is handle_vcpu_hotplug_event() -> vcpu_hotplug()
5. cpu_up() tries to take the cpucontrol semaphore, but fails because
smp_suspend() had already taken, and held onto, that lock before going to sleep.
In the link mentioned previously, the solution was to slightly rewrite
smp_suspend()/smp_resume() so as not to take that lock at all. I'm still
testing out a modified version of this patch.
Created attachment 153275 [details]
PATCH 1: Clean-up is_initial_domain
First patch in a series to address this problem. This one just does some
cleanup around the is_initial_domain() macro.
Created attachment 153276 [details]
PATCH 2: Clean-up hotplug files
Second patch in a series to address this problem. This patch cleans up the
hotplug files, mostly drivers/xen/core/cpu_hotplug.c and
Created attachment 153277 [details]
PATCH 3: Fix cpu_hotplug_lock deadly embrace
Third patch in a series to address this issue. Take the upstream Xen code to
fix the issue and backport into RHEL-4; this mostly involves not taking the cpu
hotplug lock (since it is not strictly necessary).
Created attachment 153278 [details]
PATCH 4: Fix "hlt" instruction for i386 and x86_64
Fourth patch in a series to address this problem. Currently in RHEL-4 PV, we
issue the "hlt" command when we are stopping an SMP CPU. I don't think this is
actually a problem, but we are not being a good Xen citizen this way. Fix this
up to be like RHEL-5; namely, make a hypercall to tell the HV that this CPU is
now dead from the domU point-of-view.
Note that with the above 4 patches, the save/restore loop is still failing,
although it takes a much longer time, and it only fails when using 4 vCPUs.
Apparently there is still work to be done here.
Created attachment 153350 [details]
PATCH 5: Fix up evtchn to not smp_call_function
Fifth patch in a series to address this issue. This patch takes out an
smp_call_function in the suspend path that seems entirely superflous, and was
also causing problems in x86_64.
With the above 5 patches applied, I have a x86_64, 4 vCPU test that has been
running for > 12 hours now. That's the longest it has ever run. Next I'm
sending the build through brew and will test both x86_64 and i386.
Created attachment 153821 [details]
PATCH 3 (revised): Fix cpu_hotplug_lock deadly embrace
Slightly updated PATCH 3 to fix the deadly embrace. With these 5 patches
applied, I no longer see the hangs. However, I am noticing that per-CPU kernel
threads (like aio/1, for instance) are NOT getting properly migrated onto their
appropriate CPUs. Besides the obvious performance problems, this can also
cause certain CPUs to never get scheduled again, and, under heavy load (such as
a -j5 kernel build), can cause all CPUs never to get scheduled. This needs to
be fixed as well before this bug is busted.
Created attachment 153881 [details]
PATCH 6: Fix RHEL-4 workqueue to correctly bind kernel threads
Patch 6 in a series to address this issue. This patch introduces a fix into
the workqueue code to re-bind per-CPU kernel threads to the appropriate
processors on a CPU hotplug.
Created attachment 153962 [details]
PATCH 6: Fix RHEL-4 workqueue to correctly bind kernel threads (revised)
Sixth patch in a series to address this issue. This is a respin of the
previous patch; after discussion with ddd, the last patch was overly complex
for no reason. It also wasn't generated with -p. This just cleans it up.
Created attachment 153977 [details]
PATCH 3 (revised again): Fix cpu_hotplug_lock deadly embrace
Third patch in a series to address this issue. I attached the wrong patch last
time; this is the right patch to fix the deadly embrace.
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
Might also be involved; I'm testing it out now.
That last changeset didn't seem to make a difference; here's what I've found out
on the current problem:
1) We are running into a race condition where the suspend thread is trying to
2) One of the things cpu_down does is cpu_attach_domain() of all of the online
CPUs to the "dummy" domain.
3) cpu_attach_domain() actually sets up a completion, kicks the migration
thread (via wake_up_process()), and then does a "wait_for_completion".
4) wake_up_process() attempts to wake up the migration thread; however, when
the race happens, the migration thread is already in state TASK_RUNNING,
although it is in "schedule". Because of this, wake_up_process immediately
exits without actually kicking the migration thread, and then we will wait
forever on the completion, since the migration thread will never awake to
I'm not quite sure how the migration thread is getting into this state; it seems
that every time it goes into schedule(), it should already be in state
TASK_INTERRUPTIBLE. So either that assumption is wrong, or some other thread is
coming along AFTER the migration thread has scheduled() and is changing it to
I just posted a series of 9 patches to fix this issue. Note that the first 6 of
them are similar to the ones posted here, but they are not all of them; I'll
upload the newer ones at some point.
committed in stream U6 build 55.16. A test kernel with this patch is available
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
the RHEL5.2 release notes will be dropped to translation on April 15, 2008, at
which point no further additions or revisions will be entertained.
a mockup of the RHEL5.2 release notes can be viewed at the following link:
please use the aforementioned link to verify if your bugzilla is already in the
release notes (if it needs to be). each item in the release notes contains a
link to its original bug; as such, you can search through the release notes by