236929 – [RHEL4 Xen]: Suspend/resume failure under load

Bug 236929 - [RHEL4 Xen]: Suspend/resume failure under load

Summary: [RHEL4 Xen]: Suspend/resume failure under load

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.5
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Chris Lalancette
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	251013
TreeView+	depends on / blocked

Reported:	2007-04-18 14:36 UTC by Chris Lalancette
Modified:	2008-04-02 02:12 UTC (History)
CC List:	2 users (show)
Fixed In Version:	RHBA-2007-0791
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-11-15 16:25:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
PATCH 1: Clean-up is_initial_domain (1.75 KB, patch) 2007-04-23 13:02 UTC, Chris Lalancette	no flags	Details \| Diff
PATCH 2: Clean-up hotplug files (2.97 KB, patch) 2007-04-23 13:04 UTC, Chris Lalancette	no flags	Details \| Diff
PATCH 3: Fix cpu_hotplug_lock deadly embrace (4.52 KB, patch) 2007-04-23 13:06 UTC, Chris Lalancette	no flags	Details \| Diff
PATCH 4: Fix "hlt" instruction for i386 and x86_64 (1.88 KB, patch) 2007-04-23 13:09 UTC, Chris Lalancette	no flags	Details \| Diff
PATCH 5: Fix up evtchn to not smp_call_function (2.89 KB, patch) 2007-04-24 13:33 UTC, Chris Lalancette	no flags	Details \| Diff
PATCH 3 (revised): Fix cpu_hotplug_lock deadly embrace (2.97 KB, patch) 2007-04-30 20:09 UTC, Chris Lalancette	no flags	Details \| Diff
PATCH 6: Fix RHEL-4 workqueue to correctly bind kernel threads (516 bytes, patch) 2007-05-01 18:28 UTC, Chris Lalancette	no flags	Details \| Diff
PATCH 6: Fix RHEL-4 workqueue to correctly bind kernel threads (revised) (428 bytes, patch) 2007-05-02 15:49 UTC, Chris Lalancette	no flags	Details \| Diff
PATCH 3 (revised again): Fix cpu_hotplug_lock deadly embrace (4.61 KB, patch) 2007-05-02 19:17 UTC, Chris Lalancette	no flags	Details \| Diff
Show Obsolete (3) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2007:0791	0	normal	SHIPPED_LIVE	Updated kernel packages available for Red Hat Enterprise Linux 4 Update 6	2007-11-14 18:25:55 UTC

Description Chris Lalancette 2007-04-18 14:36:51 UTC

Description of problem:
In RHEL-4 (and possibly RHEL-5) SMP paravirtualized guests, suspend/resume
sometimes fails.  To reproduce this behavior, I do the following:

1.  Start up a CPU-intensive job inside the PV guest (i.e. stress with 4 jobs)
2.  Use the following script on the dom0:

#!/bin/bash

if [ $# -ne 1 ]; then
        echo "Usage: xen_test_save_restore.sh <domname>"
        exit 1
fi

domname=$1
count=0
mkdir -p /var/lib/xen/save
while true ; do
        rm -f /var/lib/xen/save/$domname-save
        echo "Doing save $count..."
        xm save $domname /var/lib/xen/save/$domname-save
        sleep 20
        echo "Doing restore $count..."
        xm restore /var/lib/xen/save/$domname-save
        sleep 40
        count=$(( $count + 1 ))
done

After a number of iterations (it varies), the script will get hung up on a
restore.  During this time, it looks like 1 of the vCPUs has been hotplugged,
and it also looks like that CPU is running the jobs.  However, the second vCPU
will never be plugged, and the "xm save" command will never return.  This
happens on both i386 and x86_64.

Comment 1 Chris Lalancette 2007-04-18 14:41:54 UTC

I've taken a core-dump of the pv domain while it was "stuck".  What it looks
like is happening is that the "suspend" thread that is kicked off to do the
suspend never completes.  Looking at the backtrace, it looks like it is hung-up
in unregister_xenbus_watch(), trying to acquire the "xenwatch_mutex" in the end.
 However, the "down" fails, which puts the suspend process to sleep, and it
looks like it just never wakes up again.  Because of this, we never reach
smp_resume() (which explains why the other vCPUs don't get plugged back in). 
Besides unregister_xenbus_watch(), the other two users of xenwatch_mutex are
xenbus_register_driver_common(), and the xenwatch_thread().  This certainly
feels like a race with one of these two; I'll next try to see which one.

Chris Lalancette

Comment 2 Chris Lalancette 2007-04-18 15:38:38 UTC

Gah.  This goes still deeper.  The xenwatch thread got woken up to handle a vcpu
hotplug event.  It is currently executing the callback for that, which is
handle_vcpu_hotplug_event() -> vcpu_hotplug() -> cpu_up().  However, in
cpu_up(), the first down_interruptible(&cpucontrol) failed.  So, because
xenwatch thread is holding the xenwatch_mutex, and waiting for the cpucontrol
mutex (which never gets woken up), the suspend thread can never acquire
xenwatch_mutex, which causes the whole problem.  Now to figure out why
cpucontrol isn't getting released.

Chris Lalancette

Comment 3 Chris Lalancette 2007-04-18 15:54:39 UTC

This may be relevant:

http://lists.xensource.com/archives/html/xen-changelog/2007-02/msg00414.html

I'm going to test this out next.

Chris Lalancette

Comment 4 Chris Lalancette 2007-04-19 21:57:33 UTC

OK, after talking with Don some more, more details about what is actually going
on when we have the bug in "xm restore":

1.  When booting, we register a xenbus watch to look for shutdown events, which
include suspend (in drivers/xen/core/reboot.c)
2.  When we get a shutdown event for suspend via xenbus, we fire
shutdown_handler(), which schedules work for "shutdown_work".
3.  shutdown_work eventually calls __shutdown_handler()
4.  In the case of suspend, it creates a "suspend" kthread on cpu 0, using the
__do_suspend() function.
5.  __do_suspend() does the necessary teardown of everything, including
suspending the xenbus with xenbus_suspend().  Then it calls
"HYPERVISOR_suspend", which puts the guest into suspend.
6.  On resume, the suspend kthread starts after the HYPERVISOR_suspend.  It does
things like re-enable timers, gnttab, etc.  Eventually it gets to xenbus_resume()
7.  xenbus_resume() ends up calling bus_for_each_dev on the xenbus_frontend;
this calls resume_dev() for each of the devices on the bus.
8.  In turn, resume_dev() -> talk_to_otherend() -> free_otherend_watch() ->
unregister_xenbus_watch()
9.  unregister_xenbus_watch() ends up unwatching the node, canceling pending
watch events, and then down() on the xenwatch_mutex.

Step 9 fails, because the xenwatch_mutex is being held by the xenwatch kthread.
 It looks like by the time we get to unregister_xenbus_watch(), the xenbus
kthread may have already received a "vcpu_hotplug" event, which causes xenwatch
kthread to wake up, and take the xenwatch_mutex.  However, it is never letting
go of the mutex.  This is what happens in the xenwatch_thread():

1.  Woken up because the xenbus kthread did a wakeup on "watch_events_waitq".
2.  down() on xenwatch_mutex
3.  Now pulls the event off of the pending list and calls the callback for it.
4.  In this case, the callback is handle_vcpu_hotplug_event() -> vcpu_hotplug()
-> cpu_up()
5.  cpu_up() tries to take the cpucontrol semaphore, but fails because
smp_suspend() had already taken, and held onto, that lock before going to sleep.

In the link mentioned previously, the solution was to slightly rewrite
smp_suspend()/smp_resume() so as not to take that lock at all.  I'm still
testing out a modified version of this patch.

Chris Lalancette

Comment 5 Chris Lalancette 2007-04-23 13:02:23 UTC

Created attachment 153275 [details]
PATCH 1: Clean-up is_initial_domain

First patch in a series to address this problem.  This one just does some
cleanup around the is_initial_domain() macro.

Comment 6 Chris Lalancette 2007-04-23 13:04:53 UTC

Created attachment 153276 [details]
PATCH 2: Clean-up hotplug files

Second patch in a series to address this problem.  This patch cleans up the
hotplug files, mostly drivers/xen/core/cpu_hotplug.c and
drivers/xen/core/smpboot.c

Comment 7 Chris Lalancette 2007-04-23 13:06:46 UTC

Created attachment 153277 [details]
PATCH 3: Fix cpu_hotplug_lock deadly embrace

Third patch in a series to address this issue.	Take the upstream Xen code to
fix the issue and backport into RHEL-4; this mostly involves not taking the cpu
hotplug lock (since it is not strictly necessary).

Comment 8 Chris Lalancette 2007-04-23 13:09:15 UTC

Created attachment 153278 [details]
PATCH 4: Fix "hlt" instruction for i386 and x86_64

Fourth patch in a series to address this problem.  Currently in RHEL-4 PV, we
issue the "hlt" command when we are stopping an SMP CPU.  I don't think this is
actually a problem, but we are not being a good Xen citizen this way.  Fix this
up to be like RHEL-5; namely, make a hypercall to tell the HV that this CPU is
now dead from the domU point-of-view.

Comment 9 Chris Lalancette 2007-04-23 13:10:31 UTC

Note that with the above 4 patches, the save/restore loop is still failing,
although it takes a much longer time, and it only fails when using 4 vCPUs. 
Apparently there is still work to be done here.

Chris Lalancette

Comment 10 Chris Lalancette 2007-04-24 13:33:27 UTC

Created attachment 153350 [details]
PATCH 5: Fix up evtchn to not smp_call_function

Fifth patch in a series to address this issue.	This patch takes out an
smp_call_function in the suspend path that seems entirely superflous, and was
also causing problems in x86_64.

Comment 11 Chris Lalancette 2007-04-24 14:00:07 UTC

With the above 5 patches applied, I have a x86_64, 4 vCPU test that has been
running for > 12 hours now.  That's the longest it has ever run.  Next I'm
sending the build through brew and will test both x86_64 and i386.

Chris Lalancette

Comment 12 Chris Lalancette 2007-04-30 20:09:22 UTC

Created attachment 153821 [details]
PATCH 3 (revised): Fix cpu_hotplug_lock deadly embrace

Slightly updated PATCH 3 to fix the deadly embrace.  With these 5 patches
applied, I no longer see the hangs.  However, I am noticing that per-CPU kernel
threads (like aio/1, for instance) are NOT getting properly migrated onto their
appropriate CPUs.  Besides the obvious performance problems, this can also
cause certain CPUs to never get scheduled again, and, under heavy load (such as
a -j5 kernel build), can cause all CPUs never to get scheduled.  This needs to
be fixed as well before this bug is busted.

Chris Lalancette

Comment 13 Chris Lalancette 2007-05-01 18:28:45 UTC

Created attachment 153881 [details]
PATCH 6: Fix RHEL-4 workqueue to correctly bind kernel threads

Patch 6 in a series to address this issue.  This patch introduces a fix into
the workqueue code to re-bind per-CPU kernel threads to the appropriate
processors on a CPU hotplug.

Comment 14 Chris Lalancette 2007-05-02 15:49:58 UTC

Created attachment 153962 [details]
PATCH 6: Fix RHEL-4 workqueue to correctly bind kernel threads (revised)

Sixth patch in a series to address this issue.	This is a respin of the
previous patch; after discussion with ddd, the last patch was overly complex
for no reason.	It also wasn't generated with -p.  This just cleans it up.

Chris Lalancette

Comment 15 Chris Lalancette 2007-05-02 19:17:34 UTC

Created attachment 153977 [details]
PATCH 3 (revised again): Fix cpu_hotplug_lock deadly embrace

Third patch in a series to address this issue.	I attached the wrong patch last
time; this is the right patch to fix the deadly embrace.

Comment 16 RHEL Program Management 2007-05-09 04:45:15 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 17 Chris Lalancette 2007-05-29 21:08:09 UTC

This changeset:

http://git2.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=e17224bf1d01b461ec02a60f5a9b7657a89bdd23

Might also be involved; I'm testing it out now.

Chris Lalancette

Comment 18 Chris Lalancette 2007-05-31 20:14:33 UTC

That last changeset didn't seem to make a difference; here's what I've found out
on the current problem:

1)  We are running into a race condition where the suspend thread is trying to
"cpu_down".
2)  One of the things cpu_down does is cpu_attach_domain() of all of the online
CPUs to the "dummy" domain.  
3)  cpu_attach_domain() actually sets up a completion, kicks the migration
thread (via wake_up_process()), and then does a "wait_for_completion".
4)  wake_up_process() attempts to wake up the migration thread; however, when
the race happens, the migration thread is already in state TASK_RUNNING,
although it is in "schedule".  Because of this, wake_up_process immediately
exits without actually kicking the migration thread, and then we will wait
forever on the completion, since the migration thread will never awake to
service it.

I'm not quite sure how the migration thread is getting into this state; it seems
that every time it goes into schedule(), it should already be in state
TASK_INTERRUPTIBLE.  So either that assumption is wrong, or some other thread is
coming along AFTER the migration thread has scheduled() and is changing it to
TASK_RUNNING.

Chris Lalancette

Comment 19 Chris Lalancette 2007-06-25 13:42:33 UTC

I just posted a series of 9 patches to fix this issue.  Note that the first 6 of
them are similar to the ones posted here, but they are not all of them; I'll
upload the newer ones at some point.

Chris Lalancette

Comment 21 Jason Baron 2007-07-03 22:39:53 UTC

committed in stream U6 build 55.16. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/

Comment 25 errata-xmlrpc 2007-11-15 16:25:18 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0791.html

Comment 26 Don Domingo 2008-04-02 02:12:22 UTC

Hi,
the RHEL5.2 release notes will be dropped to translation on April 15, 2008, at
which point no further additions or revisions will be entertained.

a mockup of the RHEL5.2 release notes can be viewed at the following link:
http://intranet.corp.redhat.com/ic/intranet/RHEL5u2relnotesmockup.html

please use the aforementioned link to verify if your bugzilla is already in the
release notes (if it needs to be). each item in the release notes contains a
link to its original bug; as such, you can search through the release notes by
bug number.

Cheers,
Don

Note You need to log in before you can comment on or make changes to this bug.