Bug 456649 - xenbus suspend_mutex remains locked after transaction failure
Summary: xenbus suspend_mutex remains locked after transaction failure
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel-xen
Version: 4.7
Hardware: All
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: Andrew Jones
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 458302
TreeView+ depends on / blocked
 
Reported: 2008-07-25 09:52 UTC by Ian Campbell
Modified: 2011-02-16 16:03 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-02-16 16:03:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0263 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 4.9 kernel security and bug fix update 2011-02-16 15:14:55 UTC

Description Ian Campbell 2008-07-25 09:52:10 UTC
Unfortunately the fix for #250381 reverted in 2.6.9-78.EL by
linux-2.6.9-xen-modifications-to-drivers-xen-files-for-pv-on-h.patch which is
patch 12300 in the spec file.

linux-2.6.9-xen-xenbus-suspend_mutex-remains-locked-after-trans.patch is 12089.

I was unable to reopen #250381 and bugzilla advised me to make a clone.

+++ This bug was initially created as a clone of Bug #250381 +++

If a xenbus transaction end command fails it is possible for the suspend_mutex
to remain locked preventing any further xenbus traffic. e.g.
shutdown/reboot/suspend requests/notifications etc.

Kernel 2.6.9-55.0.2.EL is affected.

Upstream fix is
http://hg.uk.xensource.com/xen-unstable.hg/?cs=bbce4d115189

-- Additional comment from ijc.uk on 2007-12-14 04:00 EST --
Created an attachment (id=288741)
xen-unstable 9921:bbce4d115189 ported to 2.6.9-67.EL

We recently stopped using the rhel4x.hg port from xenbits and switched to using
a set of targetted fixes to your kernels. I have attached the patches from our
queue relevant to this issue.

-- Additional comment from ddutile on 2007-12-14 14:37 EST --
Is there a way to excite/force a transaction end failure, so
a test can be applied to show the problem, and that the fix works?


-- Additional comment from ddutile on 2007-12-14 14:39 EST --
This bug was fixed in 4.6 (in the linux-2.6.9-xen-newfiles.patch,
which included many fixes for hotplug.


-- Additional comment from ijc.uk on 2007-12-14 14:49 EST --
If I remember right you can reproduce by using xenstore-write in in a tight loop
the domU. i.e. something like "while : ; do xenstore-write foo bar ; done"

I checked 2.6.9-67.EL and it still has this problem. Is that not 4.6 kernel?

-- Additional comment from ddutile on 2007-12-14 15:00 EST --

2.6.9-67.EL is rhel4.6.

The fix that is shown in the attachment in #1 is in 4.6.

So, either you didn't test 4.6, or... the fix isn't sufficient,
or you built a -67 kernel without doing a "make prep", which
would not apply the patch listed in comment #3 to the file (before building).

Do you have the src.rpm for 4.6 to verify (from sources) that
the fix provided is the one in 4.6 ?




-- Additional comment from ijc.uk on 2007-12-15 04:41 EST --
I got my source tree by installing the .src.rpm and running rpmbuild -bp on the
spec file which leaves a source tree in /usr/src/redhat/BUILD/something, I am
pretty certain it has the patches applied or drivers/xen/xenbus/xenbus_xs.c
wouldn't even exist.

linux-2.6.9-xen-newfiles.patch in 2.6.9-67.EL contains as part of
drivers/xen/xenbus/xenbus_xs.c:xenbus_dev_request_and_reply():
+       if ((msg->type == XS_TRANSACTION_END) ||
+           ((req_msg.type == XS_TRANSACTION_START) &&
+            (msg->type == XS_ERROR)))
+               up_read(&xs_state.suspend_mutex);
and if 9921:bbce4d115189 was applied it would contain
+       if ((req_msg.type == XS_TRANSACTION_END) ||
+           ((req_msg.type == XS_TRANSACTION_START) &&
+            (msg->type == XS_ERROR)))
+               up_read(&xs_state.suspend_mutex);

Note the first line which has changed from msg->type to req_msg.type.

-- Additional comment from ddutile on 2007-12-16 22:28 EST --
My bad; I missed the subtlety of msg->type changed to req_msg.type.

I'll post a patch for 4.7 on Monday. Thanks for the test to verify the fix.


-- Additional comment from ijc.uk on 2007-12-17 06:33 EST --
Thanks, I always have to look at that particular patch twice, it's very easy to
mis-read...

-- Additional comment from ddutile on 2007-12-17 15:14 EST --
Well, the patch is actually part rhel5 & part rhel4.

The 'mutex_unlock' is in rhel5, but not rhel4;  rhel4 uses 'up'.

the patch applies, but with a fuzz warning;  i'll submit a clean rhel4 patch
that doesn't generate a patch warning.



-- Additional comment from ijc.uk on 2007-12-17 16:10 EST --
Yes, somehow quilt still applies the patch even though the context clearly
doesn't match -- I hadn't noticed that before.

-- Additional comment from bburns on 2008-01-04 14:14 EST --
Reopening for Don Dutile. Setting flags for 4.7.

-- Additional comment from vgoyal on 2008-03-03 15:39 EST --
Committed in 68.16.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

-- Additional comment from errata-xmlrpc on 2008-07-24 15:14 EST --
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0665.html

Comment 2 Don Dutile (Red Hat) 2008-07-25 16:34:37 UTC
Ian,

I ran the above test loop:
     while : ; do xenstore-write foo bar ; done

in one dom0 window, and in another dom0, ran an infinite save/restore loop on
the domU.  I could not cause the save/restore to fail/hang/stop, which is what I
would expect if xenbus transaction processing was hung due to suspend_mutex
remaining locked.

Is there some other test you can recommend ?
Without a valid regression test/cause-effect, acking the patch will be tough to
do (in 4.8).

- Don


Comment 3 Ian Campbell 2009-01-28 10:54:37 UTC
I've just noticed the old needinfo on this bug. I could have sworn I responded at the time but I must have written it and not hit send/submit or something.

My memory of this bug is very fuzzy but I think you need to run the while ... xenstore-write... loop in a domU which is being repeatedly suspended and resumed, rather than running it in the dom0 as you were doing (having a loop in both dom0 and domU can't hurt I suppose...)

Comment 5 Andrew Jones 2009-07-01 18:24:49 UTC
This is a difficult bug to recreate, but the proposed patch has been integrated into a test build at http://people.redhat.com/drjones/virttest/1-2/. The build is available for anyone who has seen the bug and would like to test the patch to see if it goes away.

Also note that the link in the description pointing to the upstream patch is out of date, you can find it here now http://xenbits.xensource.com/xen-unstable.hg?rev/bbce4d115189

Comment 7 RHEL Program Management 2010-10-12 17:51:24 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 8 Vivek Goyal 2010-10-13 16:11:29 UTC
Committed in 89.42.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 10 Jinxin Zheng 2011-01-10 10:07:19 UTC
Confirmed the patch is in -94.EL.

Never reproduced this. There were a few rhel4 patches that are just integrated back then this looked safe and we got runtime with them by them being integrated. I guess sanity checking is the best we can do.

Comment 11 errata-xmlrpc 2011-02-16 16:03:34 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0263.html


Note You need to log in before you can comment on or make changes to this bug.