If a xenbus transaction end command fails it is possible for the suspend_mutex to remain locked preventing any further xenbus traffic. e.g. shutdown/reboot/suspend requests/notifications etc. Kernel 2.6.9-55.0.2.EL is affected. Upstream fix is http://hg.uk.xensource.com/xen-unstable.hg/?cs=bbce4d115189
Created attachment 288741 [details] xen-unstable 9921:bbce4d115189 ported to 2.6.9-67.EL We recently stopped using the rhel4x.hg port from xenbits and switched to using a set of targetted fixes to your kernels. I have attached the patches from our queue relevant to this issue.
Is there a way to excite/force a transaction end failure, so a test can be applied to show the problem, and that the fix works?
This bug was fixed in 4.6 (in the linux-2.6.9-xen-newfiles.patch, which included many fixes for hotplug.
If I remember right you can reproduce by using xenstore-write in in a tight loop the domU. i.e. something like "while : ; do xenstore-write foo bar ; done" I checked 2.6.9-67.EL and it still has this problem. Is that not 4.6 kernel?
2.6.9-67.EL is rhel4.6. The fix that is shown in the attachment in #1 is in 4.6. So, either you didn't test 4.6, or... the fix isn't sufficient, or you built a -67 kernel without doing a "make prep", which would not apply the patch listed in comment #3 to the file (before building). Do you have the src.rpm for 4.6 to verify (from sources) that the fix provided is the one in 4.6 ?
I got my source tree by installing the .src.rpm and running rpmbuild -bp on the spec file which leaves a source tree in /usr/src/redhat/BUILD/something, I am pretty certain it has the patches applied or drivers/xen/xenbus/xenbus_xs.c wouldn't even exist. linux-2.6.9-xen-newfiles.patch in 2.6.9-67.EL contains as part of drivers/xen/xenbus/xenbus_xs.c:xenbus_dev_request_and_reply(): + if ((msg->type == XS_TRANSACTION_END) || + ((req_msg.type == XS_TRANSACTION_START) && + (msg->type == XS_ERROR))) + up_read(&xs_state.suspend_mutex); and if 9921:bbce4d115189 was applied it would contain + if ((req_msg.type == XS_TRANSACTION_END) || + ((req_msg.type == XS_TRANSACTION_START) && + (msg->type == XS_ERROR))) + up_read(&xs_state.suspend_mutex); Note the first line which has changed from msg->type to req_msg.type.
My bad; I missed the subtlety of msg->type changed to req_msg.type. I'll post a patch for 4.7 on Monday. Thanks for the test to verify the fix.
Thanks, I always have to look at that particular patch twice, it's very easy to mis-read...
Well, the patch is actually part rhel5 & part rhel4. The 'mutex_unlock' is in rhel5, but not rhel4; rhel4 uses 'up'. the patch applies, but with a fuzz warning; i'll submit a clean rhel4 patch that doesn't generate a patch warning.
Yes, somehow quilt still applies the patch even though the context clearly doesn't match -- I hadn't noticed that before.
Reopening for Don Dutile. Setting flags for 4.7.
Committed in 68.16.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0665.html