Unfortunately the fix for #250381 reverted in 2.6.9-78.EL by linux-2.6.9-xen-modifications-to-drivers-xen-files-for-pv-on-h.patch which is patch 12300 in the spec file. linux-2.6.9-xen-xenbus-suspend_mutex-remains-locked-after-trans.patch is 12089. I was unable to reopen #250381 and bugzilla advised me to make a clone. +++ This bug was initially created as a clone of Bug #250381 +++ If a xenbus transaction end command fails it is possible for the suspend_mutex to remain locked preventing any further xenbus traffic. e.g. shutdown/reboot/suspend requests/notifications etc. Kernel 2.6.9-55.0.2.EL is affected. Upstream fix is http://hg.uk.xensource.com/xen-unstable.hg/?cs=bbce4d115189 -- Additional comment from ijc.uk on 2007-12-14 04:00 EST -- Created an attachment (id=288741) xen-unstable 9921:bbce4d115189 ported to 2.6.9-67.EL We recently stopped using the rhel4x.hg port from xenbits and switched to using a set of targetted fixes to your kernels. I have attached the patches from our queue relevant to this issue. -- Additional comment from ddutile on 2007-12-14 14:37 EST -- Is there a way to excite/force a transaction end failure, so a test can be applied to show the problem, and that the fix works? -- Additional comment from ddutile on 2007-12-14 14:39 EST -- This bug was fixed in 4.6 (in the linux-2.6.9-xen-newfiles.patch, which included many fixes for hotplug. -- Additional comment from ijc.uk on 2007-12-14 14:49 EST -- If I remember right you can reproduce by using xenstore-write in in a tight loop the domU. i.e. something like "while : ; do xenstore-write foo bar ; done" I checked 2.6.9-67.EL and it still has this problem. Is that not 4.6 kernel? -- Additional comment from ddutile on 2007-12-14 15:00 EST -- 2.6.9-67.EL is rhel4.6. The fix that is shown in the attachment in #1 is in 4.6. So, either you didn't test 4.6, or... the fix isn't sufficient, or you built a -67 kernel without doing a "make prep", which would not apply the patch listed in comment #3 to the file (before building). Do you have the src.rpm for 4.6 to verify (from sources) that the fix provided is the one in 4.6 ? -- Additional comment from ijc.uk on 2007-12-15 04:41 EST -- I got my source tree by installing the .src.rpm and running rpmbuild -bp on the spec file which leaves a source tree in /usr/src/redhat/BUILD/something, I am pretty certain it has the patches applied or drivers/xen/xenbus/xenbus_xs.c wouldn't even exist. linux-2.6.9-xen-newfiles.patch in 2.6.9-67.EL contains as part of drivers/xen/xenbus/xenbus_xs.c:xenbus_dev_request_and_reply(): + if ((msg->type == XS_TRANSACTION_END) || + ((req_msg.type == XS_TRANSACTION_START) && + (msg->type == XS_ERROR))) + up_read(&xs_state.suspend_mutex); and if 9921:bbce4d115189 was applied it would contain + if ((req_msg.type == XS_TRANSACTION_END) || + ((req_msg.type == XS_TRANSACTION_START) && + (msg->type == XS_ERROR))) + up_read(&xs_state.suspend_mutex); Note the first line which has changed from msg->type to req_msg.type. -- Additional comment from ddutile on 2007-12-16 22:28 EST -- My bad; I missed the subtlety of msg->type changed to req_msg.type. I'll post a patch for 4.7 on Monday. Thanks for the test to verify the fix. -- Additional comment from ijc.uk on 2007-12-17 06:33 EST -- Thanks, I always have to look at that particular patch twice, it's very easy to mis-read... -- Additional comment from ddutile on 2007-12-17 15:14 EST -- Well, the patch is actually part rhel5 & part rhel4. The 'mutex_unlock' is in rhel5, but not rhel4; rhel4 uses 'up'. the patch applies, but with a fuzz warning; i'll submit a clean rhel4 patch that doesn't generate a patch warning. -- Additional comment from ijc.uk on 2007-12-17 16:10 EST -- Yes, somehow quilt still applies the patch even though the context clearly doesn't match -- I hadn't noticed that before. -- Additional comment from bburns on 2008-01-04 14:14 EST -- Reopening for Don Dutile. Setting flags for 4.7. -- Additional comment from vgoyal on 2008-03-03 15:39 EST -- Committed in 68.16.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ -- Additional comment from errata-xmlrpc on 2008-07-24 15:14 EST -- An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0665.html
Ian, I ran the above test loop: while : ; do xenstore-write foo bar ; done in one dom0 window, and in another dom0, ran an infinite save/restore loop on the domU. I could not cause the save/restore to fail/hang/stop, which is what I would expect if xenbus transaction processing was hung due to suspend_mutex remaining locked. Is there some other test you can recommend ? Without a valid regression test/cause-effect, acking the patch will be tough to do (in 4.8). - Don
I've just noticed the old needinfo on this bug. I could have sworn I responded at the time but I must have written it and not hit send/submit or something. My memory of this bug is very fuzzy but I think you need to run the while ... xenstore-write... loop in a domU which is being repeatedly suspended and resumed, rather than running it in the dom0 as you were doing (having a loop in both dom0 and domU can't hurt I suppose...)
This is a difficult bug to recreate, but the proposed patch has been integrated into a test build at http://people.redhat.com/drjones/virttest/1-2/. The build is available for anyone who has seen the bug and would like to test the patch to see if it goes away. Also note that the link in the description pointing to the upstream patch is out of date, you can find it here now http://xenbits.xensource.com/xen-unstable.hg?rev/bbce4d115189
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Committed in 89.42.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
Confirmed the patch is in -94.EL. Never reproduced this. There were a few rhel4 patches that are just integrated back then this looked safe and we got runtime with them by them being integrated. I guess sanity checking is the best we can do.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0263.html