Bug 476158
Summary: | rapid consecutive block attach/detach commands garble distributed vbd state | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Gurhan Ozen <gozen> | ||||
Component: | kernel-xen | Assignee: | Xen Maintainance List <xen-maint> | ||||
Status: | CLOSED CANTFIX | QA Contact: | Martin Jenner <mjenner> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 5.3 | CC: | jburke, lersek, pbonzini, xen-maint | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2011-05-17 14:16:49 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 514491 | ||||||
Attachments: |
|
Description
Gurhan Ozen
2008-12-12 05:24:54 UTC
Additional info: I tested this on x86_64 host with x86_64 pv.hvm and i386 hvm guest (no i386 pv guest on the machine due to another bug). They all crash, and xend has to be restarted before the guest can be destroy. However in the case of i386 hvm guest, things get worse, i couldn't get it to be destroyed, ended up with [2008-12-12 00:51:37 xend 1261] DEBUG (XendDomain:208) Cannot recreate information for dying domain 87. Xend will ignore this domain from now on. errors. Just as an FYI, this is probably a race condition. Gurhan told me (and my own tests confirm) that adding a sleep in for loop above allows it to work. Probably xenstore is adding and removing nodes very fast, and we race inside the callback handlers in the guest. We can probably fix it up with a spinlock; after all, attaching a disk isn't a common, nor a performance critical, operation. Chris Lalancette Created attachment 362527 [details]
reproducer
It is still possible to get errors in block-attach/detach.
The attached script is a bit more complicated than needed because it started as a Windows testcase, but it is enough to reproduce the issue. It will run attach/detach cycles of 5 devices, with decent pauses in the middle.
After 50-150 attach/detach cycles (which means 250-750 attaches and the same number of detaches) it will start producing unhelpful output:
Starting cycle 618...
xm block-attach RHEL5-32-HVM phy:/dev/loop2 xvdd r
Usage: xm block-attach <Domain> <BackDev> <FrontDev> <Mode>
Create a new virtual block device.
xm block-attach RHEL5-32-HVM phy:/dev/loop3 xvde r
Usage: xm block-attach <Domain> <BackDev> <FrontDev> <Mode>
Create a new virtual block device.
xm block-attach RHEL5-32-HVM phy:/dev/loop4 xvdf r
Usage: xm block-attach <Domain> <BackDev> <FrontDev> <Mode>
Create a new virtual block device.
xm block-attach RHEL5-32-HVM phy:/dev/loop5 xvdg r
Usage: xm block-attach <Domain> <BackDev> <FrontDev> <Mode>
In the last run I made, I noticed that in one run I had 1 device attach failing on cycle 85, two on cycle 86, three on cycle 87, four on cycle 88, and all five from cycle 89 on. Unfortunately I wasn't cunning enough to save a log or xenstore dump.
The above behavior was apparently in bug 217853, which was fixed by adding -f. It's possible that fixing the guest-side (domU) bug would get rid of the problem once and for all. The protocols among the participants (xenstore, blktapctrl, dom0/domU etc; see http://wiki.xensource.com/xenwiki/blktap#head-47b9cac49ceb0351f57917988f1020a435c680a9 for a detailed architecture diagram) seem to be a fertile soil for many races. We apparently did try to patch up some of those, but I think they can't all be fixed under this design -- there are too many agents to synchronize in incremental steps. |