Description of problem: # virsh attach-disk rhel4.7_x86_64_pv_guest /var/lib/xen/images/block1 /dev/xvdbb --driver tap --subdriver aio --mode shareable # virsh detach-disk rhel4.7_x86_64_pv_guest /dev/xvdbb # virsh list Id Name State ---------------------------------- 0 Domain-0 running 134 RH3_x86_64_fv blocked 138 RH52_x86_64_pv_guest blocked 144 RH53_x86_64_pv_guest blocked 148 rhel4.7_x86_64_pv_guest blocked 149 rhel4.7_x86_64_hvm_guest blocked # virsh console rhel4.7_x86_64_pv_guest libvir: Xen Daemon error : internal error failed to parse Xend domain information error: failed to get domain 'rhel4.7_x86_64_pv_guest' Xend goes haywire again and can't recover. Version-Release number of selected component (if applicable): xen-3.0.3-79.el5 How reproducible: Very Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: xend-debug.log: Exception in thread Thread-482954: Traceback (most recent call last): File "/usr/lib64/python2.4/threading.py", line 442, in __bootstrap self.run() File "/usr/lib64/python2.4/threading.py", line 422, in run self.__target(*self.__args, **self.__kwargs) File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1074, in maybeRestart {"destroy" : self.destroy, File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1861, in restart config = self.sxpr() File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1305, in sxpr sxpr += disks() File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1296, in disks config = vbd.configuration(disk) File "/usr/lib64/python2.4/site-packages/xen/xend/server/blkif.py", line 119, in configuration result = DevController.configuration(self, devid) File "/usr/lib64/python2.4/site-packages/xen/xend/server/DevController.py", line 246, in configuration raise VmError("Device %s not connected" % devid) VmError: Device 51712 not connected
Hmm, can report the following after trying the operation? virsh dumpxml vmname xm list --long vmname cat /etc/xen/vmname I assume the vm is running when you try the attach disk? If you restart xend to fix the whole mess, can you then 'virsh start' the vm, or does it fail?
Yes,the vm is running when i do that. I am pasting the information you need, along with what I see in the guest when I try to attach the domain. Restarting xend doesn't help things, i had to reboot dom0 to tame it:( # virsh attach-disk rhel4.7_x86_64_pv_guest /var/lib/xen/images/block1 /dev/xvdbb --driver tap --subdriver aio --mode shareable # virsh console rhel4.7_x86_64_pv_guest vbd vbd-268449024: 2 reading virtual-device vbd vbd-268449024: 2 xenbus_dev_probe on device/vbd/268449024 GUEST# cat /proc/partitions major minor #blocks name 202 0 3145728 xvda 202 1 104391 xvda1 202 2 3036285 xvda2 253 0 1966080 dm-0 253 1 1015808 dm-1 GUEST# dmesg | tail ip_tables: (C) 2000-2002 Netfilter core team ip_tables: (C) 2000-2002 Netfilter core team SELinux: initialized (dev rpc_pipefs, type rpc_pipefs), uses genfs_contexts NET: Registered protocol family 10 Disabled Privacy Extensions on device ffffffff80352e00(lo) IPv6 over IPv4 tunneling driver divert: not allocating divert_blk for non-ethernet device sit0 eth0: no IPv6 routers present vbd vbd-268449024: 2 reading virtual-device vbd vbd-268449024: 2 xenbus_dev_probe on device/vbd/268449024 # virsh detach-disk rhel4.7_x86_64_pv_guest /dev/xvdbb # virsh dumpxml rhel4.7_x86_64_pv_guest libvir: Xen Daemon error : internal error failed to parse Xend domain information error: failed to get domain 'rhel4.7_x86_64_pv_guest' # xm list --long rhel4.7_x86_64_pv_guest Error: Device 268449024 not connected Usage: xm list [options] [Domain, ...] List information about all/some domains. -l, --long Output all VM details in SXP --label Include security labels # cat /etc/xen/rhel4.7_x86_64_pv_guest name = "rhel4.7_x86_64_pv_guest" uuid = "35c91d12-45ce-e885-c4eb-04636bd72745" maxmem = 512 memory = 512 vcpus = 1 bootloader = "/usr/bin/pygrub" on_poweroff = "destroy" on_reboot = "restart" on_crash = "restart" vfb = [ ] disk = [ "tap:aio:/var/lib/xen/images/rhel4.7_x86_64_pv_guest.img,xvda,w" ] vif = [ "mac=00:16:3e:3f:d3:36,bridge=xenbr0" ] # service xend restart restart xend: [ OK ] # virsh list Id Name State ---------------------------------- 0 Domain-0 running 1 RH3_x86_64_fv no state 2 rhel4.7_x86_64_pv_guest blocked # virsh destroy rhel4.7_x86_64_pv_guest libvir: Xen Daemon error : internal error failed to parse Xend domain information error: failed to get domain 'rhel4.7_x86_64_pv_guest'
Just as additional info.. file: based devices show the same behavior. phy: devices attach/detach fine.
Created attachment 327352 [details] Before attach
Created attachment 327353 [details] xenstore-ls after attach
Created attachment 327354 [details] xenstore-ls after detach
Comparing the before-attach, vs after-attach xenstore logs there's one clear issue there error = "" + device = "" + vbd = "" + 268449024 = "" + error = "2 xenbus_dev_probe on device/vbd/268449024" Something went wrong when attaching the device. I suspect this will then contribute to the problems on detach, though clearly detach ought to be made robust against incomplete attach operations too.
Oh and we should also make XenD itself robust about broken atached devices. eg this command # xm list --long rhel4.7_x86_64_pv_guest Error: Device 268449024 not connected Usage: xm list [options] [Domain, ...] should never ever throw an error like this under any circumstances. It if encounters an error, XenD should ignore that device and continue. This would let us actaully tear down the broken guest without having to restart XenD.
These two errors messages are key to understanding why 'attach' didn't complete successfully: vbd vbd-268449024: 2 reading virtual-device vbd vbd-268449024: 2 xenbus_dev_probe on device/vbd/268449024 In the RHEL-4 kernel driver, drivers/xen/blkfront/blkfront.c, blkfront_proibe() method /* FIXME: Use dynamic device id if this is not set. */ err = xenbus_scanf(XBT_NIL, dev->nodename, "virtual-device", "%i", &vdevice); if (err != 1) { /* go looking in the extended area instead */ err = xenbus_scanf(XBT_NIL, dev->nodename, "virtual-device-ext", "%i", &vdevice); if (err != 1) { xenbus_dev_fatal(dev, err, "reading virtual-device"); return err; } } So it failed to read the 'virtual-device-ext' field, but its clearly present in xenstore 268449024 = "" virtual-device-ext = "268449024" state = "6" device-type = "disk" protocol = "x86_64-abi" backend-id = "0" backend = "/local/domain/0/backend/tap/1/268449024" One possible idea though - when you attached, you used '/dev/xvdbb' - could you try again with just 'xvdbb' - ie no /dev prefix. Could you also see if it still has trouble when using a non-extended device node, ie 'xvdc' - devices > xvdz hit a different codepath only recently added, so have more potential failure scenarios than a non-extended device.
Just talked with Gurhan, his guest has 2.6.9-78.ELxenU but supported for extended devices wasn't added till * Mon Oct 13 2008 Vivek Goyal <vgoyal> [2.6.9-78.14] -xen: fix blkfront to accept 16 devices (Chris Lalancette) [455756] That explains the error upon attach. So the core problem here is that XenD is unable to detach a device, if the attach operation didn't handshake with the guest correctly. We need to fix XenD in this regard, but I don't think this is a serious enough to block release
From bburns: Not a new bug. Clear 5.3 and blocker and move to 5.4. Danpb and Gurhan agree.
(In reply to comment #10) > Just talked with Gurhan, his guest has 2.6.9-78.ELxenU > > but supported for extended devices wasn't added till > > * Mon Oct 13 2008 Vivek Goyal <vgoyal> [2.6.9-78.14] > -xen: fix blkfront to accept 16 devices (Chris Lalancette) [455756] > > That explains the error upon attach. > > So the core problem here is that XenD is unable to detach a device, if the > attach operation didn't handshake with the guest correctly. We need to fix XenD > in this regard, but I don't think this is a serious enough to block release Sigh. This is nasty; we had a discussion about these type of failures long ago with Xensource, and their take on it was to add the --force flag to xm block-detach (which we now support, and will probably work in this instance). I agree, though, that this is not a nice way to handle the problem, and we should also fix the bug with xend being unable to tear down domains when this happens. Chris Lalancette
The main problem here actually is that xm block-detach used to remove some records from xenstore even if the actual detach failed leaving xenstore in an inconsistent state which xend does not like. This is fixed by a patch proposed to RHBZ 484110 and with that patch applied everything works fine. The other thing is whether we should make xend more robust against such errors in xenstore. If so, we could probably open another bug for that. Anyway I'd close this bug as a duplicate of 484110 if there are no objections to this as it seems we actually hit several different issues here...
Jiri, I tend to agree with your last statement. As long as we have a fix that keeps xend from going crazy, that's fine. I would close this as a dup of 484110, and we can address further issues in other BZ's if need be. Chris Lalancette
*** This bug has been marked as a duplicate of bug 484110 ***