476656 – block-attach/detach'ing tap:aio issues with rhel 4.7 pv guests

Bug 476656 - block-attach/detach'ing tap:aio issues with rhel 4.7 pv guests

Summary: block-attach/detach'ing tap:aio issues with rhel 4.7 pv guests

Keywords:
Status:	CLOSED DUPLICATE of bug 484110
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	xen
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jiri Denemark
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	444642 492187
TreeView+	depends on / blocked

Reported:	2008-12-16 14:18 UTC by Gurhan Ozen
Modified:	2013-11-04 01:38 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-04-03 08:25:12 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Before attach (4.93 KB, text/plain) 2008-12-18 18:40 UTC, Gurhan Ozen	no flags	Details
xenstore-ls after attach (5.89 KB, text/plain) 2008-12-18 18:41 UTC, Gurhan Ozen	no flags	Details
xenstore-ls after detach (5.71 KB, text/plain) 2008-12-18 18:41 UTC, Gurhan Ozen	no flags	Details
View All

Description Gurhan Ozen 2008-12-16 14:18:18 UTC

Description of problem:

# virsh attach-disk rhel4.7_x86_64_pv_guest /var/lib/xen/images/block1 /dev/xvdbb --driver tap --subdriver aio --mode shareable
# virsh detach-disk rhel4.7_x86_64_pv_guest /dev/xvdbb

# virsh list
 Id Name                 State
----------------------------------
  0 Domain-0             running
134 RH3_x86_64_fv        blocked
138 RH52_x86_64_pv_guest blocked
144 RH53_x86_64_pv_guest blocked
148 rhel4.7_x86_64_pv_guest blocked
149 rhel4.7_x86_64_hvm_guest blocked

# virsh console rhel4.7_x86_64_pv_guest
libvir: Xen Daemon error : internal error failed to parse Xend domain information
error: failed to get domain 'rhel4.7_x86_64_pv_guest'

Xend goes haywire again and can't recover.

Version-Release number of selected component (if applicable):
xen-3.0.3-79.el5

How reproducible:
Very


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
xend-debug.log:
Exception in thread Thread-482954:
Traceback (most recent call last):
  File "/usr/lib64/python2.4/threading.py", line 442, in __bootstrap
    self.run()
  File "/usr/lib64/python2.4/threading.py", line 422, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1074, in maybeRestart
    {"destroy"        : self.destroy,
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1861, in restart
    config = self.sxpr()
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1305, in sxpr
    sxpr += disks()
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1296, in disks
    config = vbd.configuration(disk)
  File "/usr/lib64/python2.4/site-packages/xen/xend/server/blkif.py", line 119, in configuration
    result = DevController.configuration(self, devid)
  File "/usr/lib64/python2.4/site-packages/xen/xend/server/DevController.py", line 246, in configuration
    raise VmError("Device %s not connected" % devid)
VmError: Device 51712 not connected

Comment 1 Cole Robinson 2008-12-16 15:08:16 UTC

Hmm, can report the following after trying the operation?

virsh dumpxml vmname
xm list --long vmname
cat /etc/xen/vmname

I assume the vm is running when you try the attach disk? If you restart xend to fix the whole mess, can you then 'virsh start' the vm, or does it fail?

Comment 2 Gurhan Ozen 2008-12-16 16:37:31 UTC

Yes,the vm is running when i do that. I am pasting the information you need, along with what I see in the guest when I try to attach the domain. Restarting xend doesn't help things, i had to reboot dom0 to tame it:( 


# virsh attach-disk rhel4.7_x86_64_pv_guest /var/lib/xen/images/block1 /dev/xvdbb --driver tap --subdriver aio --mode shareable

# virsh console rhel4.7_x86_64_pv_guest
vbd vbd-268449024: 2 reading virtual-device
vbd vbd-268449024: 2 xenbus_dev_probe on device/vbd/268449024

GUEST# cat /proc/partitions 
major minor  #blocks  name

 202     0    3145728 xvda
 202     1     104391 xvda1
 202     2    3036285 xvda2
 253     0    1966080 dm-0
 253     1    1015808 dm-1
GUEST# dmesg | tail
ip_tables: (C) 2000-2002 Netfilter core team
ip_tables: (C) 2000-2002 Netfilter core team
SELinux: initialized (dev rpc_pipefs, type rpc_pipefs), uses genfs_contexts
NET: Registered protocol family 10
Disabled Privacy Extensions on device ffffffff80352e00(lo)
IPv6 over IPv4 tunneling driver
divert: not allocating divert_blk for non-ethernet device sit0
eth0: no IPv6 routers present
vbd vbd-268449024: 2 reading virtual-device
vbd vbd-268449024: 2 xenbus_dev_probe on device/vbd/268449024

# virsh detach-disk rhel4.7_x86_64_pv_guest  /dev/xvdbb

# virsh dumpxml rhel4.7_x86_64_pv_guest
libvir: Xen Daemon error : internal error failed to parse Xend domain information
error: failed to get domain 'rhel4.7_x86_64_pv_guest'

# xm list --long rhel4.7_x86_64_pv_guest
Error: Device 268449024 not connected
Usage: xm list [options] [Domain, ...]

List information about all/some domains.
  -l, --long                     Output all VM details in SXP               
  --label                        Include security labels                    

# cat /etc/xen/rhel4.7_x86_64_pv_guest
name = "rhel4.7_x86_64_pv_guest"
uuid = "35c91d12-45ce-e885-c4eb-04636bd72745"
maxmem = 512
memory = 512
vcpus = 1
bootloader = "/usr/bin/pygrub"
on_poweroff = "destroy"
on_reboot = "restart"
on_crash = "restart"
vfb = [  ]
disk = [ "tap:aio:/var/lib/xen/images/rhel4.7_x86_64_pv_guest.img,xvda,w" ]
vif = [ "mac=00:16:3e:3f:d3:36,bridge=xenbr0" ]

# service xend restart
restart xend: [  OK  ]

# virsh list
 Id Name                 State
----------------------------------
  0 Domain-0             running
  1 RH3_x86_64_fv        no state
  2 rhel4.7_x86_64_pv_guest blocked

# virsh destroy rhel4.7_x86_64_pv_guest
libvir: Xen Daemon error : internal error failed to parse Xend domain information
error: failed to get domain 'rhel4.7_x86_64_pv_guest'

Comment 3 Gurhan Ozen 2008-12-16 18:27:56 UTC

Just as additional info.. 
file: based devices show the same behavior.
phy: devices attach/detach fine.

Comment 4 Gurhan Ozen 2008-12-18 18:40:15 UTC

Created attachment 327352 [details]
Before attach

Comment 5 Gurhan Ozen 2008-12-18 18:41:11 UTC

Created attachment 327353 [details]
xenstore-ls after attach

Comment 6 Gurhan Ozen 2008-12-18 18:41:41 UTC

Created attachment 327354 [details]
xenstore-ls after detach

Comment 7 Daniel Berrangé 2008-12-18 20:29:11 UTC

Comparing the before-attach, vs after-attach xenstore logs there's one clear issue there

    error = ""
+    device = ""
+     vbd = ""
+      268449024 = ""
+       error = "2 xenbus_dev_probe on device/vbd/268449024"


Something went wrong when attaching the device. I suspect this will then contribute to the problems on detach, though clearly detach ought to be made robust against incomplete attach operations too.

Comment 8 Daniel Berrangé 2008-12-18 20:36:03 UTC

Oh and we should also make XenD itself robust about broken atached devices.

eg this command

# xm list --long rhel4.7_x86_64_pv_guest
Error: Device 268449024 not connected
Usage: xm list [options] [Domain, ...]

should never ever throw an error like this under any circumstances. It if encounters an error, XenD should ignore that device and continue. This would let us actaully tear down the broken guest without having to restart XenD.

Comment 9 Daniel Berrangé 2008-12-18 21:03:11 UTC

These two errors messages are key to understanding why 'attach' didn't complete successfully:

vbd vbd-268449024: 2 reading virtual-device
vbd vbd-268449024: 2 xenbus_dev_probe on device/vbd/268449024


In the RHEL-4 kernel driver, drivers/xen/blkfront/blkfront.c, blkfront_proibe() method



        /* FIXME: Use dynamic device id if this is not set. */
        err = xenbus_scanf(XBT_NIL, dev->nodename,
                           "virtual-device", "%i", &vdevice);
        if (err != 1) {
                /* go looking in the extended area instead */
                err = xenbus_scanf(XBT_NIL, dev->nodename,
                                   "virtual-device-ext", "%i", &vdevice);
                if (err != 1) {
                        xenbus_dev_fatal(dev, err, "reading virtual-device");
                        return err;
                }
        }

So it failed to read the 'virtual-device-ext' field, but its clearly present in xenstore

     268449024 = ""
      virtual-device-ext = "268449024"
      state = "6"
      device-type = "disk"
      protocol = "x86_64-abi"
      backend-id = "0"
      backend = "/local/domain/0/backend/tap/1/268449024"

One possible idea though - when you attached, you used '/dev/xvdbb' - could you try again with just 'xvdbb' - ie no /dev prefix.

Could you also see if it still has trouble when using a non-extended device node, ie 'xvdc' - devices > xvdz hit a different codepath only recently added, so have more potential failure scenarios than a non-extended device.

Comment 10 Daniel Berrangé 2008-12-18 21:35:51 UTC

Just talked with   Gurhan, his guest has 2.6.9-78.ELxenU 

but supported for extended devices wasn't added till

* Mon Oct 13 2008 Vivek Goyal <vgoyal> [2.6.9-78.14]
-xen: fix blkfront to accept 16 devices (Chris Lalancette) [455756]

That explains the error upon attach.

So the core problem here is that XenD is unable to detach a device, if the attach operation didn't handshake with the guest correctly. We need to fix XenD in this regard, but I don't think this is a serious enough to block release

Comment 11 Suzanne Logcher 2008-12-18 22:34:37 UTC

From bburns: 
  Not a new bug. 
  Clear 5.3 and blocker and move to 5.4. 
  Danpb and Gurhan agree.

Comment 12 Chris Lalancette 2008-12-19 14:26:31 UTC

(In reply to comment #10)
> Just talked with   Gurhan, his guest has 2.6.9-78.ELxenU 
> 
> but supported for extended devices wasn't added till
> 
> * Mon Oct 13 2008 Vivek Goyal <vgoyal> [2.6.9-78.14]
> -xen: fix blkfront to accept 16 devices (Chris Lalancette) [455756]
> 
> That explains the error upon attach.
> 
> So the core problem here is that XenD is unable to detach a device, if the
> attach operation didn't handshake with the guest correctly. We need to fix XenD
> in this regard, but I don't think this is a serious enough to block release

Sigh.  This is nasty; we had a discussion about these type of failures long ago with Xensource, and their take on it was to add the --force flag to xm block-detach (which we now support, and will probably work in this instance).  I agree, though, that this is not a nice way to handle the problem, and we should also fix the bug with xend being unable to tear down domains when this happens.

Chris Lalancette

Comment 15 Jiri Denemark 2009-04-02 13:43:30 UTC

The main problem here actually is that xm block-detach used to remove some records from xenstore even if the actual detach failed leaving xenstore in an inconsistent state which xend does not like. This is fixed by a patch proposed to RHBZ 484110 and with that patch applied everything works fine.

The other thing is whether we should make xend more robust against such errors in xenstore. If so, we could probably open another bug for that.

Anyway I'd close this bug as a duplicate of 484110 if there are no objections to this as it seems we actually hit several different issues here...

Comment 16 Chris Lalancette 2009-04-02 14:00:33 UTC

Jiri,
     I tend to agree with your last statement.  As long as we have a fix that keeps xend from going crazy, that's fine.  I would close this as a dup of 484110, and we can address further issues in other BZ's if need be.

Chris Lalancette

Comment 17 Jiri Denemark 2009-04-03 08:25:12 UTC


*** This bug has been marked as a duplicate of bug 484110 ***

Note You need to log in before you can comment on or make changes to this bug.