Red Hat Bugzilla – Bug 506899
timeout with physical cdrom on a PV guest
Last modified: 2010-04-08 12:17:42 EDT
Description of problem:
With a line like
disk = ['tap:aio:/....,xvda,w', 'phy:/dev/cdrom,hdc:cdrom,r' ]
in the Xen configuration and no physical CDROM in the tray, the PV guest will wait for the 300 seconds timeout to occur
Version-Release number of selected component (if applicable):
Add the 'phy:/dev/cdrom,hdc:cdrom,r' item to an existing PV configuration and start the VM.
300 seconds timeout
The bug is probably due to the device being Closed (as shown by the timeout message) but not ready. The configuration is quite bogus, but this is a regression and, if my diagnosis is right, the fix should be easy and safe.
Is this really a regression - have you verified that this works in RHEL-5.3 ?
I'm surprised if this has ever worked, since AFAIK the Xen version RHEL5 is based on has never supported CDROM devices for paravirt guests. It was only added later when Solaris paravirt support CDROm devices.
That all said, timeouts are always the wrong answer to unsupported features - there should be explicit error message given back where required. In this case though, since the host does not really know whether the guest supports CDROM or not, it would have to attach it anyway and just hope the guest knows what todo with it.
The regression is that it takes 300 seconds instead of 10 to boot. The configuration is bogus indeed.
The fact that the length of the timeout changed does not really make this is regression. It was broken before & it is still broken now.
Ok, clearing the regression flag then. The timeout was increased to avoid problems are seen with valid configs. Side effect is that broken configs can take the full time to fail. This is the tradeoff. Won;t close this though. If we can detect and prevent waiting for a broken config this BZ can be used to fix that.
Actually, there are 2 things going on here. The first is that we did, indeed, increase the timeout to 5 minutes. But the second part, which is just as important, is that we removed ddd's "anaconda" hack from blkfront. What that hack did was to go trawling around xenbus, looking for devices that had an unconnected backend. That was necessary earlier because anaconda was having a fit with these devices when xen-platform-pci was built-in to the kernel. Now that it's a module, we decided to remove it.
That being said, though, we now have two new problems because of the removal of that patch; this device timeout, and spurious "printk" on shutdown. I'm wondering if we should maybe put that hack back, since it actually seemed to do more good than we originally realized.
Don (Dutile), have any thoughts about this?
Created attachment 348646 [details]
patch fixing the bug
Just to add 2 cents more to the discussion, here is the patch I just tested successfully.
The "anaconda hack" is at https://bugzilla.redhat.com/show_bug.cgi?id=477005 for the record.
(In reply to comment #5)
Well, I'd sure like to *not* put the "anaconda hack" back in.
Yes, it could fix it, but I'm wondering if it would mess up
the long-connect-under-load cases.
Paolo's patch looks promising, too (in #6).
Has it been tried under the tons-of-guest-starting-up-
causing-long-blkfront-connects case ???
> Has it been tried under the tons-of-guest-starting-up-
> causing-long-blkfront-connects case ???
Probably not, but I can do it if you expand on what is the "ton-of-guest-starting-up-causing-long-blkfront-connects" case. :-) I would say that in that case the devices stay in the Connected state (they don't go to Closing or Closed like an empty physical CD does) so my patch does not make any difference.
Should I create something like 40 minimal PV installations (I can do that with 128 MB per guest) and try to start them all at the same time? Or maybe just one guest is enough if the system is under heavy load, in which case I already tested that and it works.
(In reply to comment #9)
> > Has it been tried under the tons-of-guest-starting-up-
> > causing-long-blkfront-connects case ???
> Probably not, but I can do it if you expand on what is the
> "ton-of-guest-starting-up-causing-long-blkfront-connects" case. :-) I would
> say that in that case the devices stay in the Connected state (they don't go to
> Closing or Closed like an empty physical CD does) so my patch does not make any
> Should I create something like 40 minimal PV installations (I can do that with
> 128 MB per guest) and try to start them all at the same time? Or maybe just
> one guest is enough if the system is under heavy load, in which case I already
> tested that and it works.
I think the heavy load is good enough. Starting a zillion guests at the same time is just one (common) way to generate the load, but any significant dom0 load should do the trick. In which case your patch is looking OK, although I'd still like to try and see how upstream Xen handles this (if at all) in this case.
patch posted at http://thread.gmane.org/gmane.comp.emulators.xen.devel/67737 for upstream.
The patch was accepted upstream.
Just to be clear here, is this a regression? It seems like this was probably always broken (i.e. broken in 5.3 and earlier). As long as it is not a regression, and "just another bug", then we can defer to 5.5. If it is a regression in 5.4, though, I want to make sure we properly track it.
It was always broken. The reason why the bug was kept in RHEL5.4 was the relationship to bug 501474 and whether it would make sense to solve it with my patch + reinstating the "anaconda hack".
Given that instead we decreased the level of the shutdown messages to DEBUG level, this patch can wait for 5.5.
You can download this test kernel from http://people.redhat.com/dzickus/el5
Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so. However feel free
to provide a comment indicating that this fix has been verified.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.