Bug 509243 - timeout with physical cdrom on a PV guest
timeout with physical cdrom on a PV guest
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel-xen (Show other bugs)
4.9
All Linux
low Severity medium
: rc
: ---
Assigned To: Paolo Bonzini
Red Hat Kernel QE team
:
Depends On: 506899
Blocks:
  Show dependency treegraph
 
Reported: 2009-07-01 16:52 EDT by Don Dutile
Modified: 2009-07-01 17:48 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 506899
Environment:
Last Closed: 2009-07-01 17:48:47 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Don Dutile 2009-07-01 16:52:10 EDT
+++ This bug was initially created as a clone of Bug #506899 +++

Description of problem:
With a line like

disk = ['tap:aio:/....,xvda,w', 'phy:/dev/cdrom,hdc:cdrom,r' ]

in the Xen configuration and no physical CDROM in the tray, the PV guest will wait for the 300 seconds timeout to occur

Version-Release number of selected component (if applicable):
5.4

How reproducible:
Add the 'phy:/dev/cdrom,hdc:cdrom,r' item to an existing PV configuration and start the VM.
  
Actual results:
300 seconds timeout

Expected results:
No timeout.

Additional info:
The bug is probably due to the device being Closed (as shown by the timeout message) but not ready.  The configuration is quite bogus, but this is a regression and, if my diagnosis is right, the fix should be easy and safe.

--- Additional comment from berrange@redhat.com on 2009-06-19 06:38:38 EDT ---

Is this really a regression - have you verified that this works in RHEL-5.3 ?  
I'm surprised if this has ever worked, since AFAIK the Xen version RHEL5 is based on has never supported CDROM devices for paravirt guests. It was only added later when Solaris paravirt support CDROm devices. 

That all said, timeouts are always the wrong answer to unsupported features - there should be explicit error message given back where required. In this case though, since the host does not really know whether the guest supports CDROM or not, it would have to attach it anyway and just hope the guest knows what todo with it.

--- Additional comment from pbonzini@redhat.com on 2009-06-19 07:04:41 EDT ---

The regression is that it takes 300 seconds instead of 10 to boot.  The configuration is bogus indeed.

--- Additional comment from berrange@redhat.com on 2009-06-19 07:10:20 EDT ---

The fact that the length of the timeout changed does not really make this is regression. It was broken before & it is still broken now.

--- Additional comment from bburns@redhat.com on 2009-06-19 07:16:01 EDT ---

Ok, clearing the regression flag then. The timeout was increased to avoid problems are seen with valid configs. Side effect is that broken configs can take the full time to fail. This is the tradeoff. Won;t close this though. If we can detect and prevent waiting for a broken config this BZ can be used to fix that.

--- Additional comment from clalance@redhat.com on 2009-06-19 07:55:24 EDT ---

Actually, there are 2 things going on here.  The first is that we did, indeed, increase the timeout to 5 minutes.  But the second part, which is just as important, is that we removed ddd's "anaconda" hack from blkfront.  What that hack did was to go trawling around xenbus, looking for devices that had an unconnected backend.  That was necessary earlier because anaconda was having a fit with these devices when xen-platform-pci was built-in to the kernel.  Now that it's a module, we decided to remove it.

That being said, though, we now have two new problems because of the removal of that patch; this device timeout, and spurious "printk" on shutdown.  I'm wondering if we should maybe put that hack back, since it actually seemed to do more good than we originally realized.

Don (Dutile), have any thoughts about this?

Chris Lalancette

--- Additional comment from pbonzini@redhat.com on 2009-06-19 08:05:40 EDT ---

Created an attachment (id=348646)
patch fixing the bug

Just to add 2 cents more to the discussion, here is the patch I just tested successfully.

--- Additional comment from pbonzini@redhat.com on 2009-06-19 09:11:10 EDT ---

The "anaconda hack" is at https://bugzilla.redhat.com/show_bug.cgi?id=477005 for the record.

--- Additional comment from ddutile@redhat.com on 2009-06-19 13:42:03 EDT ---

(In reply to comment #5)

Well, I'd sure like to *not* put the "anaconda hack" back in.

Yes, it could fix it, but I'm wondering if it would mess up
the long-connect-under-load cases.

Paolo's patch looks promising, too (in #6).
Has it been tried under the tons-of-guest-starting-up-
causing-long-blkfront-connects case ???

- Don

--- Additional comment from pbonzini@redhat.com on 2009-06-22 04:06:57 EDT ---

> Has it been tried under the tons-of-guest-starting-up-
> causing-long-blkfront-connects case ???

Probably not, but I can do it if you expand on what is the "ton-of-guest-starting-up-causing-long-blkfront-connects" case. :-)  I would say that in that case the devices stay in the Connected state (they don't go to Closing or Closed like an empty physical CD does) so my patch does not make any difference.

Should I create something like 40 minimal PV installations (I can do that with 128 MB per guest) and try to start them all at the same time?  Or maybe just one guest is enough if the system is under heavy load, in which case I already tested that and it works.

--- Additional comment from clalance@redhat.com on 2009-06-22 05:17:53 EDT ---

(In reply to comment #9)
> > Has it been tried under the tons-of-guest-starting-up-
> > causing-long-blkfront-connects case ???
> 
> Probably not, but I can do it if you expand on what is the
> "ton-of-guest-starting-up-causing-long-blkfront-connects" case. :-)  I would
> say that in that case the devices stay in the Connected state (they don't go to
> Closing or Closed like an empty physical CD does) so my patch does not make any
> difference.
> 
> Should I create something like 40 minimal PV installations (I can do that with
> 128 MB per guest) and try to start them all at the same time?  Or maybe just
> one guest is enough if the system is under heavy load, in which case I already
> tested that and it works.  

I think the heavy load is good enough.  Starting a zillion guests at the same time is just one (common) way to generate the load, but any significant dom0 load should do the trick.  In which case your patch is looking OK, although I'd still like to try and see how upstream Xen handles this (if at all) in this case.

Chris Lalancette

--- Additional comment from pbonzini@redhat.com on 2009-06-22 09:24:49 EDT ---

patch posted at http://thread.gmane.org/gmane.comp.emulators.xen.devel/67737 for upstream.

--- Additional comment from pbonzini@redhat.com on 2009-06-23 13:17:18 EDT ---

The patch was accepted upstream.

http://xenbits.xensource.com/linux-2.6.18-xen.hg?rev/865707f40531
Comment 1 Don Dutile 2009-07-01 16:54:22 EDT
With the patch for BZ396631 posted for RHEL4 to extend device timeout from 10 secs to 5 minutes, this patch needs to be backported to RHEL4 as well.
Comment 2 Paolo Bonzini 2009-07-01 17:48:47 EDT
Yes, I planned to post an updated version tomorrow (as it's already on Andrew Jones' test branch) and then I'll close this.

Note You need to log in before you can comment on or make changes to this bug.