Bug 506899 - timeout with physical cdrom on a PV guest
timeout with physical cdrom on a PV guest
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen (Show other bugs)
5.4
All Linux
low Severity medium
: rc
: ---
Assigned To: Paolo Bonzini
Red Hat Kernel QE team
:
Depends On: 396621
Blocks: 396631 509243 526775 526946
  Show dependency treegraph
 
Reported: 2009-06-19 03:57 EDT by Paolo Bonzini
Modified: 2010-04-08 12:17 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 509243 523154 (view as bug list)
Environment:
Last Closed: 2010-03-30 03:45:17 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch fixing the bug (2.50 KB, patch)
2009-06-19 08:05 EDT, Paolo Bonzini
no flags Details | Diff

  None (edit)
Description Paolo Bonzini 2009-06-19 03:57:05 EDT
Description of problem:
With a line like

disk = ['tap:aio:/....,xvda,w', 'phy:/dev/cdrom,hdc:cdrom,r' ]

in the Xen configuration and no physical CDROM in the tray, the PV guest will wait for the 300 seconds timeout to occur

Version-Release number of selected component (if applicable):
5.4

How reproducible:
Add the 'phy:/dev/cdrom,hdc:cdrom,r' item to an existing PV configuration and start the VM.
  
Actual results:
300 seconds timeout

Expected results:
No timeout.

Additional info:
The bug is probably due to the device being Closed (as shown by the timeout message) but not ready.  The configuration is quite bogus, but this is a regression and, if my diagnosis is right, the fix should be easy and safe.
Comment 1 Daniel Berrange 2009-06-19 06:38:38 EDT
Is this really a regression - have you verified that this works in RHEL-5.3 ?  
I'm surprised if this has ever worked, since AFAIK the Xen version RHEL5 is based on has never supported CDROM devices for paravirt guests. It was only added later when Solaris paravirt support CDROm devices. 

That all said, timeouts are always the wrong answer to unsupported features - there should be explicit error message given back where required. In this case though, since the host does not really know whether the guest supports CDROM or not, it would have to attach it anyway and just hope the guest knows what todo with it.
Comment 2 Paolo Bonzini 2009-06-19 07:04:41 EDT
The regression is that it takes 300 seconds instead of 10 to boot.  The configuration is bogus indeed.
Comment 3 Daniel Berrange 2009-06-19 07:10:20 EDT
The fact that the length of the timeout changed does not really make this is regression. It was broken before & it is still broken now.
Comment 4 Bill Burns 2009-06-19 07:16:01 EDT
Ok, clearing the regression flag then. The timeout was increased to avoid problems are seen with valid configs. Side effect is that broken configs can take the full time to fail. This is the tradeoff. Won;t close this though. If we can detect and prevent waiting for a broken config this BZ can be used to fix that.
Comment 5 Chris Lalancette 2009-06-19 07:55:24 EDT
Actually, there are 2 things going on here.  The first is that we did, indeed, increase the timeout to 5 minutes.  But the second part, which is just as important, is that we removed ddd's "anaconda" hack from blkfront.  What that hack did was to go trawling around xenbus, looking for devices that had an unconnected backend.  That was necessary earlier because anaconda was having a fit with these devices when xen-platform-pci was built-in to the kernel.  Now that it's a module, we decided to remove it.

That being said, though, we now have two new problems because of the removal of that patch; this device timeout, and spurious "printk" on shutdown.  I'm wondering if we should maybe put that hack back, since it actually seemed to do more good than we originally realized.

Don (Dutile), have any thoughts about this?

Chris Lalancette
Comment 6 Paolo Bonzini 2009-06-19 08:05:40 EDT
Created attachment 348646 [details]
patch fixing the bug

Just to add 2 cents more to the discussion, here is the patch I just tested successfully.
Comment 7 Paolo Bonzini 2009-06-19 09:11:10 EDT
The "anaconda hack" is at https://bugzilla.redhat.com/show_bug.cgi?id=477005 for the record.
Comment 8 Don Dutile 2009-06-19 13:42:03 EDT
(In reply to comment #5)

Well, I'd sure like to *not* put the "anaconda hack" back in.

Yes, it could fix it, but I'm wondering if it would mess up
the long-connect-under-load cases.

Paolo's patch looks promising, too (in #6).
Has it been tried under the tons-of-guest-starting-up-
causing-long-blkfront-connects case ???

- Don
Comment 9 Paolo Bonzini 2009-06-22 04:06:57 EDT
> Has it been tried under the tons-of-guest-starting-up-
> causing-long-blkfront-connects case ???

Probably not, but I can do it if you expand on what is the "ton-of-guest-starting-up-causing-long-blkfront-connects" case. :-)  I would say that in that case the devices stay in the Connected state (they don't go to Closing or Closed like an empty physical CD does) so my patch does not make any difference.

Should I create something like 40 minimal PV installations (I can do that with 128 MB per guest) and try to start them all at the same time?  Or maybe just one guest is enough if the system is under heavy load, in which case I already tested that and it works.
Comment 10 Chris Lalancette 2009-06-22 05:17:53 EDT
(In reply to comment #9)
> > Has it been tried under the tons-of-guest-starting-up-
> > causing-long-blkfront-connects case ???
> 
> Probably not, but I can do it if you expand on what is the
> "ton-of-guest-starting-up-causing-long-blkfront-connects" case. :-)  I would
> say that in that case the devices stay in the Connected state (they don't go to
> Closing or Closed like an empty physical CD does) so my patch does not make any
> difference.
> 
> Should I create something like 40 minimal PV installations (I can do that with
> 128 MB per guest) and try to start them all at the same time?  Or maybe just
> one guest is enough if the system is under heavy load, in which case I already
> tested that and it works.  

I think the heavy load is good enough.  Starting a zillion guests at the same time is just one (common) way to generate the load, but any significant dom0 load should do the trick.  In which case your patch is looking OK, although I'd still like to try and see how upstream Xen handles this (if at all) in this case.

Chris Lalancette
Comment 11 Paolo Bonzini 2009-06-22 09:24:49 EDT
patch posted at http://thread.gmane.org/gmane.comp.emulators.xen.devel/67737 for upstream.
Comment 12 Paolo Bonzini 2009-06-23 13:17:18 EDT
The patch was accepted upstream.

http://xenbits.xensource.com/linux-2.6.18-xen.hg?rev/865707f40531
Comment 13 Chris Lalancette 2009-07-09 07:58:02 EDT
Paolo,
     Just to be clear here, is this a regression?  It seems like this was probably always broken (i.e. broken in 5.3 and earlier).  As long as it is not a regression, and "just another bug", then we can defer to 5.5.  If it is a regression in 5.4, though, I want to make sure we properly track it.

Thanks,
Chris Lalancette
Comment 14 Paolo Bonzini 2009-07-09 08:48:19 EDT
It was always broken.  The reason why the bug was kept in RHEL5.4 was the relationship to bug 501474 and whether it would make sense to solve it with my patch + reinstating the "anaconda hack".

Given that instead we decreased the level of the shutdown messages to DEBUG level, this patch can wait for 5.5.
Comment 16 Don Zickus 2009-10-21 15:11:54 EDT
in kernel-2.6.18-170.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 19 errata-xmlrpc 2010-03-30 03:45:17 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html

Note You need to log in before you can comment on or make changes to this bug.