Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 506899

Summary: timeout with physical cdrom on a PV guest
Product: Red Hat Enterprise Linux 5 Reporter: Paolo Bonzini <pbonzini>
Component: kernel-xenAssignee: Paolo Bonzini <pbonzini>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 5.4CC: berrange, clalance, ddutile, dzickus, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 509243 523154 (view as bug list) Environment:
Last Closed: 2010-03-30 07:45:17 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 396621    
Bug Blocks: 396631, 509243, 526775, 526946    
Attachments:
Description Flags
patch fixing the bug none

Description Paolo Bonzini 2009-06-19 07:57:05 UTC
Description of problem:
With a line like

disk = ['tap:aio:/....,xvda,w', 'phy:/dev/cdrom,hdc:cdrom,r' ]

in the Xen configuration and no physical CDROM in the tray, the PV guest will wait for the 300 seconds timeout to occur

Version-Release number of selected component (if applicable):
5.4

How reproducible:
Add the 'phy:/dev/cdrom,hdc:cdrom,r' item to an existing PV configuration and start the VM.
  
Actual results:
300 seconds timeout

Expected results:
No timeout.

Additional info:
The bug is probably due to the device being Closed (as shown by the timeout message) but not ready.  The configuration is quite bogus, but this is a regression and, if my diagnosis is right, the fix should be easy and safe.

Comment 1 Daniel Berrangé 2009-06-19 10:38:38 UTC
Is this really a regression - have you verified that this works in RHEL-5.3 ?  
I'm surprised if this has ever worked, since AFAIK the Xen version RHEL5 is based on has never supported CDROM devices for paravirt guests. It was only added later when Solaris paravirt support CDROm devices. 

That all said, timeouts are always the wrong answer to unsupported features - there should be explicit error message given back where required. In this case though, since the host does not really know whether the guest supports CDROM or not, it would have to attach it anyway and just hope the guest knows what todo with it.

Comment 2 Paolo Bonzini 2009-06-19 11:04:41 UTC
The regression is that it takes 300 seconds instead of 10 to boot.  The configuration is bogus indeed.

Comment 3 Daniel Berrangé 2009-06-19 11:10:20 UTC
The fact that the length of the timeout changed does not really make this is regression. It was broken before & it is still broken now.

Comment 4 Bill Burns 2009-06-19 11:16:01 UTC
Ok, clearing the regression flag then. The timeout was increased to avoid problems are seen with valid configs. Side effect is that broken configs can take the full time to fail. This is the tradeoff. Won;t close this though. If we can detect and prevent waiting for a broken config this BZ can be used to fix that.

Comment 5 Chris Lalancette 2009-06-19 11:55:24 UTC
Actually, there are 2 things going on here.  The first is that we did, indeed, increase the timeout to 5 minutes.  But the second part, which is just as important, is that we removed ddd's "anaconda" hack from blkfront.  What that hack did was to go trawling around xenbus, looking for devices that had an unconnected backend.  That was necessary earlier because anaconda was having a fit with these devices when xen-platform-pci was built-in to the kernel.  Now that it's a module, we decided to remove it.

That being said, though, we now have two new problems because of the removal of that patch; this device timeout, and spurious "printk" on shutdown.  I'm wondering if we should maybe put that hack back, since it actually seemed to do more good than we originally realized.

Don (Dutile), have any thoughts about this?

Chris Lalancette

Comment 6 Paolo Bonzini 2009-06-19 12:05:40 UTC
Created attachment 348646 [details]
patch fixing the bug

Just to add 2 cents more to the discussion, here is the patch I just tested successfully.

Comment 7 Paolo Bonzini 2009-06-19 13:11:10 UTC
The "anaconda hack" is at https://bugzilla.redhat.com/show_bug.cgi?id=477005 for the record.

Comment 8 Don Dutile (Red Hat) 2009-06-19 17:42:03 UTC
(In reply to comment #5)

Well, I'd sure like to *not* put the "anaconda hack" back in.

Yes, it could fix it, but I'm wondering if it would mess up
the long-connect-under-load cases.

Paolo's patch looks promising, too (in #6).
Has it been tried under the tons-of-guest-starting-up-
causing-long-blkfront-connects case ???

- Don

Comment 9 Paolo Bonzini 2009-06-22 08:06:57 UTC
> Has it been tried under the tons-of-guest-starting-up-
> causing-long-blkfront-connects case ???

Probably not, but I can do it if you expand on what is the "ton-of-guest-starting-up-causing-long-blkfront-connects" case. :-)  I would say that in that case the devices stay in the Connected state (they don't go to Closing or Closed like an empty physical CD does) so my patch does not make any difference.

Should I create something like 40 minimal PV installations (I can do that with 128 MB per guest) and try to start them all at the same time?  Or maybe just one guest is enough if the system is under heavy load, in which case I already tested that and it works.

Comment 10 Chris Lalancette 2009-06-22 09:17:53 UTC
(In reply to comment #9)
> > Has it been tried under the tons-of-guest-starting-up-
> > causing-long-blkfront-connects case ???
> 
> Probably not, but I can do it if you expand on what is the
> "ton-of-guest-starting-up-causing-long-blkfront-connects" case. :-)  I would
> say that in that case the devices stay in the Connected state (they don't go to
> Closing or Closed like an empty physical CD does) so my patch does not make any
> difference.
> 
> Should I create something like 40 minimal PV installations (I can do that with
> 128 MB per guest) and try to start them all at the same time?  Or maybe just
> one guest is enough if the system is under heavy load, in which case I already
> tested that and it works.  

I think the heavy load is good enough.  Starting a zillion guests at the same time is just one (common) way to generate the load, but any significant dom0 load should do the trick.  In which case your patch is looking OK, although I'd still like to try and see how upstream Xen handles this (if at all) in this case.

Chris Lalancette

Comment 11 Paolo Bonzini 2009-06-22 13:24:49 UTC
patch posted at http://thread.gmane.org/gmane.comp.emulators.xen.devel/67737 for upstream.

Comment 12 Paolo Bonzini 2009-06-23 17:17:18 UTC
The patch was accepted upstream.

http://xenbits.xensource.com/linux-2.6.18-xen.hg?rev/865707f40531

Comment 13 Chris Lalancette 2009-07-09 11:58:02 UTC
Paolo,
     Just to be clear here, is this a regression?  It seems like this was probably always broken (i.e. broken in 5.3 and earlier).  As long as it is not a regression, and "just another bug", then we can defer to 5.5.  If it is a regression in 5.4, though, I want to make sure we properly track it.

Thanks,
Chris Lalancette

Comment 14 Paolo Bonzini 2009-07-09 12:48:19 UTC
It was always broken.  The reason why the bug was kept in RHEL5.4 was the relationship to bug 501474 and whether it would make sense to solve it with my patch + reinstating the "anaconda hack".

Given that instead we decreased the level of the shutdown messages to DEBUG level, this patch can wait for 5.5.

Comment 16 Don Zickus 2009-10-21 19:11:54 UTC
in kernel-2.6.18-170.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 19 errata-xmlrpc 2010-03-30 07:45:17 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html