Bug 396621 - Increase timeout for device connection on boot
Summary: Increase timeout for device connection on boot
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen
Version: 5.1
Hardware: All
OS: Linux
low
low
Target Milestone: ---
: ---
Assignee: Chris Lalancette
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks: 506899
TreeView+ depends on / blocked
 
Reported: 2007-11-23 11:51 UTC by Ian Campbell
Modified: 2009-09-02 08:54 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-09-02 08:54:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
linux-2.6.18-xen 144:d88e59a7334a ported to 2.6.18-53.el5 (2.01 KB, patch)
2007-12-14 09:23 UTC, Ian Campbell
no flags Details | Diff
linux-2.6.18-xen146:726cd201f4cd ported to 2.6.18-53.el5 (2.00 KB, patch)
2007-12-14 09:23 UTC, Ian Campbell
no flags Details | Diff
linux-2.6.18-xen 150:09c88868e344 ported to 2.6.18-53.el5 (1.45 KB, patch)
2007-12-14 09:24 UTC, Ian Campbell
no flags Details | Diff
dmesg inside the guest (33.47 KB, text/plain)
2009-02-20 15:15 UTC, Chris Lalancette
no flags Details
FV guest configuration file (495 bytes, text/plain)
2009-02-20 15:16 UTC, Chris Lalancette
no flags Details
Output from xenstore-ls, as requested by danpb (18.51 KB, text/plain)
2009-02-20 17:11 UTC, Chris Lalancette
no flags Details
Patch 1/2: Make sure to only recover connected devices on resume (1.09 KB, patch)
2009-03-01 13:37 UTC, Chris Lalancette
no flags Details | Diff
Patch 2/2: Wait for 5 minutes for the backend to connect (3.11 KB, patch)
2009-03-01 13:38 UTC, Chris Lalancette
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1243 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update 2009-09-01 08:53:34 UTC

Description Ian Campbell 2007-11-23 11:51:02 UTC
Description of problem:

The existing 10s delay waiting for frontend devices to connect is sometimes
insufficient under load.

This has been solved upstream by these patches:
- http://hg.uk.xensource.com/linux-2.6.18-xen.hg?cs=d88e59a7334a
- http://hg.uk.xensource.com/linux-2.6.18-xen.hg?cs=726cd201f4cd
- http://hg.uk.xensource.com/linux-2.6.18-xen.hg?cs=09c88868e344

Version-Release number of selected component (if applicable):

2.6.18-53.EL

Comment 1 Ian Campbell 2007-11-23 11:54:41 UTC
Sorry, internal URLs. For external access you need to
s/hg.uk.xensource.com/xenbits.xensource.com/g

http://xenbits.xensource.com/linux-2.6.18-xen.hg?cs=d88e59a7334a
http://xenbits.xensource.com/linux-2.6.18-xen.hg?cs=726cd201f4cd
http://xenbits.xensource.com/linux-2.6.18-xen.hg?cs=09c88868e344

Comment 2 Ian Campbell 2007-12-14 09:23:27 UTC
Created attachment 288851 [details]
linux-2.6.18-xen 144:d88e59a7334a ported to 2.6.18-53.el5

Comment 3 Ian Campbell 2007-12-14 09:23:52 UTC
Created attachment 288861 [details]
linux-2.6.18-xen146:726cd201f4cd ported to 2.6.18-53.el5

Comment 4 Ian Campbell 2007-12-14 09:24:19 UTC
Created attachment 288871 [details]
linux-2.6.18-xen 150:09c88868e344 ported to 2.6.18-53.el5

Comment 6 Chris Lalancette 2009-02-20 15:14:09 UTC
Ian,
     This patch is causing some problems with fully virtualized guests.  Basically, I built a kernel with this patch, and installed it into a 5.3 guest.  Then I booted it up.  I'll attach a full guest config and logs, but essentially if you try to boot the FV guest with the following in it's config:

disk = [ "file:/var/lib/xen/images/rhel5fv_x86_64.dsk,hda,w", ",hdc:cdrom,r" ]

The FV guest will take a really long time to boot, and you'll see this in the logs:

XENBUS: Waiting for devices to initialise: 295s...290s...285s...280s...275s...55s...50s...45s...40s...35s...30s...25s...20s...15s...10s...5s...0s...
XENBUS: Timeout connecting to device: device/vbd/5632 (local state 3, remote state 2)

The FV guest will boot, but that's not nice behavior.  Have any ideas about this?

Chris Lalancette

Comment 7 Chris Lalancette 2009-02-20 15:15:44 UTC
Created attachment 332714 [details]
dmesg inside the guest

This is a dmesg from inside the FV guest when the problem happened.  Ignore the softlockup warnings and the "too much work" from the 8250 driver; that happened because I did a save/restore on this domain.  The important parts are the countdown from the XENBUS driver.

Comment 8 Chris Lalancette 2009-02-20 15:16:15 UTC
Created attachment 332715 [details]
FV guest configuration file

Comment 10 Daniel Berrangé 2009-02-20 16:02:19 UTC
It would probably be useful to capture the output of 'xenstore-ls' while the guest is in this stuck state, counting down for 5 minutes. I expect one of the devices will be in some unexpected transition 'state' and xenstore should show which

Comment 11 Chris Lalancette 2009-02-20 16:52:44 UTC
Dan,
    The thing is, I know exactly which device it is.  It's the CD-ROM device attached to hdc.  The problem is that it doesn't *have* a backend, so we are waiting around 5 minutes for a device that will never connect.
     There is actually another interesting aspect to this.  In upstream, the only way you can get into wait_for_devices() is either through boot_wait_for_devices() or through xenbus_register_frontend().  We aren't coming through boot_wait_for_devices(), since that is only called if xenbus is compiled into the kernel, and in RHEL-5, it's a module.  Therefore, we have to be coming through xenbus_register_frontend().  And in the upstream case, if you come through xenbus_register_frontend(), wait_for_devices() is essentially a no-op; it returns immediately because "ready_to_wait_for_devices" is never set.  On the other hand, in RHEL-5, we are carrying a small patch that forces ready_to_wait_for_devices to 1 in the PV_ON_HVM case.  The thing is, I'm not quite sure why we would need that additional patch; if this isn't a boot device, we don't really need to hang around for it.  I'll get in contact with Don Dutile and see why we are carrying that patch.

Chris Lalancette

Comment 12 Chris Lalancette 2009-02-20 17:11:55 UTC
Created attachment 332731 [details]
Output from xenstore-ls, as requested by danpb

This is the output from xenstore-ls while the FV guest is hanging around waiting for the timeout to happen.  The domain you are interested in is called rhel5fv_x86_64.

Comment 13 Chris Lalancette 2009-03-01 13:35:31 UTC
OK, I think I have now worked through the problems with these patches.  The first problem had to do with what I mentioned in Comment #6, where devices without a backend were hanging up.  This is going to be fixed by the patch in BZ 477005, which removes our unconditional ready_to_wait_for_devices for PV-on-HVM.  The second problem I had while testing this had to do with a crash when doing save/restore.  It basically had to do with another bug that we just never noticed before, and seems to be fixed by xen-unstable c/s 12526.  I will attach both that patch, and my rebased combination of linux-2.6.18-xen.hg c/s 144, 146, and 150 to this BZ.  Note that all of these patches must be applied *after* the patch in BZ 477005 to avoid merge conflicts.

Chris Lalancette

Comment 14 Chris Lalancette 2009-03-01 13:37:24 UTC
Created attachment 333643 [details]
Patch 1/2: Make sure to only recover connected devices on resume

Comment 15 Chris Lalancette 2009-03-01 13:38:02 UTC
Created attachment 333644 [details]
Patch 2/2: Wait for 5 minutes for the backend to connect

Comment 16 Ian Campbell 2009-03-02 10:09:07 UTC
Hi Chris, sorry for the delay getting back to you, I was out of the office.

Looks like you have this sorted now, do you still needinfo from me?

Ian.

Comment 17 Chris Lalancette 2009-03-02 10:56:29 UTC
Ian,
     Yeah, I think we are all sorted now.  Sorry, I forgot to remove the "needinfo" flag from the bug.

Thanks,
Chris Lalancette

Comment 22 errata-xmlrpc 2009-09-02 08:54:34 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html


Note You need to log in before you can comment on or make changes to this bug.