Red Hat Bugzilla – Bug 396621
Increase timeout for device connection on boot
Last modified: 2009-09-02 04:54:34 EDT
Description of problem:
The existing 10s delay waiting for frontend devices to connect is sometimes
insufficient under load.
This has been solved upstream by these patches:
Version-Release number of selected component (if applicable):
Sorry, internal URLs. For external access you need to
Created attachment 288851 [details]
linux-2.6.18-xen 144:d88e59a7334a ported to 2.6.18-53.el5
Created attachment 288861 [details]
linux-2.6.18-xen146:726cd201f4cd ported to 2.6.18-53.el5
Created attachment 288871 [details]
linux-2.6.18-xen 150:09c88868e344 ported to 2.6.18-53.el5
This patch is causing some problems with fully virtualized guests. Basically, I built a kernel with this patch, and installed it into a 5.3 guest. Then I booted it up. I'll attach a full guest config and logs, but essentially if you try to boot the FV guest with the following in it's config:
disk = [ "file:/var/lib/xen/images/rhel5fv_x86_64.dsk,hda,w", ",hdc:cdrom,r" ]
The FV guest will take a really long time to boot, and you'll see this in the logs:
XENBUS: Waiting for devices to initialise: 295s...290s...285s...280s...275s...55s...50s...45s...40s...35s...30s...25s...20s...15s...10s...5s...0s...
XENBUS: Timeout connecting to device: device/vbd/5632 (local state 3, remote state 2)
The FV guest will boot, but that's not nice behavior. Have any ideas about this?
Created attachment 332714 [details]
dmesg inside the guest
This is a dmesg from inside the FV guest when the problem happened. Ignore the softlockup warnings and the "too much work" from the 8250 driver; that happened because I did a save/restore on this domain. The important parts are the countdown from the XENBUS driver.
Created attachment 332715 [details]
FV guest configuration file
It would probably be useful to capture the output of 'xenstore-ls' while the guest is in this stuck state, counting down for 5 minutes. I expect one of the devices will be in some unexpected transition 'state' and xenstore should show which
The thing is, I know exactly which device it is. It's the CD-ROM device attached to hdc. The problem is that it doesn't *have* a backend, so we are waiting around 5 minutes for a device that will never connect.
There is actually another interesting aspect to this. In upstream, the only way you can get into wait_for_devices() is either through boot_wait_for_devices() or through xenbus_register_frontend(). We aren't coming through boot_wait_for_devices(), since that is only called if xenbus is compiled into the kernel, and in RHEL-5, it's a module. Therefore, we have to be coming through xenbus_register_frontend(). And in the upstream case, if you come through xenbus_register_frontend(), wait_for_devices() is essentially a no-op; it returns immediately because "ready_to_wait_for_devices" is never set. On the other hand, in RHEL-5, we are carrying a small patch that forces ready_to_wait_for_devices to 1 in the PV_ON_HVM case. The thing is, I'm not quite sure why we would need that additional patch; if this isn't a boot device, we don't really need to hang around for it. I'll get in contact with Don Dutile and see why we are carrying that patch.
Created attachment 332731 [details]
Output from xenstore-ls, as requested by danpb
This is the output from xenstore-ls while the FV guest is hanging around waiting for the timeout to happen. The domain you are interested in is called rhel5fv_x86_64.
OK, I think I have now worked through the problems with these patches. The first problem had to do with what I mentioned in Comment #6, where devices without a backend were hanging up. This is going to be fixed by the patch in BZ 477005, which removes our unconditional ready_to_wait_for_devices for PV-on-HVM. The second problem I had while testing this had to do with a crash when doing save/restore. It basically had to do with another bug that we just never noticed before, and seems to be fixed by xen-unstable c/s 12526. I will attach both that patch, and my rebased combination of linux-2.6.18-xen.hg c/s 144, 146, and 150 to this BZ. Note that all of these patches must be applied *after* the patch in BZ 477005 to avoid merge conflicts.
Created attachment 333643 [details]
Patch 1/2: Make sure to only recover connected devices on resume
Created attachment 333644 [details]
Patch 2/2: Wait for 5 minutes for the backend to connect
Hi Chris, sorry for the delay getting back to you, I was out of the office.
Looks like you have this sorted now, do you still needinfo from me?
Yeah, I think we are all sorted now. Sorry, I forgot to remove the "needinfo" flag from the bug.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.