Bug 449346

Summary: SMP 32bit RHEL5u1 and RHEL5u2 HVM domain might stop booting when start udev service
Product: Red Hat Enterprise Linux 5 Reporter: Adam Stokes <astokes>
Component: kernel-xenAssignee: Rik van Riel <riel>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: high Docs Contact:
Priority: high    
Version: 5.4CC: clalance, cward, dwa, dzickus, jburke, joe.jin, mathieu-acct, qcai, riel, rodney.mckee, tao, tn, xen-maint, yongkang.you
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 480689 513395 (view as bug list) Environment:
Last Closed: 2009-09-02 08:35:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 460955, 480689, 513395    
Attachments:
Description Flags
patch 1/3 of the tools bit
none
patch 2/3 of the tools bits
none
patch 3/3 of the tools bits none

Description Adam Stokes 2008-06-02 10:42:31 UTC
Description of problem:
When booting FV RHEL5.1 or 5.2 32-bit guests on RHEL5.2 32-bit dom0, guests with
more than 1 vcpu can pause for several minutes or even hang at the "starting
udev" portion of the boot sequence. On my test system, I routinely see 3 minute
pauses or neverending hangs at the udev message. Intel suggests that setting
'timer_mode=1' in the /etc/xen configuration file will eliminate the problem,
but after discussion with engineering it appears that we do not have support for
that new upstream feature in our xen userspace bits. I booted a 5.2 guest
multiple times and recorded the time it took to complete the udev portion of the
boot sequence:

udev run times in mm:ss format
------------------------------
03:05 - successful boot
00:08 - successful boot
00:08 - successful boot
18:00 - hang
03:03 - successful boot
00:08 - successful boot
00:08 - successful boot
00:08 - successful boot


Version-Release number of selected component (if applicable):
3.0.x

How reproducible:
25%

Steps to Reproduce:
To perform this testing, I installed a 32-bit xen guest over NFS. After initial
guest bootup I set the system to boot in runlevel3 and did a complete init0
shutdown of the guest. Then I brought the guest back up, timed the udev portion
of boot and shut it down with init0 after it booted. I began recording the udev
run times with the first runlevel 3 boot, not the initial boot after installation.

  
Actual results:
udev will hang for up to 15 minutes or in some cases indefinitely

Expected results:
udev starts properly each time

Additional info:
Using the timer mode implementation in newer versions of Xen addresses this issue :
http://xenbits.xensource.com/xen-3.1-testing.hg?rev/0f3055da442e

Comment 4 Rik van Riel 2009-01-15 17:46:12 UTC
Adam, which timer mode did you use for the patch to fix your problem?

Comment 5 You, Yongkang 2009-01-16 00:59:21 UTC
Rik, in my Caneland platform, I get much high hang precent than 20% to boot a SMP PAE RHEL5u1/2 guest. 32e guest doesn't have this issue. 
I remember both timer_mode 1 and 2 would help booting.

Comment 6 Rik van Riel 2009-01-19 20:20:17 UTC
It looks like we will also need the following two changesets from xen-unstable, in addition to the changeset above:

changeset:   18554:1420a6649cfa
user:        Keir Fraser <keir.fraser>
date:        Fri Sep 26 17:09:36 2008 +0100
summary:     hvm: Default timer_mode=1 (do not delay virtual time for missed

changeset:   16764:3f26758bcc02
user:        Keir Fraser <keir.fraser>
date:        Fri Jan 18 22:27:51 2008 +0000
summary:     xend: Handle unspecified timer_mode domain platform parameter.

Comment 7 Rik van Riel 2009-01-19 21:37:28 UTC
Other timer changesets we may need for HVM time to run correctly:

changeset:   18729:16eede823854
user:        Keir Fraser <keir.fraser>
date:        Tue Oct 28 10:36:22 2008 +0000
summary:     hvm: Do not mess with APIC timer deadline if in one-shot mode.

changeset:   18695:6f74549ac4c5
user:        Keir Fraser <keir.fraser>
date:        Wed Oct 22 12:08:16 2008 +0100
summary:     x86, hvm: Allow 100us periodic virtual timers

changeset:   18694:71c15dfaa12b
user:        Keir Fraser <keir.fraser>
date:        Wed Oct 22 12:04:32 2008 +0100
summary:     Port HPET device model to vpt timer subsystem

changeset:   17716:6c4cab061af4
user:        Keir Fraser <keir.fraser>
date:        Sat May 24 09:27:03 2008 +0100
summary:     hvm: Build guest timers on monotonic system time.

changeset:   16690:01adaec882d4
user:        Keir Fraser <keir.fraser>
date:        Tue Jan 08 14:31:23 2008 +0000
summary:     hvm: time: Fixes to 'SYNC' (no_missed_ticks_pending) timer handling
.

changeset:   16689:66db23ecd562
user:        Keir Fraser <keir.fraser>
date:        Tue Jan 08 13:57:45 2008 +0000
summary:     hvm: hpet: Fix per-timer enable/disable.

Comment 8 Rik van Riel 2009-01-26 23:46:04 UTC
Related to bug 307201 ?

Comment 9 Rik van Riel 2009-01-30 10:09:48 UTC
Some more HVM HPET candidates:

changeset:   17017:209512f6d89c
user:        Keir Fraser <keir.fraser>
date:        Mon Feb 11 14:45:29 2008 +0000
summary:     x86 hvm: Allow HPET to be configured as a per-domain config option.

changeset:   16707:51aa2f884f64
user:        Keir Fraser <keir.fraser>
date:        Fri Jan 11 11:01:36 2008 +0000
summary:     hvm: hpet: Tidy up hpet_to_ns_limit calculation.

changeset:   16697:1b2be7cf0b7b
user:        Keir Fraser <keir.fraser>
date:        Wed Jan 09 10:32:13 2008 +0000
summary:     hvm: hpet: Clamp period to sane values to prevent excessive looping
 in

changeset:   16693:9ff64d045e61
user:        Keir Fraser <keir.fraser>
date:        Tue Jan 08 16:20:04 2008 +0000
summary:     hvm: hpet: Fix overflow when converting to nanoseconds.
changeset:   16689:66db23ecd562
user:        Keir Fraser <keir.fraser>
date:        Tue Jan 08 13:57:45 2008 +0000
summary:     hvm: hpet: Fix per-timer enable/disable.

changeset:   16486:c00f31f27de6
user:        Keir Fraser <keir.fraser>
date:        Wed Nov 28 13:13:51 2007 +0000
summary:     hvm: Fix 2 type mismatches in vlapic.h and hpet.c for 32-bit build 
Xen

changeset:   16404:ae6f4c7f15cb
user:        Keir Fraser <keir.fraser>
date:        Wed Nov 21 09:49:09 2007 +0000
summary:     hvm: Do not crash guest if it does an unaligned access to an HPET

Comment 10 Rik van Riel 2009-02-07 02:31:42 UTC
Created attachment 331184 [details]
patch 1/3 of the tools bit

I am building a kernel now with a bunch of backported patches, CVS branch private-bz449346-branch, http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1682472

Comment 11 Rik van Riel 2009-02-07 02:33:00 UTC
Created attachment 331185 [details]
patch 2/3 of the tools bits

Comment 12 Rik van Riel 2009-02-07 02:34:03 UTC
Created attachment 331186 [details]
patch 3/3 of the tools bits

Comment 13 Rik van Riel 2009-04-10 17:00:43 UTC
This bug can be caused by a combination of two main factors:
- while doing disk IO, one VCPU of an HVM guest can miss timer ticks
- Xen did not re-deliver those missed timer ticks later on, causing clock skew between VCPUs inside an HVM guest

Both of these issues should be resolved with the backport of the AIO disk handling code and upstream Xen 'no missed-tick accounting' timer code. Please test the test RPMs from http://people.redhat.com/riel/.xenaiotime/ and let us know if those (experimental!) test packages resolve the issue.

Comment 14 Rik van Riel 2009-04-13 16:17:22 UTC
*** Bug 490760 has been marked as a duplicate of this bug. ***

Comment 15 Qian Cai 2009-05-05 22:33:04 UTC
*** Bug 499276 has been marked as a duplicate of this bug. ***

Comment 16 Don Zickus 2009-05-12 17:39:43 UTC
in kernel-2.6.18-146.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 18 Chris Ward 2009-07-03 18:03:21 UTC
~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 19 You, Yongkang 2009-07-09 01:49:35 UTC
Try this issue on RHEL5.4 beta. 

If using RHEL5u1 PAE image, still see few times guest booting stopped at "waiting for driver initialization".

But doesn't see any issue with RHEL5u4 PAE image.

Comment 20 Jeff Burke 2009-07-13 20:44:23 UTC
*** Bug 461640 has been marked as a duplicate of this bug. ***

Comment 21 Jeff Burke 2009-07-13 20:49:14 UTC
*** Bug 307201 has been marked as a duplicate of this bug. ***

Comment 23 errata-xmlrpc 2009-09-02 08:35:10 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Comment 27 Red Hat Bugzilla 2023-09-14 01:12:53 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days