Bug 427588 - [RHEL 5.2]: Tick divider bug when using clocksource=pit
[RHEL 5.2]: Tick divider bug when using clocksource=pit
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.2
All Linux
high Severity high
: rc
: ---
Assigned To: Chris Lalancette
Red Hat Kernel QE team
:
Depends On:
Blocks: 483701
  Show dependency treegraph
 
Reported: 2008-01-04 17:03 EST by Chris Lalancette
Modified: 2009-09-02 04:18 EDT (History)
15 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-09-02 04:18:37 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
CentOS 2189 None None None Never

  None (edit)
Description Chris Lalancette 2008-01-04 17:03:46 EST
Description of problem:
I've been recently working with some CentOS people using the tick divider patch:
http://bugs.centos.org/view.php?id=2189

They pointed out a bug to me: if, under VMware, you boot a RHEL-5.1 (or 5.2)
kernel with divider=10 clocksource=pit, the kernel will get softlockups and
exhibit all kinds of strange behavior.  The end result is that the kernel does
not really boot, and doesn't work properly.

I've tracked this down to arch/i386/kernel/io_apic.c: check_timer().  In there,
there is a check (timer_irq_works()) to make sure that the PIT, when routed
through the IO-APIC, actually works.  However, it does it by enabling
interrupts, mdelay((10*1000)/HZ), and then comparing the difference in the
jiffies.  In the case of a divided kernel, however, 0 jiffies may have elapsed
during the mdelay, instead of the expected 10.

I'm still working on a solution, but the simple way to go may be to just change
HZ to REAL_HZ.  This looks like it will also affect x86_64, although I haven't
confirmed it there yet.
Comment 2 Akemi Yagi 2008-01-04 18:18:52 EST
Regarding x86_64, the only available clocksource is jiffies.  So choosing pit is
not an option there.

Akemi
Comment 3 Dan Magenheimer 2008-02-29 13:37:17 EST
FYI, we have only observed the "can't boot" problem when combining
divider=10 if clock is pit on Xen on 32-bit kernels.  64-bit kernels boot
fine.  HOWEVER, we have just discovered that 64-bit kernels with
divider=10 and clock=pit (nohpet, nopmtimer) results in bad
clock skew under Xen.  Possibly this is obscured to real users
by NPT, but I thought your debugging efforts might be easier if
you know the problem does not just affect 32-bit.
Comment 4 Akemi Yagi 2008-02-29 15:29:38 EST
If I'm not mistaken, xen kernels are set to 250Hz by default.  You might not
want to use the divider= option in this case.

Akemi
Comment 5 Dan Magenheimer 2008-02-29 16:07:39 EST
Paravirtualized kernels are 250MHz.  All of our measurements are with
fully-virtualized ("hvm") kernels for which the HZ rate was compiled-in long ago.
Comment 6 Dan Magenheimer 2008-03-28 10:09:44 EDT
I've been playing with this and I think I have a partial fix, which might make
it easier to identify a complete fix.  Test by booting with divider=10 but not
clock=pit.  Manually change clocksource to pit (via writing to sysfs).  Havoc
erupts instantaneously as the time-of-day clock starts gaining time very
quickly.   Now, in arch/i386/kernel/i8253.c, remove the line in pit_read that
multiplies count by tick_divider (following comment "Adjust to logical ticks").
 This changes the problem from instantaneous and devastating, to periodic and
useable (though still unacceptable).
Comment 7 Dan Magenheimer 2008-04-02 15:18:47 EDT
Ignore the above partial fix.  I've had good luck in limited testing so far with
the one-line patch below.  No boot problems, no crazy time problems. And the
evaluated condition is the same if tick_divider=1 so no change to the normal case.

--- arch/i386/kernel/i8253.c    2008-04-02 11:28:43.000000000 -0600
+++ arch.patch/i386/kernel/i8253.c      2008-04-02 12:25:14.000000000 -0600
@@ -86,7 +86,7 @@
         * Previous attempts to handle these cases intelligently were
         * buggy, so we just do the simple thing now.
         */
-       if (count > old_count && jifs == old_jifs) {
+       if (count > old_count && (jifs - old_jifs) < tick_divider) {
                count = old_count;
        }
        old_count = count;
Comment 8 Alan Cox 2008-05-03 10:26:11 EDT
Paravirt uses 250Hz fixed and full virt is all a bit weird if it isn't related
to the real Xen timing.

The patch looks sensible to me.
Comment 9 Nathan Bryant 2008-08-27 14:25:35 EDT
Tested the patch under Microsoft Virtual PC (with clocksource=pit divider=10) and it appears to fix all the kernel instability issues. gettimeofday is still drifting, but I'm chalking that up to Microsoft for now.
Comment 10 Tru Huynh 2008-09-30 03:58:47 EDT
from the CentOS bug entry:

http://kb.vmware.com/kb/1006427 lists the timekeeping best practices for a number of distributions. 

see also https://bugzilla.redhat.com/show_bug.cgi?id=463573
Comment 12 RHEL Product and Program Management 2009-02-16 10:22:50 EST
Updating PM score.
Comment 15 Chris Lalancette 2009-03-01 08:42:36 EST
Dan M,
     We want to pull the patch from Comment #7 into the next RHEL kernel.  Given that it is a one-off fix, and never be upstream, I just wanted to make sure that I had your Signed-off-by to go ahead and put it into RHEL.  Just let me know.

Thanks,
Chris Lalancette
Comment 16 Dan Magenheimer 2009-03-02 13:39:09 EST
Hi Chris --
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
(Note that I myself have only done limited testing on the fix.)
Comment 17 Chris Lalancette 2009-03-03 03:19:04 EST
Great, thanks.  I'll definitely get it some QA here before we put it in.

Thanks again,
Chris Lalancette
Comment 18 Don Zickus 2009-04-27 11:57:19 EDT
in kernel-2.6.18-141.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 22 errata-xmlrpc 2009-09-02 04:18:37 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.