Bug 132439 - System clock runs faster under high (interrupt?) load...
Summary: System clock runs faster under high (interrupt?) load...
Keywords:
Status: CLOSED DUPLICATE of bug 163211
Alias: None
Product: Red Hat Enterprise Linux 2.1
Classification: Red Hat
Component: kernel
Version: 2.1
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Don Howard
QA Contact: Brian Brock
URL:
Whiteboard:
: 133918 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-09-13 14:36 UTC by Phil D'Amore
Modified: 2007-11-30 22:06 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-02-06 20:15:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Phil D'Amore 2004-09-13 14:36:36 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040510

Description of problem:
We have some servers whose clocks appear to run much faster than
normal when the system is under load.  At first glance, the serves do
not appear to be doing much:

[root@server1 root]# uptime
 10:09am  up 11 days, 19:14,  1 user,  load average: 0.77, 1.01, 0.98

[root@server1 root]# sar -u |tail -10
08:50:00 AM       all      2.03      0.00      9.62     88.35
08:59:57 AM       all      1.67      0.00      9.25     89.08
09:10:00 AM       all      1.37      0.00      9.09     89.54
09:20:00 AM       all      1.79      0.00      9.06     89.14
09:29:58 AM       all      1.44      0.00      8.84     89.72
09:40:00 AM       all      1.99      0.00     10.97     87.03
09:50:00 AM       all      2.12      0.00     11.32     86.56
10:00:00 AM       all      2.58      0.00     10.26     87.16
10:10:00 AM       all      1.96      0.00      9.76     88.28
Average:          all      1.58      0.00      8.40     90.02

However looking at the interrupts, eth0 (tg3) is generating much more
interrupts than anything else in the system.  The following is part of
the output of the comand:  procinfo -n10 -d -S with just the
interrupts that are actualy seeing activity grabbed from the output:

irq  0:       100 timer                 irq 46:         0
irq 28:      4922 eth0                  irq 74:         0
irq 30:        22 aic7xxx               irq 76:         0
irq 31:        40 aic7xxx               irq 77:         0

eth0 on these hosts typically generates 4500-5000 interupts a second
under normal load, which is when time goes nuts.  Looking at ntpd...

-> ntpq
ntpq> lpeers
     remote           refid      st t when poll reach   delay   offset
 jitter
==============================================================================
*clock1.redhat.c .CDMA.           1 u    7   64   77   44.015  -779.38
174.712
+clock2.redhat.c .CDMA.           1 u   62   64   37  109.412  -702.04
173.590
+clock3.util.phx .CDMA.           1 u   62   64   37    1.102  -701.69
174.580

[root@server1 root]# cat /etc/ntp/drift
-500.000

ntpd is slammed trying to keep the clock running properly, but it is
fast beyond ntpd's capability to use clock skew to keep it in line, so
it steps on a pretty regular basis:

Sep 13 08:44:05 server1 ntpd[995]: time reset -2.054975 s
Sep 13 08:44:05 server1 ntpd[995]: synchronisation lost
Sep 13 08:59:14 server1 ntpd[995]: time reset -3.893986 s
Sep 13 08:59:14 server1 ntpd[995]: synchronisation lost
Sep 13 09:14:16 server1 ntpd[995]: time reset -3.043421 s
Sep 13 09:14:16 server1 ntpd[995]: synchronisation lost
Sep 13 09:29:27 server1 ntpd[995]: time reset -2.013207 s
Sep 13 09:29:27 server1 ntpd[995]: synchronisation lost
Sep 13 09:44:44 server1 ntpd[995]: time reset -2.025384 s
Sep 13 09:44:44 server1 ntpd[995]: synchronisation lost
Sep 13 10:00:00 server1 ntpd[995]: time reset -1.348106 s
Sep 13 10:00:00 server1 ntpd[995]: synchronisation lost

All of these adjustments are negative, which is driving one of our
applications (vsftpd) crazy, since it is seeing values returned from
gettimeofday() that actually go backwards:

23248 gettimeofday({1094664634, 860444}, NULL) = 0
23248 _llseek(7, 195174856, [195174856], SEEK_SET) = 0
23248 sendfile(8, 7, NULL, 65536)       = 65536
23248 gettimeofday({1094664634, 909594}, NULL) = 0
23248 _llseek(7, 195240392, [195240392], SEEK_SET) = 0
23248 sendfile(8, 7, NULL, 65536)       = 65536
23248 gettimeofday({1094664631, 105547}, NULL) = 0

As a test, I turned off ntpd on one of the servers and let it run free
for a little while.

When I started (note monitor is a server with stable time I'm using as
a baseline, and server1 is the machine with the problem):

[damorep@monitor:/home/saqadoc/damorep]
-> date
Mon Sep 13 10:00:46 EDT 2004

[root@server1 root]# date
Mon Sep 13 10:00:46 EDT 2004

32 minutes later: 
 
[damorep@monitor:/home/saqadoc/damorep]
-> date
Mon Sep 13 10:32:27 EDT 2004

[root@server1 root]# date
Mon Sep 13 10:32:40 EDT 2004

So when left to run free, server1's clock gained 13 seconds in 32 minutes.

This machine is running e.49enterprise.  We also tried with
e.40enterprise and it appears to have the same issue.

Something is horribly wring here.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. See description

Additional info:

Comment 1 Rik van Riel 2004-09-13 14:54:22 UTC
According to the openntpd maintainer (who I spoke with at SUCON), the
adjtime system call in Linux isn't POSIX compliant and allows the time
to jump.

This could well be a bug we need to fix for RHEL4.  Maybe RHEL3 and
RHEL2.1 too...

Comment 2 Jay Turner 2004-10-01 07:52:51 UTC
*** Bug 133918 has been marked as a duplicate of this bug. ***

Comment 3 Phil D'Amore 2004-10-01 13:37:53 UTC
2 Comments here:

1) JFYI, according to the vsftpd maintainer, whom we've contacted
about this issue, the latest vsftpd is fixed to the point that
transfers will at least not fail.  There are still problems (I'm
guessing bandwidth limits will probably not work properly in this
scenario), and the maintainer is addressing them, but the transfers
will work.

2)  While it is all well and good that adjtime isn't POSIX compliant,
the reason I opened this bug in the first place is because I velieve
the root of this problem is that, under load, the system clock is
running extremely fast.  If that problem didn't happen, the rest of
this would be irrelevant, at least to this bug.

Comment 4 Everaldo de Nóbrega Peixoto 2005-08-08 16:05:35 UTC
Similar problem at FC4 X86_64.

The system clock runs more than 2 times faster than it should. The system load
is also unusually high. When you run procinfo it aborts with buffer overflow, as
follows:

Linux 2.6.11-1.1369_FC4 (bhcompile@dolly) (gcc 4.0.0 20050525 ) #1 1CPU [zefir]
Linux 2.6.11-1.1369_FC4 (bhcompile@dolly) (gcc 4.0.0 20050525 ) #1 1CPU [zefir]

Memory:      Total        Used        Free      Shared     Buffers
Mem:             0          24         -24           0           0
Swap:            0           0           0
Bootup: Mon Aug  8 12:30:49 2005    Load average: 0.76 0.59 0.64 2/116 3872
Bootup: Mon Aug  8 12:30:49 2005    Load average: 0.89 0.62 0.65 1/116 3872
user  :       0:01:57.74   3.4%  page in :        0
user  :       0:01:43.58   0.2%  page in :        0
nice  :       0:00:00.00   0.0%  page out:        0
system:       7:15:04.00  50.4%  swap in :        0
idle  :       7:07:17.86  49.5%  swap out:        0
uptime:       0:57:43.59         context :  8416472
irq  0:   3453095 timer                 irq 11:         1
irq  0:17936650240 timer                 irq 11:      5179
irq  1:     51793 i8042                 irq 12:    574909 i8042
irq  3:     10358                       irq 14:  71920696 ide0
irq  4:     25896                       irq 15: 149756224 ide1
irq  5:     10358                       irq169:   1704011 acpi
irq  6:      5179                       irq177:      5179 ATI IXP, yenta
irq  7:      5179                       irq193:  44060876 eth0
irq  8:         0 rtc                   irq201:2982460928 ehci_hcd:usb1, ohci_
irq  9:      5179                       irq209:     15538 ohci1394
irq 10:      5179
*** buffer overflow detected ***: procinfo terminated
======= Backtrace: =========
/lib64/libc.so.6(__chk_fail+0x2f)[0x36475dd23f]
/lib64/libc.so.6[0x36475dc819]
/lib64/libc.so.6(_IO_default_xsputn+0x86)[0x3647567d56]
/lib64/libc.so.6(_IO_vfprintf+0xd5b)[0x3647541c4b]
/lib64/libc.so.6(__vsprintf_chk+0xa9)[0x36475dc8c9]
/lib64/libc.so.6(__sprintf_chk+0x80)[0x36475dc800]
procinfo[0x40699b]
procinfo[0x40232a]
procinfo[0x40634e]
/lib64/libc.so.6(__libc_start_main+0xdc)[0x364751c4cc]
procinfo[0x401999]
======= Memory map: ========
00400000-00409000 r-xp 00000000 fd:00 5393600                           
/usr/bin/procinfo
00509000-0050a000 rw-p 00009000 fd:00 5393600                           
/usr/bin/procinfo
0050a000-0052c000 rw-p 0050a000 00:00 0                                  [heap]
3647300000-364731a000 r-xp 00000000 fd:00 1044711                       
/lib64/ld-2.3.5.so
Linux 2.6.11-1.1369_FC4 (bhcompile@dolly) (gcc 4.0.0 20050525 ) #1 1CPU [zefir]

Memory:      Total        Used        Free      Shared     Buffers      
/lib64/libc-2.3.5.soMem:             0           0           0           0     
     0       /lib64/libc-2.3.5.soSwap:            0           0          
01044712                        /lib64/libc-2.3.5.so3647731000-3647733000 rw-p
00131000 fd:00 1044712                        /lib64/libc-2.3.5.soBootup: Mon
Aug  8 12:30:49 2005    Load average: 0.75 0.60 0.64 1/116 3874
3648200000-3648203000 r-xp 00000000 fd:00 1044553                       
/lib64/libtermcap.so.2.0.8
3648203000-3648302000 ---p 00003000 fd:00 1044553                       
/lib64/libtermcap.so.2.0.8
3648302000-3648303000 rw-p 00002000 fd:00 1044553                       
/lib64/libtermcap.so.2.0.8
364dc00000-364dc0d000 r-xp 00000000 fd:00 1044717                       
/lib64/libgcc_s-4.0.0-20050520.so.1
364dc0d000-364dd0c000 ---p 0000d000 fd:00 1044717                       
/lib64/libgcc_s-4.0.0-20050520.so.1
364dd0c000-364dd0d000 rw-p 0000c000 fd:00 1044717                       
/lib64/libgcc_s-4.0.0-20050520.so.1
2aaaaaaab000-2aaaaaab3000 rw-p 2aaaaaaab000 00:00 0
2aaaaaabd000-2aaaaaac0000 rw-p 2aaaaaabd000 00:00 0
7fffffd14000-7fffffd29000 rw-p 7fffffd14000 00:00 0                      [stack]
[Aborted]f600000-ffffffffffe00000 ---p 00000000 00:00 0                  [vdso]


Issuing the command uptime just after the procinfo above we have?
 uptime
 13:30:25 up 59 min,  2 users,  load average: 0.42, 0.54, 0.62

The real system uptime at that moment was only 26 minutes.

I must bring to your attention that in such case, the usefull system load was
very low (only the Mozilla Browser filling the comment field of this bug report.

Another symptom is that the keyboard repeat rate is very high (I'm not sure but
I would say more than twice as fast as it should be).


Comment 5 Anssi Johansson 2005-10-19 21:09:02 UTC
For what it's worth, I'm also seeing symptoms described in this bug on a FC4
x86-64 system. My clock drifts at about the same rate as the original reporter,
but not at 2x speed as in comment #4. procinfo dies the same horrible death as
in comment #4, but that's probably something unrelated.

There's an interesting correlation between the clock drift and the amount of
data transferred via network. For every ~11MB transferred, the clock advances
one second. I'm currently running ntpd -q every five minutes to correct the
drift, and usually the correction is only a few seconds, less during the night
when the traffic is lower. I tried downloading an ISO image and the correction
amount jumped to ~20 seconds / 5 minutes.

This is an ABIT AV8 motherboard (VIA K8T800 chipset) with AMD Athlon 64 4600+
dual-core CPU, 250GB SATA Maxtor hard disk, 4GB of RAM and ATI Radeon 7000 as
the display adapter. The on-board network adapter is apparently "VIA
Technologies, Inc. VT6120/VT6121/VT6122 Gigabit Ethernet Adapter (rev 11)",
using the via_velocity driver, kernel 2.6.13-1.1526_FC4smp. I'm not running
cpufreq, Powernow has been disabled.

Various listings from the system can be found at
http://jaguaari.miuku.net/clock/ if they're of assistance. Unfortunately this is
a production system, located some 150km away from me, so my debugging
possibilities are rather limited in this regard.

Comment 6 Anssi Johansson 2005-10-25 01:06:49 UTC
Speaking of FC4, the new 2.6.13-1.1532_FC4smp kernel seems to behave an order of
magnitude better than 1526, although this is based on very preliminary
test results. 

Comment 7 Robert Lichtenberger 2005-11-21 10:07:09 UTC
I've experienced the same problems on a dual core AMD64 machine, using a vanilla
2.6.14-2 kernel. When making a disk benchmark with bonnie, the clock is
accelerated by a factor 60 or so. I can see the minutes passing by in seconds :-).

Using the kernel options "notsc" and "clock=pmtmr" seems to have fixed the
problem however, at least to the point where a clock "drift" is no longer in the
range of a 60x acceleration.

Comment 8 Don Howard 2006-02-06 20:14:26 UTC
The original problem reported here was that the system clock on 2.1 was running
fast during periods of high network interrupt activity.  Was the netconsole
module loaded?  A 2.1 bug was recently fixed where netconsole was incorrectly
influencing the system clock. See BZ 163211 for details.  

There are reports of other clock issues - please see BZ 175293 for this issue on
dual-core amd RHEL4/fc4.

The original bug reported here is more than 6 months old and suspected to be
corrected, so I am closing this ticket.

Please re-open if the fast clock issue is still seen on current RHEL2.1. 

Comment 9 Don Howard 2006-02-06 20:15:38 UTC

*** This bug has been marked as a duplicate of 163211 ***


Note You need to log in before you can comment on or make changes to this bug.