From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040510 Description of problem: We have some servers whose clocks appear to run much faster than normal when the system is under load. At first glance, the serves do not appear to be doing much: [root@server1 root]# uptime 10:09am up 11 days, 19:14, 1 user, load average: 0.77, 1.01, 0.98 [root@server1 root]# sar -u |tail -10 08:50:00 AM all 2.03 0.00 9.62 88.35 08:59:57 AM all 1.67 0.00 9.25 89.08 09:10:00 AM all 1.37 0.00 9.09 89.54 09:20:00 AM all 1.79 0.00 9.06 89.14 09:29:58 AM all 1.44 0.00 8.84 89.72 09:40:00 AM all 1.99 0.00 10.97 87.03 09:50:00 AM all 2.12 0.00 11.32 86.56 10:00:00 AM all 2.58 0.00 10.26 87.16 10:10:00 AM all 1.96 0.00 9.76 88.28 Average: all 1.58 0.00 8.40 90.02 However looking at the interrupts, eth0 (tg3) is generating much more interrupts than anything else in the system. The following is part of the output of the comand: procinfo -n10 -d -S with just the interrupts that are actualy seeing activity grabbed from the output: irq 0: 100 timer irq 46: 0 irq 28: 4922 eth0 irq 74: 0 irq 30: 22 aic7xxx irq 76: 0 irq 31: 40 aic7xxx irq 77: 0 eth0 on these hosts typically generates 4500-5000 interupts a second under normal load, which is when time goes nuts. Looking at ntpd... -> ntpq ntpq> lpeers remote refid st t when poll reach delay offset jitter ============================================================================== *clock1.redhat.c .CDMA. 1 u 7 64 77 44.015 -779.38 174.712 +clock2.redhat.c .CDMA. 1 u 62 64 37 109.412 -702.04 173.590 +clock3.util.phx .CDMA. 1 u 62 64 37 1.102 -701.69 174.580 [root@server1 root]# cat /etc/ntp/drift -500.000 ntpd is slammed trying to keep the clock running properly, but it is fast beyond ntpd's capability to use clock skew to keep it in line, so it steps on a pretty regular basis: Sep 13 08:44:05 server1 ntpd[995]: time reset -2.054975 s Sep 13 08:44:05 server1 ntpd[995]: synchronisation lost Sep 13 08:59:14 server1 ntpd[995]: time reset -3.893986 s Sep 13 08:59:14 server1 ntpd[995]: synchronisation lost Sep 13 09:14:16 server1 ntpd[995]: time reset -3.043421 s Sep 13 09:14:16 server1 ntpd[995]: synchronisation lost Sep 13 09:29:27 server1 ntpd[995]: time reset -2.013207 s Sep 13 09:29:27 server1 ntpd[995]: synchronisation lost Sep 13 09:44:44 server1 ntpd[995]: time reset -2.025384 s Sep 13 09:44:44 server1 ntpd[995]: synchronisation lost Sep 13 10:00:00 server1 ntpd[995]: time reset -1.348106 s Sep 13 10:00:00 server1 ntpd[995]: synchronisation lost All of these adjustments are negative, which is driving one of our applications (vsftpd) crazy, since it is seeing values returned from gettimeofday() that actually go backwards: 23248 gettimeofday({1094664634, 860444}, NULL) = 0 23248 _llseek(7, 195174856, [195174856], SEEK_SET) = 0 23248 sendfile(8, 7, NULL, 65536) = 65536 23248 gettimeofday({1094664634, 909594}, NULL) = 0 23248 _llseek(7, 195240392, [195240392], SEEK_SET) = 0 23248 sendfile(8, 7, NULL, 65536) = 65536 23248 gettimeofday({1094664631, 105547}, NULL) = 0 As a test, I turned off ntpd on one of the servers and let it run free for a little while. When I started (note monitor is a server with stable time I'm using as a baseline, and server1 is the machine with the problem): [damorep@monitor:/home/saqadoc/damorep] -> date Mon Sep 13 10:00:46 EDT 2004 [root@server1 root]# date Mon Sep 13 10:00:46 EDT 2004 32 minutes later: [damorep@monitor:/home/saqadoc/damorep] -> date Mon Sep 13 10:32:27 EDT 2004 [root@server1 root]# date Mon Sep 13 10:32:40 EDT 2004 So when left to run free, server1's clock gained 13 seconds in 32 minutes. This machine is running e.49enterprise. We also tried with e.40enterprise and it appears to have the same issue. Something is horribly wring here. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. See description Additional info:
According to the openntpd maintainer (who I spoke with at SUCON), the adjtime system call in Linux isn't POSIX compliant and allows the time to jump. This could well be a bug we need to fix for RHEL4. Maybe RHEL3 and RHEL2.1 too...
*** Bug 133918 has been marked as a duplicate of this bug. ***
2 Comments here: 1) JFYI, according to the vsftpd maintainer, whom we've contacted about this issue, the latest vsftpd is fixed to the point that transfers will at least not fail. There are still problems (I'm guessing bandwidth limits will probably not work properly in this scenario), and the maintainer is addressing them, but the transfers will work. 2) While it is all well and good that adjtime isn't POSIX compliant, the reason I opened this bug in the first place is because I velieve the root of this problem is that, under load, the system clock is running extremely fast. If that problem didn't happen, the rest of this would be irrelevant, at least to this bug.
Similar problem at FC4 X86_64. The system clock runs more than 2 times faster than it should. The system load is also unusually high. When you run procinfo it aborts with buffer overflow, as follows: Linux 2.6.11-1.1369_FC4 (bhcompile@dolly) (gcc 4.0.0 20050525 ) #1 1CPU [zefir] Linux 2.6.11-1.1369_FC4 (bhcompile@dolly) (gcc 4.0.0 20050525 ) #1 1CPU [zefir] Memory: Total Used Free Shared Buffers Mem: 0 24 -24 0 0 Swap: 0 0 0 Bootup: Mon Aug 8 12:30:49 2005 Load average: 0.76 0.59 0.64 2/116 3872 Bootup: Mon Aug 8 12:30:49 2005 Load average: 0.89 0.62 0.65 1/116 3872 user : 0:01:57.74 3.4% page in : 0 user : 0:01:43.58 0.2% page in : 0 nice : 0:00:00.00 0.0% page out: 0 system: 7:15:04.00 50.4% swap in : 0 idle : 7:07:17.86 49.5% swap out: 0 uptime: 0:57:43.59 context : 8416472 irq 0: 3453095 timer irq 11: 1 irq 0:17936650240 timer irq 11: 5179 irq 1: 51793 i8042 irq 12: 574909 i8042 irq 3: 10358 irq 14: 71920696 ide0 irq 4: 25896 irq 15: 149756224 ide1 irq 5: 10358 irq169: 1704011 acpi irq 6: 5179 irq177: 5179 ATI IXP, yenta irq 7: 5179 irq193: 44060876 eth0 irq 8: 0 rtc irq201:2982460928 ehci_hcd:usb1, ohci_ irq 9: 5179 irq209: 15538 ohci1394 irq 10: 5179 *** buffer overflow detected ***: procinfo terminated ======= Backtrace: ========= /lib64/libc.so.6(__chk_fail+0x2f)[0x36475dd23f] /lib64/libc.so.6[0x36475dc819] /lib64/libc.so.6(_IO_default_xsputn+0x86)[0x3647567d56] /lib64/libc.so.6(_IO_vfprintf+0xd5b)[0x3647541c4b] /lib64/libc.so.6(__vsprintf_chk+0xa9)[0x36475dc8c9] /lib64/libc.so.6(__sprintf_chk+0x80)[0x36475dc800] procinfo[0x40699b] procinfo[0x40232a] procinfo[0x40634e] /lib64/libc.so.6(__libc_start_main+0xdc)[0x364751c4cc] procinfo[0x401999] ======= Memory map: ======== 00400000-00409000 r-xp 00000000 fd:00 5393600 /usr/bin/procinfo 00509000-0050a000 rw-p 00009000 fd:00 5393600 /usr/bin/procinfo 0050a000-0052c000 rw-p 0050a000 00:00 0 [heap] 3647300000-364731a000 r-xp 00000000 fd:00 1044711 /lib64/ld-2.3.5.so Linux 2.6.11-1.1369_FC4 (bhcompile@dolly) (gcc 4.0.0 20050525 ) #1 1CPU [zefir] Memory: Total Used Free Shared Buffers /lib64/libc-2.3.5.soMem: 0 0 0 0 0 /lib64/libc-2.3.5.soSwap: 0 0 01044712 /lib64/libc-2.3.5.so3647731000-3647733000 rw-p 00131000 fd:00 1044712 /lib64/libc-2.3.5.soBootup: Mon Aug 8 12:30:49 2005 Load average: 0.75 0.60 0.64 1/116 3874 3648200000-3648203000 r-xp 00000000 fd:00 1044553 /lib64/libtermcap.so.2.0.8 3648203000-3648302000 ---p 00003000 fd:00 1044553 /lib64/libtermcap.so.2.0.8 3648302000-3648303000 rw-p 00002000 fd:00 1044553 /lib64/libtermcap.so.2.0.8 364dc00000-364dc0d000 r-xp 00000000 fd:00 1044717 /lib64/libgcc_s-4.0.0-20050520.so.1 364dc0d000-364dd0c000 ---p 0000d000 fd:00 1044717 /lib64/libgcc_s-4.0.0-20050520.so.1 364dd0c000-364dd0d000 rw-p 0000c000 fd:00 1044717 /lib64/libgcc_s-4.0.0-20050520.so.1 2aaaaaaab000-2aaaaaab3000 rw-p 2aaaaaaab000 00:00 0 2aaaaaabd000-2aaaaaac0000 rw-p 2aaaaaabd000 00:00 0 7fffffd14000-7fffffd29000 rw-p 7fffffd14000 00:00 0 [stack] [Aborted]f600000-ffffffffffe00000 ---p 00000000 00:00 0 [vdso] Issuing the command uptime just after the procinfo above we have? uptime 13:30:25 up 59 min, 2 users, load average: 0.42, 0.54, 0.62 The real system uptime at that moment was only 26 minutes. I must bring to your attention that in such case, the usefull system load was very low (only the Mozilla Browser filling the comment field of this bug report. Another symptom is that the keyboard repeat rate is very high (I'm not sure but I would say more than twice as fast as it should be).
For what it's worth, I'm also seeing symptoms described in this bug on a FC4 x86-64 system. My clock drifts at about the same rate as the original reporter, but not at 2x speed as in comment #4. procinfo dies the same horrible death as in comment #4, but that's probably something unrelated. There's an interesting correlation between the clock drift and the amount of data transferred via network. For every ~11MB transferred, the clock advances one second. I'm currently running ntpd -q every five minutes to correct the drift, and usually the correction is only a few seconds, less during the night when the traffic is lower. I tried downloading an ISO image and the correction amount jumped to ~20 seconds / 5 minutes. This is an ABIT AV8 motherboard (VIA K8T800 chipset) with AMD Athlon 64 4600+ dual-core CPU, 250GB SATA Maxtor hard disk, 4GB of RAM and ATI Radeon 7000 as the display adapter. The on-board network adapter is apparently "VIA Technologies, Inc. VT6120/VT6121/VT6122 Gigabit Ethernet Adapter (rev 11)", using the via_velocity driver, kernel 2.6.13-1.1526_FC4smp. I'm not running cpufreq, Powernow has been disabled. Various listings from the system can be found at http://jaguaari.miuku.net/clock/ if they're of assistance. Unfortunately this is a production system, located some 150km away from me, so my debugging possibilities are rather limited in this regard.
Speaking of FC4, the new 2.6.13-1.1532_FC4smp kernel seems to behave an order of magnitude better than 1526, although this is based on very preliminary test results.
I've experienced the same problems on a dual core AMD64 machine, using a vanilla 2.6.14-2 kernel. When making a disk benchmark with bonnie, the clock is accelerated by a factor 60 or so. I can see the minutes passing by in seconds :-). Using the kernel options "notsc" and "clock=pmtmr" seems to have fixed the problem however, at least to the point where a clock "drift" is no longer in the range of a 60x acceleration.
The original problem reported here was that the system clock on 2.1 was running fast during periods of high network interrupt activity. Was the netconsole module loaded? A 2.1 bug was recently fixed where netconsole was incorrectly influencing the system clock. See BZ 163211 for details. There are reports of other clock issues - please see BZ 175293 for this issue on dual-core amd RHEL4/fc4. The original bug reported here is more than 6 months old and suspected to be corrected, so I am closing this ticket. Please re-open if the fast clock issue is still seen on current RHEL2.1.
*** This bug has been marked as a duplicate of 163211 ***