Red Hat Bugzilla – Bug 451824
Clock ticks slow on 220.127.116.11-10
Last modified: 2009-01-09 02:50:06 EST
Description of problem:
The system boots up properly. Clock seems to be ticking fine. After a random
amount of time, the clock starts ticking about 4-8 times slower.
I have seen it happen on more than one system (one i386 and one x86_64).
I've tried to boot with "nohz=off", but that does not help.
In all cases (nohz=off or not), there seem to be no activity on the "timer"
interrupt after boot.
0: 353 0 IO-APIC-edge timer
I would have thought that with nohz=off, there should be some timer activity.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Boot offending kernel
2. Wait for the problem to manifest itself.
Clock sometimes start to slow down after a random period of time.
Clock should be ticking at the normal rate.
ATI chipset by any chance ?
Nope. However both systems have ATI graphics cards.
The i386 system is a Tyan Tiger 133 with dual Pentium III Coppermine 733 MHz.
The x86_64 is a Tyan Thunder K8W with dual Opteron 242 1.6 GHz.
Are you interested in lspci output?
This really sounds like a nohz bug. I'm quite stumped that adding nohz=off does
not make a difference in "timer" interrupts. How can I check if nohz is active
ok, the ATI thing was another bug with similar sounding symptoms.
It would be interesting to know if this is still a problem in the 2.6.26rc kernel.
Could you grab the rawhide kernel, and give that a boot test? It should install
without any problems on F8 and F9.
I'm going to give 2.6.26-0.72.rc6.git2.fc10 a spin.
Any clues/hints about my nohz=off and no timer interrupt concerns?
no idea tbh.
I forgot to mention an important detail:
- On the i386, the clock slows down, this started with 2.6.25.
- On the x86_64 box, the time does not slow down: it completely stops ticking.
This started with 2.6.24.
The symptoms for the x86_64 system where the clock stalls completely are:
* disk I/O seems to be completely stalled
* if I have a shell running on this box, it still works, as long as I do not
hit the disk
* the system eventually recovers (after anything ranging from 10 minutes to 8
* when it recoverts, it prints:
Clocksource tsc unstable (delta = 3413151638326 ns)
* after "recovery", the clock is off by the amount of time spent in the wedged
* the wierd thing is that on this system:
hpet acpi_pm jiffies tsc
the tsc clocksource was not selected at boot time (but is available).
Unfortunately, I was not able to capture the SysRQ-T output when the system is
Created attachment 310050 [details]
Partial sysrq-t output on a wedged x86_64 system.
One more data point.
I have currently a "wedged" x86_64 system.
It has been running 18.104.22.168-27.fc8.x86_64 for 16 hours or so.
I have seen the clock go backwards and was able to get a (partial) SysRQ-T output.
This seems very odd:
khelper R running task 0 11946 11
ffff8100051dbe10 0000000000000046 0000000000800111 0000000000000282
ffff810020042000 ffff8100051a8000 ffff810020042328 00000000051dbf34
ffff810031b5c000 0000000000002eab ffff810020042000 0000000000000246
[<ffffffff8102ad22>] ? default_wake_function+0x0/0xf
[<ffffffff8104291a>] ? ____call_usermodehelper+0x0/0x156
[<ffffffff81013c87>] ? syscall_trace_leave+0x34/0x9d
[<ffffffff8104274c>] ? wait_for_helper+0x0/0x6e
[<ffffffff8100cc6e>] ? child_rip+0x0/0x12
modprobe ? ffff81003fdc6000 0 11947 11946
ffff81001fcd7ee8 0000000000000046 ffff81001fcd7e88 ffffffff810b6ab2
ffff8100051a8000 ffff810066142000 ffff8100051a8328 000000000a540000
ffff81000a540000 0000000000000246 ffff81001fcd7ed8 ffffffff81049178
[<ffffffff810b6ab2>] ? mntput_no_expire+0x1e/0x89
[<ffffffff81049178>] ? switch_task_namespaces+0x29/0x5d
See the modprobe stuck in mntput_no_expire(), and its parent khelper in state R
while doing sys_wait4().
I've attached the partial sysrq-t output as well.
Does using "nohz=off highres=off" make any difference?
(In reply to comment #8)
> One more data point.
> I have currently a "wedged" x86_64 system.
> It has been running 22.214.171.124-27.fc8.x86_64 for 16 hours or so.
> I have seen the clock go backwards and was able to get a (partial) SysRQ-T output.
> This seems very odd:
> khelper R running task 0 11946 11
> [<ffffffff81035b32>] do_wait+0x8c6/0xa02
> modprobe ? ffff81003fdc6000 0 11947 11946
> [<ffffffff8103641e>] do_exit+0x658/0x65c
Looks like khelper missed the wakeup when modprobe exited????
More testing yielded that:
* Booting with nohz=off highres=off does not help
* Recompiling the kernel without CONFIG_NO_HZ, CONFIG_HIGH_RES_TIMERS works.
I have not found any problems in 10 days of testing after rebooting on a
kernel where NO_HZ and HIGH_RES_TIMERS are disabled.
I have the same problem, after a random time period of between 10 minutes to 48 hours it appears the system clock stops with the Dual CPU. A single CPU does not have the problem.
HP NetServer LC3 Dual PIII 500
I have never let the system go very long in this condition without shutting down and removing the second CPU.
We also have the same problem, the clock drifts after a random period of time. This is on a dual core x86 system, using a 2.6.25 kernel on Fedora7. Our application software is leveraging the high-resolution timers. We have a periodic 25-Hz task in our application that slows to approximately 1Hz when this problem is at its worst, and this is a bad thing.
We did not see this problem with the previous 2.6.23 series kernel, but it is apparent with the 2.6.25 series kernels. Also, we are seeing a disparity in the local timer interrupt count after some random amount of time. Typically, the local timer interrupt count (in /proc/interrupts) for each CPU is very close (within 1%): nominal values are around 1000 per second for each CPU, and this is quite constant.
But after some time, CPU1 local timer interrupt count plummets to about 50 interrupts per second and stays there, while CPU0 remains at around 1000 interrupts/second, an approximate factor of 20 difference for CPU1.
Available clocksources shown are tsc and jiffies, even though the system has an HPET, which is not detected for some reason under this kernel. The system initially assumes tsc as the current clocksource, but for some reason does a fallback to jiffies at some point.
Problem is still present in 126.96.36.199-14.
Again, recompiling the kernel without NO_HZ and HIGH_RES_TIMERS fixes the problem.
Does anybody know if this bug is still present in 188.8.131.52-49 ? I've been stuck using an old kernel waiting for this issue to get fixed. I see there is no indication of the problem in the changelogs and nothing on the bug report.
This message is a reminder that Fedora 8 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 8. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora
'version' of '8'.
Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version prior to Fedora 8's end of life.
Bug Reporter: Thank you for reporting this issue and we are sorry that
we may not be able to fix it before Fedora 8 is end of life. If you
would still like to see this bug fixed and are able to reproduce it
against a later version of Fedora please change the 'version' of this
bug to the applicable version. If you are unable to change the version,
please add a comment here and someone will do it for you.
Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
The process we are following is described here:
Fedora 8 changed to end-of-life (EOL) status on 2009-01-07. Fedora 8 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.
If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version.
Thank you for reporting this bug and we are sorry it could not be fixed.