451824 – Clock ticks slow on 2.6.25.4-10

Bug 451824 - Clock ticks slow on 2.6.25.4-10

Summary: Clock ticks slow on 2.6.25.4-10

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	8
Hardware:	All
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-06-17 17:18 UTC by Philippe Troin
Modified:	2009-01-09 07:50 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-01-09 07:50:06 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Partial sysrq-t output on a wedged x86_64 system. (120.45 KB, text/plain) 2008-06-23 18:05 UTC, Philippe Troin	no flags	Details
View All

Description Philippe Troin 2008-06-17 17:18:48 UTC

Description of problem:
The system boots up properly.  Clock seems to be ticking fine.  After a random
amount of time, the clock starts ticking about 4-8 times slower.
I have seen it happen on more than one system (one i386 and one x86_64).
I've tried to boot with "nohz=off", but that does not help.

In all cases (nohz=off or not), there seem to be no activity on the "timer"
interrupt after boot.
  0:        353          0   IO-APIC-edge      timer
I would have thought that with nohz=off, there should be some timer activity.

Version-Release number of selected component (if applicable):
kernel-2.6.25.4-10.i686
kernel-2.6.25.4-10.x86_64

How reproducible:
Not always.

Steps to Reproduce:
1. Boot offending kernel
2. Wait for the problem to manifest itself.
  
Actual results:
Clock sometimes start to slow down after a random period of time.

Expected results:
Clock should be ticking at the normal rate.

Additional info:
None.

Comment 1 Dave Jones 2008-06-17 17:26:59 UTC

ATI chipset by any chance ?

Comment 2 Philippe Troin 2008-06-17 17:38:29 UTC

Nope.  However both systems have ATI graphics cards.

The i386 system is a Tyan Tiger 133 with dual Pentium III Coppermine 733 MHz.
The x86_64 is a Tyan Thunder K8W with dual Opteron 242 1.6 GHz.

Are you interested in lspci output?

This really sounds like a nohz bug.  I'm quite stumped that adding nohz=off does
not make a difference in "timer" interrupts.  How can I check if nohz is active
or not?

Phil.

Comment 3 Dave Jones 2008-06-17 17:48:04 UTC

ok, the ATI thing was another bug with similar sounding symptoms.

It would be interesting to know if this is still a problem in the 2.6.26rc kernel.
Could you grab the rawhide kernel, and give that a boot test?  It should install
without any problems on F8 and F9.

Comment 4 Philippe Troin 2008-06-17 17:54:04 UTC

Ok.
I'm going to give 2.6.26-0.72.rc6.git2.fc10 a spin.

Any clues/hints about my nohz=off and no timer interrupt concerns?

Phil.

Comment 5 Dave Jones 2008-06-17 18:09:45 UTC

no idea tbh.

Comment 6 Philippe Troin 2008-06-17 18:29:02 UTC

I forgot to mention an important detail:
 - On the i386, the clock slows down, this started with 2.6.25.
 - On the x86_64 box, the time does not slow down:  it completely stops ticking.
 This started with 2.6.24.
The symptoms for the x86_64 system where the clock stalls completely are:
  * disk I/O seems to be completely stalled
  * if I have a shell running on this box, it still works, as long as I do not
hit the disk
  * the system eventually recovers (after anything ranging from 10 minutes to 8
hours)
  * when it recoverts, it prints:
Clocksource tsc unstable (delta = 3413151638326 ns)
  * after "recovery", the clock is off by the amount of time spent in the wedged
state.
  * the wierd thing is that on this system:
root@ceramic:~[2]#  cat
/sys/devices/system/clocksource/clocksource0/current_clocksource
hpet
root@ceramic:~[2]#  cat
/sys/devices/system/clocksource/clocksource0/available_clocksource
hpet acpi_pm jiffies tsc 
     the tsc clocksource was not selected at boot time (but is available).

Unfortunately, I was not able to capture the SysRQ-T output when the system is
wedged.

Phil.

Comment 7 Philippe Troin 2008-06-23 18:05:40 UTC

Created attachment 310050 [details]
Partial sysrq-t output on a wedged x86_64 system.

Comment 8 Philippe Troin 2008-06-23 18:06:37 UTC

One more data point.

I have currently a "wedged" x86_64 system.
It has been running 2.6.25.6-27.fc8.x86_64 for 16 hours or so.
I have seen the clock go backwards and was able to get a (partial) SysRQ-T output.
This seems very odd:

khelper       R  running task        0 11946     11
 ffff8100051dbe10 0000000000000046 0000000000800111 0000000000000282
 ffff810020042000 ffff8100051a8000 ffff810020042328 00000000051dbf34
 ffff810031b5c000 0000000000002eab ffff810020042000 0000000000000246
Call Trace:
 [<ffffffff81035b32>] do_wait+0x8c6/0xa02
 [<ffffffff8102ad22>] ? default_wake_function+0x0/0xf
 [<ffffffff81035d04>] sys_wait4+0x96/0xb1
 [<ffffffff8104291a>] ? ____call_usermodehelper+0x0/0x156
 [<ffffffff8104278e>] wait_for_helper+0x42/0x6e
 [<ffffffff81013c87>] ? syscall_trace_leave+0x34/0x9d
 [<ffffffff8100cc78>] child_rip+0xa/0x12
 [<ffffffff8104274c>] ? wait_for_helper+0x0/0x6e
 [<ffffffff8100cc6e>] ? child_rip+0x0/0x12

modprobe      ? ffff81003fdc6000     0 11947  11946
 ffff81001fcd7ee8 0000000000000046 ffff81001fcd7e88 ffffffff810b6ab2
 ffff8100051a8000 ffff810066142000 ffff8100051a8328 000000000a540000
 ffff81000a540000 0000000000000246 ffff81001fcd7ed8 ffffffff81049178
Call Trace:
 [<ffffffff810b6ab2>] ? mntput_no_expire+0x1e/0x89
 [<ffffffff81049178>] ? switch_task_namespaces+0x29/0x5d
 [<ffffffff8103641e>] do_exit+0x658/0x65c
 [<ffffffff8103649d>] do_group_exit+0x7b/0x96
 [<ffffffff810364ca>] sys_exit_group+0x12/0x14
 [<ffffffff8100bfd2>] tracesys+0xd5/0xda

See the modprobe stuck in mntput_no_expire(), and its parent khelper in state R
while doing sys_wait4().

I've attached the partial sysrq-t output as well.

Phil.

Comment 9 Chuck Ebbert 2008-06-24 02:53:42 UTC

Does using "nohz=off highres=off" make any difference?

Comment 10 Chuck Ebbert 2008-06-24 02:58:06 UTC

(In reply to comment #8)
> One more data point.
> 
> I have currently a "wedged" x86_64 system.
> It has been running 2.6.25.6-27.fc8.x86_64 for 16 hours or so.
> I have seen the clock go backwards and was able to get a (partial) SysRQ-T output.
> This seems very odd:
> 
> khelper       R  running task        0 11946     11
>  [<ffffffff81035b32>] do_wait+0x8c6/0xa02


> modprobe      ? ffff81003fdc6000     0 11947  11946
>  [<ffffffff8103641e>] do_exit+0x658/0x65c

Looks like khelper missed the wakeup when modprobe exited????

Comment 11 Philippe Troin 2008-07-31 19:23:23 UTC

Okay.

More testing yielded that:

 * Booting with nohz=off highres=off does not help

 * Recompiling the kernel without CONFIG_NO_HZ, CONFIG_HIGH_RES_TIMERS works.
   I have not found any problems in 10 days of testing after rebooting on a
   kernel where NO_HZ and HIGH_RES_TIMERS are disabled.

Phil.

Comment 12 Harry Smith 2008-08-15 04:48:23 UTC

I have the same problem, after a random time period of between 10 minutes to 48 hours it appears the system clock stops with the Dual CPU. A single CPU does not have the problem.

HP NetServer LC3 Dual PIII 500
Kernel 2.6.25.14-69.fc8

I have never let the system go very long in this condition without shutting down and removing the second CPU.

Comment 13 aacfhjz02 2008-08-15 18:19:00 UTC

We also have the same problem, the clock drifts after a random period of time.  This is on a dual core x86 system, using a 2.6.25 kernel on Fedora7.  Our application software is leveraging the high-resolution timers.  We have a periodic 25-Hz task in our application that slows to approximately 1Hz when this problem is at its worst, and this is a bad thing.

We did not see this problem with the previous 2.6.23 series kernel, but it is apparent with the 2.6.25 series kernels.  Also, we are seeing a disparity in the local timer interrupt count after some random amount of time.  Typically, the local timer interrupt count (in /proc/interrupts) for each CPU is very close (within 1%): nominal values are around 1000 per second for each CPU, and this is quite constant.

But after some time, CPU1 local timer interrupt count plummets to about 50 interrupts per second and stays there, while CPU0 remains at around 1000 interrupts/second, an approximate factor of 20 difference for CPU1.

Available clocksources shown are tsc and jiffies, even though the system has an HPET, which is not detected for some reason under this kernel.  The system initially assumes tsc as the current clocksource, but for some reason does a fallback to jiffies at some point.

Comment 14 Philippe Troin 2008-09-17 04:15:37 UTC

Problem is still present in 2.6.26.3-14.
Again, recompiling the kernel without NO_HZ and HIGH_RES_TIMERS fixes the problem.

Phil.

Comment 15 Harry Smith 2008-10-27 02:05:18 UTC

Does anybody know if this bug is still present in 2.6.26.6-49 ? I've been stuck using an old kernel waiting for this issue to get fixed. I see there is no indication of the problem in the changelogs and nothing on the bug report.

Comment 16 Bug Zapper 2008-11-26 10:53:11 UTC

This message is a reminder that Fedora 8 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 8.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '8'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 8's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 8 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 17 Bug Zapper 2009-01-09 07:50:06 UTC

Fedora 8 changed to end-of-life (EOL) status on 2009-01-07. Fedora 8 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.