Red Hat Bugzilla – Bug 17815
system crash after long uptime
Last modified: 2008-05-01 11:37:58 EDT
I have been running a stock RedHat 5.2 system for quite some time
now. It has been reliable and solid. I was using this machine
for my workstation and running some custom perl scripts
gathering information about the other machines on our network
and making the available via the web.
About 2 months ago, the system started acting strange. Any new
files created had the coorect timestamp and all date/time finctions
seemed normal except that ps showed every process starting January
18 (including the ps that was run to view the information).
Yesterday at exactly 500 days of uptime, the system crashed.
Since it is the weekend, I have not had time to determine the
full extent of this, but the system is not responding to the
network in any way, and the Xwindows system is locked (including
Numlock and Capslock keys are not responding, the system is completely
locked. So far that is the extent of the damage. I don't know
yet if file system coruption has occurred or not.
This is a known problem with the 2.0.* kernels, which are used in Redhat
5.2. The basic problem is that the internal kernel variable (jiffies) that
stores time since boot is only an unsigned 32-bit variable, and rolls over
back to zero around 497 days after a boot on i386 machines (where jiffies
are 1/100th of a second; the rollover happens faster on Alphas). This makes
various things doing interval arithmetic on jiffies unhappy ('wake me in
50 jiffies', for example, or 'if it has been more than 400 jiffies since
the last event, do X').
Depending on the exact drivers and events happening around the rollover,
a given system may or may not be fine.
I don't believe there's any real solution for 2.0.* series kernels. The
problem was fixed for the 2.2 series of kernels, but I believe it was a
fair amount of effort to find and fix all the code that needed to cope with