Red Hat Bugzilla – Bug 77058
LONG scheduling pauses
Last modified: 2015-01-04 17:02:01 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.8) Gecko/20020204
Description of problem:
There are occasions (a few times a day) where the scheduler fails to schedule a
soft realtime process for up to 30 seconds for no apparent reason. This is
observed by the 'heartbeat' program which is a high-availability package for
Linux. This has been observed by three different users of heartbeat in three
different environments. Heartbeat tracks how long it takes systems to send out
heartbeats. It is normal that under load that heartbeat times will go up some.
However, in this case, things run along just as smoothly as you like until
suddenly one VERY long heartbeat interval occurs - without any warnings
concerning delayed heartbeats before this. This problem has been observed by
Alex Kramarov of incredimail, Steven Wilson of NCD health, and Brian Tinsley of
emageon. The kernel was kernel-2.4.18-17.7.x. Alex Kramarov reports he was
running it on an SMP machine. I do not yet know about the other two. The users
report that the systems were either completely idle or nearly so.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. I'm unsure of some of the exact initial conditions
2. Install heartbeat on two systems and configure them over
a couple of ethernets. Set keepalive to 1, deadtime to 10
and warntime to 2.
3. Run heartbeat for a few days. It will falsely report that one or both
machines have died - without any warnings about successively longer heartbeat
delays like those which occur when the machine is under heavy load. It may occur
in a few hours, but no more than a day or two.
Actual Results: The users in question report that when they upgraded to this
kernel, they began to experience these problems. They reproduced them several
times in this way, and they went away when they changed to an earlier kernel.
Expected Results: No false takeovers should occur.
They reproduced them several times in this way, and they went away when they
changed back to an earlier kernel. Until this kernel, heartbeat has worked
nicely with every kernel made by any vendor. This kernel has a unique bug -
unlike any seen before in the last 3 years or so.
This bug renders these machines unsuitable for high-availability work. The
result is equivalent to a crash - both machines shut down their HA services and
restart them. If the users have not configured everything properly, and have a
shared disk, this can lead to loss of data.
There is a long involved discussion of these problems on the linux-ha mailing
list. The topics are "Sporadic split brain on Red Hat 7.2 with 2.4.18-17.7
kernel" and "heartbeat failure". The relevant list archives can be found here:
I suspect that this bug is related to bug 76499 which causes large numbers
of clock ticks to be lost.
As email@example.com suggested, it is likely related to bug 76499.
We have had few more emails from the victims of this bug. These emails are in
the linux-ha archives with subject lines containing "Red Hat Kernel 2.4.18-17"
with the usual set of Re: Fwd:, etc. set of prefixes. It also appears that one
person may have seen similar problems on somewhat earlier RH kernels. There are
at least three models of computers this has been reproduced on: Some UP, some
2-CPU, and some 4-CPU.
We also observe this problem on every dual-CPU HP Kayak desktop workstations
(with or without SCSI), dual-CPUs & 6-ways HP NetServers (obsolete, but...).
Our HA mechanism (polling layer on top of tcp) get confused as well, causing
We observe this problem with 2.4.18-3smp, 2.4.18-1-smp & 2.4.18-17.7.xsmp (same
is there any chance of anyone getting a sysreq-t dump of a stuck state ?
Created attachment 84804 [details]
/var/log/messages section when hitting sysrq-t at lock time
I have attached the piece of my /var/log/messages that contains a sysrq-t output.
I don't know if the lock I encountered is specifically the one we are talking
about here, because an NFS server went down at this time. If you feel it is
un-related, please disregard this log.
you are using binary only kernel modules that interact with the VFS and NFS,
your trace is not useful therefore ;(
Created attachment 85444 [details]
/var/log/message extract on 2.4.18-3 with no proprietary module
Here is another /var/log/messages (with 2 Alt-SysRq) in a 2.4.18-3 kernel with
no proprietary module. The system on which it happens is relying on NFS+NIS.
In case of the first log part, the frozen application (whereas all others are
fine) is XEmacs. (This may also happens with LyX, but not in this log)
The second log part in in the exact same situation, but it also includes a "ps
lax" command that is frozen too.
Nov 13 14:30:38 tarifa1 kernel: Call Trace: [<d09fceb2>] mvfs_rhat_rdwr [mvfs]
ehm please be serious when you say "no proprietary modules loaded" and stop
wasting my time
also please don't use 2.4.18-3 but 2.4.18-18.7.x or 2.4.18-18.8.0
Sorry if I gave you the impression to waste your time. Please read below:
The first log (collected last week on machine tarifa1) was on a kernel using a
proprietary module. So I have uploaded another attachment for another machine
(fuerteventura), which does not use a proprietary module. Since I have no mean
to remove my previous attachement, it was left attached to the bug record.
I will try to catch the problem on 2.4.18-17.7.x. Hope that the updates will
not go quick enough that another update is out...
Any chance that you have a look at the problem in the fuerteventura log on 2.4.18-3?
Ooops! wrong log from worng machine.... :-(
Created attachment 85450 [details]
2.4.18-3 with no proprietary module (this time...)
looks like an NFS lockup ;(
2.4.18-18.7.x (2.4.18-17.7.x is not the latest ;) ought to have a better NFS
behavior (but we know it's not perfect yet)
I do not believe that all the original people reporting this have NFS issues.
Recently, various people have reported to me that they've seen this problem in
kernel.org kernels. The reports are *very* credible, and contain what appears
to be the unique signature of this bug (at least I hope there aren't two with
the same bug).
When I was at LinuxWorld, folks from the SuSE booth told me that they had seen
this bug and fixed it. You might check with them on what the fix was. The
person I heard this from was either Marcus Rex, or Ralf Flaxa.
So I would suppose that either Rik or Andrea might have more information.
On our side, we have reproduced the problem with & without NFS. Quite a pain,
so no update since a long time. Our "scheduling pause" was due to a
mis-configured NTP daemon (defaut RH settings) that was triggering system
When the clock was going backward N seconds , our application was not
"scheduled" during ~ N seconds.
On my side, the fix is /etc/sysconfig/ntpd :
- OPTIONS="-U ntp"
+ OPTIONS="-U ntp -x 0 -g"
 Invoked by the scheduler, but not doing anything, because almost everything
depends on timers in a Telco application...
 too bad that there is no man page for ntpd startup options on RedHat
(/usr/share/doc/*.html only)... :-(
We are also experiencing this with 2.4.20-19.7 and 2.4.20-20.7.
Whoops. Sorry for the me too. Let me add a full report:
We are also experiencing this with 2.4.20-19.7 and 2.4.20-20.7. We have done
everything from updating to the most recent HA rpms (all ultramonkeys), to
trying the two different kernels mentioned above. I am currently compiling a
plain vanilla 2.4.22 kernel to see if this has any effect.
We can easily reproduce this on a two server cluster by setting the hwclock to
3:30am and rebooting both systems, while they are coming up we then start 3 test
clients pounding the box's with about 300 requests per second each (these are
apache clusters).. within ten minutes the "long heartbeat intervals" occur,
causing systems to assume both master status, or worse.
We are interested in working with any parties to resolve this problem. Please
contact me at the above email address.