Bug 77058 - LONG scheduling pauses
LONG scheduling pauses
Status: CLOSED WONTFIX
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
7.3
i586 Linux
medium Severity high
: ---
: ---
Assigned To: Dave Jones
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2002-10-31 10:46 EST by Alan Robertson
Modified: 2015-01-04 17:02 EST (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-05-27 10:09:57 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
/var/log/messages section when hitting sysrq-t at lock time (110.11 KB, text/plain)
2002-11-13 08:46 EST, Francois-Xavier 'FiX' KOWALSKI
no flags Details
/var/log/message extract on 2.4.18-3 with no proprietary module (110.11 KB, patch)
2002-11-18 10:30 EST, Francois-Xavier 'FiX' KOWALSKI
no flags Details | Diff
2.4.18-3 with no proprietary module (this time...) (112.93 KB, text/plain)
2002-11-18 11:22 EST, Francois-Xavier 'FiX' KOWALSKI
no flags Details

  None (edit)
Description Alan Robertson 2002-10-31 10:46:56 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.8) Gecko/20020204

Description of problem:
There are occasions (a few times a day) where the scheduler fails to schedule a
soft realtime process for up to 30 seconds for no apparent reason.  This is
observed by the 'heartbeat' program which is a high-availability package for
Linux.  This has been observed by three different users of heartbeat in three
different environments.  Heartbeat tracks how long it takes systems to send out
heartbeats.  It is normal that under load that heartbeat times will go up some.
 However, in this case, things run along just as smoothly as you like until
suddenly one VERY long heartbeat interval occurs - without any warnings
concerning delayed heartbeats before this.   This problem has been observed by
Alex Kramarov of incredimail, Steven Wilson of NCD health, and Brian Tinsley of
emageon.  The kernel was kernel-2.4.18-17.7.x.  Alex Kramarov reports he was
running it on an SMP machine.  I do not yet know about the other two.  The users
report that the systems were either completely idle or nearly so.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.  I'm unsure of some of the exact initial conditions
2.  Install heartbeat on two systems and configure them over
    a couple of ethernets.  Set keepalive to 1, deadtime to 10
    and warntime to 2.

3.  Run heartbeat for a few days.  It will falsely report that one or both
machines have died - without any warnings about successively longer heartbeat
delays like those which occur when the machine is under heavy load. It may occur
in a few hours, but no more than a day or two.
	

Actual Results:  The users in question report that when they upgraded to this
kernel, they began to experience these problems.  They reproduced them several
times in this way, and they went away when they changed to an earlier kernel.

Expected Results:  No false takeovers should occur.

Additional info:

They reproduced them several times in this way, and they went away when they
changed back to an earlier kernel.  Until this kernel, heartbeat has worked
nicely with every kernel made by any vendor.  This kernel has a unique bug -
unlike any seen before in the last 3 years or so.

This bug renders these machines unsuitable for high-availability work.  The
result is equivalent to a crash - both machines shut down their HA services and
restart them.  If the users have not configured everything properly, and have a
shared disk, this can lead to loss of data.


There is a long involved discussion of these problems on the linux-ha mailing
list.  The topics are "Sporadic split brain on Red Hat 7.2 with 2.4.18-17.7
kernel" and "heartbeat failure".  The relevant list archives can be found here:
 http://marc.theaimsgroup.com/?l=linux-ha&r=1&b=200210&w=2
Comment 1 John DeDourek 2002-11-01 17:23:38 EST
I suspect that this bug is related to bug 76499 which causes large numbers
of clock ticks to be lost.
Comment 2 Alan Robertson 2002-11-04 13:52:43 EST
As dedourek@unb.ca suggested, it is likely related to bug 76499.

We have had few more emails from the victims of this bug.  These emails are in
the linux-ha archives with subject lines containing "Red Hat Kernel 2.4.18-17"
with the usual set of Re: Fwd:, etc. set of prefixes.  It also appears that one
person may have seen similar problems on somewhat earlier RH kernels.  There are
at least three models of computers this has been reproduced on:  Some UP, some
2-CPU, and some 4-CPU.
Comment 3 Francois-Xavier 'FiX' KOWALSKI 2002-11-13 04:37:34 EST
We also observe this problem on every dual-CPU HP Kayak desktop workstations
(with or without SCSI), dual-CPUs & 6-ways HP NetServers (obsolete, but...).

Our HA mechanism (polling layer on top of tcp) get confused as well, causing
system reboot.

We observe this problem with 2.4.18-3smp, 2.4.18-1-smp & 2.4.18-17.7.xsmp (same
for bigmem's).
Comment 4 Arjan van de Ven 2002-11-13 04:45:00 EST
is there any chance of anyone getting a sysreq-t dump of a stuck state ?
Comment 5 Francois-Xavier 'FiX' KOWALSKI 2002-11-13 08:46:04 EST
Created attachment 84804 [details]
/var/log/messages section when hitting sysrq-t at lock time
Comment 6 Francois-Xavier 'FiX' KOWALSKI 2002-11-13 08:48:02 EST
I have attached the piece of my /var/log/messages that contains a sysrq-t output.

I don't know if the lock I encountered is specifically the one we are talking
about here, because an NFS server went down at this time.  If you feel it is
un-related, please disregard this log.
Comment 7 Arjan van de Ven 2002-11-13 09:02:30 EST
 francois-xavier.kowalski@hp.com:
you are using binary only kernel modules that interact with the VFS and NFS,
your trace is not useful therefore ;(
Comment 8 Francois-Xavier 'FiX' KOWALSKI 2002-11-18 10:30:18 EST
Created attachment 85444 [details]
/var/log/message extract on 2.4.18-3 with no proprietary module
Comment 9 Francois-Xavier 'FiX' KOWALSKI 2002-11-18 10:34:11 EST
Here is another /var/log/messages (with 2 Alt-SysRq) in a 2.4.18-3 kernel with
no proprietary module.  The system on which it happens is relying on NFS+NIS.

In case of the first log part, the frozen application (whereas all others are
fine) is XEmacs.  (This may also happens with LyX, but not in this log)

The second log part in in the exact same situation, but it also includes a "ps
lax" command that is frozen too.
Comment 10 Arjan van de Ven 2002-11-18 10:39:03 EST
Nov 13 14:30:38 tarifa1 kernel: Call Trace: [<d09fceb2>] mvfs_rhat_rdwr [mvfs]
0x122 

ehm please be serious when you say "no proprietary modules loaded" and stop
wasting my time
Comment 11 Arjan van de Ven 2002-11-18 10:42:37 EST
also please don't use 2.4.18-3 but 2.4.18-18.7.x or 2.4.18-18.8.0
Comment 12 Francois-Xavier 'FiX' KOWALSKI 2002-11-18 11:16:33 EST
Sorry if I gave you the impression to waste your time.  Please read below:

The first log (collected last week on machine tarifa1) was on a kernel using a
proprietary module. So I have uploaded another attachment for another machine
(fuerteventura), which does not use a proprietary module.  Since I have no mean
to remove my previous attachement, it was left attached to the bug record.

I will try to catch the problem on 2.4.18-17.7.x.  Hope that the updates will
not go quick enough that another update is out...

Any chance that you have a look at the problem in the fuerteventura log on 2.4.18-3?
Comment 13 Francois-Xavier 'FiX' KOWALSKI 2002-11-18 11:21:15 EST
Ooops! wrong log from worng machine.... :-(
Comment 14 Francois-Xavier 'FiX' KOWALSKI 2002-11-18 11:22:56 EST
Created attachment 85450 [details]
2.4.18-3 with no proprietary module (this time...)
Comment 15 Arjan van de Ven 2002-11-18 17:44:19 EST
looks like an NFS lockup ;(
2.4.18-18.7.x (2.4.18-17.7.x is not the latest ;) ought to have a better NFS
behavior (but we know it's not perfect yet)
Comment 16 Alan Robertson 2002-11-22 12:48:43 EST
I do not believe that all the original people reporting this have NFS issues.
Comment 17 Alan Robertson 2003-02-14 08:47:33 EST
Recently, various people have reported to me that they've seen this problem in
kernel.org kernels.  The reports are *very* credible, and contain what appears
to be the unique signature of this bug (at least I hope there aren't two with
the same bug).

When I was at LinuxWorld, folks from the SuSE booth told me that they had seen
this bug and fixed it.  You might check with them on what the fix was.  The
person I heard this from was either Marcus Rex, or Ralf Flaxa.

So I would suppose that either Rik or Andrea might have more information.
Comment 18 Francois-Xavier 'FiX' KOWALSKI 2003-02-24 13:10:57 EST
On our side, we have reproduced the problem with & without NFS.  Quite a pain,
so no update since a long time.  Our "scheduling pause" was due to a
mis-configured NTP daemon (defaut RH settings) that was triggering system
clock-jumps.

When the clock was going backward N seconds , our application was not
"scheduled"[1] during ~ N seconds.

On my side, the fix is /etc/sysconfig/ntpd [2]:

- OPTIONS="-U ntp"
+ OPTIONS="-U ntp -x 0 -g"

[1] Invoked by the scheduler, but not doing anything, because almost everything
depends on timers in a Telco application...

[2] too bad that there is no man page for ntpd startup options on RedHat
(/usr/share/doc/*.html only)... :-(
Comment 19 Steven Boger 2003-10-02 16:23:42 EDT
We are also experiencing this with 2.4.20-19.7 and 2.4.20-20.7.
Comment 20 Steven Boger 2003-10-02 16:30:33 EDT
Whoops. Sorry for the me too.  Let me add a full report:

We are also experiencing this with 2.4.20-19.7 and 2.4.20-20.7.  We have done
everything from updating to the most recent HA rpms (all ultramonkeys), to
trying the two different kernels mentioned above.  I am currently compiling a
plain vanilla 2.4.22 kernel to see if this has any effect.

We can easily reproduce this on a two server cluster by setting the hwclock to
3:30am and rebooting both systems, while they are coming up we then start 3 test
clients pounding the box's with about 300 requests per second each (these are
apache clusters).. within ten minutes the "long heartbeat intervals" occur,
causing systems to assume both master status, or worse.

We are interested in working with any parties to resolve this problem.  Please
contact me at the above email address.

Note You need to log in before you can comment on or make changes to this bug.