Bug 207244 - [RHEL 4] idle load average climbs monotonically until reboot
[RHEL 4] idle load average climbs monotonically until reboot
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.4
All Linux
high Severity high
: ---
: ---
Assigned To: Chris Snook
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-09-20 02:25 EDT by Issue Tracker
Modified: 2010-10-22 02:06 EDT (History)
6 users (show)

See Also:
Fixed In Version: RHBA-2007-0304
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-05-07 23:37:00 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Issue Tracker 2006-09-20 02:25:31 EDT
Escalated to Bugzilla from IssueTracker
Comment 1 Issue Tracker 2006-09-20 02:25:45 EDT
This behavior appears to be the same thing as BZ 181815:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=181815

Code inspection confirms the same bug exists in RHEL 4.  I am building a
kernel with a backport of the upstream 2.6.10 fix:

http://lkml.org/lkml/2004/11/16/78

I will update again when the kernel build is complete.


This event sent from IssueTracker by csnook  [Support Engineering Group]
 issue 102032
Comment 2 Chris Snook 2006-09-20 02:45:48 EDT
Summary:

There is an accounting bug in the scheduler that can nr_uninterruptible to
increment on one CPU without a corresponding decrement on another CPU while
migrating a task.  This causes the baseline idle load to increase monotonically
under some workloads, until load monitoring software (or the user) shuts down
software triggering the bug, or the system is rebooted.

Impact:

Applications which self-limit their activity under high load (such as sendmail)
will shut down when the apparent load exceeds their set threshold.  Load
monitoring software will also issue (mostly) false alarms.

Fix:

A backport of the 2.6.10 patch for this issue resolved the problem under the
customer's reliably reproducing workload.
Comment 5 Jason Baron 2006-09-28 11:49:28 EDT
committed in stream U5 build 42.14. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/
Comment 6 Jay Turner 2006-10-03 09:22:32 EDT
QE ack for 4.5.
Comment 7 Jason Corley 2006-10-03 09:38:18 EDT
Given that sendmail is the default MTA for RHEL and that this hard locks a
sendmail mail server, I would've expected a faster turnaround than waiting for
update 5.  I fought this for the better part of a week on our production mail
server before finding this bug report.  The kernel from jbaron's homedir fixes
the issue for us (which by the way, rendered the system unusable and locked out
remote logins).
Comment 8 Chris Snook 2006-10-03 16:19:19 EDT
(In reply to comment #7)

With the exception of security errata, new versions of packages are generally
released only with updates, as described in the RHEL Errata Support Policy:

https://www.redhat.com/security/updates/errata/

Customers requiring intermediate solutions should contact support.
Comment 9 Jason Corley 2006-10-03 16:29:50 EDT
Denial of Service is typically considered a security problem.  A fully patched
RHEL4 sendmail server a) no longer sends or receives mail after a certain period
of time and b) no longer allows remote logins (sshd kicked from RAM).  DoS in my
book, just happens to be done by the scheduler instead of a remote attacker.
Comment 10 Chris Snook 2006-10-03 17:05:07 EDT
In documented cases we have seen so far, a remote attacker could not cause
Denial of Service to occur in a situation where it otherwise would not under
normal load.  Therefore we regard this as a stability bug, like any other bug
which cannot be deliberately triggered but could cause a system to become
unavailable due to a crash or hang of some kind.  We have classified the
severity as high and prioritized it accordingly.  Customers experiencing this
problem should contact support for intermediate solutions prior to the release
of the Update containing the fix.

If you are aware of a scenario in which an attacker could deliberately trigger
this behavior on a system that is stable under normal load, please inform
security@redhat.com immediately.
Comment 11 Jason Corley 2006-10-06 11:34:00 EDT
I notice that 2.6.9-42.0.3.EL contains the following:

    "In addition, two bugfixes for the IPW-2200 wireless driver were included."

Clearly the "only security errata" does not apply globally.  It's puzzling why
people running sendmail on RHEL4 are continuing to be left out in the cold.
Comment 12 Jason Baron 2006-10-06 13:04:34 EDT
Jason, i appreciate your concerns...this fix for this issue is contained in the
current U5 beta kernel builds which are available from
http://people.redhat.com/~jbaron/rhel4/

You can also contact support for an official 'hotfix' kernel which is same thing
as a above to hold you until the fix appears in the main stream. 

thanks.
Comment 13 Jason Corley 2006-10-06 15:15:51 EDT
looks like -16 suffers the same problem as previous kernels.  About 2 hours
after installing it load spiked to over 8.  -14 seems the only reliable version
so far.
Comment 14 Jason Corley 2007-01-31 10:06:47 EST
Another kernel errata without this fix.  When exactly will this be included?
Comment 15 Chris Snook 2007-01-31 14:55:55 EST
As noted in comment 5 and comment 12, this is on track for RHEL 4 Update 5.  If
you require a supported fix before then, please contact support for a hotfix.
Comment 17 Robert Scheck 2007-02-26 17:09:50 EST
Florian, can you have a look to it - please? It seems obvious strange to me why 
this can't be fixed in a "normal" time...
Comment 18 Jason Corley 2007-02-26 17:13:09 EST
More kernel errata (2.6.9-42.0.8) dealing with issues that aren't this one. 
From your own errata announcement:

-----

In addition to the security issues described above, fixes for the following
bugs were included:

* initialization error of the tg3 driver with some BCM5703x network card

* a memory leak in the audit subsystem

* x86_64 nmi watchdog timeout is too short

* ext2/3 directory reads fail intermittently

-----

Again more non-security fixes.  Why is this one sitting until rhel4u5 again?
Comment 19 Chris Snook 2007-02-26 17:40:59 EST
As we have always done, the errata kernel 2.6.9-42.0.8.EL included both security
fixes and other fixes which were deemed critical, due to both severe and
widespread impact.

As this bug is not widespread, we have chosen to support it in the form of a
hotfix, as we do with many other bugs that do not impact a large user base.  A
hotfix is a package that is fully supported for the specific customer it is
released to until 30 days after the public release containing the fix.  If there
are concerns about 3rd-party support for these kernels, we can work with those
vendors to ensure that the hotfix is treated identically to a publicly released
RHEL kernel.

To request a hotfix for this issue, please go here:

https://www.redhat.com/apps/support/

Once you have logged in with your RHN ID and password, you should be able to
open a web ticket requesting the hotfix.  Please refer to BZ 207244 in the text
of the ticket, and please list the architecture and smp-variant (kernel,
kernel-smp, kernel-hugemem, kernel-largesmp), so we can provide the correct
build most expediently.
Comment 20 Jason Corley 2007-02-27 10:40:33 EST
I don't see how you can claim the bug is not widespread when it affects the
default MTA of your flagship product.  Either you don't expect anyone to use
your product as a mail server or you don't want them using sendmail.  If that's
the case I would recommend you remove it from the next revision of RHEL.

From the newest kernel (2.6.9-42.0.10):

In addition to the security issues described above, a fix for the SCTP
subsystem to address a system crash which may be experienced in Telco
environments has been included.


Comment 21 Chris Snook 2007-02-27 12:13:34 EST
Manifestation of this bug requires rare pathological load conditions which
usually only occur frequently enough to drive the load to a significant level
when there are more simultaneously runnable realtime processes with the same
static priority than there are CPUs.  Since this generally makes it impossible
to satisfy realtime guarantees, realtime systems are usually designed and
configured to make this condition impossible.  As a result, we have received
very few reports of this behavior, making it preferable to support those users
with hotfixes, and allow the patch to go through the extended QA and beta
exposure of our normal update release cycle.  This is how we handle the vast
majority of bugs, including many with more severe and widespread impact than
this one.

If you would like a supported hotfix, please contact support.  As a reminder,
that URL is:

https://www.redhat.com/apps/support/

If you would like to test an unsupported development kernel with this patch, you
can download it from the page linked in comment #5.  We welcome any technical
feedback you may have about it.
Comment 23 Red Hat Bugzilla 2007-05-07 23:37:00 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0304.html

Note You need to log in before you can comment on or make changes to this bug.