Red Hat Bugzilla – Bug 207244
[RHEL 4] idle load average climbs monotonically until reboot
Last modified: 2010-10-22 02:06:13 EDT
Escalated to Bugzilla from IssueTracker
This behavior appears to be the same thing as BZ 181815:
Code inspection confirms the same bug exists in RHEL 4. I am building a
kernel with a backport of the upstream 2.6.10 fix:
I will update again when the kernel build is complete.
This event sent from IssueTracker by csnook [Support Engineering Group]
There is an accounting bug in the scheduler that can nr_uninterruptible to
increment on one CPU without a corresponding decrement on another CPU while
migrating a task. This causes the baseline idle load to increase monotonically
under some workloads, until load monitoring software (or the user) shuts down
software triggering the bug, or the system is rebooted.
Applications which self-limit their activity under high load (such as sendmail)
will shut down when the apparent load exceeds their set threshold. Load
monitoring software will also issue (mostly) false alarms.
A backport of the 2.6.10 patch for this issue resolved the problem under the
customer's reliably reproducing workload.
committed in stream U5 build 42.14. A test kernel with this patch is available
QE ack for 4.5.
Given that sendmail is the default MTA for RHEL and that this hard locks a
sendmail mail server, I would've expected a faster turnaround than waiting for
update 5. I fought this for the better part of a week on our production mail
server before finding this bug report. The kernel from jbaron's homedir fixes
the issue for us (which by the way, rendered the system unusable and locked out
(In reply to comment #7)
With the exception of security errata, new versions of packages are generally
released only with updates, as described in the RHEL Errata Support Policy:
Customers requiring intermediate solutions should contact support.
Denial of Service is typically considered a security problem. A fully patched
RHEL4 sendmail server a) no longer sends or receives mail after a certain period
of time and b) no longer allows remote logins (sshd kicked from RAM). DoS in my
book, just happens to be done by the scheduler instead of a remote attacker.
In documented cases we have seen so far, a remote attacker could not cause
Denial of Service to occur in a situation where it otherwise would not under
normal load. Therefore we regard this as a stability bug, like any other bug
which cannot be deliberately triggered but could cause a system to become
unavailable due to a crash or hang of some kind. We have classified the
severity as high and prioritized it accordingly. Customers experiencing this
problem should contact support for intermediate solutions prior to the release
of the Update containing the fix.
If you are aware of a scenario in which an attacker could deliberately trigger
this behavior on a system that is stable under normal load, please inform
I notice that 2.6.9-42.0.3.EL contains the following:
"In addition, two bugfixes for the IPW-2200 wireless driver were included."
Clearly the "only security errata" does not apply globally. It's puzzling why
people running sendmail on RHEL4 are continuing to be left out in the cold.
Jason, i appreciate your concerns...this fix for this issue is contained in the
current U5 beta kernel builds which are available from
You can also contact support for an official 'hotfix' kernel which is same thing
as a above to hold you until the fix appears in the main stream.
looks like -16 suffers the same problem as previous kernels. About 2 hours
after installing it load spiked to over 8. -14 seems the only reliable version
Another kernel errata without this fix. When exactly will this be included?
As noted in comment 5 and comment 12, this is on track for RHEL 4 Update 5. If
you require a supported fix before then, please contact support for a hotfix.
Florian, can you have a look to it - please? It seems obvious strange to me why
this can't be fixed in a "normal" time...
More kernel errata (2.6.9-42.0.8) dealing with issues that aren't this one.
From your own errata announcement:
In addition to the security issues described above, fixes for the following
bugs were included:
* initialization error of the tg3 driver with some BCM5703x network card
* a memory leak in the audit subsystem
* x86_64 nmi watchdog timeout is too short
* ext2/3 directory reads fail intermittently
Again more non-security fixes. Why is this one sitting until rhel4u5 again?
As we have always done, the errata kernel 2.6.9-42.0.8.EL included both security
fixes and other fixes which were deemed critical, due to both severe and
As this bug is not widespread, we have chosen to support it in the form of a
hotfix, as we do with many other bugs that do not impact a large user base. A
hotfix is a package that is fully supported for the specific customer it is
released to until 30 days after the public release containing the fix. If there
are concerns about 3rd-party support for these kernels, we can work with those
vendors to ensure that the hotfix is treated identically to a publicly released
To request a hotfix for this issue, please go here:
Once you have logged in with your RHN ID and password, you should be able to
open a web ticket requesting the hotfix. Please refer to BZ 207244 in the text
of the ticket, and please list the architecture and smp-variant (kernel,
kernel-smp, kernel-hugemem, kernel-largesmp), so we can provide the correct
build most expediently.
I don't see how you can claim the bug is not widespread when it affects the
default MTA of your flagship product. Either you don't expect anyone to use
your product as a mail server or you don't want them using sendmail. If that's
the case I would recommend you remove it from the next revision of RHEL.
From the newest kernel (2.6.9-42.0.10):
In addition to the security issues described above, a fix for the SCTP
subsystem to address a system crash which may be experienced in Telco
environments has been included.
Manifestation of this bug requires rare pathological load conditions which
usually only occur frequently enough to drive the load to a significant level
when there are more simultaneously runnable realtime processes with the same
static priority than there are CPUs. Since this generally makes it impossible
to satisfy realtime guarantees, realtime systems are usually designed and
configured to make this condition impossible. As a result, we have received
very few reports of this behavior, making it preferable to support those users
with hotfixes, and allow the patch to go through the extended QA and beta
exposure of our normal update release cycle. This is how we handle the vast
majority of bugs, including many with more severe and widespread impact than
If you would like a supported hotfix, please contact support. As a reminder,
that URL is:
If you would like to test an unsupported development kernel with this patch, you
can download it from the page linked in comment #5. We welcome any technical
feedback you may have about it.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.