Escalated to Bugzilla from IssueTracker
This behavior appears to be the same thing as BZ 181815: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=181815 Code inspection confirms the same bug exists in RHEL 4. I am building a kernel with a backport of the upstream 2.6.10 fix: http://lkml.org/lkml/2004/11/16/78 I will update again when the kernel build is complete. This event sent from IssueTracker by csnook [Support Engineering Group] issue 102032
Summary: There is an accounting bug in the scheduler that can nr_uninterruptible to increment on one CPU without a corresponding decrement on another CPU while migrating a task. This causes the baseline idle load to increase monotonically under some workloads, until load monitoring software (or the user) shuts down software triggering the bug, or the system is rebooted. Impact: Applications which self-limit their activity under high load (such as sendmail) will shut down when the apparent load exceeds their set threshold. Load monitoring software will also issue (mostly) false alarms. Fix: A backport of the 2.6.10 patch for this issue resolved the problem under the customer's reliably reproducing workload.
committed in stream U5 build 42.14. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
QE ack for 4.5.
Given that sendmail is the default MTA for RHEL and that this hard locks a sendmail mail server, I would've expected a faster turnaround than waiting for update 5. I fought this for the better part of a week on our production mail server before finding this bug report. The kernel from jbaron's homedir fixes the issue for us (which by the way, rendered the system unusable and locked out remote logins).
(In reply to comment #7) With the exception of security errata, new versions of packages are generally released only with updates, as described in the RHEL Errata Support Policy: https://www.redhat.com/security/updates/errata/ Customers requiring intermediate solutions should contact support.
Denial of Service is typically considered a security problem. A fully patched RHEL4 sendmail server a) no longer sends or receives mail after a certain period of time and b) no longer allows remote logins (sshd kicked from RAM). DoS in my book, just happens to be done by the scheduler instead of a remote attacker.
In documented cases we have seen so far, a remote attacker could not cause Denial of Service to occur in a situation where it otherwise would not under normal load. Therefore we regard this as a stability bug, like any other bug which cannot be deliberately triggered but could cause a system to become unavailable due to a crash or hang of some kind. We have classified the severity as high and prioritized it accordingly. Customers experiencing this problem should contact support for intermediate solutions prior to the release of the Update containing the fix. If you are aware of a scenario in which an attacker could deliberately trigger this behavior on a system that is stable under normal load, please inform security immediately.
I notice that 2.6.9-42.0.3.EL contains the following: "In addition, two bugfixes for the IPW-2200 wireless driver were included." Clearly the "only security errata" does not apply globally. It's puzzling why people running sendmail on RHEL4 are continuing to be left out in the cold.
Jason, i appreciate your concerns...this fix for this issue is contained in the current U5 beta kernel builds which are available from http://people.redhat.com/~jbaron/rhel4/ You can also contact support for an official 'hotfix' kernel which is same thing as a above to hold you until the fix appears in the main stream. thanks.
looks like -16 suffers the same problem as previous kernels. About 2 hours after installing it load spiked to over 8. -14 seems the only reliable version so far.
Another kernel errata without this fix. When exactly will this be included?
As noted in comment 5 and comment 12, this is on track for RHEL 4 Update 5. If you require a supported fix before then, please contact support for a hotfix.
Florian, can you have a look to it - please? It seems obvious strange to me why this can't be fixed in a "normal" time...
More kernel errata (2.6.9-42.0.8) dealing with issues that aren't this one. From your own errata announcement: ----- In addition to the security issues described above, fixes for the following bugs were included: * initialization error of the tg3 driver with some BCM5703x network card * a memory leak in the audit subsystem * x86_64 nmi watchdog timeout is too short * ext2/3 directory reads fail intermittently ----- Again more non-security fixes. Why is this one sitting until rhel4u5 again?
As we have always done, the errata kernel 2.6.9-42.0.8.EL included both security fixes and other fixes which were deemed critical, due to both severe and widespread impact. As this bug is not widespread, we have chosen to support it in the form of a hotfix, as we do with many other bugs that do not impact a large user base. A hotfix is a package that is fully supported for the specific customer it is released to until 30 days after the public release containing the fix. If there are concerns about 3rd-party support for these kernels, we can work with those vendors to ensure that the hotfix is treated identically to a publicly released RHEL kernel. To request a hotfix for this issue, please go here: https://www.redhat.com/apps/support/ Once you have logged in with your RHN ID and password, you should be able to open a web ticket requesting the hotfix. Please refer to BZ 207244 in the text of the ticket, and please list the architecture and smp-variant (kernel, kernel-smp, kernel-hugemem, kernel-largesmp), so we can provide the correct build most expediently.
I don't see how you can claim the bug is not widespread when it affects the default MTA of your flagship product. Either you don't expect anyone to use your product as a mail server or you don't want them using sendmail. If that's the case I would recommend you remove it from the next revision of RHEL. From the newest kernel (2.6.9-42.0.10): In addition to the security issues described above, a fix for the SCTP subsystem to address a system crash which may be experienced in Telco environments has been included.
Manifestation of this bug requires rare pathological load conditions which usually only occur frequently enough to drive the load to a significant level when there are more simultaneously runnable realtime processes with the same static priority than there are CPUs. Since this generally makes it impossible to satisfy realtime guarantees, realtime systems are usually designed and configured to make this condition impossible. As a result, we have received very few reports of this behavior, making it preferable to support those users with hotfixes, and allow the patch to go through the extended QA and beta exposure of our normal update release cycle. This is how we handle the vast majority of bugs, including many with more severe and widespread impact than this one. If you would like a supported hotfix, please contact support. As a reminder, that URL is: https://www.redhat.com/apps/support/ If you would like to test an unsupported development kernel with this patch, you can download it from the page linked in comment #5. We welcome any technical feedback you may have about it.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0304.html