Bug 181815 - Phantom escalating load due to flawed rq->nr_uninterruptible increment
Summary: Phantom escalating load due to flawed rq->nr_uninterruptible increment
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Ingo Molnar
QA Contact: Brian Brock
URL: http://lkml.org/lkml/2004/11/16/78
Whiteboard:
Depends On:
Blocks: RHEL3U8CanFix 186960
TreeView+ depends on / blocked
 
Reported: 2006-02-16 19:49 UTC by Will DeHaan
Modified: 2007-11-30 22:07 UTC (History)
8 users (show)

Fixed In Version: RHSA-2006-0437
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-07-20 13:49:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
scheduler patch to fix escalating load issue (857 bytes, patch)
2006-02-16 19:49 UTC, Will DeHaan
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2006:0437 0 normal SHIPPED_LIVE Important: Updated kernel packages for Red Hat Enterprise Linux 3 Update 8 2006-07-20 13:11:00 UTC

Description Will DeHaan 2006-02-16 19:49:21 UTC
Description of problem:

Load averages can creep up in a stair-step fashion representing a phantom
minimum load on a system. The lowest load value increases from the healthy 0 to
very high numbers, on occasion given the appropriate process and IO load.

This scheduler accounting issue has no application outside of reported load so
only applications that are os load aware are impacted by this bug.

This issue has been discovered and resolved in 2.6.10 kernels by Ingo Molnar.
I've backported that fix to RH's 2.4.21-37* kernels in the attached diff which I
apply into the kernel RPMs as patch52. The URL field for this bug references
Ingo's 2.6 fix from late 2004

Version-Release number of selected component (if applicable):

2.4.21-37.0.1.EL and earlier

How reproducible:

Readily in high process count SMP systems with considerable blocked disk, lan IO

Steps to Reproduce:
1. Run many processes that regularly become uninterruptible on an SMP system
2. Watch the load averages/peek at rq->nr_uninterruptible and watch it increment
sporadically
  
Actual results:

Load averages will increase in a stair-step fashion within 0 to several weeks
under load. Disabling production applications on an affected system will return
its load averages to a baseline integer value >0

Expected results:

Load averages should always return to near-zero on an idle system

Additional info:

Numerous Proofpoint Inc. customers have encountered this issue on production
RHEL3U5,6/Sendmail/Proofpoint/MySQL servers. Sendmail implementations typically
run 600 to 2000 children. Network utilization is generally quite low, disk IO
blocks processes often. 

Please consider integrating my patch or passing feedback to me, thanks. 
Will DeHaan <will>

Comment 1 Will DeHaan 2006-02-16 19:49:21 UTC
Created attachment 124779 [details]
scheduler patch to fix escalating load issue

Comment 30 Ernie Petrides 2006-06-14 23:32:35 UTC
A fix for this problem was committed to the RHEL3 U8
patch pool on 9-Jun-2006 (in kernel version 2.4.21-44.EL).


Comment 38 Red Hat Bugzilla 2006-07-20 13:49:58 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0437.html



Note You need to log in before you can comment on or make changes to this bug.