Bug 181815

Summary: Phantom escalating load due to flawed rq->nr_uninterruptible increment
Product: Red Hat Enterprise Linux 3 Reporter: Will DeHaan <will>
Component: kernelAssignee: Ingo Molnar <mingo>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: cww, lwang, peterm, petrides, phyllis.wendelboe, sconklin, sfolkwil, syeghiay
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
URL: http://lkml.org/lkml/2004/11/16/78
Whiteboard:
Fixed In Version: RHSA-2006-0437 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-07-20 13:49:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 181405, 186960    
Attachments:
Description Flags
scheduler patch to fix escalating load issue none

Description Will DeHaan 2006-02-16 19:49:21 UTC
Description of problem:

Load averages can creep up in a stair-step fashion representing a phantom
minimum load on a system. The lowest load value increases from the healthy 0 to
very high numbers, on occasion given the appropriate process and IO load.

This scheduler accounting issue has no application outside of reported load so
only applications that are os load aware are impacted by this bug.

This issue has been discovered and resolved in 2.6.10 kernels by Ingo Molnar.
I've backported that fix to RH's 2.4.21-37* kernels in the attached diff which I
apply into the kernel RPMs as patch52. The URL field for this bug references
Ingo's 2.6 fix from late 2004

Version-Release number of selected component (if applicable):

2.4.21-37.0.1.EL and earlier

How reproducible:

Readily in high process count SMP systems with considerable blocked disk, lan IO

Steps to Reproduce:
1. Run many processes that regularly become uninterruptible on an SMP system
2. Watch the load averages/peek at rq->nr_uninterruptible and watch it increment
sporadically
  
Actual results:

Load averages will increase in a stair-step fashion within 0 to several weeks
under load. Disabling production applications on an affected system will return
its load averages to a baseline integer value >0

Expected results:

Load averages should always return to near-zero on an idle system

Additional info:

Numerous Proofpoint Inc. customers have encountered this issue on production
RHEL3U5,6/Sendmail/Proofpoint/MySQL servers. Sendmail implementations typically
run 600 to 2000 children. Network utilization is generally quite low, disk IO
blocks processes often. 

Please consider integrating my patch or passing feedback to me, thanks. 
Will DeHaan <will>

Comment 1 Will DeHaan 2006-02-16 19:49:21 UTC
Created attachment 124779 [details]
scheduler patch to fix escalating load issue

Comment 30 Ernie Petrides 2006-06-14 23:32:35 UTC
A fix for this problem was committed to the RHEL3 U8
patch pool on 9-Jun-2006 (in kernel version 2.4.21-44.EL).


Comment 38 Red Hat Bugzilla 2006-07-20 13:49:58 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0437.html