Bug 206957 - sar gives incorrect values for CPU utilization
Summary: sar gives incorrect values for CPU utilization
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: sysstat
Version: 4.4
Hardware: i386
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Ivana Varekova
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On: 196666
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-09-18 14:24 UTC by Thomas Sudbrak
Modified: 2007-11-17 01:14 UTC (History)
3 users (show)

Fixed In Version: sysstat-5.0.5-14.rhel4
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-05-02 10:59:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
use long long variables for cpu utilization within struct file_stats. (4.23 KB, patch)
2006-09-18 14:24 UTC, Thomas Sudbrak
no flags Details | Diff
use long long variables for cpu utilization within struct file_stats and stats_one_cpu. (5.04 KB, patch)
2006-09-19 10:04 UTC, Thomas Sudbrak
no flags Details | Diff

Description Thomas Sudbrak 2006-09-18 14:24:49 UTC
Description of problem:

Uptime being long, the command sar -u gives incorrect values for CPU
utilization.  The problem occurs on systems running a 2.6 kernel as soon as one
of the values in /proc/stat exceeds 2^32.

Version-Release number of selected component (if applicable):

5.0.5-11.rhel4

How reproducible:

Always after the threshold of 2^32 is reached.

Steps to Reproduce:

1. Wait a certain time (approx. 2 months on an machine with 8 processors or 4
processors with HT).  This is NOT a joke, it really happened on our machines:

uptime:
  16:09:57 up >>> 103 days <<<, ...

cat /proc/stat:
  cpu  97721466 8392 15181193 >>> 7001411008 <<< 24658269 414919 1863876
  cpu0 ...
  ...
  cpu8 ...

2. run "sar -P ALL 20 1" or similar and compare the result of the first line
(average) with the average of the following lines for each column.  The first
line obviously contains corrupted data.

Additional info:

The problem occurs due to the size of the components in struct file_stats
defined in file sa.h.  In contrast to a 2.4 kernel the values in /proc/stat
increase much faster (probably due to the frequency of 1000Hz).

The attached patch fixes the problem.  There might be other components of
file_stats (or others) which also need to be enlarged in size.

Comment 1 Thomas Sudbrak 2006-09-18 14:24:49 UTC
Created attachment 136546 [details]
use long long variables for cpu utilization within struct file_stats.

Comment 2 Thomas Sudbrak 2006-09-19 10:04:02 UTC
Created attachment 136618 [details]
use long long variables for cpu utilization within struct file_stats and stats_one_cpu.

This patch was generated after rpmbuild's %prep stage and this is to be applied
AFTER all usual patches.

Comment 3 Thomas Sudbrak 2006-09-19 10:09:29 UTC
Comment on attachment 136618 [details]
use long long variables for cpu utilization within struct file_stats and stats_one_cpu.

The patch only covers the i386 architecture; the situation on x86_64 and others
has to be examined as well.

Comment 4 Tom Sightler 2007-01-25 04:28:17 UTC
We are seeing this issue on a number of servers, especially Dell 6850's which are 
8 core systems (16 with hyper-threading).  It only takes a few weeks for the 
problem to show up.  This really should be fixed as it's very misleading when 
attempting to use sar to look at load history and has confused our DBA's several 
times.


Comment 5 Ivana Varekova 2007-05-02 10:59:19 UTC
This problem is fixed in sysstat-5.0.5-14.rhel4, if the problem persists after
the upgrade to 14.rhel4, please reopen this bug.


Note You need to log in before you can comment on or make changes to this bug.