Bug 447637

Summary:	When a monitoring alert for CPU utilization is tripped, it never reverts
Product:	Red Hat Satellite 5	Reporter:	Thomas Cameron <tcameron>
Component:	Monitoring	Assignee:	Tomas Lestach <tlestach>
Status:	CLOSED NOTABUG	QA Contact:	Preethi Thomas <pthomas>
Severity:	low	Docs Contact:
Priority:	low
Version:	510	CC:	cperry
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-04-20 11:12:28 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	463877

Description Thomas Cameron 2008-05-20 21:54:55 UTC

Description of problem:
I am running RHEL 5.1 32-bit on my dual-core ThinkPad T60.  It's got 3GB memory
and two physical hard drives.  My laptop's hostname is tct60.redhat.com.

I have installed RHN Satellite on a paravirt RHEL 4.6 guest called
tcsatellite.redhat.com on the T60.  That guest is using the whole of the the
second internal hard drive (/dev/hda1) as a raw disk.  It has about 200GB of
space.  I allocated 2 CPUs and 2GB of memory to install and sync it, then
dropped it down to 1.5GB memory to run it.

I installed RHEL 5.1 AP as a single CPU guest on the T60.  It is called
moe.redhat.com.  It is file backed and has 384MB memory and 3GB disk space.

The dom0 and both domU guests are up to date as of today.

I was using this environment to demo RHN Satellite 5.1 to a customer, and I
showed the customer monitoring.  I set up a probe suite called CPUcheck with one
probe, Linux: CPU Usage.  For demo purposes I set both the notification and
probe check intervals to 1 minute, and I set the thresholds at 10% warn and 30%
critical.  Then I loaded the guest by running the command:

cat /dev/zero > /dev/null

After a few minutes, the red circle with a white exclamation point showed up
under the Health column of the System Overview tab.  After I'd shown the
customer, I killed the cat process and CPU utilization went back down to 99%
idle.  The red circle with a white exclamation point icon did not go away,
though, and I am still getting e-mail alerts saying that CPU utilization is at 100%.

I spoke with Mike McCune over IRC and he logged into the client domU and the
Satellite server.  He said it looked like there was a bug in the perl code. 
Below is what Mike said:

[root@moe ~]# perl cpu.pl 
results: procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
1  0     40  49144  20372 225860    0    0    64    37   60  150  2  4 91  2  1
0  0     40  49144  20380 225860    0    0     0    19   28   53  0  0 100  0  0
CPU: 100

Mike told me that the perl probe uses vmstat 5 2, and parses the results, and
it's messing up the parsing of the results on my machine.

I deleted the probe suite and rebooted moe.  I then created a new probe suite
called cpu-check with Linux: CPU Usage in it and assigned it to moe.  Without
ever loading the box, it immediately showed in alert state in System Overview.

Version-Release number of selected component (if applicable):
RHN Satellite 5.1.0

How reproducible:
Create a probe suite
add Linux: CPU Usage to that probe suite
Add a system to that probe suite
Push scout configs
load the system at 100%
Reduce load to normal
  
Actual results:
Alert shows the system is at high load even when the load has been reduced.

Expected results:
Icon will change when the alert condition is cleared.

Additional info:

Comment 1 Mike McCune 2008-05-20 22:24:29 UTC

Perl code is in:

trunk/eng/monitoring/PerlModules/NP/Probe/DataSource/UnixCommand.pm

test routine to run:

sub cpu {


    my $command = '/usr/bin/vmstat 5 2';
    my $results = `$command`;

    print "results: $results";

    my @lines = split("\n", $results);
    my @out;
    @out = split(' ', $lines[3]);

    my $cpu_pct_used;

    foreach my $o (@out) {
        print $o . "|";
    }
    
    if ($lines[1] =~ /.*st$/) {
        print "out[-4]: " . $out[-4] . " out[-5]: " . $out[-5] . "\n";
        $cpu_pct_used = $out[-4] + $out[-5];
    } else {
        print "out[-2]: " . $out[-2] . " out[-3]: " . $out[-3] . "\n";
        $cpu_pct_used = $out[-2] + $out[-3];
    }

    $cpu_pct_used = $out[-2] + $out[-3];
    return $cpu_pct_used;
}

my $pct = cpu();
print "CPU: $pct\n";

1;

Comment 2 Clifford Perry 2009-04-09 17:26:17 UTC

Need to understand if the client is not sending the right information, or if as I suspect, it is a display issue with UI. Would suggest Tomas to look at replication, not sure if he will be able to fully track this down, but it is an interesting (to me) bug to track down. 

Cliff

Comment 3 Tomas Lestach 2009-04-20 11:12:28 UTC

The client sends the correct information (verified using rhn-runprobe). UI displays also the correct information. The only issue that could be confusing is the next probe schedule.
In case the probe status is not OK (f.e. when a threshold is reached), longer delay is set for the next probe schedule.

Behaviour is correct in: Satellite-5.3.0-RHEL5-re20090403.2

Example:
Logged probe events (with "Probe Check Interval" set to 1minute) available on WEB UI in CSV format:

Id		Data	Time			Metric
1-43-pctused	0	04/17/09 03:37 PM	pctused
1-43-pctused	0	04/17/09 03:38 PM	pctused
1-43-pctused	100	04/17/09 03:39 PM	pctused
1-43-pctused	0	04/17/09 03:44 PM	pctused
1-43-pctused	0	04/17/09 03:46 PM	pctused
1-43-pctused	0	04/17/09 03:47 PM	pctused
1-43-pctused	1	04/17/09 03:48 PM	pctused
1-43-pctused	0	04/17/09 03:49 PM	pctused
1-43-pctused	1	04/17/09 03:50 PM	pctused
1-43-pctused	6	04/17/09 03:51 PM	pctused
1-43-pctused	1	04/17/09 03:53 PM	pctused
1-43-pctused	3	04/17/09 03:54 PM	pctused
1-43-pctused	100	04/17/09 03:55 PM	pctused
1-43-pctused	3	04/17/09 04:01 PM	pctused
1-43-pctused	4	04/17/09 04:02 PM	pctused
1-43-pctused	1	04/17/09 04:03 PM	pctused
1-43-pctused	3	04/17/09 04:04 PM	pctused
1-43-pctused	3	04/17/09 04:05 PM	pctused
1-43-pctused	2	04/17/09 04:06 PM	pctused
1-43-pctused	5	04/17/09 04:08 PM	pctused
1-43-pctused	100	04/17/09 04:09 PM	pctused
1-43-pctused	4	04/17/09 04:14 PM	pctused

I set probe thresholds according to the description (10% warn and 30% critical).