Description of problem: I am running RHEL 5.1 32-bit on my dual-core ThinkPad T60. It's got 3GB memory and two physical hard drives. My laptop's hostname is tct60.redhat.com. I have installed RHN Satellite on a paravirt RHEL 4.6 guest called tcsatellite.redhat.com on the T60. That guest is using the whole of the the second internal hard drive (/dev/hda1) as a raw disk. It has about 200GB of space. I allocated 2 CPUs and 2GB of memory to install and sync it, then dropped it down to 1.5GB memory to run it. I installed RHEL 5.1 AP as a single CPU guest on the T60. It is called moe.redhat.com. It is file backed and has 384MB memory and 3GB disk space. The dom0 and both domU guests are up to date as of today. I was using this environment to demo RHN Satellite 5.1 to a customer, and I showed the customer monitoring. I set up a probe suite called CPUcheck with one probe, Linux: CPU Usage. For demo purposes I set both the notification and probe check intervals to 1 minute, and I set the thresholds at 10% warn and 30% critical. Then I loaded the guest by running the command: cat /dev/zero > /dev/null After a few minutes, the red circle with a white exclamation point showed up under the Health column of the System Overview tab. After I'd shown the customer, I killed the cat process and CPU utilization went back down to 99% idle. The red circle with a white exclamation point icon did not go away, though, and I am still getting e-mail alerts saying that CPU utilization is at 100%. I spoke with Mike McCune over IRC and he logged into the client domU and the Satellite server. He said it looked like there was a bug in the perl code. Below is what Mike said: [root@moe ~]# perl cpu.pl results: procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 40 49144 20372 225860 0 0 64 37 60 150 2 4 91 2 1 0 0 40 49144 20380 225860 0 0 0 19 28 53 0 0 100 0 0 CPU: 100 Mike told me that the perl probe uses vmstat 5 2, and parses the results, and it's messing up the parsing of the results on my machine. I deleted the probe suite and rebooted moe. I then created a new probe suite called cpu-check with Linux: CPU Usage in it and assigned it to moe. Without ever loading the box, it immediately showed in alert state in System Overview. Version-Release number of selected component (if applicable): RHN Satellite 5.1.0 How reproducible: Create a probe suite add Linux: CPU Usage to that probe suite Add a system to that probe suite Push scout configs load the system at 100% Reduce load to normal Actual results: Alert shows the system is at high load even when the load has been reduced. Expected results: Icon will change when the alert condition is cleared. Additional info:
Perl code is in: trunk/eng/monitoring/PerlModules/NP/Probe/DataSource/UnixCommand.pm test routine to run: sub cpu { my $command = '/usr/bin/vmstat 5 2'; my $results = `$command`; print "results: $results"; my @lines = split("\n", $results); my @out; @out = split(' ', $lines[3]); my $cpu_pct_used; foreach my $o (@out) { print $o . "|"; } if ($lines[1] =~ /.*st$/) { print "out[-4]: " . $out[-4] . " out[-5]: " . $out[-5] . "\n"; $cpu_pct_used = $out[-4] + $out[-5]; } else { print "out[-2]: " . $out[-2] . " out[-3]: " . $out[-3] . "\n"; $cpu_pct_used = $out[-2] + $out[-3]; } $cpu_pct_used = $out[-2] + $out[-3]; return $cpu_pct_used; } my $pct = cpu(); print "CPU: $pct\n"; 1;
Need to understand if the client is not sending the right information, or if as I suspect, it is a display issue with UI. Would suggest Tomas to look at replication, not sure if he will be able to fully track this down, but it is an interesting (to me) bug to track down. Cliff
The client sends the correct information (verified using rhn-runprobe). UI displays also the correct information. The only issue that could be confusing is the next probe schedule. In case the probe status is not OK (f.e. when a threshold is reached), longer delay is set for the next probe schedule. Behaviour is correct in: Satellite-5.3.0-RHEL5-re20090403.2 Example: Logged probe events (with "Probe Check Interval" set to 1minute) available on WEB UI in CSV format: Id Data Time Metric 1-43-pctused 0 04/17/09 03:37 PM pctused 1-43-pctused 0 04/17/09 03:38 PM pctused 1-43-pctused 100 04/17/09 03:39 PM pctused 1-43-pctused 0 04/17/09 03:44 PM pctused 1-43-pctused 0 04/17/09 03:46 PM pctused 1-43-pctused 0 04/17/09 03:47 PM pctused 1-43-pctused 1 04/17/09 03:48 PM pctused 1-43-pctused 0 04/17/09 03:49 PM pctused 1-43-pctused 1 04/17/09 03:50 PM pctused 1-43-pctused 6 04/17/09 03:51 PM pctused 1-43-pctused 1 04/17/09 03:53 PM pctused 1-43-pctused 3 04/17/09 03:54 PM pctused 1-43-pctused 100 04/17/09 03:55 PM pctused 1-43-pctused 3 04/17/09 04:01 PM pctused 1-43-pctused 4 04/17/09 04:02 PM pctused 1-43-pctused 1 04/17/09 04:03 PM pctused 1-43-pctused 3 04/17/09 04:04 PM pctused 1-43-pctused 3 04/17/09 04:05 PM pctused 1-43-pctused 2 04/17/09 04:06 PM pctused 1-43-pctused 5 04/17/09 04:08 PM pctused 1-43-pctused 100 04/17/09 04:09 PM pctused 1-43-pctused 4 04/17/09 04:14 PM pctused I set probe thresholds according to the description (10% warn and 30% critical).