632204 – monitoring probes doesn't work, Monitoring command did not complete

Bug 632204 - monitoring probes doesn't work, Monitoring command did not complete

Summary: monitoring probes doesn't work, Monitoring command did not complete

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Satellite 5
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	540
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	medium
Target Milestone:	---
Assignee:	Miroslav Suchý
QA Contact:	Red Hat Satellite QA List
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	sat540-blockers 625708
TreeView+	depends on / blocked

Reported:	2010-09-09 12:10 UTC by Petr Sklenar
Modified:	2010-09-15 11:25 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-09-15 11:25:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Petr Sklenar 2010-09-09 12:10:34 UTC

Description of problem:
Monitoring probes doesn't work. Each monitoring probes is inactive. Error is "Monitoring command did not complete"

Version-Release number of selected component (if applicable):
sat540 = Satellite-5.4.0-RHEL5-re20100903.1
client = rhnmd-5.3.0-5.el5sat

How reproducible:
always

Steps to Reproduce:
1. set up monitoring on sat540 + client
# iptables, selinux off; ssh via nocpulse user works

2. set up some of the probes
3. push scout config
  
Actual results:


State|Probe Description|Status String|Type

Probe(s) assigned to system have an UNKNOWN status|Linux: CPU Usage|Monitoring command did not complete within 15 seconds

Probe(s) assigned to system have an UNKNOWN status|Linux: Load|Monitoring command did not complete within 15 seconds

Probe(s) assigned to system have a CRITICAL status|Network Services: SSH|SSH port 22: connect: timeout

Expected results:


Additional info:
I set up ssh keys, restart rhnmd, I can connect to client from satellite with command:

[root@SATELLITE_FQDN ~]# /usr/bin/ssh -l nocpulse -p 4545 -i /var/lib/nocpulse/.ssh/nocpulse-identity -o StrictHostKeyChecking=no -o BatchMode=yes <IP.CLIENT.IP.CLIENT> "cd ~;pwd;hostname"
/var/lib/nocpulse
<HOSTNAME_OF_CLIENT>

Comment 1 Clifford Perry 2010-09-09 13:36:50 UTC

Petr - can you provide login details for the Monitoring system your using. May help to quickly figure out why it is not working. 

Cliff

Comment 3 Miroslav Suchý 2010-09-10 10:14:13 UTC

notes for myself, client is ufo-3.brq.redhat.com

Comment 4 Miroslav Suchý 2010-09-13 13:05:58 UTC

notes for myself:
it seems that scout is trying to connect to wrong ip. I created Network:Ping probe
and with default value, the scout is trying to connect to 10.34.28.111, which is not ip of ufo-3. If I file optional parameter ip address, it start work.
So far I tracked it that PL/SQL function rhn_server.get_ip_address returns wrong result:
SQL> select rhn_server.get_ip_address(1000010004) from dual;

RHN_SERVER.GET_IP_ADDRESS(1000010004)
--------------------------------------------------------------------------------
10.34.28.111

correct ip is 
# host ufo-3.brq.redhat.com
ufo-3.brq.redhat.com has address 10.34.26.49

SQL> select  ipaddr ip_addr from    rhnServerNetwork where   server_id = 1000010004 and ipaddr != '127.0.0.1';

IP_ADDR
----------------
10.34.28.111

So the question is why in table rhnServerNetwork is wrong IP address. I even tried to run rhn-profile-sync, but it is still intact.

Comment 5 Miroslav Suchý 2010-09-14 19:30:25 UTC

I tested if this is regression against sat530, with suspection that this will happen when client change IP address
I had hard time to change IP address on one machine. So I did following

I registred machine A to satellite X. On machine B I changed serverURL to satellite X and copied systemid from machine A to machine B. This should perfectly simulate change of IP address (and even hostname).

In more details.
1. register machine A to satellite X
2. create ping probe
3. push scout config
<---- probe is ok here
4. on machine B change server URL and copy systemid from machine A
5. suspend machine A
<----- probe show 100% packet loss
6. run rhn_check on machine B
<----- still 100% packet loss, but in SDC I see checkins
7. run rhn-profile-sync
<----- still 100% packet loss, still trying old IP
8. push scout config
<----- probe is ok here, trying new IP. 

I tried this setups
A:RHEL5.1, B:RHEL6, X: sat530
B:RHEL6, A:RHEL5.1  X: sat530
A:RHEL5.1, B:RHEL6, X: sat540

The behavior was the same on sat530 and sa540. 
Which is surprising. Because even if I try to run rhn-profile-sync on ufo-3 and scout config push on satellite it still do not update that IP.
Only difference I see so far is that ufo-3 is RHEL5.5.


Hmm

Comment 6 Miroslav Suchý 2010-09-14 20:00:41 UTC

I just tested it with RHEL5.5 and it still work as I described. I.e after rhn-profile-sync and scout config push it works as expected (with small glitch, which I reported as BZ 633975).
So something has to be special on ufo-3 machine.

Comment 7 Miroslav Suchý 2010-09-14 20:09:07 UTC

Oh crap, I had on that RHEL5.1 and 5.5 packages from spacewalk client repo. 
I will have to retest it tomorrow again with 5.5 without those spacewalk clients.

Comment 8 Miroslav Suchý 2010-09-15 07:49:16 UTC

Ok. so even with proper RHEL 5.5 (without client tools from spacewalk repo) it works on both sat530 and 540.
So I would like to say, that this bug happens only on ufo-3. Why? This is uknown to me right now.

Comment 9 Miroslav Suchý 2010-09-15 10:12:52 UTC

I find that it is client (ufo-3) who send that wrong IP address. Backend store it correctly. The problem is in incorrect network setup of ufo-3:

[root@ufo-3 ~]# python
Python 2.4.3 (#1, Jun 11 2009, 14:09:37)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-44)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from socket import gethostname
>>> from socket import gethostbyname
>>> gethostname()
'ufo-3'
>>> gethostbyname(gethostname())
'10.34.28.111'
>>>
[root@ufo-3 ~]# hostname
ufo-3
[root@ufo-3 ~]# host ufo-3
ufo-3.brq.redhat.com has address 10.34.26.49

I'm closing this as NOTABUG

Comment 10 Miroslav Suchý 2010-09-15 10:18:50 UTC

It is caused by:
[root@ufo-3 ~]# cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1               localhost.localdomain localhost
::1             localhost6.localdomain6 localhost6
10.34.28.111            ufo-3

Comment 11 Petr Sklenar 2010-09-15 10:59:03 UTC

(In reply to comment #10)
It was original IP of this machine, it seems that dhcp change it. Filled bug against python:

Bug 634147  - gethostbyname(gethostname()) is wrong when IP is changed

Comment 13 Miroslav Suchý 2010-09-15 11:25:07 UTC

This is nonsense, client do not have to have open single port. So there is no port we can try to connect.
And workarounding others bug or setup is nonsense.

Note You need to log in before you can comment on or make changes to this bug.