1079289 – When Memory Utilization reached CRITICAL, all the other services also goes to CRITICAL and Status Information shows "CHECK_NRPE: Socket timeout after 10 seconds."

Bug 1079289 - When Memory Utilization reached CRITICAL, all the other services also goes to CRITICAL and Status Information shows "CHECK_NRPE: Socket timeout after 10 seconds."

Summary: When Memory Utilization reached CRITICAL, all the other services also goes to...

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nagios-server-addons
Sub Component:
Version:	rhgs-3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Darshan
QA Contact:	RHS-C QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1087818
TreeView+	depends on / blocked

Reported:	2014-03-21 11:15 UTC by Prasanth
Modified:	2015-05-14 03:25 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	When the memory utilization is very high, some or all services goes to critical state and displays the message "CHECK_NRPE: Socket timeout after 10 seconds" based on the memory utilization.
Clone Of:
Environment:
Last Closed:	2015-02-17 09:18:22 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
sosreport from nagios server (10.39 MB, application/x-xz) 2014-03-21 11:21 UTC, Prasanth	no flags	Details
View All

Description Prasanth 2014-03-21 11:15:29 UTC

Description of problem:

When Memory Utilization reached CRITICAL, all the other services also goes to CRITICAL and Status Information shows "CHECK_NRPE: Socket timeout after 10 seconds."

----------
[1395399281] SERVICE ALERT: rhs-client28.lab.eng.blr.redhat.com;Cpu Utilization;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
[1395399401] SERVICE ALERT: rhs-client28.lab.eng.blr.redhat.com;Cpu Utilization;CRITICAL;SOFT;2;CHECK_NRPE: Socket timeout after 10 seconds.
[1395399511] SERVICE ALERT: rhs-client28.lab.eng.blr.redhat.com;Cpu Utilization;OK;SOFT;3;CPU Status OK: Total CPU:0.19% Idle CPU:99.81%
[1395399821] SERVICE ALERT: rhs-client28.lab.eng.blr.redhat.com;Cpu Utilization;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
[1395399831] SERVICE ALERT: rhs-client28.lab.eng.blr.redhat.com;Disk Utilization;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
[1395399831] SERVICE EVENT HANDLER: rhs-client28.lab.eng.blr.redhat.com;Disk Utilization;CRITICAL;SOFT;1;host_service_handler
[1395399831] Warning: Attempting to execute the command "/usr/lib64/nagios/plugins/gluster_host_service_handler.py -s CRITICAL -t SOFT -a 1 -l 10.70.36.52 -n Disk Utilization" resulted in a return code of 127.  Make sure the script or binary you are trying to execute actually exists...
[1395399931] SERVICE ALERT: rhs-client28.lab.eng.blr.redhat.com;Cpu Utilization;OK;SOFT;2;CPU Status OK: Total CPU:0.19% Idle CPU:99.81%
[1395399941] SERVICE ALERT: rhs-client28.lab.eng.blr.redhat.com;Disk Utilization;OK;SOFT;2;OK :  disks:mounts:(/dev/sdb1:/boot,lv_root:/,lv_home:/home)
[1395399941] SERVICE EVENT HANDLER: rhs-client28.lab.eng.blr.redhat.com;Disk Utilization;OK;SOFT;2;host_service_handler
[1395399941] Warning: Attempting to execute the command "/usr/lib64/nagios/plugins/gluster_host_service_handler.py -s OK -t SOFT -a 2 -l 10.70.36.52 -n Disk Utilization" resulted in a return code of 127.  Make sure the script or binary you are trying to execute actually exists...
------------

Version-Release number of selected component (if applicable): 
gluster-nagios-1.1-1.noarch.rpm 
gluster-nrpe-1.1-1.x86_64.rpm

How reproducible: Always


Steps to Reproduce:
1. Stress the memory of any node so that it reaches CRITICAL state. 
Ex: # stress --vm 3 --vm-bytes 7G -v
2. Check for the service status and Memory Utilization should now show as CRITICAL
3. Now if you look at the other service status, you can see that some or all of them also shows as CRITICAL with the following in the Status Information. 
"CHECK_NRPE: Socket timeout after 10 seconds"

Actual results: When memory reaches CRITICAL and thereby NRPE is unable to fetch data, all the service status are also changed to CRITICAL, which shouldn't be the case.


Expected results: I think, the other service status should not be shown as CRITICAL, unless it's valid


Additional info: Logs will be attached.

Comment 1 Prasanth 2014-03-21 11:21:56 UTC

Created attachment 877232 [details]
sosreport from nagios server

Comment 3 Dusmant 2014-04-10 08:17:55 UTC

Prasanth will revisit and update.

Comment 4 Nagaprasad Sathyanarayana 2014-05-06 11:43:39 UTC

Dev ack to 3.0 RHS BZs

Comment 5 Dusmant 2014-05-12 10:04:50 UTC

Discussion on this bug :
Alok : Memory state critical shouldn't cause  other unrelated services to go to critical. 
Dusmant : Alok, i would agree with you theoretically. I will put the bug for RHS 3.0, but i am not sure, if it can be fixed for real crunch situation. Dev team will work on it and if we run into limitation, we will get back to you.

Comment 6 Dusmant 2014-05-30 04:12:30 UTC

As discussed on 29-May-2014 : This issue is a resource crunch issue and this can not be avoided as such. Hence removed out of the list. We should document this behaviour.

Comment 8 Shalaka 2014-06-24 17:13:29 UTC

Please review and signoff the edited doc text.

Comment 9 Darshan 2014-06-25 04:38:01 UTC

Can you make a small change as follows:
When the memory utilization is very high, some or all services may go to critical state and display the message "CHECK_NRPE: Socket timeout after 10 seconds", because of lack of memory.

Comment 10 Sahina Bose 2015-02-17 09:18:22 UTC

Based on Comment 6, closing this.

Please open if you think otherwise

Note You need to log in before you can comment on or make changes to this bug.