Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1079289 - When Memory Utilization reached CRITICAL, all the other services also goes to CRITICAL and Status Information shows "CHECK_NRPE: Socket timeout after 10 seconds."
When Memory Utilization reached CRITICAL, all the other services also goes to...
Status: CLOSED CANTFIX
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: nagios-server-addons (Show other bugs)
3.0
Unspecified Unspecified
low Severity high
: ---
: ---
Assigned To: Darshan
RHS-C QE
:
Depends On:
Blocks: 1087818
  Show dependency treegraph
 
Reported: 2014-03-21 07:15 EDT by Prasanth
Modified: 2015-05-13 23:25 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Known Issue
Doc Text:
When the memory utilization is very high, some or all services goes to critical state and displays the message "CHECK_NRPE: Socket timeout after 10 seconds" based on the memory utilization.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-02-17 04:18:22 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
sosreport from nagios server (10.39 MB, application/x-xz)
2014-03-21 07:21 EDT, Prasanth
no flags Details

  None (edit)
Description Prasanth 2014-03-21 07:15:29 EDT
Description of problem:

When Memory Utilization reached CRITICAL, all the other services also goes to CRITICAL and Status Information shows "CHECK_NRPE: Socket timeout after 10 seconds."

----------
[1395399281] SERVICE ALERT: rhs-client28.lab.eng.blr.redhat.com;Cpu Utilization;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
[1395399401] SERVICE ALERT: rhs-client28.lab.eng.blr.redhat.com;Cpu Utilization;CRITICAL;SOFT;2;CHECK_NRPE: Socket timeout after 10 seconds.
[1395399511] SERVICE ALERT: rhs-client28.lab.eng.blr.redhat.com;Cpu Utilization;OK;SOFT;3;CPU Status OK: Total CPU:0.19% Idle CPU:99.81%
[1395399821] SERVICE ALERT: rhs-client28.lab.eng.blr.redhat.com;Cpu Utilization;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
[1395399831] SERVICE ALERT: rhs-client28.lab.eng.blr.redhat.com;Disk Utilization;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
[1395399831] SERVICE EVENT HANDLER: rhs-client28.lab.eng.blr.redhat.com;Disk Utilization;CRITICAL;SOFT;1;host_service_handler
[1395399831] Warning: Attempting to execute the command "/usr/lib64/nagios/plugins/gluster_host_service_handler.py -s CRITICAL -t SOFT -a 1 -l 10.70.36.52 -n Disk Utilization" resulted in a return code of 127.  Make sure the script or binary you are trying to execute actually exists...
[1395399931] SERVICE ALERT: rhs-client28.lab.eng.blr.redhat.com;Cpu Utilization;OK;SOFT;2;CPU Status OK: Total CPU:0.19% Idle CPU:99.81%
[1395399941] SERVICE ALERT: rhs-client28.lab.eng.blr.redhat.com;Disk Utilization;OK;SOFT;2;OK :  disks:mounts:(/dev/sdb1:/boot,lv_root:/,lv_home:/home)
[1395399941] SERVICE EVENT HANDLER: rhs-client28.lab.eng.blr.redhat.com;Disk Utilization;OK;SOFT;2;host_service_handler
[1395399941] Warning: Attempting to execute the command "/usr/lib64/nagios/plugins/gluster_host_service_handler.py -s OK -t SOFT -a 2 -l 10.70.36.52 -n Disk Utilization" resulted in a return code of 127.  Make sure the script or binary you are trying to execute actually exists...
------------

Version-Release number of selected component (if applicable): 
gluster-nagios-1.1-1.noarch.rpm 
gluster-nrpe-1.1-1.x86_64.rpm

How reproducible: Always


Steps to Reproduce:
1. Stress the memory of any node so that it reaches CRITICAL state. 
Ex: # stress --vm 3 --vm-bytes 7G -v
2. Check for the service status and Memory Utilization should now show as CRITICAL
3. Now if you look at the other service status, you can see that some or all of them also shows as CRITICAL with the following in the Status Information. 
"CHECK_NRPE: Socket timeout after 10 seconds"

Actual results: When memory reaches CRITICAL and thereby NRPE is unable to fetch data, all the service status are also changed to CRITICAL, which shouldn't be the case.


Expected results: I think, the other service status should not be shown as CRITICAL, unless it's valid


Additional info: Logs will be attached.
Comment 1 Prasanth 2014-03-21 07:21:56 EDT
Created attachment 877232 [details]
sosreport from nagios server
Comment 3 Dusmant 2014-04-10 04:17:55 EDT
Prasanth will revisit and update.
Comment 4 Nagaprasad Sathyanarayana 2014-05-06 07:43:39 EDT
Dev ack to 3.0 RHS BZs
Comment 5 Dusmant 2014-05-12 06:04:50 EDT
Discussion on this bug :
Alok : Memory state critical shouldn't cause  other unrelated services to go to critical. 
Dusmant : Alok, i would agree with you theoretically. I will put the bug for RHS 3.0, but i am not sure, if it can be fixed for real crunch situation. Dev team will work on it and if we run into limitation, we will get back to you.
Comment 6 Dusmant 2014-05-30 00:12:30 EDT
As discussed on 29-May-2014 : This issue is a resource crunch issue and this can not be avoided as such. Hence removed out of the list. We should document this behaviour.
Comment 8 Shalaka 2014-06-24 13:13:29 EDT
Please review and signoff the edited doc text.
Comment 9 Darshan 2014-06-25 00:38:01 EDT
Can you make a small change as follows:
When the memory utilization is very high, some or all services may go to critical state and display the message "CHECK_NRPE: Socket timeout after 10 seconds", because of lack of memory.
Comment 10 Sahina Bose 2015-02-17 04:18:22 EST
Based on Comment 6, closing this.

Please open if you think otherwise

Note You need to log in before you can comment on or make changes to this bug.