Description of problem: When Memory Utilization reached CRITICAL, all the other services also goes to CRITICAL and Status Information shows "CHECK_NRPE: Socket timeout after 10 seconds." ---------- [1395399281] SERVICE ALERT: rhs-client28.lab.eng.blr.redhat.com;Cpu Utilization;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds. [1395399401] SERVICE ALERT: rhs-client28.lab.eng.blr.redhat.com;Cpu Utilization;CRITICAL;SOFT;2;CHECK_NRPE: Socket timeout after 10 seconds. [1395399511] SERVICE ALERT: rhs-client28.lab.eng.blr.redhat.com;Cpu Utilization;OK;SOFT;3;CPU Status OK: Total CPU:0.19% Idle CPU:99.81% [1395399821] SERVICE ALERT: rhs-client28.lab.eng.blr.redhat.com;Cpu Utilization;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds. [1395399831] SERVICE ALERT: rhs-client28.lab.eng.blr.redhat.com;Disk Utilization;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds. [1395399831] SERVICE EVENT HANDLER: rhs-client28.lab.eng.blr.redhat.com;Disk Utilization;CRITICAL;SOFT;1;host_service_handler [1395399831] Warning: Attempting to execute the command "/usr/lib64/nagios/plugins/gluster_host_service_handler.py -s CRITICAL -t SOFT -a 1 -l 10.70.36.52 -n Disk Utilization" resulted in a return code of 127. Make sure the script or binary you are trying to execute actually exists... [1395399931] SERVICE ALERT: rhs-client28.lab.eng.blr.redhat.com;Cpu Utilization;OK;SOFT;2;CPU Status OK: Total CPU:0.19% Idle CPU:99.81% [1395399941] SERVICE ALERT: rhs-client28.lab.eng.blr.redhat.com;Disk Utilization;OK;SOFT;2;OK : disks:mounts:(/dev/sdb1:/boot,lv_root:/,lv_home:/home) [1395399941] SERVICE EVENT HANDLER: rhs-client28.lab.eng.blr.redhat.com;Disk Utilization;OK;SOFT;2;host_service_handler [1395399941] Warning: Attempting to execute the command "/usr/lib64/nagios/plugins/gluster_host_service_handler.py -s OK -t SOFT -a 2 -l 10.70.36.52 -n Disk Utilization" resulted in a return code of 127. Make sure the script or binary you are trying to execute actually exists... ------------ Version-Release number of selected component (if applicable): gluster-nagios-1.1-1.noarch.rpm gluster-nrpe-1.1-1.x86_64.rpm How reproducible: Always Steps to Reproduce: 1. Stress the memory of any node so that it reaches CRITICAL state. Ex: # stress --vm 3 --vm-bytes 7G -v 2. Check for the service status and Memory Utilization should now show as CRITICAL 3. Now if you look at the other service status, you can see that some or all of them also shows as CRITICAL with the following in the Status Information. "CHECK_NRPE: Socket timeout after 10 seconds" Actual results: When memory reaches CRITICAL and thereby NRPE is unable to fetch data, all the service status are also changed to CRITICAL, which shouldn't be the case. Expected results: I think, the other service status should not be shown as CRITICAL, unless it's valid Additional info: Logs will be attached.
Created attachment 877232 [details] sosreport from nagios server
Prasanth will revisit and update.
Dev ack to 3.0 RHS BZs
Discussion on this bug : Alok : Memory state critical shouldn't cause other unrelated services to go to critical. Dusmant : Alok, i would agree with you theoretically. I will put the bug for RHS 3.0, but i am not sure, if it can be fixed for real crunch situation. Dev team will work on it and if we run into limitation, we will get back to you.
As discussed on 29-May-2014 : This issue is a resource crunch issue and this can not be avoided as such. Hence removed out of the list. We should document this behaviour.
Please review and signoff the edited doc text.
Can you make a small change as follows: When the memory utilization is very high, some or all services may go to critical state and display the message "CHECK_NRPE: Socket timeout after 10 seconds", because of lack of memory.
Based on Comment 6, closing this. Please open if you think otherwise