Description of problem: time: Mon 22 Feb 2016 11:22:26 PM EST cmdline: /usr/bin/python /usr/lib64/nagios/plugins/gluster/network.py -e lo -e ;vdsmdummy; -t 2 uid: 993 (nrpe) abrt_version: 2.1.11 comment: event_log: executable: /usr/lib64/nagios/plugins/gluster/network.py kernel: 3.10.0-229.20.1.el7.x86_64 last_occurrence: 1460744654 pid: 31747 pkg_arch: x86_64 pkg_epoch: 0 pkg_name: gluster-nagios-addons pkg_release: 1.el7rhgs pkg_version: 0.2.5 runlevel: N 3 username: nrpe dead.letter: Text file, 759349 bytes sosreport.tar.xz: Binary file, 39280640 bytes backtrace: :network.py:84:_getStatMessage:TypeError: 'NoneType' object has no attribute '__getitem__' : :Traceback (most recent call last): : File "/usr/lib64/nagios/plugins/gluster/network.py", line 138, in <module> : main() : File "/usr/lib64/nagios/plugins/gluster/network.py", line 126, in main : excludes=args.exclude) : File "/usr/lib64/nagios/plugins/gluster/network.py", line 84, in _getStatMessage : for info in stat['network']['net-dev']: :TypeError: 'NoneType' object has no attribute '__getitem__' : :Local variables in innermost frame: :excludes: ['lo', ';vdsmdummy;'] :all: False :excludeList: ['lo', 'lo', ';vdsmdummy;'] :interfaces: {'ovirtmgmt': {'flags': 4163, 'ipaddr': '10.10.117.197'}, 'lo': {'flags': 73, 'ipaddr': '127.0.0.1'}, 'eth0': {'flags': None, 'ipaddr': None}} :stat: None :includes: None :perfLines: [] :rc: 'OK' :devNames: [] environ: :LANG=en_US.UTF-8 :NRPE_SSL_OPT= :SHLVL=1 :NRPE_MULTILINESUPPORT=1 :PWD=/ :LOGNAME=nrpe :USER=nrpe :HOME=/var/run/nrpe :PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin :NRPE_PROGRAMVERSION=2.15 :_=/usr/lib64/nagios/plugins/gluster/network.py machineid: :systemd=ef167f21fc94402c8cbb584b5d00dbdb :sosreport_uploader-dmidecode=e08687cda8f27b9797b3adc3009cec982ff9e7a250d15f82041ed8ac9e06d71f reported_to: :uReport: BTHASH=b753aebaffb52d6a8639b09b1554846367b6c0d0 :ABRT Server: URL=https://api.access.redhat.com/rs/telemetry/abrt/reports/bthash/b753aebaffb52d6a8639b09b1554846367b6c0d0 Version-Release number of selected component (if applicable): Installed Gluster Packages -------------------------- gluster-nagios-addons-0.2.5-1.el7rhgs.x86_64 Thu Oct 15 19:41:34 2015 gluster-nagios-common-0.2.2-1.el7rhgs.noarch Thu Oct 15 19:40:35 2015 glusterfs-3.7.1-16.el7rhgs.x86_64 Thu Oct 15 19:42:03 2015 glusterfs-api-3.7.1-16.el7rhgs.x86_64 Thu Oct 15 19:42:03 2015 glusterfs-cli-3.7.1-16.el7rhgs.x86_64 Thu Oct 15 19:42:04 2015 glusterfs-client-xlators-3.7.1-16.el7rhgs.x86_64 Thu Oct 15 19:42:03 2015 glusterfs-fuse-3.7.1-16.el7rhgs.x86_64 Thu Oct 15 19:42:04 2015 glusterfs-ganesha-3.7.1-16.el7rhgs.x86_64 Thu Oct 15 19:42:25 2015 glusterfs-geo-replication-3.7.1-16.el7rhgs.x86_64 Thu Oct 15 19:42:06 2015 glusterfs-libs-3.7.1-16.el7rhgs.x86_64 Thu Oct 15 19:42:03 2015 glusterfs-rdma-3.7.1-16.el7rhgs.x86_64 Thu Oct 15 19:42:07 2015 glusterfs-server-3.7.1-16.el7rhgs.x86_64 Thu Oct 15 19:42:04 2015 nfs-ganesha-gluster-2.2.0-9.el7rhgs.x86_64 Thu Oct 15 19:42:06 2015 python-gluster-3.7.1-16.el7rhgs.x86_64 Thu Oct 15 19:40:45 2015 vdsm-gluster-4.16.20-1.3.el7rhgs.noarch Thu Oct 15 19:42:20 2015 How reproducible: N/A Steps to Reproduce: 1. N/A 2. 3. Actual results: Customer saw that the node was unresponsive and had to reboot this cluster node. Expected results: The nagios plugin should not crash and cause the node to be unresponsive. Additional info:
The customer answer regarding the unresponsiveness of the node. Unresponsive = not accessible by SSH nor Virtual Machine console = not working = what the heck happened here?
The customer has corrected my understanding. By unresponsive, it means not accessible by SSH or Virtual Machine console or the FUSE/Gluster clients. So mainly the node is unresponsive. And things are back to normal after the reboot of the node.
I am not sure how this nagios plugin crash is related to node unresponsiveness. network.py is using sadf command to get the network statistics(Network statistics are collected and stored periodically by sar). Looks like sadf command is not returning any valid data. I feel there is something wrong with the network setup and the reason for node unresponsiveness is something else.
Answers from the customer: Exactly, that is what I am telling you. You are seeing the same thing I saw, the Gluster client was not able to connect to the Gluster servers, and the failover never happened. We have not changed anything related with the Gluster server configuration, not even related to the network. In any case I have chacked the configuration files and it looks fine to me. On top of it, that day when this problem happened I also asked to the network team about any ongoing issue related with the network, they checked and nothing weird was happening, a lot of servers are in the same network, none except this one was failing. If you say there is something wrong with the network setup please help me to check that part, you already have the sosreport. Let me know if you need additional files to dig into this.
We can analyze the nagios plugin crash if we have following info. How frequently this crash is happening? Steps to reproduce? Regards, Ramesh
I understood the reason for this crash. We are always querying the statistics for last one minute. If there is no network statics collected during last minute then this can happen. Specially in this case, since the VMs where not functioning properly, there may not be any statistics collected during last minute and it can lead to crash. We could add a check before processing the performance data so that we crashing anymore.
Hello, Question from the customer: Can you please expand "since the VMs where not functioning properly"?
(In reply to Oonkwee Lim_ from comment #17) > Hello, > > Question from the customer: > > Can you please expand "since the VMs where not functioning properly"? Please check your comment in comment 12 https://bugzilla.redhat.com/show_bug.cgi?id=1328191#c12.
Moving this out to 3.2.0 - Please note that the gluster nagios plugin crash is not responsible for the host/VM unresponsiveness. But the nagios plugin crash is due to data not being available in network statistics.
Steps to reproduce: ================== 1. On a system where nagios is set up and is funtioning well, run the command '/usr/lib64/nagios/plugins/gluster/network.py' and observe the output (This is the command Nagios runs internally before displaying the output on the web UI) 2. Stop crond service using 'systemctl stop crond.service' 3. Verify using the command 'sadf -x -- -n DEV 1' the statistics collected and the time at which it was collected. 4. Wait for a complete minute (60 seconds) and run the command '/usr/lib64/nagios/plugins/gluster/network.py' again and observe that it fails with a traceback. Also observe the message in /var/log/messages Reproduced this on the build gluster-nagios-addons-0.2.7-1.el7rhgs.x86_64. Hit the traceback. Saw the messages in Nagios webUI going to UNKNOWN and Network Utilization go to WARNING with the status: "NRPE: Unable to read output. " Updated the package to gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64 and was _not_ able to see a traceback. The services including Network Utilization in Nagios web UI show UNKNOWN. Pasting the detailed output below. Moving this BZ to verified in 3.2. gluster-nagios-addons 0.2.7-1 ============================= /var/log/messages snippet 12755 Nov 21 16:21:12 dhcp46-239 systemd: Stopped Command Scheduler. 12756 Nov 21 16:22:08 dhcp46-239 python: detected unhandled Python exception in '/usr/lib64/nagios/plugins/gluster/network.py' 12757 Nov 21 16:22:13 dhcp46-239 python: communication with ABRT daemon failed: timed out 12758 Nov 21 16:22:21 dhcp46-239 kernel: nr_pdflush_threads exported in /proc is scheduled for removal [root@dhcp46-239 ~]# /usr/lib64/nagios/plugins/gluster/network.py Traceback (most recent call last): File "/usr/lib64/nagios/plugins/gluster/network.py", line 138, in <module> main() File "/usr/lib64/nagios/plugins/gluster/network.py", line 126, in main excludes=args.exclude) File "/usr/lib64/nagios/plugins/gluster/network.py", line 84, in _getStatMessage for info in stat['network']['net-dev']: TypeError: 'NoneType' object has no attribute '__getitem__' [root@dhcp46-239 ~]# gluster-nagios-addons 0.2.8-1 =============================== [root@dhcp46-239 ~]# /usr/lib64/nagios/plugins/gluster/network.py ERROR:root:unable to get network status for the given interval: 1 UNKNOWN [root@dhcp46-239 ~]# After staring crond service [root@dhcp46-239 ~]# /usr/lib64/nagios/plugins/gluster/network.py OK: ens3:UP |ens3.rxpck=32.80 ens3.txpck=25.16 ens3.rxkB=3.22 ens3.txkB=7.57 [root@dhcp46-239 ~]#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0491.html