Bug 1328191 - [abrt] gluster-nagios-addons-0.2.5-1.el7rhgs: network.py:84:_getStatMessage:TypeError: 'NoneType' object has no attribute '__getitem__'
Summary: [abrt] gluster-nagios-addons-0.2.5-1.el7rhgs: network.py:84:_getStatMessage:T...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: gluster-nagios-addons
Version: rhgs-3.1
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: RHGS 3.2.0
Assignee: Ramesh N
QA Contact: Sweta Anandpara
URL:
Whiteboard:
Depends On:
Blocks: 1351515
TreeView+ depends on / blocked
 
Reported: 2016-04-18 16:05 UTC by Oonkwee Lim
Modified: 2019-10-10 11:54 UTC (History)
6 users (show)

Fixed In Version: gluster-nagios-addons-0.2.8-1.
Doc Type: Bug Fix
Doc Text:
Nagios expected and attempted to process performance data even when performance data had not yet been collected. This caused a crash in the Nagios plugin. This update ensures that Nagios checks whether data exists before attempting to process it, and displays an appropriate error message instead of crashing.
Clone Of:
Environment:
Last Closed: 2017-03-23 05:14:44 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0491 0 normal SHIPPED_LIVE Red Hat Gluster Storage 3.2.0 gluster-nagios-addon bug fix update 2017-03-23 09:07:38 UTC

Description Oonkwee Lim 2016-04-18 16:05:49 UTC
Description of problem:
time:           Mon 22 Feb 2016 11:22:26 PM EST
cmdline:        /usr/bin/python /usr/lib64/nagios/plugins/gluster/network.py -e lo -e ;vdsmdummy; -t 2
uid:            993 (nrpe)
abrt_version:   2.1.11
comment:        
event_log:      
executable:     /usr/lib64/nagios/plugins/gluster/network.py
kernel:         3.10.0-229.20.1.el7.x86_64
last_occurrence: 1460744654
pid:            31747
pkg_arch:       x86_64
pkg_epoch:      0
pkg_name:       gluster-nagios-addons
pkg_release:    1.el7rhgs
pkg_version:    0.2.5
runlevel:       N 3
username:       nrpe

dead.letter:    Text file, 759349 bytes
sosreport.tar.xz: Binary file, 39280640 bytes

backtrace:
:network.py:84:_getStatMessage:TypeError: 'NoneType' object has no attribute '__getitem__'
:
:Traceback (most recent call last):
:  File "/usr/lib64/nagios/plugins/gluster/network.py", line 138, in <module>
:    main()
:  File "/usr/lib64/nagios/plugins/gluster/network.py", line 126, in main
:    excludes=args.exclude)
:  File "/usr/lib64/nagios/plugins/gluster/network.py", line 84, in _getStatMessage
:    for info in stat['network']['net-dev']:
:TypeError: 'NoneType' object has no attribute '__getitem__'
:
:Local variables in innermost frame:
:excludes: ['lo', ';vdsmdummy;']
:all: False
:excludeList: ['lo', 'lo', ';vdsmdummy;']
:interfaces: {'ovirtmgmt': {'flags': 4163, 'ipaddr': '10.10.117.197'}, 'lo': {'flags': 73, 'ipaddr': '127.0.0.1'}, 'eth0': {'flags': None, 'ipaddr': None}}
:stat: None
:includes: None
:perfLines: []
:rc: 'OK'
:devNames: []

environ:
:LANG=en_US.UTF-8
:NRPE_SSL_OPT=
:SHLVL=1
:NRPE_MULTILINESUPPORT=1
:PWD=/
:LOGNAME=nrpe
:USER=nrpe
:HOME=/var/run/nrpe
:PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
:NRPE_PROGRAMVERSION=2.15
:_=/usr/lib64/nagios/plugins/gluster/network.py

machineid:
:systemd=ef167f21fc94402c8cbb584b5d00dbdb
:sosreport_uploader-dmidecode=e08687cda8f27b9797b3adc3009cec982ff9e7a250d15f82041ed8ac9e06d71f

reported_to:
:uReport: BTHASH=b753aebaffb52d6a8639b09b1554846367b6c0d0
:ABRT Server: URL=https://api.access.redhat.com/rs/telemetry/abrt/reports/bthash/b753aebaffb52d6a8639b09b1554846367b6c0d0

Version-Release number of selected component (if applicable):
Installed Gluster Packages
--------------------------
gluster-nagios-addons-0.2.5-1.el7rhgs.x86_64                Thu Oct 15 19:41:34 2015
gluster-nagios-common-0.2.2-1.el7rhgs.noarch                Thu Oct 15 19:40:35 2015
glusterfs-3.7.1-16.el7rhgs.x86_64                           Thu Oct 15 19:42:03 2015
glusterfs-api-3.7.1-16.el7rhgs.x86_64                       Thu Oct 15 19:42:03 2015
glusterfs-cli-3.7.1-16.el7rhgs.x86_64                       Thu Oct 15 19:42:04 2015
glusterfs-client-xlators-3.7.1-16.el7rhgs.x86_64            Thu Oct 15 19:42:03 2015
glusterfs-fuse-3.7.1-16.el7rhgs.x86_64                      Thu Oct 15 19:42:04 2015
glusterfs-ganesha-3.7.1-16.el7rhgs.x86_64                   Thu Oct 15 19:42:25 2015
glusterfs-geo-replication-3.7.1-16.el7rhgs.x86_64           Thu Oct 15 19:42:06 2015
glusterfs-libs-3.7.1-16.el7rhgs.x86_64                      Thu Oct 15 19:42:03 2015
glusterfs-rdma-3.7.1-16.el7rhgs.x86_64                      Thu Oct 15 19:42:07 2015
glusterfs-server-3.7.1-16.el7rhgs.x86_64                    Thu Oct 15 19:42:04 2015
nfs-ganesha-gluster-2.2.0-9.el7rhgs.x86_64                  Thu Oct 15 19:42:06 2015
python-gluster-3.7.1-16.el7rhgs.x86_64                      Thu Oct 15 19:40:45 2015
vdsm-gluster-4.16.20-1.3.el7rhgs.noarch                     Thu Oct 15 19:42:20 2015


How reproducible:
N/A

Steps to Reproduce:
1. N/A
2.
3.

Actual results:
Customer saw that the node was unresponsive and had to reboot this cluster node.

Expected results:
The nagios plugin should not crash and cause the node to be unresponsive.

Additional info:

Comment 3 Oonkwee Lim 2016-04-18 21:10:58 UTC
The customer answer regarding the unresponsiveness of the node.

Unresponsive = not accessible by SSH nor Virtual Machine console = not working = what the heck happened here?

Comment 4 Oonkwee Lim 2016-04-19 05:52:11 UTC
The customer has corrected my understanding.

By unresponsive, it means not accessible by SSH or Virtual Machine console or the FUSE/Gluster clients.

So mainly the node is unresponsive.

And things are back to normal after the reboot of the node.

Comment 5 Ramesh N 2016-04-19 07:29:08 UTC
I am not sure how this nagios plugin crash is related to node unresponsiveness. network.py is using sadf command to get the network statistics(Network statistics are collected and stored periodically by sar). Looks like sadf command is not returning any valid data. I feel there is something wrong with the network setup and the reason for node unresponsiveness is something else.

Comment 6 Oonkwee Lim 2016-04-19 15:28:18 UTC
Answers from the customer:

Exactly, that is what I am telling you. You are seeing the same thing I saw, the Gluster client was not able to connect to the Gluster servers, and the failover never happened.

We have not changed anything related with the Gluster server configuration, not even related to the network. In any case I have chacked the configuration files and it looks fine to me.

On top of it, that day when this problem happened I also asked to the network team about any ongoing issue related with the network, they checked and nothing weird was happening, a lot of servers are in the same network, none except this one was failing.

If you say there is something wrong with the network setup please help me to check that part, you already have the sosreport. Let me know if you need additional files to dig into this.

Comment 11 Ramesh N 2016-04-21 04:30:42 UTC
 We can analyze the nagios plugin crash if we have following info.

How frequently this crash is happening?
Steps to reproduce?

Regards,
Ramesh

Comment 16 Ramesh N 2016-04-25 13:38:05 UTC
I understood the reason for this crash. We are always querying the statistics for last one minute. If there is no network statics collected during last minute then this can happen. Specially in this case, since the VMs where not functioning properly, there may not be any statistics collected during last minute and it can lead to crash. We could add a check before processing the performance data so that we crashing anymore.

Comment 17 Oonkwee Lim 2016-04-25 16:44:51 UTC
Hello,

Question from the customer:

Can you please expand "since the VMs where not functioning properly"?

Comment 18 Ramesh N 2016-04-25 18:00:40 UTC
(In reply to Oonkwee Lim_ from comment #17)
> Hello,
> 
> Question from the customer:
> 
> Can you please expand "since the VMs where not functioning properly"?

Please check your comment in comment 12 https://bugzilla.redhat.com/show_bug.cgi?id=1328191#c12.

Comment 19 Sahina Bose 2016-04-26 05:31:55 UTC
Moving this out to 3.2.0 - 
Please note that the gluster nagios plugin crash is not responsible for the host/VM unresponsiveness. But the nagios plugin crash is due to data not being available in network statistics.

Comment 24 Sweta Anandpara 2016-11-21 11:10:58 UTC
Steps to reproduce:
==================
1. On a system where nagios is set up and is funtioning well, run the command '/usr/lib64/nagios/plugins/gluster/network.py' and observe the output
(This is the command Nagios runs internally before displaying the output on the web UI)
2. Stop crond service using 'systemctl stop crond.service'
3. Verify using the command 'sadf -x -- -n DEV 1' the statistics collected and the time at which it was collected.
4. Wait for a complete minute (60 seconds) and run the command '/usr/lib64/nagios/plugins/gluster/network.py' again and observe that it fails with a traceback. Also observe the message in /var/log/messages

Reproduced this on the build gluster-nagios-addons-0.2.7-1.el7rhgs.x86_64. Hit the traceback. Saw the messages in Nagios webUI going to UNKNOWN and Network Utilization go to WARNING with the status: "NRPE: Unable to read output. "

Updated the package to gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64 and was _not_ able to see a traceback. The services including Network Utilization in Nagios web UI show UNKNOWN.

Pasting the detailed output below. Moving this BZ to verified in 3.2.


gluster-nagios-addons 0.2.7-1
=============================

/var/log/messages snippet

12755 Nov 21 16:21:12 dhcp46-239 systemd: Stopped Command Scheduler.
12756 Nov 21 16:22:08 dhcp46-239 python: detected unhandled Python exception in '/usr/lib64/nagios/plugins/gluster/network.py'
12757 Nov 21 16:22:13 dhcp46-239 python: communication with ABRT daemon failed: timed out
12758 Nov 21 16:22:21 dhcp46-239 kernel: nr_pdflush_threads exported in /proc is scheduled for removal               

[root@dhcp46-239 ~]# /usr/lib64/nagios/plugins/gluster/network.py
Traceback (most recent call last):
  File "/usr/lib64/nagios/plugins/gluster/network.py", line 138, in <module>
    main()
  File "/usr/lib64/nagios/plugins/gluster/network.py", line 126, in main
    excludes=args.exclude)
  File "/usr/lib64/nagios/plugins/gluster/network.py", line 84, in _getStatMessage
    for info in stat['network']['net-dev']:
TypeError: 'NoneType' object has no attribute '__getitem__'
[root@dhcp46-239 ~]#


gluster-nagios-addons 0.2.8-1
===============================
[root@dhcp46-239 ~]# /usr/lib64/nagios/plugins/gluster/network.py
ERROR:root:unable to get network status for the given interval: 1
UNKNOWN
[root@dhcp46-239 ~]# 


After staring crond service

[root@dhcp46-239 ~]# /usr/lib64/nagios/plugins/gluster/network.py
OK: ens3:UP |ens3.rxpck=32.80 ens3.txpck=25.16 ens3.rxkB=3.22 ens3.txkB=7.57
[root@dhcp46-239 ~]#

Comment 26 errata-xmlrpc 2017-03-23 05:14:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0491.html


Note You need to log in before you can comment on or make changes to this bug.