Bug 1519856

Summary:	collectd: KeyError: 'split_brain_cnt' or ValueError: invalid literal for int() with base 10 in /var/log/messages
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Daniel Horák <dahorak>
Component:	web-admin-tendrl-node-agent	Assignee:	Shubhendu Tripathi <shtripat>
Status:	CLOSED ERRATA	QA Contact:	Martin Kudlej <mkudlej>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.3	CC:	dahorak, mkudlej, nthomas, rhs-bugs, sanandpa, sankarshan, shtripat
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	tendrl-node-agent-1.5.4-14.el7rhgs	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-12-18 04:38:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Daniel Horák 2017-12-01 14:54:06 UTC

Description of problem:
  I've spotted following Traceback repeated multiple times in /var/log/messages on one of my Gluster Storage servers:

  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Nov 26 12:52:15 node1 collectd: Exception in thread TendrlHealInfoAndProfileInfoPlugin:
  Nov 26 12:52:15 node1 collectd: Traceback (most recent call last):
  Nov 26 12:52:15 node1 collectd: File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  Nov 26 12:52:15 node1 collectd: self.run()
  Nov 26 12:52:15 node1 collectd: File "/usr/lib64/python2.7/threading.py", line 765, in run
  Nov 26 12:52:15 node1 collectd: self.__target(*self.__args, **self.__kwargs)
  Nov 26 12:52:15 node1 collectd: File "/usr/lib64/collectd/gluster/tendrl_gluster.py", line 97, in run
  Nov 26 12:52:15 node1 collectd: metrics = self.get_metrics()
  Nov 26 12:52:15 node1 collectd: File "/usr/lib64/collectd/gluster/heavy_weight/tendrl_glusterfs_profile_info.py", line 463, in get_metrics
  Nov 26 12:52:15 node1 collectd: self.CONFIG
  Nov 26 12:52:15 node1 collectd: File "/usr/lib64/collectd/gluster/heavy_weight/tendrl_gluster_heal_info.py", line 134, in get_metrics
  Nov 26 12:52:15 node1 collectd: get_heal_info(volume, CONFIG['integration_id'])
  Nov 26 12:52:15 node1 collectd: File "/usr/lib64/collectd/gluster/heavy_weight/tendrl_gluster_heal_info.py", line 106, in get_heal_info
  Nov 26 12:52:15 node1 collectd: ] = brick_heal_info['split_brain_cnt']
  Nov 26 12:52:15 node1 collectd: KeyError: 'split_brain_cnt'
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  The problem on tendrl-node-agent side is this:
  If `gluster volume heal VOLUME statistics` command (executed by get_volume_heal_info function [1]) doesn't return statistics for one of the bricks (because of problem on the glusterfsd site), the dictionary returned for particular brick by function '_parse_self_heal_stats(op)'[2] doesn't contain all the expected keys.

  For example the output structure from _parse_self_heal_stats (extracted during debugging process):
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  [
    {
      'brick_index': 1,
      'heal_failed_cnt': 0,
      'healed_cnt': 0,
      'host_name': 'node1.example.com',
      'split_brain_cnt': 0
    },
    {
      'brick_index': 2,
      'host_name': 'node2.example.com'
    },
    ...
  ]
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


  It was initially spotted on cluster with few Gluster volumes running for few days untouched, so the logs around are huge and it is not possible to find the exact root cause of this issue on the glusterfsd site. But I was able to simulate/reproduce it again on fresh cluster by killing one of the glusterfsd process.

Version-Release number of selected component (if applicable):
  RHGS-WA Server:
  tendrl-ansible-1.5.4-2.el7rhgs.noarch
  tendrl-api-1.5.4-3.el7rhgs.noarch
  tendrl-api-httpd-1.5.4-3.el7rhgs.noarch
  tendrl-commons-1.5.4-5.el7rhgs.noarch
  tendrl-grafana-plugins-1.5.4-8.el7rhgs.noarch
  tendrl-grafana-selinux-1.5.4-1.el7rhgs.noarch
  tendrl-monitoring-integration-1.5.4-8.el7rhgs.noarch
  tendrl-node-agent-1.5.4-8.el7rhgs.noarch
  tendrl-notifier-1.5.4-5.el7rhgs.noarch
  tendrl-selinux-1.5.4-1.el7rhgs.noarch
  tendrl-ui-1.5.4-4.el7rhgs.noarch
  
  Gluster Storage Node:
  tendrl-collectd-selinux-1.5.4-1.el7rhgs.noarch
  tendrl-commons-1.5.4-5.el7rhgs.noarch
  tendrl-gluster-integration-1.5.4-6.el7rhgs.noarch
  tendrl-node-agent-1.5.4-8.el7rhgs.noarch
  tendrl-selinux-1.5.4-1.el7rhgs.noarch

How reproducible:
  100%

Steps to Reproduce:
1. Prepare Gluster cluster with one Distributed-Replicated volume
   (volume_alpha_distrep_6x2[3] in my case)

2. Prepare RHGS-WA Server and configure RHGS-WA nodes on the Gluster Storage nodes.

3. Import the Gluster Cluster into RHGS-WA.

4. Look for Gluster Server used (tagged) as "provisioner" (in etcd database on RHGS-WA Server):
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # etcdctl --endgoints https://${HOSTNAME}:2379 ls /indexes/tags/provisioner
  /indexes/tags/provisioner/90badd2d-e835-4298-94a3-1cb4b775d14d

  # etcdctl --endpoints https://${HOSTNAME}:2379 get /indexes/tags/provisioner/90badd2d-e835-4298-94a3-1cb4b775d14d
  ["69cf038f-8260-49c7-a8f2-416b088a4f8f"]

  # etcdctl --endpoints https://${HOSTNAME}:2379 get /nodes/69cf038f-8260-49c7-a8f2-416b088a4f8f/NodeContext/fqdn
  node1.example.com
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

5. On the "provisioner" server check, that there are volume heal statistic for each brick:
  # gluster volume heal volume_alpha_distrep_6x2 statistics

  There should be number of lines under each section for particular brick (separated by the dashed line):
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  ------------------------------------------------

  Crawl statistics for brick no 1
  Hostname of brick node1.example.com

  Starting time of crawl: Fri Dec  1 06:38:34 2017

  Ending time of crawl: Fri Dec  1 06:38:34 2017

  Type of crawl: INDEX
  No. of entries healed: 0
  No. of entries in split-brain: 0
  No. of heal failed entries: 0
  <truncated>
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

6. Watch /var/log/messages on the "provisioner" for KeyError exceptions (particular lines should be coloured, when using following command):
  # tail -F /var/log/messages | grep --color -Ei '.*KeyError.*|'

7. On some Gluster Storage node (might be different than the provisioner) kill one of the /usr/sbin/glusterfsd processes.
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # ps aux | grep '[g]lusterfsd'
  root     15155  0.0  1.3 1171312 25732 ?       Ssl  06:38   0:00 /usr/sbin/glusterfsd -s node1.example.com <truncated>
  root     15174  0.0  1.4 1104736 26796 ?       Ssl  06:38   0:00 /usr/sbin/glusterfsd -s node1.example.com <truncated>
  
  # kill 15155
  
  # ps aux | grep '[g]lusterfsd'
  root     15174  0.0  1.4 1170532 27880 ?       Ssl  06:38   0:02 /usr/sbin/glusterfsd -s node1.example.com <truncated>
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

8. Back on the provisioner, check repeatedly the heal info and statistics for each brick:

  # gluster volume heal volume_alpha_distrep_6x2 info
  # gluster volume heal volume_alpha_distrep_6x2 statistics

  The "info" command will show quite immediately something like this for one of the bricks:
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Brick node1.example.com:/mnt/brick_alpha_distrep_1/1
  Status: Transport endpoint is not connected
  Number of entries: -
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  In the output from "statistics" command it will be reflected probably after few minutes. 
  It will show only this lines for the particular affected brick, but no statistics for that:
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  ------------------------------------------------

  Crawl statistics for brick no 1
  Hostname of brick node1.example.com
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

9. Now check the /var/log/messages.

Actual results:
  When there is some problem with fetching heal statistics from Gluster, collectd will crash with KeyError: 'split_brain_cnt' exception.

Expected results:
  collectd should not crash, there shouldn't be no Traceback in /var/log/messages

Additional info:
[1] /usr/lib64/collectd/gluster/heavy_weight/tendrl_gluster_heal_info.py line 47
[2] /usr/lib64/collectd/gluster/heavy_weight/tendrl_gluster_heal_info.py line 12-39
[3] https://github.com/usmqe/usmqe-setup/blob/master/gdeploy_config/volume_alpha_distrep_6x2.create.conf

Comment 3 Daniel Horák 2017-12-04 08:27:06 UTC

I've retested it with the same scenario on the newest builds and now there is different Traceback:
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Dec  4 03:16:49 node1 collectd: Failed to collect volume heal info. Error Traceback (most recent call last):
  Dec  4 03:16:49 node1 collectd: File "/usr/lib64/collectd/gluster/heavy_weight/tendrl_gluster_heal_info.py", line 106, in get_volume_heal_info_stats
  Dec  4 03:16:49 node1 collectd: vol_heal_op
  Dec  4 03:16:49 node1 collectd: File "/usr/lib64/collectd/gluster/heavy_weight/tendrl_gluster_heal_info.py", line 23, in _parse_self_heal_info_stats
  Dec  4 03:16:49 node1 collectd: heal_pending_cnt = int(line.split(": ")[1])
  Dec  4 03:16:49 node1 collectd: ValueError: invalid literal for int() with base 10: '-'
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Caused by the following output from `gluster volume heal <volname> info split-brain` command:
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # gluster volume heal volume_alpha_distrep_6x2 info split-brain
  Brick node1.example.com:/mnt/brick_alpha_distrep_1/1
  Status: Transport endpoint is not connected
  Number of entries in split-brain: -

  Brick node2.example.com:/mnt/brick_alpha_distrep_1/1
  Status: Connected
  Number of entries in split-brain: 0

  Brick node3.example.com:/mnt/brick_alpha_distrep_1/1
  Status: Connected
  Number of entries in split-brain: 0

  <truncated>
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Version-Release number of selected components:
  RHGS-WA Server:
  tendrl-ansible-1.5.4-2.el7rhgs.noarch
  tendrl-api-1.5.4-4.el7rhgs.noarch
  tendrl-api-httpd-1.5.4-4.el7rhgs.noarch
  tendrl-commons-1.5.4-6.el7rhgs.noarch
  tendrl-grafana-plugins-1.5.4-11.el7rhgs.noarch
  tendrl-grafana-selinux-1.5.4-1.el7rhgs.noarch
  tendrl-monitoring-integration-1.5.4-11.el7rhgs.noarch
  tendrl-node-agent-1.5.4-9.el7rhgs.noarch
  tendrl-notifier-1.5.4-6.el7rhgs.noarch
  tendrl-selinux-1.5.4-1.el7rhgs.noarch
  tendrl-ui-1.5.4-5.el7rhgs.noarch
  
  Gluster Storage Node:
  tendrl-collectd-selinux-1.5.4-1.el7rhgs.noarch
  tendrl-commons-1.5.4-6.el7rhgs.noarch
  tendrl-gluster-integration-1.5.4-8.el7rhgs.noarch
  tendrl-node-agent-1.5.4-9.el7rhgs.noarch
  tendrl-selinux-1.5.4-1.el7rhgs.noarch

It is different error, but caused by exactly the same scenario, so I think we can deal with that in this Bug.

Comment 5 Shubhendu Tripathi 2017-12-04 09:01:28 UTC

@Daniel, sorry my bad, didnt see the output in previous comment.
Is it like few bricks are down on the volume? If so we might need to handle this scenario as instead of having `0` value for split brain its showing as `-`.

Comment 6 Shubhendu Tripathi 2017-12-04 09:11:06 UTC

Daniel sent https://github.com/Tendrl/node-agent/pull/694 for handling the non numeric case of heal values.

Comment 7 Daniel Horák 2017-12-04 09:13:29 UTC

@Shubhendu, yes, I've simulated it by killing one of the glusterfsd processes (the reproduction scenario is the same as in Comment 0).

(Initially, when I spotted the error described in the Description of this Bug on longer running cluster, I wasn't able to find the reason of glusterfsd failure, but I find this as the steps to reproduce the error in Tendrl.)

Comment 9 Martin Kudlej 2017-12-08 11:41:22 UTC

Tested with
etcd-3.2.7-1.el7.x86_64
glusterfs-3.8.4-52.el7_4.x86_64
glusterfs-3.8.4-52.el7rhgs.x86_64
glusterfs-api-3.8.4-52.el7rhgs.x86_64
glusterfs-cli-3.8.4-52.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-52.el7_4.x86_64
glusterfs-client-xlators-3.8.4-52.el7rhgs.x86_64
glusterfs-events-3.8.4-52.el7rhgs.x86_64
glusterfs-fuse-3.8.4-52.el7_4.x86_64
glusterfs-fuse-3.8.4-52.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-52.el7rhgs.x86_64
glusterfs-libs-3.8.4-52.el7_4.x86_64
glusterfs-libs-3.8.4-52.el7rhgs.x86_64
glusterfs-rdma-3.8.4-52.el7rhgs.x86_64
glusterfs-server-3.8.4-52.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.5.x86_64
python-etcd-0.4.5-1.el7rhgs.noarch
python-gluster-3.8.4-52.el7rhgs.noarch
rubygem-etcd-0.3.0-1.el7rhgs.noarch
tendrl-ansible-1.5.4-5.el7rhgs.noarch
tendrl-api-1.5.4-4.el7rhgs.noarch
tendrl-api-httpd-1.5.4-4.el7rhgs.noarch
tendrl-collectd-selinux-1.5.4-1.el7rhgs.noarch
tendrl-commons-1.5.4-8.el7rhgs.noarch
tendrl-gluster-integration-1.5.4-13.el7rhgs.noarch
tendrl-grafana-plugins-1.5.4-13.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-1.el7rhgs.noarch
tendrl-monitoring-integration-1.5.4-13.el7rhgs.noarch
tendrl-node-agent-1.5.4-14.el7rhgs.noarch
tendrl-notifier-1.5.4-6.el7rhgs.noarch
tendrl-selinux-1.5.4-1.el7rhgs.noarch
tendrl-ui-1.5.4-6.el7rhgs.noarch
vdsm-gluster-4.17.33-1.2.el7rhgs.noarch

and it works.
--> VERIFIED

Comment 11 errata-xmlrpc 2017-12-18 04:38:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3478