Description of problem: I've spotted following Traceback repeated multiple times in /var/log/messages on one of my Gluster Storage servers: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Nov 26 12:52:15 node1 collectd: Exception in thread TendrlHealInfoAndProfileInfoPlugin: Nov 26 12:52:15 node1 collectd: Traceback (most recent call last): Nov 26 12:52:15 node1 collectd: File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner Nov 26 12:52:15 node1 collectd: self.run() Nov 26 12:52:15 node1 collectd: File "/usr/lib64/python2.7/threading.py", line 765, in run Nov 26 12:52:15 node1 collectd: self.__target(*self.__args, **self.__kwargs) Nov 26 12:52:15 node1 collectd: File "/usr/lib64/collectd/gluster/tendrl_gluster.py", line 97, in run Nov 26 12:52:15 node1 collectd: metrics = self.get_metrics() Nov 26 12:52:15 node1 collectd: File "/usr/lib64/collectd/gluster/heavy_weight/tendrl_glusterfs_profile_info.py", line 463, in get_metrics Nov 26 12:52:15 node1 collectd: self.CONFIG Nov 26 12:52:15 node1 collectd: File "/usr/lib64/collectd/gluster/heavy_weight/tendrl_gluster_heal_info.py", line 134, in get_metrics Nov 26 12:52:15 node1 collectd: get_heal_info(volume, CONFIG['integration_id']) Nov 26 12:52:15 node1 collectd: File "/usr/lib64/collectd/gluster/heavy_weight/tendrl_gluster_heal_info.py", line 106, in get_heal_info Nov 26 12:52:15 node1 collectd: ] = brick_heal_info['split_brain_cnt'] Nov 26 12:52:15 node1 collectd: KeyError: 'split_brain_cnt' ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The problem on tendrl-node-agent side is this: If `gluster volume heal VOLUME statistics` command (executed by get_volume_heal_info function [1]) doesn't return statistics for one of the bricks (because of problem on the glusterfsd site), the dictionary returned for particular brick by function '_parse_self_heal_stats(op)'[2] doesn't contain all the expected keys. For example the output structure from _parse_self_heal_stats (extracted during debugging process): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [ { 'brick_index': 1, 'heal_failed_cnt': 0, 'healed_cnt': 0, 'host_name': 'node1.example.com', 'split_brain_cnt': 0 }, { 'brick_index': 2, 'host_name': 'node2.example.com' }, ... ] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It was initially spotted on cluster with few Gluster volumes running for few days untouched, so the logs around are huge and it is not possible to find the exact root cause of this issue on the glusterfsd site. But I was able to simulate/reproduce it again on fresh cluster by killing one of the glusterfsd process. Version-Release number of selected component (if applicable): RHGS-WA Server: tendrl-ansible-1.5.4-2.el7rhgs.noarch tendrl-api-1.5.4-3.el7rhgs.noarch tendrl-api-httpd-1.5.4-3.el7rhgs.noarch tendrl-commons-1.5.4-5.el7rhgs.noarch tendrl-grafana-plugins-1.5.4-8.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-1.el7rhgs.noarch tendrl-monitoring-integration-1.5.4-8.el7rhgs.noarch tendrl-node-agent-1.5.4-8.el7rhgs.noarch tendrl-notifier-1.5.4-5.el7rhgs.noarch tendrl-selinux-1.5.4-1.el7rhgs.noarch tendrl-ui-1.5.4-4.el7rhgs.noarch Gluster Storage Node: tendrl-collectd-selinux-1.5.4-1.el7rhgs.noarch tendrl-commons-1.5.4-5.el7rhgs.noarch tendrl-gluster-integration-1.5.4-6.el7rhgs.noarch tendrl-node-agent-1.5.4-8.el7rhgs.noarch tendrl-selinux-1.5.4-1.el7rhgs.noarch How reproducible: 100% Steps to Reproduce: 1. Prepare Gluster cluster with one Distributed-Replicated volume (volume_alpha_distrep_6x2[3] in my case) 2. Prepare RHGS-WA Server and configure RHGS-WA nodes on the Gluster Storage nodes. 3. Import the Gluster Cluster into RHGS-WA. 4. Look for Gluster Server used (tagged) as "provisioner" (in etcd database on RHGS-WA Server): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # etcdctl --endgoints https://${HOSTNAME}:2379 ls /indexes/tags/provisioner /indexes/tags/provisioner/90badd2d-e835-4298-94a3-1cb4b775d14d # etcdctl --endpoints https://${HOSTNAME}:2379 get /indexes/tags/provisioner/90badd2d-e835-4298-94a3-1cb4b775d14d ["69cf038f-8260-49c7-a8f2-416b088a4f8f"] # etcdctl --endpoints https://${HOSTNAME}:2379 get /nodes/69cf038f-8260-49c7-a8f2-416b088a4f8f/NodeContext/fqdn node1.example.com ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 5. On the "provisioner" server check, that there are volume heal statistic for each brick: # gluster volume heal volume_alpha_distrep_6x2 statistics There should be number of lines under each section for particular brick (separated by the dashed line): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ------------------------------------------------ Crawl statistics for brick no 1 Hostname of brick node1.example.com Starting time of crawl: Fri Dec 1 06:38:34 2017 Ending time of crawl: Fri Dec 1 06:38:34 2017 Type of crawl: INDEX No. of entries healed: 0 No. of entries in split-brain: 0 No. of heal failed entries: 0 <truncated> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 6. Watch /var/log/messages on the "provisioner" for KeyError exceptions (particular lines should be coloured, when using following command): # tail -F /var/log/messages | grep --color -Ei '.*KeyError.*|' 7. On some Gluster Storage node (might be different than the provisioner) kill one of the /usr/sbin/glusterfsd processes. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # ps aux | grep '[g]lusterfsd' root 15155 0.0 1.3 1171312 25732 ? Ssl 06:38 0:00 /usr/sbin/glusterfsd -s node1.example.com <truncated> root 15174 0.0 1.4 1104736 26796 ? Ssl 06:38 0:00 /usr/sbin/glusterfsd -s node1.example.com <truncated> # kill 15155 # ps aux | grep '[g]lusterfsd' root 15174 0.0 1.4 1170532 27880 ? Ssl 06:38 0:02 /usr/sbin/glusterfsd -s node1.example.com <truncated> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 8. Back on the provisioner, check repeatedly the heal info and statistics for each brick: # gluster volume heal volume_alpha_distrep_6x2 info # gluster volume heal volume_alpha_distrep_6x2 statistics The "info" command will show quite immediately something like this for one of the bricks: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Brick node1.example.com:/mnt/brick_alpha_distrep_1/1 Status: Transport endpoint is not connected Number of entries: - ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In the output from "statistics" command it will be reflected probably after few minutes. It will show only this lines for the particular affected brick, but no statistics for that: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ------------------------------------------------ Crawl statistics for brick no 1 Hostname of brick node1.example.com ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 9. Now check the /var/log/messages. Actual results: When there is some problem with fetching heal statistics from Gluster, collectd will crash with KeyError: 'split_brain_cnt' exception. Expected results: collectd should not crash, there shouldn't be no Traceback in /var/log/messages Additional info: [1] /usr/lib64/collectd/gluster/heavy_weight/tendrl_gluster_heal_info.py line 47 [2] /usr/lib64/collectd/gluster/heavy_weight/tendrl_gluster_heal_info.py line 12-39 [3] https://github.com/usmqe/usmqe-setup/blob/master/gdeploy_config/volume_alpha_distrep_6x2.create.conf
I've retested it with the same scenario on the newest builds and now there is different Traceback: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Dec 4 03:16:49 node1 collectd: Failed to collect volume heal info. Error Traceback (most recent call last): Dec 4 03:16:49 node1 collectd: File "/usr/lib64/collectd/gluster/heavy_weight/tendrl_gluster_heal_info.py", line 106, in get_volume_heal_info_stats Dec 4 03:16:49 node1 collectd: vol_heal_op Dec 4 03:16:49 node1 collectd: File "/usr/lib64/collectd/gluster/heavy_weight/tendrl_gluster_heal_info.py", line 23, in _parse_self_heal_info_stats Dec 4 03:16:49 node1 collectd: heal_pending_cnt = int(line.split(": ")[1]) Dec 4 03:16:49 node1 collectd: ValueError: invalid literal for int() with base 10: '-' ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Caused by the following output from `gluster volume heal <volname> info split-brain` command: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # gluster volume heal volume_alpha_distrep_6x2 info split-brain Brick node1.example.com:/mnt/brick_alpha_distrep_1/1 Status: Transport endpoint is not connected Number of entries in split-brain: - Brick node2.example.com:/mnt/brick_alpha_distrep_1/1 Status: Connected Number of entries in split-brain: 0 Brick node3.example.com:/mnt/brick_alpha_distrep_1/1 Status: Connected Number of entries in split-brain: 0 <truncated> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Version-Release number of selected components: RHGS-WA Server: tendrl-ansible-1.5.4-2.el7rhgs.noarch tendrl-api-1.5.4-4.el7rhgs.noarch tendrl-api-httpd-1.5.4-4.el7rhgs.noarch tendrl-commons-1.5.4-6.el7rhgs.noarch tendrl-grafana-plugins-1.5.4-11.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-1.el7rhgs.noarch tendrl-monitoring-integration-1.5.4-11.el7rhgs.noarch tendrl-node-agent-1.5.4-9.el7rhgs.noarch tendrl-notifier-1.5.4-6.el7rhgs.noarch tendrl-selinux-1.5.4-1.el7rhgs.noarch tendrl-ui-1.5.4-5.el7rhgs.noarch Gluster Storage Node: tendrl-collectd-selinux-1.5.4-1.el7rhgs.noarch tendrl-commons-1.5.4-6.el7rhgs.noarch tendrl-gluster-integration-1.5.4-8.el7rhgs.noarch tendrl-node-agent-1.5.4-9.el7rhgs.noarch tendrl-selinux-1.5.4-1.el7rhgs.noarch It is different error, but caused by exactly the same scenario, so I think we can deal with that in this Bug.
@Daniel, sorry my bad, didnt see the output in previous comment. Is it like few bricks are down on the volume? If so we might need to handle this scenario as instead of having `0` value for split brain its showing as `-`.
Daniel sent https://github.com/Tendrl/node-agent/pull/694 for handling the non numeric case of heal values.
@Shubhendu, yes, I've simulated it by killing one of the glusterfsd processes (the reproduction scenario is the same as in Comment 0). (Initially, when I spotted the error described in the Description of this Bug on longer running cluster, I wasn't able to find the reason of glusterfsd failure, but I find this as the steps to reproduce the error in Tendrl.)
Tested with etcd-3.2.7-1.el7.x86_64 glusterfs-3.8.4-52.el7_4.x86_64 glusterfs-3.8.4-52.el7rhgs.x86_64 glusterfs-api-3.8.4-52.el7rhgs.x86_64 glusterfs-cli-3.8.4-52.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-52.el7_4.x86_64 glusterfs-client-xlators-3.8.4-52.el7rhgs.x86_64 glusterfs-events-3.8.4-52.el7rhgs.x86_64 glusterfs-fuse-3.8.4-52.el7_4.x86_64 glusterfs-fuse-3.8.4-52.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-52.el7rhgs.x86_64 glusterfs-libs-3.8.4-52.el7_4.x86_64 glusterfs-libs-3.8.4-52.el7rhgs.x86_64 glusterfs-rdma-3.8.4-52.el7rhgs.x86_64 glusterfs-server-3.8.4-52.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.5.x86_64 python-etcd-0.4.5-1.el7rhgs.noarch python-gluster-3.8.4-52.el7rhgs.noarch rubygem-etcd-0.3.0-1.el7rhgs.noarch tendrl-ansible-1.5.4-5.el7rhgs.noarch tendrl-api-1.5.4-4.el7rhgs.noarch tendrl-api-httpd-1.5.4-4.el7rhgs.noarch tendrl-collectd-selinux-1.5.4-1.el7rhgs.noarch tendrl-commons-1.5.4-8.el7rhgs.noarch tendrl-gluster-integration-1.5.4-13.el7rhgs.noarch tendrl-grafana-plugins-1.5.4-13.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-1.el7rhgs.noarch tendrl-monitoring-integration-1.5.4-13.el7rhgs.noarch tendrl-node-agent-1.5.4-14.el7rhgs.noarch tendrl-notifier-1.5.4-6.el7rhgs.noarch tendrl-selinux-1.5.4-1.el7rhgs.noarch tendrl-ui-1.5.4-6.el7rhgs.noarch vdsm-gluster-4.17.33-1.2.el7rhgs.noarch and it works. --> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3478