Bug 1519856
Summary: | collectd: KeyError: 'split_brain_cnt' or ValueError: invalid literal for int() with base 10 in /var/log/messages | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Daniel Horák <dahorak> |
Component: | web-admin-tendrl-node-agent | Assignee: | Shubhendu Tripathi <shtripat> |
Status: | CLOSED ERRATA | QA Contact: | Martin Kudlej <mkudlej> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | rhgs-3.3 | CC: | dahorak, mkudlej, nthomas, rhs-bugs, sanandpa, sankarshan, shtripat |
Target Milestone: | --- | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | tendrl-node-agent-1.5.4-14.el7rhgs | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-12-18 04:38:26 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Daniel Horák
2017-12-01 14:54:06 UTC
I've retested it with the same scenario on the newest builds and now there is different Traceback: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Dec 4 03:16:49 node1 collectd: Failed to collect volume heal info. Error Traceback (most recent call last): Dec 4 03:16:49 node1 collectd: File "/usr/lib64/collectd/gluster/heavy_weight/tendrl_gluster_heal_info.py", line 106, in get_volume_heal_info_stats Dec 4 03:16:49 node1 collectd: vol_heal_op Dec 4 03:16:49 node1 collectd: File "/usr/lib64/collectd/gluster/heavy_weight/tendrl_gluster_heal_info.py", line 23, in _parse_self_heal_info_stats Dec 4 03:16:49 node1 collectd: heal_pending_cnt = int(line.split(": ")[1]) Dec 4 03:16:49 node1 collectd: ValueError: invalid literal for int() with base 10: '-' ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Caused by the following output from `gluster volume heal <volname> info split-brain` command: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # gluster volume heal volume_alpha_distrep_6x2 info split-brain Brick node1.example.com:/mnt/brick_alpha_distrep_1/1 Status: Transport endpoint is not connected Number of entries in split-brain: - Brick node2.example.com:/mnt/brick_alpha_distrep_1/1 Status: Connected Number of entries in split-brain: 0 Brick node3.example.com:/mnt/brick_alpha_distrep_1/1 Status: Connected Number of entries in split-brain: 0 <truncated> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Version-Release number of selected components: RHGS-WA Server: tendrl-ansible-1.5.4-2.el7rhgs.noarch tendrl-api-1.5.4-4.el7rhgs.noarch tendrl-api-httpd-1.5.4-4.el7rhgs.noarch tendrl-commons-1.5.4-6.el7rhgs.noarch tendrl-grafana-plugins-1.5.4-11.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-1.el7rhgs.noarch tendrl-monitoring-integration-1.5.4-11.el7rhgs.noarch tendrl-node-agent-1.5.4-9.el7rhgs.noarch tendrl-notifier-1.5.4-6.el7rhgs.noarch tendrl-selinux-1.5.4-1.el7rhgs.noarch tendrl-ui-1.5.4-5.el7rhgs.noarch Gluster Storage Node: tendrl-collectd-selinux-1.5.4-1.el7rhgs.noarch tendrl-commons-1.5.4-6.el7rhgs.noarch tendrl-gluster-integration-1.5.4-8.el7rhgs.noarch tendrl-node-agent-1.5.4-9.el7rhgs.noarch tendrl-selinux-1.5.4-1.el7rhgs.noarch It is different error, but caused by exactly the same scenario, so I think we can deal with that in this Bug. @Daniel, sorry my bad, didnt see the output in previous comment. Is it like few bricks are down on the volume? If so we might need to handle this scenario as instead of having `0` value for split brain its showing as `-`. Daniel sent https://github.com/Tendrl/node-agent/pull/694 for handling the non numeric case of heal values. @Shubhendu, yes, I've simulated it by killing one of the glusterfsd processes (the reproduction scenario is the same as in Comment 0). (Initially, when I spotted the error described in the Description of this Bug on longer running cluster, I wasn't able to find the reason of glusterfsd failure, but I find this as the steps to reproduce the error in Tendrl.) Tested with etcd-3.2.7-1.el7.x86_64 glusterfs-3.8.4-52.el7_4.x86_64 glusterfs-3.8.4-52.el7rhgs.x86_64 glusterfs-api-3.8.4-52.el7rhgs.x86_64 glusterfs-cli-3.8.4-52.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-52.el7_4.x86_64 glusterfs-client-xlators-3.8.4-52.el7rhgs.x86_64 glusterfs-events-3.8.4-52.el7rhgs.x86_64 glusterfs-fuse-3.8.4-52.el7_4.x86_64 glusterfs-fuse-3.8.4-52.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-52.el7rhgs.x86_64 glusterfs-libs-3.8.4-52.el7_4.x86_64 glusterfs-libs-3.8.4-52.el7rhgs.x86_64 glusterfs-rdma-3.8.4-52.el7rhgs.x86_64 glusterfs-server-3.8.4-52.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.5.x86_64 python-etcd-0.4.5-1.el7rhgs.noarch python-gluster-3.8.4-52.el7rhgs.noarch rubygem-etcd-0.3.0-1.el7rhgs.noarch tendrl-ansible-1.5.4-5.el7rhgs.noarch tendrl-api-1.5.4-4.el7rhgs.noarch tendrl-api-httpd-1.5.4-4.el7rhgs.noarch tendrl-collectd-selinux-1.5.4-1.el7rhgs.noarch tendrl-commons-1.5.4-8.el7rhgs.noarch tendrl-gluster-integration-1.5.4-13.el7rhgs.noarch tendrl-grafana-plugins-1.5.4-13.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-1.el7rhgs.noarch tendrl-monitoring-integration-1.5.4-13.el7rhgs.noarch tendrl-node-agent-1.5.4-14.el7rhgs.noarch tendrl-notifier-1.5.4-6.el7rhgs.noarch tendrl-selinux-1.5.4-1.el7rhgs.noarch tendrl-ui-1.5.4-6.el7rhgs.noarch vdsm-gluster-4.17.33-1.2.el7rhgs.noarch and it works. --> VERIFIED Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3478 |