Bug 1647910
| Summary: | Provisioner tag in collectd claimed by more than one storage nodes | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | gowtham <gshanmug> |
| Component: | web-admin-tendrl-node-agent | Assignee: | gowtham <gshanmug> |
| Status: | CLOSED ERRATA | QA Contact: | Filip Balák <fbalak> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | rhgs-3.4 | CC: | dahorak, nthomas, rhs-bugs, sankarshan |
| Target Milestone: | --- | Keywords: | ZStream |
| Target Release: | RHGS 3.4.z Batch Update 4 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | tendrl-node-agent-1.6.3-18.el7rhgs | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-03-27 03:49:38 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1656822 | ||
|
Description
gowtham
2018-11-08 14:18:37 UTC
PR is under review: https://github.com/Tendrl/node-agent/pull/857 This bug is taken out from BU3 There is one issue in the changes related to this Bug.
In the file /usr/lib64/collectd/gluster/tendrl_gluster.py on line 256, there is
catch for exception etcd.KeyNotFound, but etcd module doesn't have such
exception (it should probably be etcd.EtcdKeyNotFound).
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
256 except (etcd.KeyNotFound, etcd.EtcdConnectionFailed, SyntaxError) as ex:
257 collectd.error('Failed to find provisioner node. Error %s' % str(ex))
258 continue
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This issue leads to following error in logs (and maybe to some other
consequences):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Feb 15 17:51:15 node1 collectd: Unhandled python exception in read callback: AttributeError: 'module' object has no attribute 'KeyNotFound'
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Spotted in tendrl-node-agent-1.6.3-17.el7rhgs.noarch
The particular line in upstream is slightly different because of some other
commit, but the issue is the same:
https://github.com/Tendrl/node-agent/blob/master/tendrl/node_agent/monitoring/collectd/collectors/gluster/tendrl_gluster.py#L257
>> ASSIGNED
There is another issue just few lines above the previously mentioned one.
In the same file /usr/lib64/collectd/gluster/tendrl_gluster.py on line 253,
there is condition if `CONFIG["node_id"] not in eval(provisioner)`, the problem
is, that under some conditions CONFIG object might not contain key "node_id".
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
252 if (
253 CONFIG["node_id"] not in eval(provisioner)
254 ):
255 continue
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This issue leads to following error (under some conditions):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Unhandled python exception in read callback: KeyError: 'node_id'
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If I find the source of the data for CONFIG object correctly, it depends on the
content of /etc/collectd.d/tendrl_gluster.conf file, which on some nodes
doesn't contain the "node_id" key. So the file looks like this:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# cat /etc/collectd.d/tendrl_gluster.conf
<Plugin "python">
ModulePath "/usr/lib64/collectd/gluster"
Import "tendrl_gluster"
<Module "tendrl_gluster">
integration_id "ffe0d070-bdbd-4aee-b217-0049b6d68e41"
graphite_host "rhsqa6.lab.eng.blr.redhat.com"
graphite_port "2003"
peer_name "rhocs-node1.lab.eng.blr.redhat.com"
provisioner False
etcd_host "rhsqa6.lab.eng.blr.redhat.com"
etcd_port 2379
etcd_ca_cert_file ""
etcd_cert_file ""
etcd_key_file ""
</Module>
</Plugin>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It seems like there is some difference between freshly installed 3.4.4 cluster
and cluster updated from previous version (or more precisely on which version
the cluster was imported).
* If the cluster was imported on new (3.4.4) version, the tendrl_gluster.conf
file on all storage nodes contains the key "node_id".
* If the cluster was imported on older version and updated to the new one,
"node_id" key is in tendrl_gluster.conf file only on one storage node.
In "tendrl-node-agent-1.6.3-17.el7rhgs", i have added a node_id variable in collectd configuration file newly, Problem is collectd configuration file is generated during import cluster flow. So it works fine for the newly created machine. But in the upgrade scenario customer already imported the cluster so "node_id" variable is not present. So collectd is failing. Fixed in PR: https://github.com/Tendrl/node-agent/pull/874 In current version is changed implementation of referring provisioner node from collectd tendrl plugins. Now is reference in /etc/collectd.d/tendrl_gluster.conf configuration file not used and instead is used value stored in etcd. --> VERIFIED During testing was reported bz 1685153 related to behaviour of the configuration file during update. Tested with: tendrl-ansible-1.6.3-11.el7rhgs.noarch tendrl-api-1.6.3-13.el7rhgs.noarch tendrl-api-httpd-1.6.3-13.el7rhgs.noarch tendrl-commons-1.6.3-17.el7rhgs.noarch tendrl-gluster-integration-1.6.3-15.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-21.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-3.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-21.el7rhgs.noarch tendrl-node-agent-1.6.3-18.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-3.el7rhgs.noarch tendrl-ui-1.6.3-15.el7rhgs.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0660 |