Description of problem: WA collectd heavy_weight plugins are executed in provisioner node only. But when provisioner is changed some other node is marking provisioner tag is true and started executing all heavy_weight plugins but old provisioner tag in the previous node is not cleared. So both nodes are started executing heavy_weight plugin. Impact of this problem is etcd read request and carbon_cache metrics request are increased. After some time all nodes are started executing heavy_weight plugin because we are not clearing tag from any node. Version-Release number of selected component (if applicable): tendrl-node-agent-1.6.3-11.el7rhgs How reproducible: After cluster import open a file "vi /etc/collectd.d/tendrl_gluster.conf" in all the nodes. provisioner tag is marked as true for any node. Then stop tendrl-node-agent service and wait for 300 seconds. Then start tendrl-node-agent service and check the file "vi /etc/collectd.d/tendrl_gluster.conf" again in all nodes, provisioner tag is marked as true for more than one node. Steps to Reproduce: 1. 2. 3. Actual results: WA heavy_weight colled plugins are executed by non-provisioner nodes also Expected results: WA heavy_weight colled plugins should executed by provisioner node only Additional info:
PR is under review: https://github.com/Tendrl/node-agent/pull/857
https://github.com/Tendrl/commons/pull/1063
This bug is taken out from BU3
There is one issue in the changes related to this Bug. In the file /usr/lib64/collectd/gluster/tendrl_gluster.py on line 256, there is catch for exception etcd.KeyNotFound, but etcd module doesn't have such exception (it should probably be etcd.EtcdKeyNotFound). ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 256 except (etcd.KeyNotFound, etcd.EtcdConnectionFailed, SyntaxError) as ex: 257 collectd.error('Failed to find provisioner node. Error %s' % str(ex)) 258 continue ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This issue leads to following error in logs (and maybe to some other consequences): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Feb 15 17:51:15 node1 collectd: Unhandled python exception in read callback: AttributeError: 'module' object has no attribute 'KeyNotFound' ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Spotted in tendrl-node-agent-1.6.3-17.el7rhgs.noarch The particular line in upstream is slightly different because of some other commit, but the issue is the same: https://github.com/Tendrl/node-agent/blob/master/tendrl/node_agent/monitoring/collectd/collectors/gluster/tendrl_gluster.py#L257 >> ASSIGNED
There is another issue just few lines above the previously mentioned one. In the same file /usr/lib64/collectd/gluster/tendrl_gluster.py on line 253, there is condition if `CONFIG["node_id"] not in eval(provisioner)`, the problem is, that under some conditions CONFIG object might not contain key "node_id". ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 252 if ( 253 CONFIG["node_id"] not in eval(provisioner) 254 ): 255 continue ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This issue leads to following error (under some conditions): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Unhandled python exception in read callback: KeyError: 'node_id' ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If I find the source of the data for CONFIG object correctly, it depends on the content of /etc/collectd.d/tendrl_gluster.conf file, which on some nodes doesn't contain the "node_id" key. So the file looks like this: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # cat /etc/collectd.d/tendrl_gluster.conf <Plugin "python"> ModulePath "/usr/lib64/collectd/gluster" Import "tendrl_gluster" <Module "tendrl_gluster"> integration_id "ffe0d070-bdbd-4aee-b217-0049b6d68e41" graphite_host "rhsqa6.lab.eng.blr.redhat.com" graphite_port "2003" peer_name "rhocs-node1.lab.eng.blr.redhat.com" provisioner False etcd_host "rhsqa6.lab.eng.blr.redhat.com" etcd_port 2379 etcd_ca_cert_file "" etcd_cert_file "" etcd_key_file "" </Module> </Plugin> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It seems like there is some difference between freshly installed 3.4.4 cluster and cluster updated from previous version (or more precisely on which version the cluster was imported). * If the cluster was imported on new (3.4.4) version, the tendrl_gluster.conf file on all storage nodes contains the key "node_id". * If the cluster was imported on older version and updated to the new one, "node_id" key is in tendrl_gluster.conf file only on one storage node.
In "tendrl-node-agent-1.6.3-17.el7rhgs", i have added a node_id variable in collectd configuration file newly, Problem is collectd configuration file is generated during import cluster flow. So it works fine for the newly created machine. But in the upgrade scenario customer already imported the cluster so "node_id" variable is not present. So collectd is failing. Fixed in PR: https://github.com/Tendrl/node-agent/pull/874
In current version is changed implementation of referring provisioner node from collectd tendrl plugins. Now is reference in /etc/collectd.d/tendrl_gluster.conf configuration file not used and instead is used value stored in etcd. --> VERIFIED During testing was reported bz 1685153 related to behaviour of the configuration file during update. Tested with: tendrl-ansible-1.6.3-11.el7rhgs.noarch tendrl-api-1.6.3-13.el7rhgs.noarch tendrl-api-httpd-1.6.3-13.el7rhgs.noarch tendrl-commons-1.6.3-17.el7rhgs.noarch tendrl-gluster-integration-1.6.3-15.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-21.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-3.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-21.el7rhgs.noarch tendrl-node-agent-1.6.3-18.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-3.el7rhgs.noarch tendrl-ui-1.6.3-15.el7rhgs.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0660