1647910 – Provisioner tag in collectd claimed by more than one storage nodes

Bug 1647910 - Provisioner tag in collectd claimed by more than one storage nodes

Summary: Provisioner tag in collectd claimed by more than one storage nodes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	web-admin-tendrl-node-agent
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.4.z Batch Update 4
Assignee:	gowtham
QA Contact:	Filip Balák
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1656822
TreeView+	depends on / blocked

Reported:	2018-11-08 14:18 UTC by gowtham
Modified:	2019-03-27 03:51 UTC (History)
CC List:	4 users (show)
Fixed In Version:	tendrl-node-agent-1.6.3-18.el7rhgs
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-03-27 03:49:38 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	Tendrl commons issues 1066	None	closed	Removing provisioner tag from collectd plugin configuration	2020-01-30 15:50:17 UTC
Github	Tendrl node-agent issues 859	None	closed	Collectd heavy weight plugins are executed by non-provisioner nodes also	2020-01-30 15:50:17 UTC
Red Hat Bugzilla	1685153	unspecified	CLOSED	/etc/collectd.d/tendrl_gluster.conf file is changed only on provisioner node after update	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2019:0660	None	None	None	2019-03-27 03:51:10 UTC

Description gowtham 2018-11-08 14:18:37 UTC

Description of problem:
WA collectd heavy_weight plugins are executed in provisioner node only. But when provisioner is changed some other node is marking provisioner tag is true and started executing all heavy_weight plugins but old provisioner tag in the previous node is not cleared. So both nodes are started executing heavy_weight plugin. Impact of this problem is etcd read request and carbon_cache metrics request are increased. After some time all nodes are started executing heavy_weight plugin because we are not clearing tag from any node.

Version-Release number of selected component (if applicable):
tendrl-node-agent-1.6.3-11.el7rhgs

How reproducible:
After cluster import open a file "vi /etc/collectd.d/tendrl_gluster.conf" in all 
the nodes. provisioner tag is marked as true for any node. Then stop tendrl-node-agent service and wait for 300 seconds. Then start tendrl-node-agent service and check the file "vi /etc/collectd.d/tendrl_gluster.conf" again in all nodes, provisioner tag is marked as true for more than one node.

Steps to Reproduce:
1.
2.
3.

Actual results:
WA heavy_weight colled plugins are executed by non-provisioner nodes also

Expected results:
WA heavy_weight colled plugins should executed by provisioner node only

Additional info:

Comment 2 gowtham 2018-12-03 07:55:14 UTC

PR is under review: https://github.com/Tendrl/node-agent/pull/857

Comment 3 gowtham 2018-12-03 08:25:40 UTC

https://github.com/Tendrl/commons/pull/1063

Comment 4 Nishanth Thomas 2018-12-13 10:08:47 UTC

This bug is taken out from BU3

Comment 6 Daniel Horák 2019-02-15 12:26:18 UTC

There is one issue in the changes related to this Bug.

In the file /usr/lib64/collectd/gluster/tendrl_gluster.py on line 256, there is
catch for exception etcd.KeyNotFound, but etcd module doesn't have such
exception (it should probably be etcd.EtcdKeyNotFound).

  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  256             except (etcd.KeyNotFound, etcd.EtcdConnectionFailed, SyntaxError) as ex:
  257                 collectd.error('Failed to find provisioner node. Error %s' % str(ex))
  258                 continue
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This issue leads to following error in logs (and maybe to some other
consequences):
  
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Feb 15 17:51:15 node1 collectd: Unhandled python exception in read callback: AttributeError: 'module' object has no attribute 'KeyNotFound'
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Spotted in tendrl-node-agent-1.6.3-17.el7rhgs.noarch

The particular line in upstream is slightly different because of some other
commit, but the issue is the same:
  https://github.com/Tendrl/node-agent/blob/master/tendrl/node_agent/monitoring/collectd/collectors/gluster/tendrl_gluster.py#L257

>> ASSIGNED

Comment 7 Daniel Horák 2019-02-15 13:33:25 UTC

There is another issue just few lines above the previously mentioned one.

In the same file /usr/lib64/collectd/gluster/tendrl_gluster.py on line 253,
there is condition if `CONFIG["node_id"] not in eval(provisioner)`, the problem
is, that under some conditions CONFIG object might not contain key "node_id".

  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  252                 if (
  253                     CONFIG["node_id"] not in eval(provisioner)
  254                 ):
  255                     continue
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This issue leads to following error (under some conditions):
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Unhandled python exception in read callback: KeyError: 'node_id'
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If I find the source of the data for CONFIG object correctly, it depends on the
content of /etc/collectd.d/tendrl_gluster.conf file, which on some nodes
doesn't contain the "node_id" key. So the file looks like this:

  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # cat /etc/collectd.d/tendrl_gluster.conf
  <Plugin "python">
      ModulePath "/usr/lib64/collectd/gluster"
  
      Import "tendrl_gluster"
  
      <Module "tendrl_gluster">
          integration_id "ffe0d070-bdbd-4aee-b217-0049b6d68e41"
          graphite_host "rhsqa6.lab.eng.blr.redhat.com"
          graphite_port "2003"
          peer_name "rhocs-node1.lab.eng.blr.redhat.com"
          provisioner False
          etcd_host "rhsqa6.lab.eng.blr.redhat.com"
          etcd_port 2379
          
              etcd_ca_cert_file ""
              etcd_cert_file ""
              etcd_key_file ""
          
  
      </Module>
  </Plugin>
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It seems like there is some difference between freshly installed 3.4.4 cluster
and cluster updated from previous version (or more precisely on which version
the cluster was imported).
* If the cluster was imported on new (3.4.4) version, the tendrl_gluster.conf
  file on all storage nodes contains the key "node_id".
* If the cluster was imported on older version and updated to the new one,
  "node_id" key is in tendrl_gluster.conf file only on one storage node.

Comment 8 gowtham 2019-02-19 05:09:11 UTC

In "tendrl-node-agent-1.6.3-17.el7rhgs", i have added a node_id variable in collectd configuration file newly, Problem is collectd configuration file is generated during import cluster flow. So it works fine for the newly created machine. But in the upgrade scenario customer already imported the cluster so "node_id" variable is not present. So collectd is failing. 

Fixed in PR: https://github.com/Tendrl/node-agent/pull/874

Comment 9 Filip Balák 2019-03-04 16:26:39 UTC

In current version is changed implementation of referring provisioner node from collectd tendrl plugins. Now is reference in /etc/collectd.d/tendrl_gluster.conf configuration file not used and instead is used value stored in etcd. --> VERIFIED
During testing was reported bz 1685153 related to behaviour of the configuration file during update.

Tested with:
tendrl-ansible-1.6.3-11.el7rhgs.noarch
tendrl-api-1.6.3-13.el7rhgs.noarch
tendrl-api-httpd-1.6.3-13.el7rhgs.noarch
tendrl-commons-1.6.3-17.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-15.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-21.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-3.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-21.el7rhgs.noarch
tendrl-node-agent-1.6.3-18.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-3.el7rhgs.noarch
tendrl-ui-1.6.3-15.el7rhgs.noarch

Comment 11 errata-xmlrpc 2019-03-27 03:49:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0660

Note You need to log in before you can comment on or make changes to this bug.