Bug 1647910 - Provisioner tag in collectd claimed by more than one storage nodes
Summary: Provisioner tag in collectd claimed by more than one storage nodes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: web-admin-tendrl-node-agent
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: RHGS 3.4.z Batch Update 4
Assignee: gowtham
QA Contact: Filip Balák
URL:
Whiteboard:
Depends On:
Blocks: 1656822
TreeView+ depends on / blocked
 
Reported: 2018-11-08 14:18 UTC by gowtham
Modified: 2019-03-27 03:51 UTC (History)
4 users (show)

Fixed In Version: tendrl-node-agent-1.6.3-18.el7rhgs
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-27 03:49:38 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github Tendrl commons issues 1066 0 None closed Removing provisioner tag from collectd plugin configuration 2020-01-30 15:50:17 UTC
Github Tendrl node-agent issues 859 0 None closed Collectd heavy weight plugins are executed by non-provisioner nodes also 2020-01-30 15:50:17 UTC
Red Hat Bugzilla 1685153 0 unspecified CLOSED /etc/collectd.d/tendrl_gluster.conf file is changed only on provisioner node after update 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHBA-2019:0660 0 None None None 2019-03-27 03:51:10 UTC

Description gowtham 2018-11-08 14:18:37 UTC
Description of problem:
WA collectd heavy_weight plugins are executed in provisioner node only. But when provisioner is changed some other node is marking provisioner tag is true and started executing all heavy_weight plugins but old provisioner tag in the previous node is not cleared. So both nodes are started executing heavy_weight plugin. Impact of this problem is etcd read request and carbon_cache metrics request are increased. After some time all nodes are started executing heavy_weight plugin because we are not clearing tag from any node.

Version-Release number of selected component (if applicable):
tendrl-node-agent-1.6.3-11.el7rhgs

How reproducible:
After cluster import open a file "vi /etc/collectd.d/tendrl_gluster.conf" in all 
the nodes. provisioner tag is marked as true for any node. Then stop tendrl-node-agent service and wait for 300 seconds. Then start tendrl-node-agent service and check the file "vi /etc/collectd.d/tendrl_gluster.conf" again in all nodes, provisioner tag is marked as true for more than one node.

Steps to Reproduce:
1.
2.
3.

Actual results:
WA heavy_weight colled plugins are executed by non-provisioner nodes also

Expected results:
WA heavy_weight colled plugins should executed by provisioner node only

Additional info:

Comment 2 gowtham 2018-12-03 07:55:14 UTC
PR is under review: https://github.com/Tendrl/node-agent/pull/857

Comment 3 gowtham 2018-12-03 08:25:40 UTC
https://github.com/Tendrl/commons/pull/1063

Comment 4 Nishanth Thomas 2018-12-13 10:08:47 UTC
This bug is taken out from BU3

Comment 6 Daniel Horák 2019-02-15 12:26:18 UTC
There is one issue in the changes related to this Bug.

In the file /usr/lib64/collectd/gluster/tendrl_gluster.py on line 256, there is
catch for exception etcd.KeyNotFound, but etcd module doesn't have such
exception (it should probably be etcd.EtcdKeyNotFound).

  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  256             except (etcd.KeyNotFound, etcd.EtcdConnectionFailed, SyntaxError) as ex:
  257                 collectd.error('Failed to find provisioner node. Error %s' % str(ex))
  258                 continue
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This issue leads to following error in logs (and maybe to some other
consequences):
  
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Feb 15 17:51:15 node1 collectd: Unhandled python exception in read callback: AttributeError: 'module' object has no attribute 'KeyNotFound'
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Spotted in tendrl-node-agent-1.6.3-17.el7rhgs.noarch

The particular line in upstream is slightly different because of some other
commit, but the issue is the same:
  https://github.com/Tendrl/node-agent/blob/master/tendrl/node_agent/monitoring/collectd/collectors/gluster/tendrl_gluster.py#L257

>> ASSIGNED

Comment 7 Daniel Horák 2019-02-15 13:33:25 UTC
There is another issue just few lines above the previously mentioned one.

In the same file /usr/lib64/collectd/gluster/tendrl_gluster.py on line 253,
there is condition if `CONFIG["node_id"] not in eval(provisioner)`, the problem
is, that under some conditions CONFIG object might not contain key "node_id".

  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  252                 if (
  253                     CONFIG["node_id"] not in eval(provisioner)
  254                 ):
  255                     continue
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This issue leads to following error (under some conditions):
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Unhandled python exception in read callback: KeyError: 'node_id'
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If I find the source of the data for CONFIG object correctly, it depends on the
content of /etc/collectd.d/tendrl_gluster.conf file, which on some nodes
doesn't contain the "node_id" key. So the file looks like this:

  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # cat /etc/collectd.d/tendrl_gluster.conf
  <Plugin "python">
      ModulePath "/usr/lib64/collectd/gluster"
  
      Import "tendrl_gluster"
  
      <Module "tendrl_gluster">
          integration_id "ffe0d070-bdbd-4aee-b217-0049b6d68e41"
          graphite_host "rhsqa6.lab.eng.blr.redhat.com"
          graphite_port "2003"
          peer_name "rhocs-node1.lab.eng.blr.redhat.com"
          provisioner False
          etcd_host "rhsqa6.lab.eng.blr.redhat.com"
          etcd_port 2379
          
              etcd_ca_cert_file ""
              etcd_cert_file ""
              etcd_key_file ""
          
  
      </Module>
  </Plugin>
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It seems like there is some difference between freshly installed 3.4.4 cluster
and cluster updated from previous version (or more precisely on which version
the cluster was imported).
* If the cluster was imported on new (3.4.4) version, the tendrl_gluster.conf
  file on all storage nodes contains the key "node_id".
* If the cluster was imported on older version and updated to the new one,
  "node_id" key is in tendrl_gluster.conf file only on one storage node.

Comment 8 gowtham 2019-02-19 05:09:11 UTC
In "tendrl-node-agent-1.6.3-17.el7rhgs", i have added a node_id variable in collectd configuration file newly, Problem is collectd configuration file is generated during import cluster flow. So it works fine for the newly created machine. But in the upgrade scenario customer already imported the cluster so "node_id" variable is not present. So collectd is failing. 

Fixed in PR: https://github.com/Tendrl/node-agent/pull/874

Comment 9 Filip Balák 2019-03-04 16:26:39 UTC
In current version is changed implementation of referring provisioner node from collectd tendrl plugins. Now is reference in /etc/collectd.d/tendrl_gluster.conf configuration file not used and instead is used value stored in etcd. --> VERIFIED
During testing was reported bz 1685153 related to behaviour of the configuration file during update.

Tested with:
tendrl-ansible-1.6.3-11.el7rhgs.noarch
tendrl-api-1.6.3-13.el7rhgs.noarch
tendrl-api-httpd-1.6.3-13.el7rhgs.noarch
tendrl-commons-1.6.3-17.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-15.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-21.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-3.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-21.el7rhgs.noarch
tendrl-node-agent-1.6.3-18.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-3.el7rhgs.noarch
tendrl-ui-1.6.3-15.el7rhgs.noarch

Comment 11 errata-xmlrpc 2019-03-27 03:49:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0660


Note You need to log in before you can comment on or make changes to this bug.