Bug 1647910

Summary: Provisioner tag in collectd claimed by more than one storage nodes
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: gowtham <gshanmug>
Component: web-admin-tendrl-node-agentAssignee: gowtham <gshanmug>
Status: CLOSED ERRATA QA Contact: Filip Balák <fbalak>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.4CC: dahorak, nthomas, rhs-bugs, sankarshan
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.4.z Batch Update 4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tendrl-node-agent-1.6.3-18.el7rhgs Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-27 03:49:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1656822    

Description gowtham 2018-11-08 14:18:37 UTC
Description of problem:
WA collectd heavy_weight plugins are executed in provisioner node only. But when provisioner is changed some other node is marking provisioner tag is true and started executing all heavy_weight plugins but old provisioner tag in the previous node is not cleared. So both nodes are started executing heavy_weight plugin. Impact of this problem is etcd read request and carbon_cache metrics request are increased. After some time all nodes are started executing heavy_weight plugin because we are not clearing tag from any node.

Version-Release number of selected component (if applicable):
tendrl-node-agent-1.6.3-11.el7rhgs

How reproducible:
After cluster import open a file "vi /etc/collectd.d/tendrl_gluster.conf" in all 
the nodes. provisioner tag is marked as true for any node. Then stop tendrl-node-agent service and wait for 300 seconds. Then start tendrl-node-agent service and check the file "vi /etc/collectd.d/tendrl_gluster.conf" again in all nodes, provisioner tag is marked as true for more than one node.

Steps to Reproduce:
1.
2.
3.

Actual results:
WA heavy_weight colled plugins are executed by non-provisioner nodes also

Expected results:
WA heavy_weight colled plugins should executed by provisioner node only

Additional info:

Comment 2 gowtham 2018-12-03 07:55:14 UTC
PR is under review: https://github.com/Tendrl/node-agent/pull/857

Comment 3 gowtham 2018-12-03 08:25:40 UTC
https://github.com/Tendrl/commons/pull/1063

Comment 4 Nishanth Thomas 2018-12-13 10:08:47 UTC
This bug is taken out from BU3

Comment 6 Daniel Horák 2019-02-15 12:26:18 UTC
There is one issue in the changes related to this Bug.

In the file /usr/lib64/collectd/gluster/tendrl_gluster.py on line 256, there is
catch for exception etcd.KeyNotFound, but etcd module doesn't have such
exception (it should probably be etcd.EtcdKeyNotFound).

  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  256             except (etcd.KeyNotFound, etcd.EtcdConnectionFailed, SyntaxError) as ex:
  257                 collectd.error('Failed to find provisioner node. Error %s' % str(ex))
  258                 continue
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This issue leads to following error in logs (and maybe to some other
consequences):
  
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Feb 15 17:51:15 node1 collectd: Unhandled python exception in read callback: AttributeError: 'module' object has no attribute 'KeyNotFound'
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Spotted in tendrl-node-agent-1.6.3-17.el7rhgs.noarch

The particular line in upstream is slightly different because of some other
commit, but the issue is the same:
  https://github.com/Tendrl/node-agent/blob/master/tendrl/node_agent/monitoring/collectd/collectors/gluster/tendrl_gluster.py#L257

>> ASSIGNED

Comment 7 Daniel Horák 2019-02-15 13:33:25 UTC
There is another issue just few lines above the previously mentioned one.

In the same file /usr/lib64/collectd/gluster/tendrl_gluster.py on line 253,
there is condition if `CONFIG["node_id"] not in eval(provisioner)`, the problem
is, that under some conditions CONFIG object might not contain key "node_id".

  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  252                 if (
  253                     CONFIG["node_id"] not in eval(provisioner)
  254                 ):
  255                     continue
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This issue leads to following error (under some conditions):
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Unhandled python exception in read callback: KeyError: 'node_id'
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If I find the source of the data for CONFIG object correctly, it depends on the
content of /etc/collectd.d/tendrl_gluster.conf file, which on some nodes
doesn't contain the "node_id" key. So the file looks like this:

  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # cat /etc/collectd.d/tendrl_gluster.conf
  <Plugin "python">
      ModulePath "/usr/lib64/collectd/gluster"
  
      Import "tendrl_gluster"
  
      <Module "tendrl_gluster">
          integration_id "ffe0d070-bdbd-4aee-b217-0049b6d68e41"
          graphite_host "rhsqa6.lab.eng.blr.redhat.com"
          graphite_port "2003"
          peer_name "rhocs-node1.lab.eng.blr.redhat.com"
          provisioner False
          etcd_host "rhsqa6.lab.eng.blr.redhat.com"
          etcd_port 2379
          
              etcd_ca_cert_file ""
              etcd_cert_file ""
              etcd_key_file ""
          
  
      </Module>
  </Plugin>
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It seems like there is some difference between freshly installed 3.4.4 cluster
and cluster updated from previous version (or more precisely on which version
the cluster was imported).
* If the cluster was imported on new (3.4.4) version, the tendrl_gluster.conf
  file on all storage nodes contains the key "node_id".
* If the cluster was imported on older version and updated to the new one,
  "node_id" key is in tendrl_gluster.conf file only on one storage node.

Comment 8 gowtham 2019-02-19 05:09:11 UTC
In "tendrl-node-agent-1.6.3-17.el7rhgs", i have added a node_id variable in collectd configuration file newly, Problem is collectd configuration file is generated during import cluster flow. So it works fine for the newly created machine. But in the upgrade scenario customer already imported the cluster so "node_id" variable is not present. So collectd is failing. 

Fixed in PR: https://github.com/Tendrl/node-agent/pull/874

Comment 9 Filip Balák 2019-03-04 16:26:39 UTC
In current version is changed implementation of referring provisioner node from collectd tendrl plugins. Now is reference in /etc/collectd.d/tendrl_gluster.conf configuration file not used and instead is used value stored in etcd. --> VERIFIED
During testing was reported bz 1685153 related to behaviour of the configuration file during update.

Tested with:
tendrl-ansible-1.6.3-11.el7rhgs.noarch
tendrl-api-1.6.3-13.el7rhgs.noarch
tendrl-api-httpd-1.6.3-13.el7rhgs.noarch
tendrl-commons-1.6.3-17.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-15.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-21.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-3.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-21.el7rhgs.noarch
tendrl-node-agent-1.6.3-18.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-3.el7rhgs.noarch
tendrl-ui-1.6.3-15.el7rhgs.noarch

Comment 11 errata-xmlrpc 2019-03-27 03:49:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0660