1647849 – Each low weight collectd plugin are executed twice per sync

Bug 1647849 - Each low weight collectd plugin are executed twice per sync

Summary: Each low weight collectd plugin are executed twice per sync

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	web-admin-tendrl-node-agent
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	RHGS 3.4.z Batch Update 4
Assignee:	gowtham
QA Contact:	sds-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-11-08 14:02 UTC by gowtham
Modified:	2019-03-27 03:51 UTC (History)
CC List:	6 users (show)
Fixed In Version:	tendrl-node-agent-1.6.3-17.el7rhgs
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-03-27 03:49:38 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	Tendrl node-agent issues 858	0	None	closed	Each sync collectd low_weight plugins are executed twice	2020-01-30 15:50:17 UTC
Red Hat Product Errata	RHBA-2019:0660	0	None	None	None	2019-03-27 03:51:10 UTC

Description gowtham 2018-11-08 14:02:38 UTC

Description of problem:
During import cluster flow WA is configuring collectd and executing low weight collectd plugins(/usr/lib64/collectd/gluster/low_weight/) in each storage nodes.
Because of some logical problem each low weight plugins are executed twice in each sync. Impact of this problem is:

1. Resource consumption in storage node side is high
2. Etcd requests are high
3. Number of metrics pushed into carbon-cache is also high

Version-Release number of selected component (if applicable):
tendrl-node-agent-1.6.3-11.el7rhgs

How reproducible:
Reproducer needs little code change:
put below code after this line: https://github.com/Tendrl/node-agent/blob/master/tendrl/node_agent/monitoring/collectd/collectors/gluster/tendrl_gluster.py#L213

with open("file_path", "a") as f:
    f.write(str(TendrlGlusterfsMonitoringBase.plugins))

you can see low_weight and heavy_weight both have all low_weight plugins. Actually, low weight plugins should not come under heavy weight. Because of redundant threads for low weight plugin started twice. 

Problem is after low_weight plugin execution TendrlGlusterfsMonitoringBase.plugins should be cleared. otherwise, heavy_weight plugins also appended and again all are executed.

Steps to Reproduce:
1. 
2.
3.

Actual results:
low_weight collectd plugins are executed twice in each sync

Expected results:
each plugin should be executed only once in sync

Additional info:

Comment 2 gowtham 2018-12-03 07:49:19 UTC

PR is under review: https://github.com/Tendrl/node-agent/pull/856

Comment 3 Martin Bukatovic 2018-12-10 19:07:35 UTC

How could we verify this? What should we measure on etcd and carbon-cache side?

Comment 4 gowtham 2018-12-11 11:56:52 UTC

This is possible only via comparing CPU and memory utilization of storage nodes and  carbon-cache utilization on the server node with the machines which have an old tendrl packages and new tendrl packages,

Comment 5 Martin Bukatovic 2018-12-11 15:37:56 UTC

(In reply to gowtham from comment #4)
> This is possible only via comparing CPU and memory utilization of storage
> nodes and  carbon-cache utilization on the server node with the machines
> which have an old tendrl packages and new tendrl packages,

What is the range of the expected improvement? It's 10%? 50%?

But to be honest, I would rather understand what data to check in graphite/carbon database.

Comment 6 Nishanth Thomas 2018-12-13 10:08:14 UTC

This bug is taken out from BU3

Comment 8 Daniel Horák 2019-03-18 08:10:19 UTC

I've performed some basic performance measurement between the previous and new
version on storage nodes with 4 vCPUs, 8GB RAM and with higher number of
storage devices (24, divided into ~160 partitions), bricks (55) and Gluster
Volumes (33). I've imported the cluster into WA and let it run 2 days.

On the older version:
  * the average load was between 2 and 2.5,
  * CPU utilization of collectd service was around 6%.

On the new version:
  * the average load was around 0.6,
  * CPU utilization of collectd service was around 2.3%.

Note: the higher load was mainly because of tendrl-gluster-integraiton service,
      see bug 1637977 comment 9.

Version-Release number of selected component:
  Previous version:
  Red Hat Enterprise Linux Server release 7.6 (Maipo)
  Red Hat Gluster Storage Server 3.4  
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  tendrl-collectd-selinux-1.5.4-3.el7rhgs.noarch
  tendrl-commons-1.6.3-15.el7rhgs.noarch
  tendrl-gluster-integration-1.6.3-13.el7rhgs.noarch
  tendrl-node-agent-1.6.3-15.el7rhgs.noarch
  tendrl-selinux-1.5.4-3.el7rhgs.noarch

  New version:
  Red Hat Enterprise Linux Server release 7.6 (Maipo)
  Red Hat Gluster Storage Server 3.4
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  tendrl-collectd-selinux-1.5.4-3.el7rhgs.noarch
  tendrl-commons-1.6.3-17.el7rhgs.noarch
  tendrl-gluster-integration-1.6.3-15.el7rhgs.noarch
  tendrl-node-agent-1.6.3-18.el7rhgs.noarch
  tendrl-selinux-1.5.4-3.el7rhgs.noarch

Comment 9 Daniel Horák 2019-03-18 08:55:16 UTC

I've also tried the case suggested in comment 0, only with module "q" instead
of manually created log file.

The module can be easily installed via command `easy_install q` as suggested in
the doc[1].

The changes are as follows (for new version the line numbers are slightly
different).

# diff -c /usr/lib64/collectd/gluster/tendrl_gluster.py_ORIGINAL /usr/lib64/collectd/gluster/tendrl_gluster.py
  *** /usr/lib64/collectd/gluster/tendrl_gluster.py_ORIGINAL	2019-03-18 09:41:58.342077042 +0100
  --- /usr/lib64/collectd/gluster/tendrl_gluster.py	2019-03-18 09:42:05.873028178 +0100
  ***************
  *** 211,217 ****
  --- 211,220 ----
    def read_callback(pkg_path, pkg):
        global threads
        load_plugins(pkg_path, pkg)
  +     import q
  +     q(TendrlGlusterfsMonitoringBase.plugins)
        for gfsmon_plugin in TendrlGlusterfsMonitoringBase.plugins:
  +         q(gfsmon_plugin)
            # If the plugin is marked as provisioner_only_plugin(heavy-weight)
            # Execute such a plugin only on the current node if and only if the
            # the current node is marked as provisioner in collectd's conf file.

Then, after restart collectd service, I've checked the created /tmp/q file.

The output for one interval is as follows:

For OLD version:
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  62.9s read_callback: 
        TendrlGlusterfsMonitoringBase.plugins=[<low_weight.tendrl_glusterfs_brick_utilization.TendrlBrickUtilizationPlugin object>, <low_weight.tendrl_glusterfs_health_counters.TendrlGlusterfsHealthCounters object>]
  62.9s read_callback: 
        gfsmon_plugin=<low_weight.tendrl_glusterfs_brick_utilization.TendrlBrickUtilizationPlugin object>
  62.9s read_callback: 
        gfsmon_plugin=<low_weight.tendrl_glusterfs_health_counters.TendrlGlusterfsHealthCounters object>
  64.3s read_callback: 
        TendrlGlusterfsMonitoringBase.plugins=[<low_weight.tendrl_glusterfs_brick_utilization.TendrlBrickUtilizationPlugin object>, <low_weight.tendrl_glusterfs_health_counters.TendrlGlusterfsHealthCounters object>, <heavy_weight.tendrl_glusterfs_profile_info.TendrlHealInfoAndProfileInfoPlugin object>]
  64.3s read_callback: 
        gfsmon_plugin=<low_weight.tendrl_glusterfs_brick_utilization.TendrlBrickUtilizationPlugin object>
  64.3s read_callback: 
        gfsmon_plugin=<low_weight.tendrl_glusterfs_health_counters.TendrlGlusterfsHealthCounters object>
  64.3s read_callback: 
        gfsmon_plugin=<heavy_weight.tendrl_glusterfs_profile_info.TendrlHealInfoAndProfileInfoPlugin object>
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For NEW version:
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  63.0s read_callback: 
        TendrlGlusterfsMonitoringBase.plugins=[<low_weight.tendrl_glusterfs_brick_utilization.TendrlBrickUtilizationPlugin object>, <low_weight.tendrl_glusterfs_health_counters.TendrlGlusterfsHealthCounters object>]
  63.0s read_callback: 
        gfsmon_plugin=<low_weight.tendrl_glusterfs_brick_utilization.TendrlBrickUtilizationPlugin object>
  63.0s read_callback: 
        gfsmon_plugin=<low_weight.tendrl_glusterfs_health_counters.TendrlGlusterfsHealthCounters object>
  64.4s read_callback: 
        TendrlGlusterfsMonitoringBase.plugins=[<heavy_weight.tendrl_glusterfs_profile_info.TendrlHealInfoAndProfileInfoPlugin object>]
  64.4s read_callback: 
        gfsmon_plugin=<heavy_weight.tendrl_glusterfs_profile_info.TendrlHealInfoAndProfileInfoPlugin object>
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As it is visible from the log output above, for older version each low_weight
plugin is executed twice (firstly with other low_weight plugins and secondly
with heavy_weight plugin).

Verifying this bug based on observation above and also based on comment 8.

[1] https://pypi.org/project/q/

>> VERIFIED

Comment 11 errata-xmlrpc 2019-03-27 03:49:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0660

Note You need to log in before you can comment on or make changes to this bug.