Description of problem: During import cluster flow WA is configuring collectd and executing low weight collectd plugins(/usr/lib64/collectd/gluster/low_weight/) in each storage nodes. Because of some logical problem each low weight plugins are executed twice in each sync. Impact of this problem is: 1. Resource consumption in storage node side is high 2. Etcd requests are high 3. Number of metrics pushed into carbon-cache is also high Version-Release number of selected component (if applicable): tendrl-node-agent-1.6.3-11.el7rhgs How reproducible: Reproducer needs little code change: put below code after this line: https://github.com/Tendrl/node-agent/blob/master/tendrl/node_agent/monitoring/collectd/collectors/gluster/tendrl_gluster.py#L213 with open("file_path", "a") as f: f.write(str(TendrlGlusterfsMonitoringBase.plugins)) you can see low_weight and heavy_weight both have all low_weight plugins. Actually, low weight plugins should not come under heavy weight. Because of redundant threads for low weight plugin started twice. Problem is after low_weight plugin execution TendrlGlusterfsMonitoringBase.plugins should be cleared. otherwise, heavy_weight plugins also appended and again all are executed. Steps to Reproduce: 1. 2. 3. Actual results: low_weight collectd plugins are executed twice in each sync Expected results: each plugin should be executed only once in sync Additional info:
PR is under review: https://github.com/Tendrl/node-agent/pull/856
How could we verify this? What should we measure on etcd and carbon-cache side?
This is possible only via comparing CPU and memory utilization of storage nodes and carbon-cache utilization on the server node with the machines which have an old tendrl packages and new tendrl packages,
(In reply to gowtham from comment #4) > This is possible only via comparing CPU and memory utilization of storage > nodes and carbon-cache utilization on the server node with the machines > which have an old tendrl packages and new tendrl packages, What is the range of the expected improvement? It's 10%? 50%? But to be honest, I would rather understand what data to check in graphite/carbon database.
This bug is taken out from BU3
I've performed some basic performance measurement between the previous and new version on storage nodes with 4 vCPUs, 8GB RAM and with higher number of storage devices (24, divided into ~160 partitions), bricks (55) and Gluster Volumes (33). I've imported the cluster into WA and let it run 2 days. On the older version: * the average load was between 2 and 2.5, * CPU utilization of collectd service was around 6%. On the new version: * the average load was around 0.6, * CPU utilization of collectd service was around 2.3%. Note: the higher load was mainly because of tendrl-gluster-integraiton service, see bug 1637977 comment 9. Version-Release number of selected component: Previous version: Red Hat Enterprise Linux Server release 7.6 (Maipo) Red Hat Gluster Storage Server 3.4 collectd-5.7.2-3.1.el7rhgs.x86_64 collectd-ping-5.7.2-3.1.el7rhgs.x86_64 libcollectdclient-5.7.2-3.1.el7rhgs.x86_64 tendrl-collectd-selinux-1.5.4-3.el7rhgs.noarch tendrl-commons-1.6.3-15.el7rhgs.noarch tendrl-gluster-integration-1.6.3-13.el7rhgs.noarch tendrl-node-agent-1.6.3-15.el7rhgs.noarch tendrl-selinux-1.5.4-3.el7rhgs.noarch New version: Red Hat Enterprise Linux Server release 7.6 (Maipo) Red Hat Gluster Storage Server 3.4 collectd-5.7.2-3.1.el7rhgs.x86_64 collectd-ping-5.7.2-3.1.el7rhgs.x86_64 libcollectdclient-5.7.2-3.1.el7rhgs.x86_64 tendrl-collectd-selinux-1.5.4-3.el7rhgs.noarch tendrl-commons-1.6.3-17.el7rhgs.noarch tendrl-gluster-integration-1.6.3-15.el7rhgs.noarch tendrl-node-agent-1.6.3-18.el7rhgs.noarch tendrl-selinux-1.5.4-3.el7rhgs.noarch
I've also tried the case suggested in comment 0, only with module "q" instead of manually created log file. The module can be easily installed via command `easy_install q` as suggested in the doc[1]. The changes are as follows (for new version the line numbers are slightly different). # diff -c /usr/lib64/collectd/gluster/tendrl_gluster.py_ORIGINAL /usr/lib64/collectd/gluster/tendrl_gluster.py *** /usr/lib64/collectd/gluster/tendrl_gluster.py_ORIGINAL 2019-03-18 09:41:58.342077042 +0100 --- /usr/lib64/collectd/gluster/tendrl_gluster.py 2019-03-18 09:42:05.873028178 +0100 *************** *** 211,217 **** --- 211,220 ---- def read_callback(pkg_path, pkg): global threads load_plugins(pkg_path, pkg) + import q + q(TendrlGlusterfsMonitoringBase.plugins) for gfsmon_plugin in TendrlGlusterfsMonitoringBase.plugins: + q(gfsmon_plugin) # If the plugin is marked as provisioner_only_plugin(heavy-weight) # Execute such a plugin only on the current node if and only if the # the current node is marked as provisioner in collectd's conf file. Then, after restart collectd service, I've checked the created /tmp/q file. The output for one interval is as follows: For OLD version: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 62.9s read_callback: TendrlGlusterfsMonitoringBase.plugins=[<low_weight.tendrl_glusterfs_brick_utilization.TendrlBrickUtilizationPlugin object>, <low_weight.tendrl_glusterfs_health_counters.TendrlGlusterfsHealthCounters object>] 62.9s read_callback: gfsmon_plugin=<low_weight.tendrl_glusterfs_brick_utilization.TendrlBrickUtilizationPlugin object> 62.9s read_callback: gfsmon_plugin=<low_weight.tendrl_glusterfs_health_counters.TendrlGlusterfsHealthCounters object> 64.3s read_callback: TendrlGlusterfsMonitoringBase.plugins=[<low_weight.tendrl_glusterfs_brick_utilization.TendrlBrickUtilizationPlugin object>, <low_weight.tendrl_glusterfs_health_counters.TendrlGlusterfsHealthCounters object>, <heavy_weight.tendrl_glusterfs_profile_info.TendrlHealInfoAndProfileInfoPlugin object>] 64.3s read_callback: gfsmon_plugin=<low_weight.tendrl_glusterfs_brick_utilization.TendrlBrickUtilizationPlugin object> 64.3s read_callback: gfsmon_plugin=<low_weight.tendrl_glusterfs_health_counters.TendrlGlusterfsHealthCounters object> 64.3s read_callback: gfsmon_plugin=<heavy_weight.tendrl_glusterfs_profile_info.TendrlHealInfoAndProfileInfoPlugin object> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For NEW version: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 63.0s read_callback: TendrlGlusterfsMonitoringBase.plugins=[<low_weight.tendrl_glusterfs_brick_utilization.TendrlBrickUtilizationPlugin object>, <low_weight.tendrl_glusterfs_health_counters.TendrlGlusterfsHealthCounters object>] 63.0s read_callback: gfsmon_plugin=<low_weight.tendrl_glusterfs_brick_utilization.TendrlBrickUtilizationPlugin object> 63.0s read_callback: gfsmon_plugin=<low_weight.tendrl_glusterfs_health_counters.TendrlGlusterfsHealthCounters object> 64.4s read_callback: TendrlGlusterfsMonitoringBase.plugins=[<heavy_weight.tendrl_glusterfs_profile_info.TendrlHealInfoAndProfileInfoPlugin object>] 64.4s read_callback: gfsmon_plugin=<heavy_weight.tendrl_glusterfs_profile_info.TendrlHealInfoAndProfileInfoPlugin object> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ As it is visible from the log output above, for older version each low_weight plugin is executed twice (firstly with other low_weight plugins and secondly with heavy_weight plugin). Verifying this bug based on observation above and also based on comment 8. [1] https://pypi.org/project/q/ >> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0660