1561468 – tendrl-node-agent CPU consumption

Bug 1561468 - tendrl-node-agent CPU consumption

Summary: tendrl-node-agent CPU consumption

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	web-admin-tendrl-node-agent
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Nishanth Thomas
QA Contact:	Daniel Horák
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1503137
TreeView+	depends on / blocked

Reported:	2018-03-28 11:54 UTC by Daniel Horák
Modified:	2018-09-04 07:04 UTC (History)
CC List:	4 users (show)
Fixed In Version:	tendrl-node-agent-1.6.3-3.el7rhgs.noarch
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-04 07:03:18 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1538248	0	unspecified	CLOSED	[RFE] Performance Improvements	2023-09-14 04:15:57 UTC
Red Hat Product Errata	RHSA-2018:2616	0	None	None	None	2018-09-04 07:04:25 UTC

Internal Links: 1538248

Description Daniel Horák 2018-03-28 11:54:46 UTC

Description of problem:
  Daemon tendrl-node-agent have quite hight CPU usage
  on longer running cluster.

Version-Release number of selected component (if applicable):
  # rpm -qa | grep tendrl | sort
  tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-commons-1.6.1-1.el7rhgs.noarch
  tendrl-gluster-integration-1.6.1-1.el7rhgs.noarch
  tendrl-node-agent-1.6.1-1.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch

How reproducible:
  100%


Steps to Reproduce:
1. Install, configure and import Gluster cluster into RHGS WA (Tendrl).
2. Let it running for few days and monitor CPU usage of Tendrl components
  (on Gluster storage nodes and also on RHGS WA server):

  # ps -p $(echo $(ps aux | grep [t]endrl-node-agent | awk '{print $2}') | sed 's/ /,/') -o %cpu,%mem,cmd -h;

  (first number is CPU% usage, second number MEM% usage)

Actual results:
  After few days (maybe earlier), tendrl-node-agent consumes quite high
  percentage of CPU.
  On machine with 2 vCPUs it's close to 50%,
  on machine with 4 vCPUs it's close to 25%.
  This means, that it constantly fully utilize one CPU core.

Expected results:
  What is the expected utilization of CPU by the Tendrl components?

Additional info:
  Please check also CPU utilization of other Tendrl components
  - for example tendrl-gluster-integration have similar "problem".

Comment 2 Daniel Horák 2018-03-28 12:08:40 UTC

Output from Gluster storage server with 2vCPUs running 1 day:
%CPU  %MEM
17.1  2.9 /usr/bin/python /usr/bin/tendrl-node-agent

Output from Gluster storage server with 2vCPUs running 12 days:
%CPU  %MEM
42.4  3.1 /usr/bin/python /usr/bin/tendrl-node-agent

And output from RHGS WA Server with 4 vCPUs running 1 day:
%CPU  %MEM
22.1  0.3 /usr/bin/python /usr/bin/tendrl-node-agent

Also the overall system load is quite high, despite the fact, that there is no data load on the Gluster Volumes or any other tasks performed.

Comment 3 Rohan Kanade 2018-04-04 12:47:42 UTC

What is the value of config option "sync_interval" at ?
/etc/tendrl/node-agent/node-agent.conf.yaml

Comment 4 Daniel Horák 2018-04-04 13:01:46 UTC

We didn't update the 'sync_interval', so it contains the default value:

# grep sync_interval /etc/tendrl/node-agent/node-agent.conf.yaml
sync_interval: 60

Comment 6 Nishanth Thomas 2018-04-24 07:22:42 UTC

@Daniel, can you re-check this with the latest buid

Comment 7 Daniel Horák 2018-04-26 07:58:44 UTC

For the first look, it seems to be ok, but I'll have to keep the cluster running for few days to be sure.
I'll post update early next week.

Comment 8 Daniel Horák 2018-04-30 08:15:44 UTC

There seems to be noticeable improvement between last two versions:
  tendrl-node-agent-1.6.3-2.el7rhgs.noarch
and
  tendrl-node-agent-1.6.3-3.el7rhgs.noarch

With the newer version, the CPU usage of tendrl-node-agent is bellow 3% on cluster running for 2 days.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ps aux | grep -E "[t]endrl-node-agent" | awk "{print \$2}" | sed "s/ /,/" | xargs -n1 ps -o %cpu,%mem,cmd -h -p
 2.8  0.8 /usr/bin/python /usr/bin/tendrl-node-agent
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I'll watch it further on more cluster variants and send another status in few days.

Comment 9 gowtham 2018-05-02 07:49:20 UTC

I have reduced usage percentage in latest release https://github.com/Tendrl/commons/pull/931 please verify it happening again

Comment 10 Daniel Horák 2018-05-02 08:56:57 UTC

(In reply to Nishanth Thomas from comment #6)
> @Daniel, can you re-check this with the latest buid

On (nearly) 5 days running cluster, CPU usage of tendrl-node-agent service is still bellow 3%, so I can confirm, that it is fixed in the latest builds (
tendrl-node-agent-1.6.3-3.el7rhgs.noarch).

Comment 12 Daniel Horák 2018-05-15 11:27:52 UTC

Tested and Verified on few clusters with various configurations,
for example:
  * cluster with 6 storage nodes, running for 6 days,
    with 2 vCPUs and 8GB RAM peer storage node
  * cluster with 24 storage nodes, running for 6 days,
    with 4 vCPUs and 6GB RAM peer storage node

The tendrl-node-agent CPU utilization is around 1-3%, for example:
(first value is CPU utilization, second value memory utilization)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
$ ansible -i ci-usm4-gluster.hosts gluster_servers:tendrl_server -m shell \
  -a 'ps aux | grep -E "[t]endrl-node-agent" | \
    awk "{print \$2}" | sed "s/ /,/" | \
    xargs -n1 ps -o %cpu,%mem,cmd -h -p'
  ci-usm4-gl2.usmqe.example.com | SUCCESS | rc=0 >>
   2.0  0.9 /usr/bin/python /usr/bin/tendrl-node-agent

  ci-usm4-gl5.usmqe.example.com | SUCCESS | rc=0 >>
   1.8  0.8 /usr/bin/python /usr/bin/tendrl-node-agent

  ci-usm4-gl1.usmqe.example.com | SUCCESS | rc=0 >>
   2.0  0.9 /usr/bin/python /usr/bin/tendrl-node-agent

  ci-usm4-gl3.usmqe.example.com | SUCCESS | rc=0 >>
   1.8  0.8 /usr/bin/python /usr/bin/tendrl-node-agent

  ci-usm4-gl4.usmqe.example.com | SUCCESS | rc=0 >>
   1.8  0.9 /usr/bin/python /usr/bin/tendrl-node-agent

  ci-usm4-gl6.usmqe.example.com | SUCCESS | rc=0 >>
   1.8  0.8 /usr/bin/python /usr/bin/tendrl-node-agent

  ci-usm4-server.usmqe.example.com | SUCCESS | rc=0 >>
   1.7  0.4 /usr/bin/python /usr/bin/tendrl-node-agent
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Version-Release number of selected component
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  Red Hat Gluster Storage Server 3.4.0
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  glusterfs-3.12.2-8.6.gite12fa69.el7rhgs.x86_64
  glusterfs-api-3.12.2-8.6.gite12fa69.el7rhgs.x86_64
  glusterfs-cli-3.12.2-8.6.gite12fa69.el7rhgs.x86_64
  glusterfs-client-xlators-3.12.2-8.6.gite12fa69.el7rhgs.x86_64
  glusterfs-events-3.12.2-8.6.gite12fa69.el7rhgs.x86_64
  glusterfs-fuse-3.12.2-8.6.gite12fa69.el7rhgs.x86_64
  glusterfs-geo-replication-3.12.2-8.6.gite12fa69.el7rhgs.x86_64
  glusterfs-libs-3.12.2-8.6.gite12fa69.el7rhgs.x86_64
  glusterfs-rdma-3.12.2-8.6.gite12fa69.el7rhgs.x86_64
  glusterfs-server-3.12.2-8.6.gite12fa69.el7rhgs.x86_64
  gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
  gluster-nagios-common-0.2.4-1.el7rhgs.noarch
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.2.x86_64
  python2-gluster-3.12.2-8.6.gite12fa69.el7rhgs.x86_64
  tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-commons-1.6.3-4.el7rhgs.noarch
  tendrl-gluster-integration-1.6.3-2.el7rhgs.noarch
  tendrl-node-agent-1.6.3-4.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  vdsm-gluster-4.19.43-2.3.el7rhgs.noarch

>> VERIFIED

Comment 14 errata-xmlrpc 2018-09-04 07:03:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616

Note You need to log in before you can comment on or make changes to this bug.