Description of problem: Daemon tendrl-node-agent have quite hight CPU usage on longer running cluster. Version-Release number of selected component (if applicable): # rpm -qa | grep tendrl | sort tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch tendrl-commons-1.6.1-1.el7rhgs.noarch tendrl-gluster-integration-1.6.1-1.el7rhgs.noarch tendrl-node-agent-1.6.1-1.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch How reproducible: 100% Steps to Reproduce: 1. Install, configure and import Gluster cluster into RHGS WA (Tendrl). 2. Let it running for few days and monitor CPU usage of Tendrl components (on Gluster storage nodes and also on RHGS WA server): # ps -p $(echo $(ps aux | grep [t]endrl-node-agent | awk '{print $2}') | sed 's/ /,/') -o %cpu,%mem,cmd -h; (first number is CPU% usage, second number MEM% usage) Actual results: After few days (maybe earlier), tendrl-node-agent consumes quite high percentage of CPU. On machine with 2 vCPUs it's close to 50%, on machine with 4 vCPUs it's close to 25%. This means, that it constantly fully utilize one CPU core. Expected results: What is the expected utilization of CPU by the Tendrl components? Additional info: Please check also CPU utilization of other Tendrl components - for example tendrl-gluster-integration have similar "problem".
Output from Gluster storage server with 2vCPUs running 1 day: %CPU %MEM 17.1 2.9 /usr/bin/python /usr/bin/tendrl-node-agent Output from Gluster storage server with 2vCPUs running 12 days: %CPU %MEM 42.4 3.1 /usr/bin/python /usr/bin/tendrl-node-agent And output from RHGS WA Server with 4 vCPUs running 1 day: %CPU %MEM 22.1 0.3 /usr/bin/python /usr/bin/tendrl-node-agent Also the overall system load is quite high, despite the fact, that there is no data load on the Gluster Volumes or any other tasks performed.
What is the value of config option "sync_interval" at ? /etc/tendrl/node-agent/node-agent.conf.yaml
We didn't update the 'sync_interval', so it contains the default value: # grep sync_interval /etc/tendrl/node-agent/node-agent.conf.yaml sync_interval: 60
@Daniel, can you re-check this with the latest buid
For the first look, it seems to be ok, but I'll have to keep the cluster running for few days to be sure. I'll post update early next week.
There seems to be noticeable improvement between last two versions: tendrl-node-agent-1.6.3-2.el7rhgs.noarch and tendrl-node-agent-1.6.3-3.el7rhgs.noarch With the newer version, the CPU usage of tendrl-node-agent is bellow 3% on cluster running for 2 days. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # ps aux | grep -E "[t]endrl-node-agent" | awk "{print \$2}" | sed "s/ /,/" | xargs -n1 ps -o %cpu,%mem,cmd -h -p 2.8 0.8 /usr/bin/python /usr/bin/tendrl-node-agent ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I'll watch it further on more cluster variants and send another status in few days.
I have reduced usage percentage in latest release https://github.com/Tendrl/commons/pull/931 please verify it happening again
(In reply to Nishanth Thomas from comment #6) > @Daniel, can you re-check this with the latest buid On (nearly) 5 days running cluster, CPU usage of tendrl-node-agent service is still bellow 3%, so I can confirm, that it is fixed in the latest builds ( tendrl-node-agent-1.6.3-3.el7rhgs.noarch).
Tested and Verified on few clusters with various configurations, for example: * cluster with 6 storage nodes, running for 6 days, with 2 vCPUs and 8GB RAM peer storage node * cluster with 24 storage nodes, running for 6 days, with 4 vCPUs and 6GB RAM peer storage node The tendrl-node-agent CPU utilization is around 1-3%, for example: (first value is CPU utilization, second value memory utilization) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ $ ansible -i ci-usm4-gluster.hosts gluster_servers:tendrl_server -m shell \ -a 'ps aux | grep -E "[t]endrl-node-agent" | \ awk "{print \$2}" | sed "s/ /,/" | \ xargs -n1 ps -o %cpu,%mem,cmd -h -p' ci-usm4-gl2.usmqe.example.com | SUCCESS | rc=0 >> 2.0 0.9 /usr/bin/python /usr/bin/tendrl-node-agent ci-usm4-gl5.usmqe.example.com | SUCCESS | rc=0 >> 1.8 0.8 /usr/bin/python /usr/bin/tendrl-node-agent ci-usm4-gl1.usmqe.example.com | SUCCESS | rc=0 >> 2.0 0.9 /usr/bin/python /usr/bin/tendrl-node-agent ci-usm4-gl3.usmqe.example.com | SUCCESS | rc=0 >> 1.8 0.8 /usr/bin/python /usr/bin/tendrl-node-agent ci-usm4-gl4.usmqe.example.com | SUCCESS | rc=0 >> 1.8 0.9 /usr/bin/python /usr/bin/tendrl-node-agent ci-usm4-gl6.usmqe.example.com | SUCCESS | rc=0 >> 1.8 0.8 /usr/bin/python /usr/bin/tendrl-node-agent ci-usm4-server.usmqe.example.com | SUCCESS | rc=0 >> 1.7 0.4 /usr/bin/python /usr/bin/tendrl-node-agent ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Version-Release number of selected component Red Hat Enterprise Linux Server release 7.5 (Maipo) Red Hat Gluster Storage Server 3.4.0 collectd-5.7.2-3.1.el7rhgs.x86_64 collectd-ping-5.7.2-3.1.el7rhgs.x86_64 glusterfs-3.12.2-8.6.gite12fa69.el7rhgs.x86_64 glusterfs-api-3.12.2-8.6.gite12fa69.el7rhgs.x86_64 glusterfs-cli-3.12.2-8.6.gite12fa69.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-8.6.gite12fa69.el7rhgs.x86_64 glusterfs-events-3.12.2-8.6.gite12fa69.el7rhgs.x86_64 glusterfs-fuse-3.12.2-8.6.gite12fa69.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-8.6.gite12fa69.el7rhgs.x86_64 glusterfs-libs-3.12.2-8.6.gite12fa69.el7rhgs.x86_64 glusterfs-rdma-3.12.2-8.6.gite12fa69.el7rhgs.x86_64 glusterfs-server-3.12.2-8.6.gite12fa69.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch libcollectdclient-5.7.2-3.1.el7rhgs.x86_64 libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.2.x86_64 python2-gluster-3.12.2-8.6.gite12fa69.el7rhgs.x86_64 tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch tendrl-commons-1.6.3-4.el7rhgs.noarch tendrl-gluster-integration-1.6.3-2.el7rhgs.noarch tendrl-node-agent-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch vdsm-gluster-4.19.43-2.3.el7rhgs.noarch >> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2616