Bug 1692440

Summary: Neutron SRIOV agent loop iteration exceeding polling interval, causing high CPU usage in neutron-rootwrap-daemon
Product: Red Hat OpenStack Reporter: Bernard Cafarelli <bcafarel>
Component: openstack-tripleo-heat-templatesAssignee: Rodolfo Alonso <ralonsoh>
Status: CLOSED ERRATA QA Contact: Roee Agiman <ragiman>
Severity: high Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: bdeschen, beagles, cfontain, ekuris, fbaudin, jjoyce, jschluet, mburns, mmethot, njohnston, ragiman, slinaber, tvignaud
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-8.3.1-35.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-07-10 13:03:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Bernard Cafarelli 2019-03-25 15:43:05 UTC
On a 13z5 deployment with SRIOV, the neutron SRIOV agent on compute node will keep appearing in CPU usage via the rootwrap calls to "ip link". Sample log:
2019-03-25 13:09:02.345 59675 DEBUG neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-1ec83b30-7aa4-440f-91eb-26ad1b84561f - - - - -] Loop iteration exceeded interval (2 vs. 2.27140307426)! daemon_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/mech_sriov/agent/sriov_nic_agent.py:412
2019-03-25 13:09:02.345 59675 DEBUG neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-1ec83b30-7aa4-440f-91eb-26ad1b84561f - - - - -] Agent rpc_loop - iteration:2 started daemon_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/mech_sriov/agent/sriov_nic_agent.py:374
2019-03-25 13:09:02.345 59675 DEBUG neutron.agent.linux.utils [req-1ec83b30-7aa4-440f-91eb-26ad1b84561f - - - - -] Running command (rootwrap daemon): ['ip', 'link', 'show'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2019-03-25 13:09:03.096 59675 DEBUG neutron.agent.linux.utils [req-1ec83b30-7aa4-440f-91eb-26ad1b84561f - - - - -] Running command (rootwrap daemon): ['ip', 'link', 'show'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2019-03-25 13:09:03.847 59675 DEBUG neutron.agent.linux.utils [req-1ec83b30-7aa4-440f-91eb-26ad1b84561f - - - - -] Running command (rootwrap daemon): ['ip', 'link', 'show'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2019-03-25 13:09:04.598 59675 DEBUG neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-1ec83b30-7aa4-440f-91eb-26ad1b84561f - - - - -] Loop iteration exceeded interval (2 vs. 2.25298190117)! daemon_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/mech_sriov/agent/sriov_nic_agent.py:412
2019-03-25 13:09:04.598 59675 DEBUG neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-1ec83b30-7aa4-440f-91eb-26ad1b84561f - - - - -] Agent rpc_loop - iteration:3 started daemon_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/mech_sriov/agent/sriov_nic_agent.py:374

Based on the "loop iteration exceeded" message, I modified /var/lib/config-data/puppet-generated/neutron/etc/neutron/plugins/ml2/sriov_agent.ini to add "polling_interval=5" to [agent] section, and after restarting the neutron_sriov_agent, activity returned to normal (1 iteration doing some ip link rootwrap calls, and pausing until next iteration).

Now, the question is, is the agent loop exceeding 2 seconds on a small deployment normal? Should we configure this parameter to longer value?

Comment 6 Bernard Cafarelli 2019-04-04 15:52:12 UTC
To extend the loop interval, on the compute node, this can be edited in /var/lib/config-data/puppet-generated/neutron/etc/neutron/plugins/ml2/sriov_agent.ini

In [agent] section, set:
polling_interval=5

Then restart the container:
# docker restart neutron_sriov_agent

To see if iterations take too much time, watch for similar lines to:
2019-04-04 15:18:31.366 59965 DEBUG neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-1263b780-bfad-408b-a2fa-8d89af87dc95 - - - - -] Loop iteration exceeded interval (2 vs. 3.00258302689)! daemon_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/mech_sriov/agent/sriov_nic_agent.py:412
in /var/log/containers/neutron/sriov-nic-agent.log

Comment 15 Boris Deschenes 2019-04-15 13:57:38 UTC
another impact of this issue is the high (around 80% for us) CPU usage even on inactive SR-IOV compute nodes, this disappears completely if the container is started with the ulimit nofiles=16384 (as proposed in https://review.openstack.org/#/c/650418/)

Comment 25 errata-xmlrpc 2019-07-10 13:03:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1738