1641175 – neutron-server has high CPU utilization after workload completes

Bug 1641175 - neutron-server has high CPU utilization after workload completes

Summary: neutron-server has high CPU utilization after workload completes

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	14.0 (Rocky)
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Bernard Cafarelli
QA Contact:	Roee Agiman
Docs Contact:
URL:
Whiteboard:	scale_lab
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-10-19 19:18 UTC by Joe Talerico
Modified:	2018-12-10 15:09 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-12-10 15:09:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Joe Talerico 2018-10-19 19:18:00 UTC

Description of problem:
After a set of Rally scenario completes neutron-server doesn't return to the previous idle utilization. [1] CPU Utilization, [2] Memory. [1]@10:40. @1550 the workload has completed, but neutron-server utilization never returns.

ML2OVN is experience this. Trying exact same tests with OVSML2.

[1] http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/snapshot/xJjA6dF7KoNPSdCud7VdzhF4M8lZTdDS
[2] http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/snapshot/pttXNtlRM2ZQ3ofg1oeLALTyVS9kqQFL

Version-Release number of selected component (if applicable):
OSP14 2018-10-08.4

How reproducible:
N/A

Steps to Reproduce:
1. Run set of Rally scenarios defined [ https://gist.github.com/jtaleric/2909dd9a8c0bed13cc93b655a6b3b875 ]

Restarting the containers fixes the problem.

Comment 1 Marian Krcmarik 2018-10-19 20:10:49 UTC

I can confirm that I am seeing high cpu usage caused by neutron on an idle OSP14 deployment as well, The CPU usage increased over time on idle system, in my case I did not even run any high workload. But It may speed up to reproduce it.

Comment 2 Joe Talerico 2018-10-22 14:14:06 UTC

Recreated with OVSML2

Comment 4 Joe Talerico 2018-11-01 23:24:02 UTC

Changing state back to New, tested with puppet-tripleo-9.3.1-0.20181010034745.157eaab.el7ost.noarch (in puddle : 2018-10-25.3)

Ran a set of Rally tests, and saw controller-0 still burning CPU even though Neutron had nothing to work on.

All that is left over from my tests:
(.browbeat-venv) (overcloud) [stack@c09-h11-r630 browbeat]$ neutron net-list
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+----------------------------------------------------+-----------+-------------------------------------------------------+
| id                                   | name                                               | tenant_id | subnets                                               |
+--------------------------------------+----------------------------------------------------+-----------+-------------------------------------------------------+
| 8cdf70ab-2a00-4395-a32f-4f12ede8f8bd | HA network tenant c4ce3b3558394963b425d6ab21bae058 |           | a9d6106a-251b-4459-b4a6-b32e456c2708 169.254.192.0/18 |
+--------------------------------------+----------------------------------------------------+-----------+-------------------------------------------------------+

I restarted Neutron and it returns the CPU utilization, you can see the before and after here.[1].

Interesting note, looking at a different controller (where I didn't restart Neutron). I can see in /var/log/container/neutron/server.log :
2018-11-01 23:11:21.818 29 INFO neutron.wsgi [-] 172.16.0.11 "OPTIONS / HTTP/1.0" status: 200  len: 248 time: 0.0013459
2018-11-01 23:11:22.548 33 INFO neutron.wsgi [-] 172.16.0.32 "OPTIONS / HTTP/1.0" status: 200  len: 248 time: 0.0010641
2018-11-01 23:11:23.816 35 INFO neutron.wsgi [-] 172.16.0.25 "OPTIONS / HTTP/1.0" status: 200  len: 248 time: 0.0010581
2018-11-01 23:11:23.821 35 INFO neutron.wsgi [-] 172.16.0.11 "OPTIONS / HTTP/1.0" status: 200  len: 248 time: 0.0008490
2018-11-01 23:11:24.552 38 INFO neutron.wsgi [-] 172.16.0.32 "OPTIONS / HTTP/1.0" status: 200  len: 248 time: 0.0011449

^ This just constantly happens, like something is querying neutron? However, looking at controller-0, i see the same thing (after a restart). 

However, since Neutron is in wsgi, I figured I would look there too, but, neutron is the only service with it being empty :
./neutron-api:
total 0
drwxr-xr-x.  2 root root   6 Nov  1 01:20 .
drwxr-xr-x. 14 root root 221 Nov  1 01:21 ..

GMR might help us figure out what Neutron is chewing on?[2]

[1] http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/snapshot/kAlDmd16lvs4ccc2qfEBOv08iv6GOZxw
[2] https://wiki.openstack.org/wiki/GuruMeditationReport

Comment 6 Bernard Cafarelli 2018-11-23 17:20:29 UTC

For the requests traffic seen on controllers, it is normal healthcheck traffic  from haproxy, we have (grabbed from haproxy container):
listen neutron
  bind 10.0.0.113:9696 transparent
  bind 172.17.1.33:9696 transparent
  mode http
  http-request set-header X-Forwarded-Proto https if { ssl_fc }
  http-request set-header X-Forwarded-Proto http if !{ ssl_fc }
  http-request set-header X-Forwarded-Port %[dst_port]
  option httpchk
  option httplog
  server controller-0.internalapi.localdomain 172.17.1.12:9696 check fall 5 inter 2000 rise 2
  server controller-1.internalapi.localdomain 172.17.1.16:9696 check fall 5 inter 2000 rise 2
  server controller-2.internalapi.localdomain 172.17.1.31:9696 check fall 5 inter 2000 rise 2

So this matches the regular OPTIONS requests in logs (httpchk without parameters mean OPTIONS)

Note You need to log in before you can comment on or make changes to this bug.