Bug 1641175 - neutron-server has high CPU utilization after workload completes
Summary: neutron-server has high CPU utilization after workload completes
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 14.0 (Rocky)
Hardware: All
OS: All
high
high
Target Milestone: ---
: ---
Assignee: Bernard Cafarelli
QA Contact: Roee Agiman
URL:
Whiteboard: scale_lab
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-19 19:18 UTC by Joe Talerico
Modified: 2018-12-10 15:09 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-12-10 15:09:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Joe Talerico 2018-10-19 19:18:00 UTC
Description of problem:
After a set of Rally scenario completes neutron-server doesn't return to the previous idle utilization. [1] CPU Utilization, [2] Memory. [1]@10:40. @1550 the workload has completed, but neutron-server utilization never returns.

ML2OVN is experience this. Trying exact same tests with OVSML2.

[1] http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/snapshot/xJjA6dF7KoNPSdCud7VdzhF4M8lZTdDS
[2] http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/snapshot/pttXNtlRM2ZQ3ofg1oeLALTyVS9kqQFL

Version-Release number of selected component (if applicable):
OSP14 2018-10-08.4

How reproducible:
N/A

Steps to Reproduce:
1. Run set of Rally scenarios defined [ https://gist.github.com/jtaleric/2909dd9a8c0bed13cc93b655a6b3b875 ]

Restarting the containers fixes the problem.

Comment 1 Marian Krcmarik 2018-10-19 20:10:49 UTC
I can confirm that I am seeing high cpu usage caused by neutron on an idle OSP14 deployment as well, The CPU usage increased over time on idle system, in my case I did not even run any high workload. But It may speed up to reproduce it.

Comment 2 Joe Talerico 2018-10-22 14:14:06 UTC
Recreated with OVSML2

Comment 4 Joe Talerico 2018-11-01 23:24:02 UTC
Changing state back to New, tested with puppet-tripleo-9.3.1-0.20181010034745.157eaab.el7ost.noarch (in puddle : 2018-10-25.3)

Ran a set of Rally tests, and saw controller-0 still burning CPU even though Neutron had nothing to work on.

All that is left over from my tests:
(.browbeat-venv) (overcloud) [stack@c09-h11-r630 browbeat]$ neutron net-list
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+----------------------------------------------------+-----------+-------------------------------------------------------+
| id                                   | name                                               | tenant_id | subnets                                               |
+--------------------------------------+----------------------------------------------------+-----------+-------------------------------------------------------+
| 8cdf70ab-2a00-4395-a32f-4f12ede8f8bd | HA network tenant c4ce3b3558394963b425d6ab21bae058 |           | a9d6106a-251b-4459-b4a6-b32e456c2708 169.254.192.0/18 |
+--------------------------------------+----------------------------------------------------+-----------+-------------------------------------------------------+

I restarted Neutron and it returns the CPU utilization, you can see the before and after here.[1].

Interesting note, looking at a different controller (where I didn't restart Neutron). I can see in /var/log/container/neutron/server.log :
2018-11-01 23:11:21.818 29 INFO neutron.wsgi [-] 172.16.0.11 "OPTIONS / HTTP/1.0" status: 200  len: 248 time: 0.0013459
2018-11-01 23:11:22.548 33 INFO neutron.wsgi [-] 172.16.0.32 "OPTIONS / HTTP/1.0" status: 200  len: 248 time: 0.0010641
2018-11-01 23:11:23.816 35 INFO neutron.wsgi [-] 172.16.0.25 "OPTIONS / HTTP/1.0" status: 200  len: 248 time: 0.0010581
2018-11-01 23:11:23.821 35 INFO neutron.wsgi [-] 172.16.0.11 "OPTIONS / HTTP/1.0" status: 200  len: 248 time: 0.0008490
2018-11-01 23:11:24.552 38 INFO neutron.wsgi [-] 172.16.0.32 "OPTIONS / HTTP/1.0" status: 200  len: 248 time: 0.0011449

^ This just constantly happens, like something is querying neutron? However, looking at controller-0, i see the same thing (after a restart). 

However, since Neutron is in wsgi, I figured I would look there too, but, neutron is the only service with it being empty :
./neutron-api:
total 0
drwxr-xr-x.  2 root root   6 Nov  1 01:20 .
drwxr-xr-x. 14 root root 221 Nov  1 01:21 ..

GMR might help us figure out what Neutron is chewing on?[2]

[1] http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/snapshot/kAlDmd16lvs4ccc2qfEBOv08iv6GOZxw
[2] https://wiki.openstack.org/wiki/GuruMeditationReport

Comment 6 Bernard Cafarelli 2018-11-23 17:20:29 UTC
For the requests traffic seen on controllers, it is normal healthcheck traffic  from haproxy, we have (grabbed from haproxy container):
listen neutron
  bind 10.0.0.113:9696 transparent
  bind 172.17.1.33:9696 transparent
  mode http
  http-request set-header X-Forwarded-Proto https if { ssl_fc }
  http-request set-header X-Forwarded-Proto http if !{ ssl_fc }
  http-request set-header X-Forwarded-Port %[dst_port]
  option httpchk
  option httplog
  server controller-0.internalapi.localdomain 172.17.1.12:9696 check fall 5 inter 2000 rise 2
  server controller-1.internalapi.localdomain 172.17.1.16:9696 check fall 5 inter 2000 rise 2
  server controller-2.internalapi.localdomain 172.17.1.31:9696 check fall 5 inter 2000 rise 2

So this matches the regular OPTIONS requests in logs (httpchk without parameters mean OPTIONS)


Note You need to log in before you can comment on or make changes to this bug.