Bug 1376592 - Periodic CPU spikes on neutron nodes (CPU usage goes to 100% for many cores)
Summary: Periodic CPU spikes on neutron nodes (CPU usage goes to 100% for many cores)
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: async
: 7.0 (Kilo)
Assignee: Assaf Muller
QA Contact: Toni Freger
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-15 20:44 UTC by Jeremy
Modified: 2019-12-16 06:47 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-01-16 18:28:48 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Jeremy 2016-09-15 20:44:20 UTC
Description of problem:  
Every 1.5-2 hours, one of our neutron nodes (there are 3 neutron nodes) experiences a massive CPU spike in the cores that are processing network traffic (we use RFS/RPS to define which core will handle this) and every core goes to 100% usage.  When this happens the offending processes seem to be 'ksoftirqd' on each core.   During this time our network throughput on the nodes  drops substantially because the cores are all busy with ksoftirqd.   The CPU spike lasts for roughly 1-3 minutes and then subsides.


Version-Release number of selected component (if applicable):
openstack-neutron-2015.1.2-14.el7ost.noarch 
openstack-neutron-openvswitch-2015.1.2-14.el7ost.noarch 
python-neutronclient-2.4.0-2.el7ost.noarch 


How reproducible:
100%

Steps to Reproduce:
1. wait 1.5 to 2 hours and notice cpu spikes
2.
3.

Actual results:
cpu 100% usage on network nodes

Expected results:
normal cpu usage

Additional info:

One important thing to note is that the issue is not specific to only one particular neutron node.  

For example:  If neutron node 1 is experiencing the problem and we fail over the routers off of neutron node 1, then next time the spike will occur on one of the two remaining neutron nodes.  But if we leave neutron node 1 up, the spikes will always happen on neutron node 1.  This leads us to believe that the cause of the ksoftirqd / cpu spike is a heavy influx of traffic from one of the neutron routers and when that router moves the CPU spike moves with it.  However, what is confusing is that when the ksoftirqd / CPU spike happens, throughput drops substantially.  If the ksoftirqd / CPU spike was caused by a spike in traffic, we thought we'd see a spike in throughput on the neutron node that coincides with the CPU spike - but that is not what we are seeing.

here are the times the spikes occurred on each node over the last couple days:

ttcosneutron1001: 
2016-09-14  14:02 CDT
2016-09-14  16:30 CDT
2016-09-14  18:22 CDT

ttcosneutron1002: 
2016-09-14 09:10 CDT
2016-09-14 10:24 CDT
2016-09-14 12:04 CDT
2016-09-15 04:42 CDT
2016-09-15 05:53 CDT
2016-09-15 07:21 CDT

ttcosneutron1003 - no spikes have occurred on this node

You will notice that first ttcosneutron1002 gets the spikes, then ttcosneutron1001, then back to ttcosneutron1002.  This is because during the 2016-09-14 12:04 CDT spike we tried to run a 'perf record -a -g' on ttcosneutron1002 which caused a panic and the server crashed.  We kept ttcosneutron1002 offline for the rest of the day and so ttcosneutron1001 started experiencing the spikes.  We brought ttcosneutron1002 back online last night at around 21:00:00 CDT and it started re-experiencing the spikes.  

One thing we notice is that the spikes seem to be more severe when they occur on ttcosneutron1002 and less severe when they occur on ttcosneutron1001 (severity measured by total time cores are at 100%).  We're not sure why this is either.

Comment 5 Faiaz Ahmed 2016-09-16 01:49:28 UTC
Looking into  l3-agent.logs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2016-09-15 10:14:48.961 53971 INFO neutron.agent.l3.ha [-] Router a3f658be-cb61-419f-bc79-c0ad8196668e transitioned to master
2016-09-15 10:14:49.030 53971 ERROR neutron.agent.linux.utils [-] 
Command: ['ip', 'netns', 'exec', u'qrouter-a3f658be-cb61-419f-bc79-c0ad8196668e', 'sysctl', '-w', u'net.ipv6.conf.qg-e2ce1ff3-15.accept_ra=2']
Exit code: 255
Stdin: 
Stdout: 
Stderr: sysctl: cannot stat /proc/sys/net/ipv6/conf/qg-e2ce1ff3-15/accept_ra: No such file or directory

2016-09-15 10:14:49.030 53971 INFO eventlet.wsgi.server [-] Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/eventlet/wsgi.py", line 454, in handle_one_response
    result = self.application(self.environ, start_response)
  File "/usr/lib/python2.7/site-packages/webob/dec.py", line 130, in __call__
    resp = self.call_func(req, *args, **self.kwargs)
  File "/usr/lib/python2.7/site-packages/webob/dec.py", line 195, in call_func
    return self.func(req, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha.py", line 59, in __call__
    self.enqueue(router_id, state)
  File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha.py", line 65, in enqueue
    self.agent.enqueue_state_change(router_id, state)
  File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha.py", line 125, in enqueue_state_change
    self._configure_ipv6_ra_on_ext_gw_port_if_necessary(ri, state)
  File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha.py", line 140, in _configure_ipv6_ra_on_ext_gw_port_if_necessary
    ri.driver.configure_ipv6_ra(ri.ns_name, interface_name)
  File "/usr/lib/python2.7/site-packages/neutron/agent/linux/interface.py", line 179, in configure_ipv6_ra
    ['sysctl', '-w', 'net.ipv6.conf.%s.accept_ra=2' % dev_name])
  File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 580, in execute
    extra_ok_codes=extra_ok_codes, **kwargs)
  File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 137, in execute
    raise RuntimeError(m)
RuntimeError: 
Command: ['ip', 'netns', 'exec', u'qrouter-a3f658be-cb61-419f-bc79-c0ad8196668e', 'sysctl', '-w', u'net.ipv6.conf.qg-e2ce1ff3-15.accept_ra=2']
Exit code: 255
Stdin: 
Stdout: 
Stderr: sysctl: cannot stat /proc/sys/net/ipv6/conf/qg-e2ce1ff3-15/accept_ra: No such file or directory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There is known bug https://bugzilla.redhat.com/show_bug.cgi?id=1262647 and been addressed https://access.redhat.com/errata/RHBA-2015:1866


Looking into customer installed versions

[fahmed@collab-shell ttcosneutron1002]$ pwd
/cases/01704538/sosreport-20160915-104140/ttcosneutron1002
[fahmed@collab-shell ttcosneutron1002]$ grep neutron installed-rpms |gawk ' {print $1} '
openstack-neutron-2015.1.2-14.el7ost.noarch
openstack-neutron-common-2015.1.2-14.el7ost.noarch
openstack-neutron-fwaas-2015.1.2-1.el7ost.noarch
openstack-neutron-lbaas-2015.1.2-2.el7ost.noarch
openstack-neutron-ml2-2015.1.2-14.el7ost.noarch
openstack-neutron-openvswitch-2015.1.2-14.el7ost.noarch
python-neutron-2015.1.2-14.el7ost.noarch
python-neutron-fwaas-2015.1.2-1.el7ost.noarch
python-neutron-lbaas-2015.1.2-2.el7ost.noarch
python-neutronclient-2.4.0-2.el7ost.noarch
[fahmed@collab-shell ttcosneutron1002]$

Comment 6 Assaf Muller 2016-09-16 01:52:49 UTC
Good find Faiaz. @Jeremy, please take a look at comment 5. Can you have the customer update to latest OSP 7 and report back?

Comment 7 Faiaz Ahmed 2016-09-16 01:55:40 UTC
Assaf, Cu already have higher version of *-neutron-* that been mentioned @https://access.redhat.com/errata/RHBA-2015:1866.

Comment 10 Flavio Leitner 2016-09-16 13:18:39 UTC
When the spike happens, attach perf 'perf record -C <cpu having spike> sleep 3'
which will record what the ksoftirqd is doing for 3 seconds.

Then you can use the command below to see what has been recorded:
# perf report -f --stdio

That will give us some clue of what is going on.

Comment 11 Assaf Muller 2016-09-16 13:41:21 UTC
Please see comment 10.

Comment 18 Jamie Bainbridge 2016-09-18 07:46:51 UTC
The high CPU usage appears to be inefficient conntrack hashing.

A typical ksoftirqd in the perf looked like:

-    4.26%     4.26%  ksoftirqd/10
   - 73.78% nf_conntrack_tuple_taken
      - nf_nat_used_tuple
        nf_nat_l4proto_unique_tuple
        tcp_unique_tuple
        get_unique_tuple
        nf_nat_setup_info
        xt_snat_target_v0
        ipt_do_table
        iptable_nat_do_chain
        nf_nat_ipv4_fn
        nf_nat_ipv4_out
        iptable_nat_ipv4_out
        nf_iterate
        nf_hook_slow
        ip_output
        ip_forward_finish
        ip_forward
        ip_rcv_finish
        ip_rcv
        __netif_receive_skb_core
        __netif_receive_skb
        process_backlog
        net_rx_action
      + __do_softirq

Conntrack was hashing at about 61:1

 $ cat sys/module/nf_conntrack/parameters/hashsize 
 16384

 $ cat ./proc/sys/net/netfilter/nf_conntrack_max
 1000000

1000000/16384 = 61

We set this down to a more reasonable 8:1

1000000 / 8 = 125000
(this rounded to 125400 once the parameter was set)

 echo 'options nf_conntrack hashsize=125400' > /etc/modprobe.d/nf_conntrack.conf

However I'm not sure how this operates when neutron/netns is involved. Do we need to set each netns with its own value?

The customer has reported one system did still encounter high CPU usage, so checking the correct config is applied and looking at the perf for that.

FYI the customer also has RPS and RFS configured on these systems, this apparently drastically improves their GRE speed:

 echo "65536" > /proc/sys/net/core/rps_sock_flow_entries

 echo "16384" > /sys/class/net/$DEV/queues/rx-$QUEUE/rps_flow_cnt
 echo "3fff0003fff" > /sys/class/net/$DEV/queues/rx-$QUEUE/rps_cpus

NICs are enic with 5 rx queues and 5 tx queues (not combined rxtx queues). irqbalance runs once in oneshot mode then never again, so there is still RSS across the same CPU mask as above, which is NUMA Node 0.

Comment 28 Jamie Bainbridge 2016-09-22 05:55:48 UTC
After changing hashsize even larger, perf still showed the spending time in the same nf symbol.

We're now wondering whether a misbehaving netns is generating a lot of badly-hashing traffic leading to inefficiently long linked lists, or if we're tracking connections for too long and should reduce a timeout.

The customer indicated one particular namespace had ~200k connections during the last spike, so we're concentrating on collecting that ns' conntrack table during the next event.

I discussed ways to add more transmit SNAT IPs (to balance hashing better) with Dan Sneddon from Neutron Eng and with OpenStack support, but that appears to require a non-trivial redesign.

Comment 29 Assaf Muller 2017-01-16 18:28:48 UTC
Closing as this is not a Neutron issue, but a platform issue that was eventually resolved by changing the client that was generating the traffic.


Note You need to log in before you can comment on or make changes to this bug.