Hide Forgot
Description of problem: Every 1.5-2 hours, one of our neutron nodes (there are 3 neutron nodes) experiences a massive CPU spike in the cores that are processing network traffic (we use RFS/RPS to define which core will handle this) and every core goes to 100% usage. When this happens the offending processes seem to be 'ksoftirqd' on each core. During this time our network throughput on the nodes drops substantially because the cores are all busy with ksoftirqd. The CPU spike lasts for roughly 1-3 minutes and then subsides. Version-Release number of selected component (if applicable): openstack-neutron-2015.1.2-14.el7ost.noarch openstack-neutron-openvswitch-2015.1.2-14.el7ost.noarch python-neutronclient-2.4.0-2.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. wait 1.5 to 2 hours and notice cpu spikes 2. 3. Actual results: cpu 100% usage on network nodes Expected results: normal cpu usage Additional info: One important thing to note is that the issue is not specific to only one particular neutron node. For example: If neutron node 1 is experiencing the problem and we fail over the routers off of neutron node 1, then next time the spike will occur on one of the two remaining neutron nodes. But if we leave neutron node 1 up, the spikes will always happen on neutron node 1. This leads us to believe that the cause of the ksoftirqd / cpu spike is a heavy influx of traffic from one of the neutron routers and when that router moves the CPU spike moves with it. However, what is confusing is that when the ksoftirqd / CPU spike happens, throughput drops substantially. If the ksoftirqd / CPU spike was caused by a spike in traffic, we thought we'd see a spike in throughput on the neutron node that coincides with the CPU spike - but that is not what we are seeing. here are the times the spikes occurred on each node over the last couple days: ttcosneutron1001: 2016-09-14 14:02 CDT 2016-09-14 16:30 CDT 2016-09-14 18:22 CDT ttcosneutron1002: 2016-09-14 09:10 CDT 2016-09-14 10:24 CDT 2016-09-14 12:04 CDT 2016-09-15 04:42 CDT 2016-09-15 05:53 CDT 2016-09-15 07:21 CDT ttcosneutron1003 - no spikes have occurred on this node You will notice that first ttcosneutron1002 gets the spikes, then ttcosneutron1001, then back to ttcosneutron1002. This is because during the 2016-09-14 12:04 CDT spike we tried to run a 'perf record -a -g' on ttcosneutron1002 which caused a panic and the server crashed. We kept ttcosneutron1002 offline for the rest of the day and so ttcosneutron1001 started experiencing the spikes. We brought ttcosneutron1002 back online last night at around 21:00:00 CDT and it started re-experiencing the spikes. One thing we notice is that the spikes seem to be more severe when they occur on ttcosneutron1002 and less severe when they occur on ttcosneutron1001 (severity measured by total time cores are at 100%). We're not sure why this is either.
Looking into l3-agent.logs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2016-09-15 10:14:48.961 53971 INFO neutron.agent.l3.ha [-] Router a3f658be-cb61-419f-bc79-c0ad8196668e transitioned to master 2016-09-15 10:14:49.030 53971 ERROR neutron.agent.linux.utils [-] Command: ['ip', 'netns', 'exec', u'qrouter-a3f658be-cb61-419f-bc79-c0ad8196668e', 'sysctl', '-w', u'net.ipv6.conf.qg-e2ce1ff3-15.accept_ra=2'] Exit code: 255 Stdin: Stdout: Stderr: sysctl: cannot stat /proc/sys/net/ipv6/conf/qg-e2ce1ff3-15/accept_ra: No such file or directory 2016-09-15 10:14:49.030 53971 INFO eventlet.wsgi.server [-] Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/eventlet/wsgi.py", line 454, in handle_one_response result = self.application(self.environ, start_response) File "/usr/lib/python2.7/site-packages/webob/dec.py", line 130, in __call__ resp = self.call_func(req, *args, **self.kwargs) File "/usr/lib/python2.7/site-packages/webob/dec.py", line 195, in call_func return self.func(req, *args, **kwargs) File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha.py", line 59, in __call__ self.enqueue(router_id, state) File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha.py", line 65, in enqueue self.agent.enqueue_state_change(router_id, state) File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha.py", line 125, in enqueue_state_change self._configure_ipv6_ra_on_ext_gw_port_if_necessary(ri, state) File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha.py", line 140, in _configure_ipv6_ra_on_ext_gw_port_if_necessary ri.driver.configure_ipv6_ra(ri.ns_name, interface_name) File "/usr/lib/python2.7/site-packages/neutron/agent/linux/interface.py", line 179, in configure_ipv6_ra ['sysctl', '-w', 'net.ipv6.conf.%s.accept_ra=2' % dev_name]) File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 580, in execute extra_ok_codes=extra_ok_codes, **kwargs) File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 137, in execute raise RuntimeError(m) RuntimeError: Command: ['ip', 'netns', 'exec', u'qrouter-a3f658be-cb61-419f-bc79-c0ad8196668e', 'sysctl', '-w', u'net.ipv6.conf.qg-e2ce1ff3-15.accept_ra=2'] Exit code: 255 Stdin: Stdout: Stderr: sysctl: cannot stat /proc/sys/net/ipv6/conf/qg-e2ce1ff3-15/accept_ra: No such file or directory ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There is known bug https://bugzilla.redhat.com/show_bug.cgi?id=1262647 and been addressed https://access.redhat.com/errata/RHBA-2015:1866 Looking into customer installed versions [fahmed@collab-shell ttcosneutron1002]$ pwd /cases/01704538/sosreport-20160915-104140/ttcosneutron1002 [fahmed@collab-shell ttcosneutron1002]$ grep neutron installed-rpms |gawk ' {print $1} ' openstack-neutron-2015.1.2-14.el7ost.noarch openstack-neutron-common-2015.1.2-14.el7ost.noarch openstack-neutron-fwaas-2015.1.2-1.el7ost.noarch openstack-neutron-lbaas-2015.1.2-2.el7ost.noarch openstack-neutron-ml2-2015.1.2-14.el7ost.noarch openstack-neutron-openvswitch-2015.1.2-14.el7ost.noarch python-neutron-2015.1.2-14.el7ost.noarch python-neutron-fwaas-2015.1.2-1.el7ost.noarch python-neutron-lbaas-2015.1.2-2.el7ost.noarch python-neutronclient-2.4.0-2.el7ost.noarch [fahmed@collab-shell ttcosneutron1002]$
Good find Faiaz. @Jeremy, please take a look at comment 5. Can you have the customer update to latest OSP 7 and report back?
Assaf, Cu already have higher version of *-neutron-* that been mentioned @https://access.redhat.com/errata/RHBA-2015:1866.
When the spike happens, attach perf 'perf record -C <cpu having spike> sleep 3' which will record what the ksoftirqd is doing for 3 seconds. Then you can use the command below to see what has been recorded: # perf report -f --stdio That will give us some clue of what is going on.
Please see comment 10.
The high CPU usage appears to be inefficient conntrack hashing. A typical ksoftirqd in the perf looked like: - 4.26% 4.26% ksoftirqd/10 - 73.78% nf_conntrack_tuple_taken - nf_nat_used_tuple nf_nat_l4proto_unique_tuple tcp_unique_tuple get_unique_tuple nf_nat_setup_info xt_snat_target_v0 ipt_do_table iptable_nat_do_chain nf_nat_ipv4_fn nf_nat_ipv4_out iptable_nat_ipv4_out nf_iterate nf_hook_slow ip_output ip_forward_finish ip_forward ip_rcv_finish ip_rcv __netif_receive_skb_core __netif_receive_skb process_backlog net_rx_action + __do_softirq Conntrack was hashing at about 61:1 $ cat sys/module/nf_conntrack/parameters/hashsize 16384 $ cat ./proc/sys/net/netfilter/nf_conntrack_max 1000000 1000000/16384 = 61 We set this down to a more reasonable 8:1 1000000 / 8 = 125000 (this rounded to 125400 once the parameter was set) echo 'options nf_conntrack hashsize=125400' > /etc/modprobe.d/nf_conntrack.conf However I'm not sure how this operates when neutron/netns is involved. Do we need to set each netns with its own value? The customer has reported one system did still encounter high CPU usage, so checking the correct config is applied and looking at the perf for that. FYI the customer also has RPS and RFS configured on these systems, this apparently drastically improves their GRE speed: echo "65536" > /proc/sys/net/core/rps_sock_flow_entries echo "16384" > /sys/class/net/$DEV/queues/rx-$QUEUE/rps_flow_cnt echo "3fff0003fff" > /sys/class/net/$DEV/queues/rx-$QUEUE/rps_cpus NICs are enic with 5 rx queues and 5 tx queues (not combined rxtx queues). irqbalance runs once in oneshot mode then never again, so there is still RSS across the same CPU mask as above, which is NUMA Node 0.
After changing hashsize even larger, perf still showed the spending time in the same nf symbol. We're now wondering whether a misbehaving netns is generating a lot of badly-hashing traffic leading to inefficiently long linked lists, or if we're tracking connections for too long and should reduce a timeout. The customer indicated one particular namespace had ~200k connections during the last spike, so we're concentrating on collecting that ns' conntrack table during the next event. I discussed ways to add more transmit SNAT IPs (to balance hashing better) with Dan Sneddon from Neutron Eng and with OpenStack support, but that appears to require a non-trivial redesign.
Closing as this is not a Neutron issue, but a platform issue that was eventually resolved by changing the client that was generating the traffic.