| Summary: | Periodic CPU spikes on neutron nodes (CPU usage goes to 100% for many cores) | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Jeremy <jmelvin> |
| Component: | openstack-neutron | Assignee: | Assaf Muller <amuller> |
| Status: | CLOSED NOTABUG | QA Contact: | Toni Freger <tfreger> |
| Severity: | urgent | Docs Contact: | |
| Priority: | high | ||
| Version: | 7.0 (Kilo) | CC: | amuller, chrisw, fahmed, fleitner, jbainbri, jmelvin, mfuruta, nyechiel, sgaddam, srevivo |
| Target Milestone: | async | Keywords: | ZStream |
| Target Release: | 7.0 (Kilo) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-01-16 18:28:48 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Jeremy
2016-09-15 20:44:20 UTC
Looking into l3-agent.logs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2016-09-15 10:14:48.961 53971 INFO neutron.agent.l3.ha [-] Router a3f658be-cb61-419f-bc79-c0ad8196668e transitioned to master
2016-09-15 10:14:49.030 53971 ERROR neutron.agent.linux.utils [-]
Command: ['ip', 'netns', 'exec', u'qrouter-a3f658be-cb61-419f-bc79-c0ad8196668e', 'sysctl', '-w', u'net.ipv6.conf.qg-e2ce1ff3-15.accept_ra=2']
Exit code: 255
Stdin:
Stdout:
Stderr: sysctl: cannot stat /proc/sys/net/ipv6/conf/qg-e2ce1ff3-15/accept_ra: No such file or directory
2016-09-15 10:14:49.030 53971 INFO eventlet.wsgi.server [-] Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/eventlet/wsgi.py", line 454, in handle_one_response
result = self.application(self.environ, start_response)
File "/usr/lib/python2.7/site-packages/webob/dec.py", line 130, in __call__
resp = self.call_func(req, *args, **self.kwargs)
File "/usr/lib/python2.7/site-packages/webob/dec.py", line 195, in call_func
return self.func(req, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha.py", line 59, in __call__
self.enqueue(router_id, state)
File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha.py", line 65, in enqueue
self.agent.enqueue_state_change(router_id, state)
File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha.py", line 125, in enqueue_state_change
self._configure_ipv6_ra_on_ext_gw_port_if_necessary(ri, state)
File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha.py", line 140, in _configure_ipv6_ra_on_ext_gw_port_if_necessary
ri.driver.configure_ipv6_ra(ri.ns_name, interface_name)
File "/usr/lib/python2.7/site-packages/neutron/agent/linux/interface.py", line 179, in configure_ipv6_ra
['sysctl', '-w', 'net.ipv6.conf.%s.accept_ra=2' % dev_name])
File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 580, in execute
extra_ok_codes=extra_ok_codes, **kwargs)
File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 137, in execute
raise RuntimeError(m)
RuntimeError:
Command: ['ip', 'netns', 'exec', u'qrouter-a3f658be-cb61-419f-bc79-c0ad8196668e', 'sysctl', '-w', u'net.ipv6.conf.qg-e2ce1ff3-15.accept_ra=2']
Exit code: 255
Stdin:
Stdout:
Stderr: sysctl: cannot stat /proc/sys/net/ipv6/conf/qg-e2ce1ff3-15/accept_ra: No such file or directory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There is known bug https://bugzilla.redhat.com/show_bug.cgi?id=1262647 and been addressed https://access.redhat.com/errata/RHBA-2015:1866
Looking into customer installed versions
[fahmed@collab-shell ttcosneutron1002]$ pwd
/cases/01704538/sosreport-20160915-104140/ttcosneutron1002
[fahmed@collab-shell ttcosneutron1002]$ grep neutron installed-rpms |gawk ' {print $1} '
openstack-neutron-2015.1.2-14.el7ost.noarch
openstack-neutron-common-2015.1.2-14.el7ost.noarch
openstack-neutron-fwaas-2015.1.2-1.el7ost.noarch
openstack-neutron-lbaas-2015.1.2-2.el7ost.noarch
openstack-neutron-ml2-2015.1.2-14.el7ost.noarch
openstack-neutron-openvswitch-2015.1.2-14.el7ost.noarch
python-neutron-2015.1.2-14.el7ost.noarch
python-neutron-fwaas-2015.1.2-1.el7ost.noarch
python-neutron-lbaas-2015.1.2-2.el7ost.noarch
python-neutronclient-2.4.0-2.el7ost.noarch
[fahmed@collab-shell ttcosneutron1002]$
Good find Faiaz. @Jeremy, please take a look at comment 5. Can you have the customer update to latest OSP 7 and report back? Assaf, Cu already have higher version of *-neutron-* that been mentioned @https://access.redhat.com/errata/RHBA-2015:1866. When the spike happens, attach perf 'perf record -C <cpu having spike> sleep 3' which will record what the ksoftirqd is doing for 3 seconds. Then you can use the command below to see what has been recorded: # perf report -f --stdio That will give us some clue of what is going on. Please see comment 10. The high CPU usage appears to be inefficient conntrack hashing.
A typical ksoftirqd in the perf looked like:
- 4.26% 4.26% ksoftirqd/10
- 73.78% nf_conntrack_tuple_taken
- nf_nat_used_tuple
nf_nat_l4proto_unique_tuple
tcp_unique_tuple
get_unique_tuple
nf_nat_setup_info
xt_snat_target_v0
ipt_do_table
iptable_nat_do_chain
nf_nat_ipv4_fn
nf_nat_ipv4_out
iptable_nat_ipv4_out
nf_iterate
nf_hook_slow
ip_output
ip_forward_finish
ip_forward
ip_rcv_finish
ip_rcv
__netif_receive_skb_core
__netif_receive_skb
process_backlog
net_rx_action
+ __do_softirq
Conntrack was hashing at about 61:1
$ cat sys/module/nf_conntrack/parameters/hashsize
16384
$ cat ./proc/sys/net/netfilter/nf_conntrack_max
1000000
1000000/16384 = 61
We set this down to a more reasonable 8:1
1000000 / 8 = 125000
(this rounded to 125400 once the parameter was set)
echo 'options nf_conntrack hashsize=125400' > /etc/modprobe.d/nf_conntrack.conf
However I'm not sure how this operates when neutron/netns is involved. Do we need to set each netns with its own value?
The customer has reported one system did still encounter high CPU usage, so checking the correct config is applied and looking at the perf for that.
FYI the customer also has RPS and RFS configured on these systems, this apparently drastically improves their GRE speed:
echo "65536" > /proc/sys/net/core/rps_sock_flow_entries
echo "16384" > /sys/class/net/$DEV/queues/rx-$QUEUE/rps_flow_cnt
echo "3fff0003fff" > /sys/class/net/$DEV/queues/rx-$QUEUE/rps_cpus
NICs are enic with 5 rx queues and 5 tx queues (not combined rxtx queues). irqbalance runs once in oneshot mode then never again, so there is still RSS across the same CPU mask as above, which is NUMA Node 0.
After changing hashsize even larger, perf still showed the spending time in the same nf symbol. We're now wondering whether a misbehaving netns is generating a lot of badly-hashing traffic leading to inefficiently long linked lists, or if we're tracking connections for too long and should reduce a timeout. The customer indicated one particular namespace had ~200k connections during the last spike, so we're concentrating on collecting that ns' conntrack table during the next event. I discussed ways to add more transmit SNAT IPs (to balance hashing better) with Dan Sneddon from Neutron Eng and with OpenStack support, but that appears to require a non-trivial redesign. Closing as this is not a Neutron issue, but a platform issue that was eventually resolved by changing the client that was generating the traffic. |