Bug 1185521 - ovs-vswitchd crash
Summary: ovs-vswitchd crash
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openvswitch
Version: 6.0 (Juno)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: z1
: 6.0 (Juno)
Assignee: Jiri Benc
QA Contact: Ofer Blaut
URL:
Whiteboard:
Depends On: 1186492 1187257 1191633
Blocks: 1191918 1191922
TreeView+ depends on / blocked
 
Reported: 2015-01-24 07:36 UTC by Fabio Massimo Di Nitto
Modified: 2016-04-26 13:48 UTC (History)
19 users (show)

Fixed In Version: openvswitch-2.1.2-2.el7_0.2
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1186492 1187257 1191918 1191922 (view as bug list)
Environment:
Last Closed: 2015-03-05 18:22:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
abrt coredumps and sosreports (12.60 MB, application/x-xz)
2015-01-24 07:36 UTC, Fabio Massimo Di Nitto
no flags Details
abrt email (63.18 KB, application/mbox)
2015-01-24 07:39 UTC, Fabio Massimo Di Nitto
no flags Details
logs of CPU spinning (164.87 KB, application/x-xz)
2015-02-01 07:37 UTC, Fabio Massimo Di Nitto
no flags Details
strace and logs from another node (3.77 MB, application/x-xz)
2015-02-01 08:20 UTC, Fabio Massimo Di Nitto
no flags Details
conf.db (43.81 KB, text/plain)
2015-02-01 08:45 UTC, Fabio Massimo Di Nitto
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:0640 0 normal SHIPPED_LIVE Red Hat Enterprise Linux OpenStack Platform Bug Fix and Enhancement Advisory 2015-03-05 23:17:40 UTC

Description Fabio Massimo Di Nitto 2015-01-24 07:36:39 UTC
Created attachment 983634 [details]
abrt coredumps and sosreports

Description of problem:

I started testing OSP6 on top of RHEL7.1 latest compose and recently (2/3 weeks perhaps?) after latest 7.1 upgrade, ovs-switchd started crashing when neutron starts configuring its own bit.

That said, I am neither a ovs or a neutron expert but I was to capture some data around. I have added neutron experts in CC and I am very happy to provide access to the systems for debugging at any given time.

[root@rhos6-node1 ~(keystone_admin)]$ uname -a
Linux rhos6-node1.vmnet.mpc.lab.eng.bos.redhat.com 3.10.0-224.el7.x86_64
#1 SMP Mon Jan 19 16:18:15 EST 2015 x86_64 x86_64 x86_64 GNU/Linux

[root@rhos6-node1 ~(keystone_admin)]$ rpm -q -i openvswitch
Name        : openvswitch
Version     : 2.1.2
Release     : 2.el7_0.1
Architecture: x86_64
Install Date: Fri 23 Jan 2015 11:02:37 AM CET

Also note that:
- VM is running RHEL7.1
- a quick test using openvswitch 2.3.1-2.gitXX from brew build didn´t show the problem but it´s not tagged either in 7.1 or osp6?

The reason is that if the problem is fixed in the new package, then we
need to make sure it ships at the same time as RHEL7.1 goes out of the door. It´s also entirely possible that this is a kernel regression that´s triggering a crash in userland (bad response from kernel?)

To reproduce:

systemctl enable openvswitch
systemctl start openvswitch

ovs-vsctl add-br br-int
ovs-vsctl add-br br-ex

ovs-vsctl add-port br-ex eth0

systemctl start neutron-openvswitch-agent

This is from neutron log but at a times we were able to reproduce it
also without neutron via manual looping (it happens less often).

Command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf',
'ovs-vsctl', '--timeout=10', 'add-port', 'br-tun', 'patch-int', '--',
'set', 'Interface', 'patch-int', 'type=patch', 'options:peer=patch-tun']
Exit code: 242
Stdout: ''
Stderr: '2015-01-23T12:42:22Z|00002|fatal_signal|WARN|terminating with
signal 14 (Alarm clock)\n'
2015-01-23 13:42:22.430 3880 ERROR neutron.agent.linux.ovs_lib
[req-0fff5bfc-978f-4ef2-bf99-a5c8f824961a None] Unable to execute
['ovs-vsctl', '--timeout=10', 'add-port', 'br-tun', 'patch-int', '--',
'set', 'Interface', 'patch-int', 'type=patch',
'options:peer=patch-tun']. Exception:
Command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf',
'ovs-vsctl', '--timeout=10', 'add-port', 'br-tun', 'patch-int', '--',
'set', 'Interface', 'patch-int', 'type=patch', 'options:peer=patch-tun']
Exit code: 242
Stdout: ''
Stderr: '2015-01-23T12:42:22Z|00002|fatal_signal|WARN|terminating with
signal 14 (Alarm clock)\n'
2015-01-23 13:42:22.684 3880 ERROR
neutron.plugins.openvswitch.agent.ovs_neutron_agent
[req-0fff5bfc-978f-4ef2-bf99-a5c8f824961a None] Failed to create OVS
patch port. Cannot have tunneling enabled on this agent, since this
version of OVS does not support tunnels or patch ports. Agent terminated!
2015-01-23 13:42:22.891 3880 ERROR neutron.agent.linux.utils
[req-0fff5bfc-978f-4ef2-bf99-a5c8f824961a None]
Command: ['ps', '--ppid', '5756', '-o', 'pid=']
Exit code: 1
Stdout: ''
Stderr: ''

 systemctl stop neutron-openvswitch-agent

Comment 1 Fabio Massimo Di Nitto 2015-01-24 07:39:18 UTC
Created attachment 983635 [details]
abrt email

email from abrt

Comment 2 Fabio Massimo Di Nitto 2015-01-24 07:40:43 UTC
The timeout reported by neutron are due to ovs-vswitchd crashing behind the scene.

Comment 7 Jiri Benc 2015-01-26 16:21:21 UTC
This is a bug in openvswitch, already fixed upstream:

546953509095 lib/odp-util: Do not use mask if it doesn't exist.

It's even part of v2.1.3 upstream:

fc60782dd679 lib/odp-util: Do not use mask if it doesn't exist.

but the ovs version used is openvswitch-2.1.2-2.el7_0.1.

Comment 8 Jiri Benc 2015-01-26 16:22:59 UTC
Version 2.3 has the fix and is not affected.

Comment 9 Jiri Benc 2015-01-26 16:32:24 UTC
The bug is in handling of new/unknown netlink attributes in the old user space. The new kernel datapath includes attributes that are unknown to the old user space. This is okay and should be handled without problems but due to a bug in the code dealing with unknown attributes in versions <2.1.3, ovs-vswitchd crashes on NULL pointer dereference.

This can't be fixed on the kernel side, we need the new attributes to support the post-2.1 features.

This can be fixed e.g. by releasing openvswitch-2.1.3 (or just openvswitch-2.1.2 with the single patch added) and requiring customers to update to it first before updating to the RHEL7 kernel.

Comment 10 Jiri Benc 2015-01-26 17:05:12 UTC
(In reply to Jiri Benc from comment #7)
> It's even part of v2.1.3 upstream:

Correcting myself, it's only part of upcoming 2.1.4, it's not in 2.1.3.

Comment 26 Fabio Massimo Di Nitto 2015-02-01 07:37:56 UTC
Created attachment 986627 [details]
logs of CPU spinning


Linux mrg-03.mpc.lab.eng.bos.redhat.com 3.10.0-229.el7.x86_64 #1 SMP Thu
Jan 29 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux

I am using the last official kernel build as announced by Linda, that´s
targeting 7.1 RC and includes the latest ovs kernel fix.

[root@mrg-03 ~]# rpm -q -i openvswitch
Name        : openvswitch
Version     : 2.1.2
Release     : 2.test.el7.1

^^^^ this is the build with the userland fix that Jiri did.


What I see so far is, and it´s random to some extents:

- machine boot -> no cpu spinning or cpu spinning right away

- if the cpu spinning does NOT start at boot, it randomly starts
  after a while and please note, beside a couple of configured
  interfaces, there are no ovs users (OSP is NOT running in any
  of those tests).
  openvswitch is starting at boot.

- stopping openvswitch userlands stops the cpu spinning of the
  daemon.

- once the userland cpu spinning is off/over. I see one ksoftirqd
  thread that goes 100% cpu at regular intervals, interesting
  enough it´s always the one ksoftirqd/3 on all the 6 machines
  where I am experiencing this problem.

- when cpu is spinning I can see ovs logs filling fast and skipping
  literally 15/20K entries due to logging flooding. Logs in attachement.

I was able to reproduce this issue in both baremetal and VMs. It´s
virtually impossible to use VMs to debug, but baremetal is still
responsive under load.

Comment 28 Fabio Massimo Di Nitto 2015-02-01 08:20:33 UTC
Created attachment 986640 [details]
strace and logs from another node

Comment 30 Fabio Massimo Di Nitto 2015-02-01 08:45:30 UTC
Created attachment 986642 [details]
conf.db

Reproducer for the strace:

1) install RHEL7.1 latest and make sure kernel/userland are in version as comment #29

2) Use attached conf.db. This was generated by:


systemctl enable openvswitch
systemctl start openvswitch

ovs-vsctl add-br br-int

and used by neutron on a compute node for a few instances.

3) stop openstack, stop everything.

4) systemctl start openvswitch

5) wait...

after some time, ovs process will spin at max CPU on all CPUs.

Comment 31 Fabio Massimo Di Nitto 2015-02-02 05:30:22 UTC
One more piece of information.

It is possible to experience the same issue using http://download.devel.redhat.com/brewroot/packages/openvswitch/2.3.1/2.git20150113.el7/x86_64/openvswitch-2.3.1-2.git20150113.el7.x86_64.rpm build.

There are slightly different symptoms (slightly less CPU spinning) and trigger (appears to start spinning only when using neutron) but eventually makes the machine unusable.

Comment 33 Jiri Benc 2015-02-02 15:38:50 UTC
Fabio, I did not reproduce the 100% CPU usage. Can I have access to your machine?

Comment 34 Fabio Massimo Di Nitto 2015-02-02 15:47:24 UTC
(In reply to Jiri Benc from comment #33)
> Fabio, I did not reproduce the 100% CPU usage. Can I have access to your
> machine?

Yes, please send me your ssh public key via email.

Comment 35 Jiri Benc 2015-02-05 14:07:59 UTC
Recording my observation (on Fabio's machines with the problem reproduced):

The leader revalidator thread is dumping flows from the kernel in a tight loop. The other revalidator threads and ovs-vswitchd thread are spending a lot of time on various locks (which is logical). The flow revalidation should not happen more often than every 500 msec but that's not the case here for some reason.

It seems the problem is seq_wait in udpif_revalidator which either sets poll_immediate_wake or causes poll_block to return immediately because of latch_wait(&waiter->thread->latch) in seq_wait__.

Comment 39 Jiri Benc 2015-02-11 15:46:08 UTC
After gathering more data, doing some tests and consulting with Fabio and Miguel, the root cause is OpenStack (Neutron), not ovs. What happens is this:

Under certain circumstances, openflow rules are not reconfigured on the ovs-tun bridges on nodes (those "certain circumstances" are things like ovs agent restart or restart of ovs-vswitchd on a node, etc.), leaving the ovs bridge with default behavior (NORMAL, i.e. MAC learning switch). As the setup contains loops under such conditions, it's prone to broadcast storms.

It's not completely clear why ovs starts dumping flows in tight loop under a broadcast storm and it's certainly something to look into (I'll file a separate bug for that) but that's not the root cause.

Comment 45 Ofer Blaut 2015-02-17 11:22:40 UTC
Issues and logs are not seen in openvswitch-2.1.2-2.el7_0.2.x86_64

Comment 50 errata-xmlrpc 2015-03-05 18:22:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0640.html

Comment 51 August Simonelli 2015-03-31 02:40:47 UTC
This effects RHELOSP5 on 7.1 too. And i can confirm that openvswitch-2.1.2-2.el7_0.2.x86_64.rpm fixes it. :-)


Note You need to log in before you can comment on or make changes to this bug.