1637926 – [CI] openvswitch 2.10 dies after BFD reports failure connecting to a bridge controller (i.e.: Opendaylight)

Bug 1637926 - [CI] openvswitch 2.10 dies after BFD reports failure connecting to a bridge controller (i.e.: Opendaylight)

Summary: [CI] openvswitch 2.10 dies after BFD reports failure connecting to a bridge c...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openvswitch
Sub Component:
Version:	14.0 (Rocky)
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	beta
Target Release:	14.0 (Rocky)
Assignee:	Numan Siddique
QA Contact:	Waldemar Znoinski
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1640045 (view as bug list)
Depends On:
Blocks:	1629430
TreeView+	depends on / blocked

Reported:	2018-10-10 10:48 UTC by Waldemar Znoinski
Modified:	2019-01-11 11:54 UTC (History)
CC List:	14 users (show)
Fixed In Version:	openvswitch2.10-2.10.0-21.el7fdn
Doc Type:	If docs needed, set a value
Doc Text:	-
Clone Of:
Environment:
Last Closed:	2019-01-11 11:53:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
OVS LOG (30.26 KB, text/plain) 2018-10-10 10:48 UTC, Waldemar Znoinski	no flags	Details
ovs core dump (1.01 MB, application/x-gzip) 2018-10-12 10:44 UTC, Waldemar Znoinski	no flags	Details
new ovs vswitchd.log (12.84 KB, application/x-gzip) 2018-10-12 10:55 UTC, Waldemar Znoinski	no flags	Details
journal (2.74 MB, application/x-gzip) 2018-10-12 10:55 UTC, Waldemar Znoinski	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2019:0045	0	None	None	None	2019-01-11 11:54:00 UTC

Description Waldemar Znoinski 2018-10-10 10:48:44 UTC

Created attachment 1492509 [details]
OVS LOG

Description of problem:
as part of our downstream OSP + ODL testing we bring opendaylight service down and up (it takes ~10mins between down and up)
during that 10min while opendaylight is down ovs sees:

2018-10-10T09:03:37.631Z|01744|rconn|INFO|br-int<->tcp:172.17.1.21:6653: connection closed by peer
2018-10-10T09:03:37.881Z|01745|rconn|INFO|br-int<->tcp:172.17.1.21:6653: connecting...
2018-10-10T09:03:37.881Z|01746|rconn|WARN|br-int<->tcp:172.17.1.21:6653: connection failed (Connection refused)
2018-10-10T09:03:37.881Z|01747|rconn|INFO|br-int<->tcp:172.17.1.21:6653: waiting 2 seconds before reconnect
2018-10-10T09:03:39.881Z|01748|rconn|INFO|br-int<->tcp:172.17.1.21:6653: connecting...
2018-10-10T09:03:39.882Z|01749|rconn|WARN|br-int<->tcp:172.17.1.21:6653: connection failed (Connection refused)
2018-10-10T09:03:39.882Z|01750|rconn|INFO|br-int<->tcp:172.17.1.21:6653: waiting 4 seconds before reconnect
2018-10-10T09:03:43.881Z|01751|rconn|INFO|br-int<->tcp:172.17.1.21:6653: connecting...
2018-10-10T09:03:43.882Z|01752|rconn|WARN|br-int<->tcp:172.17.1.21:6653: connection failed (Connection refused)
2018-10-10T09:03:43.882Z|01753|rconn|INFO|br-int<->tcp:172.17.1.21:6653: continuing to retry connections in the background but suppressing further logging
2018-10-10T09:03:59.881Z|01768|rconn|WARN|br-int<->tcp:172.17.1.21:6653: connection failed (Connection refused)
2018-10-10T09:04:07.882Z|01770|rconn|WARN|br-int<->tcp:172.17.1.21:6653: connection failed (Connection refused)
2018-10-10T09:04:15.881Z|01771|rconn|WARN|br-int<->tcp:172.17.1.21:6653: connection failed (Connection refused)
2018-10-10T09:04:23.882Z|01772|rconn|WARN|br-int<->tcp:172.17.1.21:6653: connection failed (Connection refused)
2018-10-10T09:04:31.886Z|01773|rconn|WARN|br-int<->tcp:172.17.1.21:6653: connection failed (Connection refused)
2018-10-10T09:04:39.881Z|01774|rconn|WARN|br-int<->tcp:172.17.1.21:6653: connection failed (Connection refused)
2018-10-10T09:04:47.884Z|01775|rconn|WARN|br-int<->tcp:172.17.1.21:6653: connection failed (Connection refused)
2018-10-10T09:04:55.882Z|01776|rconn|WARN|br-int<->tcp:172.17.1.21:6653: connection failed (Connection refused)
2018-10-10T09:05:03.881Z|01777|rconn|WARN|br-int<->tcp:172.17.1.21:6653: connection failed (Connection refused)
2018-10-10T09:05:11.881Z|01778|rconn|WARN|br-int<->tcp:172.17.1.21:6653: connection failed (Connection refused)
2018-10-10T09:05:19.881Z|01779|rconn|WARN|br-int<->tcp:172.17.1.21:6653: connection failed (Connection refused)
2018-10-10T09:05:27.881Z|01780|rconn|WARN|br-int<->tcp:172.17.1.21:6653: connection failed (Connection refused)
2018-10-10T09:05:28.201Z|00001|bfd(handler26)|INFO|tun70020c4c630: BFD state change: up->down "No Diagnostic"->"Neighbor Signaled Session Down".
  Forwarding: true
  Detect Multiplier: 3
  Concatenated Path Down: false
  TX Interval: Approx 1000ms
  RX Interval: Approx 1000ms
  Detect Time: now +3000ms
  Next TX Time: now +160ms
  Last TX Time: now -800ms

  Local Flags: none
  Local Session State: up
  Local Diagnostic: No Diagnostic
  Local Discriminator: 0x3bc62f48
  Local Minimum TX Interval: 1000ms
  Local Minimum RX Interval: 1000ms

  Remote Flags: none
  Remote Session State: down
  Remote Diagnostic: No Diagnostic
  Remote Discriminator: 0x62f329d8
  Remote Minimum TX Interval: 1000ms
  Remote Minimum RX Interval: 1000ms
  Remote Detect Multiplier: 3
2018-10-10T09:05:28.201Z|00002|bfd(handler26)|INFO|tun70020c4c630: Remote signaled STATE_DOWN.
  vers:1 diag:"No Diagnostic" state:down mult:3 length:24
  flags: none
  my_disc:0x62f329d8 your_disc:0x0
  min_tx:1000000us (1000ms)
  min_rx:1000000us (1000ms)
  min_rx_echo:0us (0ms)  Forwarding: true
  Detect Multiplier: 3
  Concatenated Path Down: false
  TX Interval: Approx 1000ms
  RX Interval: Approx 1000ms
  Detect Time: now +2999ms
  Next TX Time: now +160ms
  Last TX Time: now -800ms

  Local Flags: none
  Local Session State: down
  Local Diagnostic: Neighbor Signaled Session Down
  Local Discriminator: 0x3bc62f48
  Local Minimum TX Interval: 1000ms
  Local Minimum RX Interval: 1000ms

  Remote Flags: none
  Remote Session State: down
  Remote Diagnostic: No Diagnostic
  Remote Discriminator: 0x0
  Remote Minimum TX Interval: 0ms
  Remote Minimum RX Interval: 1ms
  Remote Detect Multiplier: 3
2018-10-10T09:05:29.052Z|00003|bfd(handler26)|INFO|tun70020c4c630: New remote min_rx.
  vers:1 diag:"No Diagnostic" state:init mult:3 length:24
  flags: none
  my_disc:0x62f329d8 your_disc:0x3bc62f48
  min_tx:1000000us (1000ms)
  min_rx:1000000us (1000ms)
  min_rx_echo:0us (0ms)  Forwarding: true
  Detect Multiplier: 3
  Concatenated Path Down: false
  TX Interval: Approx 1000ms
  RX Interval: Approx 1000ms
  Detect Time: now +2148ms
  Next TX Time: now +69ms
  Last TX Time: now -691ms

  Local Flags: none
  Local Session State: down
  Local Diagnostic: Neighbor Signaled Session Down
  Local Discriminator: 0x3bc62f48
  Local Minimum TX Interval: 1000ms
  Local Minimum RX Interval: 1000ms

  Remote Flags: none
  Remote Session State: init
  Remote Diagnostic: No Diagnostic
  Remote Discriminator: 0x62f329d8
  Remote Minimum TX Interval: 0ms
  Remote Minimum RX Interval: 1000ms
  Remote Detect Multiplier: 3
2018-10-10T09:05:29.052Z|00004|bfd(handler26)|INFO|tun70020c4c630: BFD state change: down->up "Neighbor Signaled Session Down"->"Neighbor Signaled Session Down".
  Forwarding: true
  Detect Multiplier: 3
  Concatenated Path Down: false
  TX Interval: Approx 1000ms
  RX Interval: Approx 1000ms
  Detect Time: now +3000ms
  Next TX Time: now +69ms
  Last TX Time: now -691ms

  Local Flags: none
  Local Session State: down
  Local Diagnostic: Neighbor Signaled Session Down
  Local Discriminator: 0x3bc62f48
  Local Minimum TX Interval: 1000ms
  Local Minimum RX Interval: 1000ms

  Remote Flags: none
  Remote Session State: init
  Remote Diagnostic: No Diagnostic
  Remote Discriminator: 0x62f329d8
  Remote Minimum TX Interval: 1000ms
  Remote Minimum RX Interval: 1000ms
  Remote Detect Multiplier: 3
2018-10-10T09:05:35.882Z|01781|rconn|WARN|br-int<->tcp:172.17.1.21:6653: connection failed (Connection refused)
2018-10-10T09:05:42.933Z|01782|connmgr|INFO|br-int<->tcp:172.17.1.11:6653: 1 flow_mods 10 s ago (1 adds)
2018-10-10T09:05:43.881Z|01783|rconn|WARN|br-int<->tcp:172.17.1.21:6653: connection failed (Connection refused)
2018-10-10T09:05:51.882Z|01784|rconn|WARN|br-int<->tcp:172.17.1.21:6653: connection failed (Connection refused)
2018-10-10T09:05:59.881Z|01785|rconn|WARN|br-int<->tcp:172.17.1.21:6653: connection failed (Connection refused)
2018-10-10T09:06:07.882Z|01786|rconn|WARN|br-int<->tcp:172.17.1.21:6653: connection failed (Connection refused)
2018-10-10T09:06:15.882Z|01787|rconn|WARN|br-int<->tcp:172.17.1.21:6653: connection failed (Connection refused)
2018-10-10T09:06:16.455Z|00002|bfd(monitor34)|INFO|tunf301e6505ab: BFD state change: up->down "No Diagnostic"->"Control Detection Time Expired".
  Forwarding: true
  Detect Multiplier: 3
  Concatenated Path Down: false
  TX Interval: Approx 1000ms
  RX Interval: Approx 1000ms
  Detect Time: now +0ms
  Next TX Time: now +970ms
  Last TX Time: now +0ms

  Local Flags: none
  Local Session State: up
  Local Diagnostic: No Diagnostic
  Local Discriminator: 0xa18d8eda
  Local Minimum TX Interval: 1000ms
  Local Minimum RX Interval: 1000ms

  Remote Flags: none
  Remote Session State: up
  Remote Diagnostic: No Diagnostic
  Remote Discriminator: 0x4f0a3226
  Remote Minimum TX Interval: 1000ms
  Remote Minimum RX Interval: 1000ms
  Remote Detect Multiplier: 3


there's no further logfile entries in ovs after that
systemctl/journal shows: 


Oct 10 10:06:17 controller-1 ovs-ctl[911167]: 2018-10-10T09:06:17Z|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.1020.ctl
Oct 10 10:06:17 controller-1 ovs-appctl[911206]: ovs|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.1020.ctl
Oct 10 10:06:17 controller-1 ovs-ctl[911167]: ovs-appctl: cannot connect to "/var/run/openvswitch/ovs-vswitchd.1020.ctl" (Connection refused)

[root@controller-1 opendaylight]# systemctl status ovs-vswitchd
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: inactive (dead) since Wed 2018-10-10 10:06:17 BST; 32min ago
  Process: 911167 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop (code=exited, status=0/SUCCESS)

Oct 10 10:06:17 controller-1 ovs-ctl[911167]: 2018-10-10T09:06:17Z|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.1020.ctl
Oct 10 10:06:17 controller-1 ovs-ctl[911167]: ovs-appctl: cannot connect to "/var/run/openvswitch/ovs-vswitchd.1020.ctl" (Connection refused)
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.


the previous OSP (13) with ovs 2.9 was tested exactly the same way but ovs logfile doesn't show these BFD messages:

2018-10-09T03:59:13.367Z|00415|rconn|INFO|br-int<->tcp:172.17.1.15:6653: connection closed by peer
2018-10-09T03:59:13.846Z|00416|rconn|INFO|br-int<->tcp:172.17.1.15:6653: connecting...
2018-10-09T03:59:13.847Z|00417|rconn|WARN|br-int<->tcp:172.17.1.15:6653: connection dropped (Connection refused)
2018-10-09T03:59:13.847Z|00418|rconn|INFO|br-int<->tcp:172.17.1.15:6653: waiting 2 seconds before reconnect
2018-10-09T03:59:15.845Z|00419|rconn|INFO|br-int<->tcp:172.17.1.15:6653: connecting...
2018-10-09T03:59:15.846Z|00420|rconn|WARN|br-int<->tcp:172.17.1.15:6653: connection dropped (Connection refused)
2018-10-09T03:59:15.846Z|00421|rconn|INFO|br-int<->tcp:172.17.1.15:6653: waiting 4 seconds before reconnect
2018-10-09T03:59:19.846Z|00422|rconn|INFO|br-int<->tcp:172.17.1.15:6653: connecting...
2018-10-09T03:59:19.846Z|00423|rconn|WARN|br-int<->tcp:172.17.1.15:6653: connection dropped (Connection refused)
2018-10-09T03:59:19.846Z|00424|rconn|INFO|br-int<->tcp:172.17.1.15:6653: continuing to retry connections in the background but suppressing further logging
2018-10-09T03:59:27.848Z|00425|rconn|WARN|br-int<->tcp:172.17.1.15:6653: connection dropped (Connection refused)
...
2018-10-09T04:05:59.846Z|00554|rconn|WARN|br-int<->tcp:172.17.1.15:6653: connection dropped (Connection refused)
2018-10-09T04:06:07.846Z|00557|rconn|WARN|br-int<->tcp:172.17.1.15:6653: connection dropped (Connection refused)
2018-10-09T04:06:15.948Z|00559|rconn|INFO|br-int<->tcp:172.17.1.15:6653: connected



Version-Release number of selected component (if applicable):
osp14, openvswitch2.10-2.10.0-4

How reproducible:
100%

Steps to Reproduce:
1. deploy OSP + ovs2.10 + ODL
2. stop opendaylight on any overcloud controller for >6mins 
3. observe ovs dying

Actual results:
ovs dies after BFD reporting connection to one of the opendaylights down

Expected results:
ovs to survive and not become a zombie! 

Additional info:

Comment 1 Aaron Conole 2018-10-10 13:48:42 UTC

According to the logs, it looks like ovs-vswitchd was stopped?  Did I understand the timeline correctly?

Comment 2 Waldemar Znoinski 2018-10-10 14:01:21 UTC

yes, ovs was stopped but not by hand... must have died

Comment 3 Eran Kuris 2018-10-10 14:08:53 UTC

@Lucas, @Numan is there any chance that it will reproduce/ effect on OVN?

Comment 4 Eran Kuris 2018-10-10 14:10:11 UTC

Numan see comment 3

Comment 5 Waldemar Znoinski 2018-10-10 14:13:45 UTC

2 cents of mine...
As OVN manages OVS in a different way than ODL (I've learned that by the way of https://bugzilla.redhat.com/show_bug.cgi?id=1626488 which was happening on ODL but not OVN instances) I think you may not be able to reproduce this (1637926) bug on OVN but it's worth a shot

other than that I have a d/s machine with these symptoms I'm happy to share with developers to troubleshoot/dev the issue

Comment 6 Numan Siddique 2018-10-10 14:45:41 UTC

(In reply to Eran Kuris from comment #4)
> Numan see comment 3

Eran - to see if it affects OVN, may be you can bring down one of the controller node, wait for like 10 minutes and see the status of ovs-vswitchd in the other controller nodes.
Before that, create a neutron router and attach gateway interface to it so that OVN configures BFD on the tunnel interfaces (and may be you can create a VM as well).

Comment 7 Aaron Conole 2018-10-10 15:42:29 UTC

(In reply to Waldemar Znoinski from comment #0)
...
> 
> [root@controller-1 opendaylight]# systemctl status ovs-vswitchd
> ● ovs-vswitchd.service - Open vSwitch Forwarding Unit
>    Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static;
> vendor preset: disabled)
>    Active: inactive (dead) since Wed 2018-10-10 10:06:17 BST; 32min ago
>   Process: 911167 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl
> --no-ovsdb-server stop (code=exited, status=0/SUCCESS)
> 
> Oct 10 10:06:17 controller-1 ovs-ctl[911167]:
> 2018-10-10T09:06:17Z|00001|unixctl|WARN|failed to connect to
> /var/run/openvswitch/ovs-vswitchd.1020.ctl
> Oct 10 10:06:17 controller-1 ovs-ctl[911167]: ovs-appctl: cannot connect to
> "/var/run/openvswitch/ovs-vswitchd.1020.ctl" (Connection refused)
> Warning: Journal has been rotated since unit was started. Log output is
> incomplete or unavailable.

Usually, when something is killed, it won't have an ExecStop= set, it will report that it was killed.  Something here seems like it stopped ovs-vswitchd (if I'm reading the output correctly).  Can you attach the full systemd journal (or somehow get it)?

Comment 8 Waldemar Znoinski 2018-10-12 08:34:04 UTC

(In reply to Aaron Conole from comment #7)
> (In reply to Waldemar Znoinski from comment #0)
> ...
> > 
> > [root@controller-1 opendaylight]# systemctl status ovs-vswitchd
> > ● ovs-vswitchd.service - Open vSwitch Forwarding Unit
> >    Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static;
> > vendor preset: disabled)
> >    Active: inactive (dead) since Wed 2018-10-10 10:06:17 BST; 32min ago
> >   Process: 911167 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl
> > --no-ovsdb-server stop (code=exited, status=0/SUCCESS)
> > 
> > Oct 10 10:06:17 controller-1 ovs-ctl[911167]:
> > 2018-10-10T09:06:17Z|00001|unixctl|WARN|failed to connect to
> > /var/run/openvswitch/ovs-vswitchd.1020.ctl
> > Oct 10 10:06:17 controller-1 ovs-ctl[911167]: ovs-appctl: cannot connect to
> > "/var/run/openvswitch/ovs-vswitchd.1020.ctl" (Connection refused)
> > Warning: Journal has been rotated since unit was started. Log output is
> > incomplete or unavailable.
> 
> Usually, when something is killed, it won't have an ExecStop= set, it will
> report that it was killed.  Something here seems like it stopped
> ovs-vswitchd (if I'm reading the output correctly).  Can you attach the full
> systemd journal (or somehow get it)?


Aaron I think it works different than you described, because I

1. had ovs running
2. killed ovs straight
3. systemctl status shows ExecStop (not it's for ovsdb not ovs-vswitchd)

[root@controller-1 openvswitch2.10-2.10.0-10]# sudo systemctl status ovs-vswitchd                                                                                                                                  
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: active (running) since Fri 2018-10-12 09:30:53 BST; 17s ago
  Process: 469088 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop (code=exited, status=0/SUCCESS)
  Process: 25807 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server --no-monitor --system-id=random ${OVSUSER} start $OPTIONS (code=exited, status=0/SUCCESS)
  Process: 25802 ExecStartPre=/usr/bin/chmod 0775 /dev/hugepages (code=exited, status=0/SUCCESS)
  Process: 25800 ExecStartPre=/bin/sh -c /usr/bin/chown :$${OVS_USER_ID##*:} /dev/hugepages (code=exited, status=0/SUCCESS)
    Tasks: 11
   Memory: 87.5M
   CGroup: /system.slice/ovs-vswitchd.service
           └─25850 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --user openvswitch:hugetlbfs --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --...

Oct 12 09:30:52 controller-1 systemd[1]: Starting Open vSwitch Forwarding Unit...
Oct 12 09:30:53 controller-1 ovs-ctl[25807]: Starting ovs-vswitchd [  OK  ]
Oct 12 09:30:53 controller-1 ovs-vsctl[25985]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set Open_vSwitch . external-ids:hostname=controller-1.localdomain
Oct 12 09:30:53 controller-1 ovs-ctl[25807]: Enabling remote OVSDB managers [  OK  ]
Oct 12 09:30:53 controller-1 systemd[1]: Started Open vSwitch Forwarding Unit.
[root@controller-1 openvswitch2.10-2.10.0-10]# ps aux | grep -i ovs-vswitch
openvsw+   25850  2.0  0.3 913216 106728 ?       S<Lsl 09:30   0:00 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --user openvswitch:hugetlbfs --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach
root       28699  0.0  0.0 112704  1008 pts/1    S+   09:31   0:00 grep --color=auto -i ovs-vswitch
[root@controller-1 openvswitch2.10-2.10.0-10]# kill 25850
[root@controller-1 openvswitch2.10-2.10.0-10]# sudo systemctl status ovs-vswitchd
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: inactive (dead) since Fri 2018-10-12 09:31:27 BST; 5s ago
  Process: 29139 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop (code=exited, status=0/SUCCESS)
  Process: 25807 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server --no-monitor --system-id=random ${OVSUSER} start $OPTIONS (code=exited, status=0/SUCCESS)
  Process: 25802 ExecStartPre=/usr/bin/chmod 0775 /dev/hugepages (code=exited, status=0/SUCCESS)
  Process: 25800 ExecStartPre=/bin/sh -c /usr/bin/chown :$${OVS_USER_ID##*:} /dev/hugepages (code=exited, status=0/SUCCESS)

Oct 12 09:30:52 controller-1 systemd[1]: Starting Open vSwitch Forwarding Unit...
Oct 12 09:30:53 controller-1 ovs-ctl[25807]: Starting ovs-vswitchd [  OK  ]
Oct 12 09:30:53 controller-1 ovs-vsctl[25985]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set Open_vSwitch . external-ids:hostname=controller-1.localdomain
Oct 12 09:30:53 controller-1 ovs-ctl[25807]: Enabling remote OVSDB managers [  OK  ]
Oct 12 09:30:53 controller-1 systemd[1]: Started Open vSwitch Forwarding Unit.
Oct 12 09:31:27 controller-1 ovs-ctl[29139]: ovs-vswitchd is not running.
[root@controller-1 openvswitch2.10-2.10.0-10]# 



Also, 
I've installed newest fast datapath ovs version 2.10-10 on all overcloud nodes and make ovs start, when I brought opendaylight docker container down the OVS on two overcloud nodes died after couple of minutes

[root@controller-2 openvswitch2.10-2.10.0-10]# sudo systemctl status ovs-vswitchd                                                                                                                                  
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: inactive (dead) since Thu 2018-10-11 14:16:35 BST; 19h ago
  Process: 335143 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop (code=exited, status=0/SUCCESS)
  Process: 305530 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server --no-monitor --system-id=random ${OVSUSER} start $OPTIONS (code=exited, status=0/SUCCESS)
  Process: 305526 ExecStartPre=/usr/bin/chmod 0775 /dev/hugepages (code=exited, status=0/SUCCESS)
  Process: 305524 ExecStartPre=/bin/sh -c /usr/bin/chown :$${OVS_USER_ID##*:} /dev/hugepages (code=exited, status=0/SUCCESS)
 Main PID: 967023 (code=killed, signal=TERM)

Comment 9 Waldemar Znoinski 2018-10-12 10:44:38 UTC

Created attachment 1493244 [details]
ovs core dump

Comment 11 Waldemar Znoinski 2018-10-12 10:55:18 UTC

Created attachment 1493247 [details]
new ovs vswitchd.log

Comment 12 Waldemar Znoinski 2018-10-12 10:55:55 UTC

Created attachment 1493248 [details]
journal

Comment 13 Waldemar Znoinski 2018-10-12 10:57:16 UTC

Aaron, I'm attaching newer ovs log and journal as requested
note I've started ovs-vswitchd systemctl service manually at 11:33

Comment 15 Waldemar Znoinski 2018-10-15 10:40:20 UTC

I've tried settings
ovs-appctl bfd/set-forwarding false

on all overcloud nodes, then ran integration testing (which creates VMs and brings down odl containers + checks the VMs is still pingable)... ovs died again

Comment 17 Eran Kuris 2018-10-16 10:31:56 UTC

so I check the same scenario on OVN deployment and I did not hit this issue.

Comment 18 Eelco Chaudron 2018-10-17 11:16:23 UTC

*** Bug 1640045 has been marked as a duplicate of this bug. ***

Comment 19 Eelco Chaudron 2018-10-17 12:34:45 UTC

Adding the backtrac for the crash, so it's easy to see it's a duplicate of 1640045.

#0  0x00007f3c6524b207 in raise () from /lib64/libc.so.6
#1  0x00007f3c6524c8f8 in abort () from /lib64/libc.so.6
#2  0x00007f3c66c06cb7 in ofputil_encode_flow_removed (fr=fr@entry=0x7f3c59ff9b80, protocol=<optimized out>)
    at lib/ofp-monitor.c:293
#3  0x00007f3c671b1db3 in connmgr_send_flow_removed (mgr=mgr@entry=0x56197f5a4800, fr=fr@entry=0x7f3c59ff9b80)
    at ofproto/connmgr.c:1702
#4  0x00007f3c671b7464 in ofproto_rule_send_removed (rule=0x56197f69db80) at ofproto/ofproto.c:5729
#5  0x00007f3c671bdc3d in rule_destroy_cb (rule=0x56197f69db80) at ofproto/ofproto.c:2839
#6  0x00007f3c66c1e88e in ovsrcu_call_postponed () at lib/ovs-rcu.c:342
#7  0x00007f3c66c1ea94 in ovsrcu_postpone_thread (arg=<optimized out>) at lib/ovs-rcu.c:357
#8  0x00007f3c66c20d2f in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:354
#9  0x00007f3c66000dd5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f3c65313b3d in clone () from /lib64/libc.so.6
(gdb) 

Re-assigning BZ to duplicate BZ owner, as he has a fix posted upstream.

https://patchwork.ozlabs.org/patch/985340/

Comment 20 jamo luhrsen 2018-10-17 13:38:40 UTC

Tomas, Waldek,

can you attach the karaf.log from this bug when it occurs. There are also
some u/s jobs with messy things going on with ovs. Wondering if we are
seeing the same thing. But, the karaf.log will also show a lot of reconnects,
I think.

Comment 21 Waldemar Znoinski 2018-10-18 12:24:30 UTC

jamo, the setup I have this problem on was changed plenty (diff ovs, many restarts, many pulling and tearing) so I don't think karaf will be a good example for you to compare vs. upstream, I'll redeploy and give you the karaf.log then (today)

Comment 22 jamo luhrsen 2018-10-18 15:53:44 UTC

(In reply to Waldemar Znoinski from comment #21)
> jamo, the setup I have this problem on was changed plenty (diff ovs, many
> restarts, many pulling and tearing) so I don't think karaf will be a good
> example for you to compare vs. upstream, I'll redeploy and give you the
> karaf.log then (today)

oh, this wasn't from a CI job where you can just go pull the karaf.log files?

Comment 23 Waldemar Znoinski 2018-10-18 20:28:51 UTC

jamoluhrsen, hey, re https://bugzilla.redhat.com/show_bug.cgi?id=1637926 you're right, it's in CI too, i.e.: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-14_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/28/robot/report/log.html in that's job 'build artifacts' in controller-0/1/2.tar.gz in /var/log/containers/opendaylight is your karaf

Comment 34 errata-xmlrpc 2019-01-11 11:53:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045

Note You need to log in before you can comment on or make changes to this bug.