1838006 – ovsdb does not restart correctly during an update from ovs 2.9 -> 2.11, causing data plane down time

Bug 1838006 - ovsdb does not restart correctly during an update from ovs 2.9 -> 2.11, causing data plane down time

Summary: ovsdb does not restart correctly during an update from ovs 2.9 -> 2.11, causi...

Keywords:
Status:	CLOSED DUPLICATE of bug 1763902
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	13.0 (Queens)
Hardware:	All
OS:	All
Priority:	low
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	RHOS Maint
QA Contact:	David Rosenfeld
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-20 11:08 UTC by Elf Lewis
Modified:	2023-12-15 17:57 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-29 14:42:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Elf Lewis 2020-05-20 11:08:19 UTC

Description of problem:
During a recent update to OSP13, we noticed that we have a 10 minute outage on the data plane of each compute server.   Looking through the logs at the time of the outage, we first see yum updating packages, here is a small example snippet:

<<
May 11 11:28:58 overcloud-dev-compute-0 yum[150069]: Installed: python-openvswitch2.11-2.11.0-35.el7fdp.x86_64
May 11 11:28:58 overcloud-dev-compute-0 yum[150069]: Installed: python-rhosp-openvswitch-2.11-0.6.el7ost.noarch
May 11 11:28:58 overcloud-dev-compute-0 yum[150069]: Updated: python2-ovsdbapp-0.10.4-2.el7ost.noarch
May 11 11:29:00 overcloud-dev-compute-0 yum[150069]: Updated: 1:python-neutron-12.1.1-6.el7ost.noarch
May 11 11:29:00 overcloud-dev-compute-0 yum[150069]: Updated: 1:openstack-neutron-common-12.1.1-6.el7ost.noarch
>>

Shortly after we see ovsdb being shut down:

<<
May 11 11:29:37 overcloud-dev-compute-0 systemd: Started Kernel Samepage Merging (KSM) Tuning Daemon.
May 11 11:29:37 overcloud-dev-compute-0 systemd: Stopped Open vSwitch Forwarding Unit.
May 11 11:29:37 overcloud-dev-compute-0 systemd: Stopping Open vSwitch Database Unit...
May 11 11:29:37 overcloud-dev-compute-0 ovs-ctl: Exiting ovsdb-server (26416) [  OK  ]
May 11 11:29:38 overcloud-dev-compute-0 systemd: Stopped Open vSwitch Database Unit.
May 11 11:29:38 overcloud-dev-compute-0 ntpd[42884]: Deleting interface #18 vxlan_sys_4789, fe80::5418:efff:fec1:d1d4#123, interface stats: received=0, sent=0, dropped=0, active_time=10439 secs
May 11 11:29:41 overcloud-dev-compute-0 dracut: dracut-033-564.el7
>>

We see errors in oopevswitch-agent logs from this time period:

<<
2020-05-11 11:29:37.983 67510 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: ovsdb-client: tcp:127.0.0.1:6640: receive failed (End of file)
2020-05-11 11:29:37.983 67510 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Bridge name --format=json]: ovsdb-client: tcp:127.0.0.1:6640: receive failed (End of file)
2020-05-11 11:29:37.984 67510 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: None
2020-05-11 11:29:38.003 67510 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Bridge name --format=json]: None
2020-05-11 11:29:39.022 67510 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: send error: Connection refused
2020-05-11 11:29:39.023 67510 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: connection dropped (Connection refused)
2020-05-11 11:29:41.024 67510 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: send error: Connection refused
2020-05-11 11:29:41.025 67510 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: connection dropped (Connection refused)
2020-05-11 11:29:45.029 67510 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: send error: Connection refused
2020-05-11 11:29:45.029 67510 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: connection dropped (Connection refused)
2020-05-11 11:29:53.033 67510 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: send error: Connection refused
2020-05-11 11:29:53.034 67510 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: connection dropped (Connection refused)
2020-05-11 11:30:01.036 67510 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: send error: Connection refused
2020-05-11 11:30:01.037 67510 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: connection dropped (Connection refused)
2020-05-11 11:30:08.075 67510 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: ovsdb-client: failed to connect to "tcp:127.0.0.1:6640" (Connection refused)
>>


Data plane downtime persists until ovsdb restarts 10 mins later:
<<
May 11 11:39:29 overcloud-dev-compute-0 systemd: Reloading.
May 11 11:39:29 overcloud-dev-compute-0 systemd: Started Flexible Branding Service.
May 11 11:39:29 overcloud-dev-compute-0 systemd: Starting Open vSwitch Database Unit...
May 11 11:39:29 overcloud-dev-compute-0 ovs-ctl: Backing up database to /etc/openvswitch/conf.db.backup7.15.1-3682332033 [  OK  ]
May 11 11:39:29 overcloud-dev-compute-0 ovs-ctl: Compacting database [  OK  ]
May 11 11:39:29 overcloud-dev-compute-0 ovs-ctl: Converting database schema [  OK  ]
May 11 11:39:29 overcloud-dev-compute-0 ovs-ctl: Starting ovsdb-server [  OK  ]
May 11 11:39:29 overcloud-dev-compute-0 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait -- init -- set Open_vSwitch . db-version=7.16.1
May 11 11:39:29 overcloud-dev-compute-0 kernel: IPv4: martian source 192.168.1.20 from 192.168.1.30, on dev qbra04eae17-24
>>

Then all is well.

Note that the outage start and stop correlates with the following puppet sections during the director run:

Start:
ASK [Update all packages] *****************************************************
Monday 11 May 2020  11:28:28 +0200 (0:00:00.542)       0:30:52.270 ************
^^ this task ends at 11:39:20

End:
TASK [ensure openvswitch service is enabled] ***********************************
Monday 11 May 2020  11:39:28 +0200 (0:00:00.674)       0:41:52.788 ************


Version-Release number of selected component (if applicable):
ovs 2.9.0-117, being upgraded to ovs 2.11-0.6, using tripleo templates: 8.5.1:3

Comment 1 Elf Lewis 2020-05-21 09:21:09 UTC

Actually, I just noticed that it doesnt just stop ovsdb - it also stops openvswitch itself:

May 11 11:29:36 overcloud-dev-compute-0 systemd: Started Flexible Branding Service.
May 11 11:29:36 overcloud-dev-compute-0 yum[150069]: Erased: openvswitch-ovn-central-2.9.0-117.bz1733374.1.el7ost.x86_64
May 11 11:29:36 overcloud-dev-compute-0 systemd: Reloading.
May 11 11:29:36 overcloud-dev-compute-0 systemd: Started Flexible Branding Service.
May 11 11:29:36 overcloud-dev-compute-0 yum[150069]: Erased: openvswitch-ovn-host-2.9.0-117.bz1733374.1.el7ost.x86_64
May 11 11:29:36 overcloud-dev-compute-0 systemd: Reloading.
May 11 11:29:36 overcloud-dev-compute-0 yum[150069]: Erased: openvswitch-ovn-common-2.9.0-117.bz1733374.1.el7ost.x86_64
May 11 11:29:36 overcloud-dev-compute-0 systemd: Started Flexible Branding Service.
May 11 11:29:36 overcloud-dev-compute-0 systemd: Reloading.
May 11 11:29:36 overcloud-dev-compute-0 systemd: Started Flexible Branding Service.
May 11 11:29:36 overcloud-dev-compute-0 yum[150069]: Erased: python-openvswitch-2.9.0-117.bz1733374.1.el7ost.x86_64
May 11 11:29:36 overcloud-dev-compute-0 systemd: Stopping Open vSwitch...
May 11 11:29:36 overcloud-dev-compute-0 systemd: Stopped Open vSwitch.
May 11 11:29:36 overcloud-dev-compute-0 systemd: Stopping Open vSwitch Forwarding Unit...
May 11 11:29:36 overcloud-dev-compute-0 yum[150069]: Erased: openvswitch-2.9.0-117.bz1733374.1.el7ost.x86_64
May 11 11:29:36 overcloud-dev-compute-0 systemd: Reloading.
May 11 11:29:36 overcloud-dev-compute-0 ovs-ctl: Exiting ovs-vswitchd (26525) [  OK  ]
May 11 11:29:37 overcloud-dev-compute-0 systemd: Started Flexible Branding Service.
May 11 11:29:37 overcloud-dev-compute-0 systemd: Reloading.
May 11 11:29:37 overcloud-dev-compute-0 systemd: Stopping Kernel Samepage Merging (KSM) Tuning Daemon...
May 11 11:29:37 overcloud-dev-compute-0 systemd: Started Flexible Branding Service.


Could this be a regression of https://bugzilla.redhat.com/show_bug.cgi?id=1763902 ?

Comment 2 Jakub Libosvar 2020-05-26 13:15:08 UTC

Did you use director to update the nodes or you just used yum?

To not have a dataplane downtime, you need to use director where we have tooling around this specific issue.

Comment 3 Elf Lewis 2020-05-26 14:45:37 UTC

Yes, we performed the update via director.

Thanks,
Elf

Comment 4 Jakub Libosvar 2020-05-26 15:09:40 UTC

Can you please attach a sosreport of the overcloud-dev-compute-0 node? We need to troubleshoot why the mechanism to avoid running OVS scripts wasn't executed.

Note You need to log in before you can comment on or make changes to this bug.