Description of problem: During a recent update to OSP13, we noticed that we have a 10 minute outage on the data plane of each compute server. Looking through the logs at the time of the outage, we first see yum updating packages, here is a small example snippet: << May 11 11:28:58 overcloud-dev-compute-0 yum[150069]: Installed: python-openvswitch2.11-2.11.0-35.el7fdp.x86_64 May 11 11:28:58 overcloud-dev-compute-0 yum[150069]: Installed: python-rhosp-openvswitch-2.11-0.6.el7ost.noarch May 11 11:28:58 overcloud-dev-compute-0 yum[150069]: Updated: python2-ovsdbapp-0.10.4-2.el7ost.noarch May 11 11:29:00 overcloud-dev-compute-0 yum[150069]: Updated: 1:python-neutron-12.1.1-6.el7ost.noarch May 11 11:29:00 overcloud-dev-compute-0 yum[150069]: Updated: 1:openstack-neutron-common-12.1.1-6.el7ost.noarch >> Shortly after we see ovsdb being shut down: << May 11 11:29:37 overcloud-dev-compute-0 systemd: Started Kernel Samepage Merging (KSM) Tuning Daemon. May 11 11:29:37 overcloud-dev-compute-0 systemd: Stopped Open vSwitch Forwarding Unit. May 11 11:29:37 overcloud-dev-compute-0 systemd: Stopping Open vSwitch Database Unit... May 11 11:29:37 overcloud-dev-compute-0 ovs-ctl: Exiting ovsdb-server (26416) [ OK ] May 11 11:29:38 overcloud-dev-compute-0 systemd: Stopped Open vSwitch Database Unit. May 11 11:29:38 overcloud-dev-compute-0 ntpd[42884]: Deleting interface #18 vxlan_sys_4789, fe80::5418:efff:fec1:d1d4#123, interface stats: received=0, sent=0, dropped=0, active_time=10439 secs May 11 11:29:41 overcloud-dev-compute-0 dracut: dracut-033-564.el7 >> We see errors in oopevswitch-agent logs from this time period: << 2020-05-11 11:29:37.983 67510 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: ovsdb-client: tcp:127.0.0.1:6640: receive failed (End of file) 2020-05-11 11:29:37.983 67510 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Bridge name --format=json]: ovsdb-client: tcp:127.0.0.1:6640: receive failed (End of file) 2020-05-11 11:29:37.984 67510 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: None 2020-05-11 11:29:38.003 67510 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Bridge name --format=json]: None 2020-05-11 11:29:39.022 67510 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: send error: Connection refused 2020-05-11 11:29:39.023 67510 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: connection dropped (Connection refused) 2020-05-11 11:29:41.024 67510 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: send error: Connection refused 2020-05-11 11:29:41.025 67510 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: connection dropped (Connection refused) 2020-05-11 11:29:45.029 67510 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: send error: Connection refused 2020-05-11 11:29:45.029 67510 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: connection dropped (Connection refused) 2020-05-11 11:29:53.033 67510 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: send error: Connection refused 2020-05-11 11:29:53.034 67510 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: connection dropped (Connection refused) 2020-05-11 11:30:01.036 67510 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: send error: Connection refused 2020-05-11 11:30:01.037 67510 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: connection dropped (Connection refused) 2020-05-11 11:30:08.075 67510 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: ovsdb-client: failed to connect to "tcp:127.0.0.1:6640" (Connection refused) >> Data plane downtime persists until ovsdb restarts 10 mins later: << May 11 11:39:29 overcloud-dev-compute-0 systemd: Reloading. May 11 11:39:29 overcloud-dev-compute-0 systemd: Started Flexible Branding Service. May 11 11:39:29 overcloud-dev-compute-0 systemd: Starting Open vSwitch Database Unit... May 11 11:39:29 overcloud-dev-compute-0 ovs-ctl: Backing up database to /etc/openvswitch/conf.db.backup7.15.1-3682332033 [ OK ] May 11 11:39:29 overcloud-dev-compute-0 ovs-ctl: Compacting database [ OK ] May 11 11:39:29 overcloud-dev-compute-0 ovs-ctl: Converting database schema [ OK ] May 11 11:39:29 overcloud-dev-compute-0 ovs-ctl: Starting ovsdb-server [ OK ] May 11 11:39:29 overcloud-dev-compute-0 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait -- init -- set Open_vSwitch . db-version=7.16.1 May 11 11:39:29 overcloud-dev-compute-0 kernel: IPv4: martian source 192.168.1.20 from 192.168.1.30, on dev qbra04eae17-24 >> Then all is well. Note that the outage start and stop correlates with the following puppet sections during the director run: Start: ASK [Update all packages] ***************************************************** Monday 11 May 2020 11:28:28 +0200 (0:00:00.542) 0:30:52.270 ************ ^^ this task ends at 11:39:20 End: TASK [ensure openvswitch service is enabled] *********************************** Monday 11 May 2020 11:39:28 +0200 (0:00:00.674) 0:41:52.788 ************ Version-Release number of selected component (if applicable): ovs 2.9.0-117, being upgraded to ovs 2.11-0.6, using tripleo templates: 8.5.1:3
Actually, I just noticed that it doesnt just stop ovsdb - it also stops openvswitch itself: May 11 11:29:36 overcloud-dev-compute-0 systemd: Started Flexible Branding Service. May 11 11:29:36 overcloud-dev-compute-0 yum[150069]: Erased: openvswitch-ovn-central-2.9.0-117.bz1733374.1.el7ost.x86_64 May 11 11:29:36 overcloud-dev-compute-0 systemd: Reloading. May 11 11:29:36 overcloud-dev-compute-0 systemd: Started Flexible Branding Service. May 11 11:29:36 overcloud-dev-compute-0 yum[150069]: Erased: openvswitch-ovn-host-2.9.0-117.bz1733374.1.el7ost.x86_64 May 11 11:29:36 overcloud-dev-compute-0 systemd: Reloading. May 11 11:29:36 overcloud-dev-compute-0 yum[150069]: Erased: openvswitch-ovn-common-2.9.0-117.bz1733374.1.el7ost.x86_64 May 11 11:29:36 overcloud-dev-compute-0 systemd: Started Flexible Branding Service. May 11 11:29:36 overcloud-dev-compute-0 systemd: Reloading. May 11 11:29:36 overcloud-dev-compute-0 systemd: Started Flexible Branding Service. May 11 11:29:36 overcloud-dev-compute-0 yum[150069]: Erased: python-openvswitch-2.9.0-117.bz1733374.1.el7ost.x86_64 May 11 11:29:36 overcloud-dev-compute-0 systemd: Stopping Open vSwitch... May 11 11:29:36 overcloud-dev-compute-0 systemd: Stopped Open vSwitch. May 11 11:29:36 overcloud-dev-compute-0 systemd: Stopping Open vSwitch Forwarding Unit... May 11 11:29:36 overcloud-dev-compute-0 yum[150069]: Erased: openvswitch-2.9.0-117.bz1733374.1.el7ost.x86_64 May 11 11:29:36 overcloud-dev-compute-0 systemd: Reloading. May 11 11:29:36 overcloud-dev-compute-0 ovs-ctl: Exiting ovs-vswitchd (26525) [ OK ] May 11 11:29:37 overcloud-dev-compute-0 systemd: Started Flexible Branding Service. May 11 11:29:37 overcloud-dev-compute-0 systemd: Reloading. May 11 11:29:37 overcloud-dev-compute-0 systemd: Stopping Kernel Samepage Merging (KSM) Tuning Daemon... May 11 11:29:37 overcloud-dev-compute-0 systemd: Started Flexible Branding Service. Could this be a regression of https://bugzilla.redhat.com/show_bug.cgi?id=1763902 ?
Did you use director to update the nodes or you just used yum? To not have a dataplane downtime, you need to use director where we have tooling around this specific issue.
Yes, we performed the update via director. Thanks, Elf
Can you please attach a sosreport of the overcloud-dev-compute-0 node? We need to troubleshoot why the mechanism to avoid running OVS scripts wasn't executed.