Bug 1837235 - [updates] 16 to 16.1, 80% packet loss on stop L3 connectivity check test
Summary: [updates] 16 to 16.1, 80% packet loss on stop L3 connectivity check test
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-tripleo
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 16.1 (Train on RHEL 8.2)
Assignee: RHOS Maint
QA Contact: David Rosenfeld
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-19 07:12 UTC by Ronnie Rasouli
Modified: 2020-07-29 07:53 UTC (History)
18 users (show)

Fixed In Version: puppet-tripleo-11.5.0-0.20200610124245.68291df.el8ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-29 07:52:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
db double mount point hack. (1.07 KB, patch)
2020-06-09 13:24 UTC, Sofer Athlan-Guyot
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 734619 0 None MERGED ovn-dbs-bundle: Prepare for supporting new OVN version with separarte run dirs 2020-10-05 04:08:29 UTC
Red Hat Product Errata RHBA-2020:3148 0 None None None 2020-07-29 07:53:16 UTC

Description Ronnie Rasouli 2020-05-19 07:12:24 UTC
Description of problem:

The composable roles overcloud update failed after updating the overcloud nodes.

TASK [tripleo-upgrade : stop l3 agent connectivity check] **********************
task path: /home/rhos-ci/jenkins/workspace/DFG-upgrades-updates-16-to-16.1-from-passed_phase1-composable-ipv6/infrared/plugins/tripleo-upgrade/infrared_plugin/roles/tripleo-upgrade/tasks/common/l3_agent_connectivity_check_stop_script.yml:2
Monday 18 May 2020  16:52:12 +0000 (0:23:14.751)       4:59:01.463 ************ 
fatal: [undercloud-0]: FAILED! => {
    "changed": true,
    "cmd": "source /home/stack/qe-Cloud-0rc\n /home/stack/l3_agent_stop_ping.sh",
    "delta": "0:00:00.093769",
    "end": "2020-05-18 16:52:13.700489",
    "rc": 1,
    "start": "2020-05-18 16:52:13.606720"
}

STDOUT:

16402 packets transmitted, 3256 received, +12940 errors, 80.1488% packet loss, time 17551ms
rtt min/avg/max/mdev = 0.482/1.284/18.393/0.741 ms, pipe 4
Ping loss higher than 1% detected


MSG:

non-zero return code

Version-Release number of selected component (if applicable):
RHOS_TRUNK-16.0-RHEL-8-20200513.n.1

How reproducible:
most likely

Steps to Reproduce:
1. Deploy RHOS16 with 3: controllers, networkers, DB, CEPH, messaging. 2 computes
2. update the undercloud from 16 to 16.1, note RHEL upgrade from 8.1 to 8.2
3. update the overcloud nodes
4. run l3 ping test

Actual results:
80+% ping loss

Expected results:
less than 1% ping lost

Additional info:

Comment 9 Sofer Athlan-Guyot 2020-05-26 16:49:18 UTC
So back to that bug which is about DFG-upgrades-updates-16-to-16.1-from-z1-HA-ipv4.  


So first we have no side car container issues:

find . -type f -iname 'containers_allinfo.log' -exec grep -hH 'Exited ([1-9])' "{}" \;

comes back empty.

Then, from undercloud-0/home/stack/ping_results_202005241409.log, with the timestamps converted
(cat  undercloud-0/home/stack/ping_results_202005241409.log | perl -pe 's/([\d]{10}\.[\d]{3})/localtime $1/eg;' | less )  
we have an idea of when it starts to fail. 

[Sun May 24 16:09:16 2020139] 64 bytes from 10.0.0.212: icmp_seq=10 ttl=63 time=3.71 ms
[Sun May 24 16:09:17 2020576] 64 bytes from 10.0.0.212: icmp_seq=11 ttl=63 time=2.25 ms
[Sun May 24 16:09:18 2020853] 64 bytes from 10.0.0.212: icmp_seq=12 ttl=63 time=1.03 ms
...

[Sun May 24 17:56:54 2020341] 64 bytes from 10.0.0.212: icmp_seq=6431 ttl=63 time=1.16 ms
[Sun May 24 17:56:55 2020427] 64 bytes from 10.0.0.212: icmp_seq=6432 ttl=63 time=1.22 ms
[Sun May 24 17:57:20 2020471] From 10.0.0.11 icmp_seq=6454 Destination Host Unreachable
[Sun May 24 17:57:20 2020564] From 10.0.0.11 icmp_seq=6455 Destination Host Unreachable

....
Sun May 24 19:27:55 2020548] From 10.0.0.11 icmp_seq=11767 Destination Host Unreachable
[Sun May 24 19:27:58 2020499] From 10.0.0.11 icmp_seq=11768 Destination Host Unreachable
[Sun May 24 19:27:58 2020521] From 10.0.0.11 icmp_seq=11769 Destination Host Unreachable
[Sun May 24 19:27:58 2020526] From 10.0.0.11 icmp_seq=11770 Destination Host Unreachable

--- 10.0.0.212 ping statistics ---
11773 packets transmitted, 6355 received, +5259 errors, 46.0206% packet loss, time 12616ms

At that time (I assumed a +2h relative to the other logs) we were updating ctl-2, making 
sure that ovndb and vip were banned before removing the node from the cluster.


020-05-24 15:56:32 | TASK [Clear ovndb cluster pacemaker error] *************************************
2020-05-24 15:56:32 | Sunday 24 May 2020  15:56:07 +0000 (0:00:00.164)       1:46:42.651 ************
2020-05-24 15:56:32 | changed: [controller-2] => {"changed": true, "cmd": "pcs resource cleanup ovn-dbs-bundle", "

2020-05-24 15:56:32 | TASK [Ban ovndb resource on the current node.] *********************************
2020-05-24 15:56:32 | Sunday 24 May 2020  15:56:09 +0000 (0:00:01.166)       1:46:43.818 ************
2020-05-24 15:56:32 | changed: [controller-2] => {"changed": true, "cmd": "pcs resource ban ovn-dbs-bundle $(hostname | cut -d. -f1)", "delta": "0:00:00.630408", "end": "2020-05-24 15:56:10.027388", "rc": 0, "start": "2020-05-24 15:56:09.396980", "stderr": "", "stderr_lines": [], "stdout": "Warning: Creating locatio
n constraint 'cli-ban-ovn-dbs-bundle-on-controller-2' with a score of -INFINITY for resource ovn-dbs-bundle on controller-2.\n\tThis will prevent ovn-dbs-bundle from running on controller-2 until the constraint is removed\n\tThis will be the case even if controller-2 is the last node in the cluster", "stdout_lines":

2020-05-24 15:56:32 | TASK [Move virtual IPs to another node before stopping pacemaker] **************
2020-05-24 15:56:32 | Sunday 24 May 2020  15:56:12 +0000 (0:00:00.170)       1:46:46.780 ************
2020-05-24 15:56:32 | changed: [controller-2] => {"changed": true, "cmd": "CLUSTER_NODE=$(crm_node -n)\necho \"Retrieving all t

2020-05-24 15:57:51 | TASK [Stop pacemaker cluster] **************************************************
2020-05-24 15:57:51 | Sunday 24 May 2020  15:56:32 +0000 (0:00:01.070)       1:47:07.270 ************
2020-05-24 15:57:51 | changed: [controller-2] => {"changed": true, "out": "offline"}


Then on the flow side :

    controller-0/var/log/openvswitch/ovs-vswitchd.log

2020-05-24T15:26:57.909Z|00078|connmgr|INFO|br-int<->unix#2: 4 flow_mods 10 s ago (4 adds)
2020-05-24T15:56:55.424Z|00079|bridge|INFO|bridge br-ex: deleted interface patch-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc-to-br-int on port 4
2020-05-24T15:56:55.425Z|00080|bridge|INFO|bridge br-int: deleted interface patch-br-int-to-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc on port 7
2020-05-24T15:56:55.425Z|00003|ofproto_dpif_monitor(monitor27)|INFO|monitor thread terminated
2020-05-24T15:56:55.425Z|00081|dpif|WARN|system@ovs-system: failed to query port patch-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc-to-br-int: Invalid argument
2020-05-24T15:56:55.425Z|00082|dpif|WARN|system@ovs-system: failed to query port patch-br-int-to-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc: Invalid argument
2020-05-24T15:57:05.406Z|00083|connmgr|INFO|br-int<->unix#2: 345 flow_mods 10 s ago (345 deletes)

    controller-1/var/log/openvswitch/ovs-vswitchd.log


2020-05-24T15:26:56.993Z|00052|memory|INFO|72556 kB peak resident set size after 10.0 seconds
2020-05-24T15:26:56.994Z|00053|memory|INFO|handlers:5 ofconns:2 ports:15 revalidators:3 rules:366 udpif keys:73
2020-05-24T15:27:04.967Z|00054|connmgr|INFO|br-int<->unix#0: 355 flow_mods in the 9 s starting 10 s ago (354 adds, 1 deletes)
2020-05-24T15:52:02.675Z|00055|bridge|INFO|bridge br-ex: deleted interface patch-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc-to-br-int on port 3
2020-05-24T15:52:02.675Z|00056|bridge|INFO|bridge br-int: deleted interface patch-br-int-to-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc on port 10
2020-05-24T15:52:02.676Z|00057|dpif|WARN|Dropped 4 log messages in last 1516 seconds (most recently, 1516 seconds ago) due to excessive rate
2020-05-24T15:52:02.676Z|00058|dpif|WARN|system@ovs-system: failed to query port patch-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc-to-br-int: Invalid argument
2020-05-24T15:52:02.676Z|00059|dpif|WARN|system@ovs-system: failed to query port patch-br-int-to-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc: Invalid argument
2020-05-24T15:52:02.703Z|00060|bridge|INFO|bridge br-ex: added interface patch-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc-to-br-int on port 2
2020-05-24T15:52:02.703Z|00061|bridge|INFO|bridge br-int: added interface patch-br-int-to-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc on port 1
2020-05-24T15:52:12.673Z|00062|connmgr|INFO|br-int<->unix#2: 701 flow_mods in the 9 s starting 10 s ago (700 adds, 1 deletes)
2020-05-24T15:56:55.442Z|00063|bridge|INFO|bridge br-ex: deleted interface patch-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc-to-br-int on port 2
2020-05-24T15:56:55.442Z|00064|bridge|INFO|bridge br-int: deleted interface patch-br-int-to-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc on port 1
2020-05-24T15:56:55.442Z|00002|ofproto_dpif_monitor(monitor26)|INFO|monitor thread terminated

  controller-2/var/log/openvswitch/ovs-vswitchd.log

2020-05-24T16:09:24.672Z|00046|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.13.0
2020-05-24T16:09:34.622Z|00047|memory|INFO|70376 kB peak resident set size after 10.0 seconds
2020-05-24T16:09:34.622Z|00048|memory|INFO|handlers:5 ofconns:2 ports:13 revalidators:3 rules:23 udpif keys:58
2020-05-24T16:09:41.111Z|00049|connmgr|INFO|br-int<->unix#0: 10 flow_mods 10 s ago (9 adds, 1 deletes)


2020-05-24T16:31:53.022Z|00050|connmgr|INFO|br-int<->unix#2: 18 flow_mods 10 s ago (17 adds, 1 deletes)
2020-05-24T15:56:55.443Z|00065|dpif|WARN|system@ovs-system: failed to query port patch-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc-to-br-int: Invalid argument
2020-05-24T15:56:55.443Z|00066|dpif|WARN|system@ovs-system: failed to query port patch-br-int-to-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc: Invalid argument
2020-05-24T15:57:05.439Z|00067|connmgr|INFO|br-int<->unix#2: 345 flow_mods 10 s ago (345 deletes)


So it looks to me as we go and update ctl-0, it fails to come back online, but it doesn't matter because ctl-1 took the load, then ctl-1 was updated and failed to come back online, but it doesn't matter because ctl-2 is taking the load.  Then we reach ctl-2, and shut it down and bam! No more network.

Fun fact is that the ovn-db cluster seems fine after that period:

  * Container bundle set: ovn-dbs-bundle [cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest]:
    * ovn-dbs-bundle-0	(ocf::ovn:ovndb-servers):	Master controller-0
    * ovn-dbs-bundle-1	(ocf::ovn:ovndb-servers):	Slave controller-1
    * ovn-dbs-bundle-2	(ocf::ovn:ovndb-servers):	Slave controller-2

Again, we detect that ping test failure after *successful* update /run/ of ctl-0,1,2 , cpt-0,1 and cehp-0,1,2,
ie the overall process doesn't trigger any error that would comes back to the user (which in itself, is an issue).

To conclude ovs-vswitch doesn't seem to come back properly after update of the controllers and connectivity 
fail when we reach the last ctl and switch it off to update it.

It reminds me of the why we put the ovndb ban in the first place, where database incompatibility were preventing
the cluster from reforming properly during the rolling update of the controller.  Except that now it seems that 
this doesn't help (this is just a wild guess based on old memory)

Adding dalvarez here, as he helped on that case back in the days :) 

@Networking, could you help to go further in the debugging, and does the current analysis make sense ?

This is definitively a blocker, as in the end of the "update run" stage we have lost all North-South connectivity, it doesn't come back.

Thanks,

Comment 14 Sofer Athlan-Guyot 2020-06-03 16:16:18 UTC
Hi, according do dalvared and dhill this bears some resemblance with https://bugzilla.redhat.com/show_bug.cgi?id=1828287

18:05:49 @dalvarez:   chem: hey yeah i joined the tmux and saw that the dbs are empty, prolly not empty themselves but ovsdb server wont start and wont say anything

Comment 17 Jakub Libosvar 2020-06-08 17:19:25 UTC
The problem is between how ovn2.11 and ovn2.13 uses /etc. OVN 2.11 uses /etc/openvswitch to store database file while OVN 2.13 uses /etc/ovn/

After the image was upgraded, OVN NB Db is started with new /etc/ovn

ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/ovn/ovsdb-server-nb.log --remote=punix:/var/run/ovn/ovnnb_db.sock --pidfile=/var/run/ovn/ovnnb_db.pid --unixctl=/var/run/ovn/ovnnb_db.ctl --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections --private-key=db:OVN_Northbound,SSL,private_key --certificate=db:OVN_Northbound,SSL,certificate --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers --remote=ptcp:6641:172.17.1.146 --sync-from=tcp:192.0.2.254:6641 /etc/ovn/ovnnb_db.db

but the container is still configured to mount /var/lib/openvswitch into /etc/openvswitch:
        "Mounts": [
            {
                "Type": "bind",
                "Name": "",
                "Source": "/var/lib/openvswitch/ovn",
                "Destination": "/etc/openvswitch",
                "Driver": "",
                "Mode": "",
                "Options": [
                    "rbind"
                ],
                "RW": true,
                "Propagation": "rprivate"
            },

This can happen also when updating ovn packages outside of containers

Comment 18 Daniel Alvarez Sanchez 2020-06-09 12:38:33 UTC
So the problem is that docker/podman can't mount the same dir into two different destinations?
We have '/var/lib/openvswitch/ovn' to be mounted in 4 different locations [0] and is it only being mounted in '/etc/openvswitch'?

[0] https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/ovn/ovn-dbs-container-puppet.yaml#L139..L142

Comment 19 Sofer Athlan-Guyot 2020-06-09 13:24:23 UTC
Created attachment 1696332 [details]
db double mount point hack.

Comment 20 Sofer Athlan-Guyot 2020-06-09 13:28:34 UTC
Hi,

@Daniel,

nope the source of problem is the the new version of ovn change the
default location of the database without offering an update path.

In a non containerized environment the issue will be horrible. Update
your ovn db server related package and all the sudden you don't have
any flow anymore because now it's looking in /etc/ovn/ovnnb_db.db and
not in /etc/openvswitch/ovnnb_db.db anymore.

The code you're showing is a workaround in the container context. But
the issue should be fixed in the ovn packaging that should offer a
smoother upgrade path by symlink the previous location if the new
location is empty and the previous exists.

Now as you showed it can be workarounded in the container context.
The problem here is that the code you're showing only apply to
standalone deployment, not HA/pacemaker one.

The source of truth for ovn in pacemaker context seems to be in
puppet-tripleo[1].

Those mount point are definitively not in sync with what we have in
the templates.

This offer us another way to fix this, but I really would like to see
it fixed in the packaging even if we do the double mount point hack.

That being said I'm currently testing the attached patch on a
deployment.  It's really a wild "grep" result.  Adding Michele for
patch review.

I will let you know the result.

[1] https://github.com/openstack/puppet-tripleo/blob/stable/train/manifests/profile/pacemaker/ovn_dbs_bundle.pp#L153..L180

Comment 21 Jakub Libosvar 2020-06-09 13:34:35 UTC
(In reply to Sofer Athlan-Guyot from comment #20)
> Hi,
> 
> @Daniel,
> 
> nope the source of problem is the the new version of ovn change the
> default location of the database without offering an update path.
> 
> In a non containerized environment the issue will be horrible. Update
> your ovn db server related package and all the sudden you don't have
> any flow anymore because now it's looking in /etc/ovn/ovnnb_db.db and
> not in /etc/openvswitch/ovnnb_db.db anymore.

We figured out fixing it on packaging level will not fix this particular issue because images are built from scratch, thus installing an OVN package won't detect it's being updated.

> 
> The code you're showing is a workaround in the container context. But
> the issue should be fixed in the ovn packaging that should offer a
> smoother upgrade path by symlink the previous location if the new
> location is empty and the previous exists.

The problem here is that we have mount points already in the THT and they are ignored. Do we understand why the mountpoints are not obeyed?

> 
> Now as you showed it can be workarounded in the container context.
> The problem here is that the code you're showing only apply to
> standalone deployment, not HA/pacemaker one.
> 
> The source of truth for ovn in pacemaker context seems to be in
> puppet-tripleo[1].
> 
> Those mount point are definitively not in sync with what we have in
> the templates.
> 
> This offer us another way to fix this, but I really would like to see
> it fixed in the packaging even if we do the double mount point hack.
> 
> That being said I'm currently testing the attached patch on a
> deployment.  It's really a wild "grep" result.  Adding Michele for
> patch review.
> 
> I will let you know the result.
> 
> [1]
> https://github.com/openstack/puppet-tripleo/blob/stable/train/manifests/
> profile/pacemaker/ovn_dbs_bundle.pp#L153..L180

Comment 22 Sofer Athlan-Guyot 2020-06-09 13:57:32 UTC
(In reply to Jakub Libosvar from comment #21)
> (In reply to Sofer Athlan-Guyot from comment #20)
> > Hi,
> > 
> > @Daniel,
> > 
> > nope the source of problem is the the new version of ovn change the
> > default location of the database without offering an update path.
> > 
> > In a non containerized environment the issue will be horrible. Update
> > your ovn db server related package and all the sudden you don't have
> > any flow anymore because now it's looking in /etc/ovn/ovnnb_db.db and
> > not in /etc/openvswitch/ovnnb_db.db anymore.
> 
> We figured out fixing it on packaging level will not fix this particular
> issue because images are built from scratch, thus installing an OVN package
> won't detect it's being updated.

oki, then.

> 
> > 
> > The code you're showing is a workaround in the container context. But
> > the issue should be fixed in the ovn packaging that should offer a
> > smoother upgrade path by symlink the previous location if the new
> > location is empty and the previous exists.
> 
> The problem here is that we have mount points already in the THT and they
> are ignored. Do we understand why the mountpoints are not obeyed?

As I said the template show seems to not the one defining the mount point in the *pacemaker* context.  
The one that defines it in pacemaker context are in puppet-tripleo there https://github.com/openstack/puppet-tripleo/blob/stable/train/manifests/profile/pacemaker/ovn_dbs_bundle.pp#L153..L180
in the definition of the bundle, Michele can confirm/infirm that assertion 
(until my test finished to run)

> 
> > 
> > Now as you showed it can be workarounded in the container context.
> > The problem here is that the code you're showing only apply to
> > standalone deployment, not HA/pacemaker one.
> > 
> > The source of truth for ovn in pacemaker context seems to be in
> > puppet-tripleo[1].
> > 
> > Those mount point are definitively not in sync with what we have in
> > the templates.
> > 
> > This offer us another way to fix this, but I really would like to see
> > it fixed in the packaging even if we do the double mount point hack.
> > 
> > That being said I'm currently testing the attached patch on a
> > deployment.  It's really a wild "grep" result.  Adding Michele for
> > patch review.
> > 
> > I will let you know the result.
> > 
> > [1]
> > https://github.com/openstack/puppet-tripleo/blob/stable/train/manifests/
> > profile/pacemaker/ovn_dbs_bundle.pp#L153..L180

Comment 23 Sofer Athlan-Guyot 2020-06-09 14:42:23 UTC
I tested the attached patch and it's working:

(undercloud) [stack@undercloud-0 ~]$ cat patch.diff
From 33009cd4cb607b63ac7401f35243792b9d99814e Mon Sep 17 00:00:00 2001
From: Sofer Athlan-Guyot <sathlang>
Date: Tue, 9 Jun 2020 15:10:02 +0200
Subject: [PATCH] Hack for db location change.

Change-Id: Ib389b0c264b16128a3d9ec11a52124e6bf6216cf
---
manifests/profile/pacemaker/ovn_dbs_bundle.pp | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/manifests/profile/pacemaker/ovn_dbs_bundle.pp b/manifests/profile/pacemaker/ovn_dbs_bundle.pp
index f4986fff..6316f5c7 100644
--- a/manifests/profile/pacemaker/ovn_dbs_bundle.pp
+++ b/manifests/profile/pacemaker/ovn_dbs_bundle.pp
@@ -176,6 +176,11 @@ class tripleo::profile::pacemaker::ovn_dbs_bundle (
'target-dir' => '/etc/openvswitch',
'options'    => 'rw',
},
+        'ovn-dbs-db-path-new'   => {
+          'source-dir' => '/var/lib/openvswitch/ovn',
+          'target-dir' => '/etc/ovn',
+          'options'    => 'rw',
+        },
}
if (hiera('ovn_dbs_short_node_names_override', undef)) {
$ovn_dbs_short_node_names = hiera('ovn_dbs_short_node_names_override')
--
2.25.4


tripleo-ansible-inventory     --plan "qe-Cloud-0"     --ansible_ssh_user heat-admin     --static-yaml-inventory     inventory.yaml
ansible -b -i inventory.yaml 'Controller' -m patch -a 'src=patch.diff basedir=/usr/share/openstack-puppet/modules/tripleo strip=1'

[heat-admin@controller-0 ~]$ sudo podman inspect ovn-dbs-bundle-podman-0 | jq '.[]|.Mounts[]|.Source + " -> " + .Destination'

/etc/pacemaker/authkey -> /etc/pacemaker/authkey"
"/var/log/pacemaker/bundles/ovn-dbs-bundle-0 -> /var/log"
"/var/lib/kolla/config_files/ovn_dbs.json -> /var/lib/kolla/config_files/config.json"
"/lib/modules -> /lib/modules"
"/var/lib/openvswitch/ovn -> /run/openvswitch"
"/var/log/containers/openvswitch -> /var/log/openvswitch"
"/var/lib/openvswitch/ovn -> /etc/openvswitch"
"/var/lib/openvswitch/ovn -> /etc/ovn"

we can see that the new mount point exist.

That's when I discovered that the patch exist upstream, but only in master and ussuri.  I've trigger the backport to train.

Comment 24 Sofer Athlan-Guyot 2020-06-09 14:45:01 UTC
Now, not sure how it works for upstream, because they don't have the 16.0/16.1 split so not sure it's relevant there.  

@Networking can you analyse the whole situation: should this be an downstream only backport in 16.1 only ?

Comment 25 Daniel Alvarez Sanchez 2020-06-09 15:19:17 UTC
(In reply to Sofer Athlan-Guyot from comment #24)
> Now, not sure how it works for upstream, because they don't have the
> 16.0/16.1 split so not sure it's relevant there.  
> 
> @Networking can you analyse the whole situation: should this be an
> downstream only backport in 16.1 only ?

I think it makes sense to have it upstream TripleO as well right? The thing is that I don't think we're testing the ovn dbs bundle upstream in Tripleo CI. Am I wrong?
Otherwise we would've hit the issue in this case when we bumped OVN from 2.11 to 2.12/20.03

Comment 26 Sofer Athlan-Guyot 2020-06-09 17:15:31 UTC
(In reply to Daniel Alvarez Sanchez from comment #25)
> (In reply to Sofer Athlan-Guyot from comment #24)
> > Now, not sure how it works for upstream, because they don't have the
> > 16.0/16.1 split so not sure it's relevant there.  
> > 
> > @Networking can you analyse the whole situation: should this be an
> > downstream only backport in 16.1 only ?
> 
> I think it makes sense to have it upstream TripleO as well right? The thing
> is that I don't think we're testing the ovn dbs bundle upstream in Tripleo
> CI. Am I wrong?
> Otherwise we would've hit the issue in this case when we bumped OVN from
> 2.11 to 2.12/20.03

One need a continuous ping test to see the failure.  The ovndb cluster just run fine,
but with an empty database.  Stateless tempest test doesn't cut it neither,
as everything would run fine here if you do a tempest test after the update.

Comment 34 Ronnie Rasouli 2020-06-25 10:40:58 UTC
Ping test is working with 0 percent of packet

Comment 36 errata-xmlrpc 2020-07-29 07:52:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3148


Note You need to log in before you can comment on or make changes to this bug.