Description of problem: The composable roles overcloud update failed after updating the overcloud nodes. TASK [tripleo-upgrade : stop l3 agent connectivity check] ********************** task path: /home/rhos-ci/jenkins/workspace/DFG-upgrades-updates-16-to-16.1-from-passed_phase1-composable-ipv6/infrared/plugins/tripleo-upgrade/infrared_plugin/roles/tripleo-upgrade/tasks/common/l3_agent_connectivity_check_stop_script.yml:2 Monday 18 May 2020 16:52:12 +0000 (0:23:14.751) 4:59:01.463 ************ fatal: [undercloud-0]: FAILED! => { "changed": true, "cmd": "source /home/stack/qe-Cloud-0rc\n /home/stack/l3_agent_stop_ping.sh", "delta": "0:00:00.093769", "end": "2020-05-18 16:52:13.700489", "rc": 1, "start": "2020-05-18 16:52:13.606720" } STDOUT: 16402 packets transmitted, 3256 received, +12940 errors, 80.1488% packet loss, time 17551ms rtt min/avg/max/mdev = 0.482/1.284/18.393/0.741 ms, pipe 4 Ping loss higher than 1% detected MSG: non-zero return code Version-Release number of selected component (if applicable): RHOS_TRUNK-16.0-RHEL-8-20200513.n.1 How reproducible: most likely Steps to Reproduce: 1. Deploy RHOS16 with 3: controllers, networkers, DB, CEPH, messaging. 2 computes 2. update the undercloud from 16 to 16.1, note RHEL upgrade from 8.1 to 8.2 3. update the overcloud nodes 4. run l3 ping test Actual results: 80+% ping loss Expected results: less than 1% ping lost Additional info:
So back to that bug which is about DFG-upgrades-updates-16-to-16.1-from-z1-HA-ipv4. So first we have no side car container issues: find . -type f -iname 'containers_allinfo.log' -exec grep -hH 'Exited ([1-9])' "{}" \; comes back empty. Then, from undercloud-0/home/stack/ping_results_202005241409.log, with the timestamps converted (cat undercloud-0/home/stack/ping_results_202005241409.log | perl -pe 's/([\d]{10}\.[\d]{3})/localtime $1/eg;' | less ) we have an idea of when it starts to fail. [Sun May 24 16:09:16 2020139] 64 bytes from 10.0.0.212: icmp_seq=10 ttl=63 time=3.71 ms [Sun May 24 16:09:17 2020576] 64 bytes from 10.0.0.212: icmp_seq=11 ttl=63 time=2.25 ms [Sun May 24 16:09:18 2020853] 64 bytes from 10.0.0.212: icmp_seq=12 ttl=63 time=1.03 ms ... [Sun May 24 17:56:54 2020341] 64 bytes from 10.0.0.212: icmp_seq=6431 ttl=63 time=1.16 ms [Sun May 24 17:56:55 2020427] 64 bytes from 10.0.0.212: icmp_seq=6432 ttl=63 time=1.22 ms [Sun May 24 17:57:20 2020471] From 10.0.0.11 icmp_seq=6454 Destination Host Unreachable [Sun May 24 17:57:20 2020564] From 10.0.0.11 icmp_seq=6455 Destination Host Unreachable .... Sun May 24 19:27:55 2020548] From 10.0.0.11 icmp_seq=11767 Destination Host Unreachable [Sun May 24 19:27:58 2020499] From 10.0.0.11 icmp_seq=11768 Destination Host Unreachable [Sun May 24 19:27:58 2020521] From 10.0.0.11 icmp_seq=11769 Destination Host Unreachable [Sun May 24 19:27:58 2020526] From 10.0.0.11 icmp_seq=11770 Destination Host Unreachable --- 10.0.0.212 ping statistics --- 11773 packets transmitted, 6355 received, +5259 errors, 46.0206% packet loss, time 12616ms At that time (I assumed a +2h relative to the other logs) we were updating ctl-2, making sure that ovndb and vip were banned before removing the node from the cluster. 020-05-24 15:56:32 | TASK [Clear ovndb cluster pacemaker error] ************************************* 2020-05-24 15:56:32 | Sunday 24 May 2020 15:56:07 +0000 (0:00:00.164) 1:46:42.651 ************ 2020-05-24 15:56:32 | changed: [controller-2] => {"changed": true, "cmd": "pcs resource cleanup ovn-dbs-bundle", " 2020-05-24 15:56:32 | TASK [Ban ovndb resource on the current node.] ********************************* 2020-05-24 15:56:32 | Sunday 24 May 2020 15:56:09 +0000 (0:00:01.166) 1:46:43.818 ************ 2020-05-24 15:56:32 | changed: [controller-2] => {"changed": true, "cmd": "pcs resource ban ovn-dbs-bundle $(hostname | cut -d. -f1)", "delta": "0:00:00.630408", "end": "2020-05-24 15:56:10.027388", "rc": 0, "start": "2020-05-24 15:56:09.396980", "stderr": "", "stderr_lines": [], "stdout": "Warning: Creating locatio n constraint 'cli-ban-ovn-dbs-bundle-on-controller-2' with a score of -INFINITY for resource ovn-dbs-bundle on controller-2.\n\tThis will prevent ovn-dbs-bundle from running on controller-2 until the constraint is removed\n\tThis will be the case even if controller-2 is the last node in the cluster", "stdout_lines": 2020-05-24 15:56:32 | TASK [Move virtual IPs to another node before stopping pacemaker] ************** 2020-05-24 15:56:32 | Sunday 24 May 2020 15:56:12 +0000 (0:00:00.170) 1:46:46.780 ************ 2020-05-24 15:56:32 | changed: [controller-2] => {"changed": true, "cmd": "CLUSTER_NODE=$(crm_node -n)\necho \"Retrieving all t 2020-05-24 15:57:51 | TASK [Stop pacemaker cluster] ************************************************** 2020-05-24 15:57:51 | Sunday 24 May 2020 15:56:32 +0000 (0:00:01.070) 1:47:07.270 ************ 2020-05-24 15:57:51 | changed: [controller-2] => {"changed": true, "out": "offline"} Then on the flow side : controller-0/var/log/openvswitch/ovs-vswitchd.log 2020-05-24T15:26:57.909Z|00078|connmgr|INFO|br-int<->unix#2: 4 flow_mods 10 s ago (4 adds) 2020-05-24T15:56:55.424Z|00079|bridge|INFO|bridge br-ex: deleted interface patch-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc-to-br-int on port 4 2020-05-24T15:56:55.425Z|00080|bridge|INFO|bridge br-int: deleted interface patch-br-int-to-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc on port 7 2020-05-24T15:56:55.425Z|00003|ofproto_dpif_monitor(monitor27)|INFO|monitor thread terminated 2020-05-24T15:56:55.425Z|00081|dpif|WARN|system@ovs-system: failed to query port patch-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc-to-br-int: Invalid argument 2020-05-24T15:56:55.425Z|00082|dpif|WARN|system@ovs-system: failed to query port patch-br-int-to-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc: Invalid argument 2020-05-24T15:57:05.406Z|00083|connmgr|INFO|br-int<->unix#2: 345 flow_mods 10 s ago (345 deletes) controller-1/var/log/openvswitch/ovs-vswitchd.log 2020-05-24T15:26:56.993Z|00052|memory|INFO|72556 kB peak resident set size after 10.0 seconds 2020-05-24T15:26:56.994Z|00053|memory|INFO|handlers:5 ofconns:2 ports:15 revalidators:3 rules:366 udpif keys:73 2020-05-24T15:27:04.967Z|00054|connmgr|INFO|br-int<->unix#0: 355 flow_mods in the 9 s starting 10 s ago (354 adds, 1 deletes) 2020-05-24T15:52:02.675Z|00055|bridge|INFO|bridge br-ex: deleted interface patch-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc-to-br-int on port 3 2020-05-24T15:52:02.675Z|00056|bridge|INFO|bridge br-int: deleted interface patch-br-int-to-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc on port 10 2020-05-24T15:52:02.676Z|00057|dpif|WARN|Dropped 4 log messages in last 1516 seconds (most recently, 1516 seconds ago) due to excessive rate 2020-05-24T15:52:02.676Z|00058|dpif|WARN|system@ovs-system: failed to query port patch-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc-to-br-int: Invalid argument 2020-05-24T15:52:02.676Z|00059|dpif|WARN|system@ovs-system: failed to query port patch-br-int-to-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc: Invalid argument 2020-05-24T15:52:02.703Z|00060|bridge|INFO|bridge br-ex: added interface patch-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc-to-br-int on port 2 2020-05-24T15:52:02.703Z|00061|bridge|INFO|bridge br-int: added interface patch-br-int-to-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc on port 1 2020-05-24T15:52:12.673Z|00062|connmgr|INFO|br-int<->unix#2: 701 flow_mods in the 9 s starting 10 s ago (700 adds, 1 deletes) 2020-05-24T15:56:55.442Z|00063|bridge|INFO|bridge br-ex: deleted interface patch-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc-to-br-int on port 2 2020-05-24T15:56:55.442Z|00064|bridge|INFO|bridge br-int: deleted interface patch-br-int-to-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc on port 1 2020-05-24T15:56:55.442Z|00002|ofproto_dpif_monitor(monitor26)|INFO|monitor thread terminated controller-2/var/log/openvswitch/ovs-vswitchd.log 2020-05-24T16:09:24.672Z|00046|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.13.0 2020-05-24T16:09:34.622Z|00047|memory|INFO|70376 kB peak resident set size after 10.0 seconds 2020-05-24T16:09:34.622Z|00048|memory|INFO|handlers:5 ofconns:2 ports:13 revalidators:3 rules:23 udpif keys:58 2020-05-24T16:09:41.111Z|00049|connmgr|INFO|br-int<->unix#0: 10 flow_mods 10 s ago (9 adds, 1 deletes) 2020-05-24T16:31:53.022Z|00050|connmgr|INFO|br-int<->unix#2: 18 flow_mods 10 s ago (17 adds, 1 deletes) 2020-05-24T15:56:55.443Z|00065|dpif|WARN|system@ovs-system: failed to query port patch-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc-to-br-int: Invalid argument 2020-05-24T15:56:55.443Z|00066|dpif|WARN|system@ovs-system: failed to query port patch-br-int-to-provnet-1a64876c-5ffa-49b2-8a76-fc00bd6c08bc: Invalid argument 2020-05-24T15:57:05.439Z|00067|connmgr|INFO|br-int<->unix#2: 345 flow_mods 10 s ago (345 deletes) So it looks to me as we go and update ctl-0, it fails to come back online, but it doesn't matter because ctl-1 took the load, then ctl-1 was updated and failed to come back online, but it doesn't matter because ctl-2 is taking the load. Then we reach ctl-2, and shut it down and bam! No more network. Fun fact is that the ovn-db cluster seems fine after that period: * Container bundle set: ovn-dbs-bundle [cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest]: * ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master controller-0 * ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-1 * ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-2 Again, we detect that ping test failure after *successful* update /run/ of ctl-0,1,2 , cpt-0,1 and cehp-0,1,2, ie the overall process doesn't trigger any error that would comes back to the user (which in itself, is an issue). To conclude ovs-vswitch doesn't seem to come back properly after update of the controllers and connectivity fail when we reach the last ctl and switch it off to update it. It reminds me of the why we put the ovndb ban in the first place, where database incompatibility were preventing the cluster from reforming properly during the rolling update of the controller. Except that now it seems that this doesn't help (this is just a wild guess based on old memory) Adding dalvarez here, as he helped on that case back in the days :) @Networking, could you help to go further in the debugging, and does the current analysis make sense ? This is definitively a blocker, as in the end of the "update run" stage we have lost all North-South connectivity, it doesn't come back. Thanks,
Hi, according do dalvared and dhill this bears some resemblance with https://bugzilla.redhat.com/show_bug.cgi?id=1828287 18:05:49 @dalvarez: chem: hey yeah i joined the tmux and saw that the dbs are empty, prolly not empty themselves but ovsdb server wont start and wont say anything
The problem is between how ovn2.11 and ovn2.13 uses /etc. OVN 2.11 uses /etc/openvswitch to store database file while OVN 2.13 uses /etc/ovn/ After the image was upgraded, OVN NB Db is started with new /etc/ovn ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/ovn/ovsdb-server-nb.log --remote=punix:/var/run/ovn/ovnnb_db.sock --pidfile=/var/run/ovn/ovnnb_db.pid --unixctl=/var/run/ovn/ovnnb_db.ctl --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections --private-key=db:OVN_Northbound,SSL,private_key --certificate=db:OVN_Northbound,SSL,certificate --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers --remote=ptcp:6641:172.17.1.146 --sync-from=tcp:192.0.2.254:6641 /etc/ovn/ovnnb_db.db but the container is still configured to mount /var/lib/openvswitch into /etc/openvswitch: "Mounts": [ { "Type": "bind", "Name": "", "Source": "/var/lib/openvswitch/ovn", "Destination": "/etc/openvswitch", "Driver": "", "Mode": "", "Options": [ "rbind" ], "RW": true, "Propagation": "rprivate" }, This can happen also when updating ovn packages outside of containers
So the problem is that docker/podman can't mount the same dir into two different destinations? We have '/var/lib/openvswitch/ovn' to be mounted in 4 different locations [0] and is it only being mounted in '/etc/openvswitch'? [0] https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/ovn/ovn-dbs-container-puppet.yaml#L139..L142
Created attachment 1696332 [details] db double mount point hack.
Hi, @Daniel, nope the source of problem is the the new version of ovn change the default location of the database without offering an update path. In a non containerized environment the issue will be horrible. Update your ovn db server related package and all the sudden you don't have any flow anymore because now it's looking in /etc/ovn/ovnnb_db.db and not in /etc/openvswitch/ovnnb_db.db anymore. The code you're showing is a workaround in the container context. But the issue should be fixed in the ovn packaging that should offer a smoother upgrade path by symlink the previous location if the new location is empty and the previous exists. Now as you showed it can be workarounded in the container context. The problem here is that the code you're showing only apply to standalone deployment, not HA/pacemaker one. The source of truth for ovn in pacemaker context seems to be in puppet-tripleo[1]. Those mount point are definitively not in sync with what we have in the templates. This offer us another way to fix this, but I really would like to see it fixed in the packaging even if we do the double mount point hack. That being said I'm currently testing the attached patch on a deployment. It's really a wild "grep" result. Adding Michele for patch review. I will let you know the result. [1] https://github.com/openstack/puppet-tripleo/blob/stable/train/manifests/profile/pacemaker/ovn_dbs_bundle.pp#L153..L180
(In reply to Sofer Athlan-Guyot from comment #20) > Hi, > > @Daniel, > > nope the source of problem is the the new version of ovn change the > default location of the database without offering an update path. > > In a non containerized environment the issue will be horrible. Update > your ovn db server related package and all the sudden you don't have > any flow anymore because now it's looking in /etc/ovn/ovnnb_db.db and > not in /etc/openvswitch/ovnnb_db.db anymore. We figured out fixing it on packaging level will not fix this particular issue because images are built from scratch, thus installing an OVN package won't detect it's being updated. > > The code you're showing is a workaround in the container context. But > the issue should be fixed in the ovn packaging that should offer a > smoother upgrade path by symlink the previous location if the new > location is empty and the previous exists. The problem here is that we have mount points already in the THT and they are ignored. Do we understand why the mountpoints are not obeyed? > > Now as you showed it can be workarounded in the container context. > The problem here is that the code you're showing only apply to > standalone deployment, not HA/pacemaker one. > > The source of truth for ovn in pacemaker context seems to be in > puppet-tripleo[1]. > > Those mount point are definitively not in sync with what we have in > the templates. > > This offer us another way to fix this, but I really would like to see > it fixed in the packaging even if we do the double mount point hack. > > That being said I'm currently testing the attached patch on a > deployment. It's really a wild "grep" result. Adding Michele for > patch review. > > I will let you know the result. > > [1] > https://github.com/openstack/puppet-tripleo/blob/stable/train/manifests/ > profile/pacemaker/ovn_dbs_bundle.pp#L153..L180
(In reply to Jakub Libosvar from comment #21) > (In reply to Sofer Athlan-Guyot from comment #20) > > Hi, > > > > @Daniel, > > > > nope the source of problem is the the new version of ovn change the > > default location of the database without offering an update path. > > > > In a non containerized environment the issue will be horrible. Update > > your ovn db server related package and all the sudden you don't have > > any flow anymore because now it's looking in /etc/ovn/ovnnb_db.db and > > not in /etc/openvswitch/ovnnb_db.db anymore. > > We figured out fixing it on packaging level will not fix this particular > issue because images are built from scratch, thus installing an OVN package > won't detect it's being updated. oki, then. > > > > > The code you're showing is a workaround in the container context. But > > the issue should be fixed in the ovn packaging that should offer a > > smoother upgrade path by symlink the previous location if the new > > location is empty and the previous exists. > > The problem here is that we have mount points already in the THT and they > are ignored. Do we understand why the mountpoints are not obeyed? As I said the template show seems to not the one defining the mount point in the *pacemaker* context. The one that defines it in pacemaker context are in puppet-tripleo there https://github.com/openstack/puppet-tripleo/blob/stable/train/manifests/profile/pacemaker/ovn_dbs_bundle.pp#L153..L180 in the definition of the bundle, Michele can confirm/infirm that assertion (until my test finished to run) > > > > > Now as you showed it can be workarounded in the container context. > > The problem here is that the code you're showing only apply to > > standalone deployment, not HA/pacemaker one. > > > > The source of truth for ovn in pacemaker context seems to be in > > puppet-tripleo[1]. > > > > Those mount point are definitively not in sync with what we have in > > the templates. > > > > This offer us another way to fix this, but I really would like to see > > it fixed in the packaging even if we do the double mount point hack. > > > > That being said I'm currently testing the attached patch on a > > deployment. It's really a wild "grep" result. Adding Michele for > > patch review. > > > > I will let you know the result. > > > > [1] > > https://github.com/openstack/puppet-tripleo/blob/stable/train/manifests/ > > profile/pacemaker/ovn_dbs_bundle.pp#L153..L180
I tested the attached patch and it's working: (undercloud) [stack@undercloud-0 ~]$ cat patch.diff From 33009cd4cb607b63ac7401f35243792b9d99814e Mon Sep 17 00:00:00 2001 From: Sofer Athlan-Guyot <sathlang> Date: Tue, 9 Jun 2020 15:10:02 +0200 Subject: [PATCH] Hack for db location change. Change-Id: Ib389b0c264b16128a3d9ec11a52124e6bf6216cf --- manifests/profile/pacemaker/ovn_dbs_bundle.pp | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/manifests/profile/pacemaker/ovn_dbs_bundle.pp b/manifests/profile/pacemaker/ovn_dbs_bundle.pp index f4986fff..6316f5c7 100644 --- a/manifests/profile/pacemaker/ovn_dbs_bundle.pp +++ b/manifests/profile/pacemaker/ovn_dbs_bundle.pp @@ -176,6 +176,11 @@ class tripleo::profile::pacemaker::ovn_dbs_bundle ( 'target-dir' => '/etc/openvswitch', 'options' => 'rw', }, + 'ovn-dbs-db-path-new' => { + 'source-dir' => '/var/lib/openvswitch/ovn', + 'target-dir' => '/etc/ovn', + 'options' => 'rw', + }, } if (hiera('ovn_dbs_short_node_names_override', undef)) { $ovn_dbs_short_node_names = hiera('ovn_dbs_short_node_names_override') -- 2.25.4 tripleo-ansible-inventory --plan "qe-Cloud-0" --ansible_ssh_user heat-admin --static-yaml-inventory inventory.yaml ansible -b -i inventory.yaml 'Controller' -m patch -a 'src=patch.diff basedir=/usr/share/openstack-puppet/modules/tripleo strip=1' [heat-admin@controller-0 ~]$ sudo podman inspect ovn-dbs-bundle-podman-0 | jq '.[]|.Mounts[]|.Source + " -> " + .Destination' /etc/pacemaker/authkey -> /etc/pacemaker/authkey" "/var/log/pacemaker/bundles/ovn-dbs-bundle-0 -> /var/log" "/var/lib/kolla/config_files/ovn_dbs.json -> /var/lib/kolla/config_files/config.json" "/lib/modules -> /lib/modules" "/var/lib/openvswitch/ovn -> /run/openvswitch" "/var/log/containers/openvswitch -> /var/log/openvswitch" "/var/lib/openvswitch/ovn -> /etc/openvswitch" "/var/lib/openvswitch/ovn -> /etc/ovn" we can see that the new mount point exist. That's when I discovered that the patch exist upstream, but only in master and ussuri. I've trigger the backport to train.
Now, not sure how it works for upstream, because they don't have the 16.0/16.1 split so not sure it's relevant there. @Networking can you analyse the whole situation: should this be an downstream only backport in 16.1 only ?
(In reply to Sofer Athlan-Guyot from comment #24) > Now, not sure how it works for upstream, because they don't have the > 16.0/16.1 split so not sure it's relevant there. > > @Networking can you analyse the whole situation: should this be an > downstream only backport in 16.1 only ? I think it makes sense to have it upstream TripleO as well right? The thing is that I don't think we're testing the ovn dbs bundle upstream in Tripleo CI. Am I wrong? Otherwise we would've hit the issue in this case when we bumped OVN from 2.11 to 2.12/20.03
(In reply to Daniel Alvarez Sanchez from comment #25) > (In reply to Sofer Athlan-Guyot from comment #24) > > Now, not sure how it works for upstream, because they don't have the > > 16.0/16.1 split so not sure it's relevant there. > > > > @Networking can you analyse the whole situation: should this be an > > downstream only backport in 16.1 only ? > > I think it makes sense to have it upstream TripleO as well right? The thing > is that I don't think we're testing the ovn dbs bundle upstream in Tripleo > CI. Am I wrong? > Otherwise we would've hit the issue in this case when we bumped OVN from > 2.11 to 2.12/20.03 One need a continuous ping test to see the failure. The ovndb cluster just run fine, but with an empty database. Stateless tempest test doesn't cut it neither, as everything would run fine here if you do a tempest test after the update.
Ping test is working with 0 percent of packet
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3148