Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2052494

Summary:	Experiencing ping loss on VM started on previous version of ovn/ovs when updating the ovn db.
Product:	Red Hat Enterprise Linux Fast Datapath	Reporter:	Sofer Athlan-Guyot <sathlang>
Component:	ovn-2021	Assignee:	OVN Team <ovnteam>
Status:	CLOSED WONTFIX	QA Contact:	Jianlin Shi <jishi>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	FDP 22.A	CC:	ctrautma, dceara, egarciar, ekuris, i.maximets, jiji, jpretori, mmichels, nusiddiq, smooney
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-08-21 08:26:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2050154

Description Sofer Athlan-Guyot 2022-02-09 11:55:51 UTC

Description of problem:

Hi,

initially reported there https://bugzilla.redhat.com/show_bug.cgi?id=2050154 in the context of an update of OSP16.1 to OSP16.2

During that update we move from:

ovn2.13-20.12.0-189.el8fdp.x86_64 to ovn-2021-21.12.0-11.el8fdp.x86_64

and

openvswitch2.13-2.13.0-124.el8fdp.x86_64 to openvswitch2.15-2.15.0-57.el8fdp.x86_64


The workflow is:

1. start vm1 in the overcloud (so under ovn2.13-20... and openvswitch2.13-2....)
2. ping vm1 continuously;
3. update ovndb containers on the OSP controllers nodes (so the Compute nodes are not yet updated)  
4. See the ping failing.

After the last controller has been updated, the ping is still in a failed state.

How reproducible: always.

Comment 8 Sofer Athlan-Guyot 2022-02-11 17:54:43 UTC

Hi,

I would like to share more about the OSP update process and understand better what would be the consequence of setting external_ids:ovn-match-northd-version to true. Also, I would like to emphasis that we didn't need that parameter to have successful update up until recently for 16.1->16.2 and 16.2->16.2 OSP update.

The OSP/tripleo upgrade process is:

 - upgrade of the OSP controller which have those services:

8d6548183046  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-server-ovn:16.2_20220201.1  kolla_start           8 days ago          Up 8 days ago                  neutron_api
e1a7b08f0724  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-novncproxy:16.2_20220201.1     kolla_start           8 days ago          Up 8 days ago                  nova_vnc_proxy
bf7a8d4d09ae  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-controller:16.2_20220201.1      kolla_start           8 days ago          Up 8 days ago                  ovn_controller
56d8a7505b33  cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest                                            /bin/bash /usr/lo...  7 minutes ago       Up 7 minutes ago               ovn-dbs-bundle-podman-0

 - later (and it might eventually be some days latters on very big OSP cloud environment with hundred of computes nodes or following the cloud operator policy) the computes get updated. They have those services (for instance):

5abb4b9d066c  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:16.1_20211111.1  /bin/bash -c HAPR...  8 days ago  Up 8 days ago         neutron-haproxy-ovnmeta-3cce3842-61c3-4cc3-abd3-aea6c3434313
3ad2093dcc0a  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:16.1_20211111.1  kolla_start           9 days ago  Up 9 days ago         ovn_metadata_agent
46624d9b2125  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-controller:16.1_20211111.1              kolla_start           9 days ago  Up 9 days ago         ovn_controller

So I understand that the ovn manual say update the ovn_controller before updating the ovn-dbs, but it's rather not practical/possible given the current OSP architecture.

Now, until recently the "backward" compatibility of ovn was maintained for everything that we call OSP16 in OSP:
 - 16.1->16.2 was working (no ping loss);
 - 16.2ga->16.2 latest was working (no ping loss);

and the parameter external_ids:ovn-match-northd-version wasn't set (so false by default). So maybe we could look what OVN change triggered the "backward" incompatibility. Maybe it could/should be possible to have the current flows still working within an OSP 16.X update. 

I'm asking because I'm afraid of what are the exact consequence of setting that parameters to true in OSP from a customer perspective. I sound to me that when set this command prevent any new network related object to be created. So the customer, until all of the OSP compute get updated cannot create new VMs on the Cloud.

Could someone elaborate on the consequences setting this to true has, knowing that the OSP compute nodes can be update days after the OSP controller nodes ?

In parallel I've trigger tests of an update using that parameter, details in https://bugzilla.redhat.com/show_bug.cgi?id=2050154#c8 and https://bugzilla.redhat.com/show_bug.cgi?id=2050154#c9.

Comment 9 Sofer Athlan-Guyot 2022-02-14 12:45:03 UTC

Hi @dceara ,

the test of the patch delivery the external_ids:ovn-match-northd-version to true don't work correctly in the OSP framework because of the way the update work.  This would certainly work (with the caveats mentioned earlier, cut in control plane connectivity during OSP update).

We're going to look into other ways to deliver this, but in the meantime I've made a table of what was working, so maybe you could find what patch caused such a change:

- updating from ovn-2021-21.06.0-24/openvswitch2.15.x86_64 2.15.0-26 to ovn-2021-21.09.0-20/openvswitch2.15.x86_64 2.15.0-55 was working (ie updating ovn-northd before ovn-controller didn't cause any trouble)
- when updating to ovn-2021-21.12.0-11/openvswitch2.15.x86_64 2.15.0-26 updating ovn-northd before ovn-controller cause a cut in the flows.

So something between ovn-2021-21.06.0-24 and ovn-2021-21.12.0-11 causes this. Identifying this could maybe provide a short term solution while we're working out the detail of how to proceed in the future.

> Does connectivity restore once you update the compute nodes?

Yes. It has to be noted though that on real site, this could be days later.

Comment 11 Dumitru Ceara 2022-02-14 13:32:43 UTC

(In reply to Sofer Athlan-Guyot from comment #9)
> Hi @dceara ,
> 

Hi Sofer,

CC: Numan.

> the test of the patch delivery the external_ids:ovn-match-northd-version to
> true don't work correctly in the OSP framework because of the way the update
> work.  This would certainly work (with the caveats mentioned earlier, cut in
> control plane connectivity during OSP update).
> 

The only supported and recommended OVN upgrade procedure is to upgrade
ovn-controller (computes) first and then central components (ovn-northd
and NB/SB databases on controller nodes).

If the CMS doesn't ensure this then there might be packet drops because
OVN cannot ensure forward-compatibility in ovn-controller.

To *partially alleviate* the impact OVN recently started providing the
ovn-match-northd-version.  Obviously, the best way to upgrade is still
to follow the only supported and recommended procedure, i.e.,
ovn-controller (computes) first.

Quoting from the commit that introduced "ovn-match-northd-version":

    OVN recommends updating/upgrading ovn-controllers first and then
    ovn-northd and OVN DB ovsdb-servers.  This is to ensure that any
    new functionality specified by the database or logical flows created
    by ovn-northd is understood by ovn-controller.

This doesn't change in any way the recommended upgrade procedure.

> We're going to look into other ways to deliver this, but in the meantime
> I've made a table of what was working, so maybe you could find what patch
> caused such a change:
> 
> - updating from ovn-2021-21.06.0-24/openvswitch2.15.x86_64 2.15.0-26 to
> ovn-2021-21.09.0-20/openvswitch2.15.x86_64 2.15.0-55 was working (ie
> updating ovn-northd before ovn-controller didn't cause any trouble)
> - when updating to ovn-2021-21.12.0-11/openvswitch2.15.x86_64 2.15.0-26
> updating ovn-northd before ovn-controller cause a cut in the flows.
> 
> So something between ovn-2021-21.06.0-24 and ovn-2021-21.12.0-11 causes
> this. Identifying this could maybe provide a short term solution while we're
> working out the detail of how to proceed in the future.
> 

The commit that added a new OVN action (internal action) in that
interval is:
https://github.com/ovn-org/ovn/commit/4deac4509abbedd6ffaecf27eed01ddefccea40a

Just stressing this out, this is not a regression, this commit doesn't
do anything wrong; I still think it's the CMSs responsibility to ensure
that components are upgraded in the correct order.  Wrt. a short term
solution I don't see a way to ensure that this doesn't happen in the
future.  OVN just cannot provide forward-compatibility.

Numan, do you maybe have alternative suggestions?

> > Does connectivity restore once you update the compute nodes?
> 
> Yes. It has to be noted though that on real site, this could be days later.

Ack.

Regards,
Dumitru

Comment 12 Dumitru Ceara 2022-02-14 16:23:16 UTC

There might be ways to remove the current restriction that specifies
ovn-controllers must be upgraded before central components.  That,
however, is not something that can be implemented as a bug fix and
should be treated as an RFE.  One thing that comes to mind is having
northd check what feature set is supported by the running
ovn-controllers.  This needs detailed scoping though because it opens
up the possibility for different scenarios (e.g., stale chassis records
might force northd to not use new features; development complexity
increase).

In the meantime, the CMS will have to:
- upgrade ovn-controller first
OR (at least)
- set ovn-match-northd-version=true in all compute nodes before
  performing the upgrade.  This can be a one-time operational change.
  It will get persisted across OVS/OVN upgrades (unless the DBs are
  removed).

Comment 13 smooney 2022-02-23 12:57:29 UTC

the only supported way to upgrade OpenStack is to upgrade the controller first and computes second so this is a major problem for using ovn with OpenStack.

within minor z stream release technically you are allowed to upgrade computes first but that is untested and you are never allowed to do a major release upgrade of the computes first.
the controller must always be upgraded before any other host in a major release.

so unless this is addressed you cannot do a major upgrade sanely.

you would have to first upgrade the ovn container on the compute with a container that will be from a newer rhel which rhel does not actually support
then you have upgraded the controllers. then you have to upgrade all the other containers on the computers then you have to do the rhel upgrade on the computer.

that is not tested or supported by our tooling today.

the supported upgrade path for major upgrades is
1 upgrade all the contolers
2 upgrade the container on all the computes
3 upgrade the rhel on all the computes

the ovn restriction would force a step 0 to upgrade just the ovn container and would also require that ovn to work with the older ovs supplied by the unupgraded rhel.

for osp 16 to osp 17 that is a rhel 8.4 to rhel 9.0 version split

for 17 and 18 they at least will be the same major version fo rhel but I see the current requirement to upgrade ovn on the computes as a potential osp 17.0 release
blocker, since it will block 17 upgrades and a blocker for support osp 18 controller with osp 17, computes as part of the osp 18 roadmap.

Comment 14 Ilya Maximets 2022-02-23 13:26:34 UTC

(In reply to smooney from comment #13)
> you would have to first upgrade the ovn container on the compute with a
> container that will be from a newer rhel which rhel does not actually support

Doesn't steps 2 and 3 below already imply that we're running newer containers
on older rhel for some period of time?

> then you have upgraded the controllers. then you have to upgrade all the
> other containers on the computers then you have to do the rhel upgrade on
> the computer.
> 
> that is not tested or supported by our tooling today.
> 
> the supported upgrade path for major upgrades is 
> 1 upgrade all the contolers
> 2 upgrade the container on all the computes
> 3 upgrade the rhel on all the computes
> 
> the ovn restriction would force a step 0 to upgrade just the ovn container
> and would also require that ovn to work with the older ovs supplied by the
> unupgraded rhel.

OVN is not tied to a specific version of OVS, so that should not be a problem.

Comment 15 smooney 2022-02-24 18:10:43 UTC

yes it does
we already do that today for the osp 13 rhel7 to osp 16 rhel 8 FFU

so in the hybrid state which is defined as running fully upgraded osp 16 controller on rhel8 with osp 16 container on rhel 7 computes
we already run the osp 16(rhel 8) nova compute container on a rhel 7 host.

in this hybrid mode the nova-libvirt container is kept back to the osp 13/rhel7 container and its just the pyton nova-compute agent container
that is from the newer rhel. we only package one version fo the container pre release today.

we do not currently allow that hybrid state to be used for extended periods and the only operation that is support without a support exception is
live migration.
so this transitationary state must be entered and exited in the same mantaince windows which typically does not last more then 48hours.

so steps 2 and 3 in the context of the osp 16 to 17 upgrade implies
running the ovn rhel 9 container on an rhel 8 host a short period of time but in 16-17 all host will be running the same version of ovn once they are in the hybrid state.
there is a goal to support the hybrid state effectively indefinitely in 17.1 https://bugzilla.redhat.com/show_bug.cgi?id=2006966

for osp 17-18 upgrade rhel 9.0 to 9.x
the intent is to allow you to upgrade the OpenStack controller without touching the compute at all
and then later upgrade one compute at a time.

in practice we expect that customer would upgrade there central components in one maintenance windows, then in successive short maintenance windows upgrade batches of compute nodes
part of the 18 goal however is not need to touch every host in the deployment to do that first step of upgrading the contolers.

ovns current design would require that all the agents on the distributed compute node be upgraded before the OSP upgrade could start.
or we woudl have to not upgrade ovn on the controller as part of the initial step and only upgrade it at the very end.

that woudl effectively be the following
* pin ovn to osp 17 version and upgrade the contollers.
* upgrade computes in batches to 17
* after all compute and fully upgraded to 18 upgrade ovn on the contolers.

since we expect ^ to take place over a time period of 6-12 months or more potentially that basically
means neutron in osp 18 woudl have to be fully function with the osp 17 version of ovn for an extended period fo time
it might be possible to do that but it woudl significantly increase the testing burden for the network dfg.

so ideally if ovn could provide a way to upgrade the cetralised part first and disturbed part second that would simplify integration with layered products like OpenStack or possible openshift
I'm not familiar with there upgrade mechanisms but I woudl guess they also want to upgrade the centralised components first as that tend to be how distributed systems choose to address this.
e.g. the server can support older clients. rather then client supporting older servers but both can be done.
ceph also allows the centralised parts to be upgraded before the ceph clients are updated.

Comment 17 Elvira 2023-08-21 08:26:40 UTC

Since this issue was closed for 17 and 18 (BZ#2057568), I don't think it makes sense to have it filed in OVN-2021 which is equivalent to the older OSP 16. Please reach back or reopen the other BZ as needed.

Comment 18 Red Hat Bugzilla 2023-12-20 04:25:07 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days