Bug 1895220 - After updating the controller nodes to RHOSP 16.1.2, there was a problem with VM network communication.
Summary: After updating the controller nodes to RHOSP 16.1.2, there was a problem with...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-containers
Version: 16.1 (Train)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: z3
: 16.1 (Train on RHEL 8.2)
Assignee: Sofer Athlan-Guyot
QA Contact: Sofer Athlan-Guyot
URL:
Whiteboard:
: 1919884 (view as bug list)
Depends On: 1899936
Blocks: 1900484
TreeView+ depends on / blocked
 
Reported: 2020-11-06 05:26 UTC by Young Kim
Modified: 2024-03-25 16:58 UTC (History)
27 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1899936 1900484 (view as bug list)
Environment:
Last Closed: 2020-12-15 18:37:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-14522 0 None None None 2022-04-05 15:15:44 UTC
Red Hat Knowledge Base (Solution) 5554371 0 None None None 2020-11-06 15:23:44 UTC
Red Hat Product Errata RHEA-2020:5413 0 None None None 2020-12-15 18:37:58 UTC

Comment 15 Sofer Athlan-Guyot 2020-11-12 13:09:32 UTC
Hi,

so I know why our internal CI didn't see that error. It wasn't related to L2 vs L3 testing.

1. Before we start the update run on all role we start the ping test.
2. we then update all role, starting with the controller;
3. in the compute ovn controller we get the same error than described there but the existing ping test is not affected;
4. after the update run we check the ping for lost packet and there is none.

5. we start another ping test for the rest of the update and it's working fine, as, at that point the controller and the computer are updated.

Now, if during the update run I start a *new* ping test after Controller update but before Compute update, then I get total packet loss.

So the crux here is that existing flow are still working even with the error, but *new* flows don't work until we have the compute updated.

@nusiddiq does this behaviour match your expectation ?  

So overcome this, the only way is to start the update with the compute role. Any other way is a major change, Numan, you're proposing
a solution inside ovn code. Do you think it has any chance to land soon, currently we're facing this for all update in 16.1 and certainly update from 16.0 to 16.1.

Should I go and try to implement the major change or should we wait until we get an ovn fix?

Comment 16 Numan Siddique 2020-11-12 14:19:45 UTC
(In reply to Sofer Athlan-Guyot from comment #15)
> Hi,
> 
> so I know why our internal CI didn't see that error. It wasn't related to L2
> vs L3 testing.
> 
> 1. Before we start the update run on all role we start the ping test.
> 2. we then update all role, starting with the controller;
> 3. in the compute ovn controller we get the same error than described there
> but the existing ping test is not affected;
> 4. after the update run we check the ping for lost packet and there is none.
> 
> 5. we start another ping test for the rest of the update and it's working
> fine, as, at that point the controller and the computer are updated.
> 
> Now, if during the update run I start a *new* ping test after Controller
> update but before Compute update, then I get total packet loss.
> 
> So the crux here is that existing flow are still working even with the
> error, but *new* flows don't work until we have the compute updated.
> 
> @nusiddiq does this behaviour match your expectation ?  

Yes

> 
> So overcome this, the only way is to start the update with the compute role.
> Any other way is a major change, Numan, you're proposing
> a solution inside ovn code. Do you think it has any chance to land soon,
> currently we're facing this for all update in 16.1 and certainly update from
> 16.0 to 16.1.
> 
> Should I go and try to implement the major change or should we wait until we
> get an ovn fix?

The proposal was discussed this week. But we still need to evaluate if it's possible or not.
It also needs to be discussed upstream. Before I confirm anything, I atleast need to do a quick
POC if it is possible or not. How about I come back on this by the end of next week ?

Does this sound good ?

Thanks
Numan

Comment 17 Jakub Libosvar 2020-11-16 09:00:56 UTC
Just thinking out loud. Fixing this in core OVN will not fix the update from older version, but only from the new version containing the fix to the newer, right? Because the old ovn-controllers will still not be able to wait for them to be updated.

It means from the OpenStack standpoint, we'll still need to figure out updates from 16.1.2 to 16.1.3 or from any 16.1.z to 16.1.3

Comment 18 Numan Siddique 2020-11-16 09:24:41 UTC
(In reply to Jakub Libosvar from comment #17)
> Just thinking out loud. Fixing this in core OVN will not fix the update from
> older version, but only from the new version containing the fix to the
> newer, right? Because the old ovn-controllers will still not be able to wait
> for them to be updated.

That's correct.


> 
> It means from the OpenStack standpoint, we'll still need to figure out
> updates from 16.1.2 to 16.1.3 or from any 16.1.z to 16.1.3

I think so.

FYI - the initial patch (without tests and documentation for now) is up for review - https://patchwork.ozlabs.org/project/ovn/patch/20201113183615.1962024-1-numans@ovn.org/

Comment 20 Sofer Athlan-Guyot 2020-11-17 15:00:26 UTC
Hi,

the first easy workaround cannot work because ovs need it the other way round: Controller (neutron-server) must be updated before the Compute (any agent).

So we have ovn that currently requires Compute first and then Controller and ovs that requires the opposite. The current update workflow cannot handle
such a set of constraints.

In any case we need a way to deliver the new ovn code to the overcloud without breaking network, so this has to be outside of the usual update workflow.
 
We're following two leads from the update perspective:

 1. a knowledge based solution on how to update ovn_controller on compute before running the update;
 2. create a new "pre-update" stage in the update process
    - this will be the place to add such code in the future, and will start with that ovn_controller delivery;
    - needs a significant amount of changes and testing;

So first a kb solution for this case and "2." as a long term solution to provide an easy hook where
to put this kind of code in the future.

Thanks,

Comment 38 errata-xmlrpc 2020-12-15 18:37:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.3 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:5413

Comment 39 Jakub Libosvar 2021-01-27 15:54:02 UTC
*** Bug 1919884 has been marked as a duplicate of this bug. ***

Comment 43 Sofer Athlan-Guyot 2022-04-20 14:37:21 UTC
Hi,

The reason we don't need the workaround anymore is because, starting with 16.1.7 we have the necessary code to update OVN in the product[0].

We have updated the documentation to reflect that necessary new step in the update process there[1].

If we can remove that constraint from the product later will depends on how the RFE is taken by OVN members[2].

Thanks,

[0] See https://bugzilla.redhat.com/show_bug.cgi?id=2052411 for details.
[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/keeping_red_hat_openstack_platform_updated/index#proc_updating-ovn-controller-container_keeping-updated
[2] https://bugzilla.redhat.com/show_bug.cgi?id=2057568


Note You need to log in before you can comment on or make changes to this bug.