Summary: | After updating the controller nodes to RHOSP 16.1.2, there was a problem with VM network communication. | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Young Kim <youngkim> | |
Component: | openstack-containers | Assignee: | Sofer Athlan-Guyot <sathlang> | |
Status: | CLOSED ERRATA | QA Contact: | Sofer Athlan-Guyot <sathlang> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 16.1 (Train) | CC: | akaris, amuller, astupnik, atragler, bdobreli, bhaley, bshephar, chrisw, ealcaniz, ekuris, ffernand, jlibosva, jpretori, knoha, ltamagno, m.andre, nusiddiq, sathlang, scohen, sgolovat, slinaber, smooney, spower, sputhenp, supadhya, tkajinam, yocha | |
Target Milestone: | z3 | Keywords: | Triaged | |
Target Release: | 16.1 (Train on RHEL 8.2) | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1899936 1900484 (view as bug list) | Environment: | ||
Last Closed: | 2020-12-15 18:37:35 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Bug Depends On: | 1899936 | |||
Bug Blocks: | 1900484 |
Comment 15
Sofer Athlan-Guyot
2020-11-12 13:09:32 UTC
(In reply to Sofer Athlan-Guyot from comment #15) > Hi, > > so I know why our internal CI didn't see that error. It wasn't related to L2 > vs L3 testing. > > 1. Before we start the update run on all role we start the ping test. > 2. we then update all role, starting with the controller; > 3. in the compute ovn controller we get the same error than described there > but the existing ping test is not affected; > 4. after the update run we check the ping for lost packet and there is none. > > 5. we start another ping test for the rest of the update and it's working > fine, as, at that point the controller and the computer are updated. > > Now, if during the update run I start a *new* ping test after Controller > update but before Compute update, then I get total packet loss. > > So the crux here is that existing flow are still working even with the > error, but *new* flows don't work until we have the compute updated. > > @nusiddiq does this behaviour match your expectation ? Yes > > So overcome this, the only way is to start the update with the compute role. > Any other way is a major change, Numan, you're proposing > a solution inside ovn code. Do you think it has any chance to land soon, > currently we're facing this for all update in 16.1 and certainly update from > 16.0 to 16.1. > > Should I go and try to implement the major change or should we wait until we > get an ovn fix? The proposal was discussed this week. But we still need to evaluate if it's possible or not. It also needs to be discussed upstream. Before I confirm anything, I atleast need to do a quick POC if it is possible or not. How about I come back on this by the end of next week ? Does this sound good ? Thanks Numan Just thinking out loud. Fixing this in core OVN will not fix the update from older version, but only from the new version containing the fix to the newer, right? Because the old ovn-controllers will still not be able to wait for them to be updated. It means from the OpenStack standpoint, we'll still need to figure out updates from 16.1.2 to 16.1.3 or from any 16.1.z to 16.1.3 (In reply to Jakub Libosvar from comment #17) > Just thinking out loud. Fixing this in core OVN will not fix the update from > older version, but only from the new version containing the fix to the > newer, right? Because the old ovn-controllers will still not be able to wait > for them to be updated. That's correct. > > It means from the OpenStack standpoint, we'll still need to figure out > updates from 16.1.2 to 16.1.3 or from any 16.1.z to 16.1.3 I think so. FYI - the initial patch (without tests and documentation for now) is up for review - https://patchwork.ozlabs.org/project/ovn/patch/20201113183615.1962024-1-numans@ovn.org/ Hi, the first easy workaround cannot work because ovs need it the other way round: Controller (neutron-server) must be updated before the Compute (any agent). So we have ovn that currently requires Compute first and then Controller and ovs that requires the opposite. The current update workflow cannot handle such a set of constraints. In any case we need a way to deliver the new ovn code to the overcloud without breaking network, so this has to be outside of the usual update workflow. We're following two leads from the update perspective: 1. a knowledge based solution on how to update ovn_controller on compute before running the update; 2. create a new "pre-update" stage in the update process - this will be the place to add such code in the future, and will start with that ovn_controller delivery; - needs a significant amount of changes and testing; So first a kb solution for this case and "2." as a long term solution to provide an easy hook where to put this kind of code in the future. Thanks, Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.3 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:5413 *** Bug 1919884 has been marked as a duplicate of this bug. *** Hi, The reason we don't need the workaround anymore is because, starting with 16.1.7 we have the necessary code to update OVN in the product[0]. We have updated the documentation to reflect that necessary new step in the update process there[1]. If we can remove that constraint from the product later will depends on how the RFE is taken by OVN members[2]. Thanks, [0] See https://bugzilla.redhat.com/show_bug.cgi?id=2052411 for details. [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/keeping_red_hat_openstack_platform_updated/index#proc_updating-ovn-controller-container_keeping-updated [2] https://bugzilla.redhat.com/show_bug.cgi?id=2057568 |