Bug 1955538
Summary: | [update] Slight cut in rabbitmq connectivity triggered a data plane loss after a full sync. | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Sofer Athlan-Guyot <sathlang> |
Component: | openstack-neutron | Assignee: | Rodolfo Alonso <ralonsoh> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Eran Kuris <ekuris> |
Severity: | urgent | Docs Contact: | |
Priority: | high | ||
Version: | 13.0 (Queens) | CC: | ccamposr, chrisw, michele, ralonsoh, scohen, vgrosu |
Target Milestone: | --- | Keywords: | Triaged, ZStream |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-19 14:30:50 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Sofer Athlan-Guyot
2021-04-30 10:45:15 UTC
Would like to have some guidance about the following points, asking DFG:Networking for it. Depending on the frequency of the issue it could be interesting to know if something can be done: 1. to mitigate the issue (issue some commands before update or something delivered as a kb ?) 2. to know why a sync would destroy the connectivity; 3. how to recover (reboot of the compute, of the vm ?) Hi, one thing to note is that this seems to affect mainly composable jobs, relative to ha jobs. The main difference in the context of this issue is the order of the updated roles. In ha, controllers are updated first then compute. In composable we start with controller (but that role doesn't have the rabbitmq server) and the messaging roles (which do have the rabbitmq server) is updated *after* the compute. This is not a problem as order shouldn't matter, but that's certainly not the most realistic scenario. On the field compute role would usually be the last thing to be updated. But in the context of this bugzilla , this means that, at the time of the full sync on the compute, ovs is still in the old - not updated - version in memory (only the binaries are updated, and a reboot - restart of ovs - would be needed to get the new ovs) but the python networking agent container is in the latest version. So this may be that if the computes are updated after the rabbitmq servers, the full sync may be harmless (that would explain why we don't seem to have the issue on HA architecture), then it could be that only when the compute are in this "mixed" mode that the sync would be fatal. This is only a theory but that would fit nicely with the data we currently have. If this proves to be correct that would be good, because, as said, on costumer side, the compute would usually be the last thing to be updated. So we need: 1. to confirm/infirm this theory; 2. adjust the composable role CI testing to have a role update sequence that is closer to real life, ie the compute should be after the entire control plane (ctl, db, messaging). 3. if the theory is correct make sure we add this constraint to the documentation (but that wouldn't be a big new constraint) In the light of this new information maybe DFG:Networking may be able to better root cause the issue. This would still be useful, especially to check what would be a easy way out of this (reboot of ovs, reboot of the compute node, something else ... ?) Note, I'm currently in PTO with limited access to internet. Wanted to capture this while it's still fresh, but won't further work on this before next Monday. Hi Rodolfo, I'll setup a reproducer so that you can look into it. I'll post the detail in the bz when the environment is available. Concurrently I'll validate the theory of "bad" sequence in update (ie, testing with compute role node coming last). Hi @vgrosu we need a new "Warning" section in the OSP13 update page, on the same vein as the one in osp16.1 about ovn. Basically if the deployment is <z10 they need to consult the KBS and plan for it before doing the update. This is the delivery of an hotfix that will prevent data plane cut during update. The cut is not happening all the time but if the hotfix is not applied it may happen. Hi Sofer, Apologies for the delay, just getting around to this ticket now. I can see the KBS is not published and the status is Solution in Progress. Bases on the comments in this BZ it looks like it can move to verified and can be published. Will I give it an editorial review and then publish it, can you please confirm? Also, I'll create a draft for the 1.2. Known issues that might block an update [1] for the OSP13 doc to describe the issue and link to the workaround and share the details of that shortly. Is that what you had in mind? Many thanks, Vlada [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/keeping_red_hat_openstack_platform_updated/index#known_issues_that_might_block_an_update (In reply to Sofer Athlan-Guyot from comment #15) > Hi @vgrosu > > we need a new "Warning" section in the OSP13 update page, on the same vein > as the one in osp16.1 about ovn. > > Basically if the deployment is <z10 they need to consult the KBS and plan > for it before doing the update. This > is the delivery of an hotfix that will prevent data plane cut during update. > The cut is not happening all the > time but if the hotfix is not applied it may happen. I've published the doc update here: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/keeping_red_hat_openstack_platform_updated/index?lb_target=production#known_issues_that_might_block_an_update And I've published the Knowledgebase solution here: https://access.redhat.com/solutions/6068071 |