Bug 1955538

Summary:	[update] Slight cut in rabbitmq connectivity triggered a data plane loss after a full sync.
Product:	Red Hat OpenStack	Reporter:	Sofer Athlan-Guyot <sathlang>
Component:	openstack-neutron	Assignee:	Rodolfo Alonso <ralonsoh>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Eran Kuris <ekuris>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	13.0 (Queens)	CC:	ccamposr, chrisw, michele, ralonsoh, scohen, vgrosu
Target Milestone:	---	Keywords:	Triaged, ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-19 14:30:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sofer Athlan-Guyot 2021-04-30 10:45:15 UTC

Description of problem:

Hi,

update of OSP13z8 with the patches from[0] in 2021-04-09.1 puddle. This is an upgrade of rabbitmq to the osp16 version on osp13.

To summarize we end up the update with a loss of connectivity due to a
full sync triggered by the neutron agent after a small cut in rabbitmq 
connectivity.  The cut doesn't recover.

The running neutron agent on the compute node at the time of the error
is rhosp13-openstack-neutron-openvswitch-agent:20210408.1

While ovs is ovs-version=2.9.0 and is "updated"
openvswitch2.11-2.11.3-86.el7fdp.x86_64
(rhosp-openvswitch-2.11-0.7.el7ost.noarch)

I put updated in quote as ovs is not restarted during the update, so
at the time of the issue 2.9 is still running.

Here is the detail of the rabbitmq cut with a short explanation. At
that time all ctl, all ceph, all computes, all database have been
updated.

We are currently updating the messaging nodes and in particular msg-0,
here's what happen on the neutron-openvswitch agent on cpt-1:

- 16:25:55-57: 3s, switch from rmq0 to rmq2 (update of msg-0, cluster is stopped on that node)
- 16:39:33-35: 2s, switch from rmq2 to rmq1 (ban of rmq2 to allow the upgrade)
- 16:39:41-51: 10s, switch from rmq1 to rmq0 (ban of rmq1 to allow the upgrade, hence the fallback to rmq0)
- 16:41:19-33: 14s, switch from rmq0 to rmq0 (restart of the rmq0 container during step2 of common deploy step tasks) 
- 16:41:53-42:09: 16s, switch from rmq0 to rmq0 (same as above)

- 16:43:08: full sync: 2021-04-28 16:43:08.980,2021-04-28T16:43:08.980000+0000,158739,INFO,neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent,Agent has just been revived. Doing a full sync.

- 16:43:20: loss of connectivity to the vm on compute-1 (ping test from undercloud to the FIP)

As we detect the ping loss at the end of the process we went on with
the update of all messagings and the remaing 2 networker.

No error during the process.

No reboot of any node.

It really look like [1], but at the time it was though that it was
because of a too old version of the tooling with every <= z7. Now, z8
is affected too.

This seems random again as I've seen it working with the same set of
patches with z8.

[0] https://bugzilla.redhat.com/show_bug.cgi?id=1943285
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1891461

Comment 2 Sofer Athlan-Guyot 2021-04-30 10:55:34 UTC

Would like to have some guidance about the following points, asking DFG:Networking for it.

Depending on the frequency of the issue it could be interesting to know if
something can be done:
 1. to mitigate the issue (issue some commands before update or something delivered as a kb ?)
 2. to know why a sync would destroy the connectivity;
 3. how to recover (reboot of the compute, of the vm ?)

Comment 5 Sofer Athlan-Guyot 2021-05-13 12:28:28 UTC

Hi,

one thing to note is that this seems to affect mainly composable jobs, relative to ha jobs. The main difference in the context of this issue is the order of the updated roles.

In ha, controllers are updated first then compute. In composable we start with controller (but that role doesn't have the rabbitmq server) and the messaging roles (which do have the rabbitmq server) is updated *after* the compute. This is not a problem as order shouldn't matter, but that's certainly not the most realistic scenario. On the field compute role would usually be the last thing to be updated.

But in the context of this bugzilla , this means that, at the time of the full sync on the compute, ovs is still in the old - not updated - version in memory (only the binaries are updated, and a reboot - restart of ovs - would be needed to get the new ovs) but the python networking agent container is in the latest version.

So this may be that if the computes are updated after the rabbitmq servers, the full sync may be harmless (that would explain why we don't seem to have the issue on HA architecture), then it could be that only when the compute are in this "mixed" mode that the sync would be fatal.

This is only a theory but that would fit nicely with the data we currently have.

If this proves to be correct that would be good, because, as said, on costumer side, the compute would usually be the last thing to be updated.

So we need:
1. to confirm/infirm this theory;
2. adjust the composable role CI testing to have a role update sequence that is closer to real life, ie the compute should be after the entire control plane (ctl, db, messaging).
3. if the theory is correct make sure we add this constraint to the documentation (but that wouldn't be a big new constraint)

In the light of this new information maybe DFG:Networking may be able to better root cause the issue. This would still be useful, especially to check what would be a easy way out of this (reboot of ovs, reboot of the compute node, something else ... ?)

Note, I'm currently in PTO with limited access to internet. Wanted to capture this while it's still fresh, but won't further work on this before next Monday.

Comment 7 Sofer Athlan-Guyot 2021-05-17 09:08:58 UTC

Hi Rodolfo,

I'll setup a reproducer so that you can look into it.  I'll post the detail in the bz when the environment is available.

Concurrently I'll validate the theory of "bad" sequence in update (ie, testing with compute role node coming last).

Comment 15 Sofer Athlan-Guyot 2021-05-23 20:41:25 UTC

Hi @vgrosu 

we need a new "Warning" section in the OSP13 update page, on the same vein as the one in osp16.1 about ovn.

Basically if the deployment is <z10 they need to consult the KBS and plan for it before doing the update. This
is the delivery of an hotfix that will prevent data plane cut during update. The cut is not happening all the
time but if the hotfix is not applied it may happen.

Comment 20 Vlada Grosu 2021-06-03 11:02:05 UTC

Hi Sofer,

Apologies for the delay, just getting around to this ticket now. 
I can see the KBS is not published and the status is Solution in Progress. Bases on the comments in this BZ it looks like it can move to verified and can be published. Will I give it an editorial review and then publish it, can you please confirm?

Also, I'll create a draft for the 1.2. Known issues that might block an update [1] for the OSP13 doc to describe the issue and link to the workaround and share the details of that shortly.

Is that what you had in mind?
Many thanks,
Vlada


[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/keeping_red_hat_openstack_platform_updated/index#known_issues_that_might_block_an_update


(In reply to Sofer Athlan-Guyot from comment #15)
> Hi @vgrosu 
> 
> we need a new "Warning" section in the OSP13 update page, on the same vein
> as the one in osp16.1 about ovn.
> 
> Basically if the deployment is <z10 they need to consult the KBS and plan
> for it before doing the update. This
> is the delivery of an hotfix that will prevent data plane cut during update.
> The cut is not happening all the
> time but if the hotfix is not applied it may happen.

Comment 27 Vlada Grosu 2021-06-16 10:54:39 UTC

I've published the doc update here: 
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/keeping_red_hat_openstack_platform_updated/index?lb_target=production#known_issues_that_might_block_an_update

And I've published the Knowledgebase solution here: https://access.redhat.com/solutions/6068071