Bug 2094265 - Data plane disruption during update from 16.2.1, 16.2.0, or any 16.1 release to 16.2.2 or later in ML2/OVN deployments
Summary: Data plane disruption during update from 16.2.1, 16.2.0, or any 16.1 release ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.2 (Train)
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: z4
: 16.2 (Train on RHEL 8.4)
Assignee: Terry Wilson
QA Contact: Fiorella Yanac
URL:
Whiteboard:
: 2127166 (view as bug list)
Depends On: 2089416 2141873
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-07 10:14 UTC by Ujey J
Modified: 2023-04-12 19:19 UTC (History)
26 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.6.1-2.20221010235135.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2141873 (view as bug list)
Environment:
Last Closed: 2022-12-07 19:23:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 863171 0 None MERGED Support setting ovn-ofctrl-wait-before-clear 2022-11-11 06:07:48 UTC
Red Hat Issue Tracker OSP-15557 0 None None None 2022-06-07 10:16:16 UTC
Red Hat Product Errata RHBA-2022:8794 0 None None None 2022-12-07 19:23:47 UTC

Internal Links: 2098208 2117544

Description Ujey J 2022-06-07 10:14:27 UTC
Description of problem:
Customer has upgraded one of our rhosp from 16.2.1 to 16.2.2 and during the procedure we saw some impact to the VMs running in there. It seems that it happened during the ovn-controller container refresh where at least some of the VMs experienced connection timeouts.

It seems that exactly while following the below step was the one that caused the issue and as mentioned, everything auto-recovered after 60-90 seconds.

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/keeping_red_hat_openstack_platform_updated/index#proc_updating-ovn-controller-container_updating-overcloud

Version-Release number of selected component (if applicable):
RHOSP 16.2.1

How reproducible:
Upgrade from RHOSP 16.2.1 to 16.2.2

Steps to Reproduce:
1. Upgrade 16.2.1 to 16.2.2



Actual results:
Upgraded successfully but there VM connectivity drop for 60 - 90 seconds while data plan upgrade.

Expected results:
successful upgrade without any downtime.

Additional info:

Comment 1 ldenny 2022-06-09 04:45:48 UTC
Hi Ujey, 

I confirmed the correct process was followed regarding updating the OVN controller on the compute nodes first before controllers:

Compute:
StartedAt": "2022-05-18T10:58:15.253471631Z

Controller: 
StartedAt": "2022-05-18T20:22:58.251640909Z

I also confirmed the container version match with both compute and controllers using version 16.2.2-15.1651564647[1]

However it seems we are missing the relevant logs for OVS and OVN in the sos reports, for example:
OVS:
2022-05-19T03:32:01.810Z|01027|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log

OVN controller:
2022-05-19T00:22:14.527Z|00014|pinctrl(ovn_pinctrl0)|INFO|DHCPACK $MAC $IP

Even messages starts at `May 18 18:19:27`

From the start date of ovn_controller on the compute node and what the customer has told us we should assume the issue was around 2022-05-18T10:58:15

I assume the customer hasn't provided the full rotated logs from the system, could you please confirm this, maybe I have missed something. 

If we don't have the logs from the incident please check with the customer, they may not have been rotated off the server and we could capture them in a tarball of /var/log/   

[1]https://catalog.redhat.com/software/containers/rhosp-rhel8/openstack-ovn-northd/5de6c2b4d70cc51644a57382?architecture=amd64&tag=16.2.2-15.1651564647&push_date=1651858373000

Comment 2 Ujey J 2022-06-13 07:32:15 UTC
Hello,

Customer has attached the ovs-vswitchd logs from that specific compute node on the case also ovn-controller logs from that specific compute are already attached in this case.

So please check the attached logs and share your findings.

Also, Cu concern is that 

"we will have this same issue on the next minor update so what we are trying to confirm for now is that using one of your reference rhosp deployments(or a lab one) you still don't see any impact during a minor update to the data plane when running that intermediate step of refreshing the ovn-controllers."

Thanks,
Ujey J

Comment 5 Ujey J 2022-06-15 11:16:39 UTC
Hi Jakub,

Please find the answer for your query below

what you mean by "It seems that it happened during the ovn-controller container refresh"

Ans: customer has mentioned that after doing the below steps on this document they have faced the issue

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/keeping_red_hat_openstack_platform_updated/index#proc_updating-ovn-controller-container_updating-overcloud

Do you have rough timestamps in which the downtime occurred?

Ans: the impact time is: 10:57-10:58 UTC 18.05.2022

Were whole logs asked in comment 1 provided?
Unfortunately, because some logs rotated out they will not able to provide full logs but they shared ovs-vswitchd logs from that specific compute on case itself and OVN controller logs are attached earlier on May 25.

The customer ticket mentions "upgrade" but the procedure they did was an "update" from 16.2.1 to 16.2.2, is that correct?
Ans: Yes Cu has done the update from 16.2.1 to 16.2.2

The customer actively wants to know the root cause or the customer would like to know if they can fix the documentation as was not referenced in the downtime that they experienced.

Could you please look into this and provide some updates.

Thanks,
Ujey J

Comment 19 Jakub Libosvar 2022-10-04 18:53:30 UTC
*** Bug 2127166 has been marked as a duplicate of this bug. ***

Comment 40 errata-xmlrpc 2022-12-07 19:23:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.4), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8794

Comment 41 Sofer Athlan-Guyot 2022-12-08 17:49:04 UTC
Hi,
Follow up bz for update there https://bugzilla.redhat.com/show_bug.cgi?id=2151958.


Note You need to log in before you can comment on or make changes to this bug.