Bug 1660880 - 3.10.14 to v3.10.72 upgrade - control plane upgrade upgrades the ovs and sdn pods on all node network causing downtime on the nodes
Summary: 3.10.14 to v3.10.72 upgrade - control plane upgrade upgrades the ovs and sdn ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 3.10.z
Assignee: Russell Teague
QA Contact: zhaozhanqi
URL:
Whiteboard:
: 1667441 (view as bug list)
Depends On:
Blocks: 1657019
TreeView+ depends on / blocked
 
Reported: 2018-12-19 13:41 UTC by Luis Martinho
Modified: 2022-03-13 16:34 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
When the OpenShift SDN/OVS DaemonSets were upgraded during control plane upgrades with an updateStrategy of RollingUpdate, an upgrade of the pods in the entire cluster was performed. This caused unexpected network and application outages on nodes. This patch updates the following: * Changed the updateStrategy for SDN/OVS pods to OnDelete in the template, affects new installs. * Added control plane upgrade task to modify SDN/OVS daemonsets to use OnDelete updateStrategy * Added node upgrade task to delete all SDN/OVS pods while nodes are drained Network outages for nodes should only occur during the node upgrade when nodes are drained.
Clone Of:
Environment:
Last Closed: 2019-02-20 10:11:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3759411 0 None None None 2018-12-21 16:27:20 UTC
Red Hat Product Errata RHBA-2019:0328 0 None None None 2019-02-20 10:11:17 UTC

Internal Links: 1667441

Comment 11 Scott Dodson 2019-01-02 13:42:29 UTC
https://github.com/openshift/openshift-ansible/pull/10910 is the proposed fix which is deferring the SDN update to the node upgrade phase of the upgrade which will ensure control plane availability.

The feedback from QE is that removal of the image change trigger did not address the problem. Back to networking team for now as the PR is merging. Once merged and verified by QE we can clone this bug for 3.10 for backporting the fix to 3.10.

Comment 13 Luis Martinho 2019-01-02 14:13:13 UTC
Adding to the above question, can you provide an estimation of when will the backport be implemented in 3.10?

Comment 29 Stephen Cuppett 2019-01-18 15:37:41 UTC
*** Bug 1667441 has been marked as a duplicate of this bug. ***

Comment 35 Scott Dodson 2019-01-21 18:12:49 UTC
In order to address requirement 2 we can change the updateStrategy from RollingUpdate to OnDelete. When the node is drained we delete the pods in openshift-sdn namespace for that node which triggers the SDN upgrade on the drained node.

So the outstanding work to be done is as follows

1) Update the updateStrategy for SDN pods to OnDelete in the template, affects new installs.
2) Add control plane update task to mutate the SDN daemonsets to use OnDelete updateStrategy, must happen prior to 3)
3) Leave the SDN upgrade in the control plane in 3.10 (move it back in 3.11).
4) During the node drain, upgrade, restart process, delete all the SDN pods for a given node, unfortunately you cannot select on nodeName so this must be scripted. Something like

oc delete pod -n openshift-node $(oc get pods -n openshift-sdn -o wide --sort-by="{.spec.nodeName}" | grep {{ openshift.node.name }} | cut -f 1 -d ' ')

Maybe the oc_obj module is smarter and can do this for us? i doubt it.


Regarding Q1, The changes in https://github.com/openshift/openshift-ansible/pull/11021 will not address Requirement 2. It would only make sense to address Requirement 1 without addressing Requirement 2 if the customer is willing to forego node upgrades until we can deliver a more complete fix. I'd prefer we work on what's described above and see if that gets us a complete solution.

Comment 39 Russell Teague 2019-01-23 18:01:28 UTC
New PR submitted to limit OVS pod restart only during node upgrade when node is drained:

release-3.10: https://github.com/openshift/openshift-ansible/pull/11050

Limited 3.10.n upgrade testing in progress.

Comment 45 Russell Teague 2019-01-29 20:35:04 UTC
Fixed in build openshift-ansible-3.10.106-1

Comment 56 errata-xmlrpc 2019-02-20 10:11:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0328


Note You need to log in before you can comment on or make changes to this bug.