Bug 1660880

Summary:	3.10.14 to v3.10.72 upgrade - control plane upgrade upgrades the ovs and sdn pods on all node network causing downtime on the nodes
Product:	OpenShift Container Platform	Reporter:	Luis Martinho <lmartinh>
Component:	Cluster Version Operator	Assignee:	Russell Teague <rteague>
Status:	CLOSED ERRATA	QA Contact:	zhaozhanqi <zzhao>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	3.10.0	CC:	aos-bugs, bmeng, cdc, jokerman, lmartinh, mirollin, mmccomas, ocasalsa, openshift-bugs-escalate, rhowe, rteague, scuppett, steven.barre, vrutkovs, wsun, yoliynyk, zzhao
Target Milestone:	---	Keywords:	NeedsTestCase
Target Release:	3.10.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	When the OpenShift SDN/OVS DaemonSets were upgraded during control plane upgrades with an updateStrategy of RollingUpdate, an upgrade of the pods in the entire cluster was performed. This caused unexpected network and application outages on nodes. This patch updates the following: * Changed the updateStrategy for SDN/OVS pods to OnDelete in the template, affects new installs. * Added control plane upgrade task to modify SDN/OVS daemonsets to use OnDelete updateStrategy * Added node upgrade task to delete all SDN/OVS pods while nodes are drained Network outages for nodes should only occur during the node upgrade when nodes are drained.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-02-20 10:11:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1657019

Comment 11 Scott Dodson 2019-01-02 13:42:29 UTC

https://github.com/openshift/openshift-ansible/pull/10910 is the proposed fix which is deferring the SDN update to the node upgrade phase of the upgrade which will ensure control plane availability.

The feedback from QE is that removal of the image change trigger did not address the problem. Back to networking team for now as the PR is merging. Once merged and verified by QE we can clone this bug for 3.10 for backporting the fix to 3.10.

Comment 13 Luis Martinho 2019-01-02 14:13:13 UTC

Adding to the above question, can you provide an estimation of when will the backport be implemented in 3.10?

Comment 29 Stephen Cuppett 2019-01-18 15:37:41 UTC

*** Bug 1667441 has been marked as a duplicate of this bug. ***

Comment 35 Scott Dodson 2019-01-21 18:12:49 UTC

In order to address requirement 2 we can change the updateStrategy from RollingUpdate to OnDelete. When the node is drained we delete the pods in openshift-sdn namespace for that node which triggers the SDN upgrade on the drained node.

So the outstanding work to be done is as follows

1) Update the updateStrategy for SDN pods to OnDelete in the template, affects new installs.
2) Add control plane update task to mutate the SDN daemonsets to use OnDelete updateStrategy, must happen prior to 3)
3) Leave the SDN upgrade in the control plane in 3.10 (move it back in 3.11).
4) During the node drain, upgrade, restart process, delete all the SDN pods for a given node, unfortunately you cannot select on nodeName so this must be scripted. Something like

oc delete pod -n openshift-node $(oc get pods -n openshift-sdn -o wide --sort-by="{.spec.nodeName}" | grep {{ openshift.node.name }} | cut -f 1 -d ' ')

Maybe the oc_obj module is smarter and can do this for us? i doubt it.


Regarding Q1, The changes in https://github.com/openshift/openshift-ansible/pull/11021 will not address Requirement 2. It would only make sense to address Requirement 1 without addressing Requirement 2 if the customer is willing to forego node upgrades until we can deliver a more complete fix. I'd prefer we work on what's described above and see if that gets us a complete solution.

Comment 39 Russell Teague 2019-01-23 18:01:28 UTC

New PR submitted to limit OVS pod restart only during node upgrade when node is drained:

release-3.10: https://github.com/openshift/openshift-ansible/pull/11050

Limited 3.10.n upgrade testing in progress.

Comment 45 Russell Teague 2019-01-29 20:35:04 UTC

Fixed in build openshift-ansible-3.10.106-1

Comment 56 errata-xmlrpc 2019-02-20 10:11:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0328