Bug 1660880
Summary: | 3.10.14 to v3.10.72 upgrade - control plane upgrade upgrades the ovs and sdn pods on all node network causing downtime on the nodes | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Luis Martinho <lmartinh> |
Component: | Cluster Version Operator | Assignee: | Russell Teague <rteague> |
Status: | CLOSED ERRATA | QA Contact: | zhaozhanqi <zzhao> |
Severity: | urgent | Docs Contact: | |
Priority: | high | ||
Version: | 3.10.0 | CC: | aos-bugs, bmeng, cdc, jokerman, lmartinh, mirollin, mmccomas, ocasalsa, openshift-bugs-escalate, rhowe, rteague, scuppett, steven.barre, vrutkovs, wsun, yoliynyk, zzhao |
Target Milestone: | --- | Keywords: | NeedsTestCase |
Target Release: | 3.10.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
When the OpenShift SDN/OVS DaemonSets were upgraded during control plane upgrades with an updateStrategy of RollingUpdate, an upgrade of the pods in the entire cluster was performed. This caused unexpected network and application outages on nodes.
This patch updates the following:
* Changed the updateStrategy for SDN/OVS pods to OnDelete in the template, affects new installs.
* Added control plane upgrade task to modify SDN/OVS daemonsets to use OnDelete updateStrategy
* Added node upgrade task to delete all SDN/OVS pods while nodes are drained Network outages for nodes should only occur during the node upgrade when nodes are drained.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2019-02-20 10:11:10 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1657019 |
Comment 11
Scott Dodson
2019-01-02 13:42:29 UTC
Adding to the above question, can you provide an estimation of when will the backport be implemented in 3.10? *** Bug 1667441 has been marked as a duplicate of this bug. *** In order to address requirement 2 we can change the updateStrategy from RollingUpdate to OnDelete. When the node is drained we delete the pods in openshift-sdn namespace for that node which triggers the SDN upgrade on the drained node. So the outstanding work to be done is as follows 1) Update the updateStrategy for SDN pods to OnDelete in the template, affects new installs. 2) Add control plane update task to mutate the SDN daemonsets to use OnDelete updateStrategy, must happen prior to 3) 3) Leave the SDN upgrade in the control plane in 3.10 (move it back in 3.11). 4) During the node drain, upgrade, restart process, delete all the SDN pods for a given node, unfortunately you cannot select on nodeName so this must be scripted. Something like oc delete pod -n openshift-node $(oc get pods -n openshift-sdn -o wide --sort-by="{.spec.nodeName}" | grep {{ openshift.node.name }} | cut -f 1 -d ' ') Maybe the oc_obj module is smarter and can do this for us? i doubt it. Regarding Q1, The changes in https://github.com/openshift/openshift-ansible/pull/11021 will not address Requirement 2. It would only make sense to address Requirement 1 without addressing Requirement 2 if the customer is willing to forego node upgrades until we can deliver a more complete fix. I'd prefer we work on what's described above and see if that gets us a complete solution. New PR submitted to limit OVS pod restart only during node upgrade when node is drained: release-3.10: https://github.com/openshift/openshift-ansible/pull/11050 Limited 3.10.n upgrade testing in progress. Fixed in build openshift-ansible-3.10.106-1 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0328 |