Bug 1657019
Summary: | 3.11 control plain upgrade upgrades the ovs and sdn pods on all node network causing downtime on the nodes | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ryan Howe <rhowe> |
Component: | Cluster Version Operator | Assignee: | Russell Teague <rteague> |
Status: | CLOSED ERRATA | QA Contact: | Weihua Meng <wmeng> |
Severity: | high | Docs Contact: | |
Priority: | urgent | ||
Version: | 3.11.0 | CC: | akrastin, aos-bugs, cdc, erich, jdesousa, jiajliu, jokerman, jolee, lmartinh, maupadhy, mmccomas, mrobson, openshift-bugs-escalate, parmstro, pdwyer, rteague, scuppett, sgarciam, steven.barre, yoliynyk |
Target Milestone: | --- | ||
Target Release: | 3.11.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
When the OpenShift SDN/OVS DaemonSets were upgraded during control plane upgrades with an updateStrategy of RollingUpdate, an upgrade of the pods in the entire cluster was performed. This caused unexpected network and application outages on nodes.
This patch updates the following:
* Changed the updateStrategy for SDN/OVS pods to OnDelete in the template, affects new installs.
* Added control plane upgrade task to modify SDN/OVS daemonsets to use OnDelete updateStrategy
* Added node upgrade task to delete all SDN/OVS pods while nodes are drained Network outages for nodes should only occur during the node upgrade when nodes are drained.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2019-02-20 14:11:02 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1660880 | ||
Bug Blocks: |
Description
Ryan Howe
2018-12-06 20:45:42 UTC
While it may be unexpected that the SDN daemonsets are updated as part of the control plane upgrade, the fact that updating the daemonset causes downtime is the real problem that needs to be resolved here. Moving to SDN team for their consideration. We also noticed the restart issues as part of the 4.0 effort, and have started working on more resiliency for the sdn. Beyond those improvements, do you think there's anything left to do here, Scott? I don't think we'll ever split the SDN to multiple daemonsets, so while it is surprising that a "control plane" upgrade touches the nodes, it makes technical sense. No, I think that's it, thanks. The initial implementation that moved the SDN upgrade from Control Plane to Node upgrade was found to be insufficient. Please see this comment for explanation of the new approach that's being taken. https://bugzilla.redhat.com/show_bug.cgi?id=1660880#c35 Forward porting those changes and reverting the earlier change which moved the SDN upgrade around in this PR https://github.com/openshift/openshift-ansible/pull/11075 Fixed in build openshift-ansible-3.11.76-1 Fixed. openshift-ansible-3.11.82-1.git.0.f29227a.el7.noarch after /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade_control_plane.yml executed, only masters are touched. all nodes are in Ready status. [root@ip-172-18-2-246 ~]# oc get node NAME STATUS ROLES AGE VERSION ip-172-18-0-37.ec2.internal Ready infra 2h v1.10.0+b81c8f8 ip-172-18-12-206.ec2.internal Ready infra 2h v1.10.0+b81c8f8 ip-172-18-13-244.ec2.internal Ready compute 2h v1.10.0+b81c8f8 ip-172-18-14-241.ec2.internal Ready master 2h v1.11.0+d4cacc0 ip-172-18-15-49.ec2.internal Ready compute 2h v1.10.0+b81c8f8 ip-172-18-2-246.ec2.internal Ready master 2h v1.11.0+d4cacc0 ip-172-18-6-25.ec2.internal Ready master 2h v1.11.0+d4cacc0 ip-172-18-9-142.ec2.internal Ready compute 2h v1.10.0+b81c8f8 ip-172-18-9-234.ec2.internal Ready infra 2h v1.10.0+b81c8f8 [root@ip-172-18-2-246 ~]# oc get all NAME READY STATUS RESTARTS AGE pod/ovs-4hpqn 1/1 Running 0 2h pod/ovs-7czbj 1/1 Running 0 36m pod/ovs-fkxws 1/1 Running 0 2h pod/ovs-hhfjd 1/1 Running 0 37m pod/ovs-hwgcp 1/1 Running 0 2h pod/ovs-jqlw6 1/1 Running 0 2h pod/ovs-lfbfw 1/1 Running 0 2h pod/ovs-q8cm5 1/1 Running 0 2h pod/ovs-z6w85 1/1 Running 1 39m pod/sdn-4lj7z 1/1 Running 0 2h pod/sdn-5cwf5 1/1 Running 0 2h pod/sdn-75hxz 1/1 Running 0 37m pod/sdn-82s6l 1/1 Running 0 39m pod/sdn-d4tjr 1/1 Running 0 2h pod/sdn-gj7xx 1/1 Running 0 2h pod/sdn-k9bhm 1/1 Running 1 36m pod/sdn-rr2rr 1/1 Running 0 2h pod/sdn-srfcv 1/1 Running 0 2h NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/ovs 9 9 9 3 9 <none> 2h daemonset.apps/sdn 9 9 9 3 9 <none> 2h NAME DOCKER REPO TAGS UPDATED imagestream.image.openshift.io/node docker-registry.default.svc:5000/openshift-sdn/node v3.11,v3.10 About an hour ago Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0326 |