Hide Forgot
Description of problem: When upgrading from 3.10 to 3.11 running just the control plain upgrade playbooks, a task is run that upgrades all the nodes sdn and ovs daemon sets. This can lead to unwanted network downtime on the node when updating. Version-Release number of selected component (if applicable): 3.11 openshift-ansible-3.11.43-1.git.0.fa69a02.el7.noarch ansible-2.6.10-1.el7ae.noarch openshift-ansible-playbooks-3.11.43-1.git.0.fa69a02.el7.noarch openshift-ansible-roles-3.11.43-1.git.0.fa69a02.el7.noarch openshift-ansible-docs-3.11.43-1.git.0.fa69a02.el7.noarch How reproducible: 100% Steps to Reproduce: 1. Run upgrade control plain playbook when upgrading from 3.10 to 3.11 2. $ ansible-playbook -i </path/to/inventory/file> \ /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade_control_plane.yml Actual results: Rolling redeploy of all sdn and ovs pods causing network outage on nodes due to slow image pulls. Expected results: Only the masters to be touched. Additional info: - Can images be prepulled, by playbooks, to avoid downtime waiting on image pulls. - RUnning control plain playbook should only upgrade masters and not touch nodes. Task that is run: https://github.com/openshift/openshift-ansible/blob/release-3.11/playbooks/common/openshift-cluster/upgrades/v3_11/upgrade_control_plane_part2.yml#L82-L86 https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_sdn/tasks/main.yml#L50 Basically the following is run: # oc apply -f - FILES FILE = https://github.com/openshift/openshift-ansible/tree/release-3.11/roles/openshift_sdn/files 1. It adds 3.11 tags to the the imagestream 2. It changes DS annotations.image.openshift.io/triggers to use 3.11 imagestreamtags to point to 3.11 https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_sdn/files/sdn-ovs.yaml#L10 https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_sdn/files/sdn.yaml#L12 This triggers a rolling redeploy of the DS for all nodes in the cluster.
While it may be unexpected that the SDN daemonsets are updated as part of the control plane upgrade, the fact that updating the daemonset causes downtime is the real problem that needs to be resolved here. Moving to SDN team for their consideration.
We also noticed the restart issues as part of the 4.0 effort, and have started working on more resiliency for the sdn. Beyond those improvements, do you think there's anything left to do here, Scott? I don't think we'll ever split the SDN to multiple daemonsets, so while it is surprising that a "control plane" upgrade touches the nodes, it makes technical sense.
No, I think that's it, thanks.
The initial implementation that moved the SDN upgrade from Control Plane to Node upgrade was found to be insufficient. Please see this comment for explanation of the new approach that's being taken. https://bugzilla.redhat.com/show_bug.cgi?id=1660880#c35 Forward porting those changes and reverting the earlier change which moved the SDN upgrade around in this PR https://github.com/openshift/openshift-ansible/pull/11075
Fixed in build openshift-ansible-3.11.76-1
Fixed. openshift-ansible-3.11.82-1.git.0.f29227a.el7.noarch after /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade_control_plane.yml executed, only masters are touched. all nodes are in Ready status. [root@ip-172-18-2-246 ~]# oc get node NAME STATUS ROLES AGE VERSION ip-172-18-0-37.ec2.internal Ready infra 2h v1.10.0+b81c8f8 ip-172-18-12-206.ec2.internal Ready infra 2h v1.10.0+b81c8f8 ip-172-18-13-244.ec2.internal Ready compute 2h v1.10.0+b81c8f8 ip-172-18-14-241.ec2.internal Ready master 2h v1.11.0+d4cacc0 ip-172-18-15-49.ec2.internal Ready compute 2h v1.10.0+b81c8f8 ip-172-18-2-246.ec2.internal Ready master 2h v1.11.0+d4cacc0 ip-172-18-6-25.ec2.internal Ready master 2h v1.11.0+d4cacc0 ip-172-18-9-142.ec2.internal Ready compute 2h v1.10.0+b81c8f8 ip-172-18-9-234.ec2.internal Ready infra 2h v1.10.0+b81c8f8 [root@ip-172-18-2-246 ~]# oc get all NAME READY STATUS RESTARTS AGE pod/ovs-4hpqn 1/1 Running 0 2h pod/ovs-7czbj 1/1 Running 0 36m pod/ovs-fkxws 1/1 Running 0 2h pod/ovs-hhfjd 1/1 Running 0 37m pod/ovs-hwgcp 1/1 Running 0 2h pod/ovs-jqlw6 1/1 Running 0 2h pod/ovs-lfbfw 1/1 Running 0 2h pod/ovs-q8cm5 1/1 Running 0 2h pod/ovs-z6w85 1/1 Running 1 39m pod/sdn-4lj7z 1/1 Running 0 2h pod/sdn-5cwf5 1/1 Running 0 2h pod/sdn-75hxz 1/1 Running 0 37m pod/sdn-82s6l 1/1 Running 0 39m pod/sdn-d4tjr 1/1 Running 0 2h pod/sdn-gj7xx 1/1 Running 0 2h pod/sdn-k9bhm 1/1 Running 1 36m pod/sdn-rr2rr 1/1 Running 0 2h pod/sdn-srfcv 1/1 Running 0 2h NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/ovs 9 9 9 3 9 <none> 2h daemonset.apps/sdn 9 9 9 3 9 <none> 2h NAME DOCKER REPO TAGS UPDATED imagestream.image.openshift.io/node docker-registry.default.svc:5000/openshift-sdn/node v3.11,v3.10 About an hour ago
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0326