Description of problem:
When upgrading from 3.10 to 3.11 running just the control plain upgrade playbooks, a task is run that upgrades all the nodes sdn and ovs daemon sets.
This can lead to unwanted network downtime on the node when updating.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Run upgrade control plain playbook when upgrading from 3.10 to 3.11
2. $ ansible-playbook -i </path/to/inventory/file> \
Rolling redeploy of all sdn and ovs pods causing network outage on nodes due to slow image pulls.
Only the masters to be touched.
- Can images be prepulled, by playbooks, to avoid downtime waiting on image pulls.
- RUnning control plain playbook should only upgrade masters and not touch nodes.
Task that is run:
Basically the following is run:
# oc apply -f - FILES
FILE = https://github.com/openshift/openshift-ansible/tree/release-3.11/roles/openshift_sdn/files
1. It adds 3.11 tags to the the imagestream
2. It changes DS annotations.image.openshift.io/triggers to use 3.11 imagestreamtags to point to 3.11
This triggers a rolling redeploy of the DS for all nodes in the cluster.
While it may be unexpected that the SDN daemonsets are updated as part of the control plane upgrade, the fact that updating the daemonset causes downtime is the real problem that needs to be resolved here. Moving to SDN team for their consideration.
We also noticed the restart issues as part of the 4.0 effort, and have started working on more resiliency for the sdn.
Beyond those improvements, do you think there's anything left to do here, Scott? I don't think we'll ever split the SDN to multiple daemonsets, so while it is surprising that a "control plane" upgrade touches the nodes, it makes technical sense.
No, I think that's it, thanks.
The initial implementation that moved the SDN upgrade from Control Plane to Node upgrade was found to be insufficient.
Please see this comment for explanation of the new approach that's being taken.
Forward porting those changes and reverting the earlier change which moved the SDN upgrade around in this PR
Fixed in build openshift-ansible-3.11.76-1
after /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade_control_plane.yml executed,
only masters are touched.
all nodes are in Ready status.
[root@ip-172-18-2-246 ~]# oc get node
NAME STATUS ROLES AGE VERSION
ip-172-18-0-37.ec2.internal Ready infra 2h v1.10.0+b81c8f8
ip-172-18-12-206.ec2.internal Ready infra 2h v1.10.0+b81c8f8
ip-172-18-13-244.ec2.internal Ready compute 2h v1.10.0+b81c8f8
ip-172-18-14-241.ec2.internal Ready master 2h v1.11.0+d4cacc0
ip-172-18-15-49.ec2.internal Ready compute 2h v1.10.0+b81c8f8
ip-172-18-2-246.ec2.internal Ready master 2h v1.11.0+d4cacc0
ip-172-18-6-25.ec2.internal Ready master 2h v1.11.0+d4cacc0
ip-172-18-9-142.ec2.internal Ready compute 2h v1.10.0+b81c8f8
ip-172-18-9-234.ec2.internal Ready infra 2h v1.10.0+b81c8f8
[root@ip-172-18-2-246 ~]# oc get all
NAME READY STATUS RESTARTS AGE
pod/ovs-4hpqn 1/1 Running 0 2h
pod/ovs-7czbj 1/1 Running 0 36m
pod/ovs-fkxws 1/1 Running 0 2h
pod/ovs-hhfjd 1/1 Running 0 37m
pod/ovs-hwgcp 1/1 Running 0 2h
pod/ovs-jqlw6 1/1 Running 0 2h
pod/ovs-lfbfw 1/1 Running 0 2h
pod/ovs-q8cm5 1/1 Running 0 2h
pod/ovs-z6w85 1/1 Running 1 39m
pod/sdn-4lj7z 1/1 Running 0 2h
pod/sdn-5cwf5 1/1 Running 0 2h
pod/sdn-75hxz 1/1 Running 0 37m
pod/sdn-82s6l 1/1 Running 0 39m
pod/sdn-d4tjr 1/1 Running 0 2h
pod/sdn-gj7xx 1/1 Running 0 2h
pod/sdn-k9bhm 1/1 Running 1 36m
pod/sdn-rr2rr 1/1 Running 0 2h
pod/sdn-srfcv 1/1 Running 0 2h
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/ovs 9 9 9 3 9 <none> 2h
daemonset.apps/sdn 9 9 9 3 9 <none> 2h
NAME DOCKER REPO TAGS UPDATED
imagestream.image.openshift.io/node docker-registry.default.svc:5000/openshift-sdn/node v3.11,v3.10 About an hour ago
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.