https://github.com/openshift/openshift-ansible/pull/10910 is the proposed fix which is deferring the SDN update to the node upgrade phase of the upgrade which will ensure control plane availability. The feedback from QE is that removal of the image change trigger did not address the problem. Back to networking team for now as the PR is merging. Once merged and verified by QE we can clone this bug for 3.10 for backporting the fix to 3.10.
Adding to the above question, can you provide an estimation of when will the backport be implemented in 3.10?
*** Bug 1667441 has been marked as a duplicate of this bug. ***
In order to address requirement 2 we can change the updateStrategy from RollingUpdate to OnDelete. When the node is drained we delete the pods in openshift-sdn namespace for that node which triggers the SDN upgrade on the drained node. So the outstanding work to be done is as follows 1) Update the updateStrategy for SDN pods to OnDelete in the template, affects new installs. 2) Add control plane update task to mutate the SDN daemonsets to use OnDelete updateStrategy, must happen prior to 3) 3) Leave the SDN upgrade in the control plane in 3.10 (move it back in 3.11). 4) During the node drain, upgrade, restart process, delete all the SDN pods for a given node, unfortunately you cannot select on nodeName so this must be scripted. Something like oc delete pod -n openshift-node $(oc get pods -n openshift-sdn -o wide --sort-by="{.spec.nodeName}" | grep {{ openshift.node.name }} | cut -f 1 -d ' ') Maybe the oc_obj module is smarter and can do this for us? i doubt it. Regarding Q1, The changes in https://github.com/openshift/openshift-ansible/pull/11021 will not address Requirement 2. It would only make sense to address Requirement 1 without addressing Requirement 2 if the customer is willing to forego node upgrades until we can deliver a more complete fix. I'd prefer we work on what's described above and see if that gets us a complete solution.
New PR submitted to limit OVS pod restart only during node upgrade when node is drained: release-3.10: https://github.com/openshift/openshift-ansible/pull/11050 Limited 3.10.n upgrade testing in progress.
Fixed in build openshift-ansible-3.10.106-1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0328