Bug 1657019

Summary: 3.11 control plain upgrade upgrades the ovs and sdn pods on all node network causing downtime on the nodes
Product: OpenShift Container Platform Reporter: Ryan Howe <rhowe>
Component: Cluster Version OperatorAssignee: Russell Teague <rteague>
Status: CLOSED ERRATA QA Contact: Weihua Meng <wmeng>
Severity: high Docs Contact:
Priority: urgent    
Version: 3.11.0CC: akrastin, aos-bugs, cdc, erich, jdesousa, jiajliu, jokerman, jolee, lmartinh, maupadhy, mmccomas, mrobson, openshift-bugs-escalate, parmstro, pdwyer, rteague, scuppett, sgarciam, steven.barre, yoliynyk
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
When the OpenShift SDN/OVS DaemonSets were upgraded during control plane upgrades with an updateStrategy of RollingUpdate, an upgrade of the pods in the entire cluster was performed. This caused unexpected network and application outages on nodes. This patch updates the following: * Changed the updateStrategy for SDN/OVS pods to OnDelete in the template, affects new installs. * Added control plane upgrade task to modify SDN/OVS daemonsets to use OnDelete updateStrategy * Added node upgrade task to delete all SDN/OVS pods while nodes are drained Network outages for nodes should only occur during the node upgrade when nodes are drained.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-20 14:11:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1660880    
Bug Blocks:    

Description Ryan Howe 2018-12-06 20:45:42 UTC
Description of problem:
When upgrading from 3.10 to 3.11 running just the control plain upgrade playbooks, a task is run that upgrades all the nodes sdn and ovs daemon sets.  

This can lead to unwanted network downtime on the node when updating. 

Version-Release number of selected component (if applicable):
3.11
openshift-ansible-3.11.43-1.git.0.fa69a02.el7.noarch
ansible-2.6.10-1.el7ae.noarch
openshift-ansible-playbooks-3.11.43-1.git.0.fa69a02.el7.noarch
openshift-ansible-roles-3.11.43-1.git.0.fa69a02.el7.noarch
openshift-ansible-docs-3.11.43-1.git.0.fa69a02.el7.noarch

How reproducible:
100%

Steps to Reproduce:
1. Run upgrade control plain playbook when upgrading from 3.10 to 3.11
2. $ ansible-playbook -i </path/to/inventory/file> \
     /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade_control_plane.yml

Actual results:
Rolling redeploy of all sdn and ovs pods causing network outage on nodes due to slow image pulls. 

Expected results:
Only the masters to be touched. 

Additional info:
- Can images be prepulled, by playbooks, to avoid downtime waiting on image pulls.
- RUnning control plain playbook should only upgrade masters and not touch nodes. 


Task that is run:

https://github.com/openshift/openshift-ansible/blob/release-3.11/playbooks/common/openshift-cluster/upgrades/v3_11/upgrade_control_plane_part2.yml#L82-L86
https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_sdn/tasks/main.yml#L50

Basically the following is run: 

 # oc apply -f - FILES

FILE = https://github.com/openshift/openshift-ansible/tree/release-3.11/roles/openshift_sdn/files

1. It adds 3.11 tags to the the imagestream 
2. It changes DS annotations.image.openshift.io/triggers to use 3.11 imagestreamtags to point to 3.11
https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_sdn/files/sdn-ovs.yaml#L10
https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_sdn/files/sdn.yaml#L12

This triggers a rolling redeploy of the DS for all nodes in the cluster.

Comment 1 Scott Dodson 2018-12-06 20:57:26 UTC
While it may be unexpected that the SDN daemonsets are updated as part of the control plane upgrade, the fact that updating the daemonset causes downtime is the real problem that needs to be resolved here. Moving to SDN team for their consideration.

Comment 2 Casey Callendrello 2018-12-07 13:42:37 UTC
We also noticed the restart issues as part of the 4.0 effort, and have started working on more resiliency for the sdn.

Beyond those improvements, do you think there's anything left to do here, Scott? I don't think we'll ever split the SDN to multiple daemonsets, so while it is surprising that a "control plane" upgrade touches the nodes, it makes technical sense.

Comment 3 Scott Dodson 2018-12-07 19:19:51 UTC
No, I think that's it, thanks.

Comment 16 Scott Dodson 2019-01-25 20:17:40 UTC
The initial implementation that moved the SDN upgrade from Control Plane to Node upgrade was found to be insufficient.

Please see this comment for explanation of the new approach that's being taken.

https://bugzilla.redhat.com/show_bug.cgi?id=1660880#c35

Forward porting those changes and reverting the earlier change which moved the SDN upgrade around in this PR

https://github.com/openshift/openshift-ansible/pull/11075

Comment 18 Russell Teague 2019-01-30 13:57:11 UTC
Fixed in build openshift-ansible-3.11.76-1

Comment 22 Weihua Meng 2019-02-11 08:48:55 UTC
Fixed.

openshift-ansible-3.11.82-1.git.0.f29227a.el7.noarch

after /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade_control_plane.yml executed,
only masters are touched.

all nodes are in Ready status.

[root@ip-172-18-2-246 ~]# oc get node
NAME                            STATUS    ROLES     AGE       VERSION
ip-172-18-0-37.ec2.internal     Ready     infra     2h        v1.10.0+b81c8f8
ip-172-18-12-206.ec2.internal   Ready     infra     2h        v1.10.0+b81c8f8
ip-172-18-13-244.ec2.internal   Ready     compute   2h        v1.10.0+b81c8f8
ip-172-18-14-241.ec2.internal   Ready     master    2h        v1.11.0+d4cacc0
ip-172-18-15-49.ec2.internal    Ready     compute   2h        v1.10.0+b81c8f8
ip-172-18-2-246.ec2.internal    Ready     master    2h        v1.11.0+d4cacc0
ip-172-18-6-25.ec2.internal     Ready     master    2h        v1.11.0+d4cacc0
ip-172-18-9-142.ec2.internal    Ready     compute   2h        v1.10.0+b81c8f8
ip-172-18-9-234.ec2.internal    Ready     infra     2h        v1.10.0+b81c8f8

[root@ip-172-18-2-246 ~]# oc get all
NAME            READY     STATUS    RESTARTS   AGE
pod/ovs-4hpqn   1/1       Running   0          2h
pod/ovs-7czbj   1/1       Running   0          36m
pod/ovs-fkxws   1/1       Running   0          2h
pod/ovs-hhfjd   1/1       Running   0          37m
pod/ovs-hwgcp   1/1       Running   0          2h
pod/ovs-jqlw6   1/1       Running   0          2h
pod/ovs-lfbfw   1/1       Running   0          2h
pod/ovs-q8cm5   1/1       Running   0          2h
pod/ovs-z6w85   1/1       Running   1          39m
pod/sdn-4lj7z   1/1       Running   0          2h
pod/sdn-5cwf5   1/1       Running   0          2h
pod/sdn-75hxz   1/1       Running   0          37m
pod/sdn-82s6l   1/1       Running   0          39m
pod/sdn-d4tjr   1/1       Running   0          2h
pod/sdn-gj7xx   1/1       Running   0          2h
pod/sdn-k9bhm   1/1       Running   1          36m
pod/sdn-rr2rr   1/1       Running   0          2h
pod/sdn-srfcv   1/1       Running   0          2h

NAME                 DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/ovs   9         9         9         3            9           <none>          2h
daemonset.apps/sdn   9         9         9         3            9           <none>          2h

NAME                                  DOCKER REPO                                           TAGS          UPDATED
imagestream.image.openshift.io/node   docker-registry.default.svc:5000/openshift-sdn/node   v3.11,v3.10   About an hour ago

Comment 27 errata-xmlrpc 2019-02-20 14:11:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0326