Bug 1881938

Summary:	migrator deployment doesn't tolerate masters
Product:	OpenShift Container Platform	Reporter:	Raif Ahmed <rahmed>
Component:	kube-storage-version-migrator	Assignee:	Luis Sanchez <sanchezl>
Status:	CLOSED ERRATA	QA Contact:	Ke Wang <kewang>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.5	CC:	aos-bugs, cblecker, sanchezl, surbania, wking
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 22:33:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Raif Ahmed 2020-09-23 12:38:30 UTC

Description of problem:

openshift-kube-storage-version-migrator-operator doesn't create migrator deployment that include toleration for masters

https://github.com/openshift/cluster-kube-storage-version-migrator-operator/blob/release-4.5/bindata/kube-storage-version-migrator/deployment.yaml

Although the operator it self have such toleration 

https://github.com/openshift/cluster-kube-storage-version-migrator-operator/blob/51972754a030b5e9ed9df617de276f5deaad5066/manifests/0000_40_kube-storage-version-migrator-operator_07_deployment.yaml#L63-L65

This means that if client is applying taints on worker nodes the Pods are failed to schedule.

Comment 2 W. Trevor King 2021-03-15 22:28:01 UTC

*** Bug 1935347 has been marked as a duplicate of this bug. ***

Comment 3 W. Trevor King 2021-03-15 23:46:15 UTC

*** Bug 1935347 has been marked as a duplicate of this bug. ***

Comment 4 Ke Wang 2021-03-17 10:46:47 UTC

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-03-16-221720   True        False         151m    Cluster version is 4.8.0-0.nightly-2021-03-16-221720

Check what master node kube-storage pods is running,
$ oc get pod -A -o wide | grep kube-storage
openshift-kube-storage-version-migrator-operator   kube-storage-version-migrator-operator-564cdcc96c-xzprc               1/1     Running       0          4h7m    10.130.0.63    ip-10-0-176-115.us-east-2.compute.internal   <none>           <none>
openshift-kube-storage-version-migrator            migrator-8bdb5f65f-22prn                                              1/1     Running       0          4h7m    10.130.0.62    ip-10-0-176-115.us-east-2.compute.internal   <none>           <none>

New deployment were applied to pods,
$ oc describe pod -n openshift-kube-storage-version-migrator-operator kube-storage-version-migrator-operator-564cdcc96c-xzprc
Name:                 kube-storage-version-migrator-operator-564cdcc96c-xzprc
Namespace:            openshift-kube-storage-version-migrator-operator
...
Node-Selectors:  node-role.kubernetes.io/master=
Tolerations:     node-role.kubernetes.io/master:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 120s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 120s
Events:          <none>


----------

$ oc describe pod -n openshift-kube-storage-version-migrator migrator-8bdb5f65f-22prn
Name:         migrator-8bdb5f65f-22prn
Namespace:    openshift-kube-storage-version-migrator
...
Node-Selectors:  <none>
Tolerations:     node-role.kubernetes.io/master:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 120s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 120s
Events:          <none>

To stop kubelet service on the master node which the kube-srorage pods are located,
$ oc debug node/ip-10-0-176-115.us-east-2.compute.internal
Starting pod/ip-10-0-176-115us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.176.115
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# systemctl stop kubelet

Removing debug pod ...

About 2 minutes after the master which stopped kubelet service status is NotReady, kube-storage operator pods running on that master were changed to status Terminating and those pods were scheduled to other master.

$ oc get no
NAME                                         STATUS     ROLES    AGE     VERSION
ip-10-0-140-229.us-east-2.compute.internal   Ready      master   5h11m   v1.20.0+e1bc274
ip-10-0-157-114.us-east-2.compute.internal   Ready      worker   5h2m    v1.20.0+e1bc274
ip-10-0-176-115.us-east-2.compute.internal   NotReady   master   5h6m    v1.20.0+e1bc274
ip-10-0-185-50.us-east-2.compute.internal    Ready      worker   5h2m    v1.20.0+e1bc274
ip-10-0-221-102.us-east-2.compute.internal   Ready      master   5h7m    v1.20.0+e1bc274

$ date;echo;oc get pod -A -o wide | grep kube-storage
Wed Mar 17 05:19:24 EDT 2021

openshift-kube-storage-version-migrator-operator   kube-storage-version-migrator-operator-564cdcc96c-8hv7z               1/1     Running       0          31s     10.129.0.89    ip-10-0-221-102.us-east-2.compute.internal   <none>           <none>
openshift-kube-storage-version-migrator-operator   kube-storage-version-migrator-operator-564cdcc96c-xzprc               1/1     Terminating   0          4h42m   10.130.0.63    ip-10-0-176-115.us-east-2.compute.internal   <none>           <none>
openshift-kube-storage-version-migrator            migrator-8bdb5f65f-22prn                                              1/1     Terminating   0          4h42m   10.130.0.62    ip-10-0-176-115.us-east-2.compute.internal   <none>           <none>
openshift-kube-storage-version-migrator            migrator-8bdb5f65f-cq26s                                              1/1     Running       0          31s     10.128.2.155   ip-10-0-157-114.us-east-2.compute.internal   <none>           <none>

From above, the results is as expected, so move the bug VERIFIED.

Comment 7 errata-xmlrpc 2021-07-27 22:33:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438