Bug 1705649

Summary:	[reliability] Cluster with halted master did not reschedule operators after 5m of being down
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Node	Assignee:	ravig <rgudimet>
Status:	CLOSED ERRATA	QA Contact:	Sunil Choudhary <schoudha>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.1.0	CC:	aos-bugs, decarr, eparis, gblomqui, gklein, jokerman, mmccomas, rgudimet, sjenning
Target Milestone:	---
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-04 10:48:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Clayton Coleman 2019-05-02 16:33:56 UTC

I was testing upgrades with halted masters, and stopped one master.  After 5m, the pods on the down master should have been marked as down and deleted, but they weren't.  Instead, they sit there and never get rescheduled.

oc describe node ip-10-0-129-97.ec2.internal
Name:               ip-10-0-129-97.ec2.internal
Roles:              master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m4.xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-1
                    failure-domain.beta.kubernetes.io/zone=us-east-1a
                    kubernetes.io/hostname=ip-10-0-129-97
                    node-role.kubernetes.io/master=
                    node.openshift.io/os_id=rhcos
                    node.openshift.io/os_version=4.1
Annotations:        machine.openshift.io/machine: openshift-machine-api/ci-ln-gchzbyt-d5d6b-497pq-master-2
                    machineconfiguration.openshift.io/currentConfig: rendered-master-a53a89a1c2a7decb482455ca1c7d72c3
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-a53a89a1c2a7decb482455ca1c7d72c3
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 02 May 2019 10:33:51 -0400
Taints:             node.kubernetes.io/unreachable:NoExecute
                    node-role.kubernetes.io/master:NoSchedule
                    node.cloudprovider.kubernetes.io/shutdown:NoSchedule
                    node.kubernetes.io/unreachable:NoSchedule
Unschedulable:      false
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason                   Message
  ----             ------    -----------------                 ------------------                ------                   -------
  MemoryPressure   Unknown   Thu, 02 May 2019 11:46:17 -0400   Thu, 02 May 2019 11:47:02 -0400   NodeStatusUnknown        Kubelet stopped posting node status.
  DiskPressure     Unknown   Thu, 02 May 2019 11:46:17 -0400   Thu, 02 May 2019 11:47:02 -0400   NodeStatusUnknown        Kubelet stopped posting node status.
  PIDPressure      Unknown   Thu, 02 May 2019 11:46:17 -0400   Thu, 02 May 2019 11:47:02 -0400   NodeStatusUnknown        Kubelet stopped posting node status.
  Ready            Unknown   Thu, 02 May 2019 11:46:17 -0400   Thu, 02 May 2019 11:47:02 -0400   NodeStatusUnknown        Kubelet stopped posting node status.
  OutOfDisk        Unknown   Thu, 02 May 2019 10:33:51 -0400   Thu, 02 May 2019 11:47:02 -0400   NodeStatusNeverUpdated   Kubelet never posted node status.
Addresses:

○ oc get pods -n openshift-kube-apiserver-operator -o wide
NAME                                       READY   STATUS    RESTARTS   AGE   IP            NODE                          NOMINATED NODE   READINESS GATES
kube-apiserver-operator-7f6c4cd457-rtwvq   1/1     Running   0          48m   10.130.0.76   ip-10-0-129-97.ec2.internal   <none>           <none>

Very, very, very, very bad.

I see an active controller manager.

This is release and upgrade blocking.

Also, we should probably be setting the correct toleration to reschedule some operators faster.

Comment 1 Clayton Coleman 2019-05-02 16:36:54 UTC

The specific "tolerate down node for X seconds" should probably be set on operators to 1-2m instead of using the default.  That'd ensure progress.

Also, this could be zone thresholds.

Comment 2 Clayton Coleman 2019-05-02 16:38:07 UTC

oc get machine -A
NAMESPACE               NAME                                                INSTANCE              STATE     TYPE        REGION      ZONE         AGE
openshift-machine-api   ci-ln-gchzbyt-d5d6b-497pq-master-0                  i-0b8ba53a09e86b120   running   m4.xlarge   us-east-1   us-east-1a   122m
openshift-machine-api   ci-ln-gchzbyt-d5d6b-497pq-master-1                  i-06ef695e0b5d5d8f3   running   m4.xlarge   us-east-1   us-east-1b   122m
openshift-machine-api   ci-ln-gchzbyt-d5d6b-497pq-master-2                  i-019e54566efae9ccc   running   m4.xlarge   us-east-1   us-east-1a   122m
openshift-machine-api   ci-ln-gchzbyt-d5d6b-497pq-worker-us-east-1a-6k5cz   i-071d957aabcec3819   running   m4.large    us-east-1   us-east-1a   121m
openshift-machine-api   ci-ln-gchzbyt-d5d6b-497pq-worker-us-east-1a-vzvhw   i-0730151cbbc6b5c6f   running   m4.large    us-east-1   us-east-1a   121m
openshift-machine-api   ci-ln-gchzbyt-d5d6b-497pq-worker-us-east-1b-vprmg   i-0fc1d784a6c99335f   running   m4.large    us-east-1   us-east-1b   121m

○ oc get nodes -o wide -L failure-domain.beta.kubernetes.io/zone
NAME                           STATUS     ROLES    AGE    VERSION             INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                   KERNEL-VERSION         CONTAINER-RUNTIME                          ZONE
ip-10-0-129-154.ec2.internal   Ready      master   124m   v1.13.4+48f1990d7   10.0.129.154   <none>        Red Hat Enterprise Linux CoreOS 410.8.20190502.1 (Ootpa)   4.18.0-80.el8.x86_64   cri-o://1.13.7-1.rhaos4.1.git4258573.el8   us-east-1a
ip-10-0-129-97.ec2.internal    NotReady   master   124m   v1.13.4+48f1990d7   10.0.129.97    <none>        Red Hat Enterprise Linux CoreOS 410.8.20190502.1 (Ootpa)   4.18.0-80.el8.x86_64   cri-o://1.13.7-1.rhaos4.1.git4258573.el8   us-east-1a
ip-10-0-131-44.ec2.internal    Ready      master   42m    v1.13.4+48f1990d7   10.0.131.44    <none>        Red Hat Enterprise Linux CoreOS 410.8.20190502.1 (Ootpa)   4.18.0-80.el8.x86_64   cri-o://1.13.7-1.rhaos4.1.git4258573.el8   us-east-1a
ip-10-0-135-184.ec2.internal   Ready      worker   119m   v1.13.4+48f1990d7   10.0.135.184   <none>        Red Hat Enterprise Linux CoreOS 410.8.20190502.1 (Ootpa)   4.18.0-80.el8.x86_64   cri-o://1.13.7-1.rhaos4.1.git4258573.el8   us-east-1a
ip-10-0-141-8.ec2.internal     Ready      worker   119m   v1.13.4+48f1990d7   10.0.141.8     <none>        Red Hat Enterprise Linux CoreOS 410.8.20190502.1 (Ootpa)   4.18.0-80.el8.x86_64   cri-o://1.13.7-1.rhaos4.1.git4258573.el8   us-east-1a
ip-10-0-152-43.ec2.internal    Ready      worker   119m   v1.13.4+48f1990d7   10.0.152.43    <none>        Red Hat Enterprise Linux CoreOS 410.8.20190502.1 (Ootpa)   4.18.0-80.el8.x86_64   cri-o://1.13.7-1.rhaos4.1.git4258573.el8   us-east-1b
ip-10-0-159-171.ec2.internal   Ready      master   124m   v1.13.4+48f1990d7   10.0.159.171   <none>        Red Hat Enterprise Linux CoreOS 410.8.20190502.1 (Ootpa)   4.18.0-80.el8.x86_64   cri-o://1.13.7-1.rhaos4.1.git4258573.el8   us-east-1b

Comment 3 Seth Jennings 2019-05-03 13:45:59 UTC

Ok, there are a few issues here:

1) components tolerating all taints or the unreachable taint for infinite time

I wrote an e2e test to find these components:
https://github.com/openshift/origin/pull/22752

The current list of offenders is:
openshift-apiserver-operator/openshift-apiserver-operator
openshift-authentication-operator/authentication-operator
openshift-authentication/integrated-oauth-server
openshift-cloud-credential-operator/cloud-credential-operator
openshift-cluster-machine-approver/machine-approver
openshift-cluster-node-tuning-operator/cluster-node-tuning-operator
openshift-cluster-storage-operator/cluster-storage-operator
openshift-cluster-version/cluster-version-operator
openshift-console/downloads
openshift-controller-manager-operator/openshift-controller-manager-operator
openshift-dns-operator/dns-operator
openshift-ingress-operator/ingress-operator
openshift-kube-apiserver-operator/kube-apiserver-operator
openshift-kube-controller-manager-operator/kube-controller-manager-operator
openshift-kube-scheduler-operator/openshift-kube-scheduler-operator
openshift-machine-config-operator/etcd-quorum-guard
openshift-marketplace/marketplace-operator
openshift-monitoring/cluster-monitoring-operator
openshift-operator-lifecycle-manager/catalog-operator
openshift-operator-lifecycle-manager/olm-operator
openshift-operator-lifecycle-manager/olm-operators
openshift-operator-lifecycle-manager/packageserver
openshift-service-ca-operator/service-ca-operator
openshift-service-ca/apiservice-cabundle-injector
openshift-service-ca/configmap-cabundle-injector
openshift-service-ca/service-serving-cert-signer
openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator
openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator

2) mutationg admission plugins adding tolerations to pods

https://github.com/openshift/origin/tree/master/vendor/k8s.io/kubernetes/plugin/pkg/admission/defaulttolerationseconds
Adds a not-ready and unreachable toleration for 300s.  This is supposed to align with the 5m default pod-eviction-timeout on the kube-controller-manager.  Fragile and must stay in sync.

https://github.com/openshift/origin/tree/master/vendor/k8s.io/kubernetes/plugin/pkg/admission/podtolerationrestriction
Adds a memory-pressure toleration (infinte duration) for non-besteffort pods.

This is a huge problem in my mind in that it steal eviction policy away from the kubelet.

Consider the kubelet coming under disk pressure.  It will set the disk-pressure condition. TaintNodeByCondition will taint the node with disk-pressure taint.  All pods that don't have a toleration with NoExecute effect for that taint are evicted by the TaintBasedEviction controller.  Meanwhile the kubelet is also evicting pods, more intelligently ranked on disk usage.

I think we might have to disable TaintNodeByCondition and TaintBasedEviction until we can fix some of these glaring issues.

Ravi, what do you think?

Comment 4 Seth Jennings 2019-05-03 14:33:20 UTC

> This is a huge problem in my mind in that it steal eviction policy away from the kubelet.
>
> Consider the kubelet coming under disk pressure.  It will set the disk-pressure condition. TaintNodeByCondition will taint the node with disk-pressure taint.  All pods that don't have a toleration with NoExecute effect for that taint are evicted by the TaintBasedEviction controller.  Meanwhile the kubelet is also evicting pods, more intelligently ranked on disk usage.

Ok, Ravi pointed out to me this isn't an issue because the toleration added by TaintNodeByCondition is a NoSchedule taint, not a NoExecute taint.  So that is fine.

> I think we might have to disable TaintNodeByCondition and TaintBasedEviction until we can fix some of these glaring issues.

The issue isn't as bad as I first though, but this is still an escape hatch if we need it.

The tl;dr here is that the pod-eviction-timeout on the KCM no longer has any effect and the defaulttolerationseconds admission plugin now controls the default pod eviction timeout.  Also the timeout is now a PER-POD setting via the tolerationSeconds on the unreachable taint with the default set by the defaulttolerationseconds admission plugin.

That plugin has flags to control the default tolerationSeconds
https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/plugin/pkg/admission/defaulttolerationseconds/admission.go#L34-L40

The default is 300s (5m) to align with the historical default for pod-eviction-timeout on the KCM.

Comment 5 Seth Jennings 2019-05-03 21:27:35 UTC

Current list of pending PRs
https://github.com/openshift/cluster-monitoring-operator/pull/342
https://github.com/openshift/machine-api-operator/pull/310
https://github.com/openshift/cluster-samples-operator/pull/142
https://github.com/openshift/cloud-credential-operator/pull/64
https://github.com/openshift/cluster-kube-apiserver-operator/pull/454
https://github.com/openshift/cluster-kube-controller-manager-operator/pull/246
https://github.com/openshift/cluster-openshift-apiserver-operator/pull/200
https://github.com/openshift/cluster-storage-operator/pull/31
https://github.com/openshift/cluster-node-tuning-operator/pull/50
https://github.com/openshift/cluster-machine-approver/pull/25
https://github.com/openshift/console-operator/pull/224
https://github.com/openshift/cluster-ingress-operator/pull/226
https://github.com/operator-framework/operator-lifecycle-manager/pull/843
https://github.com/openshift/service-ca-operator/pull/56

Comment 6 Seth Jennings 2019-05-06 13:13:12 UTC

One more:
https://github.com/openshift/cluster-kube-scheduler-operator/pull/124

Merged so far:
https://github.com/openshift/cluster-samples-operator/pull/142
https://github.com/openshift/cluster-storage-operator/pull/31
https://github.com/openshift/console-operator/pull/224

12 left...

Comment 7 Seth Jennings 2019-05-06 21:58:51 UTC

No additional PRs merged today. Ravi, can you provide color as to why these aren't merging? CI? Unresponsible component owners?

Comment 8 Seth Jennings 2019-05-07 13:07:32 UTC

More have merged
https://github.com/openshift/cluster-monitoring-operator/pull/342
https://github.com/openshift/machine-api-operator/pull/310
https://github.com/openshift/cloud-credential-operator/pull/64
https://github.com/openshift/cluster-kube-controller-manager-operator/pull/246
https://github.com/openshift/cluster-ingress-operator/pull/226
https://github.com/openshift/service-ca-operator/pull/56

These remain
https://github.com/openshift/cluster-kube-apiserver-operator/pull/454
https://github.com/openshift/cluster-kube-scheduler-operator/pull/124
https://github.com/openshift/cluster-openshift-apiserver-operator/pull/200
https://github.com/openshift/cluster-node-tuning-operator/pull/50
https://github.com/openshift/cluster-machine-approver/pull/25
https://github.com/operator-framework/operator-lifecycle-manager/pull/843

Comment 9 Eric Paris 2019-05-07 16:22:13 UTC

Appears only https://github.com/operator-framework/operator-lifecycle-manager/pull/843 is outstanding. Thank you Ravi! so close.

Comment 10 Eric Paris 2019-05-07 16:23:08 UTC

Whoops 2 left:
https://github.com/openshift/cluster-kube-apiserver-operator/pull/454
https://github.com/operator-framework/operator-lifecycle-manager/pull/843

Comment 11 Seth Jennings 2019-05-07 17:28:00 UTC

As of 4.1.0-0.ci-2019-05-07-132218 the e2e test shows

openshift-apiserver-operator/openshift-apiserver-operator tolerates all taints
openshift-authentication/integrated-oauth-server tolerates all taints
openshift-authentication/oauth-openshift tolerates all taints
openshift-cloud-credential-operator/cloud-credential-operator tolerates all taints
openshift-cluster-machine-approver/machine-approver tolerates all taints
openshift-cluster-node-tuning-operator/cluster-node-tuning-operator tolerates all taints
openshift-cluster-version/cluster-version-operator tolerates all taints
openshift-dns-operator/dns-operator tolerates all taints
openshift-ingress-operator/ingress-operator tolerates all taints
openshift-kube-apiserver-operator/kube-apiserver-operator tolerates all taints
openshift-kube-controller-manager-operator/kube-controller-manager-operator tolerates all taints
openshift-kube-scheduler-operator/openshift-kube-scheduler-operator tolerates all taints
openshift-machine-config-operator/etcd-quorum-guard tolerates node.kubernetes.io/not-ready and node.kubernetes.io/unreachable with no tolerationSeconds
openshift-marketplace/marketplace-operator tolerates all taints
openshift-network-operator/network-operator tolerates node.kubernetes.io/not-ready with no tolerationSeconds
openshift-operator-lifecycle-manager/catalog-operator tolerates all taints
openshift-operator-lifecycle-manager/olm-operator tolerates all taints
openshift-operator-lifecycle-manager/olm-operators tolerates all taints
openshift-operator-lifecycle-manager/packageserver tolerates all taints
openshift-service-ca-operator/service-ca-operator tolerates all taints

Comment 12 Seth Jennings 2019-05-07 18:33:21 UTC

There is still some weirdness going on somewhere
https://github.com/openshift/cloud-credential-operator/blob/master/manifests/01_deployment.yaml#L172-L183

However, in my cluster, the 
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 120
  - operator: Exists <----
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 120
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists

Is there yet another admission plugin doing something?

Comment 13 Seth Jennings 2019-05-07 20:23:24 UTC

Clayton, hit some fun here.

Apparently, whatever the CVO does to apply the new operator deployment manifests merges the new toleration list with the old.

tl;dr There would be no way to remove tolerations from an operator deployment going forward.

Comment 14 Seth Jennings 2019-05-07 21:01:39 UTC

New list of unaddressed components (on a fresh cluster now)
openshift-authentication-operator/authentication-operator tolerates all taints
openshift-authentication/oauth-openshift tolerates all taints
openshift-dns-operator/dns-operator tolerates all taints
openshift-machine-config-operator/etcd-quorum-guard tolerates node.kubernetes.io/not-ready and node.kubernetes.io/unreachable with no tolerationSeconds
openshift-marketplace/marketplace-operator tolerates all taints

Now merged
https://github.com/openshift/cluster-kube-apiserver-operator/pull/454
https://github.com/operator-framework/operator-lifecycle-manager/pull/843

Comment 15 Seth Jennings 2019-05-07 21:33:56 UTC

> openshift-authentication-operator/authentication-operator tolerates all taints
> openshift-authentication/oauth-openshift tolerates all taints

https://github.com/openshift/cluster-authentication-operator/pull/128

> openshift-dns-operator/dns-operator tolerates all taints

https://github.com/openshift/cluster-dns-operator/pull/107

> openshift-machine-config-operator/etcd-quorum-guard tolerates node.kubernetes.io/not-ready and node.kubernetes.io/unreachable with no tolerationSeconds

https://github.com/openshift/machine-config-operator/pull/700

> openshift-marketplace/marketplace-operator tolerates all taints

https://github.com/operator-framework/operator-marketplace/pull/178 (merged but not seeing effect in 4.1.0-0.ci-2019-05-07-180607)

Comment 16 Seth Jennings 2019-05-08 01:28:23 UTC

I was able to confirm marketplace-operator is fixed on 4.1.0-0.ci-2019-05-07-213620

https://github.com/openshift/machine-config-operator/pull/700 and https://github.com/openshift/cluster-authentication-operator/pull/128 are merged but not in a release yet

Only https://github.com/openshift/cluster-dns-operator/pull/107 remains unmerged.

Comment 17 Eric Paris 2019-05-08 02:49:19 UTC

last one through.  Let's go new release payloads!

Comment 18 Seth Jennings 2019-05-08 04:15:44 UTC

As of 4.1.0-0.ci-2019-05-08-025712 only one left.

openshift-operator-lifecycle-manager/olm-operators tolerates all taints

Must have missed it in https://github.com/operator-framework/operator-lifecycle-manager/pull/843 :-/

Comment 19 Seth Jennings 2019-05-08 04:26:50 UTC

New last one
https://github.com/operator-framework/operator-lifecycle-manager/pull/849

Comment 20 Seth Jennings 2019-05-08 13:17:26 UTC

New new last one
https://github.com/operator-framework/operator-lifecycle-manager/pull/850

Comment 21 Seth Jennings 2019-05-08 17:14:12 UTC

https://github.com/operator-framework/operator-lifecycle-manager/pull/850 merged

just waiting for a release build

Comment 22 Seth Jennings 2019-05-08 19:05:17 UTC

Abandoning attempt to change olm-operators.  It is actually not an operator.  Adding to e2e test whitelist.

This should pass now and we can move to MODIFED after it does
https://github.com/openshift/origin/pull/22752

Comment 23 Seth Jennings 2019-05-08 19:06:26 UTC

I ran the e2e locally against 4.1.0-0.ci-2019-05-08-172654 and it passed

Comment 24 Eric Paris 2019-05-08 21:05:54 UTC

I'm going MODIFIED as the code is fixed. We should continue to track if https://github.com/openshift/origin/pull/22752 merges, since that is an e2e to prevent this mess from coming back. But QE should be able to validate now.

Comment 26 Seth Jennings 2019-05-09 14:15:39 UTC

https://github.com/openshift/origin/pull/22752 merged so we don't lose this ground

Note to QE:
The validation for this is to ensure that all _operator_ pods do not 1) tolerate all taints or 2) tolerate not-ready or unreachable taint with no tolerationSeconds (i.e. indefinitely)

Comment 29 errata-xmlrpc 2019-06-04 10:48:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758