1827783 – CI Failure (RHEL): Worker config rollout failed

Bug 1827783 - CI Failure (RHEL): Worker config rollout failed

Summary: CI Failure (RHEL): Worker config rollout failed

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-scheduler
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Jan Chaloupka
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1779811
TreeView+	depends on / blocked

Reported:	2020-04-24 19:03 UTC by Russell Teague
Modified:	2020-07-17 12:47 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-17 12:47:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
openshift-kube-scheduler-ip-10-0-159-94.us-east-2.compute.internal.oc_logs (131.40 KB, text/plain) 2020-05-21 19:47 UTC, Russell Teague	no flags	Details
openshift-kube-scheduler-ip-10-0-186-242.us-east-2.compute.internal.oc_logs (28.62 KB, text/plain) 2020-05-21 19:48 UTC, Russell Teague	no flags	Details
openshift-kube-scheduler-ip-10-0-186-242.us-east-2.compute.internal.oc_logs_previous (165.64 KB, text/plain) 2020-05-21 19:48 UTC, Russell Teague	no flags	Details
openshift-kube-scheduler-ip-10-0-192-24.us-east-2.compute.internal.oc_logs (90.27 KB, text/plain) 2020-05-21 19:49 UTC, Russell Teague	no flags	Details
View All

Description Russell Teague 2020-04-24 19:03:47 UTC

Description of problem:
During the RHEL7 worker scaleup job, it fails at step:
TASK [DEBUG - Worker config rollout failed]

The worker config is updating:
                    "lastTransitionTime": "2020-04-24T15:03:47Z",
                    "message": "All nodes are updating to rendered-worker-3867bb78787415f48206beac222f2629",
                    "reason": "",
                    "status": "True",
                    "type": "Updating"

And one node is not available:
            "degradedMachineCount": 0,
            "machineCount": 6,
            "observedGeneration": 2,
            "readyMachineCount": 5,
            "unavailableMachineCount": 1,
            "updatedMachineCount": 5

From artifacts pods.json, it shows the MCD pod is not scheduleable with:
            "status": {
                "conditions": [
                    {
                        "lastProbeTime": null,
                        "lastTransitionTime": "2020-04-24T15:04:22Z",
                        "message": "0/9 nodes are available: 8 node(s) didn't have free ports for the requested pod ports.",
                        "reason": "Unschedulable",
                        "status": "False",
                        "type": "PodScheduled"
                    }
                ],
                "phase": "Pending",



How reproducible:
Example CI job failure:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/3406/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/5858

CI Search for affected jobs:
https://search.svc.ci.openshift.org/?search=msg%3A+Worker+config+rollout+failed&maxAge=48h&context=1&type=build-log&name=rhel7&maxMatches=5&maxBytes=20971520&groupBy=job

Version-Release number of selected component (if applicable):

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Russell Teague 2020-05-04 20:37:40 UTC

Still observing this in CI.  Three related failures in the last 8 hours.

Comment 3 Yu Qi Zhang 2020-05-21 19:47:31 UTC

Ok, to give a relatively conclusive summary of what we've determined is happening with the CI job (a big thanks to Russell who was able to reproduce the issue and pinpoint errors)

when the RHEL scalup happens, roughly 10% of the time, one node fails scheduling for DNS and MCD pods almost indefinitely e.g. the MCD pod failure is looged as:

I0521 18:32:05.801811       1 factory.go:462] Unable to schedule openshift-machine-config-operator/machine-config-daemon-f9cf2: no fit: 0/9 nodes are available: 8 node(s) didn't have free ports for the requested pod ports.; waiting

The DNS error is logged as:

Unable to schedule openshift-dns/dns-default-tzkt6: no fit: 0/9 nodes are available: 8 node(s) didn't match node selector.; waiting

With no other indication from the scheduler why this is not able to schedule. The node itself is perfectly fine and has no notable differences from nodes that were able to schedule the pods. The error reported above will continuously loop as it retried, but never gets scheduled.

The interesting part is, after ~4 hours in our reproduced environment, the `openshift-kube-scheduler` pod that has been trying to schedule MCD/DNS fails a leaderelection:

E0521 18:39:53.314878       1 reflector.go:382] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:233: Failed to watch *v1.Pod: Get https://localhost:6443/api/v1/pods?allowWatchBookmarks=true&fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&resourceVersion=145175&timeoutSeconds=514&watch=true: dial tcp [::1]:6443: connect: connection refused
E0521 18:39:53.314949       1 reflector.go:382] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.PersistentVolume: Get https://localhost:6443/api/v1/persistentvolumes?allowWatchBookmarks=true&resourceVersion=21510&timeout=5m36s&timeoutSeconds=336&watch=true: dial tcp [::1]:6443: connect: connection refused
E0521 18:39:53.315033       1 reflector.go:382] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.StorageClass: Get https://localhost:6443/apis/storage.k8s.io/v1/storageclasses?allowWatchBookmarks=true&resourceVersion=21511&timeout=6m46s&timeoutSeconds=406&watch=true: dial tcp [::1]:6443: connect: connection refused
E0521 18:39:53.315102       1 reflector.go:382] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.PersistentVolumeClaim: Get https://localhost:6443/api/v1/persistentvolumeclaims?allowWatchBookmarks=true&resourceVersion=21510&timeout=8m46s&timeoutSeconds=526&watch=true: dial tcp [::1]:6443: connect: connection refused
I0521 18:39:53.907043       1 leaderelection.go:277] failed to renew lease openshift-kube-scheduler/kube-scheduler: timed out waiting for the condition
F0521 18:39:53.907081       1 server.go:244] leaderelection lost

and thus was restarted. Another `openshift-kube-scheduler` pod took over the scheduling, and was immediately able to schedule the pods. The errors we see above would disappear after MCD is able to consolidate the node. This leads me to believe that the actual pod setups for MCD/DNS are fine.

So the question is: why is the `openshift-kube-scheduler` unable to schedule the node. Unfortunately the kube scheduler does not log the reason (8 node(s) didn't have free ports for the requested pod ports just means that the other 8 nodes have MCDs scheduled and running on the node, and is not the failure).

Comment 4 Russell Teague 2020-05-21 19:47:33 UTC

Created attachment 1690779 [details]
openshift-kube-scheduler-ip-10-0-159-94.us-east-2.compute.internal.oc_logs

Comment 5 Russell Teague 2020-05-21 19:48:06 UTC

Created attachment 1690780 [details]
openshift-kube-scheduler-ip-10-0-186-242.us-east-2.compute.internal.oc_logs

Comment 6 Russell Teague 2020-05-21 19:48:31 UTC

Created attachment 1690781 [details]
openshift-kube-scheduler-ip-10-0-186-242.us-east-2.compute.internal.oc_logs_previous

Comment 7 Russell Teague 2020-05-21 19:49:04 UTC

Created attachment 1690782 [details]
openshift-kube-scheduler-ip-10-0-192-24.us-east-2.compute.internal.oc_logs

Comment 8 Russell Teague 2020-05-21 19:49:32 UTC

This failure behavior was observed on a dev cluster as follows.

Three RHEL7 nodes were added to an existing 4.5 cluster running 1.18.2. The MCD pod (openshift-machine-config-operator/machine-config-daemon-f9cf2) was noticed to not be scheduled on one of the RHEL7 nodes (ip-10-0-180-19.us-east-2.compute.internal). It was noticed that another pod for dns (openshift-dns/dns-default-tzkt6) was also failing to schedule on the same node.

I0521 13:04:05.617917 1 factory.go:462] Unable to schedule openshift-dns/dns-default-tzkt6: no fit: 0/9 nodes are available: 8 node(s) didn't match node selector.; waiting
I0521 13:04:05.621952 1 factory.go:462] Unable to schedule openshift-machine-config-operator/machine-config-daemon-f9cf2: no fit: 0/9 nodes are available: 8 node(s) didn't have free ports for the requested pod ports.; waiting
Attempts to discover why the pod was failing to be scheduled did not reveal anything conclusive.

After some time, about 6 hours, it appears that there was a cert rotation event that occurred causing a restart of the openshift-kube-scheduler pod (openshift-kube-scheduler-ip-10-0-186-242.us-east-2.compute.internal) which was waiting to schedule the MCD pod.

Another openshift-kube-scheduler pod ( openshift-kube-scheduler-ip-10-0-159-94.us-east-2.compute.internal) was able to successfully schedule the two pods on the node mentioned above.

Logs attached are for the three openshift-kube-scheduler pods and included the previous container logs for the pod that restarted.

Comment 9 Kirsten Garrison 2020-05-21 20:01:11 UTC

Should this BZ be moved to a different component?

Comment 10 Yu Qi Zhang 2020-05-21 20:06:09 UTC

Moving to kube-scheduler to see if there's more insight there

Comment 11 Jan Chaloupka 2020-05-25 15:11:49 UTC

The first explanation for this behavior that comes to my mind is the nodeInfo cache in openshift-kube-scheduler-ip-10-0-186-242.us-east-2.compute.internal pod was not updated. Unfortunately (and as Yu Qi Zhang already mentioned), 
```
I0521 18:26:05.795361       1 factory.go:462] Unable to schedule openshift-dns/dns-default-tzkt6: no fit: 0/9 nodes are available: 8 node(s) didn't match node selector.; waiting
I0521 18:26:05.800622       1 factory.go:462] Unable to schedule openshift-machine-config-operator/machine-config-daemon-f9cf2: no fit: 0/9 nodes are available: 8 node(s) didn't have free ports for the requested pod ports.; waiting
```
does not say why 9-th node (likely ip-10-0-180-19.us-east-2.compute.internal node) is not suitable for scheduling. 

I also checked artifacts in https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/3406/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/5858 around the time the same errors are reported (I0424 15:05:56.613413       1 factory.go:462] Unable to schedule openshift-machine-config-operator/machine-config-daemon-hkvkm: no fit: 0/9 nodes are available: 8 node(s) didn't have free ports for the requested pod ports.; waiting). Checking kubelet logs all worker and master nodes are registered and there's no "Error updating node status" error message in the journal logs.

One observation to mention. Greping the master journal logs for "Recording NodeHasSufficientMemory event message for node", the log timestamps stop to log the message around 14:29:24. The same error message for the worker nodes stop to be logged around 15:03:59. First "Unable to schedule openshift-machine-config-operator/machine-config-daemon-" error message in the scheduler logs starts to appear around 15:04:22.

Timestamps of worker nodes getting successfully registered (from the workers-journal):

```
Apr 24 14:42:32.440919 ip-10-0-129-145 hyperkube[1455]: I0424 14:42:32.440869    1455 kubelet_node_status.go:73] Successfully registered node ip-10-0-129-145.ec2.internal
Apr 24 14:42:38.431563 ip-10-0-128-242 hyperkube[1429]: I0424 14:42:38.431529    1429 kubelet_node_status.go:73] Successfully registered node ip-10-0-128-242.ec2.internal
Apr 24 14:43:00.134236 ip-10-0-144-89 hyperkube[1426]: I0424 14:43:00.134201    1426 kubelet_node_status.go:73] Successfully registered node ip-10-0-144-89.ec2.internal
Apr 24 15:03:42.617502 ip-10-0-128-155.ec2.internal hyperkube[1576]: I0424 15:03:42.617158    1576 kubelet_node_status.go:73] Successfully registered node ip-10-0-128-155.ec2.internal
Apr 24 15:03:44.475164 ip-10-0-141-206.ec2.internal hyperkube[1640]: I0424 15:03:44.474858    1640 kubelet_node_status.go:73] Successfully registered node ip-10-0-141-206.ec2.internal
Apr 24 15:03:53.834886 ip-10-0-145-69.ec2.internal hyperkube[1593]: I0424 15:03:53.834594    1593 kubelet_node_status.go:73] Successfully registered node ip-10-0-145-69.ec2.internal
```

Timestamps of individual nodes getting added to the scheduler's cache:
```
I0424 14:45:08.306485       1 node_tree.go:86] Added node "ip-10-0-140-211.ec2.internal" in group "us-east-1:\x00:us-east-1a" to NodeTree
I0424 14:45:08.306589       1 node_tree.go:86] Added node "ip-10-0-143-86.ec2.internal" in group "us-east-1:\x00:us-east-1a" to NodeTree
I0424 14:45:08.306649       1 node_tree.go:86] Added node "ip-10-0-144-89.ec2.internal" in group "us-east-1:\x00:us-east-1b" to NodeTree
I0424 14:45:08.306692       1 node_tree.go:86] Added node "ip-10-0-147-236.ec2.internal" in group "us-east-1:\x00:us-east-1b" to NodeTree
I0424 14:45:08.306757       1 node_tree.go:86] Added node "ip-10-0-128-242.ec2.internal" in group "us-east-1:\x00:us-east-1a" to NodeTree
I0424 14:45:08.306793       1 node_tree.go:86] Added node "ip-10-0-129-145.ec2.internal" in group "us-east-1:\x00:us-east-1a" to NodeTree
I0424 15:03:42.621349       1 node_tree.go:86] Added node "ip-10-0-128-155.ec2.internal" in group "us-east-1:\x00:us-east-1a" to NodeTree
I0424 15:03:44.478163       1 node_tree.go:86] Added node "ip-10-0-141-206.ec2.internal" in group "us-east-1:\x00:us-east-1a" to NodeTree
I0424 15:03:53.836196       1 node_tree.go:86] Added node "ip-10-0-145-69.ec2.internal" in group "us-east-1:\x00:us-east-1b" to NodeTree

```

Two mco daemons got scheduled:
```
I0424 15:04:24.678088       1 scheduler.go:728] pod openshift-machine-config-operator/machine-config-daemon-xkpd4 is bound successfully on node "ip-10-0-141-206.ec2.internal", 8 nodes evaluated, 1 nodes were found feasible.
I0424 15:04:34.070190       1 scheduler.go:728] pod openshift-machine-config-operator/machine-config-daemon-mzmnm is bound successfully on node "ip-10-0-145-69.ec2.internal", 9 nodes evaluated, 2 nodes were found feasible.
I0424 15:04:35.820095       1 factory.go:462] Unable to schedule openshift-machine-config-operator/machine-config-daemon-hkvkm: no fit: 0/9 nodes are available: 8 node(s) didn't have free ports for the requested pod ports.; waiting
```

Notice the second "pod is bound successfully on node" which says 2 nodes were found feasible. And then, after ~1.75s, no node was found feasible.

From openshift-kube-scheduler-ip-10-0-186-242.us-east-2.compute.internal.oc_logs_previous.txt logs:
```
I0521 12:48:37.783263       1 node_tree.go:86] Added node "ip-10-0-135-130.us-east-2.compute.internal" in group "us-east-2:\x00:us-east-2a" to NodeTree
I0521 12:48:37.783345       1 node_tree.go:86] Added node "ip-10-0-159-94.us-east-2.compute.internal" in group "us-east-2:\x00:us-east-2a" to NodeTree
I0521 12:48:37.783444       1 node_tree.go:86] Added node "ip-10-0-165-255.us-east-2.compute.internal" in group "us-east-2:\x00:us-east-2b" to NodeTree
I0521 12:48:37.783498       1 node_tree.go:86] Added node "ip-10-0-186-242.us-east-2.compute.internal" in group "us-east-2:\x00:us-east-2b" to NodeTree
I0521 12:48:37.783567       1 node_tree.go:86] Added node "ip-10-0-192-24.us-east-2.compute.internal" in group "us-east-2:\x00:us-east-2c" to NodeTree
I0521 12:48:37.783619       1 node_tree.go:86] Added node "ip-10-0-194-124.us-east-2.compute.internal" in group "us-east-2:\x00:us-east-2c" to NodeTree
I0521 13:01:39.848636       1 node_tree.go:86] Added node "ip-10-0-180-19.us-east-2.compute.internal" in group "us-east-2:\x00:us-east-2b" to NodeTree
I0521 13:01:40.947714       1 node_tree.go:86] Added node "ip-10-0-204-91.us-east-2.compute.internal" in group "us-east-2:\x00:us-east-2c" to NodeTree
I0521 13:01:54.345662       1 node_tree.go:86] Added node "ip-10-0-147-219.us-east-2.compute.internal" in group "us-east-2:\x00:us-east-2a" to NodeTree
```

```
I0521 13:02:31.233266       1 scheduler.go:728] pod openshift-machine-config-operator/machine-config-daemon-z9t56 is bound successfully on node "ip-10-0-204-91.us-east-2.compute.internal", 8 nodes evaluated, 1 nodes were found feasible.
I0521 13:02:44.847864       1 scheduler.go:728] pod openshift-machine-config-operator/machine-config-daemon-v9cpg is bound successfully on node "ip-10-0-147-219.us-east-2.compute.internal", 9 nodes evaluated, 2 nodes were found feasible.
I0521 13:02:53.795277       1 factory.go:462] Unable to schedule openshift-machine-config-operator/machine-config-daemon-f9cf2: no fit: 0/9 nodes are available: 8 node(s) didn't have free ports for the requested pod ports.; waiting
```

One can see the same pattern saying 2 nodes were found feasible but the last machine-config-daemon had no nodes left.

Comment 12 Jan Chaloupka 2020-05-25 15:31:59 UTC

Going back to https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/3406/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/5858 each machine-config-daemon pod have an affinity to a specific node set:

machine-config-daemon-xkpd4 to ip-10-0-141-206.ec2.internal
machine-config-daemon-mzmnm to ip-10-0-145-69.ec2.internal
machine-config-daemon-hkvkm to ip-10-0-128-155.ec2.internal

So
```
I0424 15:04:34.070190       1 scheduler.go:728] pod openshift-machine-config-operator/machine-config-daemon-mzmnm is bound successfully on node "ip-10-0-145-69.ec2.internal", 9 nodes evaluated, 2 nodes were found feasible.
```
log message is actually non-nonsensical since there's supposed to be only 1 node feasible.

Comment 13 Jan Chaloupka 2020-06-02 14:35:49 UTC

Upstream issues related:
- FailedScheduling doesn't report all reasons for nodes failing: https://github.com/kubernetes/kubernetes/issues/91340
- Daemonset does not provision to all nodes, 0 nodes available: https://github.com/kubernetes/kubernetes/issues/91601

Comment 14 Jan Chaloupka 2020-06-18 09:17:48 UTC

I’m adding UpcomingSprint, because I lack the information to properly root cause the bug. I will revisit this bug when the information is available.

Comment 15 W. Trevor King 2020-06-25 05:07:00 UTC

> I will revisit this bug when the information is available.

Sounds like this should be NEEDINFO... somebody.  What information are you missing, and how do you expect it to become available?

Comment 16 Jan Chaloupka 2020-07-09 10:13:36 UTC

> Sounds like this should be NEEDINFO... somebody.  What information are you missing, and how do you expect it to become available?

Any stable reproducer is required. Scheduler logs currently does not provide sufficient information. Also, the code analysis does not show any hints on where to look. Thus, sharing more scheduler logs will not help.

Comment 17 Russell Teague 2020-07-17 12:47:59 UTC

No failures observed in the last 14 days.  Will open a new bug if this issue becomes a problem again.  Thanks.

Note You need to log in before you can comment on or make changes to this bug.