Description of problem: During the RHEL7 worker scaleup job, it fails at step: TASK [DEBUG - Worker config rollout failed] The worker config is updating: "lastTransitionTime": "2020-04-24T15:03:47Z", "message": "All nodes are updating to rendered-worker-3867bb78787415f48206beac222f2629", "reason": "", "status": "True", "type": "Updating" And one node is not available: "degradedMachineCount": 0, "machineCount": 6, "observedGeneration": 2, "readyMachineCount": 5, "unavailableMachineCount": 1, "updatedMachineCount": 5 From artifacts pods.json, it shows the MCD pod is not scheduleable with: "status": { "conditions": [ { "lastProbeTime": null, "lastTransitionTime": "2020-04-24T15:04:22Z", "message": "0/9 nodes are available: 8 node(s) didn't have free ports for the requested pod ports.", "reason": "Unschedulable", "status": "False", "type": "PodScheduled" } ], "phase": "Pending", How reproducible: Example CI job failure: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/3406/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/5858 CI Search for affected jobs: https://search.svc.ci.openshift.org/?search=msg%3A+Worker+config+rollout+failed&maxAge=48h&context=1&type=build-log&name=rhel7&maxMatches=5&maxBytes=20971520&groupBy=job Version-Release number of selected component (if applicable): Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Still observing this in CI. Three related failures in the last 8 hours.
Ok, to give a relatively conclusive summary of what we've determined is happening with the CI job (a big thanks to Russell who was able to reproduce the issue and pinpoint errors) when the RHEL scalup happens, roughly 10% of the time, one node fails scheduling for DNS and MCD pods almost indefinitely e.g. the MCD pod failure is looged as: I0521 18:32:05.801811 1 factory.go:462] Unable to schedule openshift-machine-config-operator/machine-config-daemon-f9cf2: no fit: 0/9 nodes are available: 8 node(s) didn't have free ports for the requested pod ports.; waiting The DNS error is logged as: Unable to schedule openshift-dns/dns-default-tzkt6: no fit: 0/9 nodes are available: 8 node(s) didn't match node selector.; waiting With no other indication from the scheduler why this is not able to schedule. The node itself is perfectly fine and has no notable differences from nodes that were able to schedule the pods. The error reported above will continuously loop as it retried, but never gets scheduled. The interesting part is, after ~4 hours in our reproduced environment, the `openshift-kube-scheduler` pod that has been trying to schedule MCD/DNS fails a leaderelection: E0521 18:39:53.314878 1 reflector.go:382] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:233: Failed to watch *v1.Pod: Get https://localhost:6443/api/v1/pods?allowWatchBookmarks=true&fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&resourceVersion=145175&timeoutSeconds=514&watch=true: dial tcp [::1]:6443: connect: connection refused E0521 18:39:53.314949 1 reflector.go:382] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.PersistentVolume: Get https://localhost:6443/api/v1/persistentvolumes?allowWatchBookmarks=true&resourceVersion=21510&timeout=5m36s&timeoutSeconds=336&watch=true: dial tcp [::1]:6443: connect: connection refused E0521 18:39:53.315033 1 reflector.go:382] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.StorageClass: Get https://localhost:6443/apis/storage.k8s.io/v1/storageclasses?allowWatchBookmarks=true&resourceVersion=21511&timeout=6m46s&timeoutSeconds=406&watch=true: dial tcp [::1]:6443: connect: connection refused E0521 18:39:53.315102 1 reflector.go:382] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.PersistentVolumeClaim: Get https://localhost:6443/api/v1/persistentvolumeclaims?allowWatchBookmarks=true&resourceVersion=21510&timeout=8m46s&timeoutSeconds=526&watch=true: dial tcp [::1]:6443: connect: connection refused I0521 18:39:53.907043 1 leaderelection.go:277] failed to renew lease openshift-kube-scheduler/kube-scheduler: timed out waiting for the condition F0521 18:39:53.907081 1 server.go:244] leaderelection lost and thus was restarted. Another `openshift-kube-scheduler` pod took over the scheduling, and was immediately able to schedule the pods. The errors we see above would disappear after MCD is able to consolidate the node. This leads me to believe that the actual pod setups for MCD/DNS are fine. So the question is: why is the `openshift-kube-scheduler` unable to schedule the node. Unfortunately the kube scheduler does not log the reason (8 node(s) didn't have free ports for the requested pod ports just means that the other 8 nodes have MCDs scheduled and running on the node, and is not the failure).
Created attachment 1690779 [details] openshift-kube-scheduler-ip-10-0-159-94.us-east-2.compute.internal.oc_logs
Created attachment 1690780 [details] openshift-kube-scheduler-ip-10-0-186-242.us-east-2.compute.internal.oc_logs
Created attachment 1690781 [details] openshift-kube-scheduler-ip-10-0-186-242.us-east-2.compute.internal.oc_logs_previous
Created attachment 1690782 [details] openshift-kube-scheduler-ip-10-0-192-24.us-east-2.compute.internal.oc_logs
This failure behavior was observed on a dev cluster as follows. Three RHEL7 nodes were added to an existing 4.5 cluster running 1.18.2. The MCD pod (openshift-machine-config-operator/machine-config-daemon-f9cf2) was noticed to not be scheduled on one of the RHEL7 nodes (ip-10-0-180-19.us-east-2.compute.internal). It was noticed that another pod for dns (openshift-dns/dns-default-tzkt6) was also failing to schedule on the same node. I0521 13:04:05.617917 1 factory.go:462] Unable to schedule openshift-dns/dns-default-tzkt6: no fit: 0/9 nodes are available: 8 node(s) didn't match node selector.; waiting I0521 13:04:05.621952 1 factory.go:462] Unable to schedule openshift-machine-config-operator/machine-config-daemon-f9cf2: no fit: 0/9 nodes are available: 8 node(s) didn't have free ports for the requested pod ports.; waiting Attempts to discover why the pod was failing to be scheduled did not reveal anything conclusive. After some time, about 6 hours, it appears that there was a cert rotation event that occurred causing a restart of the openshift-kube-scheduler pod (openshift-kube-scheduler-ip-10-0-186-242.us-east-2.compute.internal) which was waiting to schedule the MCD pod. Another openshift-kube-scheduler pod ( openshift-kube-scheduler-ip-10-0-159-94.us-east-2.compute.internal) was able to successfully schedule the two pods on the node mentioned above. Logs attached are for the three openshift-kube-scheduler pods and included the previous container logs for the pod that restarted.
Should this BZ be moved to a different component?
Moving to kube-scheduler to see if there's more insight there
The first explanation for this behavior that comes to my mind is the nodeInfo cache in openshift-kube-scheduler-ip-10-0-186-242.us-east-2.compute.internal pod was not updated. Unfortunately (and as Yu Qi Zhang already mentioned), ``` I0521 18:26:05.795361 1 factory.go:462] Unable to schedule openshift-dns/dns-default-tzkt6: no fit: 0/9 nodes are available: 8 node(s) didn't match node selector.; waiting I0521 18:26:05.800622 1 factory.go:462] Unable to schedule openshift-machine-config-operator/machine-config-daemon-f9cf2: no fit: 0/9 nodes are available: 8 node(s) didn't have free ports for the requested pod ports.; waiting ``` does not say why 9-th node (likely ip-10-0-180-19.us-east-2.compute.internal node) is not suitable for scheduling. I also checked artifacts in https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/3406/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/5858 around the time the same errors are reported (I0424 15:05:56.613413 1 factory.go:462] Unable to schedule openshift-machine-config-operator/machine-config-daemon-hkvkm: no fit: 0/9 nodes are available: 8 node(s) didn't have free ports for the requested pod ports.; waiting). Checking kubelet logs all worker and master nodes are registered and there's no "Error updating node status" error message in the journal logs. One observation to mention. Greping the master journal logs for "Recording NodeHasSufficientMemory event message for node", the log timestamps stop to log the message around 14:29:24. The same error message for the worker nodes stop to be logged around 15:03:59. First "Unable to schedule openshift-machine-config-operator/machine-config-daemon-" error message in the scheduler logs starts to appear around 15:04:22. Timestamps of worker nodes getting successfully registered (from the workers-journal): ``` Apr 24 14:42:32.440919 ip-10-0-129-145 hyperkube[1455]: I0424 14:42:32.440869 1455 kubelet_node_status.go:73] Successfully registered node ip-10-0-129-145.ec2.internal Apr 24 14:42:38.431563 ip-10-0-128-242 hyperkube[1429]: I0424 14:42:38.431529 1429 kubelet_node_status.go:73] Successfully registered node ip-10-0-128-242.ec2.internal Apr 24 14:43:00.134236 ip-10-0-144-89 hyperkube[1426]: I0424 14:43:00.134201 1426 kubelet_node_status.go:73] Successfully registered node ip-10-0-144-89.ec2.internal Apr 24 15:03:42.617502 ip-10-0-128-155.ec2.internal hyperkube[1576]: I0424 15:03:42.617158 1576 kubelet_node_status.go:73] Successfully registered node ip-10-0-128-155.ec2.internal Apr 24 15:03:44.475164 ip-10-0-141-206.ec2.internal hyperkube[1640]: I0424 15:03:44.474858 1640 kubelet_node_status.go:73] Successfully registered node ip-10-0-141-206.ec2.internal Apr 24 15:03:53.834886 ip-10-0-145-69.ec2.internal hyperkube[1593]: I0424 15:03:53.834594 1593 kubelet_node_status.go:73] Successfully registered node ip-10-0-145-69.ec2.internal ``` Timestamps of individual nodes getting added to the scheduler's cache: ``` I0424 14:45:08.306485 1 node_tree.go:86] Added node "ip-10-0-140-211.ec2.internal" in group "us-east-1:\x00:us-east-1a" to NodeTree I0424 14:45:08.306589 1 node_tree.go:86] Added node "ip-10-0-143-86.ec2.internal" in group "us-east-1:\x00:us-east-1a" to NodeTree I0424 14:45:08.306649 1 node_tree.go:86] Added node "ip-10-0-144-89.ec2.internal" in group "us-east-1:\x00:us-east-1b" to NodeTree I0424 14:45:08.306692 1 node_tree.go:86] Added node "ip-10-0-147-236.ec2.internal" in group "us-east-1:\x00:us-east-1b" to NodeTree I0424 14:45:08.306757 1 node_tree.go:86] Added node "ip-10-0-128-242.ec2.internal" in group "us-east-1:\x00:us-east-1a" to NodeTree I0424 14:45:08.306793 1 node_tree.go:86] Added node "ip-10-0-129-145.ec2.internal" in group "us-east-1:\x00:us-east-1a" to NodeTree I0424 15:03:42.621349 1 node_tree.go:86] Added node "ip-10-0-128-155.ec2.internal" in group "us-east-1:\x00:us-east-1a" to NodeTree I0424 15:03:44.478163 1 node_tree.go:86] Added node "ip-10-0-141-206.ec2.internal" in group "us-east-1:\x00:us-east-1a" to NodeTree I0424 15:03:53.836196 1 node_tree.go:86] Added node "ip-10-0-145-69.ec2.internal" in group "us-east-1:\x00:us-east-1b" to NodeTree ``` Two mco daemons got scheduled: ``` I0424 15:04:24.678088 1 scheduler.go:728] pod openshift-machine-config-operator/machine-config-daemon-xkpd4 is bound successfully on node "ip-10-0-141-206.ec2.internal", 8 nodes evaluated, 1 nodes were found feasible. I0424 15:04:34.070190 1 scheduler.go:728] pod openshift-machine-config-operator/machine-config-daemon-mzmnm is bound successfully on node "ip-10-0-145-69.ec2.internal", 9 nodes evaluated, 2 nodes were found feasible. I0424 15:04:35.820095 1 factory.go:462] Unable to schedule openshift-machine-config-operator/machine-config-daemon-hkvkm: no fit: 0/9 nodes are available: 8 node(s) didn't have free ports for the requested pod ports.; waiting ``` Notice the second "pod is bound successfully on node" which says 2 nodes were found feasible. And then, after ~1.75s, no node was found feasible. From openshift-kube-scheduler-ip-10-0-186-242.us-east-2.compute.internal.oc_logs_previous.txt logs: ``` I0521 12:48:37.783263 1 node_tree.go:86] Added node "ip-10-0-135-130.us-east-2.compute.internal" in group "us-east-2:\x00:us-east-2a" to NodeTree I0521 12:48:37.783345 1 node_tree.go:86] Added node "ip-10-0-159-94.us-east-2.compute.internal" in group "us-east-2:\x00:us-east-2a" to NodeTree I0521 12:48:37.783444 1 node_tree.go:86] Added node "ip-10-0-165-255.us-east-2.compute.internal" in group "us-east-2:\x00:us-east-2b" to NodeTree I0521 12:48:37.783498 1 node_tree.go:86] Added node "ip-10-0-186-242.us-east-2.compute.internal" in group "us-east-2:\x00:us-east-2b" to NodeTree I0521 12:48:37.783567 1 node_tree.go:86] Added node "ip-10-0-192-24.us-east-2.compute.internal" in group "us-east-2:\x00:us-east-2c" to NodeTree I0521 12:48:37.783619 1 node_tree.go:86] Added node "ip-10-0-194-124.us-east-2.compute.internal" in group "us-east-2:\x00:us-east-2c" to NodeTree I0521 13:01:39.848636 1 node_tree.go:86] Added node "ip-10-0-180-19.us-east-2.compute.internal" in group "us-east-2:\x00:us-east-2b" to NodeTree I0521 13:01:40.947714 1 node_tree.go:86] Added node "ip-10-0-204-91.us-east-2.compute.internal" in group "us-east-2:\x00:us-east-2c" to NodeTree I0521 13:01:54.345662 1 node_tree.go:86] Added node "ip-10-0-147-219.us-east-2.compute.internal" in group "us-east-2:\x00:us-east-2a" to NodeTree ``` ``` I0521 13:02:31.233266 1 scheduler.go:728] pod openshift-machine-config-operator/machine-config-daemon-z9t56 is bound successfully on node "ip-10-0-204-91.us-east-2.compute.internal", 8 nodes evaluated, 1 nodes were found feasible. I0521 13:02:44.847864 1 scheduler.go:728] pod openshift-machine-config-operator/machine-config-daemon-v9cpg is bound successfully on node "ip-10-0-147-219.us-east-2.compute.internal", 9 nodes evaluated, 2 nodes were found feasible. I0521 13:02:53.795277 1 factory.go:462] Unable to schedule openshift-machine-config-operator/machine-config-daemon-f9cf2: no fit: 0/9 nodes are available: 8 node(s) didn't have free ports for the requested pod ports.; waiting ``` One can see the same pattern saying 2 nodes were found feasible but the last machine-config-daemon had no nodes left.
Going back to https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/3406/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/5858 each machine-config-daemon pod have an affinity to a specific node set: machine-config-daemon-xkpd4 to ip-10-0-141-206.ec2.internal machine-config-daemon-mzmnm to ip-10-0-145-69.ec2.internal machine-config-daemon-hkvkm to ip-10-0-128-155.ec2.internal So ``` I0424 15:04:34.070190 1 scheduler.go:728] pod openshift-machine-config-operator/machine-config-daemon-mzmnm is bound successfully on node "ip-10-0-145-69.ec2.internal", 9 nodes evaluated, 2 nodes were found feasible. ``` log message is actually non-nonsensical since there's supposed to be only 1 node feasible.
Upstream issues related: - FailedScheduling doesn't report all reasons for nodes failing: https://github.com/kubernetes/kubernetes/issues/91340 - Daemonset does not provision to all nodes, 0 nodes available: https://github.com/kubernetes/kubernetes/issues/91601
I’m adding UpcomingSprint, because I lack the information to properly root cause the bug. I will revisit this bug when the information is available.
> I will revisit this bug when the information is available. Sounds like this should be NEEDINFO... somebody. What information are you missing, and how do you expect it to become available?
> Sounds like this should be NEEDINFO... somebody. What information are you missing, and how do you expect it to become available? Any stable reproducer is required. Scheduler logs currently does not provide sufficient information. Also, the code analysis does not show any hints on where to look. Thus, sharing more scheduler logs will not help.
No failures observed in the last 14 days. Will open a new bug if this issue becomes a problem again. Thanks.