Description of problem: Upgrade from 4.7.23 to 4.8.4 failed due to 1 of 2 worker node unavailable for a cluster with 400 services. Version-Release number of selected component (if applicable): 4.7.23 How reproducible: In stall a 4.6.42 cluster with n2 1-standard-4 worker nodes. Load it with 400 services (400 pods created). Upgrade the cluster from 4.6.42 to 4.7.23 and to 4.8.4. Steps to Reproduce: 1. Create a gcp 4.6.42 sdn cluster with 2 n1-standard-4 worker nodes. 2. Run max service scale-ci test to generate 400 services (400 pods created) to the cluster. https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/max-services/77 3. Upgrade the cluster from 4.6.42 to 4.7.23 4. Upgrade the cluster from 4.7.23 to 4.8.4 Actual results: The upgrade from 4.7.23 to 4.8.4 did not finish after over 12 hours. Unavailable node's condition: Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- NetworkUnavailable False Mon, 01 Jan 0001 00:00:00 +0000 Tue, 10 Aug 2021 10:58:09 +0800 RouteCreated openshift-sdn cleared kubelet-set NoRouteCreated MemoryPressure Unknown Tue, 10 Aug 2021 13:59:56 +0800 Tue, 10 Aug 2021 14:05:26 +0800 NodeStatusUnknown Kubelet stopped posting node status. DiskPressure Unknown Tue, 10 Aug 2021 13:59:56 +0800 Tue, 10 Aug 2021 14:05:26 +0800 NodeStatusUnknown Kubelet stopped posting node status. PIDPressure Unknown Tue, 10 Aug 2021 13:59:56 +0800 Tue, 10 Aug 2021 14:05:26 +0800 NodeStatusUnknown Kubelet stopped posting node status. Ready Unknown Tue, 10 Aug 2021 13:59:56 +0800 Tue, 10 Aug 2021 14:05:26 +0800 NodeStatusUnknown Kubelet stopped posting node status. Expected results: The upgrade from 4.7.23 to 4.8.4 should be successful. Additional info: ---------------------------------------- % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.23 True True 24h Unable to apply 4.8.4: an unknown error has occurred: MultipleErrors ---------------------------------------- % oc get clusterversion -o json|jq ".items[0].status.history" [ { "completionTime": null, "image": "registry.ci.openshift.org/ocp/release@sha256:841535acc09ca8412cd17e8f7702eceda1cac688ccc281278f108675c30de270", "startedTime": "2021-08-10T06:09:49Z", "state": "Partial", "verified": true, "version": "4.8.4" }, { "completionTime": "2021-08-10T05:56:35Z", "image": "registry.ci.openshift.org/ocp/release@sha256:fb00f5e16a2092c3f15113ad8de0d2e841abdb43c9c39794522fc79784a3efb0", "startedTime": "2021-08-10T04:51:19Z", "state": "Completed", "verified": true, "version": "4.7.23" }, { "completionTime": "2021-08-10T03:09:19Z", "image": "quay.io/openshift-release-dev/ocp-release@sha256:59e2e85f5d1bcb4440765c310b6261387ffc3f16ed55ca0a79012367e15b558b", "startedTime": "2021-08-10T02:46:37Z", "state": "Completed", "verified": false, "version": "4.6.42" } ] ---------------------------------------- % oc get machineset -A NAMESPACE NAME DESIRED CURRENT READY AVAILABLE AGE openshift-machine-api qili46s-z4896-worker-a 1 1 1 1 27h openshift-machine-api qili46s-z4896-worker-b 1 1 27h openshift-machine-api qili46s-z4896-worker-c 0 0 27h openshift-machine-api qili46s-z4896-worker-f 0 0 27h ---------------------------------------- % oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE qili46s-z4896-worker-a-4ctgx Running n1-standard-4 us-central1 us-central1-a 27h qili46s-z4896-worker-b-mw5f9 Running n1-standard-4 us-central1 us-central1-b 27h ---------------------------------------- % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.4 True False False 19h baremetal 4.8.4 True False False 19h cloud-credential 4.8.4 True False False 22h cluster-autoscaler 4.8.4 True False False 22h config-operator 4.8.4 True False False 22h console 4.8.4 True False False 18h csi-snapshot-controller 4.8.4 True False False 19h dns 4.7.23 True False False 22h etcd 4.8.4 True False False 22h image-registry 4.7.23 True True True 19h ingress 4.8.4 True False True 18h insights 4.8.4 True False False 22h kube-apiserver 4.8.4 True False False 22h kube-controller-manager 4.8.4 True False False 22h kube-scheduler 4.8.4 True False False 22h kube-storage-version-migrator 4.8.4 True False False 18h machine-api 4.8.4 True False False 22h machine-approver 4.8.4 True False False 22h machine-config 4.7.23 False False True 19h marketplace 4.8.4 True False False 19h monitoring 4.7.23 False True True 18h network 4.7.23 True True True 22h node-tuning 4.8.4 True True False 18h openshift-apiserver 4.8.4 True False False 19h openshift-controller-manager 4.8.4 True False False 22h openshift-samples 4.8.4 True False False 18h operator-lifecycle-manager 4.8.4 True False False 22h operator-lifecycle-manager-catalog 4.8.4 True False False 22h operator-lifecycle-manager-packageserver 4.8.4 True False False 19h service-ca 4.8.4 True False False 22h storage 4.8.4 True True False 19h ---------------------------------------- % oc get pod -A | wc -l 868 % oc get pod -A | egrep -v "Completed|Running" | wc -l 427 ---------------------------------------- % oc get pod -A | egrep -v "Completed|Running" NAMESPACE NAME READY STATUS RESTARTS AGE benchmark-operator benchmark-controller-manager-575cc4768b-2qxn4 0/2 Pending 0 19h benchmark-operator benchmark-controller-manager-575cc4768b-rxbql 2/2 Terminating 0 19h max-services-cf852503-8ef9-4622-9a1f-142618f7f23b max-serv-1-1-c6f7b9b49-9gqwn 1/1 Terminating 0 19h max-services-cf852503-8ef9-4622-9a1f-142618f7f23b max-serv-1-1-c6f7b9b49-mj8bw 0/1 Pending 0 19h : : : max-services-cf852503-8ef9-4622-9a1f-142618f7f23b max-serv-99-1-54d9d895cd-lmq9g 0/1 Pending 0 19h openshift-cluster-csi-drivers gcp-pd-csi-driver-node-t4pnk 3/3 Terminating 0 20h openshift-cluster-node-tuning-operator tuned-9x9hn 1/1 Terminating 0 20h openshift-image-registry image-registry-74476bb9b6-j84z5 0/1 Pending 0 18h openshift-image-registry image-registry-74476bb9b6-s96tk 0/1 Pending 0 18h openshift-image-registry image-registry-9d7f47dd4-8bfmh 1/1 Terminating 0 19h openshift-image-registry node-ca-vcmjt 1/1 Terminating 0 20h openshift-ingress-canary ingress-canary-vn6vk 1/1 Terminating 0 20h openshift-ingress router-default-5ffbcbb44f-67cpx 1/1 Terminating 0 19h openshift-ingress router-default-7dcc5499f9-p9lxn 0/1 Pending 0 18h openshift-kube-apiserver kube-apiserver-qili46s-z4896-master-1.c.openshift-qe.internal 0/5 PodInitializing 0 54s openshift-kube-storage-version-migrator migrator-f58676cd4-92cjw 1/1 Terminating 0 19h openshift-marketplace certified-operators-h4g57 0/1 Pending 0 18h openshift-marketplace certified-operators-q8f4d 1/1 Terminating 0 19h openshift-marketplace certified-operators-zrfgs 0/1 Pending 0 19h openshift-marketplace community-operators-2vhc6 0/1 Pending 0 18h openshift-marketplace community-operators-k7lgb 0/1 Pending 0 19h openshift-marketplace community-operators-phj6h 1/1 Terminating 0 19h openshift-marketplace redhat-marketplace-4vl6f 0/1 Pending 0 18h openshift-marketplace redhat-marketplace-cxgwb 0/1 Pending 0 19h openshift-marketplace redhat-marketplace-h5mrw 1/1 Terminating 0 19h openshift-marketplace redhat-operators-gmj2r 0/1 Pending 0 19h openshift-marketplace redhat-operators-mcmc7 1/1 Terminating 0 19h openshift-monitoring alertmanager-main-0 5/5 Terminating 0 19h openshift-monitoring alertmanager-main-1 5/5 Terminating 0 19h openshift-monitoring alertmanager-main-2 5/5 Terminating 0 19h openshift-monitoring grafana-578596d89-lpwtr 2/2 Terminating 0 19h openshift-monitoring kube-state-metrics-d956df775-gxfcv 3/3 Terminating 0 19h openshift-monitoring node-exporter-g8kkq 2/2 Terminating 0 20h openshift-monitoring node-exporter-hxm4f 0/2 Pending 0 18h openshift-monitoring openshift-state-metrics-74b58f578c-6m4fl 3/3 Terminating 0 19h openshift-monitoring prometheus-adapter-79db6db5fd-8f524 1/1 Terminating 0 19h openshift-monitoring prometheus-adapter-79db6db5fd-bdm8w 1/1 Terminating 0 19h openshift-monitoring prometheus-k8s-0 7/7 Terminating 1 19h openshift-monitoring prometheus-k8s-1 7/7 Terminating 1 19h openshift-monitoring telemeter-client-668bc5dd49-vwfqk 3/3 Terminating 0 19h openshift-monitoring thanos-querier-69f5f7979-cjl58 5/5 Terminating 0 19h openshift-monitoring thanos-querier-69f5f7979-zcnfx 5/5 Terminating 0 19h openshift-monitoring thanos-querier-9b769975-92q6s 0/5 Pending 0 18h openshift-monitoring thanos-querier-9b769975-lnhlk 0/5 Pending 0 18h openshift-network-diagnostics network-check-source-6cd65cf589-bxzff 1/1 Terminating 0 19h openshift-network-diagnostics network-check-source-6cd65cf589-hp8bh 0/1 Pending 0 19h ---------------------------------------- % oc get events -n max-services-cf852503-8ef9-4622-9a1f-142618f7f23b …. 13m Warning FailedScheduling pod/max-serv-99-1-54d9d895cd-lmq9g 0/5 nodes are available: 1 Too many pods, 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. % oc get events -n openshift-ingress LAST SEEN TYPE REASON OBJECT MESSAGE 16m Warning FailedScheduling pod/router-default-7dcc5499f9-p9lxn 0/5 nodes are available: 1 Too many pods, 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. ---------------------------------------- % oc get nodes NAME STATUS ROLES AGE VERSION qili46s-z4896-master-0.c.openshift-qe.internal Ready master 24h v1.20.0+558d959 qili46s-z4896-master-1.c.openshift-qe.internal Ready master 24h v1.20.0+558d959 qili46s-z4896-master-2.c.openshift-qe.internal Ready master 24h v1.20.0+558d959 qili46s-z4896-worker-a-4ctgx.c.openshift-qe.internal Ready worker 24h v1.20.0+558d959 qili46s-z4896-worker-b-mw5f9.c.openshift-qe.internal NotReady worker 24h v1.20.0+558d959 ---------------------------------------- % oc describe node qili46s-z4896-worker-b-mw5f9.c.openshift-qe.internal Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- NetworkUnavailable False Mon, 01 Jan 0001 00:00:00 +0000 Tue, 10 Aug 2021 10:58:09 +0800 RouteCreated openshift-sdn cleared kubelet-set NoRouteCreated MemoryPressure Unknown Tue, 10 Aug 2021 13:59:56 +0800 Tue, 10 Aug 2021 14:05:26 +0800 NodeStatusUnknown Kubelet stopped posting node status. DiskPressure Unknown Tue, 10 Aug 2021 13:59:56 +0800 Tue, 10 Aug 2021 14:05:26 +0800 NodeStatusUnknown Kubelet stopped posting node status. PIDPressure Unknown Tue, 10 Aug 2021 13:59:56 +0800 Tue, 10 Aug 2021 14:05:26 +0800 NodeStatusUnknown Kubelet stopped posting node status. Ready Unknown Tue, 10 Aug 2021 13:59:56 +0800 Tue, 10 Aug 2021 14:05:26 +0800 NodeStatusUnknown Kubelet stopped posting node status. Non-terminated Pods: (250 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- benchmark-operator benchmark-controller-manager-575cc4768b-rxbql 2 (57%) 2 (57%) 0 (0%) 0 (0%) 22h max-services-cf852503-8ef9-4622-9a1f-142618f7f23b max-serv-1-1-c6f7b9b49-9gqwn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 22h Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 2795m (79%) 2 (57%) memory 6059Mi (43%) 512Mi (3%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) attachable-volumes-gce-pd 0 0 Events: <none>
Because of the condition of the cluster, I can't run must gather. [must-gather-czqlh] OUT gather logs unavailable: http2: server sent GOAWAY and closed the connection; LastStreamID=13, ErrCode=NO_ERROR, debug="" [must-gather-czqlh] OUT waiting for gather to complete [must-gather-czqlh] OUT gather never finished: timed out waiting for the condition [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-x2hlp deleted [must-gather ] OUT namespace/openshift-must-gather-z6l9x deleted error: gather never finished for pod must-gather-czqlh: timed out waiting for the condition
Is this on a cluster I can log into to debug?
Sorry I was on PTO in the past days. Unfortunately I don't have the cluster now. I attached the grafana screenshot the average worker nodes cpu usage. The query of the grafana graph was 'uuid.keyword: $uuid AND metricName: "nodeCPU-AggregatedWorkers"'. Full grafana link is here http://grafana.rdu2.scalelab.redhat.com:3000/d/hIBqKNvMz123/kube-burner-report-aggregated?orgId=1&from=1628566252651&to=1628566652544&var-Datasource=SVTQE-kube-burner&var-sdn=openshift-sdn&var-job=All&var-uuid=cf852503-8ef9-4622-9a1f-142618f7f23b&var-master=qili46s-z4896-master-0.c.openshift-qe.internal&var-namespace=All&var-verb=All&var-resource=All&var-flowschema=All&var-priority_level=All. I think the load is too big for the cluster that caused the worker node's cpu under pressure.
Roshni Pattath told me she tried 4.6.42 -> 4.7.24 -> 4.8.5 and I was not able to reproduce the issue. 4.7.24 and 4.8.5 were the latest stable builds. The upgrade completed in less than 3 hours. That means this issue is not reproduced (or not reproduced everytime) on the upgrade path 4.6.42 -> 4.7.24 -> 4.8.5.
@rphillips Thanks for the analysis. Why 'two prometheus pods running on the same NotReady node' can make a node not ready? Is that an expected behavior? How do we let users be aware of this and avoid this?
@rphillips Hi, Ryan, sorry the test environment had already been cleared. I will rerun the test to see if I can reproduce the issue. How can I know which build the patches are in?
@rphillips I met the similar issue today. A 4.8.10 AWS OVN IPI cluster, after install, I scaled up worker nodes from 3 to 120. After that I found one node NotReady, and I found both prometheus pods are on this node, the pods on this node stuck in 'Terminating' state and can not be scheduled on other nodes - that caused some co in abnormal state. You've mentioned 'There are two prometheus pods running on the same NotReady node. This is the most likely cause why the node is notready'. Is this a known issue? Can you tell more about how it happens? ---------------------------- % oc get clusterversions NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.11 True False 78m Cluster version is 4.8.11 ---------------------------- % oc get nodes | grep NotReady ip-10-0-142-90.us-east-2.compute.internal NotReady worker 118m v1.21.1+9807387 ---------------------------- % oc get pods -A -o wide | egrep -v "Completed| Running" NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES openshift-debug-node-rbs4v ip-10-0-142-90.us-east-2.compute.internal-debug 0/1 Terminating 0 13m <none> ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-ingress router-default-845cc4d4dc-p5xzs 1/1 Terminating 0 120m 10.131.0.13 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-marketplace certified-operators-f8r65 1/1 Terminating 0 122m 10.131.0.8 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-marketplace community-operators-j2mc5 1/1 Terminating 0 122m 10.131.0.15 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-marketplace redhat-operators-r2bfm 1/1 Terminating 0 122m 10.131.0.16 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-monitoring alertmanager-main-0 5/5 Terminating 0 119m 10.131.0.18 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-monitoring alertmanager-main-1 5/5 Terminating 0 119m 10.131.0.19 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-monitoring alertmanager-main-2 5/5 Terminating 0 119m 10.131.0.20 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-monitoring grafana-dd56d9c7f-frght 2/2 Terminating 0 119m 10.131.0.21 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-monitoring kube-state-metrics-7485cb5695-8j7m9 3/3 Terminating 0 124m 10.131.0.9 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-monitoring openshift-state-metrics-65c6597c7-8tzcq 3/3 Terminating 0 124m 10.131.0.14 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-monitoring prometheus-adapter-748974bc4b-qfmhq 1/1 Terminating 0 119m 10.131.0.17 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-monitoring prometheus-k8s-0 7/7 Terminating 2 119m 10.131.0.23 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-monitoring prometheus-k8s-1 7/7 Terminating 1 119m 10.131.0.24 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-monitoring telemeter-client-74b79579b-4dfh7 3/3 Terminating 0 124m 10.131.0.11 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-monitoring thanos-querier-65dc8646f7-5tnkz 5/5 Terminating 0 119m 10.131.0.22 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-network-diagnostics network-check-source-6ccd7c5589-kbwt6 1/1 Terminating 0 127m 10.131.0.10 ip-10-0-142-90.us-east-2.compute.internal <none> <none> ---------------------------- % oc describe node ip-10-0-142-90.us-east-2.compute.internal Name: ip-10-0-142-90.us-east-2.compute.internal Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=m5.xlarge beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=us-east-2 failure-domain.beta.kubernetes.io/zone=us-east-2a kubernetes.io/arch=amd64 kubernetes.io/hostname=ip-10-0-142-90 kubernetes.io/os=linux node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=m5.xlarge node.openshift.io/os_id=rhcos topology.ebs.csi.aws.com/zone=us-east-2a topology.kubernetes.io/region=us-east-2 topology.kubernetes.io/zone=us-east-2a Annotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-07f5b9600cf909f5c"} k8s.ovn.org/host-addresses: ["10.0.142.90"] k8s.ovn.org/l3-gateway-config: {"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-142-90.us-east-2.compute.internal","mac-address":"02:c2:e9:ee:89:92","ip-address... k8s.ovn.org/node-chassis-id: 453f11fa-2b24-4ebc-acb5-79cbd635fdfd k8s.ovn.org/node-mgmt-port-mac-address: b2:aa:93:1f:8b:9c k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.142.90/19"} k8s.ovn.org/node-subnets: {"default":"10.131.0.0/23"} machine.openshift.io/machine: openshift-machine-api/qili-48-aws-0914-sdct2-worker-us-east-2a-2m9z4 machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-85b1a4bcca322341743ffdc8480f170c machineconfiguration.openshift.io/desiredConfig: rendered-worker-85b1a4bcca322341743ffdc8480f170c machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Tue, 14 Sep 2021 09:36:02 +0800 Taints: node.kubernetes.io/unreachable:NoExecute node.kubernetes.io/unreachable:NoSchedule Unschedulable: false Lease: HolderIdentity: ip-10-0-142-90.us-east-2.compute.internal AcquireTime: <unset> RenewTime: Tue, 14 Sep 2021 11:05:18 +0800 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure Unknown Tue, 14 Sep 2021 11:03:08 +0800 Tue, 14 Sep 2021 11:06:02 +0800 NodeStatusUnknown Kubelet stopped posting node status. DiskPressure Unknown Tue, 14 Sep 2021 11:03:08 +0800 Tue, 14 Sep 2021 11:06:02 +0800 NodeStatusUnknown Kubelet stopped posting node status. PIDPressure Unknown Tue, 14 Sep 2021 11:03:08 +0800 Tue, 14 Sep 2021 11:06:02 +0800 NodeStatusUnknown Kubelet stopped posting node status. Ready Unknown Tue, 14 Sep 2021 11:03:08 +0800 Tue, 14 Sep 2021 11:06:02 +0800 NodeStatusUnknown Kubelet stopped posting node status. Addresses: InternalIP: 10.0.142.90 Hostname: ip-10-0-142-90.us-east-2.compute.internal InternalDNS: ip-10-0-142-90.us-east-2.compute.internal Capacity: attachable-volumes-aws-ebs: 25 cpu: 4 ephemeral-storage: 125293548Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 16106300Ki pods: 250 Allocatable: attachable-volumes-aws-ebs: 25 cpu: 3500m ephemeral-storage: 115470533646 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 14955324Ki pods: 250 System Info: Machine ID: ec2b9c749de9fa592bffff2e7b8243af System UUID: ec2b9c74-9de9-fa59-2bff-ff2e7b8243af Boot ID: b8f66e17-4193-4b19-9ddb-3ae653014828 Kernel Version: 4.18.0-305.19.1.el8_4.x86_64 OS Image: Red Hat Enterprise Linux CoreOS 48.84.202109090400-0 (Ootpa) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.21.2-15.rhaos4.8.gitcdc4f56.el8 Kubelet Version: v1.21.1+9807387 Kube-Proxy Version: v1.21.1+9807387 ProviderID: aws:///us-east-2a/i-07f5b9600cf909f5c Non-terminated Pods: (30 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- openshift-cluster-csi-drivers aws-ebs-csi-driver-node-ljkln 30m (0%) 0 (0%) 150Mi (1%) 0 (0%) 118m openshift-cluster-node-tuning-operator tuned-wkjm9 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 118m openshift-debug-node-rbs4v ip-10-0-142-90.us-east-2.compute.internal-debug 0 (0%) 0 (0%) 0 (0%) 0 (0%) 11m openshift-dns dns-default-h87pk 60m (1%) 0 (0%) 110Mi (0%) 0 (0%) 117m openshift-dns node-resolver-94zpn 5m (0%) 0 (0%) 21Mi (0%) 0 (0%) 118m openshift-image-registry node-ca-4fpp2 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 118m openshift-ingress-canary ingress-canary-bvsh4 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 117m openshift-ingress router-default-845cc4d4dc-p5xzs 100m (2%) 0 (0%) 256Mi (1%) 0 (0%) 118m openshift-machine-config-operator machine-config-daemon-zlbhc 40m (1%) 0 (0%) 100Mi (0%) 0 (0%) 118m openshift-marketplace certified-operators-f8r65 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 120m openshift-marketplace community-operators-j2mc5 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 120m openshift-marketplace redhat-operators-r2bfm 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 120m openshift-monitoring alertmanager-main-0 8m (0%) 0 (0%) 105Mi (0%) 0 (0%) 117m openshift-monitoring alertmanager-main-1 8m (0%) 0 (0%) 105Mi (0%) 0 (0%) 117m openshift-monitoring alertmanager-main-2 8m (0%) 0 (0%) 105Mi (0%) 0 (0%) 117m openshift-monitoring grafana-dd56d9c7f-frght 5m (0%) 0 (0%) 84Mi (0%) 0 (0%) 117m openshift-monitoring kube-state-metrics-7485cb5695-8j7m9 4m (0%) 0 (0%) 110Mi (0%) 0 (0%) 121m openshift-monitoring node-exporter-nkn4f 9m (0%) 0 (0%) 47Mi (0%) 0 (0%) 118m openshift-monitoring openshift-state-metrics-65c6597c7-8tzcq 3m (0%) 0 (0%) 72Mi (0%) 0 (0%) 121m openshift-monitoring prometheus-adapter-748974bc4b-qfmhq 1m (0%) 0 (0%) 40Mi (0%) 0 (0%) 117m openshift-monitoring prometheus-k8s-0 76m (2%) 0 (0%) 1119Mi (7%) 0 (0%) 117m openshift-monitoring prometheus-k8s-1 76m (2%) 0 (0%) 1119Mi (7%) 0 (0%) 117m openshift-monitoring telemeter-client-74b79579b-4dfh7 3m (0%) 0 (0%) 70Mi (0%) 0 (0%) 121m openshift-monitoring thanos-querier-65dc8646f7-5tnkz 14m (0%) 0 (0%) 77Mi (0%) 0 (0%) 117m openshift-multus multus-additional-cni-plugins-knbdn 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 118m openshift-multus multus-trtjx 10m (0%) 0 (0%) 65Mi (0%) 0 (0%) 118m openshift-multus network-metrics-daemon-s4pvr 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 118m openshift-network-diagnostics network-check-source-6ccd7c5589-kbwt6 10m (0%) 0 (0%) 40Mi (0%) 0 (0%) 125m openshift-network-diagnostics network-check-target-b62jb 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 118m openshift-ovn-kubernetes ovnkube-node-8ms5f 40m (1%) 0 (0%) 640Mi (4%) 0 (0%) 118m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 610m (17%) 0 (0%) memory 4810Mi (32%) 0 (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Starting 118m kubelet Starting kubelet. Normal NodeHasSufficientMemory 118m (x2 over 118m) kubelet Node ip-10-0-142-90.us-east-2.compute.internal status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 118m (x2 over 118m) kubelet Node ip-10-0-142-90.us-east-2.compute.internal status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 118m (x2 over 118m) kubelet Node ip-10-0-142-90.us-east-2.compute.internal status is now: NodeHasSufficientPID Normal NodeAllocatableEnforced 118m kubelet Updated Node Allocatable limit across pods Normal NodeReady 117m kubelet Node ip-10-0-142-90.us-east-2.compute.internal status is now: NodeReady ---------------------------- % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.11 True False False 79m baremetal 4.8.11 True False False 105m cloud-credential 4.8.11 True False False 116m cluster-autoscaler 4.8.11 True False False 105m config-operator 4.8.11 True False False 106m console 4.8.11 True False False 93m csi-snapshot-controller 4.8.11 True False False 106m dns 4.8.11 True True False 105m etcd 4.8.11 True False False 104m image-registry 4.8.11 True False False 102m ingress 4.8.11 True False False 100m insights 4.8.11 True False False 100m kube-apiserver 4.8.11 True False False 102m kube-controller-manager 4.8.11 True False False 104m kube-scheduler 4.8.11 True False False 104m kube-storage-version-migrator 4.8.11 True False False 106m machine-api 4.8.11 True False False 102m machine-approver 4.8.11 True False False 106m machine-config 4.8.11 False False True 3m19s marketplace 4.8.11 True False False 105m monitoring 4.8.11 False True True 6m42s network 4.8.11 True True False 107m node-tuning 4.8.11 True False False 105m openshift-apiserver 4.8.11 True False False 103m openshift-controller-manager 4.8.11 True False False 98m openshift-samples 4.8.11 True False False 103m operator-lifecycle-manager 4.8.11 True False False 106m operator-lifecycle-manager-catalog 4.8.11 True False False 105m operator-lifecycle-manager-packageserver 4.8.11 True False False 103m service-ca 4.8.11 True False False 106m storage 4.8.11 True True False 105m
@rphillips I reproduced this issue multiple times on different providers GCP/AzureAAWS, network types sdn/ovn, OCP versions 4.6/4/7/4/8, the same think I can see is as you've mentioned 'There are two prometheus pods running on the same NotReady node. This is the most likely cause why the node is notready'. So I updated the bug title to better describe this issue. Is this a known issue that two prometheus pods running on the same node can make it NotReady? Can you tell more about how it happens? Given the node is broken for because two prometheus pods running on the same node, or some other reason, why couldn't the pods stuck in 'Terminating' state on it be discovered and rescheduled to other healthy nodes?
@rphillips I can reproduce this when
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days