1992446 – Node NotReady during upgrade or scaleup or perf test happens on multiple providers & network types & versions

Bug 1992446 - Node NotReady during upgrade or scaleup or perf test happens on multiple providers & network types & versions

Summary: Node NotReady during upgrade or scaleup or perf test happens on multiple prov...

Keywords:
Status:	CLOSED DUPLICATE of bug 1933847
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Simon Pasquier
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-11 06:34 UTC by Qiujie Li
Modified:	2023-09-15 01:13 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-12 12:42:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Qiujie Li 2021-08-11 06:34:05 UTC

Description of problem:
Upgrade from 4.7.23 to 4.8.4 failed due to 1 of 2 worker node unavailable for a cluster with 400 services. 

Version-Release number of selected component (if applicable):
4.7.23

How reproducible:
In stall a 4.6.42 cluster with n2 1-standard-4 worker nodes. Load it with 400 services (400 pods created). Upgrade the cluster from 4.6.42 to 4.7.23 and to 4.8.4.

Steps to Reproduce:
1. Create a gcp 4.6.42 sdn cluster with 2 n1-standard-4 worker nodes.
2. Run max service scale-ci test to generate 400 services (400 pods created) to the cluster. https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/max-services/77 
3. Upgrade the cluster from 4.6.42 to 4.7.23
4. Upgrade the cluster from 4.7.23 to 4.8.4

Actual results:
The upgrade from 4.7.23 to 4.8.4 did not finish after over 12 hours.

Unavailable node's condition:

Conditions:
  Type                 Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----                 ------    -----------------                 ------------------                ------              -------
  NetworkUnavailable   False     Mon, 01 Jan 0001 00:00:00 +0000   Tue, 10 Aug 2021 10:58:09 +0800   RouteCreated        openshift-sdn cleared kubelet-set NoRouteCreated
  MemoryPressure       Unknown   Tue, 10 Aug 2021 13:59:56 +0800   Tue, 10 Aug 2021 14:05:26 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure         Unknown   Tue, 10 Aug 2021 13:59:56 +0800   Tue, 10 Aug 2021 14:05:26 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure          Unknown   Tue, 10 Aug 2021 13:59:56 +0800   Tue, 10 Aug 2021 14:05:26 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready                Unknown   Tue, 10 Aug 2021 13:59:56 +0800   Tue, 10 Aug 2021 14:05:26 +0800   NodeStatusUnknown   Kubelet stopped posting node status.


Expected results:
The upgrade from 4.7.23 to 4.8.4 should be successful.

Additional info:
----------------------------------------
% oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.23    True        True          24h     Unable to apply 4.8.4: an unknown error has occurred: MultipleErrors

----------------------------------------
% oc get clusterversion -o json|jq ".items[0].status.history"
[
  {
    "completionTime": null,
    "image": "registry.ci.openshift.org/ocp/release@sha256:841535acc09ca8412cd17e8f7702eceda1cac688ccc281278f108675c30de270",
    "startedTime": "2021-08-10T06:09:49Z",
    "state": "Partial",
    "verified": true,
    "version": "4.8.4"
  },
  {
    "completionTime": "2021-08-10T05:56:35Z",
    "image": "registry.ci.openshift.org/ocp/release@sha256:fb00f5e16a2092c3f15113ad8de0d2e841abdb43c9c39794522fc79784a3efb0",
    "startedTime": "2021-08-10T04:51:19Z",
    "state": "Completed",
    "verified": true,
    "version": "4.7.23"
  },
  {
    "completionTime": "2021-08-10T03:09:19Z",
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:59e2e85f5d1bcb4440765c310b6261387ffc3f16ed55ca0a79012367e15b558b",
    "startedTime": "2021-08-10T02:46:37Z",
    "state": "Completed",
    "verified": false,
    "version": "4.6.42"
  }
]

----------------------------------------
% oc get machineset -A

NAMESPACE               NAME                     DESIRED   CURRENT   READY   AVAILABLE   AGE
openshift-machine-api   qili46s-z4896-worker-a   1         1         1       1           27h
openshift-machine-api   qili46s-z4896-worker-b   1         1                             27h
openshift-machine-api   qili46s-z4896-worker-c   0         0                             27h
openshift-machine-api   qili46s-z4896-worker-f   0         0                             27h

----------------------------------------
% oc get machine -n openshift-machine-api
NAME                           PHASE     TYPE            REGION        ZONE            AGE
qili46s-z4896-worker-a-4ctgx   Running   n1-standard-4   us-central1   us-central1-a   27h
qili46s-z4896-worker-b-mw5f9   Running   n1-standard-4   us-central1   us-central1-b   27h

----------------------------------------
% oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.4     True        False         False      19h
baremetal                                  4.8.4     True        False         False      19h
cloud-credential                           4.8.4     True        False         False      22h
cluster-autoscaler                         4.8.4     True        False         False      22h
config-operator                            4.8.4     True        False         False      22h
console                                    4.8.4     True        False         False      18h
csi-snapshot-controller                    4.8.4     True        False         False      19h
dns                                        4.7.23    True        False         False      22h
etcd                                       4.8.4     True        False         False      22h
image-registry                             4.7.23    True        True          True       19h
ingress                                    4.8.4     True        False         True       18h
insights                                   4.8.4     True        False         False      22h
kube-apiserver                             4.8.4     True        False         False      22h
kube-controller-manager                    4.8.4     True        False         False      22h
kube-scheduler                             4.8.4     True        False         False      22h
kube-storage-version-migrator              4.8.4     True        False         False      18h
machine-api                                4.8.4     True        False         False      22h
machine-approver                           4.8.4     True        False         False      22h
machine-config                             4.7.23    False       False         True       19h
marketplace                                4.8.4     True        False         False      19h
monitoring                                 4.7.23    False       True          True       18h
network                                    4.7.23    True        True          True       22h
node-tuning                                4.8.4     True        True          False      18h
openshift-apiserver                        4.8.4     True        False         False      19h
openshift-controller-manager               4.8.4     True        False         False      22h
openshift-samples                          4.8.4     True        False         False      18h
operator-lifecycle-manager                 4.8.4     True        False         False      22h
operator-lifecycle-manager-catalog         4.8.4     True        False         False      22h
operator-lifecycle-manager-packageserver   4.8.4     True        False         False      19h
service-ca                                 4.8.4     True        False         False      22h
storage                                    4.8.4     True        True          False      19h

----------------------------------------

% oc get pod -A | wc -l           
     868    
% oc get pod -A | egrep -v "Completed|Running" | wc -l                                            
     427

----------------------------------------
% oc get pod -A | egrep -v "Completed|Running"
NAMESPACE                                           NAME                                                                      READY   STATUS            RESTARTS   AGE
benchmark-operator                                  benchmark-controller-manager-575cc4768b-2qxn4                             0/2     Pending           0          19h
benchmark-operator                                  benchmark-controller-manager-575cc4768b-rxbql                             2/2     Terminating       0          19h
max-services-cf852503-8ef9-4622-9a1f-142618f7f23b   max-serv-1-1-c6f7b9b49-9gqwn                                              1/1     Terminating       0          19h
max-services-cf852503-8ef9-4622-9a1f-142618f7f23b   max-serv-1-1-c6f7b9b49-mj8bw                                              0/1     Pending           0          19h
:
:
:
max-services-cf852503-8ef9-4622-9a1f-142618f7f23b   max-serv-99-1-54d9d895cd-lmq9g                                            0/1     Pending           0          19h
openshift-cluster-csi-drivers                       gcp-pd-csi-driver-node-t4pnk                                              3/3     Terminating       0          20h
openshift-cluster-node-tuning-operator              tuned-9x9hn                                                               1/1     Terminating       0          20h
openshift-image-registry                            image-registry-74476bb9b6-j84z5                                           0/1     Pending           0          18h
openshift-image-registry                            image-registry-74476bb9b6-s96tk                                           0/1     Pending           0          18h
openshift-image-registry                            image-registry-9d7f47dd4-8bfmh                                            1/1     Terminating       0          19h
openshift-image-registry                            node-ca-vcmjt                                                             1/1     Terminating       0          20h
openshift-ingress-canary                            ingress-canary-vn6vk                                                      1/1     Terminating       0          20h
openshift-ingress                                   router-default-5ffbcbb44f-67cpx                                           1/1     Terminating       0          19h
openshift-ingress                                   router-default-7dcc5499f9-p9lxn                                           0/1     Pending           0          18h
openshift-kube-apiserver                            kube-apiserver-qili46s-z4896-master-1.c.openshift-qe.internal             0/5     PodInitializing   0          54s
openshift-kube-storage-version-migrator             migrator-f58676cd4-92cjw                                                  1/1     Terminating       0          19h
openshift-marketplace                               certified-operators-h4g57                                                 0/1     Pending           0          18h
openshift-marketplace                               certified-operators-q8f4d                                                 1/1     Terminating       0          19h
openshift-marketplace                               certified-operators-zrfgs                                                 0/1     Pending           0          19h
openshift-marketplace                               community-operators-2vhc6                                                 0/1     Pending           0          18h
openshift-marketplace                               community-operators-k7lgb                                                 0/1     Pending           0          19h
openshift-marketplace                               community-operators-phj6h                                                 1/1     Terminating       0          19h
openshift-marketplace                               redhat-marketplace-4vl6f                                                  0/1     Pending           0          18h
openshift-marketplace                               redhat-marketplace-cxgwb                                                  0/1     Pending           0          19h
openshift-marketplace                               redhat-marketplace-h5mrw                                                  1/1     Terminating       0          19h
openshift-marketplace                               redhat-operators-gmj2r                                                    0/1     Pending           0          19h
openshift-marketplace                               redhat-operators-mcmc7                                                    1/1     Terminating       0          19h
openshift-monitoring                                alertmanager-main-0                                                       5/5     Terminating       0          19h
openshift-monitoring                                alertmanager-main-1                                                       5/5     Terminating       0          19h
openshift-monitoring                                alertmanager-main-2                                                       5/5     Terminating       0          19h
openshift-monitoring                                grafana-578596d89-lpwtr                                                   2/2     Terminating       0          19h
openshift-monitoring                                kube-state-metrics-d956df775-gxfcv                                        3/3     Terminating       0          19h
openshift-monitoring                                node-exporter-g8kkq                                                       2/2     Terminating       0          20h
openshift-monitoring                                node-exporter-hxm4f                                                       0/2     Pending           0          18h
openshift-monitoring                                openshift-state-metrics-74b58f578c-6m4fl                                  3/3     Terminating       0          19h
openshift-monitoring                                prometheus-adapter-79db6db5fd-8f524                                       1/1     Terminating       0          19h
openshift-monitoring                                prometheus-adapter-79db6db5fd-bdm8w                                       1/1     Terminating       0          19h
openshift-monitoring                                prometheus-k8s-0                                                          7/7     Terminating       1          19h
openshift-monitoring                                prometheus-k8s-1                                                          7/7     Terminating       1          19h
openshift-monitoring                                telemeter-client-668bc5dd49-vwfqk                                         3/3     Terminating       0          19h
openshift-monitoring                                thanos-querier-69f5f7979-cjl58                                            5/5     Terminating       0          19h
openshift-monitoring                                thanos-querier-69f5f7979-zcnfx                                            5/5     Terminating       0          19h
openshift-monitoring                                thanos-querier-9b769975-92q6s                                             0/5     Pending           0          18h
openshift-monitoring                                thanos-querier-9b769975-lnhlk                                             0/5     Pending           0          18h
openshift-network-diagnostics                       network-check-source-6cd65cf589-bxzff                                     1/1     Terminating       0          19h
openshift-network-diagnostics                       network-check-source-6cd65cf589-hp8bh                                     0/1     Pending           0          19h

----------------------------------------
% oc get events  -n max-services-cf852503-8ef9-4622-9a1f-142618f7f23b
….
13m         Warning   FailedScheduling   pod/max-serv-99-1-54d9d895cd-lmq9g    0/5 nodes are available: 1 Too many pods, 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

% oc get events  -n openshift-ingress
LAST SEEN   TYPE      REASON             OBJECT                                MESSAGE
16m         Warning   FailedScheduling   pod/router-default-7dcc5499f9-p9lxn   0/5 nodes are available: 1 Too many pods, 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

----------------------------------------
 % oc get nodes
NAME                                                   STATUS     ROLES    AGE   VERSION
qili46s-z4896-master-0.c.openshift-qe.internal         Ready      master   24h   v1.20.0+558d959
qili46s-z4896-master-1.c.openshift-qe.internal         Ready      master   24h   v1.20.0+558d959
qili46s-z4896-master-2.c.openshift-qe.internal         Ready      master   24h   v1.20.0+558d959
qili46s-z4896-worker-a-4ctgx.c.openshift-qe.internal   Ready      worker   24h   v1.20.0+558d959
qili46s-z4896-worker-b-mw5f9.c.openshift-qe.internal   NotReady   worker   24h   v1.20.0+558d959

----------------------------------------
% oc describe node qili46s-z4896-worker-b-mw5f9.c.openshift-qe.internal
Conditions:
  Type                 Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----                 ------    -----------------                 ------------------                ------              -------
  NetworkUnavailable   False     Mon, 01 Jan 0001 00:00:00 +0000   Tue, 10 Aug 2021 10:58:09 +0800   RouteCreated        openshift-sdn cleared kubelet-set NoRouteCreated
  MemoryPressure       Unknown   Tue, 10 Aug 2021 13:59:56 +0800   Tue, 10 Aug 2021 14:05:26 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure         Unknown   Tue, 10 Aug 2021 13:59:56 +0800   Tue, 10 Aug 2021 14:05:26 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure          Unknown   Tue, 10 Aug 2021 13:59:56 +0800   Tue, 10 Aug 2021 14:05:26 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready                Unknown   Tue, 10 Aug 2021 13:59:56 +0800   Tue, 10 Aug 2021 14:05:26 +0800   NodeStatusUnknown   Kubelet stopped posting node status.

Non-terminated Pods:                                 (250 in total)
  Namespace                                          Name                                             CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                                          ----                                             ------------  ----------  ---------------  -------------  ---
  benchmark-operator                                 benchmark-controller-manager-575cc4768b-rxbql    2 (57%)       2 (57%)     0 (0%)           0 (0%)         22h
  max-services-cf852503-8ef9-4622-9a1f-142618f7f23b  max-serv-1-1-c6f7b9b49-9gqwn                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         22h

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests      Limits
  --------                   --------      ------
  cpu                        2795m (79%)   2 (57%)
  memory                     6059Mi (43%)  512Mi (3%)
  ephemeral-storage          0 (0%)        0 (0%)
  hugepages-1Gi              0 (0%)        0 (0%)
  hugepages-2Mi              0 (0%)        0 (0%)
  attachable-volumes-gce-pd  0             0
Events:                      <none>

Comment 1 Qiujie Li 2021-08-11 06:42:27 UTC

Because of the condition of the cluster, I can't run must gather. 
[must-gather-czqlh] OUT gather logs unavailable: http2: server sent GOAWAY and closed the connection; LastStreamID=13, ErrCode=NO_ERROR, debug=""
[must-gather-czqlh] OUT waiting for gather to complete
[must-gather-czqlh] OUT gather never finished: timed out waiting for the condition
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-x2hlp deleted
[must-gather      ] OUT namespace/openshift-must-gather-z6l9x deleted
error: gather never finished for pod must-gather-czqlh: timed out waiting for the condition

Comment 2 Ryan Phillips 2021-08-12 20:19:35 UTC

Is this on a cluster I can log into to debug?

Comment 4 Qiujie Li 2021-08-18 04:08:32 UTC

Sorry I was on PTO in the past days. Unfortunately I don't have the cluster now. I attached the grafana screenshot the average worker nodes cpu usage. The query of the grafana graph was 'uuid.keyword: $uuid AND metricName: "nodeCPU-AggregatedWorkers"'. Full grafana link is here http://grafana.rdu2.scalelab.redhat.com:3000/d/hIBqKNvMz123/kube-burner-report-aggregated?orgId=1&from=1628566252651&to=1628566652544&var-Datasource=SVTQE-kube-burner&var-sdn=openshift-sdn&var-job=All&var-uuid=cf852503-8ef9-4622-9a1f-142618f7f23b&var-master=qili46s-z4896-master-0.c.openshift-qe.internal&var-namespace=All&var-verb=All&var-resource=All&var-flowschema=All&var-priority_level=All.
I think the load is too big for the cluster that caused the worker node's cpu under pressure.

Comment 5 Qiujie Li 2021-08-18 04:12:25 UTC

Roshni Pattath told me she tried 4.6.42 -> 4.7.24 -> 4.8.5 and I was not able to reproduce the issue. 4.7.24 and 4.8.5 were the latest stable builds. The upgrade completed in less than 3 hours. That means this issue is not reproduced (or not reproduced everytime) on the upgrade path 4.6.42 -> 4.7.24 -> 4.8.5.

Comment 9 Qiujie Li 2021-08-20 14:17:27 UTC

@rphillips Thanks for the analysis. Why 'two prometheus pods running on the same NotReady node' can make a node not ready? Is that an expected behavior? How do we let users be aware of this and avoid this?

Comment 11 Qiujie Li 2021-09-10 02:48:59 UTC

@rphillips Hi, Ryan, sorry the test environment had already been cleared. I will rerun the test to see if I can reproduce the issue. How can I know which build the patches are in?

Comment 12 Qiujie Li 2021-09-14 03:40:21 UTC

@rphillips I met the similar issue today. A 4.8.10 AWS OVN IPI cluster, after install, I scaled up worker nodes from 3 to 120. After that I found one node NotReady, and I found both prometheus pods are on this node, the pods on this node stuck in 'Terminating' state and can not be scheduled on other nodes - that caused some co in abnormal state. 

You've mentioned 'There are two prometheus pods running on the same NotReady node. This is the most likely cause why the node is notready'. Is this a known issue? Can you tell more about how it happens? 

----------------------------

 % oc get clusterversions 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.11    True        False         78m     Cluster version is 4.8.11

----------------------------


% oc get nodes | grep NotReady
ip-10-0-142-90.us-east-2.compute.internal    NotReady   worker   118m   v1.21.1+9807387

----------------------------

% oc get pods -A -o wide | egrep -v "Completed| Running"                             
NAMESPACE                                          NAME                                                                  READY   STATUS        RESTARTS   AGE    IP             NODE                                         NOMINATED NODE   READINESS GATES
openshift-debug-node-rbs4v                         ip-10-0-142-90.us-east-2.compute.internal-debug                       0/1     Terminating   0          13m    <none>         ip-10-0-142-90.us-east-2.compute.internal    <none>           <none>
openshift-ingress                                  router-default-845cc4d4dc-p5xzs                                       1/1     Terminating   0          120m   10.131.0.13    ip-10-0-142-90.us-east-2.compute.internal    <none>           <none>
openshift-marketplace                              certified-operators-f8r65                                             1/1     Terminating   0          122m   10.131.0.8     ip-10-0-142-90.us-east-2.compute.internal    <none>           <none>
openshift-marketplace                              community-operators-j2mc5                                             1/1     Terminating   0          122m   10.131.0.15    ip-10-0-142-90.us-east-2.compute.internal    <none>           <none>
openshift-marketplace                              redhat-operators-r2bfm                                                1/1     Terminating   0          122m   10.131.0.16    ip-10-0-142-90.us-east-2.compute.internal    <none>           <none>
openshift-monitoring                               alertmanager-main-0                                                   5/5     Terminating   0          119m   10.131.0.18    ip-10-0-142-90.us-east-2.compute.internal    <none>           <none>
openshift-monitoring                               alertmanager-main-1                                                   5/5     Terminating   0          119m   10.131.0.19    ip-10-0-142-90.us-east-2.compute.internal    <none>           <none>
openshift-monitoring                               alertmanager-main-2                                                   5/5     Terminating   0          119m   10.131.0.20    ip-10-0-142-90.us-east-2.compute.internal    <none>           <none>
openshift-monitoring                               grafana-dd56d9c7f-frght                                               2/2     Terminating   0          119m   10.131.0.21    ip-10-0-142-90.us-east-2.compute.internal    <none>           <none>
openshift-monitoring                               kube-state-metrics-7485cb5695-8j7m9                                   3/3     Terminating   0          124m   10.131.0.9     ip-10-0-142-90.us-east-2.compute.internal    <none>           <none>
openshift-monitoring                               openshift-state-metrics-65c6597c7-8tzcq                               3/3     Terminating   0          124m   10.131.0.14    ip-10-0-142-90.us-east-2.compute.internal    <none>           <none>
openshift-monitoring                               prometheus-adapter-748974bc4b-qfmhq                                   1/1     Terminating   0          119m   10.131.0.17    ip-10-0-142-90.us-east-2.compute.internal    <none>           <none>
openshift-monitoring                               prometheus-k8s-0                                                      7/7     Terminating   2          119m   10.131.0.23    ip-10-0-142-90.us-east-2.compute.internal    <none>           <none>
openshift-monitoring                               prometheus-k8s-1                                                      7/7     Terminating   1          119m   10.131.0.24    ip-10-0-142-90.us-east-2.compute.internal    <none>           <none>
openshift-monitoring                               telemeter-client-74b79579b-4dfh7                                      3/3     Terminating   0          124m   10.131.0.11    ip-10-0-142-90.us-east-2.compute.internal    <none>           <none>
openshift-monitoring                               thanos-querier-65dc8646f7-5tnkz                                       5/5     Terminating   0          119m   10.131.0.22    ip-10-0-142-90.us-east-2.compute.internal    <none>           <none>
openshift-network-diagnostics                      network-check-source-6ccd7c5589-kbwt6                                 1/1     Terminating   0          127m   10.131.0.10    ip-10-0-142-90.us-east-2.compute.internal    <none>           <none>


----------------------------

 % oc describe node ip-10-0-142-90.us-east-2.compute.internal
Name:               ip-10-0-142-90.us-east-2.compute.internal
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m5.xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-2
                    failure-domain.beta.kubernetes.io/zone=us-east-2a
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-142-90
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=m5.xlarge
                    node.openshift.io/os_id=rhcos
                    topology.ebs.csi.aws.com/zone=us-east-2a
                    topology.kubernetes.io/region=us-east-2
                    topology.kubernetes.io/zone=us-east-2a
Annotations:        csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-07f5b9600cf909f5c"}
                    k8s.ovn.org/host-addresses: ["10.0.142.90"]
                    k8s.ovn.org/l3-gateway-config:
                      {"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-142-90.us-east-2.compute.internal","mac-address":"02:c2:e9:ee:89:92","ip-address...
                    k8s.ovn.org/node-chassis-id: 453f11fa-2b24-4ebc-acb5-79cbd635fdfd
                    k8s.ovn.org/node-mgmt-port-mac-address: b2:aa:93:1f:8b:9c
                    k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.142.90/19"}
                    k8s.ovn.org/node-subnets: {"default":"10.131.0.0/23"}
                    machine.openshift.io/machine: openshift-machine-api/qili-48-aws-0914-sdct2-worker-us-east-2a-2m9z4
                    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-85b1a4bcca322341743ffdc8480f170c
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-85b1a4bcca322341743ffdc8480f170c
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 14 Sep 2021 09:36:02 +0800
Taints:             node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unreachable:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-0-142-90.us-east-2.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Tue, 14 Sep 2021 11:05:18 +0800
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Tue, 14 Sep 2021 11:03:08 +0800   Tue, 14 Sep 2021 11:06:02 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Tue, 14 Sep 2021 11:03:08 +0800   Tue, 14 Sep 2021 11:06:02 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Tue, 14 Sep 2021 11:03:08 +0800   Tue, 14 Sep 2021 11:06:02 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Tue, 14 Sep 2021 11:03:08 +0800   Tue, 14 Sep 2021 11:06:02 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
Addresses:
  InternalIP:   10.0.142.90
  Hostname:     ip-10-0-142-90.us-east-2.compute.internal
  InternalDNS:  ip-10-0-142-90.us-east-2.compute.internal
Capacity:
  attachable-volumes-aws-ebs:  25
  cpu:                         4
  ephemeral-storage:           125293548Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      16106300Ki
  pods:                        250
Allocatable:
  attachable-volumes-aws-ebs:  25
  cpu:                         3500m
  ephemeral-storage:           115470533646
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      14955324Ki
  pods:                        250
System Info:
  Machine ID:                             ec2b9c749de9fa592bffff2e7b8243af
  System UUID:                            ec2b9c74-9de9-fa59-2bff-ff2e7b8243af
  Boot ID:                                b8f66e17-4193-4b19-9ddb-3ae653014828
  Kernel Version:                         4.18.0-305.19.1.el8_4.x86_64
  OS Image:                               Red Hat Enterprise Linux CoreOS 48.84.202109090400-0 (Ootpa)
  Operating System:                       linux
  Architecture:                           amd64
  Container Runtime Version:              cri-o://1.21.2-15.rhaos4.8.gitcdc4f56.el8
  Kubelet Version:                        v1.21.1+9807387
  Kube-Proxy Version:                     v1.21.1+9807387
ProviderID:                               aws:///us-east-2a/i-07f5b9600cf909f5c
Non-terminated Pods:                      (30 in total)
  Namespace                               Name                                               CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                               ----                                               ------------  ----------  ---------------  -------------  ---
  openshift-cluster-csi-drivers           aws-ebs-csi-driver-node-ljkln                      30m (0%)      0 (0%)      150Mi (1%)       0 (0%)         118m
  openshift-cluster-node-tuning-operator  tuned-wkjm9                                        10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         118m
  openshift-debug-node-rbs4v              ip-10-0-142-90.us-east-2.compute.internal-debug    0 (0%)        0 (0%)      0 (0%)           0 (0%)         11m
  openshift-dns                           dns-default-h87pk                                  60m (1%)      0 (0%)      110Mi (0%)       0 (0%)         117m
  openshift-dns                           node-resolver-94zpn                                5m (0%)       0 (0%)      21Mi (0%)        0 (0%)         118m
  openshift-image-registry                node-ca-4fpp2                                      10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         118m
  openshift-ingress-canary                ingress-canary-bvsh4                               10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         117m
  openshift-ingress                       router-default-845cc4d4dc-p5xzs                    100m (2%)     0 (0%)      256Mi (1%)       0 (0%)         118m
  openshift-machine-config-operator       machine-config-daemon-zlbhc                        40m (1%)      0 (0%)      100Mi (0%)       0 (0%)         118m
  openshift-marketplace                   certified-operators-f8r65                          10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         120m
  openshift-marketplace                   community-operators-j2mc5                          10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         120m
  openshift-marketplace                   redhat-operators-r2bfm                             10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         120m
  openshift-monitoring                    alertmanager-main-0                                8m (0%)       0 (0%)      105Mi (0%)       0 (0%)         117m
  openshift-monitoring                    alertmanager-main-1                                8m (0%)       0 (0%)      105Mi (0%)       0 (0%)         117m
  openshift-monitoring                    alertmanager-main-2                                8m (0%)       0 (0%)      105Mi (0%)       0 (0%)         117m
  openshift-monitoring                    grafana-dd56d9c7f-frght                            5m (0%)       0 (0%)      84Mi (0%)        0 (0%)         117m
  openshift-monitoring                    kube-state-metrics-7485cb5695-8j7m9                4m (0%)       0 (0%)      110Mi (0%)       0 (0%)         121m
  openshift-monitoring                    node-exporter-nkn4f                                9m (0%)       0 (0%)      47Mi (0%)        0 (0%)         118m
  openshift-monitoring                    openshift-state-metrics-65c6597c7-8tzcq            3m (0%)       0 (0%)      72Mi (0%)        0 (0%)         121m
  openshift-monitoring                    prometheus-adapter-748974bc4b-qfmhq                1m (0%)       0 (0%)      40Mi (0%)        0 (0%)         117m
  openshift-monitoring                    prometheus-k8s-0                                   76m (2%)      0 (0%)      1119Mi (7%)      0 (0%)         117m
  openshift-monitoring                    prometheus-k8s-1                                   76m (2%)      0 (0%)      1119Mi (7%)      0 (0%)         117m
  openshift-monitoring                    telemeter-client-74b79579b-4dfh7                   3m (0%)       0 (0%)      70Mi (0%)        0 (0%)         121m
  openshift-monitoring                    thanos-querier-65dc8646f7-5tnkz                    14m (0%)      0 (0%)      77Mi (0%)        0 (0%)         117m
  openshift-multus                        multus-additional-cni-plugins-knbdn                10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         118m
  openshift-multus                        multus-trtjx                                       10m (0%)      0 (0%)      65Mi (0%)        0 (0%)         118m
  openshift-multus                        network-metrics-daemon-s4pvr                       20m (0%)      0 (0%)      120Mi (0%)       0 (0%)         118m
  openshift-network-diagnostics           network-check-source-6ccd7c5589-kbwt6              10m (0%)      0 (0%)      40Mi (0%)        0 (0%)         125m
  openshift-network-diagnostics           network-check-target-b62jb                         10m (0%)      0 (0%)      15Mi (0%)        0 (0%)         118m
  openshift-ovn-kubernetes                ovnkube-node-8ms5f                                 40m (1%)      0 (0%)      640Mi (4%)       0 (0%)         118m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         610m (17%)    0 (0%)
  memory                      4810Mi (32%)  0 (0%)
  ephemeral-storage           0 (0%)        0 (0%)
  hugepages-1Gi               0 (0%)        0 (0%)
  hugepages-2Mi               0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0
Events:
  Type    Reason                   Age                  From     Message
  ----    ------                   ----                 ----     -------
  Normal  Starting                 118m                 kubelet  Starting kubelet.
  Normal  NodeHasSufficientMemory  118m (x2 over 118m)  kubelet  Node ip-10-0-142-90.us-east-2.compute.internal status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    118m (x2 over 118m)  kubelet  Node ip-10-0-142-90.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     118m (x2 over 118m)  kubelet  Node ip-10-0-142-90.us-east-2.compute.internal status is now: NodeHasSufficientPID
  Normal  NodeAllocatableEnforced  118m                 kubelet  Updated Node Allocatable limit across pods
  Normal  NodeReady                117m                 kubelet  Node ip-10-0-142-90.us-east-2.compute.internal status is now: NodeReady

----------------------------

% oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.11    True        False         False      79m
baremetal                                  4.8.11    True        False         False      105m
cloud-credential                           4.8.11    True        False         False      116m
cluster-autoscaler                         4.8.11    True        False         False      105m
config-operator                            4.8.11    True        False         False      106m
console                                    4.8.11    True        False         False      93m
csi-snapshot-controller                    4.8.11    True        False         False      106m
dns                                        4.8.11    True        True          False      105m
etcd                                       4.8.11    True        False         False      104m
image-registry                             4.8.11    True        False         False      102m
ingress                                    4.8.11    True        False         False      100m
insights                                   4.8.11    True        False         False      100m
kube-apiserver                             4.8.11    True        False         False      102m
kube-controller-manager                    4.8.11    True        False         False      104m
kube-scheduler                             4.8.11    True        False         False      104m
kube-storage-version-migrator              4.8.11    True        False         False      106m
machine-api                                4.8.11    True        False         False      102m
machine-approver                           4.8.11    True        False         False      106m
machine-config                             4.8.11    False       False         True       3m19s
marketplace                                4.8.11    True        False         False      105m
monitoring                                 4.8.11    False       True          True       6m42s
network                                    4.8.11    True        True          False      107m
node-tuning                                4.8.11    True        False         False      105m
openshift-apiserver                        4.8.11    True        False         False      103m
openshift-controller-manager               4.8.11    True        False         False      98m
openshift-samples                          4.8.11    True        False         False      103m
operator-lifecycle-manager                 4.8.11    True        False         False      106m
operator-lifecycle-manager-catalog         4.8.11    True        False         False      105m
operator-lifecycle-manager-packageserver   4.8.11    True        False         False      103m
service-ca                                 4.8.11    True        False         False      106m
storage                                    4.8.11    True        True          False      105m

Comment 13 Qiujie Li 2021-09-14 03:58:07 UTC

@rphillips I reproduced this issue multiple times on different providers GCP/AzureAAWS, network types sdn/ovn, OCP versions 4.6/4/7/4/8, the same think I can see is as you've mentioned 'There are two prometheus pods running on the same NotReady node. This is the most likely cause why the node is notready'. So I updated the bug title to better describe this issue.

Is this a known issue that two prometheus pods running on the same node can make it NotReady? Can you tell more about how it happens?
Given the node is broken for because two prometheus pods running on the same node, or some other reason, why couldn't the pods stuck in 'Terminating' state on it be discovered and rescheduled to other healthy nodes?

Comment 14 Qiujie Li 2021-09-22 14:59:00 UTC

@rphillips  I can reproduce this when

Comment 18 Red Hat Bugzilla 2023-09-15 01:13:30 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.