gathered information url: https://bit.ly/3nHADSB On the test cluster nodes go NotReady all the time because of OOM. Even on the infra nodes. I have never seen this issue on 4.5 and and there was no significant change in the amount of pods since then. Issue also occurs on infra nodes ocpinfra6911 and ocpinfra6912 which only contains openshift pods (4 Cores, 32 GB RAM). The difference between lab and test the the amount of pods only. $ oc get node NAME STATUS ROLES AGE VERSION ocpinfra6911.rbgooe.at Ready infra,worker 82d v1.19.0+7070803 ocpinfra6912.rbgooe.at Ready infra,worker 81d v1.19.0+7070803 ocpmaster6911.rbgooe.at Ready master 82d v1.19.0+7070803 ocpmaster6912.rbgooe.at Ready master 82d v1.19.0+7070803 ocpmaster6913.rbgooe.at Ready master 82d v1.19.0+7070803 ocpnode6923.rbgooe.at Ready worker 79d v1.19.0+7070803 ocpnode6924.rbgooe.at Ready worker 79d v1.19.0+7070803 ocpnode6925.rbgooe.at Ready worker 75d v1.19.0+7070803 ocpnode6926.rbgooe.at Ready worker 75d v1.19.0+7070803 ocpnode6927.rbgooe.at Ready worker 75d v1.19.0+7070803 ocpnode6928.rbgooe.at Ready worker 75d v1.19.0+7070803 ocpnode6929.rbgooe.at Ready worker 56d v1.19.0+7070803 ocpnode6930.rbgooe.at Ready worker 55d v1.19.0+7070803 ocprouter6911.rbgooe.at Ready worker 77d v1.19.0+7070803 ocprouter6912.rbgooe.at Ready worker 77d v1.19.0+7070803 $ oc describe node ocpinfra6911.rbgooe.at |grep -i cpu: cpu: 4 cpu: 3500m [RBGOOE\lrzgwia_p@ocpjump6901 tmp]$ oc describe node ocpinfra6911.rbgooe.at |grep -i memory: memory: 32907804Ki memory: 31756828Ki $ oc adm top node --as=system:admin Error from server: Error while fetching node metrics for selector : unable to fetch node CPU metrics: unable to execute query: Get "https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=sum%281+-+irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B5m%5D%29+%2A+on%28namespace%2C+pod%29+group_left%28node%29+node_namespace_pod%3Akube_pod_info%3A%7Bnode%3D~%22ocpinfra6911.rbgooe.at%7Cocpinfra6912.rbgooe.at%7Cocpmaster6911.rbgooe.at%7Cocpmaster6912.rbgooe.at%7Cocpmaster6913.rbgooe.at%7Cocpnode6923.rbgooe.at%7Cocpnode6924.rbgooe.at%7Cocpnode6925.rbgooe.at%7Cocpnode6926.rbgooe.at%7Cocpnode6927.rbgooe.at%7Cocpnode6928.rbgooe.at%7Cocpnode6929.rbgooe.at%7Cocpnode6930.rbgooe.at%7Cocprouter6911.rbgooe.at%7Cocprouter6912.rbgooe.at%22%7D%29+by+%28node%29&time=1609760996.489": dial tcp 10.95.63.220:9091: i/o timeout $ ssh core@ocpinfra6911 Red Hat Enterprise Linux CoreOS 46.82.202012051820-0 Part of OpenShift 4.6, RHCOS is a Kubernetes native operating system managed by the Machine Config Operator (`clusteroperator/machine-config`). WARNING: Direct SSH access to machines is not recommended; instead, make configuration changes via `machineconfig` objects: https://docs.openshift.com/container-platform/4.6/architecture/architecture-rhcos.html --- Last login: Mon Jan 4 09:14:30 2021 from 10.96.76.10 [systemd] Failed Units: 2 rpc-statd.service vmtoolsd.service $ oc get po --all-namespaces -o wide |grep ocpinfra dynatrace oneagent-hfntq 1/1 Running 0 63m 10.96.76.13 ocpinfra6912.rbgooe.at <none> <none> dynatrace oneagent-k2xmv 1/1 Running 0 64m 10.96.76.12 ocpinfra6911.rbgooe.at <none> <none> openshift-cluster-node-tuning-operator tuned-8jrp5 1/1 Running 0 18d 10.96.76.13 ocpinfra6912.rbgooe.at <none> <none> openshift-cluster-node-tuning-operator tuned-gp5fq 1/1 Running 0 18d 10.96.76.12 ocpinfra6911.rbgooe.at <none> <none> openshift-dns dns-default-6lmwb 3/3 Running 0 17d 10.94.8.4 ocpinfra6912.rbgooe.at <none> <none> openshift-dns dns-default-x8gr8 3/3 Running 0 17d 10.94.6.11 ocpinfra6911.rbgooe.at <none> <none> openshift-image-registry image-registry-d7c7b9c4d-bjb9m 1/1 Running 0 17m 10.94.6.14 ocpinfra6911.rbgooe.at <none> <none> openshift-image-registry image-registry-d7c7b9c4d-kjt89 1/1 Running 0 17m 10.94.6.4 ocpinfra6911.rbgooe.at <none> <none> openshift-image-registry node-ca-ffln9 1/1 Running 0 18d 10.96.76.12 ocpinfra6911.rbgooe.at <none> <none> openshift-image-registry node-ca-rdc6b 1/1 Running 0 18d 10.96.76.13 ocpinfra6912.rbgooe.at <none> <none> openshift-ingress router-default-6d8dd4fd9f-h9zpw 1/1 Running 0 15m 10.96.76.12 ocpinfra6911.rbgooe.at <none> <none> openshift-ingress router-default-6d8dd4fd9f-zwwx9 1/1 Running 0 17m 10.96.76.13 ocpinfra6912.rbgooe.at <none> <none> openshift-machine-config-operator machine-config-daemon-q5jvf 2/2 Running 0 17d 10.96.76.12 ocpinfra6911.rbgooe.at <none> <none> openshift-machine-config-operator machine-config-daemon-s9m2d 2/2 Running 0 17d 10.96.76.13 ocpinfra6912.rbgooe.at <none> <none> openshift-marketplace redhat-marketplace-c4h6d 1/1 Running 0 5m14s 10.94.8.2 ocpinfra6912.rbgooe.at <none> <none> openshift-monitoring alertmanager-main-0 5/5 Running 0 5m14s 10.94.8.3 ocpinfra6912.rbgooe.at <none> <none> openshift-monitoring alertmanager-main-1 5/5 Running 0 5m14s 10.94.8.5 ocpinfra6912.rbgooe.at <none> <none> openshift-monitoring alertmanager-main-2 5/5 Running 0 5m20s 10.94.6.17 ocpinfra6911.rbgooe.at <none> <none> openshift-monitoring grafana-57d6cd6d77-mdfxf 2/2 Running 0 17m 10.94.6.13 ocpinfra6911.rbgooe.at <none> <none> openshift-monitoring kube-state-metrics-6657f49659-7kth6 3/3 Running 0 17m 10.94.6.6 ocpinfra6911.rbgooe.at <none> <none> openshift-monitoring node-exporter-gl67w 2/2 Running 0 18d 10.96.76.13 ocpinfra6912.rbgooe.at <none> <none> openshift-monitoring node-exporter-q2bfs 2/2 Running 0 18d 10.96.76.12 ocpinfra6911.rbgooe.at <none> <none> openshift-monitoring openshift-state-metrics-698f8c857b-lzlsc 3/3 Running 0 17m 10.94.6.10 ocpinfra6911.rbgooe.at <none> <none> openshift-monitoring prometheus-adapter-678848b675-gldkg 1/1 Running 0 17m 10.94.6.5 ocpinfra6911.rbgooe.at <none> <none> openshift-monitoring prometheus-adapter-678848b675-v9qpw 1/1 Running 0 17m 10.94.6.3 ocpinfra6911.rbgooe.at <none> <none> openshift-monitoring prometheus-operator-5fcfd84995-46njn 2/2 Running 0 17m 10.94.6.12 ocpinfra6911.rbgooe.at <none> <none> openshift-monitoring telemeter-client-6fb9cbdf5b-llh4w 3/3 Running 0 17m 10.94.6.2 ocpinfra6911.rbgooe.at <none> <none> openshift-multus multus-kbhks 1/1 Running 0 17d 10.96.76.13 ocpinfra6912.rbgooe.at <none> <none> openshift-multus multus-phql2 1/1 Running 0 17d 10.96.76.12 ocpinfra6911.rbgooe.at <none> <none> openshift-multus network-metrics-daemon-svr8z 2/2 Running 0 18d 10.94.8.6 ocpinfra6912.rbgooe.at <none> <none> openshift-multus network-metrics-daemon-vhc9s 2/2 Running 0 18d 10.94.6.8 ocpinfra6911.rbgooe.at <none> <none> openshift-sdn ovs-dgffs 1/1 Running 0 17d 10.96.76.13 ocpinfra6912.rbgooe.at <none> <none> openshift-sdn ovs-jzlv8 1/1 Running 0 17d 10.96.76.12 ocpinfra6911.rbgooe.at <none> <none> openshift-sdn sdn-52vh8 2/2 Running 1 18d 10.96.76.13 ocpinfra6912.rbgooe.at <none> <none> openshift-sdn sdn-jwp9n 2/2 Running 1 18d 10.96.76.12 ocpinfra6911.rbgooe.at <none> <none> splunk-connect splunk-kubernetes-logging-mxpgd 1/1 Running 0 21d 10.94.8.7 ocpinfra6912.rbgooe.at <none> <none> splunk-connect splunk-kubernetes-logging-qcwwv 1/1 Running 0 21d 10.94.6.15 ocpinfra6911.rbgooe.at <none> <none> splunk-connect splunk-kubernetes-logging-splunk-kubernetes-metrics-8fmdc 1/1 Running 0 21d 10.94.6.16 ocpinfra6911.rbgooe.at <none> <none> splunk-connect splunk-kubernetes-logging-splunk-kubernetes-metrics-vwp75 1/1 Running 0 21d 10.94.8.8 ocpinfra6912.rbgooe.at <none> <none> ============================================= some other additional information requested from customer: Limitrange removed to give you all necessary information. NAME STATUS ROLES AGE VERSION ocpinfra5130.rbgooe.at NotReady infra 90d v1.19.0+7070803 ocpinfra5131.rbgooe.at NotReady infra 90d v1.19.0+7070803 ocpmaster5130.rbgooe.at Ready master 90d v1.19.0+7070803 ocpmaster5131.rbgooe.at Ready master 90d v1.19.0+7070803 ocpmaster5132.rbgooe.at Ready master 90d v1.19.0+7070803 ocpnode5133.rbgooe.at Ready worker 53d v1.19.0+7070803 ocpnode5134.rbgooe.at Ready worker 53d v1.19.0+7070803 ocprouter5130.rbgooe.at Ready worker 90d v1.19.0+7070803 ocprouter5131.rbgooe.at Ready worker 90d v1.19.0+7070803 Infra Nodes went NotReady a few minutes after a removed the limitrange. $ oc get po -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES alertmanager-main-0 5/5 Running 0 178m 10.94.8.11 ocpinfra5131.rbgooe.at <none> <none> alertmanager-main-1 5/5 Running 0 179m 10.94.6.10 ocpinfra5130.rbgooe.at <none> <none> alertmanager-main-2 5/5 Running 0 3h 10.94.8.13 ocpinfra5131.rbgooe.at <none> <none> cluster-monitoring-operator-f85f7bcb5-txdkd 2/2 Running 0 98m 10.94.0.12 ocpmaster5130.rbgooe.at <none> <none> grafana-7dbfc78d6-scr6m 2/2 Running 0 3h1m 10.94.6.4 ocpinfra5130.rbgooe.at <none> <none> kube-state-metrics-5f77599f58-c5lpl 3/3 Running 0 3h1m 10.94.6.23 ocpinfra5130.rbgooe.at <none> <none> node-exporter-4zvm5 2/2 Running 0 3h1m 10.96.91.22 ocpmaster5132.rbgooe.at <none> <none> node-exporter-6q8zw 2/2 Running 0 178m 10.96.91.20 ocpmaster5130.rbgooe.at <none> <none> node-exporter-bgn4x 2/2 Running 0 3h 10.96.89.67 ocprouter5130.rbgooe.at <none> <none> node-exporter-fh9tc 2/2 Running 0 179m 10.96.89.12 ocpnode5133.rbgooe.at <none> <none> node-exporter-hw929 2/2 Running 0 179m 10.96.91.24 ocpinfra5131.rbgooe.at <none> <none> node-exporter-mk5wd 2/2 Running 0 178m 10.96.91.23 ocpinfra5130.rbgooe.at <none> <none> node-exporter-t8rgz 2/2 Running 0 3h1m 10.96.91.21 ocpmaster5131.rbgooe.at <none> <none> node-exporter-x6kvv 2/2 Running 0 3h 10.96.89.13 ocpnode5134.rbgooe.at <none> <none> node-exporter-ztkhb 2/2 Running 0 179m 10.96.89.68 ocprouter5131.rbgooe.at <none> <none> openshift-state-metrics-56d57cf485-sggdl 3/3 Running 0 3h1m 10.94.6.5 ocpinfra5130.rbgooe.at <none> <none> prometheus-adapter-667656bd67-f4lq4 1/1 Running 0 156m 10.94.8.10 ocpinfra5131.rbgooe.at <none> <none> prometheus-adapter-667656bd67-z2rqv 1/1 Running 0 154m 10.94.6.11 ocpinfra5130.rbgooe.at <none> <none> prometheus-k8s-0 6/6 Running 0 173m 10.94.6.26 ocpinfra5130.rbgooe.at <none> <none> prometheus-k8s-1 6/6 Running 0 23m 10.94.8.5 ocpinfra5131.rbgooe.at <none> <none> prometheus-operator-7c79b69968-bsqsg 2/2 Running 0 3h1m 10.94.6.7 ocpinfra5130.rbgooe.at <none> <none> telemeter-client-59b89d4cd7-64h28 3/3 Running 0 3h1m 10.94.6.14 ocpinfra5130.rbgooe.at <none> <none> thanos-querier-54f4796ddd-ksw6g 5/5 Running 0 15m 10.94.6.24 ocpinfra5130.rbgooe.at <none> <none> thanos-querier-54f4796ddd-rcdfk 5/5 Running 0 15m 10.94.8.23 ocpinfra5131.rbgooe.at <none> <none> $ oc adm top node NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% ocpinfra5130.rbgooe.at 3951m 112% 23886Mi 105% ocpmaster5130.rbgooe.at 3189m 91% 8527Mi 57% ocpmaster5131.rbgooe.at 3986m 113% 8526Mi 57% ocpmaster5132.rbgooe.at 3355m 95% 6996Mi 47% ocpnode5133.rbgooe.at 7541m 107% 17531Mi 59% ocpnode5134.rbgooe.at 6268m 89% 7950Mi 26% ocprouter5130.rbgooe.at 421m 42% 1928Mi 35% ocprouter5131.rbgooe.at 250m 25% 2129Mi 39% ocpinfra5131.rbgooe.at <unknown> <unknown> <unknown> <unknown> System memory on ocpinfra already increased from 16 to 24 gb ram and no effect. Only limitrange helps. $ oc adm top po -n openshift-monitoring --> not working $ oc adm top po -n openshift-monitoring Error from server (ServiceUnavailable): the server is currently unable to handle the request (get pods.metrics.k8s.io)
*** This bug has been marked as a duplicate of bug 1906496 ***