Bug 1916645 - OpenShift node goes into NotReady state since 4.6.8 upgrade
Summary: OpenShift node goes into NotReady state since 4.6.8 upgrade
Keywords:
Status: CLOSED DUPLICATE of bug 1906496
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Sergiusz Urbaniak
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-15 10:35 UTC by mchebbi@redhat.com
Modified: 2024-03-25 17:51 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-22 19:34:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description mchebbi@redhat.com 2021-01-15 10:35:30 UTC
gathered information url: https://bit.ly/3nHADSB

On the test cluster nodes go NotReady all the time because of OOM. Even on the infra nodes. I have never seen this issue on 4.5 and and there was no significant change in the amount of pods since then. Issue also occurs on infra nodes ocpinfra6911 and ocpinfra6912 which only contains openshift pods (4 Cores, 32 GB RAM).

The difference between lab and test the the amount of pods only.

$ oc get node
NAME                      STATUS    ROLES          AGE       VERSION
ocpinfra6911.rbgooe.at    Ready     infra,worker   82d       v1.19.0+7070803
ocpinfra6912.rbgooe.at    Ready     infra,worker   81d       v1.19.0+7070803
ocpmaster6911.rbgooe.at   Ready     master         82d       v1.19.0+7070803
ocpmaster6912.rbgooe.at   Ready     master         82d       v1.19.0+7070803
ocpmaster6913.rbgooe.at   Ready     master         82d       v1.19.0+7070803
ocpnode6923.rbgooe.at     Ready     worker         79d       v1.19.0+7070803
ocpnode6924.rbgooe.at     Ready     worker         79d       v1.19.0+7070803
ocpnode6925.rbgooe.at     Ready     worker         75d       v1.19.0+7070803
ocpnode6926.rbgooe.at     Ready     worker         75d       v1.19.0+7070803
ocpnode6927.rbgooe.at     Ready     worker         75d       v1.19.0+7070803
ocpnode6928.rbgooe.at     Ready     worker         75d       v1.19.0+7070803
ocpnode6929.rbgooe.at     Ready     worker         56d       v1.19.0+7070803
ocpnode6930.rbgooe.at     Ready     worker         55d       v1.19.0+7070803
ocprouter6911.rbgooe.at   Ready     worker         77d       v1.19.0+7070803
ocprouter6912.rbgooe.at   Ready     worker         77d       v1.19.0+7070803

$ oc describe node ocpinfra6911.rbgooe.at |grep -i cpu:
 cpu:                4
 cpu:                3500m
[RBGOOE\lrzgwia_p@ocpjump6901 tmp]$ oc describe node ocpinfra6911.rbgooe.at |grep -i memory:
 memory:             32907804Ki
 memory:             31756828Ki

$ oc adm top node --as=system:admin
Error from server: Error while fetching node metrics for selector : unable to fetch node CPU metrics: unable to execute query: Get "https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=sum%281+-+irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B5m%5D%29+%2A+on%28namespace%2C+pod%29+group_left%28node%29+node_namespace_pod%3Akube_pod_info%3A%7Bnode%3D~%22ocpinfra6911.rbgooe.at%7Cocpinfra6912.rbgooe.at%7Cocpmaster6911.rbgooe.at%7Cocpmaster6912.rbgooe.at%7Cocpmaster6913.rbgooe.at%7Cocpnode6923.rbgooe.at%7Cocpnode6924.rbgooe.at%7Cocpnode6925.rbgooe.at%7Cocpnode6926.rbgooe.at%7Cocpnode6927.rbgooe.at%7Cocpnode6928.rbgooe.at%7Cocpnode6929.rbgooe.at%7Cocpnode6930.rbgooe.at%7Cocprouter6911.rbgooe.at%7Cocprouter6912.rbgooe.at%22%7D%29+by+%28node%29&time=1609760996.489": dial tcp 10.95.63.220:9091: i/o timeout

$ ssh core@ocpinfra6911
Red Hat Enterprise Linux CoreOS 46.82.202012051820-0
  Part of OpenShift 4.6, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.6/architecture/architecture-rhcos.html

---
Last login: Mon Jan  4 09:14:30 2021 from 10.96.76.10
[systemd]
Failed Units: 2
  rpc-statd.service
  vmtoolsd.service

$ oc get po --all-namespaces -o wide |grep ocpinfra
dynatrace                                          oneagent-hfntq                                                    1/1       Running     0          63m       10.96.76.13    ocpinfra6912.rbgooe.at    <none>           <none>
dynatrace                                          oneagent-k2xmv                                                    1/1       Running     0          64m       10.96.76.12    ocpinfra6911.rbgooe.at    <none>           <none>
openshift-cluster-node-tuning-operator             tuned-8jrp5                                                       1/1       Running     0          18d       10.96.76.13    ocpinfra6912.rbgooe.at    <none>           <none>
openshift-cluster-node-tuning-operator             tuned-gp5fq                                                       1/1       Running     0          18d       10.96.76.12    ocpinfra6911.rbgooe.at    <none>           <none>
openshift-dns                                      dns-default-6lmwb                                                 3/3       Running     0          17d       10.94.8.4      ocpinfra6912.rbgooe.at    <none>           <none>
openshift-dns                                      dns-default-x8gr8                                                 3/3       Running     0          17d       10.94.6.11     ocpinfra6911.rbgooe.at    <none>           <none>
openshift-image-registry                           image-registry-d7c7b9c4d-bjb9m                                    1/1       Running     0          17m       10.94.6.14     ocpinfra6911.rbgooe.at    <none>           <none>
openshift-image-registry                           image-registry-d7c7b9c4d-kjt89                                    1/1       Running     0          17m       10.94.6.4      ocpinfra6911.rbgooe.at    <none>           <none>
openshift-image-registry                           node-ca-ffln9                                                     1/1       Running     0          18d       10.96.76.12    ocpinfra6911.rbgooe.at    <none>           <none>
openshift-image-registry                           node-ca-rdc6b                                                     1/1       Running     0          18d       10.96.76.13    ocpinfra6912.rbgooe.at    <none>           <none>
openshift-ingress                                  router-default-6d8dd4fd9f-h9zpw                                   1/1       Running     0          15m       10.96.76.12    ocpinfra6911.rbgooe.at    <none>           <none>
openshift-ingress                                  router-default-6d8dd4fd9f-zwwx9                                   1/1       Running     0          17m       10.96.76.13    ocpinfra6912.rbgooe.at    <none>           <none>
openshift-machine-config-operator                  machine-config-daemon-q5jvf                                       2/2       Running     0          17d       10.96.76.12    ocpinfra6911.rbgooe.at    <none>           <none>
openshift-machine-config-operator                  machine-config-daemon-s9m2d                                       2/2       Running     0          17d       10.96.76.13    ocpinfra6912.rbgooe.at    <none>           <none>
openshift-marketplace                              redhat-marketplace-c4h6d                                          1/1       Running     0          5m14s     10.94.8.2      ocpinfra6912.rbgooe.at    <none>           <none>
openshift-monitoring                               alertmanager-main-0                                               5/5       Running     0          5m14s     10.94.8.3      ocpinfra6912.rbgooe.at    <none>           <none>
openshift-monitoring                               alertmanager-main-1                                               5/5       Running     0          5m14s     10.94.8.5      ocpinfra6912.rbgooe.at    <none>           <none>
openshift-monitoring                               alertmanager-main-2                                               5/5       Running     0          5m20s     10.94.6.17     ocpinfra6911.rbgooe.at    <none>           <none>
openshift-monitoring                               grafana-57d6cd6d77-mdfxf                                          2/2       Running     0          17m       10.94.6.13     ocpinfra6911.rbgooe.at    <none>           <none>
openshift-monitoring                               kube-state-metrics-6657f49659-7kth6                               3/3       Running     0          17m       10.94.6.6      ocpinfra6911.rbgooe.at    <none>           <none>
openshift-monitoring                               node-exporter-gl67w                                               2/2       Running     0          18d       10.96.76.13    ocpinfra6912.rbgooe.at    <none>           <none>
openshift-monitoring                               node-exporter-q2bfs                                               2/2       Running     0          18d       10.96.76.12    ocpinfra6911.rbgooe.at    <none>           <none>
openshift-monitoring                               openshift-state-metrics-698f8c857b-lzlsc                          3/3       Running     0          17m       10.94.6.10     ocpinfra6911.rbgooe.at    <none>           <none>
openshift-monitoring                               prometheus-adapter-678848b675-gldkg                               1/1       Running     0          17m       10.94.6.5      ocpinfra6911.rbgooe.at    <none>           <none>
openshift-monitoring                               prometheus-adapter-678848b675-v9qpw                               1/1       Running     0          17m       10.94.6.3      ocpinfra6911.rbgooe.at    <none>           <none>
openshift-monitoring                               prometheus-operator-5fcfd84995-46njn                              2/2       Running     0          17m       10.94.6.12     ocpinfra6911.rbgooe.at    <none>           <none>
openshift-monitoring                               telemeter-client-6fb9cbdf5b-llh4w                                 3/3       Running     0          17m       10.94.6.2      ocpinfra6911.rbgooe.at    <none>           <none>
openshift-multus                                   multus-kbhks                                                      1/1       Running     0          17d       10.96.76.13    ocpinfra6912.rbgooe.at    <none>           <none>
openshift-multus                                   multus-phql2                                                      1/1       Running     0          17d       10.96.76.12    ocpinfra6911.rbgooe.at    <none>           <none>
openshift-multus                                   network-metrics-daemon-svr8z                                      2/2       Running     0          18d       10.94.8.6      ocpinfra6912.rbgooe.at    <none>           <none>
openshift-multus                                   network-metrics-daemon-vhc9s                                      2/2       Running     0          18d       10.94.6.8      ocpinfra6911.rbgooe.at    <none>           <none>
openshift-sdn                                      ovs-dgffs                                                         1/1       Running     0          17d       10.96.76.13    ocpinfra6912.rbgooe.at    <none>           <none>
openshift-sdn                                      ovs-jzlv8                                                         1/1       Running     0          17d       10.96.76.12    ocpinfra6911.rbgooe.at    <none>           <none>
openshift-sdn                                      sdn-52vh8                                                         2/2       Running     1          18d       10.96.76.13    ocpinfra6912.rbgooe.at    <none>           <none>
openshift-sdn                                      sdn-jwp9n                                                         2/2       Running     1          18d       10.96.76.12    ocpinfra6911.rbgooe.at    <none>           <none>
splunk-connect                                     splunk-kubernetes-logging-mxpgd                                   1/1       Running     0          21d       10.94.8.7      ocpinfra6912.rbgooe.at    <none>           <none>
splunk-connect                                     splunk-kubernetes-logging-qcwwv                                   1/1       Running     0          21d       10.94.6.15     ocpinfra6911.rbgooe.at    <none>           <none>
splunk-connect                                     splunk-kubernetes-logging-splunk-kubernetes-metrics-8fmdc         1/1       Running     0          21d       10.94.6.16     ocpinfra6911.rbgooe.at    <none>           <none>
splunk-connect                                     splunk-kubernetes-logging-splunk-kubernetes-metrics-vwp75         1/1       Running     0          21d       10.94.8.8      ocpinfra6912.rbgooe.at    <none>           <none>
=============================================

some other additional information requested from customer:


Limitrange removed to give you all necessary information.

NAME                      STATUS     ROLES     AGE       VERSION
ocpinfra5130.rbgooe.at    NotReady   infra     90d       v1.19.0+7070803
ocpinfra5131.rbgooe.at    NotReady   infra     90d       v1.19.0+7070803
ocpmaster5130.rbgooe.at   Ready      master    90d       v1.19.0+7070803
ocpmaster5131.rbgooe.at   Ready      master    90d       v1.19.0+7070803
ocpmaster5132.rbgooe.at   Ready      master    90d       v1.19.0+7070803
ocpnode5133.rbgooe.at     Ready      worker    53d       v1.19.0+7070803
ocpnode5134.rbgooe.at     Ready      worker    53d       v1.19.0+7070803
ocprouter5130.rbgooe.at   Ready      worker    90d       v1.19.0+7070803
ocprouter5131.rbgooe.at   Ready      worker    90d       v1.19.0+7070803

Infra Nodes went NotReady a few minutes after a removed the limitrange.

$ oc get po -o wide
NAME                                          READY   STATUS    RESTARTS   AGE    IP            NODE                      NOMINATED NODE   READINESS GATES
alertmanager-main-0                           5/5     Running   0          178m   10.94.8.11    ocpinfra5131.rbgooe.at    <none>           <none>
alertmanager-main-1                           5/5     Running   0          179m   10.94.6.10    ocpinfra5130.rbgooe.at    <none>           <none>
alertmanager-main-2                           5/5     Running   0          3h     10.94.8.13    ocpinfra5131.rbgooe.at    <none>           <none>
cluster-monitoring-operator-f85f7bcb5-txdkd   2/2     Running   0          98m    10.94.0.12    ocpmaster5130.rbgooe.at   <none>           <none>
grafana-7dbfc78d6-scr6m                       2/2     Running   0          3h1m   10.94.6.4     ocpinfra5130.rbgooe.at    <none>           <none>
kube-state-metrics-5f77599f58-c5lpl           3/3     Running   0          3h1m   10.94.6.23    ocpinfra5130.rbgooe.at    <none>           <none>
node-exporter-4zvm5                           2/2     Running   0          3h1m   10.96.91.22   ocpmaster5132.rbgooe.at   <none>           <none>
node-exporter-6q8zw                           2/2     Running   0          178m   10.96.91.20   ocpmaster5130.rbgooe.at   <none>           <none>
node-exporter-bgn4x                           2/2     Running   0          3h     10.96.89.67   ocprouter5130.rbgooe.at   <none>           <none>
node-exporter-fh9tc                           2/2     Running   0          179m   10.96.89.12   ocpnode5133.rbgooe.at     <none>           <none>
node-exporter-hw929                           2/2     Running   0          179m   10.96.91.24   ocpinfra5131.rbgooe.at    <none>           <none>
node-exporter-mk5wd                           2/2     Running   0          178m   10.96.91.23   ocpinfra5130.rbgooe.at    <none>           <none>
node-exporter-t8rgz                           2/2     Running   0          3h1m   10.96.91.21   ocpmaster5131.rbgooe.at   <none>           <none>
node-exporter-x6kvv                           2/2     Running   0          3h     10.96.89.13   ocpnode5134.rbgooe.at     <none>           <none>
node-exporter-ztkhb                           2/2     Running   0          179m   10.96.89.68   ocprouter5131.rbgooe.at   <none>           <none>
openshift-state-metrics-56d57cf485-sggdl      3/3     Running   0          3h1m   10.94.6.5     ocpinfra5130.rbgooe.at    <none>           <none>
prometheus-adapter-667656bd67-f4lq4           1/1     Running   0          156m   10.94.8.10    ocpinfra5131.rbgooe.at    <none>           <none>
prometheus-adapter-667656bd67-z2rqv           1/1     Running   0          154m   10.94.6.11    ocpinfra5130.rbgooe.at    <none>           <none>
prometheus-k8s-0                              6/6     Running   0          173m   10.94.6.26    ocpinfra5130.rbgooe.at    <none>           <none>
prometheus-k8s-1                              6/6     Running   0          23m    10.94.8.5     ocpinfra5131.rbgooe.at    <none>           <none>
prometheus-operator-7c79b69968-bsqsg          2/2     Running   0          3h1m   10.94.6.7     ocpinfra5130.rbgooe.at    <none>           <none>
telemeter-client-59b89d4cd7-64h28             3/3     Running   0          3h1m   10.94.6.14    ocpinfra5130.rbgooe.at    <none>           <none>
thanos-querier-54f4796ddd-ksw6g               5/5     Running   0          15m    10.94.6.24    ocpinfra5130.rbgooe.at    <none>           <none>
thanos-querier-54f4796ddd-rcdfk               5/5     Running   0          15m    10.94.8.23    ocpinfra5131.rbgooe.at    <none>           <none>

$ oc adm top node
NAME                      CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ocpinfra5130.rbgooe.at    3951m        112%   23886Mi         105%
ocpmaster5130.rbgooe.at   3189m        91%    8527Mi          57%
ocpmaster5131.rbgooe.at   3986m        113%   8526Mi          57%
ocpmaster5132.rbgooe.at   3355m        95%    6996Mi          47%
ocpnode5133.rbgooe.at     7541m        107%   17531Mi         59%
ocpnode5134.rbgooe.at     6268m        89%    7950Mi          26%
ocprouter5130.rbgooe.at   421m         42%    1928Mi          35%
ocprouter5131.rbgooe.at   250m         25%    2129Mi          39%
ocpinfra5131.rbgooe.at    <unknown>                           <unknown>               <unknown>               <unknown>

System memory on ocpinfra already increased from 16 to 24 gb ram and no effect. Only limitrange helps.

$ oc adm top po -n openshift-monitoring
--> not working

$ oc adm top po -n openshift-monitoring
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get pods.metrics.k8s.io)

Comment 3 Sergiusz Urbaniak 2021-01-22 19:34:27 UTC

*** This bug has been marked as a duplicate of bug 1906496 ***


Note You need to log in before you can comment on or make changes to this bug.