Bug 1992446
| Summary: | Node NotReady during upgrade or scaleup or perf test happens on multiple providers & network types & versions | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Qiujie Li <qili> |
| Component: | Monitoring | Assignee: | Simon Pasquier <spasquie> |
| Status: | CLOSED DUPLICATE | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.7 | CC: | alegrand, amuller, anpicker, aos-bugs, erooth, nagrawal, rphillips |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-10-12 12:42:11 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Qiujie Li
2021-08-11 06:34:05 UTC
Because of the condition of the cluster, I can't run must gather. [must-gather-czqlh] OUT gather logs unavailable: http2: server sent GOAWAY and closed the connection; LastStreamID=13, ErrCode=NO_ERROR, debug="" [must-gather-czqlh] OUT waiting for gather to complete [must-gather-czqlh] OUT gather never finished: timed out waiting for the condition [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-x2hlp deleted [must-gather ] OUT namespace/openshift-must-gather-z6l9x deleted error: gather never finished for pod must-gather-czqlh: timed out waiting for the condition Is this on a cluster I can log into to debug? Sorry I was on PTO in the past days. Unfortunately I don't have the cluster now. I attached the grafana screenshot the average worker nodes cpu usage. The query of the grafana graph was 'uuid.keyword: $uuid AND metricName: "nodeCPU-AggregatedWorkers"'. Full grafana link is here http://grafana.rdu2.scalelab.redhat.com:3000/d/hIBqKNvMz123/kube-burner-report-aggregated?orgId=1&from=1628566252651&to=1628566652544&var-Datasource=SVTQE-kube-burner&var-sdn=openshift-sdn&var-job=All&var-uuid=cf852503-8ef9-4622-9a1f-142618f7f23b&var-master=qili46s-z4896-master-0.c.openshift-qe.internal&var-namespace=All&var-verb=All&var-resource=All&var-flowschema=All&var-priority_level=All. I think the load is too big for the cluster that caused the worker node's cpu under pressure. Roshni Pattath told me she tried 4.6.42 -> 4.7.24 -> 4.8.5 and I was not able to reproduce the issue. 4.7.24 and 4.8.5 were the latest stable builds. The upgrade completed in less than 3 hours. That means this issue is not reproduced (or not reproduced everytime) on the upgrade path 4.6.42 -> 4.7.24 -> 4.8.5. @rphillips Thanks for the analysis. Why 'two prometheus pods running on the same NotReady node' can make a node not ready? Is that an expected behavior? How do we let users be aware of this and avoid this? @rphillips Hi, Ryan, sorry the test environment had already been cleared. I will rerun the test to see if I can reproduce the issue. How can I know which build the patches are in? @rphillips I met the similar issue today. A 4.8.10 AWS OVN IPI cluster, after install, I scaled up worker nodes from 3 to 120. After that I found one node NotReady, and I found both prometheus pods are on this node, the pods on this node stuck in 'Terminating' state and can not be scheduled on other nodes - that caused some co in abnormal state.
You've mentioned 'There are two prometheus pods running on the same NotReady node. This is the most likely cause why the node is notready'. Is this a known issue? Can you tell more about how it happens?
----------------------------
% oc get clusterversions
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.8.11 True False 78m Cluster version is 4.8.11
----------------------------
% oc get nodes | grep NotReady
ip-10-0-142-90.us-east-2.compute.internal NotReady worker 118m v1.21.1+9807387
----------------------------
% oc get pods -A -o wide | egrep -v "Completed| Running"
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openshift-debug-node-rbs4v ip-10-0-142-90.us-east-2.compute.internal-debug 0/1 Terminating 0 13m <none> ip-10-0-142-90.us-east-2.compute.internal <none> <none>
openshift-ingress router-default-845cc4d4dc-p5xzs 1/1 Terminating 0 120m 10.131.0.13 ip-10-0-142-90.us-east-2.compute.internal <none> <none>
openshift-marketplace certified-operators-f8r65 1/1 Terminating 0 122m 10.131.0.8 ip-10-0-142-90.us-east-2.compute.internal <none> <none>
openshift-marketplace community-operators-j2mc5 1/1 Terminating 0 122m 10.131.0.15 ip-10-0-142-90.us-east-2.compute.internal <none> <none>
openshift-marketplace redhat-operators-r2bfm 1/1 Terminating 0 122m 10.131.0.16 ip-10-0-142-90.us-east-2.compute.internal <none> <none>
openshift-monitoring alertmanager-main-0 5/5 Terminating 0 119m 10.131.0.18 ip-10-0-142-90.us-east-2.compute.internal <none> <none>
openshift-monitoring alertmanager-main-1 5/5 Terminating 0 119m 10.131.0.19 ip-10-0-142-90.us-east-2.compute.internal <none> <none>
openshift-monitoring alertmanager-main-2 5/5 Terminating 0 119m 10.131.0.20 ip-10-0-142-90.us-east-2.compute.internal <none> <none>
openshift-monitoring grafana-dd56d9c7f-frght 2/2 Terminating 0 119m 10.131.0.21 ip-10-0-142-90.us-east-2.compute.internal <none> <none>
openshift-monitoring kube-state-metrics-7485cb5695-8j7m9 3/3 Terminating 0 124m 10.131.0.9 ip-10-0-142-90.us-east-2.compute.internal <none> <none>
openshift-monitoring openshift-state-metrics-65c6597c7-8tzcq 3/3 Terminating 0 124m 10.131.0.14 ip-10-0-142-90.us-east-2.compute.internal <none> <none>
openshift-monitoring prometheus-adapter-748974bc4b-qfmhq 1/1 Terminating 0 119m 10.131.0.17 ip-10-0-142-90.us-east-2.compute.internal <none> <none>
openshift-monitoring prometheus-k8s-0 7/7 Terminating 2 119m 10.131.0.23 ip-10-0-142-90.us-east-2.compute.internal <none> <none>
openshift-monitoring prometheus-k8s-1 7/7 Terminating 1 119m 10.131.0.24 ip-10-0-142-90.us-east-2.compute.internal <none> <none>
openshift-monitoring telemeter-client-74b79579b-4dfh7 3/3 Terminating 0 124m 10.131.0.11 ip-10-0-142-90.us-east-2.compute.internal <none> <none>
openshift-monitoring thanos-querier-65dc8646f7-5tnkz 5/5 Terminating 0 119m 10.131.0.22 ip-10-0-142-90.us-east-2.compute.internal <none> <none>
openshift-network-diagnostics network-check-source-6ccd7c5589-kbwt6 1/1 Terminating 0 127m 10.131.0.10 ip-10-0-142-90.us-east-2.compute.internal <none> <none>
----------------------------
% oc describe node ip-10-0-142-90.us-east-2.compute.internal
Name: ip-10-0-142-90.us-east-2.compute.internal
Roles: worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=m5.xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-east-2
failure-domain.beta.kubernetes.io/zone=us-east-2a
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-10-0-142-90
kubernetes.io/os=linux
node-role.kubernetes.io/worker=
node.kubernetes.io/instance-type=m5.xlarge
node.openshift.io/os_id=rhcos
topology.ebs.csi.aws.com/zone=us-east-2a
topology.kubernetes.io/region=us-east-2
topology.kubernetes.io/zone=us-east-2a
Annotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-07f5b9600cf909f5c"}
k8s.ovn.org/host-addresses: ["10.0.142.90"]
k8s.ovn.org/l3-gateway-config:
{"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-142-90.us-east-2.compute.internal","mac-address":"02:c2:e9:ee:89:92","ip-address...
k8s.ovn.org/node-chassis-id: 453f11fa-2b24-4ebc-acb5-79cbd635fdfd
k8s.ovn.org/node-mgmt-port-mac-address: b2:aa:93:1f:8b:9c
k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.142.90/19"}
k8s.ovn.org/node-subnets: {"default":"10.131.0.0/23"}
machine.openshift.io/machine: openshift-machine-api/qili-48-aws-0914-sdct2-worker-us-east-2a-2m9z4
machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
machineconfiguration.openshift.io/currentConfig: rendered-worker-85b1a4bcca322341743ffdc8480f170c
machineconfiguration.openshift.io/desiredConfig: rendered-worker-85b1a4bcca322341743ffdc8480f170c
machineconfiguration.openshift.io/reason:
machineconfiguration.openshift.io/state: Done
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Tue, 14 Sep 2021 09:36:02 +0800
Taints: node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unreachable:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: ip-10-0-142-90.us-east-2.compute.internal
AcquireTime: <unset>
RenewTime: Tue, 14 Sep 2021 11:05:18 +0800
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure Unknown Tue, 14 Sep 2021 11:03:08 +0800 Tue, 14 Sep 2021 11:06:02 +0800 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Tue, 14 Sep 2021 11:03:08 +0800 Tue, 14 Sep 2021 11:06:02 +0800 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Tue, 14 Sep 2021 11:03:08 +0800 Tue, 14 Sep 2021 11:06:02 +0800 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Tue, 14 Sep 2021 11:03:08 +0800 Tue, 14 Sep 2021 11:06:02 +0800 NodeStatusUnknown Kubelet stopped posting node status.
Addresses:
InternalIP: 10.0.142.90
Hostname: ip-10-0-142-90.us-east-2.compute.internal
InternalDNS: ip-10-0-142-90.us-east-2.compute.internal
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 4
ephemeral-storage: 125293548Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16106300Ki
pods: 250
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 3500m
ephemeral-storage: 115470533646
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 14955324Ki
pods: 250
System Info:
Machine ID: ec2b9c749de9fa592bffff2e7b8243af
System UUID: ec2b9c74-9de9-fa59-2bff-ff2e7b8243af
Boot ID: b8f66e17-4193-4b19-9ddb-3ae653014828
Kernel Version: 4.18.0-305.19.1.el8_4.x86_64
OS Image: Red Hat Enterprise Linux CoreOS 48.84.202109090400-0 (Ootpa)
Operating System: linux
Architecture: amd64
Container Runtime Version: cri-o://1.21.2-15.rhaos4.8.gitcdc4f56.el8
Kubelet Version: v1.21.1+9807387
Kube-Proxy Version: v1.21.1+9807387
ProviderID: aws:///us-east-2a/i-07f5b9600cf909f5c
Non-terminated Pods: (30 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
openshift-cluster-csi-drivers aws-ebs-csi-driver-node-ljkln 30m (0%) 0 (0%) 150Mi (1%) 0 (0%) 118m
openshift-cluster-node-tuning-operator tuned-wkjm9 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 118m
openshift-debug-node-rbs4v ip-10-0-142-90.us-east-2.compute.internal-debug 0 (0%) 0 (0%) 0 (0%) 0 (0%) 11m
openshift-dns dns-default-h87pk 60m (1%) 0 (0%) 110Mi (0%) 0 (0%) 117m
openshift-dns node-resolver-94zpn 5m (0%) 0 (0%) 21Mi (0%) 0 (0%) 118m
openshift-image-registry node-ca-4fpp2 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 118m
openshift-ingress-canary ingress-canary-bvsh4 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 117m
openshift-ingress router-default-845cc4d4dc-p5xzs 100m (2%) 0 (0%) 256Mi (1%) 0 (0%) 118m
openshift-machine-config-operator machine-config-daemon-zlbhc 40m (1%) 0 (0%) 100Mi (0%) 0 (0%) 118m
openshift-marketplace certified-operators-f8r65 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 120m
openshift-marketplace community-operators-j2mc5 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 120m
openshift-marketplace redhat-operators-r2bfm 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 120m
openshift-monitoring alertmanager-main-0 8m (0%) 0 (0%) 105Mi (0%) 0 (0%) 117m
openshift-monitoring alertmanager-main-1 8m (0%) 0 (0%) 105Mi (0%) 0 (0%) 117m
openshift-monitoring alertmanager-main-2 8m (0%) 0 (0%) 105Mi (0%) 0 (0%) 117m
openshift-monitoring grafana-dd56d9c7f-frght 5m (0%) 0 (0%) 84Mi (0%) 0 (0%) 117m
openshift-monitoring kube-state-metrics-7485cb5695-8j7m9 4m (0%) 0 (0%) 110Mi (0%) 0 (0%) 121m
openshift-monitoring node-exporter-nkn4f 9m (0%) 0 (0%) 47Mi (0%) 0 (0%) 118m
openshift-monitoring openshift-state-metrics-65c6597c7-8tzcq 3m (0%) 0 (0%) 72Mi (0%) 0 (0%) 121m
openshift-monitoring prometheus-adapter-748974bc4b-qfmhq 1m (0%) 0 (0%) 40Mi (0%) 0 (0%) 117m
openshift-monitoring prometheus-k8s-0 76m (2%) 0 (0%) 1119Mi (7%) 0 (0%) 117m
openshift-monitoring prometheus-k8s-1 76m (2%) 0 (0%) 1119Mi (7%) 0 (0%) 117m
openshift-monitoring telemeter-client-74b79579b-4dfh7 3m (0%) 0 (0%) 70Mi (0%) 0 (0%) 121m
openshift-monitoring thanos-querier-65dc8646f7-5tnkz 14m (0%) 0 (0%) 77Mi (0%) 0 (0%) 117m
openshift-multus multus-additional-cni-plugins-knbdn 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 118m
openshift-multus multus-trtjx 10m (0%) 0 (0%) 65Mi (0%) 0 (0%) 118m
openshift-multus network-metrics-daemon-s4pvr 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 118m
openshift-network-diagnostics network-check-source-6ccd7c5589-kbwt6 10m (0%) 0 (0%) 40Mi (0%) 0 (0%) 125m
openshift-network-diagnostics network-check-target-b62jb 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 118m
openshift-ovn-kubernetes ovnkube-node-8ms5f 40m (1%) 0 (0%) 640Mi (4%) 0 (0%) 118m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 610m (17%) 0 (0%)
memory 4810Mi (32%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 118m kubelet Starting kubelet.
Normal NodeHasSufficientMemory 118m (x2 over 118m) kubelet Node ip-10-0-142-90.us-east-2.compute.internal status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 118m (x2 over 118m) kubelet Node ip-10-0-142-90.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 118m (x2 over 118m) kubelet Node ip-10-0-142-90.us-east-2.compute.internal status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 118m kubelet Updated Node Allocatable limit across pods
Normal NodeReady 117m kubelet Node ip-10-0-142-90.us-east-2.compute.internal status is now: NodeReady
----------------------------
% oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.8.11 True False False 79m
baremetal 4.8.11 True False False 105m
cloud-credential 4.8.11 True False False 116m
cluster-autoscaler 4.8.11 True False False 105m
config-operator 4.8.11 True False False 106m
console 4.8.11 True False False 93m
csi-snapshot-controller 4.8.11 True False False 106m
dns 4.8.11 True True False 105m
etcd 4.8.11 True False False 104m
image-registry 4.8.11 True False False 102m
ingress 4.8.11 True False False 100m
insights 4.8.11 True False False 100m
kube-apiserver 4.8.11 True False False 102m
kube-controller-manager 4.8.11 True False False 104m
kube-scheduler 4.8.11 True False False 104m
kube-storage-version-migrator 4.8.11 True False False 106m
machine-api 4.8.11 True False False 102m
machine-approver 4.8.11 True False False 106m
machine-config 4.8.11 False False True 3m19s
marketplace 4.8.11 True False False 105m
monitoring 4.8.11 False True True 6m42s
network 4.8.11 True True False 107m
node-tuning 4.8.11 True False False 105m
openshift-apiserver 4.8.11 True False False 103m
openshift-controller-manager 4.8.11 True False False 98m
openshift-samples 4.8.11 True False False 103m
operator-lifecycle-manager 4.8.11 True False False 106m
operator-lifecycle-manager-catalog 4.8.11 True False False 105m
operator-lifecycle-manager-packageserver 4.8.11 True False False 103m
service-ca 4.8.11 True False False 106m
storage 4.8.11 True True False 105m
@rphillips I reproduced this issue multiple times on different providers GCP/AzureAAWS, network types sdn/ovn, OCP versions 4.6/4/7/4/8, the same think I can see is as you've mentioned 'There are two prometheus pods running on the same NotReady node. This is the most likely cause why the node is notready'. So I updated the bug title to better describe this issue. Is this a known issue that two prometheus pods running on the same node can make it NotReady? Can you tell more about how it happens? Given the node is broken for because two prometheus pods running on the same node, or some other reason, why couldn't the pods stuck in 'Terminating' state on it be discovered and rescheduled to other healthy nodes? @rphillips I can reproduce this when The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |