Bug 1992446
Summary: | Node NotReady during upgrade or scaleup or perf test happens on multiple providers & network types & versions | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Qiujie Li <qili> |
Component: | Monitoring | Assignee: | Simon Pasquier <spasquie> |
Status: | CLOSED DUPLICATE | QA Contact: | Junqi Zhao <juzhao> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.7 | CC: | alegrand, amuller, anpicker, aos-bugs, erooth, nagrawal, rphillips |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-10-12 12:42:11 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Qiujie Li
2021-08-11 06:34:05 UTC
Because of the condition of the cluster, I can't run must gather. [must-gather-czqlh] OUT gather logs unavailable: http2: server sent GOAWAY and closed the connection; LastStreamID=13, ErrCode=NO_ERROR, debug="" [must-gather-czqlh] OUT waiting for gather to complete [must-gather-czqlh] OUT gather never finished: timed out waiting for the condition [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-x2hlp deleted [must-gather ] OUT namespace/openshift-must-gather-z6l9x deleted error: gather never finished for pod must-gather-czqlh: timed out waiting for the condition Is this on a cluster I can log into to debug? Sorry I was on PTO in the past days. Unfortunately I don't have the cluster now. I attached the grafana screenshot the average worker nodes cpu usage. The query of the grafana graph was 'uuid.keyword: $uuid AND metricName: "nodeCPU-AggregatedWorkers"'. Full grafana link is here http://grafana.rdu2.scalelab.redhat.com:3000/d/hIBqKNvMz123/kube-burner-report-aggregated?orgId=1&from=1628566252651&to=1628566652544&var-Datasource=SVTQE-kube-burner&var-sdn=openshift-sdn&var-job=All&var-uuid=cf852503-8ef9-4622-9a1f-142618f7f23b&var-master=qili46s-z4896-master-0.c.openshift-qe.internal&var-namespace=All&var-verb=All&var-resource=All&var-flowschema=All&var-priority_level=All. I think the load is too big for the cluster that caused the worker node's cpu under pressure. Roshni Pattath told me she tried 4.6.42 -> 4.7.24 -> 4.8.5 and I was not able to reproduce the issue. 4.7.24 and 4.8.5 were the latest stable builds. The upgrade completed in less than 3 hours. That means this issue is not reproduced (or not reproduced everytime) on the upgrade path 4.6.42 -> 4.7.24 -> 4.8.5. @rphillips Thanks for the analysis. Why 'two prometheus pods running on the same NotReady node' can make a node not ready? Is that an expected behavior? How do we let users be aware of this and avoid this? @rphillips Hi, Ryan, sorry the test environment had already been cleared. I will rerun the test to see if I can reproduce the issue. How can I know which build the patches are in? @rphillips I met the similar issue today. A 4.8.10 AWS OVN IPI cluster, after install, I scaled up worker nodes from 3 to 120. After that I found one node NotReady, and I found both prometheus pods are on this node, the pods on this node stuck in 'Terminating' state and can not be scheduled on other nodes - that caused some co in abnormal state. You've mentioned 'There are two prometheus pods running on the same NotReady node. This is the most likely cause why the node is notready'. Is this a known issue? Can you tell more about how it happens? ---------------------------- % oc get clusterversions NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.11 True False 78m Cluster version is 4.8.11 ---------------------------- % oc get nodes | grep NotReady ip-10-0-142-90.us-east-2.compute.internal NotReady worker 118m v1.21.1+9807387 ---------------------------- % oc get pods -A -o wide | egrep -v "Completed| Running" NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES openshift-debug-node-rbs4v ip-10-0-142-90.us-east-2.compute.internal-debug 0/1 Terminating 0 13m <none> ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-ingress router-default-845cc4d4dc-p5xzs 1/1 Terminating 0 120m 10.131.0.13 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-marketplace certified-operators-f8r65 1/1 Terminating 0 122m 10.131.0.8 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-marketplace community-operators-j2mc5 1/1 Terminating 0 122m 10.131.0.15 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-marketplace redhat-operators-r2bfm 1/1 Terminating 0 122m 10.131.0.16 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-monitoring alertmanager-main-0 5/5 Terminating 0 119m 10.131.0.18 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-monitoring alertmanager-main-1 5/5 Terminating 0 119m 10.131.0.19 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-monitoring alertmanager-main-2 5/5 Terminating 0 119m 10.131.0.20 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-monitoring grafana-dd56d9c7f-frght 2/2 Terminating 0 119m 10.131.0.21 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-monitoring kube-state-metrics-7485cb5695-8j7m9 3/3 Terminating 0 124m 10.131.0.9 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-monitoring openshift-state-metrics-65c6597c7-8tzcq 3/3 Terminating 0 124m 10.131.0.14 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-monitoring prometheus-adapter-748974bc4b-qfmhq 1/1 Terminating 0 119m 10.131.0.17 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-monitoring prometheus-k8s-0 7/7 Terminating 2 119m 10.131.0.23 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-monitoring prometheus-k8s-1 7/7 Terminating 1 119m 10.131.0.24 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-monitoring telemeter-client-74b79579b-4dfh7 3/3 Terminating 0 124m 10.131.0.11 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-monitoring thanos-querier-65dc8646f7-5tnkz 5/5 Terminating 0 119m 10.131.0.22 ip-10-0-142-90.us-east-2.compute.internal <none> <none> openshift-network-diagnostics network-check-source-6ccd7c5589-kbwt6 1/1 Terminating 0 127m 10.131.0.10 ip-10-0-142-90.us-east-2.compute.internal <none> <none> ---------------------------- % oc describe node ip-10-0-142-90.us-east-2.compute.internal Name: ip-10-0-142-90.us-east-2.compute.internal Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=m5.xlarge beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=us-east-2 failure-domain.beta.kubernetes.io/zone=us-east-2a kubernetes.io/arch=amd64 kubernetes.io/hostname=ip-10-0-142-90 kubernetes.io/os=linux node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=m5.xlarge node.openshift.io/os_id=rhcos topology.ebs.csi.aws.com/zone=us-east-2a topology.kubernetes.io/region=us-east-2 topology.kubernetes.io/zone=us-east-2a Annotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-07f5b9600cf909f5c"} k8s.ovn.org/host-addresses: ["10.0.142.90"] k8s.ovn.org/l3-gateway-config: {"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-142-90.us-east-2.compute.internal","mac-address":"02:c2:e9:ee:89:92","ip-address... k8s.ovn.org/node-chassis-id: 453f11fa-2b24-4ebc-acb5-79cbd635fdfd k8s.ovn.org/node-mgmt-port-mac-address: b2:aa:93:1f:8b:9c k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.142.90/19"} k8s.ovn.org/node-subnets: {"default":"10.131.0.0/23"} machine.openshift.io/machine: openshift-machine-api/qili-48-aws-0914-sdct2-worker-us-east-2a-2m9z4 machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-85b1a4bcca322341743ffdc8480f170c machineconfiguration.openshift.io/desiredConfig: rendered-worker-85b1a4bcca322341743ffdc8480f170c machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Tue, 14 Sep 2021 09:36:02 +0800 Taints: node.kubernetes.io/unreachable:NoExecute node.kubernetes.io/unreachable:NoSchedule Unschedulable: false Lease: HolderIdentity: ip-10-0-142-90.us-east-2.compute.internal AcquireTime: <unset> RenewTime: Tue, 14 Sep 2021 11:05:18 +0800 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure Unknown Tue, 14 Sep 2021 11:03:08 +0800 Tue, 14 Sep 2021 11:06:02 +0800 NodeStatusUnknown Kubelet stopped posting node status. DiskPressure Unknown Tue, 14 Sep 2021 11:03:08 +0800 Tue, 14 Sep 2021 11:06:02 +0800 NodeStatusUnknown Kubelet stopped posting node status. PIDPressure Unknown Tue, 14 Sep 2021 11:03:08 +0800 Tue, 14 Sep 2021 11:06:02 +0800 NodeStatusUnknown Kubelet stopped posting node status. Ready Unknown Tue, 14 Sep 2021 11:03:08 +0800 Tue, 14 Sep 2021 11:06:02 +0800 NodeStatusUnknown Kubelet stopped posting node status. Addresses: InternalIP: 10.0.142.90 Hostname: ip-10-0-142-90.us-east-2.compute.internal InternalDNS: ip-10-0-142-90.us-east-2.compute.internal Capacity: attachable-volumes-aws-ebs: 25 cpu: 4 ephemeral-storage: 125293548Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 16106300Ki pods: 250 Allocatable: attachable-volumes-aws-ebs: 25 cpu: 3500m ephemeral-storage: 115470533646 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 14955324Ki pods: 250 System Info: Machine ID: ec2b9c749de9fa592bffff2e7b8243af System UUID: ec2b9c74-9de9-fa59-2bff-ff2e7b8243af Boot ID: b8f66e17-4193-4b19-9ddb-3ae653014828 Kernel Version: 4.18.0-305.19.1.el8_4.x86_64 OS Image: Red Hat Enterprise Linux CoreOS 48.84.202109090400-0 (Ootpa) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.21.2-15.rhaos4.8.gitcdc4f56.el8 Kubelet Version: v1.21.1+9807387 Kube-Proxy Version: v1.21.1+9807387 ProviderID: aws:///us-east-2a/i-07f5b9600cf909f5c Non-terminated Pods: (30 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- openshift-cluster-csi-drivers aws-ebs-csi-driver-node-ljkln 30m (0%) 0 (0%) 150Mi (1%) 0 (0%) 118m openshift-cluster-node-tuning-operator tuned-wkjm9 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 118m openshift-debug-node-rbs4v ip-10-0-142-90.us-east-2.compute.internal-debug 0 (0%) 0 (0%) 0 (0%) 0 (0%) 11m openshift-dns dns-default-h87pk 60m (1%) 0 (0%) 110Mi (0%) 0 (0%) 117m openshift-dns node-resolver-94zpn 5m (0%) 0 (0%) 21Mi (0%) 0 (0%) 118m openshift-image-registry node-ca-4fpp2 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 118m openshift-ingress-canary ingress-canary-bvsh4 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 117m openshift-ingress router-default-845cc4d4dc-p5xzs 100m (2%) 0 (0%) 256Mi (1%) 0 (0%) 118m openshift-machine-config-operator machine-config-daemon-zlbhc 40m (1%) 0 (0%) 100Mi (0%) 0 (0%) 118m openshift-marketplace certified-operators-f8r65 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 120m openshift-marketplace community-operators-j2mc5 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 120m openshift-marketplace redhat-operators-r2bfm 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 120m openshift-monitoring alertmanager-main-0 8m (0%) 0 (0%) 105Mi (0%) 0 (0%) 117m openshift-monitoring alertmanager-main-1 8m (0%) 0 (0%) 105Mi (0%) 0 (0%) 117m openshift-monitoring alertmanager-main-2 8m (0%) 0 (0%) 105Mi (0%) 0 (0%) 117m openshift-monitoring grafana-dd56d9c7f-frght 5m (0%) 0 (0%) 84Mi (0%) 0 (0%) 117m openshift-monitoring kube-state-metrics-7485cb5695-8j7m9 4m (0%) 0 (0%) 110Mi (0%) 0 (0%) 121m openshift-monitoring node-exporter-nkn4f 9m (0%) 0 (0%) 47Mi (0%) 0 (0%) 118m openshift-monitoring openshift-state-metrics-65c6597c7-8tzcq 3m (0%) 0 (0%) 72Mi (0%) 0 (0%) 121m openshift-monitoring prometheus-adapter-748974bc4b-qfmhq 1m (0%) 0 (0%) 40Mi (0%) 0 (0%) 117m openshift-monitoring prometheus-k8s-0 76m (2%) 0 (0%) 1119Mi (7%) 0 (0%) 117m openshift-monitoring prometheus-k8s-1 76m (2%) 0 (0%) 1119Mi (7%) 0 (0%) 117m openshift-monitoring telemeter-client-74b79579b-4dfh7 3m (0%) 0 (0%) 70Mi (0%) 0 (0%) 121m openshift-monitoring thanos-querier-65dc8646f7-5tnkz 14m (0%) 0 (0%) 77Mi (0%) 0 (0%) 117m openshift-multus multus-additional-cni-plugins-knbdn 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 118m openshift-multus multus-trtjx 10m (0%) 0 (0%) 65Mi (0%) 0 (0%) 118m openshift-multus network-metrics-daemon-s4pvr 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 118m openshift-network-diagnostics network-check-source-6ccd7c5589-kbwt6 10m (0%) 0 (0%) 40Mi (0%) 0 (0%) 125m openshift-network-diagnostics network-check-target-b62jb 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 118m openshift-ovn-kubernetes ovnkube-node-8ms5f 40m (1%) 0 (0%) 640Mi (4%) 0 (0%) 118m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 610m (17%) 0 (0%) memory 4810Mi (32%) 0 (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Starting 118m kubelet Starting kubelet. Normal NodeHasSufficientMemory 118m (x2 over 118m) kubelet Node ip-10-0-142-90.us-east-2.compute.internal status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 118m (x2 over 118m) kubelet Node ip-10-0-142-90.us-east-2.compute.internal status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 118m (x2 over 118m) kubelet Node ip-10-0-142-90.us-east-2.compute.internal status is now: NodeHasSufficientPID Normal NodeAllocatableEnforced 118m kubelet Updated Node Allocatable limit across pods Normal NodeReady 117m kubelet Node ip-10-0-142-90.us-east-2.compute.internal status is now: NodeReady ---------------------------- % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.11 True False False 79m baremetal 4.8.11 True False False 105m cloud-credential 4.8.11 True False False 116m cluster-autoscaler 4.8.11 True False False 105m config-operator 4.8.11 True False False 106m console 4.8.11 True False False 93m csi-snapshot-controller 4.8.11 True False False 106m dns 4.8.11 True True False 105m etcd 4.8.11 True False False 104m image-registry 4.8.11 True False False 102m ingress 4.8.11 True False False 100m insights 4.8.11 True False False 100m kube-apiserver 4.8.11 True False False 102m kube-controller-manager 4.8.11 True False False 104m kube-scheduler 4.8.11 True False False 104m kube-storage-version-migrator 4.8.11 True False False 106m machine-api 4.8.11 True False False 102m machine-approver 4.8.11 True False False 106m machine-config 4.8.11 False False True 3m19s marketplace 4.8.11 True False False 105m monitoring 4.8.11 False True True 6m42s network 4.8.11 True True False 107m node-tuning 4.8.11 True False False 105m openshift-apiserver 4.8.11 True False False 103m openshift-controller-manager 4.8.11 True False False 98m openshift-samples 4.8.11 True False False 103m operator-lifecycle-manager 4.8.11 True False False 106m operator-lifecycle-manager-catalog 4.8.11 True False False 105m operator-lifecycle-manager-packageserver 4.8.11 True False False 103m service-ca 4.8.11 True False False 106m storage 4.8.11 True True False 105m @rphillips I reproduced this issue multiple times on different providers GCP/AzureAAWS, network types sdn/ovn, OCP versions 4.6/4/7/4/8, the same think I can see is as you've mentioned 'There are two prometheus pods running on the same NotReady node. This is the most likely cause why the node is notready'. So I updated the bug title to better describe this issue. Is this a known issue that two prometheus pods running on the same node can make it NotReady? Can you tell more about how it happens? Given the node is broken for because two prometheus pods running on the same node, or some other reason, why couldn't the pods stuck in 'Terminating' state on it be discovered and rescheduled to other healthy nodes? @rphillips I can reproduce this when The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |