Bug 1998260
| Summary: | machine-config operator not available after node crash test. | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Simon <skordas> |
| Component: | Machine Config Operator | Assignee: | Yu Qi Zhang <jerzhang> |
| Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Rio Liu <rioliu> |
| Status: | CLOSED NOTABUG | Docs Contact: | |
| Severity: | medium | ||
| Priority: | unspecified | CC: | aos-bugs, jkyros, kgarriso, mkrejci |
| Version: | 4.10 | Keywords: | Reopened |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-05-10 15:23:17 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Simon
2021-08-26 17:28:30 UTC
The kubelet status from your node describe makes it look you have more problems than just the MCO being degraded, and that node ip-10-0-149-240.us-east-2.compute.internal might not be okay: Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure Unknown Thu, 26 Aug 2021 12:41:25 -0400 Thu, 26 Aug 2021 12:42:08 -0400 NodeStatusUnknown Kubelet stopped posting node status. DiskPressure Unknown Thu, 26 Aug 2021 12:41:25 -0400 Thu, 26 Aug 2021 12:42:08 -0400 NodeStatusUnknown Kubelet stopped posting node status. PIDPressure Unknown Thu, 26 Aug 2021 12:41:25 -0400 Thu, 26 Aug 2021 12:42:08 -0400 NodeStatusUnknown Kubelet stopped posting node status. Ready Unknown Thu, 26 Aug 2021 12:41:25 -0400 Thu, 26 Aug 2021 12:42:08 -0400 NodeStatusUnknown Kubelet stopped posting node status. Also, with the exception of tuned, it looks like none of your pods on ip-10-0-149-240.us-east-2.compute.internal are ready: jkyros@jkyros-t590 masters]$ omg get pods -A -o wide | grep 'ip-10-0-149-240.us-east-2.compute.internal' default ip-10-0-149-240us-east-2computeinternal-debug 0/1 Pending 0 35m 10.0.149.240 ip-10-0-149-240.us-east-2.compute.internal openshift-cluster-csi-drivers aws-ebs-csi-driver-node-wch9k 0/3 Running 2 5h41m 10.0.149.240 ip-10-0-149-240.us-east-2.compute.internal openshift-cluster-node-tuning-operator tuned-jjvtw 1/1 Running 2 5h40m 10.0.149.240 ip-10-0-149-240.us-east-2.compute.internal openshift-dns dns-default-dcxtb 0/2 Running 2 5h39m ip-10-0-149-240.us-east-2.compute.internal openshift-dns node-resolver-pvbxx 0/1 Running 2 5h40m 10.0.149.240 ip-10-0-149-240.us-east-2.compute.internal openshift-image-registry image-registry-568957b9d6-rdkdx 0/1 Running 2 5h40m ip-10-0-149-240.us-east-2.compute.internal openshift-image-registry node-ca-875ck 0/1 Running 2 5h40m 10.0.149.240 ip-10-0-149-240.us-east-2.compute.internal openshift-ingress-canary ingress-canary-8zj6c 0/1 Running 2 5h38m ip-10-0-149-240.us-east-2.compute.internal openshift-ingress router-default-65bdc775fd-dwrbr 0/1 Running 2 5h38m ip-10-0-149-240.us-east-2.compute.internal openshift-machine-config-operator machine-config-daemon-ngrhx 0/2 Running 2 5h40m 10.0.149.240 ip-10-0-149-240.us-east-2.compute.internal openshift-marketplace certified-operators-f6smg 0/1 Running 2 55m ip-10-0-149-240.us-east-2.compute.internal openshift-marketplace community-operators-rrmq2 0/1 Running 2 5h47m ip-10-0-149-240.us-east-2.compute.internal openshift-marketplace redhat-marketplace-vck2g 0/1 Running 2 5h47m ip-10-0-149-240.us-east-2.compute.internal openshift-marketplace redhat-operators-n2kvh 0/1 Running 2 1h20m ip-10-0-149-240.us-east-2.compute.internal openshift-monitoring alertmanager-main-2 0/5 Running 2 5h37m ip-10-0-149-240.us-east-2.compute.internal openshift-monitoring kube-state-metrics-59b87859b8-kgw7r 0/3 Running 2 5h47m ip-10-0-149-240.us-east-2.compute.internal openshift-monitoring node-exporter-qm46v 0/2 Running 2 5h40m 10.0.149.240 ip-10-0-149-240.us-east-2.compute.internal openshift-monitoring openshift-state-metrics-66585c8c7c-hdv4x 0/3 Running 2 5h47m ip-10-0-149-240.us-east-2.compute.internal openshift-monitoring telemeter-client-6bd9bc5f84-557hq 0/3 Running 2 5h47m ip-10-0-149-240.us-east-2.compute.internal openshift-multus multus-additional-cni-plugins-pg4qm 0/1 Pending 2 5h40m 10.0.149.240 ip-10-0-149-240.us-east-2.compute.internal openshift-multus multus-bx467 0/1 Running 2 5h40m 10.0.149.240 ip-10-0-149-240.us-east-2.compute.internal openshift-multus network-metrics-daemon-c447r 0/2 Running 2 5h40m ip-10-0-149-240.us-east-2.compute.internal openshift-network-diagnostics network-check-source-75749bc6b4-c8jrf 0/1 Running 2 5h51m ip-10-0-149-240.us-east-2.compute.internal openshift-network-diagnostics network-check-target-tm2fs 0/1 Running 2 5h40m ip-10-0-149-240.us-east-2.compute.internal openshift-ovn-kubernetes ovnkube-node-p2bmb 0/4 Running 3 5h40m 10.0.149.240 ip-10-0-149-240.us-east-2.compute.internal This doesn't look like the MCO as cause -- more that the MCO is a victim and complaining about it. Can you still get into the "problem node" ip-10-0-149-240.us-east-2.compute.internal ? Would you be able to upload the journal logs or take a sosreport of that node so we can see what's going on in there? This hit in the middle of whatever stability issues were happening with [1] and it appeared to resolve with later nightlies, I suspect that the root cause here was similar (and regardless, outside the MCO). I'm going to close this, as I believe the underlying problems have been resolved, but if you manage to reproduce it on a current nightly, please reopen it. Thanks! [1] https://bugzilla.redhat.com/show_bug.cgi?id=1997905 Unfortunately I'm getting the same results with 4.10.0-0.nightly-2022-03-09-224546 version. $ oc version Client Version: 4.9.0-0.nightly-2021-07-20-014024 Server Version: 4.10.0-0.nightly-2022-03-09-224546 Kubernetes Version: v1.23.3+e419edf Simon, Can you clarify what exactly the bug is? That if a node crashes and is unavailable the MCO is degraded (which would be expected behaviour) or something else? Secondly can you explain what you are doing to your clusters, ie this is a crash test and it seems like the node crashed as intended? We need some extra details here so we can figure out what exactly is going on and whether this is a bug or not. Thanks! Sorry for this one. That was retest. Next time when I see the same issue I'll start from node... |