Bug 1986375
Summary: | Avoid CMO being degraded when some nodes aren't available | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Prashant Balachandran <pnair> |
Component: | Monitoring | Assignee: | Prashant Balachandran <pnair> |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.9 | CC: | amuller, anpicker, aos-bugs, arajkuma, erooth, jgato |
Target Milestone: | --- | ||
Target Release: | 4.9.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-10-18 17:41:54 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Prashant Balachandran
2021-07-27 12:04:02 UTC
tested with 4.9.0-0.nightly-2021-07-28-181504, monitoring won't be reported as DEGRADED due to offline/unavailable nodes, steps see below set one node to SchedulingDisabled to not affect other pods # oc adm cordon ip-10-0-217-156.us-east-2.compute.internal # oc get node ip-10-0-217-156.us-east-2.compute.internal NAME STATUS ROLES AGE VERSION ip-10-0-217-156.us-east-2.compute.internal Ready,SchedulingDisabled worker 4h45m v1.21.1+8268f88 scale down cluster-version-operator/cluster-monitoring-operator,remove daemonset # oc -n openshift-cluster-version scale deploy cluster-version-operator --replicas=0 # oc -n openshift-monitoring scale deploy cluster-monitoring-operator --replicas=0 # oc -n openshift-monitoring delete daemonset node-exporter make sure other pods are normal # oc -n openshift-monitoring get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES alertmanager-main-0 5/5 Running 0 5h8m 10.129.2.8 ip-10-0-134-137.us-east-2.compute.internal <none> <none> alertmanager-main-1 5/5 Running 0 25m 10.129.2.16 ip-10-0-134-137.us-east-2.compute.internal <none> <none> alertmanager-main-2 5/5 Running 0 5h8m 10.131.0.17 ip-10-0-176-74.us-east-2.compute.internal <none> <none> grafana-6c679c5748-vct2g 2/2 Running 0 5h8m 10.129.2.9 ip-10-0-134-137.us-east-2.compute.internal <none> <none> kube-state-metrics-59f44f65fb-qgghv 3/3 Running 0 5h12m 10.131.0.15 ip-10-0-176-74.us-east-2.compute.internal <none> <none> openshift-state-metrics-78c5465bcd-bkndb 3/3 Running 0 5h12m 10.131.0.7 ip-10-0-176-74.us-east-2.compute.internal <none> <none> prometheus-adapter-7d6b95dd6-cbv7h 1/1 Running 0 58m 10.131.0.122 ip-10-0-176-74.us-east-2.compute.internal <none> <none> prometheus-adapter-7d6b95dd6-zgflb 1/1 Running 0 5h8m 10.129.2.7 ip-10-0-134-137.us-east-2.compute.internal <none> <none> prometheus-k8s-0 7/7 Running 0 26m 10.131.0.137 ip-10-0-176-74.us-east-2.compute.internal <none> <none> prometheus-k8s-1 7/7 Running 0 5h7m 10.129.2.11 ip-10-0-134-137.us-east-2.compute.internal <none> <none> prometheus-operator-cd5899dbc-trcpx 2/2 Running 1 5h14m 10.128.0.40 ip-10-0-176-171.us-east-2.compute.internal <none> <none> telemeter-client-567dc564fd-pvpcp 3/3 Running 0 5h12m 10.131.0.16 ip-10-0-176-74.us-east-2.compute.internal <none> <none> thanos-querier-865d44b845-58cnf 5/5 Running 0 4h11m 10.129.2.13 ip-10-0-134-137.us-east-2.compute.internal <none> <none> thanos-querier-865d44b845-hrxvb 5/5 Running 0 4h11m 10.131.0.40 ip-10-0-176-74.us-east-2.compute.internal <none> <none> use following script to stop kubelet and sleep for 20m then start again # oc debug node/ip-10-0-217-156.us-east-2.compute.internal sh-4.4# chroot /host sh-4.4# chmod +x /tmp/run.sh sh-4.4# /tmp/run.sh & cat /tmp/run.sh ************** systemctl stop kubelet sleep 20m systemctl start kubelet ************** scale up cluster-version-operator/cluster-monitoring-operator # oc -n openshift-cluster-version scale deploy cluster-version-operator --replicas=1 # oc -n openshift-monitoring scale deploy cluster-monitoring-operator --replicas=1 # oc get node ip-10-0-217-156.us-east-2.compute.internal NAME STATUS ROLES AGE VERSION ip-10-0-217-156.us-east-2.compute.internal NotReady,SchedulingDisabled worker 5h12m v1.21.1+8268f88 make sure only the node-exporter pod is abornal which scheduled on the NotReady node # oc -n openshift-monitoring get pod -o wide | grep -Ev "Running|Completed" NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES node-exporter-l884h 0/2 Pending 0 2m58s <none> ip-10-0-217-156.us-east-2.compute.internal <none> <none> watch for a while, monitoring won't be reported as DEGRADED # node="ip-10-0-217-156.us-east-2.compute.internal"; while true; do oc get node ${node}; oc -n openshift-monitoring get pod -o wide | grep node-exporter | grep ${node}; oc get co monitoring; oc -n openshift-monitoring get ds;sleep 20s; done ... NAME STATUS ROLES AGE VERSION ip-10-0-217-156.us-east-2.compute.internal NotReady,SchedulingDisabled worker 5h19m v1.21.1+8268f88 node-exporter-l884h 0/2 Pending 0 7m32s <none> ip-10-0-217-156.us-east-2.compute.internal <none> <none> NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE monitoring 4.9.0-0.nightly-2021-07-28-181504 True False False 35m NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE node-exporter 6 6 5 6 5 kubernetes.io/os=linux 7m42s ... Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 we have found the same issue on 4.8.44. Was this backported to 4.8? thanks, |