Bug 1986375
| Summary: | Avoid CMO being degraded when some nodes aren't available | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Prashant Balachandran <pnair> |
| Component: | Monitoring | Assignee: | Prashant Balachandran <pnair> |
| Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.9 | CC: | amuller, anpicker, aos-bugs, arajkuma, erooth, jgato |
| Target Milestone: | --- | ||
| Target Release: | 4.9.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-10-18 17:41:54 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Prashant Balachandran
2021-07-27 12:04:02 UTC
tested with 4.9.0-0.nightly-2021-07-28-181504, monitoring won't be reported as DEGRADED due to offline/unavailable nodes, steps see below
set one node to SchedulingDisabled to not affect other pods
# oc adm cordon ip-10-0-217-156.us-east-2.compute.internal
# oc get node ip-10-0-217-156.us-east-2.compute.internal
NAME STATUS ROLES AGE VERSION
ip-10-0-217-156.us-east-2.compute.internal Ready,SchedulingDisabled worker 4h45m v1.21.1+8268f88
scale down cluster-version-operator/cluster-monitoring-operator,remove daemonset
# oc -n openshift-cluster-version scale deploy cluster-version-operator --replicas=0
# oc -n openshift-monitoring scale deploy cluster-monitoring-operator --replicas=0
# oc -n openshift-monitoring delete daemonset node-exporter
make sure other pods are normal
# oc -n openshift-monitoring get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
alertmanager-main-0 5/5 Running 0 5h8m 10.129.2.8 ip-10-0-134-137.us-east-2.compute.internal <none> <none>
alertmanager-main-1 5/5 Running 0 25m 10.129.2.16 ip-10-0-134-137.us-east-2.compute.internal <none> <none>
alertmanager-main-2 5/5 Running 0 5h8m 10.131.0.17 ip-10-0-176-74.us-east-2.compute.internal <none> <none>
grafana-6c679c5748-vct2g 2/2 Running 0 5h8m 10.129.2.9 ip-10-0-134-137.us-east-2.compute.internal <none> <none>
kube-state-metrics-59f44f65fb-qgghv 3/3 Running 0 5h12m 10.131.0.15 ip-10-0-176-74.us-east-2.compute.internal <none> <none>
openshift-state-metrics-78c5465bcd-bkndb 3/3 Running 0 5h12m 10.131.0.7 ip-10-0-176-74.us-east-2.compute.internal <none> <none>
prometheus-adapter-7d6b95dd6-cbv7h 1/1 Running 0 58m 10.131.0.122 ip-10-0-176-74.us-east-2.compute.internal <none> <none>
prometheus-adapter-7d6b95dd6-zgflb 1/1 Running 0 5h8m 10.129.2.7 ip-10-0-134-137.us-east-2.compute.internal <none> <none>
prometheus-k8s-0 7/7 Running 0 26m 10.131.0.137 ip-10-0-176-74.us-east-2.compute.internal <none> <none>
prometheus-k8s-1 7/7 Running 0 5h7m 10.129.2.11 ip-10-0-134-137.us-east-2.compute.internal <none> <none>
prometheus-operator-cd5899dbc-trcpx 2/2 Running 1 5h14m 10.128.0.40 ip-10-0-176-171.us-east-2.compute.internal <none> <none>
telemeter-client-567dc564fd-pvpcp 3/3 Running 0 5h12m 10.131.0.16 ip-10-0-176-74.us-east-2.compute.internal <none> <none>
thanos-querier-865d44b845-58cnf 5/5 Running 0 4h11m 10.129.2.13 ip-10-0-134-137.us-east-2.compute.internal <none> <none>
thanos-querier-865d44b845-hrxvb 5/5 Running 0 4h11m 10.131.0.40 ip-10-0-176-74.us-east-2.compute.internal <none> <none>
use following script to stop kubelet and sleep for 20m then start again
# oc debug node/ip-10-0-217-156.us-east-2.compute.internal
sh-4.4# chroot /host
sh-4.4# chmod +x /tmp/run.sh
sh-4.4# /tmp/run.sh &
cat /tmp/run.sh
**************
systemctl stop kubelet
sleep 20m
systemctl start kubelet
**************
scale up cluster-version-operator/cluster-monitoring-operator
# oc -n openshift-cluster-version scale deploy cluster-version-operator --replicas=1
# oc -n openshift-monitoring scale deploy cluster-monitoring-operator --replicas=1
# oc get node ip-10-0-217-156.us-east-2.compute.internal
NAME STATUS ROLES AGE VERSION
ip-10-0-217-156.us-east-2.compute.internal NotReady,SchedulingDisabled worker 5h12m v1.21.1+8268f88
make sure only the node-exporter pod is abornal which scheduled on the NotReady node
# oc -n openshift-monitoring get pod -o wide | grep -Ev "Running|Completed"
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
node-exporter-l884h 0/2 Pending 0 2m58s <none> ip-10-0-217-156.us-east-2.compute.internal <none> <none>
watch for a while, monitoring won't be reported as DEGRADED
# node="ip-10-0-217-156.us-east-2.compute.internal"; while true; do oc get node ${node}; oc -n openshift-monitoring get pod -o wide | grep node-exporter | grep ${node}; oc get co monitoring; oc -n openshift-monitoring get ds;sleep 20s; done
...
NAME STATUS ROLES AGE VERSION
ip-10-0-217-156.us-east-2.compute.internal NotReady,SchedulingDisabled worker 5h19m v1.21.1+8268f88
node-exporter-l884h 0/2 Pending 0 7m32s <none> ip-10-0-217-156.us-east-2.compute.internal <none> <none>
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
monitoring 4.9.0-0.nightly-2021-07-28-181504 True False False 35m
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
node-exporter 6 6 5 6 5 kubernetes.io/os=linux 7m42s
...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 we have found the same issue on 4.8.44. Was this backported to 4.8? thanks, |