Bug 1967514
| Summary: | Unable to apply 4.6.30: the cluster operator monitoring is degraded | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Petr Balogh <pbalogh> |
| Component: | Networking | Assignee: | Alexander Constantinescu <aconstan> |
| Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> |
| Status: | CLOSED DUPLICATE | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | aconstan, alegrand, anpicker, aos-bugs, erooth, kakkoyun, mmasters, pkrupa, pnair, prubenda, wking |
| Version: | 4.6.z | Keywords: | Upgrades |
| Target Milestone: | --- | ||
| Target Release: | 4.9.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-08-20 15:53:26 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Petr Balogh
2021-06-03 09:30:35 UTC
As far as I can tell from the logs, this seems to be some networking issues with the `compute-1` node since both degraded pods are scheduled on this node and are having networking issues. For the grafana pod, the grafana-proxy is reported unready although from the logs, everything seems to be going well for this container. When trying to access the route attached to the Grafana UI that is behind the proxy, I got a HTTP 503 meaning that the service was indeed unavailable. As for prometheus-adapter, one of the replica is scheduled on `compute-0` and is healthy/ready meanwhile the other one is scheduled on `compute-1` and is crash looping because it can't reach the Kubernetes service: ``` F0603 12:48:02.610478 1 adapter.go:286] unable to install resource metrics API: unable to construct dynamic discovery mapper: unable to populate initial set of REST mappings: Get "https://172.30.0.1:443/api?timeout=32s": dial tcp 172.30.0.1:443: connect: no route to host ``` It might be worth also noting that the dns-default pod from the `openshift-dns` namespace scheduled on `compute-1` is also crashlooping and thus might be the root cause of the problem. Sending this bug over to the DNS team to further investigate the `dns-default` pod crashlooping on the `compute-1` node. I've tried to drain compute-1 with command: oc adm drain --force --delete-local-data --ignore-daemonsets compute-1 and finally had to reboot it finally as I saw this: http://pastebin.test.redhat.com/968739 Repeating: error when evicting pod "rook-ceph-osd-0-5fbf89f9bc-wccp5" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. evicting pod openshift-storage/rook-ceph-osd-0-5fbf89f9bc-wccp5 error when evicting pod "rook-ceph-osd-0-5fbf89f9bc-wccp5" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. evicting pod openshift-storage/rook-ceph-osd-0-5fbf89f9bc-wccp5 Maybe if I will wait more time it will finish but after like 5 minutes I force rebooted the node. After that everything come back to normal: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.30 True False 6m55s Cluster version is 4.6.30 And cluster got upgraded. But something had to lead to this issue so worth to find out the root cause of the issue. I will continue with the other OCS bug verification on this cluster, so please do not manipulate with it and I will do one more upgrade to OCP 4.7 so hope I will not hit the same issue there as well. But I went from OCP 4.3 to 4.4 and 4.5 and didn't hit any issue before. I am seeing the same issue upgrading from 4.3.40 -> 4.4.33->4.5.40->4.6.34 on Azure IPI
% oc get nodes
NAME STATUS ROLES AGE VERSION
qe-pr-az43-pf5z4-master-0 Ready master 7h22m v1.19.0+c3e2e69
qe-pr-az43-pf5z4-master-1 Ready master 7h22m v1.19.0+c3e2e69
qe-pr-az43-pf5z4-master-2 Ready master 7h22m v1.19.0+c3e2e69
qe-pr-az43-pf5z4-worker-northcentralus-b78nn Ready worker 6h3m v1.19.0+c3e2e69
qe-pr-az43-pf5z4-worker-northcentralus-w5s72 Ready,SchedulingDisabled worker 7h13m v1.18.3+d8ef5ad
qe-pr-az43-pf5z4-worker-northcentralus-zbhdm Ready worker 7h12m v1.18.3+d8ef5ad
% oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.6.34 True False False 97m
cloud-credential 4.6.34 True False False 7h22m
cluster-autoscaler 4.6.34 True False False 7h16m
config-operator 4.6.34 True False False 3h44m
console 4.6.34 True False False 100m
csi-snapshot-controller 4.6.34 True False False 99m
dns 4.6.34 True False False 120m
etcd 4.6.34 True False False 5h45m
image-registry 4.6.34 True True False 106m
ingress 4.6.34 True False True 137m
insights 4.6.34 True False False 7h18m
kube-apiserver 4.6.34 True False False 7h20m
kube-controller-manager 4.6.34 True False False 5h41m
kube-scheduler 4.6.34 True False False 5h42m
kube-storage-version-migrator 4.6.34 True False False 106m
machine-api 4.6.34 True False False 7h18m
machine-approver 4.6.34 True False False 3h39m
machine-config 4.6.34 True False False 97m
marketplace 4.6.34 True False False 100m
monitoring 4.6.34 False True True 103m
network 4.6.34 True False False 7h21m
node-tuning 4.6.34 True False False 136m
openshift-apiserver 4.6.34 True False False 123m
openshift-controller-manager 4.6.34 True False False 155m
openshift-samples 4.6.34 True False False 135m
operator-lifecycle-manager 4.6.34 True False False 7h17m
operator-lifecycle-manager-catalog 4.6.34 True False False 7h17m
operator-lifecycle-manager-packageserver 4.6.34 True False False 136m
service-ca 4.6.34 True False False 7h21m
storage 4.6.34 True False False 137m
% oc describe co monitoring
Name: monitoring
Namespace:
Labels: <none>
Annotations: <none>
API Version: config.openshift.io/v1
Kind: ClusterOperator
Metadata:
Creation Timestamp: 2021-06-23T12:40:18Z
Generation: 1
Resource Version: 240683
Self Link: /apis/config.openshift.io/v1/clusteroperators/monitoring
UID: bc02b67f-6bc6-4684-b686-c38d43bb51f7
Spec:
Status:
Conditions:
Last Transition Time: 2021-06-23T18:15:28Z
Status: False
Type: Available
Last Transition Time: 2021-06-23T19:55:45Z
Message: Rolling out the stack.
Reason: RollOutInProgress
Status: True
Type: Progressing
Last Transition Time: 2021-06-23T18:27:54Z
Message: Failed to rollout the stack. Error: running task Updating Grafana failed: reconciling Grafana Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/grafana: got 1 unavailable replicas
Reason: UpdatingGrafanaFailed
Status: True
Type: Degraded
Last Transition Time: 2021-06-23T19:55:45Z
Message: Rollout of the monitoring stack is in progress. Please wait until it finishes.
Reason: RollOutInProgress
Status: True
Type: Upgradeable
Extension: <nil>
Related Objects:
Group:
Name: openshift-monitoring
Resource: namespaces
Group:
Name: openshift-user-workload-monitoring
Resource: namespaces
Group: monitoring.coreos.com
Name:
Resource: servicemonitors
Group: monitoring.coreos.com
Name:
Resource: podmonitors
Group: monitoring.coreos.com
Name:
Resource: prometheusrules
Group: monitoring.coreos.com
Name:
Resource: alertmanagers
Group: monitoring.coreos.com
Name:
Resource: prometheuses
Group: monitoring.coreos.com
Name:
Resource: thanosrulers
Versions:
Name: operator
Version: 4.6.34
% oc get pods -n openshift-monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 5/5 Running 0 101m
alertmanager-main-1 5/5 Running 0 131m
alertmanager-main-2 5/5 Running 0 101m
cluster-monitoring-operator-6dd74db54f-bfq6d 2/2 Running 0 96m
grafana-7ff876c957-xphcn 1/2 Running 0 103m
kube-state-metrics-659c7b865d-4b5z6 2/3 CrashLoopBackOff 24 103m
node-exporter-98zks 2/2 Running 0 132m
node-exporter-9n9lf 2/2 Running 0 132m
node-exporter-brj9v 2/2 Running 0 132m
node-exporter-r2jfs 2/2 Running 0 132m
node-exporter-tvqtd 2/2 Running 0 132m
node-exporter-zthfh 2/2 Running 0 132m
openshift-state-metrics-7cf4dc694b-q8g28 3/3 Running 0 101m
prometheus-adapter-7bb644ff67-l8bst 0/1 CrashLoopBackOff 24 103m
prometheus-adapter-7bb644ff67-pnlbg 1/1 Running 0 101m
prometheus-k8s-0 6/6 Running 7 131m
prometheus-k8s-1 6/6 Running 1 101m
prometheus-operator-7fdc685b8d-xqs57 2/2 Running 0 96m
telemeter-client-7f9f75f5f8-h6qmw 3/3 Running 0 103m
thanos-querier-7c8844f896-22r9j 5/5 Running 0 101m
thanos-querier-7c8844f896-v8sc8 5/5 Running 0 103m
Grafana logs show:
t=2021-06-23T18:10:12+0000 lvl=info msg="Created default admin" logger=sqlstore user=WHAT_YOU_ARE_DOING_IS_VOIDING_SUPPORT_0000000000000000000000000000000000000000000000000000000000000000
t=2021-06-23T18:10:12+0000 lvl=info msg="Starting plugin search" logger=plugins
t=2021-06-23T18:10:12+0000 lvl=warn msg="[Deprecated] the use of basicAuthPassword field is deprecated. Please use secureJsonData.basicAuthPassword" logger=provisioning.datasources datasource name=prometheus
t=2021-06-23T18:10:12+0000 lvl=info msg="inserting datasource from configuration " logger=provisioning.datasources name=prometheus uid=
t=2021-06-23T18:10:12+0000 lvl=eror msg="Failed to read plugin provisioning files from directory" logger=provisioning.plugins path=/etc/grafana/provisioning/plugins error="open /etc/grafana/provisioning/plugins: no such file or directory"
t=2021-06-23T18:10:12+0000 lvl=eror msg="Can't read alert notification provisioning files from directory" logger=provisioning.notifiers path=/etc/grafana/provisioning/notifiers error="open /etc/grafana/provisioning/notifiers: no such file or directory"
t=2021-06-23T18:10:12+0000 lvl=info msg="HTTP Server Listen" logger=http.server address=127.0.0.1:3001 protocol=http subUrl= socket=
kube-state-metrics logs show:
I0623 19:59:28.080659 1 main.go:86] Using default collectors
I0623 19:59:28.080810 1 main.go:98] Using all namespace
I0623 19:59:28.080829 1 main.go:139] metric white-blacklisting: blacklisting the following items: kube_secret_labels
W0623 19:59:28.080887 1 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0623 19:59:28.084325 1 main.go:186] Testing communication with server
F0623 19:59:31.154189 1 main.go:149] Failed to create client: error while trying to communicate with apiserver: Get "https://172.30.0.1:443/version?timeout=32s": dial tcp 172.30.0.1:443: connect: no route to host
prometheus adapter logs:
I0623 19:59:21.128533 1 adapter.go:94] successfully using in-cluster auth
F0623 19:59:24.242392 1 adapter.go:286] unable to install resource metrics API: unable to construct dynamic discovery mapper: unable to populate initial set of REST mappings: Get "https://172.30.0.1:443/api?timeout=32s": dial tcp 172.30.0.1:443: connect: no route to host
Uploading must gather to next comment.
The failing dns-default was unable to contact the API server, which seems to be the same issue the other pods on compute-1 were seeing: reflector.go:127] github.com/coredns/coredns/plugin/kubernetes/controller.go:333: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://172.30.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host Since all the pods that can't access the API server seem to be on compute-1, I think this is most likely a networking issue with that node specifically. The cluster was using OpenShift SDN, so I'm passing this to the SDN team. UpgradeBlocker keyword semantics are defined in [1]. 4.6 is now so old and quiet, that I would be very surprised if this ends up being a recent, product-side networking regression. I'm going to pull the keyword off to remove the bug from my triage queue, but feel free to add it back if further investigation does turn up a 4.6 regression that seems like grounds for removing 4.6.z update recommendations. I dunno what the status is of 4.5 -> 4.6 edge recommendations now that 4.5 is end-of-life [2]. [1]: https://github.com/openshift/enhancements/pull/475 [2]: https://access.redhat.com/support/policy/updates/openshift/#dates Oh, wait, why are you still updating to 4.6.30? We stopped recommending updates to it based on bug 1953518 back on June 4th [1] (~a day after you opened this bug). Maybe this bug can be closed as a dup of bug 1953518? [1]: https://github.com/openshift/cincinnati-graph-data/pull/837#event-4844091800 Marking as a dupe, if the issue reproduces upgrading from 4.5.40 to latest 4.6.z please re-open. *** This bug has been marked as a duplicate of bug 1953518 *** |