Description of problem: Unable to upgrade 4.5.40 to 4.6.30 with error in $SUBJECT. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.40 True True 12h Unable to apply 4.6.30: the cluster operator monitoring is degraded I see those 2 problematic pods, yesterday the prometheus-adapter was OK, today morning I see it in CrashLoopBackOff state. $ oc get pod -n openshift-monitoring NAME READY STATUS RESTARTS AGE grafana-7977459cb-wx9p5 1/2 Running 0 12h prometheus-adapter-795f9f4cf6-kmgfg 0/1 CrashLoopBackOff 32 144m 12h Some logs I see in grafana container: t=2021-06-02T20:52:26+0000 lvl=warn msg="[Deprecated] the use of basicAuthPassword field is deprecated. Please use secureJsonData.basicAuthPassword" logger=provisioning.datasources datasource name=prometheus t=2021-06-02T20:52:26+0000 lvl=info msg="inserting datasource from configuration " logger=provisioning.datasources name=prometheus uid= t=2021-06-02T20:52:26+0000 lvl=eror msg="Failed to read plugin provisioning files from directory" logger=provisioning.plugins path=/etc/grafana/provisioning/plugins error="open /etc/grafana/provisioning/plugins: no such file or directory" t=2021-06-02T20:52:26+0000 lvl=eror msg="Can't read alert notification provisioning files from directory" logger=provisioning.notifiers path=/etc/grafana/provisioning/notifiers error="open /etc/grafana/provisioning/notifiers: no such file or directory" t=2021-06-02T20:52:26+0000 lvl=info msg="HTTP Server Listen" logger=http.server address=127.0.0.1:3001 protocol=http subUrl= socket= From Adapter logs I see: I0603 08:37:40.394777 1 adapter.go:94] successfully using in-cluster auth F0603 08:37:43.474386 1 adapter.go:286] unable to install resource metrics API: unable to construct dynamic discovery mapper: unable to populate initial set of REST mappings: Get "https://172.30.0.1:443/api?timeout=32s": dial tcp 172.30.0.1:443: connect: no route to host Will attach must gather as well Version-Release number of selected component (if applicable): upgrade 4.5.40 to 4.6.30 - Cluster originally deployed with OCP 4.3. How reproducible: Ran upgrade from 4.3 to 4.4 to 4.5 and 4.6 Steps to Reproduce: 1. Described above Actual results: Unable to apply 4.6.30: the cluster operator monitoring is degraded Expected results: Successful upgrade Additional info: Must gather: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/must-gather.tar.gz
As far as I can tell from the logs, this seems to be some networking issues with the `compute-1` node since both degraded pods are scheduled on this node and are having networking issues. For the grafana pod, the grafana-proxy is reported unready although from the logs, everything seems to be going well for this container. When trying to access the route attached to the Grafana UI that is behind the proxy, I got a HTTP 503 meaning that the service was indeed unavailable. As for prometheus-adapter, one of the replica is scheduled on `compute-0` and is healthy/ready meanwhile the other one is scheduled on `compute-1` and is crash looping because it can't reach the Kubernetes service: ``` F0603 12:48:02.610478 1 adapter.go:286] unable to install resource metrics API: unable to construct dynamic discovery mapper: unable to populate initial set of REST mappings: Get "https://172.30.0.1:443/api?timeout=32s": dial tcp 172.30.0.1:443: connect: no route to host ``` It might be worth also noting that the dns-default pod from the `openshift-dns` namespace scheduled on `compute-1` is also crashlooping and thus might be the root cause of the problem.
Sending this bug over to the DNS team to further investigate the `dns-default` pod crashlooping on the `compute-1` node.
I've tried to drain compute-1 with command: oc adm drain --force --delete-local-data --ignore-daemonsets compute-1 and finally had to reboot it finally as I saw this: http://pastebin.test.redhat.com/968739 Repeating: error when evicting pod "rook-ceph-osd-0-5fbf89f9bc-wccp5" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. evicting pod openshift-storage/rook-ceph-osd-0-5fbf89f9bc-wccp5 error when evicting pod "rook-ceph-osd-0-5fbf89f9bc-wccp5" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. evicting pod openshift-storage/rook-ceph-osd-0-5fbf89f9bc-wccp5 Maybe if I will wait more time it will finish but after like 5 minutes I force rebooted the node. After that everything come back to normal: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.30 True False 6m55s Cluster version is 4.6.30 And cluster got upgraded. But something had to lead to this issue so worth to find out the root cause of the issue. I will continue with the other OCS bug verification on this cluster, so please do not manipulate with it and I will do one more upgrade to OCP 4.7 so hope I will not hit the same issue there as well. But I went from OCP 4.3 to 4.4 and 4.5 and didn't hit any issue before.
I am seeing the same issue upgrading from 4.3.40 -> 4.4.33->4.5.40->4.6.34 on Azure IPI % oc get nodes NAME STATUS ROLES AGE VERSION qe-pr-az43-pf5z4-master-0 Ready master 7h22m v1.19.0+c3e2e69 qe-pr-az43-pf5z4-master-1 Ready master 7h22m v1.19.0+c3e2e69 qe-pr-az43-pf5z4-master-2 Ready master 7h22m v1.19.0+c3e2e69 qe-pr-az43-pf5z4-worker-northcentralus-b78nn Ready worker 6h3m v1.19.0+c3e2e69 qe-pr-az43-pf5z4-worker-northcentralus-w5s72 Ready,SchedulingDisabled worker 7h13m v1.18.3+d8ef5ad qe-pr-az43-pf5z4-worker-northcentralus-zbhdm Ready worker 7h12m v1.18.3+d8ef5ad % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.34 True False False 97m cloud-credential 4.6.34 True False False 7h22m cluster-autoscaler 4.6.34 True False False 7h16m config-operator 4.6.34 True False False 3h44m console 4.6.34 True False False 100m csi-snapshot-controller 4.6.34 True False False 99m dns 4.6.34 True False False 120m etcd 4.6.34 True False False 5h45m image-registry 4.6.34 True True False 106m ingress 4.6.34 True False True 137m insights 4.6.34 True False False 7h18m kube-apiserver 4.6.34 True False False 7h20m kube-controller-manager 4.6.34 True False False 5h41m kube-scheduler 4.6.34 True False False 5h42m kube-storage-version-migrator 4.6.34 True False False 106m machine-api 4.6.34 True False False 7h18m machine-approver 4.6.34 True False False 3h39m machine-config 4.6.34 True False False 97m marketplace 4.6.34 True False False 100m monitoring 4.6.34 False True True 103m network 4.6.34 True False False 7h21m node-tuning 4.6.34 True False False 136m openshift-apiserver 4.6.34 True False False 123m openshift-controller-manager 4.6.34 True False False 155m openshift-samples 4.6.34 True False False 135m operator-lifecycle-manager 4.6.34 True False False 7h17m operator-lifecycle-manager-catalog 4.6.34 True False False 7h17m operator-lifecycle-manager-packageserver 4.6.34 True False False 136m service-ca 4.6.34 True False False 7h21m storage 4.6.34 True False False 137m % oc describe co monitoring Name: monitoring Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2021-06-23T12:40:18Z Generation: 1 Resource Version: 240683 Self Link: /apis/config.openshift.io/v1/clusteroperators/monitoring UID: bc02b67f-6bc6-4684-b686-c38d43bb51f7 Spec: Status: Conditions: Last Transition Time: 2021-06-23T18:15:28Z Status: False Type: Available Last Transition Time: 2021-06-23T19:55:45Z Message: Rolling out the stack. Reason: RollOutInProgress Status: True Type: Progressing Last Transition Time: 2021-06-23T18:27:54Z Message: Failed to rollout the stack. Error: running task Updating Grafana failed: reconciling Grafana Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/grafana: got 1 unavailable replicas Reason: UpdatingGrafanaFailed Status: True Type: Degraded Last Transition Time: 2021-06-23T19:55:45Z Message: Rollout of the monitoring stack is in progress. Please wait until it finishes. Reason: RollOutInProgress Status: True Type: Upgradeable Extension: <nil> Related Objects: Group: Name: openshift-monitoring Resource: namespaces Group: Name: openshift-user-workload-monitoring Resource: namespaces Group: monitoring.coreos.com Name: Resource: servicemonitors Group: monitoring.coreos.com Name: Resource: podmonitors Group: monitoring.coreos.com Name: Resource: prometheusrules Group: monitoring.coreos.com Name: Resource: alertmanagers Group: monitoring.coreos.com Name: Resource: prometheuses Group: monitoring.coreos.com Name: Resource: thanosrulers Versions: Name: operator Version: 4.6.34 % oc get pods -n openshift-monitoring NAME READY STATUS RESTARTS AGE alertmanager-main-0 5/5 Running 0 101m alertmanager-main-1 5/5 Running 0 131m alertmanager-main-2 5/5 Running 0 101m cluster-monitoring-operator-6dd74db54f-bfq6d 2/2 Running 0 96m grafana-7ff876c957-xphcn 1/2 Running 0 103m kube-state-metrics-659c7b865d-4b5z6 2/3 CrashLoopBackOff 24 103m node-exporter-98zks 2/2 Running 0 132m node-exporter-9n9lf 2/2 Running 0 132m node-exporter-brj9v 2/2 Running 0 132m node-exporter-r2jfs 2/2 Running 0 132m node-exporter-tvqtd 2/2 Running 0 132m node-exporter-zthfh 2/2 Running 0 132m openshift-state-metrics-7cf4dc694b-q8g28 3/3 Running 0 101m prometheus-adapter-7bb644ff67-l8bst 0/1 CrashLoopBackOff 24 103m prometheus-adapter-7bb644ff67-pnlbg 1/1 Running 0 101m prometheus-k8s-0 6/6 Running 7 131m prometheus-k8s-1 6/6 Running 1 101m prometheus-operator-7fdc685b8d-xqs57 2/2 Running 0 96m telemeter-client-7f9f75f5f8-h6qmw 3/3 Running 0 103m thanos-querier-7c8844f896-22r9j 5/5 Running 0 101m thanos-querier-7c8844f896-v8sc8 5/5 Running 0 103m Grafana logs show: t=2021-06-23T18:10:12+0000 lvl=info msg="Created default admin" logger=sqlstore user=WHAT_YOU_ARE_DOING_IS_VOIDING_SUPPORT_0000000000000000000000000000000000000000000000000000000000000000 t=2021-06-23T18:10:12+0000 lvl=info msg="Starting plugin search" logger=plugins t=2021-06-23T18:10:12+0000 lvl=warn msg="[Deprecated] the use of basicAuthPassword field is deprecated. Please use secureJsonData.basicAuthPassword" logger=provisioning.datasources datasource name=prometheus t=2021-06-23T18:10:12+0000 lvl=info msg="inserting datasource from configuration " logger=provisioning.datasources name=prometheus uid= t=2021-06-23T18:10:12+0000 lvl=eror msg="Failed to read plugin provisioning files from directory" logger=provisioning.plugins path=/etc/grafana/provisioning/plugins error="open /etc/grafana/provisioning/plugins: no such file or directory" t=2021-06-23T18:10:12+0000 lvl=eror msg="Can't read alert notification provisioning files from directory" logger=provisioning.notifiers path=/etc/grafana/provisioning/notifiers error="open /etc/grafana/provisioning/notifiers: no such file or directory" t=2021-06-23T18:10:12+0000 lvl=info msg="HTTP Server Listen" logger=http.server address=127.0.0.1:3001 protocol=http subUrl= socket= kube-state-metrics logs show: I0623 19:59:28.080659 1 main.go:86] Using default collectors I0623 19:59:28.080810 1 main.go:98] Using all namespace I0623 19:59:28.080829 1 main.go:139] metric white-blacklisting: blacklisting the following items: kube_secret_labels W0623 19:59:28.080887 1 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0623 19:59:28.084325 1 main.go:186] Testing communication with server F0623 19:59:31.154189 1 main.go:149] Failed to create client: error while trying to communicate with apiserver: Get "https://172.30.0.1:443/version?timeout=32s": dial tcp 172.30.0.1:443: connect: no route to host prometheus adapter logs: I0623 19:59:21.128533 1 adapter.go:94] successfully using in-cluster auth F0623 19:59:24.242392 1 adapter.go:286] unable to install resource metrics API: unable to construct dynamic discovery mapper: unable to populate initial set of REST mappings: Get "https://172.30.0.1:443/api?timeout=32s": dial tcp 172.30.0.1:443: connect: no route to host Uploading must gather to next comment.
The failing dns-default was unable to contact the API server, which seems to be the same issue the other pods on compute-1 were seeing: reflector.go:127] github.com/coredns/coredns/plugin/kubernetes/controller.go:333: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://172.30.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host
Since all the pods that can't access the API server seem to be on compute-1, I think this is most likely a networking issue with that node specifically. The cluster was using OpenShift SDN, so I'm passing this to the SDN team.
UpgradeBlocker keyword semantics are defined in [1]. 4.6 is now so old and quiet, that I would be very surprised if this ends up being a recent, product-side networking regression. I'm going to pull the keyword off to remove the bug from my triage queue, but feel free to add it back if further investigation does turn up a 4.6 regression that seems like grounds for removing 4.6.z update recommendations. I dunno what the status is of 4.5 -> 4.6 edge recommendations now that 4.5 is end-of-life [2]. [1]: https://github.com/openshift/enhancements/pull/475 [2]: https://access.redhat.com/support/policy/updates/openshift/#dates
Oh, wait, why are you still updating to 4.6.30? We stopped recommending updates to it based on bug 1953518 back on June 4th [1] (~a day after you opened this bug). Maybe this bug can be closed as a dup of bug 1953518? [1]: https://github.com/openshift/cincinnati-graph-data/pull/837#event-4844091800
Marking as a dupe, if the issue reproduces upgrading from 4.5.40 to latest 4.6.z please re-open. *** This bug has been marked as a duplicate of bug 1953518 ***