Bug 1967514 - Unable to apply 4.6.30: the cluster operator monitoring is degraded
Summary: Unable to apply 4.6.30: the cluster operator monitoring is degraded
Keywords:
Status: CLOSED DUPLICATE of bug 1953518
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6.z
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.9.0
Assignee: Alexander Constantinescu
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-03 09:30 UTC by Petr Balogh
Modified: 2021-08-20 15:53 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-20 15:53:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Petr Balogh 2021-06-03 09:30:35 UTC
Description of problem:
Unable to upgrade 4.5.40 to 4.6.30 with error in $SUBJECT.

 $ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.40    True        True          12h     Unable to apply 4.6.30: the cluster operator monitoring is degraded

I see those 2 problematic pods, yesterday the prometheus-adapter was OK, today morning I see it in CrashLoopBackOff state.
$ oc get pod -n openshift-monitoring
NAME                                           READY   STATUS             RESTARTS   AGE
grafana-7977459cb-wx9p5                        1/2     Running            0          12h
prometheus-adapter-795f9f4cf6-kmgfg            0/1     CrashLoopBackOff   32         144m  12h

Some logs I see in grafana container:
t=2021-06-02T20:52:26+0000 lvl=warn msg="[Deprecated] the use of basicAuthPassword field is deprecated. Please use secureJsonData.basicAuthPassword" logger=provisioning.datasources datasource name=prometheus
t=2021-06-02T20:52:26+0000 lvl=info msg="inserting datasource from configuration " logger=provisioning.datasources name=prometheus uid=
t=2021-06-02T20:52:26+0000 lvl=eror msg="Failed to read plugin provisioning files from directory" logger=provisioning.plugins path=/etc/grafana/provisioning/plugins error="open /etc/grafana/provisioning/plugins: no such file or directory"
t=2021-06-02T20:52:26+0000 lvl=eror msg="Can't read alert notification provisioning files from directory" logger=provisioning.notifiers path=/etc/grafana/provisioning/notifiers error="open /etc/grafana/provisioning/notifiers: no such file or directory"
t=2021-06-02T20:52:26+0000 lvl=info msg="HTTP Server Listen" logger=http.server address=127.0.0.1:3001 protocol=http subUrl= socket=


From Adapter logs I see:
I0603 08:37:40.394777       1 adapter.go:94] successfully using in-cluster auth
F0603 08:37:43.474386       1 adapter.go:286] unable to install resource metrics API: unable to construct dynamic discovery mapper: unable to populate initial set of REST mappings: Get "https://172.30.0.1:443/api?timeout=32s": dial tcp 172.30.0.1:443: connect: no route to host

Will attach must gather as well


Version-Release number of selected component (if applicable):
upgrade 4.5.40 to 4.6.30 - Cluster originally deployed with OCP 4.3.

How reproducible:
Ran upgrade from 4.3 to 4.4 to 4.5 and 4.6

Steps to Reproduce:
1. Described above


Actual results:
Unable to apply 4.6.30: the cluster operator monitoring is degraded

Expected results:
Successful upgrade 

Additional info:
Must gather:
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/must-gather.tar.gz

Comment 2 Damien Grisonnet 2021-06-03 12:57:36 UTC
As far as I can tell from the logs, this seems to be some networking issues with the `compute-1` node since both degraded pods are scheduled on this node and are having networking issues.

For the grafana pod, the grafana-proxy is reported unready although from the logs, everything seems to be going well for this container. When trying to access the route attached to the Grafana UI that is behind the proxy, I got a HTTP 503 meaning that the service was indeed unavailable.

As for prometheus-adapter, one of the replica is scheduled on `compute-0` and is healthy/ready meanwhile the other one is scheduled on `compute-1` and is crash looping because it can't reach the Kubernetes service:
```
F0603 12:48:02.610478       1 adapter.go:286] unable to install resource metrics API: unable to construct dynamic discovery mapper: unable to populate initial set of REST mappings: Get "https://172.30.0.1:443/api?timeout=32s": dial tcp 172.30.0.1:443: connect: no route to host
```

It might be worth also noting that the dns-default pod from the `openshift-dns` namespace scheduled on `compute-1` is also crashlooping and thus might be the root cause of the problem.

Comment 3 Damien Grisonnet 2021-06-03 13:05:17 UTC
Sending this bug over to the DNS team to further investigate the `dns-default` pod crashlooping on the `compute-1` node.

Comment 4 Petr Balogh 2021-06-03 14:17:53 UTC
I've tried to drain compute-1 with command:  

oc adm drain --force --delete-local-data --ignore-daemonsets  compute-1

and finally had to reboot it finally as I saw this:

http://pastebin.test.redhat.com/968739

Repeating:
error when evicting pod "rook-ceph-osd-0-5fbf89f9bc-wccp5" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod openshift-storage/rook-ceph-osd-0-5fbf89f9bc-wccp5
error when evicting pod "rook-ceph-osd-0-5fbf89f9bc-wccp5" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod openshift-storage/rook-ceph-osd-0-5fbf89f9bc-wccp5


Maybe if I will wait more time it will finish but after like 5 minutes I force rebooted the node.

After that everything come back to normal:

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.30    True        False         6m55s   Cluster version is 4.6.30

And cluster got upgraded.

But something had to lead to this issue so worth to find out the root cause of the issue.

I will continue with the other OCS bug verification on this cluster, so please do not manipulate with it and I will do one more upgrade to OCP 4.7 so hope I will not hit the same issue there as well. But I went from OCP 4.3 to 4.4 and 4.5 and didn't hit any issue before.

Comment 5 Paige Rubendall 2021-06-23 20:02:01 UTC
I am seeing the same issue upgrading from 4.3.40 -> 4.4.33->4.5.40->4.6.34 on Azure IPI

% oc get nodes  
NAME                                           STATUS                     ROLES    AGE     VERSION
qe-pr-az43-pf5z4-master-0                      Ready                      master   7h22m   v1.19.0+c3e2e69
qe-pr-az43-pf5z4-master-1                      Ready                      master   7h22m   v1.19.0+c3e2e69
qe-pr-az43-pf5z4-master-2                      Ready                      master   7h22m   v1.19.0+c3e2e69
qe-pr-az43-pf5z4-worker-northcentralus-b78nn   Ready                      worker   6h3m    v1.19.0+c3e2e69
qe-pr-az43-pf5z4-worker-northcentralus-w5s72   Ready,SchedulingDisabled   worker   7h13m   v1.18.3+d8ef5ad
qe-pr-az43-pf5z4-worker-northcentralus-zbhdm   Ready                      worker   7h12m   v1.18.3+d8ef5ad

% oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.34    True        False         False      97m
cloud-credential                           4.6.34    True        False         False      7h22m
cluster-autoscaler                         4.6.34    True        False         False      7h16m
config-operator                            4.6.34    True        False         False      3h44m
console                                    4.6.34    True        False         False      100m
csi-snapshot-controller                    4.6.34    True        False         False      99m
dns                                        4.6.34    True        False         False      120m
etcd                                       4.6.34    True        False         False      5h45m
image-registry                             4.6.34    True        True          False      106m
ingress                                    4.6.34    True        False         True       137m
insights                                   4.6.34    True        False         False      7h18m
kube-apiserver                             4.6.34    True        False         False      7h20m
kube-controller-manager                    4.6.34    True        False         False      5h41m
kube-scheduler                             4.6.34    True        False         False      5h42m
kube-storage-version-migrator              4.6.34    True        False         False      106m
machine-api                                4.6.34    True        False         False      7h18m
machine-approver                           4.6.34    True        False         False      3h39m
machine-config                             4.6.34    True        False         False      97m
marketplace                                4.6.34    True        False         False      100m
monitoring                                 4.6.34    False       True          True       103m
network                                    4.6.34    True        False         False      7h21m
node-tuning                                4.6.34    True        False         False      136m
openshift-apiserver                        4.6.34    True        False         False      123m
openshift-controller-manager               4.6.34    True        False         False      155m
openshift-samples                          4.6.34    True        False         False      135m
operator-lifecycle-manager                 4.6.34    True        False         False      7h17m
operator-lifecycle-manager-catalog         4.6.34    True        False         False      7h17m
operator-lifecycle-manager-packageserver   4.6.34    True        False         False      136m
service-ca                                 4.6.34    True        False         False      7h21m
storage                                    4.6.34    True        False         False      137m


 % oc describe co monitoring
Name:         monitoring
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2021-06-23T12:40:18Z
  Generation:          1
  Resource Version:    240683
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/monitoring
  UID:                 bc02b67f-6bc6-4684-b686-c38d43bb51f7
Spec:
Status:
  Conditions:
    Last Transition Time:  2021-06-23T18:15:28Z
    Status:                False
    Type:                  Available
    Last Transition Time:  2021-06-23T19:55:45Z
    Message:               Rolling out the stack.
    Reason:                RollOutInProgress
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2021-06-23T18:27:54Z
    Message:               Failed to rollout the stack. Error: running task Updating Grafana failed: reconciling Grafana Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/grafana: got 1 unavailable replicas
    Reason:                UpdatingGrafanaFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2021-06-23T19:55:45Z
    Message:               Rollout of the monitoring stack is in progress. Please wait until it finishes.
    Reason:                RollOutInProgress
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
  Related Objects:
    Group:     
    Name:      openshift-monitoring
    Resource:  namespaces
    Group:     
    Name:      openshift-user-workload-monitoring
    Resource:  namespaces
    Group:     monitoring.coreos.com
    Name:      
    Resource:  servicemonitors
    Group:     monitoring.coreos.com
    Name:      
    Resource:  podmonitors
    Group:     monitoring.coreos.com
    Name:      
    Resource:  prometheusrules
    Group:     monitoring.coreos.com
    Name:      
    Resource:  alertmanagers
    Group:     monitoring.coreos.com
    Name:      
    Resource:  prometheuses
    Group:     monitoring.coreos.com
    Name:      
    Resource:  thanosrulers
  Versions:
    Name:     operator
    Version:  4.6.34

% oc get pods -n openshift-monitoring
NAME                                           READY   STATUS             RESTARTS   AGE
alertmanager-main-0                            5/5     Running            0          101m
alertmanager-main-1                            5/5     Running            0          131m
alertmanager-main-2                            5/5     Running            0          101m
cluster-monitoring-operator-6dd74db54f-bfq6d   2/2     Running            0          96m
grafana-7ff876c957-xphcn                       1/2     Running            0          103m
kube-state-metrics-659c7b865d-4b5z6            2/3     CrashLoopBackOff   24         103m
node-exporter-98zks                            2/2     Running            0          132m
node-exporter-9n9lf                            2/2     Running            0          132m
node-exporter-brj9v                            2/2     Running            0          132m
node-exporter-r2jfs                            2/2     Running            0          132m
node-exporter-tvqtd                            2/2     Running            0          132m
node-exporter-zthfh                            2/2     Running            0          132m
openshift-state-metrics-7cf4dc694b-q8g28       3/3     Running            0          101m
prometheus-adapter-7bb644ff67-l8bst            0/1     CrashLoopBackOff   24         103m
prometheus-adapter-7bb644ff67-pnlbg            1/1     Running            0          101m
prometheus-k8s-0                               6/6     Running            7          131m
prometheus-k8s-1                               6/6     Running            1          101m
prometheus-operator-7fdc685b8d-xqs57           2/2     Running            0          96m
telemeter-client-7f9f75f5f8-h6qmw              3/3     Running            0          103m
thanos-querier-7c8844f896-22r9j                5/5     Running            0          101m
thanos-querier-7c8844f896-v8sc8                5/5     Running            0          103m


Grafana logs show:
t=2021-06-23T18:10:12+0000 lvl=info msg="Created default admin" logger=sqlstore user=WHAT_YOU_ARE_DOING_IS_VOIDING_SUPPORT_0000000000000000000000000000000000000000000000000000000000000000
t=2021-06-23T18:10:12+0000 lvl=info msg="Starting plugin search" logger=plugins
t=2021-06-23T18:10:12+0000 lvl=warn msg="[Deprecated] the use of basicAuthPassword field is deprecated. Please use secureJsonData.basicAuthPassword" logger=provisioning.datasources datasource name=prometheus
t=2021-06-23T18:10:12+0000 lvl=info msg="inserting datasource from configuration " logger=provisioning.datasources name=prometheus uid=
t=2021-06-23T18:10:12+0000 lvl=eror msg="Failed to read plugin provisioning files from directory" logger=provisioning.plugins path=/etc/grafana/provisioning/plugins error="open /etc/grafana/provisioning/plugins: no such file or directory"
t=2021-06-23T18:10:12+0000 lvl=eror msg="Can't read alert notification provisioning files from directory" logger=provisioning.notifiers path=/etc/grafana/provisioning/notifiers error="open /etc/grafana/provisioning/notifiers: no such file or directory"
t=2021-06-23T18:10:12+0000 lvl=info msg="HTTP Server Listen" logger=http.server address=127.0.0.1:3001 protocol=http subUrl= socket=


kube-state-metrics logs show:
I0623 19:59:28.080659       1 main.go:86] Using default collectors
I0623 19:59:28.080810       1 main.go:98] Using all namespace
I0623 19:59:28.080829       1 main.go:139] metric white-blacklisting: blacklisting the following items: kube_secret_labels
W0623 19:59:28.080887       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0623 19:59:28.084325       1 main.go:186] Testing communication with server
F0623 19:59:31.154189       1 main.go:149] Failed to create client: error while trying to communicate with apiserver: Get "https://172.30.0.1:443/version?timeout=32s": dial tcp 172.30.0.1:443: connect: no route to host


prometheus adapter logs:
I0623 19:59:21.128533       1 adapter.go:94] successfully using in-cluster auth
F0623 19:59:24.242392       1 adapter.go:286] unable to install resource metrics API: unable to construct dynamic discovery mapper: unable to populate initial set of REST mappings: Get "https://172.30.0.1:443/api?timeout=32s": dial tcp 172.30.0.1:443: connect: no route to host

Uploading must gather to next comment.

Comment 7 Ryan Fredette 2021-07-16 20:37:17 UTC
The failing dns-default was unable to contact the API server, which seems to be the same issue the other pods on compute-1 were seeing:

reflector.go:127] github.com/coredns/coredns/plugin/kubernetes/controller.go:333: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://172.30.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host

Comment 8 Ryan Fredette 2021-07-20 16:42:39 UTC
Since all the pods that can't access the API server seem to be on compute-1, I think this is most likely a networking issue with that node specifically. The cluster was using OpenShift SDN, so I'm passing this to the SDN team.

Comment 9 W. Trevor King 2021-08-18 21:25:49 UTC
UpgradeBlocker keyword semantics are defined in [1].  4.6 is now so old and quiet, that I would be very surprised if this ends up being a recent, product-side networking regression.  I'm going to pull the keyword off to remove the bug from my triage queue, but feel free to add it back if further investigation does turn up a 4.6 regression that seems like grounds for removing 4.6.z update recommendations.  I dunno what the status is of 4.5 -> 4.6 edge recommendations now that 4.5 is end-of-life [2].

[1]: https://github.com/openshift/enhancements/pull/475
[2]: https://access.redhat.com/support/policy/updates/openshift/#dates

Comment 10 W. Trevor King 2021-08-18 21:34:51 UTC
Oh, wait, why are you still updating to 4.6.30?  We stopped recommending updates to it based on bug 1953518 back on June 4th [1] (~a day after you opened this bug).  Maybe this bug can be closed as a dup of bug 1953518?

[1]: https://github.com/openshift/cincinnati-graph-data/pull/837#event-4844091800

Comment 11 Scott Dodson 2021-08-20 15:53:26 UTC
Marking as a dupe, if the issue reproduces upgrading from 4.5.40 to latest 4.6.z please re-open.

*** This bug has been marked as a duplicate of bug 1953518 ***


Note You need to log in before you can comment on or make changes to this bug.