Description of problem: this is also reported upstream here: https://github.com/openshift/local-storage-operator/issues/319 Prometheus cannot scrape metrics from the local-storage-operator pod after upgrading to ocp 4.9 "lastError": "Get "http://<local storage operator ip>:8383/metrics": dial tcp <local storage operator ip>:8383: connect: connection refused", "lastError": "Get "http://<local storage operator ip>:8686/metrics": dial tcp <local storage operator ip>:8686: connect: connection refused", checking the config I can verify the ip address is exactly the one where prometheus cannot connect: local-storage-operator-76f878db87-qngn4 1/1 Running 0 11h <local storage operator ip> <some node> The serviceMonitor is showing: apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: creationTimestamp: "2021-11-04T08:33:19Z" generation: 1 labels: name: local-storage-operator name: local-storage-operator-metrics namespace: openshift-local-storage spec: endpoints: bearerTokenSecret: key: "" port: http-metrics bearerTokenSecret: key: "" port: cr-metrics namespaceSelector: {} selector: matchLabels: name: local-storage-operator The service is showing: spec: clusterIP: clusterIPs: - internalTrafficPolicy: Cluster ipFamilies: - IPv4 ipFamilyPolicy: SingleStack ports: - name: http-metrics port: 8383 protocol: TCP targetPort: 8383 - name: cr-metrics port: 8686 protocol: TCP targetPort: 8686 selector: name: local-storage-operator sessionAffinity: None type: ClusterIP status: loadBalancer: {} But the pod is not listening at all in those ports: 8383 / 8686
Metrics in LSO 4.8 are misconfigured. In 4.8: 1. LSO in 4.8 creates ServiceMonitor to scrape metrics from LSO Pod itself (ports 8383, 86868) 2. LSO in 4.8 does not give Prometheus permissions to see Pods / Services in local-storage-operator namespace. In 4.9: 3. LSO does not expose ports 8383 and 8686, we rewrote LSO to a new operator-sdk version that no longer provides the metrics. Because of 2., Prometheus does not scrape anything in 4.8, without any obvious error exposed to the user. During upgrade to 4.9, LSO *gives* permissions to see LSO's pods and Services to Prometheus, Prometheus sees ServiceMonitor from 1. and tries to scrape ports 8383 and 8686, but due to 3., it gets "connection refused" -> targets are reported as "down". Solution: delete local-storage-operator-metrics from step 1: oc -n openshift-local-storage delete servicemonitor local-storage-operator-metrics German, can you please create a knowledge base article for this?
Thanks Jan, I will write a knowledge base article. Will the removal of the serviceMonitor drive to re-creation by operator ?
No, 4.9 LSO does not need that particular ServiceMonitor at all and thus it won't re-create it.
Tested using this: 1. Install OCP 4.8 + LSO 4.8 2. Add this label to namespace openshift-local-storage: openshift.io/cluster-monitoring: "true" 3. Wait few minutes for prometheus to re-check Service monitors in openshift-local-storage namespace $ kubectl -n openshift-monitoring logs prometheus-k8s-0 -f ... level=error ts=2022-03-08T10:08:25.124Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:431: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" in the namespace \"openshift-local-storage\"" -> Prometheus knows about the ServiceMonitor, but it does not have RBAC to actually collect any metrics. This is a separate bug :-). 4. Update LSO to 4.9 and wait at least 2 minutes. Actual result: * local-storage-operator-metrics target is down (Prometheus now has RBAC permissions to read LSO object, but LSO does not emit the metric any longer) Expected result: * LSO's ServiceMonitor local-storage-operator-metrics is deleted: $ oc -n openshift-local-storage get servicemonitor NAME AGE local-storage-diskmaker-metrics 19m * Prometheus does not report any target as down. Esp. this step may take quite some time.
Fixing 4.11 first, backports will follow. To test this, you need to update LSO from 4.8 to 4.11. Only the operator needs to be updated, OCP cluster itself can still be 4.8.
local-storage-operator.4.8.0-202203102349 1.oc label namespace openshift-local-storage openshift.io/cluster-monitoring=true 2.oc -n openshift-monitoring logs prometheus-k8s-0 -f level=error ts=2022-03-16T03:07:40.967Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:431: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" in the namespace \"openshift-local-storage\"" 3.oc -n openshift-local-storage get servicemonitor NAME AGE local-storage-operator-metrics 5m21s 4.Upgrade to lso 4.11 version oc get csv NAME DISPLAY VERSION REPLACES PHASE elasticsearch-operator.5.2.9-11 OpenShift Elasticsearch Operator 5.2.9-11 Succeeded local-storage-operator.4.11.0-202203141858 Local Storage 4.11.0-202203141858 local-storage-operator.4.8.0-202203102349 Succeeded 5. oc -n openshift-local-storage get servicemonitor NAME AGE local-storage-discovery-metrics 84s local-storage-diskmaker-metrics 84s 6.No error found in step2.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069