Bug 2052071 - local storage operator metrics target down after upgrade
Summary: local storage operator metrics target down after upgrade
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.11.0
Assignee: Jan Safranek
QA Contact: Chao Yang
URL:
Whiteboard:
Depends On:
Blocks: 2109096
TreeView+ depends on / blocked
 
Reported: 2022-02-08 16:11 UTC by German Parente
Modified: 2023-01-25 09:55 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Upgrade of Local Storage Operator (LSO) from OpenShift 4.8 to 4.9 left an orhpan ServiceMonitor object. Consequence: Prometheus reported a metrics target represented by the ServiceMonitor as not reachable. Fix: We remove the ServiceMonitor during upgrade. Result: Prometheus does not report any unreachable metrics targets.
Clone Of:
Environment:
Last Closed: 2022-08-10 10:48:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift local-storage-operator pull 332 0 None open Bug 2052071: Delete obsolete ServiceMonitor 2022-03-08 11:25:00 UTC
Red Hat Knowledge Base (Solution) 6803881 0 None None None 2022-03-29 14:15:24 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:49:09 UTC

Description German Parente 2022-02-08 16:11:02 UTC
Description of problem:

this is also reported upstream here:

https://github.com/openshift/local-storage-operator/issues/319

Prometheus cannot scrape metrics from the local-storage-operator pod after upgrading to ocp 4.9

"lastError": "Get "http://<local storage operator ip>:8383/metrics": dial tcp <local storage operator ip>:8383: connect: connection refused",

"lastError": "Get "http://<local storage operator ip>:8686/metrics": dial tcp <local storage operator ip>:8686: connect: connection refused",

checking the config I can verify the ip address is exactly the one where prometheus cannot connect:

local-storage-operator-76f878db87-qngn4 1/1 Running 0 11h <local storage operator ip> <some node>

 
The serviceMonitor is showing:

    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
    creationTimestamp: "2021-11-04T08:33:19Z"
    generation: 1
    labels:
    name: local-storage-operator
    name: local-storage-operator-metrics
    namespace: openshift-local-storage
    spec:
    endpoints:
        bearerTokenSecret:
        key: ""
        port: http-metrics
        bearerTokenSecret:
        key: ""
        port: cr-metrics
        namespaceSelector: {}
        selector:
        matchLabels:
        name: local-storage-operator

The service is showing:

spec:
clusterIP:
clusterIPs:
-
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: http-metrics
port: 8383
protocol: TCP
targetPort: 8383
- name: cr-metrics
port: 8686
protocol: TCP
targetPort: 8686
selector:
name: local-storage-operator
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}

But the pod is not listening at all in those ports: 8383 / 8686

Comment 2 Jan Safranek 2022-03-07 14:05:33 UTC
Metrics in LSO 4.8 are misconfigured.

In 4.8:
1. LSO in 4.8 creates ServiceMonitor to scrape metrics from LSO Pod itself (ports 8383, 86868)
2. LSO in 4.8 does not give Prometheus permissions to see Pods / Services in local-storage-operator namespace.

In 4.9:
3. LSO does not expose ports 8383 and 8686, we rewrote LSO to a new operator-sdk version that no longer provides the metrics.

Because of 2., Prometheus does not scrape anything in 4.8, without any obvious error exposed to the user. During upgrade to 4.9, LSO *gives* permissions to see LSO's pods and Services to Prometheus, Prometheus sees ServiceMonitor from 1. and tries to scrape ports 8383 and 8686, but due to 3., it gets "connection refused" -> targets are reported as "down".

Solution: delete local-storage-operator-metrics from step 1:

  oc -n openshift-local-storage delete servicemonitor local-storage-operator-metrics

German, can you please create a knowledge base article for this?

Comment 3 German Parente 2022-03-07 14:20:29 UTC
Thanks Jan,

I will write a knowledge base article. 

Will the removal of the serviceMonitor drive to re-creation by operator ?

Comment 4 Jan Safranek 2022-03-07 14:28:16 UTC
No, 4.9 LSO does not need that particular ServiceMonitor at all and thus it won't re-create it.

Comment 5 Jan Safranek 2022-03-08 11:09:01 UTC
Tested using this:

1. Install OCP 4.8 + LSO 4.8

2. Add this label to namespace openshift-local-storage:
   openshift.io/cluster-monitoring: "true"

3. Wait few minutes for prometheus to re-check Service monitors in openshift-local-storage namespace

  $ kubectl -n openshift-monitoring logs prometheus-k8s-0 -f
  ...
  level=error ts=2022-03-08T10:08:25.124Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:431: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" in the namespace \"openshift-local-storage\""

  -> Prometheus knows about the ServiceMonitor, but it does not have RBAC to actually collect any metrics. This is a separate bug :-).

4. Update LSO to 4.9 and wait at least 2 minutes.

Actual result:
* local-storage-operator-metrics target is down (Prometheus now has RBAC permissions to read LSO object, but LSO does not emit the metric any longer)

Expected result:
* LSO's ServiceMonitor local-storage-operator-metrics is deleted:
  
  $ oc -n openshift-local-storage get servicemonitor

  NAME                              AGE
  local-storage-diskmaker-metrics   19m

* Prometheus does not report any target as down. Esp. this step may take quite some time.

Comment 6 Jan Safranek 2022-03-08 11:24:30 UTC
Fixing 4.11 first, backports will follow. To test this, you need to update LSO from 4.8 to 4.11. Only the operator needs to be updated, OCP cluster itself can still be 4.8.

Comment 8 Chao Yang 2022-03-16 03:22:42 UTC
local-storage-operator.4.8.0-202203102349
1.oc label namespace openshift-local-storage openshift.io/cluster-monitoring=true
2.oc -n openshift-monitoring logs prometheus-k8s-0 -f
level=error ts=2022-03-16T03:07:40.967Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:431: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" in the namespace \"openshift-local-storage\""
3.oc -n openshift-local-storage get servicemonitor
NAME                             AGE
local-storage-operator-metrics   5m21s
4.Upgrade to lso 4.11 version
oc get csv
NAME                                         DISPLAY                            VERSION               REPLACES                                    PHASE
elasticsearch-operator.5.2.9-11              OpenShift Elasticsearch Operator   5.2.9-11                                                          Succeeded
local-storage-operator.4.11.0-202203141858   Local Storage                      4.11.0-202203141858   local-storage-operator.4.8.0-202203102349   Succeeded

5.
oc -n openshift-local-storage get servicemonitor
NAME                              AGE
local-storage-discovery-metrics   84s
local-storage-diskmaker-metrics   84s

6.No error found in step2.

Comment 10 errata-xmlrpc 2022-08-10 10:48:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.