2075757 – monitoring targets are down after the cluster run for more than 1 day

Bug 2075757 - monitoring targets are down after the cluster run for more than 1 day

Summary: monitoring targets are down after the cluster run for more than 1 day

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.10.z
Assignee:	Simon Pasquier
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:	2033575
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-15 07:12 UTC by Simon Pasquier
Modified:	2023-01-06 02:23 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2033575
Environment:
Last Closed:	2022-05-11 10:31:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1641	None	Merged	Bug 2075757: use bearer token as fall-back authn method	2022-04-28 18:02:21 UTC
Github	openshift cluster-monitoring-operator pull 1653	None	open	Bug 2075757: UWM: add SAR capabilities to prometheus cluster role	2022-04-29 09:48:47 UTC
Red Hat Knowledge Base (Solution)	6956725	None	None	None	2022-05-06 08:51:44 UTC
Red Hat Product Errata	RHBA-2022:1690	None	None	None	2022-05-11 10:32:12 UTC

Comment 3 Junqi Zhao 2022-04-27 04:43:03 UTC

tested with 
# oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-04-25-155819   True        False         26h     Cluster version is 4.10.0-0.nightly-2022-04-25-155819

prometheus-user-workload/thanos-sidecar targets are down after running for 1 day, no such issue for 4.11
# oc get clusterversion version -o jsonpath="{.spec.clusterID}"
b7d1271c-ed03-4b18-8735-b6112064a091

# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=ALERTS{alertname="TargetDown"}' | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "ALERTS",
          "alertname": "TargetDown",
          "alertstate": "firing",
          "job": "prometheus-user-workload",
          "namespace": "openshift-user-workload-monitoring",
          "prometheus": "openshift-monitoring/k8s",
          "service": "prometheus-user-workload",
          "severity": "warning"
        },
        "value": [
          1651033241.909,
          "1"
        ]
      },
      {
        "metric": {
          "__name__": "ALERTS",
          "alertname": "TargetDown",
          "alertstate": "firing",
          "job": "prometheus-user-workload-thanos-sidecar",
          "namespace": "openshift-user-workload-monitoring",
          "prometheus": "openshift-monitoring/k8s",
          "service": "prometheus-user-workload-thanos-sidecar",
          "severity": "warning"
        },
        "value": [
          1651033241.909,
          "1"
        ]
      }
    ]
  }
}


# oc -n openshift-user-workload-monitoring get pod -o wide
NAME                                   READY   STATUS    RESTARTS   AGE   IP            NODE                                         NOMINATED NODE   READINESS GATES
prometheus-operator-6c5df8b7cc-82kfj   2/2     Running   0          26h   10.129.0.59   ip-10-0-152-192.us-east-2.compute.internal   <none>           <none>
prometheus-user-workload-0             5/5     Running   0          26h   10.129.2.13   ip-10-0-128-15.us-east-2.compute.internal    <none>           <none>
prometheus-user-workload-1             5/5     Running   0          26h   10.128.2.13   ip-10-0-208-88.us-east-2.compute.internal    <none>           <none>
thanos-ruler-user-workload-0           3/3     Running   0          26h   10.129.2.15   ip-10-0-128-15.us-east-2.compute.internal    <none>           <none>
thanos-ruler-user-workload-1           3/3     Running   0          26h   10.131.0.16   ip-10-0-182-244.us-east-2.compute.internal   <none>           <none>

checked the target API
        "scrapePool": "serviceMonitor/openshift-user-workload-monitoring/prometheus-user-workload/0",
        "scrapeUrl": "https://10.129.2.13:9091/metrics",
        "globalUrl": "https://10.129.2.13:9091/metrics",
        "lastError": "server returned HTTP status 401 Unauthorized",
        "lastScrape": "2022-04-27T04:01:16.692991498Z",
        "lastScrapeDuration": 0.005716866,
        "health": "down",
        "scrapeInterval": "30s",
        "scrapeTimeout": "10s"
      },
...
        "scrapePool": "serviceMonitor/openshift-user-workload-monitoring/thanos-sidecar/0",
        "scrapeUrl": "https://10.128.2.13:10902/metrics",
        "globalUrl": "https://10.128.2.13:10902/metrics",
        "lastError": "server returned HTTP status 401 Unauthorized",
        "lastScrape": "2022-04-27T04:01:14.038719427Z",
        "lastScrapeDuration": 0.00319329,
        "health": "down",
        "scrapeInterval": "30s",
        "scrapeTimeout": "10s"
      },
...
        "scrapePool": "serviceMonitor/openshift-user-workload-monitoring/thanos-sidecar/0",
        "scrapeUrl": "https://10.129.2.13:10902/metrics",
        "globalUrl": "https://10.129.2.13:10902/metrics",
        "lastError": "server returned HTTP status 401 Unauthorized",
        "lastScrape": "2022-04-27T04:01:17.719803208Z",
        "lastScrapeDuration": 0.003442965,
        "health": "down",
        "scrapeInterval": "30s",
        "scrapeTimeout": "10s"
      }
...
        "scrapePool": "serviceMonitor/openshift-user-workload-monitoring/prometheus-user-workload/0",
        "scrapeUrl": "https://10.128.2.13:9091/metrics",
        "globalUrl": "https://10.128.2.13:9091/metrics",
        "lastError": "server returned HTTP status 401 Unauthorized",
        "lastScrape": "2022-04-27T04:01:30.98821663Z",
        "lastScrapeDuration": 0.00465935,
        "health": "down",
        "scrapeInterval": "30s",
        "scrapeTimeout": "10s"
      },

use prometheus-k8s or prometheus-user-workload sa, get Unauthorized
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# for i in https://10.128.2.13:9091/metrics https://10.128.2.13:10902/metrics https://10.129.2.13:9091/metrics https://10.129.2.13:10902/metrics; do echo $i; oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" $i | head; echo -e "\n"; done
https://10.128.2.13:9091/metrics
Unauthorized


https://10.128.2.13:10902/metrics
Unauthorized


https://10.129.2.13:9091/metrics
Unauthorized


https://10.129.2.13:10902/metrics
Unauthorized


# token=`oc sa get-token prometheus-user-workload -n openshift-user-workload-monitoring`
# for i in https://10.128.2.13:9091/metrics https://10.128.2.13:10902/metrics https://10.129.2.13:9091/metrics https://10.129.2.13:10902/metrics; do echo $i; oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" $i | head -n 2; echo -e "\n"; done
https://10.128.2.13:9091/metrics
Unauthorized


https://10.128.2.13:10902/metrics
Unauthorized


https://10.129.2.13:9091/metrics
Unauthorized


https://10.129.2.13:10902/metrics
Unauthorized

no such issue for 4.11, that is why no TargetDown alerts for 4.11
# oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-26-181148   True        False         3h31m   Cluster version is 4.11.0-0.nightly-2022-04-26-181148

# oc -n openshift-user-workload-monitoring get pod -o wide
NAME                                   READY   STATUS    RESTARTS   AGE    IP            NODE                                         NOMINATED NODE   READINESS GATES
prometheus-operator-7c84c769b7-2bslv   2/2     Running   0          174m   10.129.0.47   ip-10-0-193-51.us-east-2.compute.internal    <none>           <none>
prometheus-user-workload-0             6/6     Running   0          174m   10.129.2.14   ip-10-0-153-61.us-east-2.compute.internal    <none>           <none>
prometheus-user-workload-1             6/6     Running   0          174m   10.128.2.13   ip-10-0-183-174.us-east-2.compute.internal   <none>           <none>
thanos-ruler-user-workload-0           3/3     Running   0          174m   10.129.2.13   ip-10-0-153-61.us-east-2.compute.internal    <none>           <none>
thanos-ruler-user-workload-1           3/3     Running   0          174m   10.131.0.21   ip-10-0-218-71.us-east-2.compute.internal    <none>           <none>

# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# for i in https://10.129.2.14:9091/metrics https://10.129.2.14:10902/metrics https://10.128.2.13:9091/metrics https://10.128.2.13:10902/metrics; do echo $i; oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" $i | head -n 2; echo -e "\n"; done
https://10.129.2.14:9091/metrics
# HELP go_gc_cycles_automatic_gc_cycles_total Count of completed GC cycles generated by the Go runtime.
# TYPE go_gc_cycles_automatic_gc_cycles_total counter



https://10.129.2.14:10902/metrics
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary


https://10.128.2.13:9091/metrics
# HELP go_gc_cycles_automatic_gc_cycles_total Count of completed GC cycles generated by the Go runtime.
# TYPE go_gc_cycles_automatic_gc_cycles_total counter


https://10.128.2.13:10902/metrics
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary

Comment 9 Junqi Zhao 2022-05-02 00:22:04 UTC

tested with 4.10.0-0.nightly-2022-04-29-234633, enabled user workload monitoring and monitored for 30 hours, no TargetDown alerts for openshift-monitoring/openshift-user-workload-monitoring, see the picture

Comment 14 errata-xmlrpc 2022-05-11 10:31:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.13 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1690

Note You need to log in before you can comment on or make changes to this bug.