Bug 2075757 - monitoring targets are down after the cluster run for more than 1 day
Summary: monitoring targets are down after the cluster run for more than 1 day
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.10.z
Assignee: Simon Pasquier
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On: 2033575
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-15 07:12 UTC by Simon Pasquier
Modified: 2023-01-06 02:23 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2033575
Environment:
Last Closed: 2022-05-11 10:31:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1641 0 None Merged Bug 2075757: use bearer token as fall-back authn method 2022-04-28 18:02:21 UTC
Github openshift cluster-monitoring-operator pull 1653 0 None open Bug 2075757: UWM: add SAR capabilities to prometheus cluster role 2022-04-29 09:48:47 UTC
Red Hat Knowledge Base (Solution) 6956725 0 None None None 2022-05-06 08:51:44 UTC
Red Hat Product Errata RHBA-2022:1690 0 None None None 2022-05-11 10:32:12 UTC

Comment 3 Junqi Zhao 2022-04-27 04:43:03 UTC
tested with 
# oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-04-25-155819   True        False         26h     Cluster version is 4.10.0-0.nightly-2022-04-25-155819

prometheus-user-workload/thanos-sidecar targets are down after running for 1 day, no such issue for 4.11
# oc get clusterversion version -o jsonpath="{.spec.clusterID}"
b7d1271c-ed03-4b18-8735-b6112064a091

# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=ALERTS{alertname="TargetDown"}' | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "ALERTS",
          "alertname": "TargetDown",
          "alertstate": "firing",
          "job": "prometheus-user-workload",
          "namespace": "openshift-user-workload-monitoring",
          "prometheus": "openshift-monitoring/k8s",
          "service": "prometheus-user-workload",
          "severity": "warning"
        },
        "value": [
          1651033241.909,
          "1"
        ]
      },
      {
        "metric": {
          "__name__": "ALERTS",
          "alertname": "TargetDown",
          "alertstate": "firing",
          "job": "prometheus-user-workload-thanos-sidecar",
          "namespace": "openshift-user-workload-monitoring",
          "prometheus": "openshift-monitoring/k8s",
          "service": "prometheus-user-workload-thanos-sidecar",
          "severity": "warning"
        },
        "value": [
          1651033241.909,
          "1"
        ]
      }
    ]
  }
}


# oc -n openshift-user-workload-monitoring get pod -o wide
NAME                                   READY   STATUS    RESTARTS   AGE   IP            NODE                                         NOMINATED NODE   READINESS GATES
prometheus-operator-6c5df8b7cc-82kfj   2/2     Running   0          26h   10.129.0.59   ip-10-0-152-192.us-east-2.compute.internal   <none>           <none>
prometheus-user-workload-0             5/5     Running   0          26h   10.129.2.13   ip-10-0-128-15.us-east-2.compute.internal    <none>           <none>
prometheus-user-workload-1             5/5     Running   0          26h   10.128.2.13   ip-10-0-208-88.us-east-2.compute.internal    <none>           <none>
thanos-ruler-user-workload-0           3/3     Running   0          26h   10.129.2.15   ip-10-0-128-15.us-east-2.compute.internal    <none>           <none>
thanos-ruler-user-workload-1           3/3     Running   0          26h   10.131.0.16   ip-10-0-182-244.us-east-2.compute.internal   <none>           <none>

checked the target API
        "scrapePool": "serviceMonitor/openshift-user-workload-monitoring/prometheus-user-workload/0",
        "scrapeUrl": "https://10.129.2.13:9091/metrics",
        "globalUrl": "https://10.129.2.13:9091/metrics",
        "lastError": "server returned HTTP status 401 Unauthorized",
        "lastScrape": "2022-04-27T04:01:16.692991498Z",
        "lastScrapeDuration": 0.005716866,
        "health": "down",
        "scrapeInterval": "30s",
        "scrapeTimeout": "10s"
      },
...
        "scrapePool": "serviceMonitor/openshift-user-workload-monitoring/thanos-sidecar/0",
        "scrapeUrl": "https://10.128.2.13:10902/metrics",
        "globalUrl": "https://10.128.2.13:10902/metrics",
        "lastError": "server returned HTTP status 401 Unauthorized",
        "lastScrape": "2022-04-27T04:01:14.038719427Z",
        "lastScrapeDuration": 0.00319329,
        "health": "down",
        "scrapeInterval": "30s",
        "scrapeTimeout": "10s"
      },
...
        "scrapePool": "serviceMonitor/openshift-user-workload-monitoring/thanos-sidecar/0",
        "scrapeUrl": "https://10.129.2.13:10902/metrics",
        "globalUrl": "https://10.129.2.13:10902/metrics",
        "lastError": "server returned HTTP status 401 Unauthorized",
        "lastScrape": "2022-04-27T04:01:17.719803208Z",
        "lastScrapeDuration": 0.003442965,
        "health": "down",
        "scrapeInterval": "30s",
        "scrapeTimeout": "10s"
      }
...
        "scrapePool": "serviceMonitor/openshift-user-workload-monitoring/prometheus-user-workload/0",
        "scrapeUrl": "https://10.128.2.13:9091/metrics",
        "globalUrl": "https://10.128.2.13:9091/metrics",
        "lastError": "server returned HTTP status 401 Unauthorized",
        "lastScrape": "2022-04-27T04:01:30.98821663Z",
        "lastScrapeDuration": 0.00465935,
        "health": "down",
        "scrapeInterval": "30s",
        "scrapeTimeout": "10s"
      },

use prometheus-k8s or prometheus-user-workload sa, get Unauthorized
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# for i in https://10.128.2.13:9091/metrics https://10.128.2.13:10902/metrics https://10.129.2.13:9091/metrics https://10.129.2.13:10902/metrics; do echo $i; oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" $i | head; echo -e "\n"; done
https://10.128.2.13:9091/metrics
Unauthorized


https://10.128.2.13:10902/metrics
Unauthorized


https://10.129.2.13:9091/metrics
Unauthorized


https://10.129.2.13:10902/metrics
Unauthorized


# token=`oc sa get-token prometheus-user-workload -n openshift-user-workload-monitoring`
# for i in https://10.128.2.13:9091/metrics https://10.128.2.13:10902/metrics https://10.129.2.13:9091/metrics https://10.129.2.13:10902/metrics; do echo $i; oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" $i | head -n 2; echo -e "\n"; done
https://10.128.2.13:9091/metrics
Unauthorized


https://10.128.2.13:10902/metrics
Unauthorized


https://10.129.2.13:9091/metrics
Unauthorized


https://10.129.2.13:10902/metrics
Unauthorized

no such issue for 4.11, that is why no TargetDown alerts for 4.11
# oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-26-181148   True        False         3h31m   Cluster version is 4.11.0-0.nightly-2022-04-26-181148

# oc -n openshift-user-workload-monitoring get pod -o wide
NAME                                   READY   STATUS    RESTARTS   AGE    IP            NODE                                         NOMINATED NODE   READINESS GATES
prometheus-operator-7c84c769b7-2bslv   2/2     Running   0          174m   10.129.0.47   ip-10-0-193-51.us-east-2.compute.internal    <none>           <none>
prometheus-user-workload-0             6/6     Running   0          174m   10.129.2.14   ip-10-0-153-61.us-east-2.compute.internal    <none>           <none>
prometheus-user-workload-1             6/6     Running   0          174m   10.128.2.13   ip-10-0-183-174.us-east-2.compute.internal   <none>           <none>
thanos-ruler-user-workload-0           3/3     Running   0          174m   10.129.2.13   ip-10-0-153-61.us-east-2.compute.internal    <none>           <none>
thanos-ruler-user-workload-1           3/3     Running   0          174m   10.131.0.21   ip-10-0-218-71.us-east-2.compute.internal    <none>           <none>

# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# for i in https://10.129.2.14:9091/metrics https://10.129.2.14:10902/metrics https://10.128.2.13:9091/metrics https://10.128.2.13:10902/metrics; do echo $i; oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" $i | head -n 2; echo -e "\n"; done
https://10.129.2.14:9091/metrics
# HELP go_gc_cycles_automatic_gc_cycles_total Count of completed GC cycles generated by the Go runtime.
# TYPE go_gc_cycles_automatic_gc_cycles_total counter



https://10.129.2.14:10902/metrics
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary


https://10.128.2.13:9091/metrics
# HELP go_gc_cycles_automatic_gc_cycles_total Count of completed GC cycles generated by the Go runtime.
# TYPE go_gc_cycles_automatic_gc_cycles_total counter


https://10.128.2.13:10902/metrics
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary

Comment 9 Junqi Zhao 2022-05-02 00:22:04 UTC
tested with 4.10.0-0.nightly-2022-04-29-234633, enabled user workload monitoring and monitored for 30 hours, no TargetDown alerts for openshift-monitoring/openshift-user-workload-monitoring, see the picture

Comment 14 errata-xmlrpc 2022-05-11 10:31:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.13 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1690


Note You need to log in before you can comment on or make changes to this bug.