Bug 2075757

Summary: monitoring targets are down after the cluster run for more than 1 day
Product: OpenShift Container Platform Reporter: Simon Pasquier <spasquie>
Component: MonitoringAssignee: Simon Pasquier <spasquie>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: high    
Version: 4.10CC: amuller, anpicker, erooth, hongyli, juzhao, kgordeev, llopezmo, sdodson, spasquie, wking
Target Milestone: ---Keywords: Upgrades
Target Release: 4.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2033575 Environment:
Last Closed: 2022-05-11 10:31:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2033575    
Bug Blocks:    

Comment 3 Junqi Zhao 2022-04-27 04:43:03 UTC
tested with 
# oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-04-25-155819   True        False         26h     Cluster version is 4.10.0-0.nightly-2022-04-25-155819

prometheus-user-workload/thanos-sidecar targets are down after running for 1 day, no such issue for 4.11
# oc get clusterversion version -o jsonpath="{.spec.clusterID}"
b7d1271c-ed03-4b18-8735-b6112064a091

# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=ALERTS{alertname="TargetDown"}' | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "ALERTS",
          "alertname": "TargetDown",
          "alertstate": "firing",
          "job": "prometheus-user-workload",
          "namespace": "openshift-user-workload-monitoring",
          "prometheus": "openshift-monitoring/k8s",
          "service": "prometheus-user-workload",
          "severity": "warning"
        },
        "value": [
          1651033241.909,
          "1"
        ]
      },
      {
        "metric": {
          "__name__": "ALERTS",
          "alertname": "TargetDown",
          "alertstate": "firing",
          "job": "prometheus-user-workload-thanos-sidecar",
          "namespace": "openshift-user-workload-monitoring",
          "prometheus": "openshift-monitoring/k8s",
          "service": "prometheus-user-workload-thanos-sidecar",
          "severity": "warning"
        },
        "value": [
          1651033241.909,
          "1"
        ]
      }
    ]
  }
}


# oc -n openshift-user-workload-monitoring get pod -o wide
NAME                                   READY   STATUS    RESTARTS   AGE   IP            NODE                                         NOMINATED NODE   READINESS GATES
prometheus-operator-6c5df8b7cc-82kfj   2/2     Running   0          26h   10.129.0.59   ip-10-0-152-192.us-east-2.compute.internal   <none>           <none>
prometheus-user-workload-0             5/5     Running   0          26h   10.129.2.13   ip-10-0-128-15.us-east-2.compute.internal    <none>           <none>
prometheus-user-workload-1             5/5     Running   0          26h   10.128.2.13   ip-10-0-208-88.us-east-2.compute.internal    <none>           <none>
thanos-ruler-user-workload-0           3/3     Running   0          26h   10.129.2.15   ip-10-0-128-15.us-east-2.compute.internal    <none>           <none>
thanos-ruler-user-workload-1           3/3     Running   0          26h   10.131.0.16   ip-10-0-182-244.us-east-2.compute.internal   <none>           <none>

checked the target API
        "scrapePool": "serviceMonitor/openshift-user-workload-monitoring/prometheus-user-workload/0",
        "scrapeUrl": "https://10.129.2.13:9091/metrics",
        "globalUrl": "https://10.129.2.13:9091/metrics",
        "lastError": "server returned HTTP status 401 Unauthorized",
        "lastScrape": "2022-04-27T04:01:16.692991498Z",
        "lastScrapeDuration": 0.005716866,
        "health": "down",
        "scrapeInterval": "30s",
        "scrapeTimeout": "10s"
      },
...
        "scrapePool": "serviceMonitor/openshift-user-workload-monitoring/thanos-sidecar/0",
        "scrapeUrl": "https://10.128.2.13:10902/metrics",
        "globalUrl": "https://10.128.2.13:10902/metrics",
        "lastError": "server returned HTTP status 401 Unauthorized",
        "lastScrape": "2022-04-27T04:01:14.038719427Z",
        "lastScrapeDuration": 0.00319329,
        "health": "down",
        "scrapeInterval": "30s",
        "scrapeTimeout": "10s"
      },
...
        "scrapePool": "serviceMonitor/openshift-user-workload-monitoring/thanos-sidecar/0",
        "scrapeUrl": "https://10.129.2.13:10902/metrics",
        "globalUrl": "https://10.129.2.13:10902/metrics",
        "lastError": "server returned HTTP status 401 Unauthorized",
        "lastScrape": "2022-04-27T04:01:17.719803208Z",
        "lastScrapeDuration": 0.003442965,
        "health": "down",
        "scrapeInterval": "30s",
        "scrapeTimeout": "10s"
      }
...
        "scrapePool": "serviceMonitor/openshift-user-workload-monitoring/prometheus-user-workload/0",
        "scrapeUrl": "https://10.128.2.13:9091/metrics",
        "globalUrl": "https://10.128.2.13:9091/metrics",
        "lastError": "server returned HTTP status 401 Unauthorized",
        "lastScrape": "2022-04-27T04:01:30.98821663Z",
        "lastScrapeDuration": 0.00465935,
        "health": "down",
        "scrapeInterval": "30s",
        "scrapeTimeout": "10s"
      },

use prometheus-k8s or prometheus-user-workload sa, get Unauthorized
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# for i in https://10.128.2.13:9091/metrics https://10.128.2.13:10902/metrics https://10.129.2.13:9091/metrics https://10.129.2.13:10902/metrics; do echo $i; oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" $i | head; echo -e "\n"; done
https://10.128.2.13:9091/metrics
Unauthorized


https://10.128.2.13:10902/metrics
Unauthorized


https://10.129.2.13:9091/metrics
Unauthorized


https://10.129.2.13:10902/metrics
Unauthorized


# token=`oc sa get-token prometheus-user-workload -n openshift-user-workload-monitoring`
# for i in https://10.128.2.13:9091/metrics https://10.128.2.13:10902/metrics https://10.129.2.13:9091/metrics https://10.129.2.13:10902/metrics; do echo $i; oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" $i | head -n 2; echo -e "\n"; done
https://10.128.2.13:9091/metrics
Unauthorized


https://10.128.2.13:10902/metrics
Unauthorized


https://10.129.2.13:9091/metrics
Unauthorized


https://10.129.2.13:10902/metrics
Unauthorized

no such issue for 4.11, that is why no TargetDown alerts for 4.11
# oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-26-181148   True        False         3h31m   Cluster version is 4.11.0-0.nightly-2022-04-26-181148

# oc -n openshift-user-workload-monitoring get pod -o wide
NAME                                   READY   STATUS    RESTARTS   AGE    IP            NODE                                         NOMINATED NODE   READINESS GATES
prometheus-operator-7c84c769b7-2bslv   2/2     Running   0          174m   10.129.0.47   ip-10-0-193-51.us-east-2.compute.internal    <none>           <none>
prometheus-user-workload-0             6/6     Running   0          174m   10.129.2.14   ip-10-0-153-61.us-east-2.compute.internal    <none>           <none>
prometheus-user-workload-1             6/6     Running   0          174m   10.128.2.13   ip-10-0-183-174.us-east-2.compute.internal   <none>           <none>
thanos-ruler-user-workload-0           3/3     Running   0          174m   10.129.2.13   ip-10-0-153-61.us-east-2.compute.internal    <none>           <none>
thanos-ruler-user-workload-1           3/3     Running   0          174m   10.131.0.21   ip-10-0-218-71.us-east-2.compute.internal    <none>           <none>

# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# for i in https://10.129.2.14:9091/metrics https://10.129.2.14:10902/metrics https://10.128.2.13:9091/metrics https://10.128.2.13:10902/metrics; do echo $i; oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" $i | head -n 2; echo -e "\n"; done
https://10.129.2.14:9091/metrics
# HELP go_gc_cycles_automatic_gc_cycles_total Count of completed GC cycles generated by the Go runtime.
# TYPE go_gc_cycles_automatic_gc_cycles_total counter



https://10.129.2.14:10902/metrics
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary


https://10.128.2.13:9091/metrics
# HELP go_gc_cycles_automatic_gc_cycles_total Count of completed GC cycles generated by the Go runtime.
# TYPE go_gc_cycles_automatic_gc_cycles_total counter


https://10.128.2.13:10902/metrics
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary

Comment 9 Junqi Zhao 2022-05-02 00:22:04 UTC
tested with 4.10.0-0.nightly-2022-04-29-234633, enabled user workload monitoring and monitored for 30 hours, no TargetDown alerts for openshift-monitoring/openshift-user-workload-monitoring, see the picture

Comment 14 errata-xmlrpc 2022-05-11 10:31:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.13 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1690