tested with # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-04-25-155819 True False 26h Cluster version is 4.10.0-0.nightly-2022-04-25-155819 prometheus-user-workload/thanos-sidecar targets are down after running for 1 day, no such issue for 4.11 # oc get clusterversion version -o jsonpath="{.spec.clusterID}" b7d1271c-ed03-4b18-8735-b6112064a091 # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=ALERTS{alertname="TargetDown"}' | jq { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "ALERTS", "alertname": "TargetDown", "alertstate": "firing", "job": "prometheus-user-workload", "namespace": "openshift-user-workload-monitoring", "prometheus": "openshift-monitoring/k8s", "service": "prometheus-user-workload", "severity": "warning" }, "value": [ 1651033241.909, "1" ] }, { "metric": { "__name__": "ALERTS", "alertname": "TargetDown", "alertstate": "firing", "job": "prometheus-user-workload-thanos-sidecar", "namespace": "openshift-user-workload-monitoring", "prometheus": "openshift-monitoring/k8s", "service": "prometheus-user-workload-thanos-sidecar", "severity": "warning" }, "value": [ 1651033241.909, "1" ] } ] } } # oc -n openshift-user-workload-monitoring get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES prometheus-operator-6c5df8b7cc-82kfj 2/2 Running 0 26h 10.129.0.59 ip-10-0-152-192.us-east-2.compute.internal <none> <none> prometheus-user-workload-0 5/5 Running 0 26h 10.129.2.13 ip-10-0-128-15.us-east-2.compute.internal <none> <none> prometheus-user-workload-1 5/5 Running 0 26h 10.128.2.13 ip-10-0-208-88.us-east-2.compute.internal <none> <none> thanos-ruler-user-workload-0 3/3 Running 0 26h 10.129.2.15 ip-10-0-128-15.us-east-2.compute.internal <none> <none> thanos-ruler-user-workload-1 3/3 Running 0 26h 10.131.0.16 ip-10-0-182-244.us-east-2.compute.internal <none> <none> checked the target API "scrapePool": "serviceMonitor/openshift-user-workload-monitoring/prometheus-user-workload/0", "scrapeUrl": "https://10.129.2.13:9091/metrics", "globalUrl": "https://10.129.2.13:9091/metrics", "lastError": "server returned HTTP status 401 Unauthorized", "lastScrape": "2022-04-27T04:01:16.692991498Z", "lastScrapeDuration": 0.005716866, "health": "down", "scrapeInterval": "30s", "scrapeTimeout": "10s" }, ... "scrapePool": "serviceMonitor/openshift-user-workload-monitoring/thanos-sidecar/0", "scrapeUrl": "https://10.128.2.13:10902/metrics", "globalUrl": "https://10.128.2.13:10902/metrics", "lastError": "server returned HTTP status 401 Unauthorized", "lastScrape": "2022-04-27T04:01:14.038719427Z", "lastScrapeDuration": 0.00319329, "health": "down", "scrapeInterval": "30s", "scrapeTimeout": "10s" }, ... "scrapePool": "serviceMonitor/openshift-user-workload-monitoring/thanos-sidecar/0", "scrapeUrl": "https://10.129.2.13:10902/metrics", "globalUrl": "https://10.129.2.13:10902/metrics", "lastError": "server returned HTTP status 401 Unauthorized", "lastScrape": "2022-04-27T04:01:17.719803208Z", "lastScrapeDuration": 0.003442965, "health": "down", "scrapeInterval": "30s", "scrapeTimeout": "10s" } ... "scrapePool": "serviceMonitor/openshift-user-workload-monitoring/prometheus-user-workload/0", "scrapeUrl": "https://10.128.2.13:9091/metrics", "globalUrl": "https://10.128.2.13:9091/metrics", "lastError": "server returned HTTP status 401 Unauthorized", "lastScrape": "2022-04-27T04:01:30.98821663Z", "lastScrapeDuration": 0.00465935, "health": "down", "scrapeInterval": "30s", "scrapeTimeout": "10s" }, use prometheus-k8s or prometheus-user-workload sa, get Unauthorized # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # for i in https://10.128.2.13:9091/metrics https://10.128.2.13:10902/metrics https://10.129.2.13:9091/metrics https://10.129.2.13:10902/metrics; do echo $i; oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" $i | head; echo -e "\n"; done https://10.128.2.13:9091/metrics Unauthorized https://10.128.2.13:10902/metrics Unauthorized https://10.129.2.13:9091/metrics Unauthorized https://10.129.2.13:10902/metrics Unauthorized # token=`oc sa get-token prometheus-user-workload -n openshift-user-workload-monitoring` # for i in https://10.128.2.13:9091/metrics https://10.128.2.13:10902/metrics https://10.129.2.13:9091/metrics https://10.129.2.13:10902/metrics; do echo $i; oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" $i | head -n 2; echo -e "\n"; done https://10.128.2.13:9091/metrics Unauthorized https://10.128.2.13:10902/metrics Unauthorized https://10.129.2.13:9091/metrics Unauthorized https://10.129.2.13:10902/metrics Unauthorized no such issue for 4.11, that is why no TargetDown alerts for 4.11 # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-04-26-181148 True False 3h31m Cluster version is 4.11.0-0.nightly-2022-04-26-181148 # oc -n openshift-user-workload-monitoring get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES prometheus-operator-7c84c769b7-2bslv 2/2 Running 0 174m 10.129.0.47 ip-10-0-193-51.us-east-2.compute.internal <none> <none> prometheus-user-workload-0 6/6 Running 0 174m 10.129.2.14 ip-10-0-153-61.us-east-2.compute.internal <none> <none> prometheus-user-workload-1 6/6 Running 0 174m 10.128.2.13 ip-10-0-183-174.us-east-2.compute.internal <none> <none> thanos-ruler-user-workload-0 3/3 Running 0 174m 10.129.2.13 ip-10-0-153-61.us-east-2.compute.internal <none> <none> thanos-ruler-user-workload-1 3/3 Running 0 174m 10.131.0.21 ip-10-0-218-71.us-east-2.compute.internal <none> <none> # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # for i in https://10.129.2.14:9091/metrics https://10.129.2.14:10902/metrics https://10.128.2.13:9091/metrics https://10.128.2.13:10902/metrics; do echo $i; oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" $i | head -n 2; echo -e "\n"; done https://10.129.2.14:9091/metrics # HELP go_gc_cycles_automatic_gc_cycles_total Count of completed GC cycles generated by the Go runtime. # TYPE go_gc_cycles_automatic_gc_cycles_total counter https://10.129.2.14:10902/metrics # HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles. # TYPE go_gc_duration_seconds summary https://10.128.2.13:9091/metrics # HELP go_gc_cycles_automatic_gc_cycles_total Count of completed GC cycles generated by the Go runtime. # TYPE go_gc_cycles_automatic_gc_cycles_total counter https://10.128.2.13:10902/metrics # HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles. # TYPE go_gc_duration_seconds summary
tested with 4.10.0-0.nightly-2022-04-29-234633, enabled user workload monitoring and monitored for 30 hours, no TargetDown alerts for openshift-monitoring/openshift-user-workload-monitoring, see the picture
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.10.13 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:1690