Description of problem: TargetDown alert is firing in 4.5 clusters and in CI. {\“metric\“:{\“alertname\“:\“TargetDown\“,\“alertstate\“:\“firing\“,\“job\“:\“multus-admission-controller-monitor-service\“,\“namespace\“:\“openshift-multus\“,\“service\“:\“multus-admission-controller-monitor-service\“,\“severity\“:\“warning\“},\“value\“:[1585130732.425,\“34\“]}]“ Version-Release number of selected component (if applicable): current 4.5 CI and 4.5.0-0.nightly-2020-03-26-031938 Expected results: Alert should not be firing.
This is killing CI: $ curl -s 'https://search.svc.ci.openshift.org/search?search=promQL+query:+count_over_time.*reported+incorrect+results&type=build-log&maxAge=24h&context=0' | jq -r '. | to_entries[].value | to_entries[].value[].context[]' | sed -n 's/.*incorrect results:\\n\(.*\)",$/\1/p' | sed 's|\\||g' | jq -r '.[] | select(.metric.alertname == "TargetDown").metric | .namespace + " " + .job' | sort | uniq -c | sort -n | tail -n5 1 openshift-service-catalog-apiserver-operator metrics 1 openshift-service-catalog-controller-manager-operator metrics 2 openshift-authentication-operator metrics 13 openshift-console-operator metrics 175 openshift-multus multus-admission-controller-monitor-service
Example promotion job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/923
@aputtur, could you include your PR in this bug? Thanks!
Tested and verified in 4.5.0-0.nightly-2020-04-21-103613 [weliang@weliang networking]$ token=`oc -n openshift-monitoring sa get-token prometheus-k8s` [weliang@weliang networking]$ oc get routes -A | grep prometheus openshift-monitoring prometheus-k8s prometheus-k8s-openshift-monitoring.apps.qe-weliangsdn2.qe.devcluster.openshift.com prometheus-k8s web reencrypt/Redirect None [weliang@weliang networking]$ curl -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.qe-weliangsdn2.qe.devcluster.openshift.com/api/v1/alerts | grep TargetDown % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 8128 0 8128 0 0 20125 0 --:--:-- --:--:-- --:--:-- 20118 [weliang@weliang networking]$ [weliang@weliang networking]$ [weliang@weliang networking]$ [weliang@weliang networking]$ curl -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.qe-weliangsdn2.qe.devcluster.openshift.com/api/v1/alerts | grep TargetDown | grep multus % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 8124 0 8124 0 0 17594 0 --:--:-- --:--:-- --:--:-- 17584 [weliang@weliang networking]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-04-21-103613 True False 5h32m Cluster version is 4.5.0-0.nightly-2020-04-21-103613 [weliang@weliang networking]$
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409