Bug 1956768 - aws-ebs-csi-driver-controller-metrics TargetDown
Summary: aws-ebs-csi-driver-controller-metrics TargetDown
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.8
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.8.0
Assignee: Jan Safranek
QA Contact: Qin Ping
Depends On:
TreeView+ depends on / blocked
Reported: 2021-05-04 11:37 UTC by Petr Muller
Modified: 2021-05-07 01:41 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
test: openshift-tests.[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel]
Last Closed:
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift csi-external-provisioner pull 41 0 None open Bug 1956768: UPSTREAM: 620: Fix migration metric registration 2021-05-04 18:59:14 UTC

Description Petr Muller 2021-05-04 11:37:14 UTC
Since https://amd64.ocp.releases.ci.openshift.org/releasestream/4.8.0-0.ci/release/4.8.0-0.ci-2021-05-04-055526 (which contains the csi-external-provisioner rebase https://bugzilla.redhat.com/show_bug.cgi?id=1924439 as likely culprit), the following tests started to fail in AWS serial jobs:


[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel] expand_less 	1m10s
fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:493]: Unexpected error:
    <errors.aggregate | len:1, cap:1>: [
            s: "promQL query returned unexpected results:\nALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1\n[\n  {\n    \"metric\": {\n      \"__name__\": \"ALERTS\",\n      \"alertname\": \"TargetDown\",\n      \"alertstate\": \"firing\",\n      \"job\": \"aws-ebs-csi-driver-controller-metrics\",\n      \"namespace\": \"openshift-cluster-csi-drivers\",\n      \"prometheus\": \"openshift-monitoring/k8s\",\n      \"service\": \"aws-ebs-csi-driver-controller-metrics\",\n      \"severity\": \"warning\"\n    },\n    \"value\": [\n      1620123216.319,\n      \"1\"\n    ]\n  }\n]",
    promQL query returned unexpected results:
    ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1
        "metric": {
          "__name__": "ALERTS",
          "alertname": "TargetDown",
          "alertstate": "firing",
          "job": "aws-ebs-csi-driver-controller-metrics",
          "namespace": "openshift-cluster-csi-drivers",
          "prometheus": "openshift-monitoring/k8s",
          "service": "aws-ebs-csi-driver-controller-metrics",
          "severity": "warning"
        "value": [

[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel] expand_less 	20s
fail [github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: May  4 11:10:18.079: Unexpected alerts fired or pending after the test run:

alert TargetDown fired for 3461 seconds with labels: {job="aws-ebs-csi-driver-controller-metrics", namespace="openshift-cluster-csi-drivers", service="aws-ebs-csi-driver-controller-metrics", severity="warning"}
open stdoutopen_in_new

Comment 1 Petr Muller 2021-05-04 11:38:16 UTC
Setting to high, this blocks CI release acceptance

Comment 2 Jan Safranek 2021-05-04 16:22:46 UTC
It's indeed related to the rebase, the provisioner returns 500 to metrics request:

sh-4.4# curl -v localhost:8202/metrics
*   Trying ::1...
* connect to ::1 port 8202 failed: Connection refused
*   Trying
* Connected to localhost ( port 8202 (#0)
> GET /metrics HTTP/1.1
> Host: localhost:8202
> User-Agent: curl/7.61.1
> Accept: */*
< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Tue, 04 May 2021 16:20:47 GMT
< Content-Length: 243
An error has occurred while serving metrics:

gathered metric family process_start_time_seconds has help "[ALPHA] Start time of the process since unix epoch in seconds." but should have "Start time of the process since unix epoch in seconds."

Comment 3 Jan Safranek 2021-05-04 17:14:35 UTC
I am unable to find the root cause, the same container works locally, only in the real OCP cluster it breaks.

Comment 4 Jan Safranek 2021-05-04 18:03:45 UTC
It's related to CSI migration - the provisioner behaves differently for migratable CSI drivers and registers wrong metrics. This registration should include "metrics.WithProcessStartTime(false)" option to prevent double registration of startup time metric: 


Comment 6 Qin Ping 2021-05-07 01:41:27 UTC
Verified with: 4.8.0-0.nightly-2021-05-06-210840

Note You need to log in before you can comment on or make changes to this bug.