Bug 1956768 - aws-ebs-csi-driver-controller-metrics TargetDown
Summary: aws-ebs-csi-driver-controller-metrics TargetDown
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.8.0
Assignee: Jan Safranek
QA Contact: Qin Ping
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-04 11:37 UTC by Petr Muller
Modified: 2021-07-27 23:06 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
test: openshift-tests.[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel]
Last Closed: 2021-07-27 23:06:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift csi-external-provisioner pull 41 0 None open Bug 1956768: UPSTREAM: 620: Fix migration metric registration 2021-05-04 18:59:14 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:06:25 UTC

Description Petr Muller 2021-05-04 11:37:14 UTC
Since https://amd64.ocp.releases.ci.openshift.org/releasestream/4.8.0-0.ci/release/4.8.0-0.ci-2021-05-04-055526 (which contains the csi-external-provisioner rebase https://bugzilla.redhat.com/show_bug.cgi?id=1924439 as likely culprit), the following tests started to fail in AWS serial jobs:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial/1389458816164171776
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial/1389471478365294592

[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel] expand_less 	1m10s
fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:493]: Unexpected error:
    <errors.aggregate | len:1, cap:1>: [
        {
            s: "promQL query returned unexpected results:\nALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1\n[\n  {\n    \"metric\": {\n      \"__name__\": \"ALERTS\",\n      \"alertname\": \"TargetDown\",\n      \"alertstate\": \"firing\",\n      \"job\": \"aws-ebs-csi-driver-controller-metrics\",\n      \"namespace\": \"openshift-cluster-csi-drivers\",\n      \"prometheus\": \"openshift-monitoring/k8s\",\n      \"service\": \"aws-ebs-csi-driver-controller-metrics\",\n      \"severity\": \"warning\"\n    },\n    \"value\": [\n      1620123216.319,\n      \"1\"\n    ]\n  }\n]",
        },
    ]
    promQL query returned unexpected results:
    ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1
    [
      {
        "metric": {
          "__name__": "ALERTS",
          "alertname": "TargetDown",
          "alertstate": "firing",
          "job": "aws-ebs-csi-driver-controller-metrics",
          "namespace": "openshift-cluster-csi-drivers",
          "prometheus": "openshift-monitoring/k8s",
          "service": "aws-ebs-csi-driver-controller-metrics",
          "severity": "warning"
        },
        "value": [
          1620123216.319,
          "1"
        ]
      }
    ]


[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel] expand_less 	20s
fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: May  4 11:10:18.079: Unexpected alerts fired or pending after the test run:

alert TargetDown fired for 3461 seconds with labels: {job="aws-ebs-csi-driver-controller-metrics", namespace="openshift-cluster-csi-drivers", service="aws-ebs-csi-driver-controller-metrics", severity="warning"}
open stdoutopen_in_new

Comment 1 Petr Muller 2021-05-04 11:38:16 UTC
Setting to high, this blocks CI release acceptance

Comment 2 Jan Safranek 2021-05-04 16:22:46 UTC
It's indeed related to the rebase, the provisioner returns 500 to metrics request:

sh-4.4# curl -v localhost:8202/metrics
*   Trying ::1...
* TCP_NODELAY set
* connect to ::1 port 8202 failed: Connection refused
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8202 (#0)
> GET /metrics HTTP/1.1
> Host: localhost:8202
> User-Agent: curl/7.61.1
> Accept: */*
> 
< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Tue, 04 May 2021 16:20:47 GMT
< Content-Length: 243
< 
An error has occurred while serving metrics:

gathered metric family process_start_time_seconds has help "[ALPHA] Start time of the process since unix epoch in seconds." but should have "Start time of the process since unix epoch in seconds."

Comment 3 Jan Safranek 2021-05-04 17:14:35 UTC
I am unable to find the root cause, the same container works locally, only in the real OCP cluster it breaks.

Comment 4 Jan Safranek 2021-05-04 18:03:45 UTC
It's related to CSI migration - the provisioner behaves differently for migratable CSI drivers and registers wrong metrics. This registration should include "metrics.WithProcessStartTime(false)" option to prevent double registration of startup time metric: 

https://github.com/kubernetes-csi/external-provisioner/blob/5d1b62aaa38b309e2c845e97efdc24c944fb66d8/cmd/csi-provisioner/csi-provisioner.go#L225

Comment 6 Qin Ping 2021-05-07 01:41:27 UTC
Verified with: 4.8.0-0.nightly-2021-05-06-210840

Comment 9 errata-xmlrpc 2021-07-27 23:06:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.