Bug 1956768

Summary:	aws-ebs-csi-driver-controller-metrics TargetDown
Product:	OpenShift Container Platform	Reporter:	Petr Muller <pmuller>
Component:	Storage	Assignee:	Jan Safranek <jsafrane>
Storage sub component:	Storage	QA Contact:	Qin Ping <piqin>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	aos-bugs, jsafrane, wking
Version:	4.8
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:	test: openshift-tests.[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel]
Last Closed:	2021-07-27 23:06:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Petr Muller 2021-05-04 11:37:14 UTC

Since https://amd64.ocp.releases.ci.openshift.org/releasestream/4.8.0-0.ci/release/4.8.0-0.ci-2021-05-04-055526 (which contains the csi-external-provisioner rebase https://bugzilla.redhat.com/show_bug.cgi?id=1924439 as likely culprit), the following tests started to fail in AWS serial jobs:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial/1389458816164171776
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial/1389471478365294592

[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel] expand_less 	1m10s
fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:493]: Unexpected error:
    <errors.aggregate | len:1, cap:1>: [
        {
            s: "promQL query returned unexpected results:\nALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1\n[\n  {\n    \"metric\": {\n      \"__name__\": \"ALERTS\",\n      \"alertname\": \"TargetDown\",\n      \"alertstate\": \"firing\",\n      \"job\": \"aws-ebs-csi-driver-controller-metrics\",\n      \"namespace\": \"openshift-cluster-csi-drivers\",\n      \"prometheus\": \"openshift-monitoring/k8s\",\n      \"service\": \"aws-ebs-csi-driver-controller-metrics\",\n      \"severity\": \"warning\"\n    },\n    \"value\": [\n      1620123216.319,\n      \"1\"\n    ]\n  }\n]",
        },
    ]
    promQL query returned unexpected results:
    ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1
    [
      {
        "metric": {
          "__name__": "ALERTS",
          "alertname": "TargetDown",
          "alertstate": "firing",
          "job": "aws-ebs-csi-driver-controller-metrics",
          "namespace": "openshift-cluster-csi-drivers",
          "prometheus": "openshift-monitoring/k8s",
          "service": "aws-ebs-csi-driver-controller-metrics",
          "severity": "warning"
        },
        "value": [
          1620123216.319,
          "1"
        ]
      }
    ]


[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel] expand_less 	20s
fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: May  4 11:10:18.079: Unexpected alerts fired or pending after the test run:

alert TargetDown fired for 3461 seconds with labels: {job="aws-ebs-csi-driver-controller-metrics", namespace="openshift-cluster-csi-drivers", service="aws-ebs-csi-driver-controller-metrics", severity="warning"}
open stdoutopen_in_new

Comment 1 Petr Muller 2021-05-04 11:38:16 UTC

Setting to high, this blocks CI release acceptance

Comment 2 Jan Safranek 2021-05-04 16:22:46 UTC

It's indeed related to the rebase, the provisioner returns 500 to metrics request:

sh-4.4# curl -v localhost:8202/metrics
*   Trying ::1...
* TCP_NODELAY set
* connect to ::1 port 8202 failed: Connection refused
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8202 (#0)
> GET /metrics HTTP/1.1
> Host: localhost:8202
> User-Agent: curl/7.61.1
> Accept: */*
> 
< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Tue, 04 May 2021 16:20:47 GMT
< Content-Length: 243
< 
An error has occurred while serving metrics:

gathered metric family process_start_time_seconds has help "[ALPHA] Start time of the process since unix epoch in seconds." but should have "Start time of the process since unix epoch in seconds."

Comment 3 Jan Safranek 2021-05-04 17:14:35 UTC

I am unable to find the root cause, the same container works locally, only in the real OCP cluster it breaks.

Comment 4 Jan Safranek 2021-05-04 18:03:45 UTC

It's related to CSI migration - the provisioner behaves differently for migratable CSI drivers and registers wrong metrics. This registration should include "metrics.WithProcessStartTime(false)" option to prevent double registration of startup time metric: 

https://github.com/kubernetes-csi/external-provisioner/blob/5d1b62aaa38b309e2c845e97efdc24c944fb66d8/cmd/csi-provisioner/csi-provisioner.go#L225

Comment 6 Qin Ping 2021-05-07 01:41:27 UTC

Verified with: 4.8.0-0.nightly-2021-05-06-210840

Comment 9 errata-xmlrpc 2021-07-27 23:06:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438