Since https://amd64.ocp.releases.ci.openshift.org/releasestream/4.8.0-0.ci/release/4.8.0-0.ci-2021-05-04-055526 (which contains the csi-external-provisioner rebase https://bugzilla.redhat.com/show_bug.cgi?id=1924439 as likely culprit), the following tests started to fail in AWS serial jobs: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial/1389458816164171776 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial/1389471478365294592 [sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel] expand_less 1m10s fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:493]: Unexpected error: <errors.aggregate | len:1, cap:1>: [ { s: "promQL query returned unexpected results:\nALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1\n[\n {\n \"metric\": {\n \"__name__\": \"ALERTS\",\n \"alertname\": \"TargetDown\",\n \"alertstate\": \"firing\",\n \"job\": \"aws-ebs-csi-driver-controller-metrics\",\n \"namespace\": \"openshift-cluster-csi-drivers\",\n \"prometheus\": \"openshift-monitoring/k8s\",\n \"service\": \"aws-ebs-csi-driver-controller-metrics\",\n \"severity\": \"warning\"\n },\n \"value\": [\n 1620123216.319,\n \"1\"\n ]\n }\n]", }, ] promQL query returned unexpected results: ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1 [ { "metric": { "__name__": "ALERTS", "alertname": "TargetDown", "alertstate": "firing", "job": "aws-ebs-csi-driver-controller-metrics", "namespace": "openshift-cluster-csi-drivers", "prometheus": "openshift-monitoring/k8s", "service": "aws-ebs-csi-driver-controller-metrics", "severity": "warning" }, "value": [ 1620123216.319, "1" ] } ] [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel] expand_less 20s fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: May 4 11:10:18.079: Unexpected alerts fired or pending after the test run: alert TargetDown fired for 3461 seconds with labels: {job="aws-ebs-csi-driver-controller-metrics", namespace="openshift-cluster-csi-drivers", service="aws-ebs-csi-driver-controller-metrics", severity="warning"} open stdoutopen_in_new
Setting to high, this blocks CI release acceptance
It's indeed related to the rebase, the provisioner returns 500 to metrics request: sh-4.4# curl -v localhost:8202/metrics * Trying ::1... * TCP_NODELAY set * connect to ::1 port 8202 failed: Connection refused * Trying 127.0.0.1... * TCP_NODELAY set * Connected to localhost (127.0.0.1) port 8202 (#0) > GET /metrics HTTP/1.1 > Host: localhost:8202 > User-Agent: curl/7.61.1 > Accept: */* > < HTTP/1.1 500 Internal Server Error < Content-Type: text/plain; charset=utf-8 < X-Content-Type-Options: nosniff < Date: Tue, 04 May 2021 16:20:47 GMT < Content-Length: 243 < An error has occurred while serving metrics: gathered metric family process_start_time_seconds has help "[ALPHA] Start time of the process since unix epoch in seconds." but should have "Start time of the process since unix epoch in seconds."
I am unable to find the root cause, the same container works locally, only in the real OCP cluster it breaks.
It's related to CSI migration - the provisioner behaves differently for migratable CSI drivers and registers wrong metrics. This registration should include "metrics.WithProcessStartTime(false)" option to prevent double registration of startup time metric: https://github.com/kubernetes-csi/external-provisioner/blob/5d1b62aaa38b309e2c845e97efdc24c944fb66d8/cmd/csi-provisioner/csi-provisioner.go#L225
Verified with: 4.8.0-0.nightly-2021-05-06-210840
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438