Description of problem: TargetDown alert is firing with against job=shared-resource-csi-driver-node-metrics fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Dec 15 07:15:12.856: Unexpected alerts fired or pending after the test run: alert TargetDown fired for 1304 seconds with labels: {job="shared-resource-csi-driver-node-metrics", namespace="openshift-cluster-csi-drivers", service="shared-resource-csi-driver-node-metrics", severity="warning"} As an example, I checked out the must-gather from job[https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-techpreview/1470999572703612928] and I couldn't find an obvious reason for the TargetDown and I think this may have been a temporary condition. The occurrence of this alert is resulting in a large percentage of some jobs failing. Version-Release number of selected component (if applicable): 4.10 How reproducible: Consistently in Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: https://search.ci.openshift.org/?search=shared-resource-csi-driver-node-metrics&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Recording current stats behind "a large percentage of some jobs": $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=alert+TargetDown+fired.*shared-resource-csi-driver-node-metrics&maxAge=48h&type=junit' | grep 'failures match' | sort periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-aws-arm64-techpreview (all) - 5 runs, 80% failed, 50% of failures match = 40% impact periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-aws-arm64-techpreview-serial (all) - 5 runs, 40% failed, 100% of failures match = 40% impact periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview (all) - 3 runs, 67% failed, 100% of failures match = 67% impact periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview-serial (all) - 3 runs, 100% failed, 67% of failures match = 67% impact periodic-ci-openshift-release-master-ci-4.10-e2e-azure-techpreview (all) - 3 runs, 67% failed, 100% of failures match = 67% impact periodic-ci-openshift-release-master-ci-4.10-e2e-azure-techpreview-serial (all) - 3 runs, 67% failed, 100% of failures match = 67% impact periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-techpreview (all) - 3 runs, 67% failed, 50% of failures match = 33% impact periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-techpreview-serial (all) - 3 runs, 67% failed, 100% of failures match = 67% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-techpreview (all) - 9 runs, 89% failed, 88% of failures match = 78% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-techpreview-serial (all) - 9 runs, 100% failed, 78% of failures match = 78% impact periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-techpreview-parallel (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-techpreview-serial (all) - 2 runs, 100% failed, 100% of failures match = 100% impact pull-ci-openshift-cluster-capi-operator-main-e2e-aws-capi-techpreview (all) - 19 runs, 89% failed, 6% of failures match = 5% impact rehearse-20433-periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-techpreview (all) - 1 runs, 100% failed, 100% of failures match = 100% impact rehearse-20433-periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-techpreview-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact rehearse-24537-periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-techpreview (all) - 4 runs, 25% failed, 100% of failures match = 25% impact rehearse-24537-periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-techpreview-serial (all) - 4 runs, 25% failed, 100% of failures match = 25% impact So all 4.10 and all TechPreview. I'd guess there's a ServiceMonitor, but some issue (missing target pods, busted auth, something) that keeps the scrape attempt from succeeding in TechPreview clusters.
[1] seems to have regressed around the 11th, with [2]: : [sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel] Run #0: Failed 1m3s fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:791]: Unexpected error: <errors.aggregate | len:1, cap:1>: [ { s: "promQL query returned unexpected results:\nALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards|TechPreviewNoUpgrade|ClusterNotUpgradeable\",alertstate=\"firing\",severity!=\"info\"} >= 1\n[\n {\n \"metric\": {\n \"__name__\": \"ALERTS\",\n \"alertname\": \"TargetDown\",\n \"alertstate\": \"firing\",\n \"job\": \"shared-resource-csi-driver-node-metrics\",\n \"namespace\": \"openshift-cluster-csi-drivers\",\n \"prometheus\": \"openshift-monitoring/k8s\",\n \"service\": \"shared-resource-csi-driver-node-metrics\",\n \"severity\": \"warning\"\n },\n \"value\": [\n 1639241135.401,\n \"1\"\n ]\n }\n]", }, ] promQL query returned unexpected results: ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards|TechPreviewNoUpgrade|ClusterNotUpgradeable",alertstate="firing",severity!="info"} >= 1 [ { "metric": { "__name__": "ALERTS", "alertname": "TargetDown", "alertstate": "firing", "job": "shared-resource-csi-driver-node-metrics", "namespace": "openshift-cluster-csi-drivers", "prometheus": "openshift-monitoring/k8s", "service": "shared-resource-csi-driver-node-metrics", "severity": "warning" }, "value": [ 1639241135.401, "1" ] } ] and [3]: : [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Skipped:Disconnected] [Suite:openshift/conformance/parallel] Run #0: Failed 17s fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Dec 11 11:24:03.437: Unexpected alerts fired or pending after the test run: alert TargetDown fired for 1368 seconds with labels: {job="shared-resource-csi-driver-node-metrics", namespace="openshift-cluster-csi-drivers", service="shared-resource-csi-driver-node-metrics", severity="warning"} [1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-techpreview [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-techpreview/1469699317257211904 [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-techpreview/1469606244649013248
Comparing that transition around the 11th with [1], I suspect [2] may have introduced this issue. Poking around in the must-gather from the c0 job: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-techpreview/1470999572703612928/artifacts/e2e-vsphere-techpreview/gather-must-gather/artifacts/must-gather.tar | tar xz --strip-components=1 $ grep -r shared-resource-csi-driver-node-metrics namespaces/openshift-cluster-csi-drivers | tail -n1 namespaces/openshift-cluster-csi-drivers/pods/shared-resource-csi-driver-operator-58b9547d79-lb6g8/shared-resource-csi-driver-operator/shared-resource-csi-driver-operator/logs/current.log:2021-12-15T06:35:28.938987478Z E1215 06:35:28.938945 1 base_controller.go:272] StaticResourceController reconciliation failed: ["csidriver.yaml" (string): Get "https://172.30.0.1:443/apis/storage.k8s.io/v1/csidrivers/csi.sharedresource.openshift.io": dial tcp 172.30.0.1:443: connect: connection refused, "node_sa.yaml" (string): Get "https://172.30.0.1:443/api/v1/namespaces/openshift-cluster-csi-drivers/serviceaccounts/csi-driver-shared-resource-plugin": dial tcp 172.30.0.1:443: connect: connection refused, "service.yaml" (string): Get "https://172.30.0.1:443/api/v1/namespaces/openshift-cluster-csi-drivers/services/shared-resource-csi-driver-node": dial tcp 172.30.0.1:443: connect: connection refused, "metrics_service.yaml" (string): Get "https://172.30.0.1:443/api/v1/namespaces/openshift-cluster-csi-drivers/services/shared-resource-csi-driver-node-metrics": dial tcp 172.30.0.1:443: connect: connection refused, "servicemonitor.yaml" (string): Get "https://172.30.0.1:443/apis/monitoring.coreos.com/v1/namespaces/openshift-cluster-csi-drivers/servicemonitors/shared-resource-csi-driver-node-monitor": dial tcp 172.30.0.1:443: connect: connection refused, "rbac/privileged_role.yaml" (string): Get "https://172.30.0.1:443/apis/rbac.authorization.k8s.io/v1/clusterroles/shared-resource-privileged-role": dial tcp 172.30.0.1:443: connect: connection refused, "rbac/node_role.yaml" (string): Get "https://172.30.0.1:443/apis/rbac.authorization.k8s.io/v1/clusterroles/shared-resource-secret-configmap-share-watch-sar-create": dial tcp 172.30.0.1:443: connect: connection refused, "rbac/node_privileged_binding.yaml" (string): Get "https://172.30.0.1:443/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/shared-resource-node-privileged-binding": dial tcp 172.30.0.1:443: connect: connection refused, "rbac/node_binding.yaml" (string): Get "https://172.30.0.1:443/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/shared-resource-secret-configmap-share-watch-sar-create": dial tcp 172.30.0.1:443: connect: connection refused, "rbac/prometheus_role.yaml" (string): Get "https://172.30.0.1:443/apis/rbac.authorization.k8s.io/v1/namespaces/openshift-cluster-csi-drivers/roles/shared-resource-prometheus": dial tcp 172.30.0.1:443: connect: connection refused, "rbac/prometheus_rolebinding.yaml" (string): Get "https://172.30.0.1:443/apis/rbac.authorization.k8s.io/v1/namespaces/openshift-cluster-csi-drivers/rolebindings/shared-resource-prometheus": dial tcp 172.30.0.1:443: connect: connection refused, Put "https://172.30.0.1:443/apis/operator.openshift.io/v1/clustercsidrivers/csi.sharedresource.openshift.io/status": dial tcp 172.30.0.1:443: connect: connection refused] That doesn't look good. Also walking the ServiceMonitor chain: $ yaml2json <namespaces/openshift-cluster-csi-drivers/monitoring.coreos.com/servicemonitors/shared-resource-csi-driver-node-monitor.yaml | jq -r '.metadata.creationTimestamp + "\n" + (.spec.selector | tostring)' 2021-12-15T06:31:01Z {"matchLabels":{"app":"shared-resource-csi-driver-node-metrics"}} $ yaml2json <namespaces/openshift-cluster-csi-drivers/core/services.yaml | jq -r '.items[] | select(.metadata.labels.app == "shared-resource-csi-driver-node-metrics") | .metadata.creationTimestamp + "\n" + (.spec.selector | tostring)' 2021-12-15T06:31:01Z {"app":"shared-resource-csi-driver-node"} $ yaml2json <namespaces/openshift-cluster-csi-drivers/core/pods.yaml | jq -r '.items[] | select(.metadata.labels.app == "shared-resource-csi-driver-node").metadata | .name + " " + .creat ionTimestamp' shared-resource-csi-driver-node-48t89 2021-12-15T06:37:11Z shared-resource-csi-driver-node-vdwr8 2021-12-15T06:37:24Z shared-resource-csi-driver-node-ztqzb 2021-12-15T06:37:31Z Hmm, so there's a bit of a window from the 6:31 Service(Monitor) creation and the 6:37 Pod creation. Checking: $ yaml2json <namespaces/openshift-cluster-csi-drivers/core/events.yaml | jq -r '.items[] | select(.involvedObject.name | startswith("shared-resource-csi-driver-node-")) | .metadata.creat ionTimestamp + " " + .reason + ": " + .message' ...nothing before 2021-12-15T06:37:11Z... confirms that those are the original pods. I dunno. Feels like must-gather should include something fairly straightforward about why scrapes are failing, but if it's in there, I'm not finding it. [1]: https://github.com/openshift/csi-driver-shared-resource-operator/commits/release-4.10 [2]: https://github.com/openshift/csi-driver-shared-resource-operator/pull/37
Simon found the error messages in gather-extra: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-techpreview/1470999572703612928/artifacts/e2e-vsphere-techpreview/gather-extra/artifacts/metrics/prometheus-targets.json | jq '.data.activeTargets[] | select(.health == "down") | {scrapePool, lastError}' { "scrapePool": "serviceMonitor/openshift-cluster-csi-drivers/shared-resource-csi-driver-node-monitor/0", "lastError": "Get \"https://10.128.2.5:6000/metrics\": http: server gave HTTP response to HTTPS client" } { "scrapePool": "serviceMonitor/openshift-cluster-csi-drivers/shared-resource-csi-driver-node-monitor/0", "lastError": "Get \"https://10.129.2.2:6000/metrics\": http: server gave HTTP response to HTTPS client" } { "scrapePool": "serviceMonitor/openshift-cluster-csi-drivers/shared-resource-csi-driver-node-monitor/0", "lastError": "Get \"https://10.131.0.15:6000/metrics\": http: server gave HTTP response to HTTPS client" } So [1] looks plausible as part of this issue (and as comment 3 pointed out, the timing for #37 looks right for this regression). Still not clear to me why #37's ListenAndServeTLS doesn't seem to be kicking in, or why the alerts are restricted to TechPreview jobs. [1]: https://github.com/openshift/csi-driver-shared-resource-operator/pull/37/files#diff-c680ad5533affa111b33d8f4f252b642af5391109065b19fbb9a1c0069a77251L43-R48
Adam - so after Alice's https://github.com/openshift/cluster-storage-operator/pull/243 merged this alert seemed to have cleared up for https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview As evident by https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1471136844929306624 That said, after viewing https://search.ci.openshift.org/?search=shared-resource-csi-driver-node-metrics&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job just now, I'm seeing some recent failures since her PR merged. Looking at the must gather on one of those recent ones, https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-techpreview-parallel/1471422849846611968, the shared resource operator pod is missing the changes from https://github.com/openshift/cluster-storage-operator/pull/243 So I think we can eventually close this as a dup of https://bugzilla.redhat.com/show_bug.cgi?id=2030364 but we should probably wait until we see via https://search.ci.openshift.org/?search=shared-resource-csi-driver-node-metrics&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job some of those other periodics pick up those changes and pass. https://search.ci.openshift.org/?search=shared-resource-csi-driver-node-metrics&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
I'll take this one and monitor the periodics the remainder of this week. If we see the alerts even when we see the https://github.com/openshift/cluster-storage-operator/pull/243 changes on the SRO I may send over to Adam to address while Alice and I are on PTO.
OK the latest run of rehearse-20841-periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-techpreview-serial at https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/20841/rehearse-20841-periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-techpreview-serial/1471575530078736384 no longer has this alert nor does our metric show up in any of the flaky items that was from 47 minutes ago as of this comment calling that sufficient validation that the https://bugzilla.redhat.com/show_bug.cgi?id=2030364 changes fully address this across the various periodics, once they are run on a level that contains it changes *** This bug has been marked as a duplicate of bug 2030364 ***