Bug 2033057 - shared-resource-csi-driver-node-metrics TargetDown alert is firing and resulting in numerous TechPreview jobs failing
Summary: shared-resource-csi-driver-node-metrics TargetDown alert is firing and result...
Keywords:
Status: CLOSED DUPLICATE of bug 2030364
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Gabe Montero
QA Contact: Priti Kumar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-12-15 19:10 UTC by rvanderp
Modified: 2021-12-16 22:48 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-12-16 22:48:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description rvanderp 2021-12-15 19:10:11 UTC
Description of problem:

TargetDown alert is firing with against job=shared-resource-csi-driver-node-metrics

fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Dec 15 07:15:12.856: Unexpected alerts fired or pending after the test run:

alert TargetDown fired for 1304 seconds with labels: {job="shared-resource-csi-driver-node-metrics", namespace="openshift-cluster-csi-drivers", service="shared-resource-csi-driver-node-metrics", severity="warning"}

As an example, I checked out the must-gather from job[https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-techpreview/1470999572703612928] and I couldn't find an obvious reason for the TargetDown and I think this may have been a temporary condition.

The occurrence of this alert is resulting in a large percentage of some jobs failing.

Version-Release number of selected component (if applicable):
4.10

How reproducible:
Consistently in 

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
https://search.ci.openshift.org/?search=shared-resource-csi-driver-node-metrics&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 1 W. Trevor King 2021-12-16 08:14:27 UTC
Recording current stats behind "a large percentage of some jobs":

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=alert+TargetDown+fired.*shared-resource-csi-driver-node-metrics&maxAge=48h&type=junit' | grep 'failures match' | sort
periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-aws-arm64-techpreview (all) - 5 runs, 80% failed, 50% of failures match = 40% impact
periodic-ci-openshift-multiarch-master-nightly-4.10-ocp-e2e-aws-arm64-techpreview-serial (all) - 5 runs, 40% failed, 100% of failures match = 40% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview (all) - 3 runs, 67% failed, 100% of failures match = 67% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview-serial (all) - 3 runs, 100% failed, 67% of failures match = 67% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-azure-techpreview (all) - 3 runs, 67% failed, 100% of failures match = 67% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-azure-techpreview-serial (all) - 3 runs, 67% failed, 100% of failures match = 67% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-techpreview (all) - 3 runs, 67% failed, 50% of failures match = 33% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-techpreview-serial (all) - 3 runs, 67% failed, 100% of failures match = 67% impact
periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-techpreview (all) - 9 runs, 89% failed, 88% of failures match = 78% impact
periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-techpreview-serial (all) - 9 runs, 100% failed, 78% of failures match = 78% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-techpreview-parallel (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-techpreview-serial (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
pull-ci-openshift-cluster-capi-operator-main-e2e-aws-capi-techpreview (all) - 19 runs, 89% failed, 6% of failures match = 5% impact
rehearse-20433-periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-techpreview (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
rehearse-20433-periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-techpreview-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
rehearse-24537-periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-techpreview (all) - 4 runs, 25% failed, 100% of failures match = 25% impact
rehearse-24537-periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-techpreview-serial (all) - 4 runs, 25% failed, 100% of failures match = 25% impact

So all 4.10 and all TechPreview.  I'd guess there's a ServiceMonitor, but some issue (missing target pods, busted auth, something) that keeps the scrape attempt from succeeding in TechPreview clusters.

Comment 2 W. Trevor King 2021-12-16 08:21:09 UTC
[1] seems to have regressed around the 11th, with [2]:

  : [sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
  Run #0: Failed	1m3s
  fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:791]: Unexpected error:
    <errors.aggregate | len:1, cap:1>: [
        {
            s: "promQL query returned unexpected results:\nALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards|TechPreviewNoUpgrade|ClusterNotUpgradeable\",alertstate=\"firing\",severity!=\"info\"} >= 1\n[\n  {\n    \"metric\": {\n      \"__name__\": \"ALERTS\",\n      \"alertname\": \"TargetDown\",\n      \"alertstate\": \"firing\",\n      \"job\": \"shared-resource-csi-driver-node-metrics\",\n      \"namespace\": \"openshift-cluster-csi-drivers\",\n      \"prometheus\": \"openshift-monitoring/k8s\",\n      \"service\": \"shared-resource-csi-driver-node-metrics\",\n      \"severity\": \"warning\"\n    },\n    \"value\": [\n      1639241135.401,\n      \"1\"\n    ]\n  }\n]",
        },
    ]
    promQL query returned unexpected results:
    ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards|TechPreviewNoUpgrade|ClusterNotUpgradeable",alertstate="firing",severity!="info"} >= 1
    [
      {
        "metric": {
          "__name__": "ALERTS",
          "alertname": "TargetDown",
          "alertstate": "firing",
          "job": "shared-resource-csi-driver-node-metrics",
          "namespace": "openshift-cluster-csi-drivers",
          "prometheus": "openshift-monitoring/k8s",
          "service": "shared-resource-csi-driver-node-metrics",
          "severity": "warning"
        },
        "value": [
          1639241135.401,
          "1"
        ]
      }
    ]

and [3]:

  : [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
  Run #0: Failed	17s
  fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Dec 11 11:24:03.437: Unexpected alerts fired or pending after the test run:

alert TargetDown fired for 1368 seconds with labels: {job="shared-resource-csi-driver-node-metrics", namespace="openshift-cluster-csi-drivers", service="shared-resource-csi-driver-node-metrics", severity="warning"}

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-techpreview
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-techpreview/1469699317257211904
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-techpreview/1469606244649013248

Comment 3 W. Trevor King 2021-12-16 09:29:07 UTC
Comparing that transition around the 11th with [1], I suspect [2] may have introduced this issue.  Poking around in the must-gather from the c0 job:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-techpreview/1470999572703612928/artifacts/e2e-vsphere-techpreview/gather-must-gather/artifacts/must-gather.tar | tar xz --strip-components=1
$ grep -r shared-resource-csi-driver-node-metrics namespaces/openshift-cluster-csi-drivers | tail -n1
namespaces/openshift-cluster-csi-drivers/pods/shared-resource-csi-driver-operator-58b9547d79-lb6g8/shared-resource-csi-driver-operator/shared-resource-csi-driver-operator/logs/current.log:2021-12-15T06:35:28.938987478Z E1215 06:35:28.938945       1 base_controller.go:272] StaticResourceController reconciliation failed: ["csidriver.yaml" (string): Get "https://172.30.0.1:443/apis/storage.k8s.io/v1/csidrivers/csi.sharedresource.openshift.io": dial tcp 172.30.0.1:443: connect: connection refused, "node_sa.yaml" (string): Get "https://172.30.0.1:443/api/v1/namespaces/openshift-cluster-csi-drivers/serviceaccounts/csi-driver-shared-resource-plugin": dial tcp 172.30.0.1:443: connect: connection refused, "service.yaml" (string): Get "https://172.30.0.1:443/api/v1/namespaces/openshift-cluster-csi-drivers/services/shared-resource-csi-driver-node": dial tcp 172.30.0.1:443: connect: connection refused, "metrics_service.yaml" (string): Get "https://172.30.0.1:443/api/v1/namespaces/openshift-cluster-csi-drivers/services/shared-resource-csi-driver-node-metrics": dial tcp 172.30.0.1:443: connect: connection refused, "servicemonitor.yaml" (string): Get "https://172.30.0.1:443/apis/monitoring.coreos.com/v1/namespaces/openshift-cluster-csi-drivers/servicemonitors/shared-resource-csi-driver-node-monitor": dial tcp 172.30.0.1:443: connect: connection refused, "rbac/privileged_role.yaml" (string): Get "https://172.30.0.1:443/apis/rbac.authorization.k8s.io/v1/clusterroles/shared-resource-privileged-role": dial tcp 172.30.0.1:443: connect: connection refused, "rbac/node_role.yaml" (string): Get "https://172.30.0.1:443/apis/rbac.authorization.k8s.io/v1/clusterroles/shared-resource-secret-configmap-share-watch-sar-create": dial tcp 172.30.0.1:443: connect: connection refused, "rbac/node_privileged_binding.yaml" (string): Get "https://172.30.0.1:443/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/shared-resource-node-privileged-binding": dial tcp 172.30.0.1:443: connect: connection refused, "rbac/node_binding.yaml" (string): Get "https://172.30.0.1:443/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/shared-resource-secret-configmap-share-watch-sar-create": dial tcp 172.30.0.1:443: connect: connection refused, "rbac/prometheus_role.yaml" (string): Get "https://172.30.0.1:443/apis/rbac.authorization.k8s.io/v1/namespaces/openshift-cluster-csi-drivers/roles/shared-resource-prometheus": dial tcp 172.30.0.1:443: connect: connection refused, "rbac/prometheus_rolebinding.yaml" (string): Get "https://172.30.0.1:443/apis/rbac.authorization.k8s.io/v1/namespaces/openshift-cluster-csi-drivers/rolebindings/shared-resource-prometheus": dial tcp 172.30.0.1:443: connect: connection refused, Put "https://172.30.0.1:443/apis/operator.openshift.io/v1/clustercsidrivers/csi.sharedresource.openshift.io/status": dial tcp 172.30.0.1:443: connect: connection refused]

That doesn't look good.  Also walking the ServiceMonitor chain:

$ yaml2json <namespaces/openshift-cluster-csi-drivers/monitoring.coreos.com/servicemonitors/shared-resource-csi-driver-node-monitor.yaml | jq -r '.metadata.creationTimestamp + "\n" + (.spec.selector | tostring)'
2021-12-15T06:31:01Z
{"matchLabels":{"app":"shared-resource-csi-driver-node-metrics"}}
$ yaml2json <namespaces/openshift-cluster-csi-drivers/core/services.yaml | jq -r '.items[] | select(.metadata.labels.app == "shared-resource-csi-driver-node-metrics") | .metadata.creationTimestamp + "\n" + (.spec.selector | tostring)' 
2021-12-15T06:31:01Z
{"app":"shared-resource-csi-driver-node"}
$ yaml2json <namespaces/openshift-cluster-csi-drivers/core/pods.yaml | jq -r '.items[] | select(.metadata.labels.app == "shared-resource-csi-driver-node").metadata | .name + " " + .creat
ionTimestamp' 
shared-resource-csi-driver-node-48t89 2021-12-15T06:37:11Z
shared-resource-csi-driver-node-vdwr8 2021-12-15T06:37:24Z
shared-resource-csi-driver-node-ztqzb 2021-12-15T06:37:31Z

Hmm, so there's a bit of a window from the 6:31 Service(Monitor) creation and the 6:37 Pod creation.  Checking:

  $ yaml2json <namespaces/openshift-cluster-csi-drivers/core/events.yaml | jq -r '.items[] | select(.involvedObject.name | startswith("shared-resource-csi-driver-node-")) | .metadata.creat
ionTimestamp + " " + .reason + ": " + .message' 
  ...nothing before 2021-12-15T06:37:11Z...

confirms that those are the original pods.  I dunno.  Feels like must-gather should include something fairly straightforward about why scrapes are failing, but if it's in there, I'm not finding it.

[1]: https://github.com/openshift/csi-driver-shared-resource-operator/commits/release-4.10
[2]: https://github.com/openshift/csi-driver-shared-resource-operator/pull/37

Comment 4 W. Trevor King 2021-12-16 09:58:37 UTC
Simon found the error messages in gather-extra:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-techpreview/1470999572703612928/artifacts/e2e-vsphere-techpreview/gather-extra/artifacts/metrics/prometheus-targets.json | jq '.data.activeTargets[] | select(.health == "down") | {scrapePool, lastError}'
{
  "scrapePool": "serviceMonitor/openshift-cluster-csi-drivers/shared-resource-csi-driver-node-monitor/0",
  "lastError": "Get \"https://10.128.2.5:6000/metrics\": http: server gave HTTP response to HTTPS client"
}
{
  "scrapePool": "serviceMonitor/openshift-cluster-csi-drivers/shared-resource-csi-driver-node-monitor/0",
  "lastError": "Get \"https://10.129.2.2:6000/metrics\": http: server gave HTTP response to HTTPS client"
}
{
  "scrapePool": "serviceMonitor/openshift-cluster-csi-drivers/shared-resource-csi-driver-node-monitor/0",
  "lastError": "Get \"https://10.131.0.15:6000/metrics\": http: server gave HTTP response to HTTPS client"
}

So [1] looks plausible as part of this issue (and as comment 3 pointed out, the timing for #37 looks right for this regression).  Still not clear to me why #37's ListenAndServeTLS doesn't seem to be kicking in, or why the alerts are restricted to TechPreview jobs.

[1]: https://github.com/openshift/csi-driver-shared-resource-operator/pull/37/files#diff-c680ad5533affa111b33d8f4f252b642af5391109065b19fbb9a1c0069a77251L43-R48

Comment 5 Gabe Montero 2021-12-16 17:01:38 UTC
Adam - so after Alice's https://github.com/openshift/cluster-storage-operator/pull/243 merged this alert seemed to have cleared up for https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview

As evident by https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1471136844929306624

That said, after viewing https://search.ci.openshift.org/?search=shared-resource-csi-driver-node-metrics&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job just now, I'm seeing some recent failures since her PR merged.


Looking at the must gather on one of those recent ones, https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-techpreview-parallel/1471422849846611968, the shared resource operator pod is missing the changes from https://github.com/openshift/cluster-storage-operator/pull/243

So I think we can eventually close this as a dup of https://bugzilla.redhat.com/show_bug.cgi?id=2030364 but we should probably wait until we see via https://search.ci.openshift.org/?search=shared-resource-csi-driver-node-metrics&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job some of those other periodics pick up those changes and pass.
https://search.ci.openshift.org/?search=shared-resource-csi-driver-node-metrics&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 6 Gabe Montero 2021-12-16 18:26:55 UTC
I'll take this one and monitor the periodics the remainder of this week.

If we see the alerts even when we see the https://github.com/openshift/cluster-storage-operator/pull/243 changes on the SRO I may send over to Adam to address while Alice and I are on PTO.

Comment 7 Gabe Montero 2021-12-16 22:48:38 UTC
OK the latest run of rehearse-20841-periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-techpreview-serial at https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/20841/rehearse-20841-periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-techpreview-serial/1471575530078736384 no longer has this alert nor does our metric show up in any of the flaky items 

that was from 47 minutes ago as of this comment

calling that sufficient validation that the https://bugzilla.redhat.com/show_bug.cgi?id=2030364 changes fully address this across the various periodics, once they are run on a level that contains it changes

*** This bug has been marked as a duplicate of bug 2030364 ***


Note You need to log in before you can comment on or make changes to this bug.