1926943 – vsphere-problem-detector: Alerts in CI jobs

Bug 1926943 - vsphere-problem-detector: Alerts in CI jobs

Summary: vsphere-problem-detector: Alerts in CI jobs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Jan Safranek
QA Contact:	Qin Ping
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2092811 2092814
TreeView+	depends on / blocked

Reported:	2021-02-09 17:51 UTC by Jan Safranek
Modified:	2024-12-20 19:37 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2092811 2092814 (view as bug list)
Environment:
Last Closed:	2021-07-27 22:43:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift vsphere-problem-detector pull 30	0	None	Waiting on Customer	CEPH replication issue	2022-06-27 14:54:21 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:43:32 UTC

Description Jan Safranek 2021-02-09 17:51:12 UTC

vmware CI jobs fail with:

[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel] 

fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:399]: Unexpected error:
    <errors.aggregate | len:1, cap:1>: [
        {
            s: "query failed: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1: promQL query: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1 had reported incorrect results:\n[{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"VSphereOpenshiftClusterHealthFail\",\"alertstate\":\"firing\",\"check\":\"CheckDefaultDatastore\",\"container\":\"vsphere-problem-detector-operator\",\"endpoint\":\"vsphere-metrics\",\"instance\":\"10.129.0.7:8444\",\"job\":\"vsphere-problem-detector-metrics\",\"namespace\":\"openshift-cluster-storage-operator\",\"pod\":\"vsphere-problem-detector-operator-7b59444849-rf47f\",\"service\":\"vsphere-problem-detector-metrics\",\"severity\":\"warning\"},\"value\":[1612885400.871,\"1\"]}]",
        },
    ]
    query failed: ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1: promQL query: ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1 had reported incorrect results:
    [{"metric":{"__name__":"ALERTS","alertname":"VSphereOpenshiftClusterHealthFail","alertstate":"firing","check":"CheckDefaultDatastore","container":"vsphere-problem-detector-operator","endpoint":"vsphere-metrics","instance":"10.129.0.7:8444","job":"vsphere-problem-detector-metrics","namespace":"openshift-cluster-storage-operator","pod":"vsphere-problem-detector-operator-7b59444849-rf47f","service":"vsphere-problem-detector-metrics","severity":"warning"},"value":[1612885400.871,"1"]}]
occurred

The reason is failing Datastore check, because cluster-id is too long

2021-02-09T15:10:52.571208419Z I0209 15:10:52.571182       1 operator.go:208] CheckDefaultDatastore failed: defaultDatastore "WorkloadDatastore" in vSphere configuration: datastore WorkloadDatastore: datastore name is too long: escaped volume path "-var-lib-kubelet-plugins-kubernetes.io-vsphere\\x2dvolume-mounts-\\x5bWorkloadDatastore\\x5d\\x2000000000\\x2d0000\\x2d0000\\x2d0000\\x2d000000000000-ci\\x2dop\\x2dcvx98rqr\\x2dd1161\\x2d28xzd\\x2ddynamic\\x2dpvc\\x2d00000000\\x2d0000\\x2d0000\\x2d0000\\x2d000000000000.vmdk" must be under 255 characters, got 255


Sample: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_vsphere-problem-detector/17/pull-ci-openshift-vsphere-problem-detector-master-e2e-vsphere/1359151454819979264

Comment 1 Jan Safranek 2021-02-09 18:18:30 UTC

Submitted https://github.com/openshift/vsphere-problem-detector/pull/31 to disable the datastore checks.

To fix this issue properly, we need to shorten the volume names even more in Kubernetes, https://github.com/kubernetes/kubernetes/blob/93d288e2a47fa6d497b50d37c8b3a04e91da4228/pkg/volume/vsphere_volume/vsphere_volume_util.go#L100
Now it's cut at 90 characters, but it does not count with characters escaped by systemd. Shortening it to ~63 characters would be better. The following 63 character string is escaped to 90 characters (mind the invisible "\n").

$ systemd-escape "ci-op-cvx98rqr-dynamic-pvc-00000000-0000-0000-0000-000000000000" | wc -c 
91

Comment 3 Jan Safranek 2021-02-10 08:09:44 UTC

The final solution in this BZ was to shorten the volume name to 63 characters, even before such shortening is in Kubernetes. This fixes CI and, when released to customers, we can gather data from customer clusters if 63 is the right value.

Comment 4 Qin Ping 2021-02-22 07:39:30 UTC

The ci job: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_vsphere-problem-detector/28/pull-ci-openshift-vsphere-problem-detector-master-e2e-vsphere/1359532360546127872 has the same length cluster id and the failed test case passed.

Comment 11 errata-xmlrpc 2021-07-27 22:43:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.