Bug 1926943

Summary:	vsphere-problem-detector: Alerts in CI jobs
Product:	OpenShift Container Platform	Reporter:	Jan Safranek <jsafrane>
Component:	Storage	Assignee:	Jan Safranek <jsafrane>
Storage sub component:	Operators	QA Contact:	Qin Ping <piqin>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	adeshpan, agudi, aos-bugs, pmagotra, ssadhale, ssonigra
Version:	4.7
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	2092811 2092814 (view as bug list)		Environment:
Last Closed:	2021-07-27 22:43:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2092811, 2092814

Description Jan Safranek 2021-02-09 17:51:12 UTC

vmware CI jobs fail with:

[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel] 

fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:399]: Unexpected error:
    <errors.aggregate | len:1, cap:1>: [
        {
            s: "query failed: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1: promQL query: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1 had reported incorrect results:\n[{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"VSphereOpenshiftClusterHealthFail\",\"alertstate\":\"firing\",\"check\":\"CheckDefaultDatastore\",\"container\":\"vsphere-problem-detector-operator\",\"endpoint\":\"vsphere-metrics\",\"instance\":\"10.129.0.7:8444\",\"job\":\"vsphere-problem-detector-metrics\",\"namespace\":\"openshift-cluster-storage-operator\",\"pod\":\"vsphere-problem-detector-operator-7b59444849-rf47f\",\"service\":\"vsphere-problem-detector-metrics\",\"severity\":\"warning\"},\"value\":[1612885400.871,\"1\"]}]",
        },
    ]
    query failed: ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1: promQL query: ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1 had reported incorrect results:
    [{"metric":{"__name__":"ALERTS","alertname":"VSphereOpenshiftClusterHealthFail","alertstate":"firing","check":"CheckDefaultDatastore","container":"vsphere-problem-detector-operator","endpoint":"vsphere-metrics","instance":"10.129.0.7:8444","job":"vsphere-problem-detector-metrics","namespace":"openshift-cluster-storage-operator","pod":"vsphere-problem-detector-operator-7b59444849-rf47f","service":"vsphere-problem-detector-metrics","severity":"warning"},"value":[1612885400.871,"1"]}]
occurred

The reason is failing Datastore check, because cluster-id is too long

2021-02-09T15:10:52.571208419Z I0209 15:10:52.571182       1 operator.go:208] CheckDefaultDatastore failed: defaultDatastore "WorkloadDatastore" in vSphere configuration: datastore WorkloadDatastore: datastore name is too long: escaped volume path "-var-lib-kubelet-plugins-kubernetes.io-vsphere\\x2dvolume-mounts-\\x5bWorkloadDatastore\\x5d\\x2000000000\\x2d0000\\x2d0000\\x2d0000\\x2d000000000000-ci\\x2dop\\x2dcvx98rqr\\x2dd1161\\x2d28xzd\\x2ddynamic\\x2dpvc\\x2d00000000\\x2d0000\\x2d0000\\x2d0000\\x2d000000000000.vmdk" must be under 255 characters, got 255


Sample: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_vsphere-problem-detector/17/pull-ci-openshift-vsphere-problem-detector-master-e2e-vsphere/1359151454819979264

Comment 1 Jan Safranek 2021-02-09 18:18:30 UTC

Submitted https://github.com/openshift/vsphere-problem-detector/pull/31 to disable the datastore checks.

To fix this issue properly, we need to shorten the volume names even more in Kubernetes, https://github.com/kubernetes/kubernetes/blob/93d288e2a47fa6d497b50d37c8b3a04e91da4228/pkg/volume/vsphere_volume/vsphere_volume_util.go#L100
Now it's cut at 90 characters, but it does not count with characters escaped by systemd. Shortening it to ~63 characters would be better. The following 63 character string is escaped to 90 characters (mind the invisible "\n").

$ systemd-escape "ci-op-cvx98rqr-dynamic-pvc-00000000-0000-0000-0000-000000000000" | wc -c 
91

Comment 3 Jan Safranek 2021-02-10 08:09:44 UTC

The final solution in this BZ was to shorten the volume name to 63 characters, even before such shortening is in Kubernetes. This fixes CI and, when released to customers, we can gather data from customer clusters if 63 is the right value.

Comment 4 Qin Ping 2021-02-22 07:39:30 UTC

The ci job: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_vsphere-problem-detector/28/pull-ci-openshift-vsphere-problem-detector-master-e2e-vsphere/1359532360546127872 has the same length cluster id and the failed test case passed.

Comment 11 errata-xmlrpc 2021-07-27 22:43:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438