Bug 1926943

Summary: vsphere-problem-detector: Alerts in CI jobs
Product: OpenShift Container Platform Reporter: Jan Safranek <jsafrane>
Component: StorageAssignee: Jan Safranek <jsafrane>
Storage sub component: Operators QA Contact: Qin Ping <piqin>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: adeshpan, agudi, aos-bugs, pmagotra, ssadhale, ssonigra
Version: 4.7   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2092811 2092814 (view as bug list) Environment:
Last Closed: 2021-07-27 22:43:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2092811, 2092814    

Description Jan Safranek 2021-02-09 17:51:12 UTC
vmware CI jobs fail with:

[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel] 

fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:399]: Unexpected error:
    <errors.aggregate | len:1, cap:1>: [
        {
            s: "query failed: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1: promQL query: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1 had reported incorrect results:\n[{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"VSphereOpenshiftClusterHealthFail\",\"alertstate\":\"firing\",\"check\":\"CheckDefaultDatastore\",\"container\":\"vsphere-problem-detector-operator\",\"endpoint\":\"vsphere-metrics\",\"instance\":\"10.129.0.7:8444\",\"job\":\"vsphere-problem-detector-metrics\",\"namespace\":\"openshift-cluster-storage-operator\",\"pod\":\"vsphere-problem-detector-operator-7b59444849-rf47f\",\"service\":\"vsphere-problem-detector-metrics\",\"severity\":\"warning\"},\"value\":[1612885400.871,\"1\"]}]",
        },
    ]
    query failed: ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1: promQL query: ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1 had reported incorrect results:
    [{"metric":{"__name__":"ALERTS","alertname":"VSphereOpenshiftClusterHealthFail","alertstate":"firing","check":"CheckDefaultDatastore","container":"vsphere-problem-detector-operator","endpoint":"vsphere-metrics","instance":"10.129.0.7:8444","job":"vsphere-problem-detector-metrics","namespace":"openshift-cluster-storage-operator","pod":"vsphere-problem-detector-operator-7b59444849-rf47f","service":"vsphere-problem-detector-metrics","severity":"warning"},"value":[1612885400.871,"1"]}]
occurred

The reason is failing Datastore check, because cluster-id is too long

2021-02-09T15:10:52.571208419Z I0209 15:10:52.571182       1 operator.go:208] CheckDefaultDatastore failed: defaultDatastore "WorkloadDatastore" in vSphere configuration: datastore WorkloadDatastore: datastore name is too long: escaped volume path "-var-lib-kubelet-plugins-kubernetes.io-vsphere\\x2dvolume-mounts-\\x5bWorkloadDatastore\\x5d\\x2000000000\\x2d0000\\x2d0000\\x2d0000\\x2d000000000000-ci\\x2dop\\x2dcvx98rqr\\x2dd1161\\x2d28xzd\\x2ddynamic\\x2dpvc\\x2d00000000\\x2d0000\\x2d0000\\x2d0000\\x2d000000000000.vmdk" must be under 255 characters, got 255


Sample: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_vsphere-problem-detector/17/pull-ci-openshift-vsphere-problem-detector-master-e2e-vsphere/1359151454819979264

Comment 1 Jan Safranek 2021-02-09 18:18:30 UTC
Submitted https://github.com/openshift/vsphere-problem-detector/pull/31 to disable the datastore checks.

To fix this issue properly, we need to shorten the volume names even more in Kubernetes, https://github.com/kubernetes/kubernetes/blob/93d288e2a47fa6d497b50d37c8b3a04e91da4228/pkg/volume/vsphere_volume/vsphere_volume_util.go#L100
Now it's cut at 90 characters, but it does not count with characters escaped by systemd. Shortening it to ~63 characters would be better. The following 63 character string is escaped to 90 characters (mind the invisible "\n").

$ systemd-escape "ci-op-cvx98rqr-dynamic-pvc-00000000-0000-0000-0000-000000000000" | wc -c 
91

Comment 3 Jan Safranek 2021-02-10 08:09:44 UTC
The final solution in this BZ was to shorten the volume name to 63 characters, even before such shortening is in Kubernetes. This fixes CI and, when released to customers, we can gather data from customer clusters if 63 is the right value.

Comment 11 errata-xmlrpc 2021-07-27 22:43:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438