Bug 1926943 - vsphere-problem-detector: Alerts in CI jobs
Summary: vsphere-problem-detector: Alerts in CI jobs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Jan Safranek
QA Contact: Qin Ping
URL:
Whiteboard:
Depends On:
Blocks: 2092811 2092814
TreeView+ depends on / blocked
 
Reported: 2021-02-09 17:51 UTC by Jan Safranek
Modified: 2022-06-02 10:11 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2092811 2092814 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:43:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift vsphere-problem-detector pull 30 0 None Waiting on Customer CEPH replication issue 2022-06-27 14:54:21 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:43:32 UTC

Description Jan Safranek 2021-02-09 17:51:12 UTC
vmware CI jobs fail with:

[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel] 

fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:399]: Unexpected error:
    <errors.aggregate | len:1, cap:1>: [
        {
            s: "query failed: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1: promQL query: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1 had reported incorrect results:\n[{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"VSphereOpenshiftClusterHealthFail\",\"alertstate\":\"firing\",\"check\":\"CheckDefaultDatastore\",\"container\":\"vsphere-problem-detector-operator\",\"endpoint\":\"vsphere-metrics\",\"instance\":\"10.129.0.7:8444\",\"job\":\"vsphere-problem-detector-metrics\",\"namespace\":\"openshift-cluster-storage-operator\",\"pod\":\"vsphere-problem-detector-operator-7b59444849-rf47f\",\"service\":\"vsphere-problem-detector-metrics\",\"severity\":\"warning\"},\"value\":[1612885400.871,\"1\"]}]",
        },
    ]
    query failed: ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1: promQL query: ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1 had reported incorrect results:
    [{"metric":{"__name__":"ALERTS","alertname":"VSphereOpenshiftClusterHealthFail","alertstate":"firing","check":"CheckDefaultDatastore","container":"vsphere-problem-detector-operator","endpoint":"vsphere-metrics","instance":"10.129.0.7:8444","job":"vsphere-problem-detector-metrics","namespace":"openshift-cluster-storage-operator","pod":"vsphere-problem-detector-operator-7b59444849-rf47f","service":"vsphere-problem-detector-metrics","severity":"warning"},"value":[1612885400.871,"1"]}]
occurred

The reason is failing Datastore check, because cluster-id is too long

2021-02-09T15:10:52.571208419Z I0209 15:10:52.571182       1 operator.go:208] CheckDefaultDatastore failed: defaultDatastore "WorkloadDatastore" in vSphere configuration: datastore WorkloadDatastore: datastore name is too long: escaped volume path "-var-lib-kubelet-plugins-kubernetes.io-vsphere\\x2dvolume-mounts-\\x5bWorkloadDatastore\\x5d\\x2000000000\\x2d0000\\x2d0000\\x2d0000\\x2d000000000000-ci\\x2dop\\x2dcvx98rqr\\x2dd1161\\x2d28xzd\\x2ddynamic\\x2dpvc\\x2d00000000\\x2d0000\\x2d0000\\x2d0000\\x2d000000000000.vmdk" must be under 255 characters, got 255


Sample: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_vsphere-problem-detector/17/pull-ci-openshift-vsphere-problem-detector-master-e2e-vsphere/1359151454819979264

Comment 1 Jan Safranek 2021-02-09 18:18:30 UTC
Submitted https://github.com/openshift/vsphere-problem-detector/pull/31 to disable the datastore checks.

To fix this issue properly, we need to shorten the volume names even more in Kubernetes, https://github.com/kubernetes/kubernetes/blob/93d288e2a47fa6d497b50d37c8b3a04e91da4228/pkg/volume/vsphere_volume/vsphere_volume_util.go#L100
Now it's cut at 90 characters, but it does not count with characters escaped by systemd. Shortening it to ~63 characters would be better. The following 63 character string is escaped to 90 characters (mind the invisible "\n").

$ systemd-escape "ci-op-cvx98rqr-dynamic-pvc-00000000-0000-0000-0000-000000000000" | wc -c 
91

Comment 3 Jan Safranek 2021-02-10 08:09:44 UTC
The final solution in this BZ was to shorten the volume name to 63 characters, even before such shortening is in Kubernetes. This fixes CI and, when released to customers, we can gather data from customer clusters if 63 is the right value.

Comment 11 errata-xmlrpc 2021-07-27 22:43:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.