Bug 1926943
| Summary: | vsphere-problem-detector: Alerts in CI jobs | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jan Safranek <jsafrane> | |
| Component: | Storage | Assignee: | Jan Safranek <jsafrane> | |
| Storage sub component: | Operators | QA Contact: | Qin Ping <piqin> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | high | |||
| Priority: | high | CC: | adeshpan, agudi, aos-bugs, pmagotra, ssadhale, ssonigra | |
| Version: | 4.7 | |||
| Target Milestone: | --- | |||
| Target Release: | 4.8.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | No Doc Update | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2092811 2092814 (view as bug list) | Environment: | ||
| Last Closed: | 2021-07-27 22:43:10 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2092811, 2092814 | |||
Submitted https://github.com/openshift/vsphere-problem-detector/pull/31 to disable the datastore checks. To fix this issue properly, we need to shorten the volume names even more in Kubernetes, https://github.com/kubernetes/kubernetes/blob/93d288e2a47fa6d497b50d37c8b3a04e91da4228/pkg/volume/vsphere_volume/vsphere_volume_util.go#L100 Now it's cut at 90 characters, but it does not count with characters escaped by systemd. Shortening it to ~63 characters would be better. The following 63 character string is escaped to 90 characters (mind the invisible "\n"). $ systemd-escape "ci-op-cvx98rqr-dynamic-pvc-00000000-0000-0000-0000-000000000000" | wc -c 91 The final solution in this BZ was to shorten the volume name to 63 characters, even before such shortening is in Kubernetes. This fixes CI and, when released to customers, we can gather data from customer clusters if 63 is the right value. The ci job: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_vsphere-problem-detector/28/pull-ci-openshift-vsphere-problem-detector-master-e2e-vsphere/1359532360546127872 has the same length cluster id and the failed test case passed. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |
vmware CI jobs fail with: [sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel] fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:399]: Unexpected error: <errors.aggregate | len:1, cap:1>: [ { s: "query failed: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1: promQL query: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1 had reported incorrect results:\n[{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"VSphereOpenshiftClusterHealthFail\",\"alertstate\":\"firing\",\"check\":\"CheckDefaultDatastore\",\"container\":\"vsphere-problem-detector-operator\",\"endpoint\":\"vsphere-metrics\",\"instance\":\"10.129.0.7:8444\",\"job\":\"vsphere-problem-detector-metrics\",\"namespace\":\"openshift-cluster-storage-operator\",\"pod\":\"vsphere-problem-detector-operator-7b59444849-rf47f\",\"service\":\"vsphere-problem-detector-metrics\",\"severity\":\"warning\"},\"value\":[1612885400.871,\"1\"]}]", }, ] query failed: ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1: promQL query: ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1 had reported incorrect results: [{"metric":{"__name__":"ALERTS","alertname":"VSphereOpenshiftClusterHealthFail","alertstate":"firing","check":"CheckDefaultDatastore","container":"vsphere-problem-detector-operator","endpoint":"vsphere-metrics","instance":"10.129.0.7:8444","job":"vsphere-problem-detector-metrics","namespace":"openshift-cluster-storage-operator","pod":"vsphere-problem-detector-operator-7b59444849-rf47f","service":"vsphere-problem-detector-metrics","severity":"warning"},"value":[1612885400.871,"1"]}] occurred The reason is failing Datastore check, because cluster-id is too long 2021-02-09T15:10:52.571208419Z I0209 15:10:52.571182 1 operator.go:208] CheckDefaultDatastore failed: defaultDatastore "WorkloadDatastore" in vSphere configuration: datastore WorkloadDatastore: datastore name is too long: escaped volume path "-var-lib-kubelet-plugins-kubernetes.io-vsphere\\x2dvolume-mounts-\\x5bWorkloadDatastore\\x5d\\x2000000000\\x2d0000\\x2d0000\\x2d0000\\x2d000000000000-ci\\x2dop\\x2dcvx98rqr\\x2dd1161\\x2d28xzd\\x2ddynamic\\x2dpvc\\x2d00000000\\x2d0000\\x2d0000\\x2d0000\\x2d000000000000.vmdk" must be under 255 characters, got 255 Sample: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_vsphere-problem-detector/17/pull-ci-openshift-vsphere-problem-detector-master-e2e-vsphere/1359151454819979264