vsphere-problem-detector should emit alerts when it finds a serious issue with vSphere configuration. * Find what to alert on (how long must be a check failing, when to clear the event). * Update AlertManager. * Make sure the events are documented, so user know what to do / where to find details when the alert fires.
Verified with: 4.7.0-0.nightly-2021-01-21-235301 $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' |jq|grep VSphereOpenshiftClusterHealthFail -A 15 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 7106 0 7106 0 0 408k 0 --:--:-- --:--:-- --:--:-- 385k "alertname": "VSphereOpenshiftClusterHealthFail", "check": "CheckPVs", "container": "vsphere-problem-detector-operator", "endpoint": "vsphere-metrics", "instance": "10.130.0.75:8444", "job": "vsphere-problem-detector-metrics", "namespace": "openshift-cluster-storage-operator", "pod": "vsphere-problem-detector-operator-f46b5cfb-xclx6", "service": "vsphere-problem-detector-metrics", "severity": "warning" }, "annotations": { "message": "VSphere cluster health checks are failing with CheckPVs" }, "state": "firing", "activeAt": "2021-01-22T13:20:52.396347327Z", -- "alertname": "VSphereOpenshiftClusterHealthFail", "check": "CheckStorageClasses", "container": "vsphere-problem-detector-operator", "endpoint": "vsphere-metrics", "instance": "10.130.0.75:8444", "job": "vsphere-problem-detector-metrics", "namespace": "openshift-cluster-storage-operator", "pod": "vsphere-problem-detector-operator-f46b5cfb-xclx6", "service": "vsphere-problem-detector-metrics", "severity": "warning" }, "annotations": { "message": "VSphere cluster health checks are failing with CheckStorageClasses" }, "state": "firing", "activeAt": "2021-01-22T13:20:52.396347327Z",
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633