Bug 1904503

Summary: vsphere-problem-detector: emit alerts
Product: OpenShift Container Platform Reporter: Jan Safranek <jsafrane>
Component: StorageAssignee: Hemant Kumar <hekumar>
Storage sub component: Operators QA Contact: Qin Ping <piqin>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: aos-bugs
Version: 4.7   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:38:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jan Safranek 2020-12-04 16:08:43 UTC
vsphere-problem-detector should emit alerts when it finds a serious issue with vSphere configuration.

* Find what to alert on (how long must be a check failing, when to clear the event).
* Update AlertManager.
* Make sure the events are documented, so user know what to do / where to find details when the alert fires.

Comment 2 Qin Ping 2021-01-22 13:31:42 UTC
Verified with: 4.7.0-0.nightly-2021-01-21-235301

$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' |jq|grep VSphereOpenshiftClusterHealthFail -A 15
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  7106    0  7106    0     0   408k      0 --:--:-- --:--:-- --:--:--  385k
          "alertname": "VSphereOpenshiftClusterHealthFail",
          "check": "CheckPVs",
          "container": "vsphere-problem-detector-operator",
          "endpoint": "vsphere-metrics",
          "instance": "10.130.0.75:8444",
          "job": "vsphere-problem-detector-metrics",
          "namespace": "openshift-cluster-storage-operator",
          "pod": "vsphere-problem-detector-operator-f46b5cfb-xclx6",
          "service": "vsphere-problem-detector-metrics",
          "severity": "warning"
        },
        "annotations": {
          "message": "VSphere cluster health checks are failing with CheckPVs"
        },
        "state": "firing",
        "activeAt": "2021-01-22T13:20:52.396347327Z",
--
          "alertname": "VSphereOpenshiftClusterHealthFail",
          "check": "CheckStorageClasses",
          "container": "vsphere-problem-detector-operator",
          "endpoint": "vsphere-metrics",
          "instance": "10.130.0.75:8444",
          "job": "vsphere-problem-detector-metrics",
          "namespace": "openshift-cluster-storage-operator",
          "pod": "vsphere-problem-detector-operator-f46b5cfb-xclx6",
          "service": "vsphere-problem-detector-metrics",
          "severity": "warning"
        },
        "annotations": {
          "message": "VSphere cluster health checks are failing with CheckStorageClasses"
        },
        "state": "firing",
        "activeAt": "2021-01-22T13:20:52.396347327Z",

Comment 5 errata-xmlrpc 2021-02-24 15:38:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633