Description of problem: When creating a cluster (platform == VSphere), I purposely entered wrong vSphere credentials (and a wrong vCenter address) and no event/alert appears in the console. Version-Release number of selected component (if applicable): How reproducible: Consistently Steps to Reproduce: 1. Install cluster with wrong credentials ======== Install config ======== ... platform: vsphere: vCenter: vcenterplaceholder username: usernameplaceholder password: passwordplaceholder datacenter: datacenterplaceholder defaultDatastore: defaultdatastoreplaceholder network: networkplaceholder cluster: clusterplaceholder apiVIP: 10.19.115.210 ingressVIP: 10.19.115.212 ... ======== ======== ======== ====== Actual results: - After Installation no relevant alerts/events exists on the Console - On PVC creation, an event is being created "Failed to provision volume with StorageClass "thin": Post "https://placeholder:443/sdk": dial tcp: lookup placeholder on 10.19.115.106:53: no such host". After updating the vCenter address (on vsphere-creds and on cloud-provider-config) the following event appear: "Failed to provision volume with StorageClass "thin": ServerFaultCode: Cannot complete login due to an incorrect user name or password." Expected results: Alert on Console home page Additional info: Related bugs: https://bugzilla.redhat.com/show_bug.cgi?id=1959546
Can you please attach must-gather? It should have all the logs we need. Looking at the alert definition in https://github.com/openshift/cluster-storage-operator/blob/release-4.7/assets/vsphere_problem_detector/12_prometheusrules.yaml#L37, there is an alert fired when the vspehere-problem-detector can't connect to vCenter for ~75 minutes. Reason why it can't connect to it does not really matter (wrong IP, firewall on the way, wrong credential, wrong protocol, wrong TLS config...) Detailed error message can be found in `oc get clusteroperator storage -o yaml` and look for `Available` condition in status. That message is set immediately after connection error, no need to wait for 75 minutes.
After waiting ~75 minutes I succeeded to see the alert that was created by the vspehere-problem-detector. I'm wondering why not to validate the connection to vCenter right after installation? If client entered wrong credentials he will know it only 75 minutes after installation.
We should probably reduce the alert time to something like ~10 minutes.
Verified passed on 4.9.0-0.nightly-2021-08-07-175228 Alert became "firing" after 15 mins after "activeAt" time. { "labels": { "alertname": "VSphereOpenshiftConnectionFailure", "container": "vsphere-problem-detector-operator", "endpoint": "vsphere-metrics", "instance": "10.129.0.43:8444", "job": "vsphere-problem-detector-metrics", "namespace": "openshift-cluster-storage-operator", "pod": "vsphere-problem-detector-operator-f446f6f7d-bhv98", "reason": "InvalidCredentials", "service": "vsphere-problem-detector-metrics", "severity": "warning" }, "annotations": { "description": "vsphere-problem-detector cannot access vCenter. As consequence, other OCP components,\nsuch as storage or machine API, may not be able to access vCenter too and provide\ntheir services. Detailed error message can be found in Available condition of\nClusterOperator \"storage\", either in console\n(Administration -> Cluster settings -> Cluster operators tab -> storage) or on\ncommand line: oc get clusteroperator storage -o jsonpath='{.status.conditions[?(@.type==\"Available\")].message}'\n", "summary": "vsphere-problem-detector is unable to connect to vSphere vCenter." }, "state": "firing", "activeAt": "2021-08-11T08:45:22.396347327Z", "value": "1e+00" }, { "labels": { "alertname": "VSphereOpenshiftConnectionFailure", "container": "vsphere-problem-detector-operator", "endpoint": "vsphere-metrics", "instance": "10.129.0.43:8444", "job": "vsphere-problem-detector-metrics", "namespace": "openshift-cluster-storage-operator", "pod": "vsphere-problem-detector-operator-f446f6f7d-bhv98", "reason": "SyncError", "service": "vsphere-problem-detector-metrics", "severity": "warning" }, "annotations": { "description": "vsphere-problem-detector cannot access vCenter. As consequence, other OCP components,\nsuch as storage or machine API, may not be able to access vCenter too and provide\ntheir services. Detailed error message can be found in Available condition of\nClusterOperator \"storage\", either in console\n(Administration -> Cluster settings -> Cluster operators tab -> storage) or on\ncommand line: oc get clusteroperator storage -o jsonpath='{.status.conditions[?(@.type==\"Available\")].message}'\n", "summary": "vsphere-problem-detector is unable to connect to vSphere vCenter." }, "state": "firing", "activeAt": "2021-08-11T08:45:22.396347327Z", "value": "1e+00" },
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759