Bug 1943719 - storage-operator/vsphere-problem-detector causing upgrades to fail that would have succeeded in past versions
Summary: storage-operator/vsphere-problem-detector causing upgrades to fail that would...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.8.0
Assignee: Jan Safranek
QA Contact: Wei Duan
URL:
Whiteboard:
: 1955260 (view as bug list)
Depends On:
Blocks: 1959546
TreeView+ depends on / blocked
 
Reported: 2021-03-26 21:47 UTC by Luke Stanton
Modified: 2021-11-03 08:46 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1959546 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:56:00 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-storage-operator pull 166 0 None closed Bug 1943719: Add alert about vsphere-problem-detector unable to connect 2021-05-17 15:12:14 UTC
Github openshift vsphere-problem-detector pull 39 0 None closed Bug 1943719: Don't degrade cluster on connection error 2021-05-17 15:12:17 UTC
Github openshift vsphere-problem-detector pull 42 0 None closed Bug 1943719: Save error in Available message on every sync 2021-05-24 15:30:48 UTC
Red Hat Knowledge Base (Solution) 4618011 0 None None None 2021-05-05 06:53:47 UTC
Red Hat Knowledge Base (Solution) 6005261 0 None None None 2021-05-05 06:53:47 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:56:36 UTC

Description Luke Stanton 2021-03-26 21:47:32 UTC
Description of problem:

The vsphere-problem-detector feature is causing upgrades to stall that worked previously, forcing users to update configuration solely to get around the problem detector. Depending on the policies around configuration updates, this can be a major hindrance for a user who needs the upgrade to complete and wants to keep the current vSphere settings since they've worked in the past.


Version-Release number of selected component (if applicable):

4.7


How reproducible:

Consistently


Steps to Reproduce:
1. Attempt to upgrade a cluster to 4.7 with invalid vSphere credentials


Actual results:

The upgrade hangs since the storage operator is degraded due to the vsphere-problem-detector indicating a config problem


Expected results:

Opt out or bypass the vsphere-problem-detector if the user doesn't want to make a config change, since the setup is working, and upgrades like this succeeded for user previous to 4.7

Comment 10 Hemant Kumar 2021-04-29 19:54:02 UTC
*** Bug 1955260 has been marked as a duplicate of this bug. ***

Comment 15 pmoses 2021-05-12 18:00:09 UTC
Additional info:

This also takes place if there is network segmentation blocking access back to the diesore host:port. 

Upgrades were able to complete by switching the operator to unmanaged/managed at several points of the upgrade however after completing the upgrade, the operator continues to show as degraded.

Comment 17 Jan Safranek 2021-05-17 18:17:41 UTC
I found an issue that the message on Available condition is sometimes cleared.

Comment 19 Wei Duan 2021-05-18 11:47:11 UTC
Verified with 4.8.0-0.nightly-2021-05-18-033553.

After change to a invalid password by: 
$ oc -n kube-system edit secret vsphere-creds

Then check storage clusteroperator is AVAILABLE and not DEGRADED 
$ oc get co storage
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
storage   4.8.0-0.nightly-2021-05-18-033553   True        False         False      92m

Message from the clusteroperator:
$ oc get clusteroperator storage -o jsonpath='{.status.conditions[?(@.type=="Available")].message}'
VSphereProblemDetectorControllerAvailable: failed to connect to vcenter.sddc-44-236-21-251.vmwarevmc.com: ServerFaultCode: Cannot complete login due to an incorrect user name or password.

Check the vsphere_sync_errors metric and the alert raised:
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "vsphere_sync_errors",
          "container": "vsphere-problem-detector-operator",
          "endpoint": "vsphere-metrics",
          "instance": "10.128.0.44:8444",
          "job": "vsphere-problem-detector-metrics",
          "namespace": "openshift-cluster-storage-operator",
          "pod": "vsphere-problem-detector-operator-958d9f68c-w74tb",
          "service": "vsphere-problem-detector-metrics"
        },
        "value": [
          1621335304.464,
          "1"
        ]
      }
    ]
  }
}

    "alerts": [
      {
        "labels": {
          "alertname": "VSphereOpenshiftConnectionFailure",
          "container": "vsphere-problem-detector-operator",
          "endpoint": "vsphere-metrics",
          "instance": "10.128.0.44:8444",
          "job": "vsphere-problem-detector-metrics",
          "namespace": "openshift-cluster-storage-operator",
          "pod": "vsphere-problem-detector-operator-958d9f68c-w74tb",
          "service": "vsphere-problem-detector-metrics",
          "severity": "warning"
        },
        "annotations": {
          "description": "vsphere-problem-detector cannot access vCenter. As consequence, other OCP components,\nsuch as storage or machine API, may not be able to access vCenter too and provide\ntheir services. Detailed error message can be found in Available condition of\nClusterOperator \"storage\", either in console\n(Administration -> Cluster settings -> Cluster operators tab -> storage) or on\ncommand line: oc get clusteroperator storage -o jsonpath='{.status.conditions[?(@.type==\"Available\")].message}'\n",
          "summary": "vsphere-problem-detector is unable to connect to vSphere vCenter."
        },
        "state": "firing",
        "activeAt": "2021-05-18T10:08:52.396347327Z",
        "value": "1e+00"
      },

Marked as VERIFIED.

Comment 22 errata-xmlrpc 2021-07-27 22:56:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.