Bug 1943719

Summary: storage-operator/vsphere-problem-detector causing upgrades to fail that would have succeeded in past versions
Product: OpenShift Container Platform Reporter: Luke Stanton <lstanton>
Component: StorageAssignee: Jan Safranek <jsafrane>
Storage sub component: Operators QA Contact: Wei Duan <wduan>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: aos-bugs, dmoessne, hekumar, jcallen, jsafrane, lmohanty, nchoudhu, palshure, pmoses, vrutkovs, WilliamC.Elliott, wking
Version: 4.7Keywords: Upgrades
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1959546 (view as bug list) Environment:
Last Closed: 2021-07-27 22:56:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1959546    

Description Luke Stanton 2021-03-26 21:47:32 UTC
Description of problem:

The vsphere-problem-detector feature is causing upgrades to stall that worked previously, forcing users to update configuration solely to get around the problem detector. Depending on the policies around configuration updates, this can be a major hindrance for a user who needs the upgrade to complete and wants to keep the current vSphere settings since they've worked in the past.


Version-Release number of selected component (if applicable):

4.7


How reproducible:

Consistently


Steps to Reproduce:
1. Attempt to upgrade a cluster to 4.7 with invalid vSphere credentials


Actual results:

The upgrade hangs since the storage operator is degraded due to the vsphere-problem-detector indicating a config problem


Expected results:

Opt out or bypass the vsphere-problem-detector if the user doesn't want to make a config change, since the setup is working, and upgrades like this succeeded for user previous to 4.7

Comment 10 Hemant Kumar 2021-04-29 19:54:02 UTC
*** Bug 1955260 has been marked as a duplicate of this bug. ***

Comment 15 pmoses 2021-05-12 18:00:09 UTC
Additional info:

This also takes place if there is network segmentation blocking access back to the diesore host:port. 

Upgrades were able to complete by switching the operator to unmanaged/managed at several points of the upgrade however after completing the upgrade, the operator continues to show as degraded.

Comment 17 Jan Safranek 2021-05-17 18:17:41 UTC
I found an issue that the message on Available condition is sometimes cleared.

Comment 19 Wei Duan 2021-05-18 11:47:11 UTC
Verified with 4.8.0-0.nightly-2021-05-18-033553.

After change to a invalid password by: 
$ oc -n kube-system edit secret vsphere-creds

Then check storage clusteroperator is AVAILABLE and not DEGRADED 
$ oc get co storage
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
storage   4.8.0-0.nightly-2021-05-18-033553   True        False         False      92m

Message from the clusteroperator:
$ oc get clusteroperator storage -o jsonpath='{.status.conditions[?(@.type=="Available")].message}'
VSphereProblemDetectorControllerAvailable: failed to connect to vcenter.sddc-44-236-21-251.vmwarevmc.com: ServerFaultCode: Cannot complete login due to an incorrect user name or password.

Check the vsphere_sync_errors metric and the alert raised:
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "vsphere_sync_errors",
          "container": "vsphere-problem-detector-operator",
          "endpoint": "vsphere-metrics",
          "instance": "10.128.0.44:8444",
          "job": "vsphere-problem-detector-metrics",
          "namespace": "openshift-cluster-storage-operator",
          "pod": "vsphere-problem-detector-operator-958d9f68c-w74tb",
          "service": "vsphere-problem-detector-metrics"
        },
        "value": [
          1621335304.464,
          "1"
        ]
      }
    ]
  }
}

    "alerts": [
      {
        "labels": {
          "alertname": "VSphereOpenshiftConnectionFailure",
          "container": "vsphere-problem-detector-operator",
          "endpoint": "vsphere-metrics",
          "instance": "10.128.0.44:8444",
          "job": "vsphere-problem-detector-metrics",
          "namespace": "openshift-cluster-storage-operator",
          "pod": "vsphere-problem-detector-operator-958d9f68c-w74tb",
          "service": "vsphere-problem-detector-metrics",
          "severity": "warning"
        },
        "annotations": {
          "description": "vsphere-problem-detector cannot access vCenter. As consequence, other OCP components,\nsuch as storage or machine API, may not be able to access vCenter too and provide\ntheir services. Detailed error message can be found in Available condition of\nClusterOperator \"storage\", either in console\n(Administration -> Cluster settings -> Cluster operators tab -> storage) or on\ncommand line: oc get clusteroperator storage -o jsonpath='{.status.conditions[?(@.type==\"Available\")].message}'\n",
          "summary": "vsphere-problem-detector is unable to connect to vSphere vCenter."
        },
        "state": "firing",
        "activeAt": "2021-05-18T10:08:52.396347327Z",
        "value": "1e+00"
      },

Marked as VERIFIED.

Comment 22 errata-xmlrpc 2021-07-27 22:56:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438