1943719 – storage-operator/vsphere-problem-detector causing upgrades to fail that would have succeeded in past versions

Bug 1943719 - storage-operator/vsphere-problem-detector causing upgrades to fail that would have succeeded in past versions

Summary: storage-operator/vsphere-problem-detector causing upgrades to fail that would...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Jan Safranek
QA Contact:	Wei Duan
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1955260 (view as bug list)
Depends On:
Blocks:	1959546
TreeView+	depends on / blocked

Reported:	2021-03-26 21:47 UTC by Luke Stanton
Modified:	2024-06-14 01:03 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1959546 (view as bug list)
Environment:
Last Closed:	2021-07-27 22:56:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-storage-operator pull 166	None	closed	Bug 1943719: Add alert about vsphere-problem-detector unable to connect	2021-05-17 15:12:14 UTC
Github	openshift vsphere-problem-detector pull 39	None	closed	Bug 1943719: Don't degrade cluster on connection error	2021-05-17 15:12:17 UTC
Github	openshift vsphere-problem-detector pull 42	None	closed	Bug 1943719: Save error in Available message on every sync	2021-05-24 15:30:48 UTC
Red Hat Knowledge Base (Solution)	4618011	None	None	None	2021-05-05 06:53:47 UTC
Red Hat Knowledge Base (Solution)	6005261	None	None	None	2021-05-05 06:53:47 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 22:56:36 UTC

Description Luke Stanton 2021-03-26 21:47:32 UTC

Description of problem:

The vsphere-problem-detector feature is causing upgrades to stall that worked previously, forcing users to update configuration solely to get around the problem detector. Depending on the policies around configuration updates, this can be a major hindrance for a user who needs the upgrade to complete and wants to keep the current vSphere settings since they've worked in the past.


Version-Release number of selected component (if applicable):

4.7


How reproducible:

Consistently


Steps to Reproduce:
1. Attempt to upgrade a cluster to 4.7 with invalid vSphere credentials


Actual results:

The upgrade hangs since the storage operator is degraded due to the vsphere-problem-detector indicating a config problem


Expected results:

Opt out or bypass the vsphere-problem-detector if the user doesn't want to make a config change, since the setup is working, and upgrades like this succeeded for user previous to 4.7

Comment 10 Hemant Kumar 2021-04-29 19:54:02 UTC

*** Bug 1955260 has been marked as a duplicate of this bug. ***

Comment 15 pmoses 2021-05-12 18:00:09 UTC

Additional info:

This also takes place if there is network segmentation blocking access back to the diesore host:port. 

Upgrades were able to complete by switching the operator to unmanaged/managed at several points of the upgrade however after completing the upgrade, the operator continues to show as degraded.

Comment 17 Jan Safranek 2021-05-17 18:17:41 UTC

I found an issue that the message on Available condition is sometimes cleared.

Comment 19 Wei Duan 2021-05-18 11:47:11 UTC

Verified with 4.8.0-0.nightly-2021-05-18-033553.

After change to a invalid password by: 
$ oc -n kube-system edit secret vsphere-creds

Then check storage clusteroperator is AVAILABLE and not DEGRADED 
$ oc get co storage
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
storage   4.8.0-0.nightly-2021-05-18-033553   True        False         False      92m

Message from the clusteroperator:
$ oc get clusteroperator storage -o jsonpath='{.status.conditions[?(@.type=="Available")].message}'
VSphereProblemDetectorControllerAvailable: failed to connect to vcenter.sddc-44-236-21-251.vmwarevmc.com: ServerFaultCode: Cannot complete login due to an incorrect user name or password.

Check the vsphere_sync_errors metric and the alert raised:
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "vsphere_sync_errors",
          "container": "vsphere-problem-detector-operator",
          "endpoint": "vsphere-metrics",
          "instance": "10.128.0.44:8444",
          "job": "vsphere-problem-detector-metrics",
          "namespace": "openshift-cluster-storage-operator",
          "pod": "vsphere-problem-detector-operator-958d9f68c-w74tb",
          "service": "vsphere-problem-detector-metrics"
        },
        "value": [
          1621335304.464,
          "1"
        ]
      }
    ]
  }
}

    "alerts": [
      {
        "labels": {
          "alertname": "VSphereOpenshiftConnectionFailure",
          "container": "vsphere-problem-detector-operator",
          "endpoint": "vsphere-metrics",
          "instance": "10.128.0.44:8444",
          "job": "vsphere-problem-detector-metrics",
          "namespace": "openshift-cluster-storage-operator",
          "pod": "vsphere-problem-detector-operator-958d9f68c-w74tb",
          "service": "vsphere-problem-detector-metrics",
          "severity": "warning"
        },
        "annotations": {
          "description": "vsphere-problem-detector cannot access vCenter. As consequence, other OCP components,\nsuch as storage or machine API, may not be able to access vCenter too and provide\ntheir services. Detailed error message can be found in Available condition of\nClusterOperator \"storage\", either in console\n(Administration -> Cluster settings -> Cluster operators tab -> storage) or on\ncommand line: oc get clusteroperator storage -o jsonpath='{.status.conditions[?(@.type==\"Available\")].message}'\n",
          "summary": "vsphere-problem-detector is unable to connect to vSphere vCenter."
        },
        "state": "firing",
        "activeAt": "2021-05-18T10:08:52.396347327Z",
        "value": "1e+00"
      },

Marked as VERIFIED.

Comment 22 errata-xmlrpc 2021-07-27 22:56:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.