Bug 2060213 - [vSphere CSI driver Operator] hw_version_total metric update wrong value after upgrade nodes hardware version from `vmx-13` to `vmx-15`
Summary: [vSphere CSI driver Operator] hw_version_total metric update wrong value afte...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.10
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
: 4.10.z
Assignee: Hemant Kumar
QA Contact: Penghao Wang
URL:
Whiteboard:
Depends On: 2053104
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-03 00:49 UTC by Hemant Kumar
Modified: 2023-07-10 23:12 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2053104
Environment:
Last Closed: 2022-03-28 12:03:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift vmware-vsphere-csi-driver-operator pull 78 0 None open Bug 2060213: Fix node check result cache between checks 2022-03-03 14:56:28 UTC
Red Hat Product Errata RHBA-2022:1026 0 None None None 2022-03-28 12:03:40 UTC

Description Hemant Kumar 2022-03-03 00:49:01 UTC
+++ This bug was initially created as a clone of Bug #2053104 +++

Description of problem:
[vSphere CSI driver Operator] hw_version_total metric update wrong value after upgrade nodes hardware version from `vmx-13` to  `vmx-15`

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-02-09-020826   True        False         28h     Cluster version is 4.10.0-0.nightly-2022-02-09-020826

How reproducible: 
  Always

Steps to Reproduce:

1. Install a 4.9.0 vSphere cluster 3 master nodes 2 worker nodes with hardwareversion vmx-13.
2. Upgrade the cluster to 4.10 nightly and wait for upgrade successfully then check storage co don't degrade and cluster Upgradeable=False. 
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0     True        False         20m     Cluster version is 4.9.0
 $ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-02-09-020826 --force=true --allow-explicit-upgrade=true
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-02-09-020826

$ oc get node
NAME              STATUS   ROLES    AGE   VERSION
compute-0         Ready    worker   82m   v1.23.3+759c22b
compute-1         Ready    worker   83m   v1.23.3+759c22b
control-plane-0   Ready    master   92m   v1.23.3+759c22b
control-plane-1   Ready    master   92m   v1.23.3+759c22b
control-plane-2   Ready    master   92m   v1.23.3+759c22b
 $ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-02-09-020826   True        False         61s     Cluster version is 4.10.0-0.nightly-2022-02-09-020826
 $ oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.10.0-0.nightly-2022-02-09-020826   True        False         False      15m
baremetal                                  4.10.0-0.nightly-2022-02-09-020826   True        False         False      90m
cloud-controller-manager                   4.10.0-0.nightly-2022-02-09-020826   True        False         False      92m
cloud-credential                           4.10.0-0.nightly-2022-02-09-020826   True        False         False      92m
cluster-autoscaler                         4.10.0-0.nightly-2022-02-09-020826   True        False         False      90m
config-operator                            4.10.0-0.nightly-2022-02-09-020826   True        False         False      92m
console                                    4.10.0-0.nightly-2022-02-09-020826   True        False         False      15m
csi-snapshot-controller                    4.10.0-0.nightly-2022-02-09-020826   True        False         False      91m
dns                                        4.10.0-0.nightly-2022-02-09-020826   True        False         False      90m
etcd                                       4.10.0-0.nightly-2022-02-09-020826   True        False         False      90m
image-registry                             4.10.0-0.nightly-2022-02-09-020826   True        False         False      13m
ingress                                    4.10.0-0.nightly-2022-02-09-020826   True        False         False      12m
insights                                   4.10.0-0.nightly-2022-02-09-020826   True        False         False      85m
kube-apiserver                             4.10.0-0.nightly-2022-02-09-020826   True        False         False      84m
kube-controller-manager                    4.10.0-0.nightly-2022-02-09-020826   True        False         False      89m
kube-scheduler                             4.10.0-0.nightly-2022-02-09-020826   True        False         False      90m
kube-storage-version-migrator              4.10.0-0.nightly-2022-02-09-020826   True        False         False      13m
machine-api                                4.10.0-0.nightly-2022-02-09-020826   True        False         False      87m
machine-approver                           4.10.0-0.nightly-2022-02-09-020826   True        False         False      91m
machine-config                             4.10.0-0.nightly-2022-02-09-020826   True        False         False      82m
marketplace                                4.10.0-0.nightly-2022-02-09-020826   True        False         False      90m
monitoring                                 4.10.0-0.nightly-2022-02-09-020826   True        False         False      81m
network                                    4.10.0-0.nightly-2022-02-09-020826   True        False         False      92m
node-tuning                                4.10.0-0.nightly-2022-02-09-020826   True        False         False      35m
openshift-apiserver                        4.10.0-0.nightly-2022-02-09-020826   True        False         False      86m
openshift-controller-manager               4.10.0-0.nightly-2022-02-09-020826   True        False         False      33m
openshift-samples                          4.10.0-0.nightly-2022-02-09-020826   True        False         False      36m
operator-lifecycle-manager                 4.10.0-0.nightly-2022-02-09-020826   True        False         False      91m
operator-lifecycle-manager-catalog         4.10.0-0.nightly-2022-02-09-020826   True        False         False      91m
operator-lifecycle-manager-packageserver   4.10.0-0.nightly-2022-02-09-020826   True        False         False      87m
service-ca                                 4.10.0-0.nightly-2022-02-09-020826   True        False         False      91m
storage                                    4.10.0-0.nightly-2022-02-09-020826   True        False         False      36m
$ oc adm upgrade
Cluster version is 4.10.0-0.nightly-2022-02-09-020826

Upgradeable=False

  Reason: VSphereCSIDriverOperatorCR_VMwareVSphereController_check_deprecated_hw_version
  Message: Cluster operator storage should not be upgraded between minor versions: VSphereCSIDriverOperatorCRUpgradeable: VMwareVSphereControllerUpgradeable: node control-plane-0 has hardware version vmx-13, which is below the minimum required version 15

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.9
warning: Cannot display available updates:
  Reason: VersionNotFound
  Message: Unable to retrieve available updates: currently reconciling cluster version 4.10.0-0.nightly-2022-02-09-020826 not found in the "stable-4.9" channel

3. Upgrade all nodes hardware version to vmx15 serially.
https://docs.openshift.com/container-platform/4.9/updating/updating-hardware-on-nodes-running-on-vsphere.html

4. After Upgrade all nodes hardware version successfully wait for the alert "CSIWithOldVSphereHWVersion" disappear.

5. Check the cluster become Upgradeable=True and CSI driver auto installed.

Actual results:
In step 4: Wait for about 24h the alert "CSIWithOldVSphereHWVersion" still exist
And check the hw_version_total metrics still has vmx-13 but in vsphere-problem-detector-operator log check all nodes with hw-15
the vmware-vsphere-csi-driver-operator log still report workers has hardware version vmx-13, which is below the minimum required version 15
$ oc project openshift-cluster-storage-operator
Now using project "openshift-cluster-storage-operator" on server "https://api.pewang0209vmc-13.qe.devcluster.openshift.com:6443".
$ oc exec deployment/vsphere-problem-detector-operator -- curl -ks -H "Authorization: Bearer $token" https://vsphere-problem-detector-metrics:8444/metrics | grep -i "hw_version_total"
# HELP vsphere_node_hw_version_total [ALPHA] Number of vSphere nodes with given HW version.
# TYPE vsphere_node_hw_version_total gauge
vsphere_node_hw_version_total{hw_version="vmx-13"} 1
vsphere_node_hw_version_total{hw_version="vmx-15"} 5

$ oc logs vsphere-problem-detector-operator-6d56b44dd9-trtxt | grep "vmx-15"
I0209 16:58:26.702083       1 node_hw_version.go:46] Node control-plane-2 has HW version vmx-15
I0209 16:58:26.703180       1 node_hw_version.go:46] Node control-plane-1 has HW version vmx-15
I0209 16:58:26.703509       1 node_hw_version.go:46] Node control-plane-0 has HW version vmx-15
I0209 17:58:27.414917       1 node_hw_version.go:46] Node compute-0 has HW version vmx-15
I0209 17:58:27.416061       1 node_hw_version.go:46] Node compute-1 has HW version vmx-15

$ oc -n openshift-cluster-csi-drivers logs deployment/vmware-vsphere-csi-driver-operator
I0210 14:04:19.192457       1 vspherecontroller.go:229] Scheduled the next check in 1h0m27.0258479s
W0210 14:04:19.192469       1 vspherecontroller.go:423] Marking cluster un-upgradeable because node compute-1 has hardware version vmx-13, which is below the minimum required version 15
I0210 14:14:04.176417       1 vmware.go:273] Found existing profile with same name: openshift-storage-policy-pewang0209vmc-13-dtxts
I0210 14:14:07.362694       1 vmware.go:273] Found existing profile with same name: openshift-storage-policy-pewang0209vmc-13-dtxts


In step 5: The cluster still Upgradeable=False and vsphere CSI driver not auto install


Expected results:
In step 4: the alert "CSIWithOldVSphereHWVersion" disappeared 
In step 5: the cluster become Upgradeable=True and CSI driver auto installed

--- Additional comment from Penghao Wang on 2022-02-10 14:34:48 UTC ---

Not sure the root cause is the same with the customer scenario bug:
https://bugzilla.redhat.com/show_bug.cgi?id=2042446

--- Additional comment from Jan Safranek on 2022-02-23 15:16:27 UTC ---

Good catch! I am going to close this in favor of a similar BZ from a customer.

--- Additional comment from Hemant Kumar on 2022-02-28 02:01:34 UTC ---

While the metric bug is same as linked one. The upgradeable=False not being cleared is a different error and needs a fix esp. for 4.10.

--- Additional comment from OpenShift Automated Release Tooling on 2022-03-02 07:09:15 UTC ---

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.11 release created.

Comment 1 Hemant Kumar 2022-03-03 15:00:45 UTC
Since it is easy to workaround the bug by deleted and recreating vmware-vsphere-csi-driver-operartor pod I am going to mark this as non-blocker for 4.10

Comment 10 errata-xmlrpc 2022-03-28 12:03:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.6 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1026

Comment 11 jendy hanna 2023-01-30 02:38:39 UTC Comment hidden (spam)
Comment 12 rouzman 2023-03-03 12:48:48 UTC Comment hidden (spam)
Comment 13 salmangujjar 2023-07-09 17:42:06 UTC Comment hidden (spam)

Note You need to log in before you can comment on or make changes to this bug.