Bug 2060214 - CSIWithOldVSphereHWVersion alert recurring despite upgrade to vmx-15
Summary: CSIWithOldVSphereHWVersion alert recurring despite upgrade to vmx-15
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.9
Hardware: x86_64
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.9.z
Assignee: Hemant Kumar
QA Contact: Penghao Wang
URL:
Whiteboard:
Depends On: 2059255
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-03 00:51 UTC by Hemant Kumar
Modified: 2023-09-15 01:52 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2059255
Environment:
Last Closed: 2022-04-13 05:10:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift vsphere-problem-detector pull 82 0 None open Bug 2060214:Fix stale metrics 2022-03-29 18:04:43 UTC
Red Hat Product Errata RHBA-2022:1245 0 None None None 2022-04-13 05:10:49 UTC

Description Hemant Kumar 2022-03-03 00:51:57 UTC
+++ This bug was initially created as a clone of Bug #2059255 +++

+++ This bug was initially created as a clone of Bug #2042446 +++

Description of problem:
After upgrading clusters running on vSphere UPI from 4.8 to 4.9 the alert CSIWithOldVSphereHWVersion appears because the node VMs are still on vmx-13. After upgrading the node VMs to vmx-15, the alert does not go away.

Version-Release number of selected component (if applicable):
4.9

How reproducible:
so far, always

Steps to Reproduce:
1. upgrade cluster from 4.8 to 4.9
2. see the CSIWithOldVSphereHWVersion alert
3. upgrade all VMs to vmx-15
4. still see the same alert

Actual results:
alert stays

Expected results:
alert should clear

Additional info:

--- Additional comment from Kai-Uwe Rommel on 2022-01-19 14:30:42 UTC ---

I checked the logs of the vSphere problem detector and found this:
C:\Work>oc logs vsphere-problem-detector-operator-5bd788bf59-rrzb5 | fgrep vmx-                            
I0118 14:02:38.246614       1 node_hw_version.go:54] Node worker-01.ocp-demo.ars.de has HW version vmx-13  
I0118 14:02:38.259371       1 node_hw_version.go:54] Node master-02.ocp-demo.ars.de has HW version vmx-13  
I0118 14:02:38.263086       1 node_hw_version.go:54] Node worker-04.ocp-demo.ars.de has HW version vmx-13  
I0118 14:02:38.283624       1 node_hw_version.go:54] Node master-03.ocp-demo.ars.de has HW version vmx-13  
I0118 14:02:38.566010       1 node_hw_version.go:54] Node worker-02.ocp-demo.ars.de has HW version vmx-13  
I0118 14:02:38.623278       1 node_hw_version.go:54] Node worker-03.ocp-demo.ars.de has HW version vmx-13  
I0118 14:02:38.774871       1 node_hw_version.go:54] Node master-01.ocp-demo.ars.de has HW version vmx-15  
I0118 22:02:37.296778       1 node_hw_version.go:54] Node worker-04.ocp-demo.ars.de has HW version vmx-15  
I0118 22:02:37.297563       1 node_hw_version.go:54] Node master-03.ocp-demo.ars.de has HW version vmx-15  
I0118 22:02:37.300421       1 node_hw_version.go:54] Node worker-02.ocp-demo.ars.de has HW version vmx-15  
I0118 22:02:37.307970       1 node_hw_version.go:54] Node master-01.ocp-demo.ars.de has HW version vmx-15  
I0118 22:02:37.309944       1 node_hw_version.go:54] Node master-02.ocp-demo.ars.de has HW version vmx-15  
I0118 22:02:37.312096       1 node_hw_version.go:54] Node worker-03.ocp-demo.ars.de has HW version vmx-15  
I0118 22:02:37.317984       1 node_hw_version.go:54] Node worker-01.ocp-demo.ars.de has HW version vmx-15  
I0119 06:02:37.241360       1 node_hw_version.go:54] Node worker-03.ocp-demo.ars.de has HW version vmx-15  
I0119 06:02:37.255881       1 node_hw_version.go:54] Node master-03.ocp-demo.ars.de has HW version vmx-15  
I0119 06:02:37.257634       1 node_hw_version.go:54] Node master-02.ocp-demo.ars.de has HW version vmx-15  
I0119 06:02:37.274839       1 node_hw_version.go:54] Node worker-04.ocp-demo.ars.de has HW version vmx-15  
I0119 06:02:37.285984       1 node_hw_version.go:54] Node master-01.ocp-demo.ars.de has HW version vmx-15  
I0119 06:02:37.302934       1 node_hw_version.go:54] Node worker-02.ocp-demo.ars.de has HW version vmx-15  
I0119 06:02:37.304449       1 node_hw_version.go:54] Node worker-01.ocp-demo.ars.de has HW version vmx-15  
I0119 14:02:37.308002       1 node_hw_version.go:54] Node worker-03.ocp-demo.ars.de has HW version vmx-15  
I0119 14:02:37.308707       1 node_hw_version.go:54] Node master-01.ocp-demo.ars.de has HW version vmx-15  
I0119 14:02:37.312121       1 node_hw_version.go:54] Node worker-01.ocp-demo.ars.de has HW version vmx-15  
I0119 14:02:37.329824       1 node_hw_version.go:54] Node master-03.ocp-demo.ars.de has HW version vmx-15  
I0119 14:02:37.335922       1 node_hw_version.go:54] Node master-02.ocp-demo.ars.de has HW version vmx-15  
I0119 14:02:37.359976       1 node_hw_version.go:54] Node worker-04.ocp-demo.ars.de has HW version vmx-15  
I0119 14:02:37.360943       1 node_hw_version.go:54] Node worker-02.ocp-demo.ars.de has HW version vmx-15  

You can see the vmx-13 being reported in the afternoon. I then upgraded all VMs and all later logs show vmx-15 but the alert does not clear even after 24 hours.

--- Additional comment from Kai-Uwe Rommel on 2022-01-19 14:36:14 UTC ---

A couple of days I installed a new cluster from scratch with 4.9 on vSphere UPI with the VMs being vmx-15 from the beginning and do not see the alert. (This other fresh is an OKD cluster but I see above reported problem also on clusters with OKD upgraded from 4.8 to 4.9)

--- Additional comment from Sudha Ponnaganti on 2022-01-27 20:03:29 UTC ---

Is this only OKD issue?

--- Additional comment from Kai-Uwe Rommel on 2022-01-27 20:13:01 UTC ---

When I deleted the vsphere-problem-detector-operator, it's log was gone and the new one started from scratch.
And the alert disappeared. So the problem is that the alert is not cleared when all newer tests determine the vmx-15.
Instead it seems to trigger as long as any vmx-13 is anywhere earlier in the log.

--- Additional comment from Adam Kaplan on 2022-02-08 19:47:28 UTC ---

vsphere-problem-detector belongs in the "Operators" subcomponent.

--- Additional comment from Jan Safranek on 2022-02-23 15:16:27 UTC ---



--- Additional comment from Hemant Kumar on 2022-02-25 18:12:41 UTC ---

This is happening because of semantics of metric emission from Prometheus. We need to fix the way metrics are emitted or ensure that we enable an additional metric with "hw-13:0" when all the hardware versions are upgraded.

--- Additional comment from Hemant Kumar on 2022-02-25 20:48:25 UTC ---

Opened https://github.com/openshift/vsphere-problem-detector/pull/77 to fix it. Will open backport request for 4.10 and 4.9

--- Additional comment from OpenShift Automated Release Tooling on 2022-02-27 03:14:34 UTC ---

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.11 release created.

--- Additional comment from OpenShift BugZilla Robot on 2022-02-27 18:23:53 UTC ---

Bugfix included in accepted release 4.11.0-0.nightly-2022-02-27-122819
Bug will not be automatically moved to VERIFIED for the following reasons:
- PR openshift/vsphere-problem-detector#77 not approved by QA contact

This bug must now be manually moved to VERIFIED by wduan

--- Additional comment from Penghao Wang on 2022-02-28 10:22:35 UTC ---

Verified pass Cluster version is 4.11.0-0.nightly-2022-02-27-122819

Verify Steps:
1. Install one 4.9 nightly vSphere cluster with 6 nodes (3 masters and 3 workers) with hardware version 13
2. Check the vsphere_node_hw_version_total metric and CSIWithOldVSphereHWVersion Alert
2. Upgrade the cluster to 4.11.0-0.nightly-2022-02-27-122819
3. Upgrade all the nodes hardware version to 15 serially
4. Check the vsphere_node_hw_version_total metric correctoc  and CSIWithOldVSphereHWVersion Alert disappeared.

 wangpenghao@MacBook-Pro  ~  oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-02-27-122819   True        False         3h50m   Cluster version is 4.11.0-0.nightly-2022-02-27-122819
 wangpenghao@MacBook-Pro  ~  oc get node
NAME              STATUS   ROLES    AGE     VERSION
compute-0         Ready    worker   7h56m   v1.23.3+7478cf2
compute-1         Ready    worker   7h56m   v1.23.3+7478cf2
compute-2         Ready    worker   7h56m   v1.23.3+7478cf2
control-plane-0   Ready    master   8h      v1.23.3+7478cf2
control-plane-1   Ready    master   8h      v1.23.3+7478cf2
control-plane-2   Ready    master   8h      v1.23.3+7478cf2
 wangpenghao@MacBook-Pro  ~  oc project openshift-cluster-storage-operator
Now using project "openshift-cluster-storage-operator" on server "https://api.pewang-0228vmc49.qe.devcluster.openshift.com:6443".
 wangpenghao@MacBook-Pro  ~  token=`oc sa get-token prometheus-k8s -n openshift-monitoring`

 ✘ wangpenghao@MacBook-Pro  ~  oc exec deployment/vsphere-problem-detector-operator -- curl -ks -H "Authorization: Bearer $token" https://vsphere-problem-detector-metrics:8444/metrics |grep "node_hw"
# HELP vsphere_node_hw_version_total [ALPHA] Number of vSphere nodes with given HW verocsion.
# TYPE vsphere_node_hw_version_total gauge
vsphere_node_hw_version_total{hw_version="vmx-15"} 6

Comment 8 errata-xmlrpc 2022-04-13 05:10:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.28 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1245

Comment 10 Red Hat Bugzilla 2023-09-15 01:52:28 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days


Note You need to log in before you can comment on or make changes to this bug.