Bug 2042446

Summary: CSIWithOldVSphereHWVersion alert recurring despite upgrade to vmx-15
Product: OpenShift Container Platform Reporter: Kai-Uwe Rommel <kai-uwe.rommel>
Component: StorageAssignee: Hemant Kumar <hekumar>
Storage sub component: Operators QA Contact: Penghao Wang <pewang>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: chdeshpa, eparis, hekumar, jsafrane, mmahmoud, pewang, sdharma, snetting
Version: 4.9   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2059255 (view as bug list) Environment:
Last Closed: 2022-08-10 10:43:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2059255    

Description Kai-Uwe Rommel 2022-01-19 14:28:39 UTC
Description of problem:
After upgrading clusters running on vSphere UPI from 4.8 to 4.9 the alert CSIWithOldVSphereHWVersion appears because the node VMs are still on vmx-13. After upgrading the node VMs to vmx-15, the alert does not go away.

Version-Release number of selected component (if applicable):
4.9

How reproducible:
so far, always

Steps to Reproduce:
1. upgrade cluster from 4.8 to 4.9
2. see the CSIWithOldVSphereHWVersion alert
3. upgrade all VMs to vmx-15
4. still see the same alert

Actual results:
alert stays

Expected results:
alert should clear

Additional info:

Comment 1 Kai-Uwe Rommel 2022-01-19 14:30:42 UTC
I checked the logs of the vSphere problem detector and found this:
C:\Work>oc logs vsphere-problem-detector-operator-5bd788bf59-rrzb5 | fgrep vmx-                            
I0118 14:02:38.246614       1 node_hw_version.go:54] Node worker-01.ocp-demo.ars.de has HW version vmx-13  
I0118 14:02:38.259371       1 node_hw_version.go:54] Node master-02.ocp-demo.ars.de has HW version vmx-13  
I0118 14:02:38.263086       1 node_hw_version.go:54] Node worker-04.ocp-demo.ars.de has HW version vmx-13  
I0118 14:02:38.283624       1 node_hw_version.go:54] Node master-03.ocp-demo.ars.de has HW version vmx-13  
I0118 14:02:38.566010       1 node_hw_version.go:54] Node worker-02.ocp-demo.ars.de has HW version vmx-13  
I0118 14:02:38.623278       1 node_hw_version.go:54] Node worker-03.ocp-demo.ars.de has HW version vmx-13  
I0118 14:02:38.774871       1 node_hw_version.go:54] Node master-01.ocp-demo.ars.de has HW version vmx-15  
I0118 22:02:37.296778       1 node_hw_version.go:54] Node worker-04.ocp-demo.ars.de has HW version vmx-15  
I0118 22:02:37.297563       1 node_hw_version.go:54] Node master-03.ocp-demo.ars.de has HW version vmx-15  
I0118 22:02:37.300421       1 node_hw_version.go:54] Node worker-02.ocp-demo.ars.de has HW version vmx-15  
I0118 22:02:37.307970       1 node_hw_version.go:54] Node master-01.ocp-demo.ars.de has HW version vmx-15  
I0118 22:02:37.309944       1 node_hw_version.go:54] Node master-02.ocp-demo.ars.de has HW version vmx-15  
I0118 22:02:37.312096       1 node_hw_version.go:54] Node worker-03.ocp-demo.ars.de has HW version vmx-15  
I0118 22:02:37.317984       1 node_hw_version.go:54] Node worker-01.ocp-demo.ars.de has HW version vmx-15  
I0119 06:02:37.241360       1 node_hw_version.go:54] Node worker-03.ocp-demo.ars.de has HW version vmx-15  
I0119 06:02:37.255881       1 node_hw_version.go:54] Node master-03.ocp-demo.ars.de has HW version vmx-15  
I0119 06:02:37.257634       1 node_hw_version.go:54] Node master-02.ocp-demo.ars.de has HW version vmx-15  
I0119 06:02:37.274839       1 node_hw_version.go:54] Node worker-04.ocp-demo.ars.de has HW version vmx-15  
I0119 06:02:37.285984       1 node_hw_version.go:54] Node master-01.ocp-demo.ars.de has HW version vmx-15  
I0119 06:02:37.302934       1 node_hw_version.go:54] Node worker-02.ocp-demo.ars.de has HW version vmx-15  
I0119 06:02:37.304449       1 node_hw_version.go:54] Node worker-01.ocp-demo.ars.de has HW version vmx-15  
I0119 14:02:37.308002       1 node_hw_version.go:54] Node worker-03.ocp-demo.ars.de has HW version vmx-15  
I0119 14:02:37.308707       1 node_hw_version.go:54] Node master-01.ocp-demo.ars.de has HW version vmx-15  
I0119 14:02:37.312121       1 node_hw_version.go:54] Node worker-01.ocp-demo.ars.de has HW version vmx-15  
I0119 14:02:37.329824       1 node_hw_version.go:54] Node master-03.ocp-demo.ars.de has HW version vmx-15  
I0119 14:02:37.335922       1 node_hw_version.go:54] Node master-02.ocp-demo.ars.de has HW version vmx-15  
I0119 14:02:37.359976       1 node_hw_version.go:54] Node worker-04.ocp-demo.ars.de has HW version vmx-15  
I0119 14:02:37.360943       1 node_hw_version.go:54] Node worker-02.ocp-demo.ars.de has HW version vmx-15  

You can see the vmx-13 being reported in the afternoon. I then upgraded all VMs and all later logs show vmx-15 but the alert does not clear even after 24 hours.

Comment 2 Kai-Uwe Rommel 2022-01-19 14:36:14 UTC
A couple of days I installed a new cluster from scratch with 4.9 on vSphere UPI with the VMs being vmx-15 from the beginning and do not see the alert. (This other fresh is an OKD cluster but I see above reported problem also on clusters with OKD upgraded from 4.8 to 4.9)

Comment 4 Kai-Uwe Rommel 2022-01-27 20:13:01 UTC
When I deleted the vsphere-problem-detector-operator, it's log was gone and the new one started from scratch.
And the alert disappeared. So the problem is that the alert is not cleared when all newer tests determine the vmx-15.
Instead it seems to trigger as long as any vmx-13 is anywhere earlier in the log.

Comment 5 Adam Kaplan 2022-02-08 19:47:28 UTC
vsphere-problem-detector belongs in the "Operators" subcomponent.

Comment 6 Jan Safranek 2022-02-23 15:16:27 UTC
*** Bug 2053104 has been marked as a duplicate of this bug. ***

Comment 7 Hemant Kumar 2022-02-25 18:12:41 UTC
This is happening because of semantics of metric emission from Prometheus. We need to fix the way metrics are emitted or ensure that we enable an additional metric with "hw-13:0" when all the hardware versions are upgraded.

Comment 8 Hemant Kumar 2022-02-25 20:48:25 UTC
Opened https://github.com/openshift/vsphere-problem-detector/pull/77 to fix it. Will open backport request for 4.10 and 4.9

Comment 24 errata-xmlrpc 2022-08-10 10:43:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 25 Red Hat Bugzilla 2023-09-15 01:51:11 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days