+++ This bug was initially created as a clone of Bug #2059255 +++ +++ This bug was initially created as a clone of Bug #2042446 +++ Description of problem: After upgrading clusters running on vSphere UPI from 4.8 to 4.9 the alert CSIWithOldVSphereHWVersion appears because the node VMs are still on vmx-13. After upgrading the node VMs to vmx-15, the alert does not go away. Version-Release number of selected component (if applicable): 4.9 How reproducible: so far, always Steps to Reproduce: 1. upgrade cluster from 4.8 to 4.9 2. see the CSIWithOldVSphereHWVersion alert 3. upgrade all VMs to vmx-15 4. still see the same alert Actual results: alert stays Expected results: alert should clear Additional info: --- Additional comment from Kai-Uwe Rommel on 2022-01-19 14:30:42 UTC --- I checked the logs of the vSphere problem detector and found this: C:\Work>oc logs vsphere-problem-detector-operator-5bd788bf59-rrzb5 | fgrep vmx- I0118 14:02:38.246614 1 node_hw_version.go:54] Node worker-01.ocp-demo.ars.de has HW version vmx-13 I0118 14:02:38.259371 1 node_hw_version.go:54] Node master-02.ocp-demo.ars.de has HW version vmx-13 I0118 14:02:38.263086 1 node_hw_version.go:54] Node worker-04.ocp-demo.ars.de has HW version vmx-13 I0118 14:02:38.283624 1 node_hw_version.go:54] Node master-03.ocp-demo.ars.de has HW version vmx-13 I0118 14:02:38.566010 1 node_hw_version.go:54] Node worker-02.ocp-demo.ars.de has HW version vmx-13 I0118 14:02:38.623278 1 node_hw_version.go:54] Node worker-03.ocp-demo.ars.de has HW version vmx-13 I0118 14:02:38.774871 1 node_hw_version.go:54] Node master-01.ocp-demo.ars.de has HW version vmx-15 I0118 22:02:37.296778 1 node_hw_version.go:54] Node worker-04.ocp-demo.ars.de has HW version vmx-15 I0118 22:02:37.297563 1 node_hw_version.go:54] Node master-03.ocp-demo.ars.de has HW version vmx-15 I0118 22:02:37.300421 1 node_hw_version.go:54] Node worker-02.ocp-demo.ars.de has HW version vmx-15 I0118 22:02:37.307970 1 node_hw_version.go:54] Node master-01.ocp-demo.ars.de has HW version vmx-15 I0118 22:02:37.309944 1 node_hw_version.go:54] Node master-02.ocp-demo.ars.de has HW version vmx-15 I0118 22:02:37.312096 1 node_hw_version.go:54] Node worker-03.ocp-demo.ars.de has HW version vmx-15 I0118 22:02:37.317984 1 node_hw_version.go:54] Node worker-01.ocp-demo.ars.de has HW version vmx-15 I0119 06:02:37.241360 1 node_hw_version.go:54] Node worker-03.ocp-demo.ars.de has HW version vmx-15 I0119 06:02:37.255881 1 node_hw_version.go:54] Node master-03.ocp-demo.ars.de has HW version vmx-15 I0119 06:02:37.257634 1 node_hw_version.go:54] Node master-02.ocp-demo.ars.de has HW version vmx-15 I0119 06:02:37.274839 1 node_hw_version.go:54] Node worker-04.ocp-demo.ars.de has HW version vmx-15 I0119 06:02:37.285984 1 node_hw_version.go:54] Node master-01.ocp-demo.ars.de has HW version vmx-15 I0119 06:02:37.302934 1 node_hw_version.go:54] Node worker-02.ocp-demo.ars.de has HW version vmx-15 I0119 06:02:37.304449 1 node_hw_version.go:54] Node worker-01.ocp-demo.ars.de has HW version vmx-15 I0119 14:02:37.308002 1 node_hw_version.go:54] Node worker-03.ocp-demo.ars.de has HW version vmx-15 I0119 14:02:37.308707 1 node_hw_version.go:54] Node master-01.ocp-demo.ars.de has HW version vmx-15 I0119 14:02:37.312121 1 node_hw_version.go:54] Node worker-01.ocp-demo.ars.de has HW version vmx-15 I0119 14:02:37.329824 1 node_hw_version.go:54] Node master-03.ocp-demo.ars.de has HW version vmx-15 I0119 14:02:37.335922 1 node_hw_version.go:54] Node master-02.ocp-demo.ars.de has HW version vmx-15 I0119 14:02:37.359976 1 node_hw_version.go:54] Node worker-04.ocp-demo.ars.de has HW version vmx-15 I0119 14:02:37.360943 1 node_hw_version.go:54] Node worker-02.ocp-demo.ars.de has HW version vmx-15 You can see the vmx-13 being reported in the afternoon. I then upgraded all VMs and all later logs show vmx-15 but the alert does not clear even after 24 hours. --- Additional comment from Kai-Uwe Rommel on 2022-01-19 14:36:14 UTC --- A couple of days I installed a new cluster from scratch with 4.9 on vSphere UPI with the VMs being vmx-15 from the beginning and do not see the alert. (This other fresh is an OKD cluster but I see above reported problem also on clusters with OKD upgraded from 4.8 to 4.9) --- Additional comment from Sudha Ponnaganti on 2022-01-27 20:03:29 UTC --- Is this only OKD issue? --- Additional comment from Kai-Uwe Rommel on 2022-01-27 20:13:01 UTC --- When I deleted the vsphere-problem-detector-operator, it's log was gone and the new one started from scratch. And the alert disappeared. So the problem is that the alert is not cleared when all newer tests determine the vmx-15. Instead it seems to trigger as long as any vmx-13 is anywhere earlier in the log. --- Additional comment from Adam Kaplan on 2022-02-08 19:47:28 UTC --- vsphere-problem-detector belongs in the "Operators" subcomponent. --- Additional comment from Jan Safranek on 2022-02-23 15:16:27 UTC --- --- Additional comment from Hemant Kumar on 2022-02-25 18:12:41 UTC --- This is happening because of semantics of metric emission from Prometheus. We need to fix the way metrics are emitted or ensure that we enable an additional metric with "hw-13:0" when all the hardware versions are upgraded. --- Additional comment from Hemant Kumar on 2022-02-25 20:48:25 UTC --- Opened https://github.com/openshift/vsphere-problem-detector/pull/77 to fix it. Will open backport request for 4.10 and 4.9 --- Additional comment from OpenShift Automated Release Tooling on 2022-02-27 03:14:34 UTC --- Elliott changed bug status from MODIFIED to ON_QA. This bug is expected to ship in the next 4.11 release created. --- Additional comment from OpenShift BugZilla Robot on 2022-02-27 18:23:53 UTC --- Bugfix included in accepted release 4.11.0-0.nightly-2022-02-27-122819 Bug will not be automatically moved to VERIFIED for the following reasons: - PR openshift/vsphere-problem-detector#77 not approved by QA contact This bug must now be manually moved to VERIFIED by wduan --- Additional comment from Penghao Wang on 2022-02-28 10:22:35 UTC --- Verified pass Cluster version is 4.11.0-0.nightly-2022-02-27-122819 Verify Steps: 1. Install one 4.9 nightly vSphere cluster with 6 nodes (3 masters and 3 workers) with hardware version 13 2. Check the vsphere_node_hw_version_total metric and CSIWithOldVSphereHWVersion Alert 2. Upgrade the cluster to 4.11.0-0.nightly-2022-02-27-122819 3. Upgrade all the nodes hardware version to 15 serially 4. Check the vsphere_node_hw_version_total metric correctoc and CSIWithOldVSphereHWVersion Alert disappeared. wangpenghao@MacBook-Pro ~ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-02-27-122819 True False 3h50m Cluster version is 4.11.0-0.nightly-2022-02-27-122819 wangpenghao@MacBook-Pro ~ oc get node NAME STATUS ROLES AGE VERSION compute-0 Ready worker 7h56m v1.23.3+7478cf2 compute-1 Ready worker 7h56m v1.23.3+7478cf2 compute-2 Ready worker 7h56m v1.23.3+7478cf2 control-plane-0 Ready master 8h v1.23.3+7478cf2 control-plane-1 Ready master 8h v1.23.3+7478cf2 control-plane-2 Ready master 8h v1.23.3+7478cf2 wangpenghao@MacBook-Pro ~ oc project openshift-cluster-storage-operator Now using project "openshift-cluster-storage-operator" on server "https://api.pewang-0228vmc49.qe.devcluster.openshift.com:6443". wangpenghao@MacBook-Pro ~ token=`oc sa get-token prometheus-k8s -n openshift-monitoring` ✘ wangpenghao@MacBook-Pro ~ oc exec deployment/vsphere-problem-detector-operator -- curl -ks -H "Authorization: Bearer $token" https://vsphere-problem-detector-metrics:8444/metrics |grep "node_hw" # HELP vsphere_node_hw_version_total [ALPHA] Number of vSphere nodes with given HW verocsion. # TYPE vsphere_node_hw_version_total gauge vsphere_node_hw_version_total{hw_version="vmx-15"} 6
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.9.28 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:1245