Description of problem: When running smoke tests, there's a test called "sanity check" which checks HCO's conditions. These "sanity checks" are triggered before the test run and between the tests. The smoke tests are failing in the middle because, after a test for deleting a VM, HCO is reconciling. 10:17:34 ! _pytest.outcomes.Exit: HCO is unhealthy. Expected {'Available': 'True', 'Progressing': 'False', 'ReconcileComplete': 'True', 'Degraded': 'False', 'Upgradeable': 'True'}, Current: [{'lastTransitionTime': '2022-06-21T06:51:27Z', 10:17:34 'message': 'Reconcile completed successfully', 10:17:34 'observedGeneration': 2, 10:17:34 'reason': 'ReconcileCompleted', 10:17:34 'status': 'True', 10:17:34 'type': 'ReconcileComplete'}, {'lastTransitionTime': '2022-06-21T07:14:00Z', 10:17:34 'message': 'KubeVirt is not available: Deploying version ' 10:17:34 'sha256:2aac338ae0c7d554f7a66e26157a63595574744aa9230b9718f9d820556921bf ' 10:17:34 'with registry registry.redhat.io/container-native-virtualization', 10:17:34 'observedGeneration': 2, 10:17:34 'reason': 'KubeVirtNotAvailable', 10:17:34 'status': 'False', 10:17:34 'type': 'Available'}, {'lastTransitionTime': '2022-06-21T07:14:00Z', 10:17:34 'message': 'KubeVirt is progressing: Deploying version ' 10:17:34 'sha256:2aac338ae0c7d554f7a66e26157a63595574744aa9230b9718f9d820556921bf ' 10:17:34 'with registry registry.redhat.io/container-native-virtualization', 10:17:34 'observedGeneration': 2, 10:17:34 'reason': 'KubeVirtProgressing', 10:17:34 'status': 'True', 10:17:34 'type': 'Progressing'}, {'lastTransitionTime': '2022-06-21T06:55:38Z', 10:17:34 'message': 'Reconcile completed successfully', 10:17:34 'observedGeneration': 2, 10:17:34 'reason': 'ReconcileCompleted', 10:17:34 'status': 'False', 10:17:34 'type': 'Degraded'}, {'lastTransitionTime': '2022-06-21T07:14:00Z', 10:17:34 'message': 'KubeVirt is progressing: Deploying version ' 10:17:34 'sha256:2aac338ae0c7d554f7a66e26157a63595574744aa9230b9718f9d820556921bf ' 10:17:34 'with registry registry.redhat.io/container-native-virtualization', 10:17:34 'observedGeneration': 2, 10:17:34 'reason': 'KubeVirtProgressing', 10:17:34 'status': 'False', 10:17:34 'type': 'Upgradeable'}] ! Version-Release number of selected component (if applicable): v4.11.0-496, v4.11.0-505, v4.11.0-506, v4.11.0-507 How reproducible: 100 Steps to Reproduce: 1. Deploy CNV. 2. Run smoke tests. 3. Actual results: Expected results: Additional info: Some logs from the hco-operator pod: {"level":"error","ts":1655795579.6553278,"logger":"controller_hyperconverged","msg":"Failed to update HCO Status","Request.Namespace":"openshift-c nv","Request.Name":"kubevirt-hyperconverged","error":"Operation cannot be fulfilled on hyperconvergeds.hco.kubevirt.io \"kubevirt-hyperconverged\" : the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/kubevirt/hyperconverged-cl uster-operator/controllers/hyperconverged.(*ReconcileHyperConverged).updateHyperConverged\n\t/remote-source/app/controllers/hyperconverged/hyperco nverged_controller.go:573\ngithub.com/kubevirt/hyperconverged-cluster-operator/controllers/hyperconverged.(*ReconcileHyperConverged).Reconcile\n\t /remote-source/app/controllers/hyperconverged/hyperconverged_controller.go:321\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controlle r).Reconcile\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime /pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/contro ller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/con troller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/ remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"} {"level":"error","ts":1655795579.6554117,"logger":"controller.hyperconverged-controller","msg":"Reconciler error","name":"hco-controlled-cr-6e97b2 42-5873-4124-943f-6694ef9ae236","namespace":"openshift-cnv","error":"Operation cannot be fulfilled on hyperconvergeds.hco.kubevirt.io \"kubevirt-h yperconverged\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller -runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/contro ller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.i o/controller-runtime/pkg/internal/controller/controller.go:227"} {"level":"info","ts":1655795579.6609025,"logger":"controller_hyperconverged","msg":"The reconciliation got triggered by a secondary CR object","Re quest.Namespace":"openshift-cnv","Request.Name":"hco-controlled-cr-6e97b242-5873-4124-943f-6694ef9ae236"}
I can see the exact same issue in a new build - 532
Following the comment above, I could reproduce it in 2 out of 3 times that I have tried.
Could we get the update on this bug fix? Thanks.
Might be related to https://bugzilla.redhat.com/show_bug.cgi?id=2079916#c6
@kmajcher, We intentionally check HCO status once, instead of waiting for it to come to stable test. Reason being: 1) if a test is causing HCO to go to unhealthy, it should do the right thing and wait for it to come back to healthy state 2) if HCO is unhealthy we should not be running any tests in such cluster and use such results as a gating criteria. cnv-tests-run-in-progress is a recent addition to our test framework. This is created at the beginning of a run and is cleaned up at the end of a run. This ensures, no one can accidentally start a test against a cluster that is in use. This call happened as after running test tests/compute/ssp/supported_os/common_templates/rhel/test_rhel_os_support.py, before starting next test tests/compute/virt/general/test_container_disk_vm.py, we ran cluster sanity and found Kubevirt to be in deploying state. So we exited the run and before that cleaned up the configmap to release lock on the cluster. This is expected behavior. I also want to point, in this specific case, `KubeVirt is progressing: Deploying version ` indicates the deploy is not complete. This is NOT caused by any test and in such cases when deploy is in progress, if we run tests against such cluster, we can't use the test for qualifying a build, as the cluster was not fully deployed. Please let me know if you need more information.
Assigning to Virt team to continue with the investigation.
Suggestion: It's probably due to my lack of experience with these tests, but the errors logged were very difficult to understand for me, namely where the error happened and what caused it. If possible, I think it would be a good idea to have more logging in the tests, for example, in this case, stating that the error happened while waiting for the cluster to be stable, and therefore the tests were exiting and cleaning up some related resources.
*** This bug has been marked as a duplicate of bug 2079916 ***