Bug 2099635 - HCO reconciling during smoke tests
Summary: HCO reconciling during smoke tests
Keywords:
Status: CLOSED DUPLICATE of bug 2079916
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 4.11.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.11.0
Assignee: lpivarc
QA Contact: Kedar Bidarkar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-21 11:34 UTC by Mor Cohen
Modified: 2022-06-30 08:19 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-06-30 08:19:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Mor Cohen 2022-06-21 11:34:44 UTC
Description of problem:
When running smoke tests, there's a test called "sanity check" which checks HCO's conditions. These "sanity checks" are triggered before the test run and between the tests.

The smoke tests are failing in the middle because, after a test for deleting a VM, HCO is reconciling.

10:17:34  ! _pytest.outcomes.Exit: HCO is unhealthy. Expected {'Available': 'True', 'Progressing': 'False', 'ReconcileComplete': 'True', 'Degraded': 'False', 'Upgradeable': 'True'}, Current: [{'lastTransitionTime': '2022-06-21T06:51:27Z',
10:17:34   'message': 'Reconcile completed successfully',
10:17:34   'observedGeneration': 2,
10:17:34   'reason': 'ReconcileCompleted',
10:17:34   'status': 'True',
10:17:34   'type': 'ReconcileComplete'}, {'lastTransitionTime': '2022-06-21T07:14:00Z',
10:17:34   'message': 'KubeVirt is not available: Deploying version '
10:17:34              'sha256:2aac338ae0c7d554f7a66e26157a63595574744aa9230b9718f9d820556921bf '
10:17:34              'with registry registry.redhat.io/container-native-virtualization',
10:17:34   'observedGeneration': 2,
10:17:34   'reason': 'KubeVirtNotAvailable',
10:17:34   'status': 'False',
10:17:34   'type': 'Available'}, {'lastTransitionTime': '2022-06-21T07:14:00Z',
10:17:34   'message': 'KubeVirt is progressing: Deploying version '
10:17:34              'sha256:2aac338ae0c7d554f7a66e26157a63595574744aa9230b9718f9d820556921bf '
10:17:34              'with registry registry.redhat.io/container-native-virtualization',
10:17:34   'observedGeneration': 2,
10:17:34   'reason': 'KubeVirtProgressing',
10:17:34   'status': 'True',
10:17:34   'type': 'Progressing'}, {'lastTransitionTime': '2022-06-21T06:55:38Z',
10:17:34   'message': 'Reconcile completed successfully',
10:17:34   'observedGeneration': 2,
10:17:34   'reason': 'ReconcileCompleted',
10:17:34   'status': 'False',
10:17:34   'type': 'Degraded'}, {'lastTransitionTime': '2022-06-21T07:14:00Z',
10:17:34   'message': 'KubeVirt is progressing: Deploying version '
10:17:34              'sha256:2aac338ae0c7d554f7a66e26157a63595574744aa9230b9718f9d820556921bf '
10:17:34              'with registry registry.redhat.io/container-native-virtualization',
10:17:34   'observedGeneration': 2,
10:17:34   'reason': 'KubeVirtProgressing',
10:17:34   'status': 'False',
10:17:34   'type': 'Upgradeable'}] !

Version-Release number of selected component (if applicable):
v4.11.0-496, v4.11.0-505, v4.11.0-506, v4.11.0-507

How reproducible:
100

Steps to Reproduce:
1. Deploy CNV.
2. Run smoke tests.
3.

Actual results:


Expected results:


Additional info:

Some logs from the hco-operator pod:

{"level":"error","ts":1655795579.6553278,"logger":"controller_hyperconverged","msg":"Failed to update HCO Status","Request.Namespace":"openshift-c
nv","Request.Name":"kubevirt-hyperconverged","error":"Operation cannot be fulfilled on hyperconvergeds.hco.kubevirt.io \"kubevirt-hyperconverged\"
: the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/kubevirt/hyperconverged-cl
uster-operator/controllers/hyperconverged.(*ReconcileHyperConverged).updateHyperConverged\n\t/remote-source/app/controllers/hyperconverged/hyperco
nverged_controller.go:573\ngithub.com/kubevirt/hyperconverged-cluster-operator/controllers/hyperconverged.(*ReconcileHyperConverged).Reconcile\n\t
/remote-source/app/controllers/hyperconverged/hyperconverged_controller.go:321\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controlle
r).Reconcile\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime
/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/contro
ller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/con
troller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/
remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":1655795579.6554117,"logger":"controller.hyperconverged-controller","msg":"Reconciler error","name":"hco-controlled-cr-6e97b2
42-5873-4124-943f-6694ef9ae236","namespace":"openshift-cnv","error":"Operation cannot be fulfilled on hyperconvergeds.hco.kubevirt.io \"kubevirt-h
yperconverged\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller
-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/contro
ller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.i
o/controller-runtime/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":1655795579.6609025,"logger":"controller_hyperconverged","msg":"The reconciliation got triggered by a secondary CR object","Re
quest.Namespace":"openshift-cnv","Request.Name":"hco-controlled-cr-6e97b242-5873-4124-943f-6694ef9ae236"}

Comment 1 Mor Cohen 2022-06-26 07:16:35 UTC
I can see the exact same issue in a new build - 532

Comment 2 Mor Cohen 2022-06-26 07:46:40 UTC
Following the comment above,
I could reproduce it in 2 out of 3 times that I have tried.

Comment 3 Ying Cui 2022-06-27 08:01:00 UTC
Could we get the update on this bug fix? Thanks.

Comment 4 Krzysztof Majcher 2022-06-27 13:25:29 UTC
Might be related to https://bugzilla.redhat.com/show_bug.cgi?id=2079916#c6

Comment 5 Krzysztof Majcher 2022-06-27 13:25:49 UTC
Might be related to https://bugzilla.redhat.com/show_bug.cgi?id=2079916#c6

Comment 8 Debarati Basu-Nag 2022-06-27 15:36:03 UTC
@kmajcher, We intentionally check HCO status once, instead of waiting for it to come to stable test. Reason being: 1) if a test is causing HCO to go to unhealthy, it should do the right thing and wait for it to come back to healthy state 2) if HCO is unhealthy we should not be running any tests in such cluster and use such results as a gating criteria.

cnv-tests-run-in-progress is a recent addition to our test framework. This is created at the beginning of a run and is cleaned up at the end of a run. This ensures, no one can accidentally start a test against a cluster that is in use. This call happened as after running test tests/compute/ssp/supported_os/common_templates/rhel/test_rhel_os_support.py, before starting next test tests/compute/virt/general/test_container_disk_vm.py, we ran cluster sanity and found Kubevirt to be in deploying state. So we exited the run and before that cleaned up the configmap to release lock on the cluster. This is expected behavior.

I also want to point, in this specific case, `KubeVirt is progressing: Deploying version ` indicates the deploy is not complete. This is NOT caused by any test and in such cases when deploy is in progress, if we run tests against such cluster, we can't use the test for qualifying a build, as the cluster was not fully deployed.


Please let me know if you need more information.

Comment 9 Krzysztof Majcher 2022-06-27 15:47:21 UTC
Assigning to Virt team to continue with the investigation.

Comment 10 João Vilaça 2022-06-27 15:55:50 UTC
Suggestion: It's probably due to my lack of experience with these tests, but
the errors logged were very difficult to understand for me, namely where the error
happened and what caused it. If possible, I think it would be a good idea to have more
logging in the tests, for example, in this case, stating that the error happened while
waiting for the cluster to be stable, and therefore the tests were exiting and
cleaning up some related resources.

Comment 11 lpivarc 2022-06-30 08:19:29 UTC

*** This bug has been marked as a duplicate of bug 2079916 ***


Note You need to log in before you can comment on or make changes to this bug.