Created attachment 1863175 [details] ssp log Description of problem: Post deployment of BM cluster bm02-cnvqe2-rdu2, noticed that HCO is in degraded state, due to SSP not being available. From the ssp operator log, it looks like it is continuously attempting to reconcile and failing Version-Release number of selected component (if applicable): 4.10.0 - 686 How reproducible: Not sure. Steps to Reproduce: 1.Not sure. 2. 3. Actual results: HCO Status condition: ==================== { "lastTransitionTime": "2022-02-23T16:59:54Z", "message": "Reconcile completed successfully", "observedGeneration": 73, "reason": "ReconcileCompleted", "status": "True", "type": "ReconcileComplete" }, { "lastTransitionTime": "2022-02-24T00:24:04Z", "message": "SSP is not available: Reconciling SSP resources", "observedGeneration": 73, "reason": "SSPNotAvailable", "status": "False", "type": "Available" }, { "lastTransitionTime": "2022-02-24T00:24:04Z", "message": "SSP is progressing: Reconciling SSP resources", "observedGeneration": 73, "reason": "SSPProgressing", "status": "True", "type": "Progressing" }, { "lastTransitionTime": "2022-02-24T00:46:30Z", "message": "SSP is degraded: Reconciling SSP resources", "observedGeneration": 73, "reason": "SSPDegraded", "status": "True", "type": "Degraded" }, { "lastTransitionTime": "2022-02-24T00:24:04Z", "message": "SSP is progressing: Reconciling SSP resources", "observedGeneration": 73, "reason": "SSPProgressing", "status": "False", "type": "Upgradeable" } =================== SSP status: =================== { "conditions": [ { "lastHeartbeatTime": "2022-02-24T00:49:32Z", "lastTransitionTime": "2022-02-24T00:49:32Z", "message": "Reconciling SSP resources", "reason": "Available", "status": "False", "type": "Available" }, { "lastHeartbeatTime": "2022-02-24T00:49:32Z", "lastTransitionTime": "2022-02-24T00:49:32Z", "message": "Reconciling SSP resources", "reason": "Progressing", "status": "True", "type": "Progressing" }, { "lastHeartbeatTime": "2022-02-24T00:49:32Z", "lastTransitionTime": "2022-02-24T00:49:32Z", "message": "Reconciling SSP resources", "reason": "Degraded", "status": "True", "type": "Degraded" } ], "observedGeneration": 6, "observedVersion": "4.10.0", "operatorVersion": "4.10.0", "phase": "Deploying", "targetVersion": "4.10.0" } From SSP operator log, this message showing up again and again: ================= {"level":"error","ts":1645655734.9615152,"logger":"controller-runtime.manager.controller.ssp","msg":"Reconciler error","reconciler group":"ssp.kubevirt.io","reconciler kind":"SSP","name":"ssp-kubevirt-hyperconverged","namespace":"openshift-cnv","error":"Operation cannot be fulfilled on ssps.ssp.kubevirt.io \"ssp-kubevirt-hyperconverged\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"} ================ Attached is hco operator log and ssp operator log Expected results: Additional info:
Moving to storage, as triage by Oren, indicated this is a CDI issue.
Moved to SSP after having a debug session with @akrejcir
I reproduced this on my development cluster. It is not related to bare metal. The problem is that SSP and CDI modify the same labels in a loop. This is the update done by SSP: @ ["metadata","labels","app.kubernetes.io/component"] - "storage" + "templating" @ ["metadata","labels","app.kubernetes.io/managed-by"] - "cdi-controller" + "ssp-operator" And CID reverts it back. I will post a PR to SSP, to break the loop.
Verified on kubevirt-ssp-operator-container-v4.10.0-50 Note: The fix mention the labels are being set now when auto-update is disabled, this could mean the bug will again when auto-update is disabled, but I didn't manage to reproduce it even then so cant say for sure
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0947