Description of problem: Delete CDI CR (cdi-kubevirt-hyperconverged), it can't be created by HCO again. Version-Release number of selected component (if applicable): OCP 4.4 + CNV 2.3 container-native-virtualization-hyperconverged-cluster-operator@sha256:35c779c436c984baa0d6279d936053911396c03334e8dba6c89aa45859ab47fa How reproducible: Steps to Reproduce: 1. Delete CDI CR 2. Check CDI CR re-creation Actual results: 2. [cnv-qe-jenkins@cnv-executor-lbednar ~]$ oc get cdi NAME AGE PHASE cdi-kubevirt-hyperconverged 10m Error [cnv-qe-jenkins@cnv-executor-lbednar ~]$ oc get cdi cdi-kubevirt-hyperconverged -o yaml apiVersion: cdi.kubevirt.io/v1alpha1 kind: CDI metadata: creationTimestamp: "2020-03-04T04:24:02Z" generation: 8 labels: app: kubevirt-hyperconverged name: cdi-kubevirt-hyperconverged resourceVersion: "1372363" selfLink: /apis/cdi.kubevirt.io/v1alpha1/cdis/cdi-kubevirt-hyperconverged uid: 96b24c7c-4f58-48d7-b09a-0ba7ae3c6d31 spec: uninstallStrategy: BlockUninstallIfWorkloadsExist status: conditions: - lastHeartbeatTime: "2020-03-04T04:24:53Z" lastTransitionTime: "2020-03-04T04:24:02Z" status: "False" type: Available - lastHeartbeatTime: "2020-03-04T04:24:53Z" lastTransitionTime: "2020-03-04T04:24:02Z" status: "False" type: Progressing - lastHeartbeatTime: "2020-03-04T04:24:53Z" lastTransitionTime: "2020-03-04T04:24:02Z" message: Reconciling to error state, no configmap reason: ConfigError status: "True" type: Degraded phase: Error Expected results: 2. cdi-kubevirt-hyperconverged should be created again. Additional info:
I noticed after deletion, the CR wasn't 100% in the Error phase. If it kept Error, deleted it and it started to deploy again. [cnv-qe-jenkins@cnv-executor-lbednar ~]$ oc delete cdi cdi-kubevirt-hyperconverged cdi.cdi.kubevirt.io "cdi-kubevirt-hyperconverged" deleted [cnv-qe-jenkins@cnv-executor-lbednar ~]$ oc get cdi -w NAME AGE PHASE cdi-kubevirt-hyperconverged 62s Error [cnv-qe-jenkins@cnv-executor-lbednar ~]$ oc delete cdi cdi-kubevirt-hyperconverged cdi.cdi.kubevirt.io "cdi-kubevirt-hyperconverged" deleted [cnv-qe-jenkins@cnv-executor-lbednar ~]$ oc get cdi -w NAME AGE PHASE cdi-kubevirt-hyperconverged 4s Deploying cdi-kubevirt-hyperconverged 14s Deployed
I tried it a relevant amount of time and this can be occasionally reproduced. With oc delete cdi cdi-kubevirt-hyperconverged we can delete CDI CR and HCO is always going to try to create a new one. In all of my tests HCO always created CDI CR, sometimes CDI operator railed the reconciliation with an error emitted here: https://github.com/kubevirt/containerized-data-importer/blob/master/pkg/operator/controller/controller.go#L193 It emits that error when it fails to find a ConfigMap named cdi-config in openshift-cnv namespace. In indeed in all the failures on my test, when we see CDI CR in error state we really lack cdi-config config-map. but that configMap is going to be created by CDI operator itself here: https://github.com/kubevirt/containerized-data-importer/blob/master/pkg/operator/controller/controller.go#L256 So for sure we have something flappy on CDI operator side that "sometimes" fails to identify that it's trying to reconcile a new CR and so it has to create a new ConfigMap. I'd suggest to change this line: https://github.com/kubevirt/containerized-data-importer/blob/master/pkg/operator/controller/controller.go#L191 reqLogger.Info("Reconciling to error state, no configmap") to also track the value of cr.Status.Phase because I suspect that there is a transient Phase that can cause this on freshly created CRs. Moving to CDI for further investigation.
I fear that it could be a kind of race condition that flags the new CR as Deploying/Deleting/Upgrading/... before correctly creating cdi-config and so then the error. I'd suggest to log also the incoming cr.Status.Phase before setting it to Error in order to better understand this.
When it fails in CDI operator logs we see: {"level":"info","ts":1583947088.1778286,"logger":"cdi-operator","msg":"Reconciling CDI","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} {"level":"info","ts":1583947088.1778526,"logger":"cdi-operator","msg":"CDI CR no longer exists","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} {"level":"info","ts":1583947088.1856332,"logger":"cdi-operator","msg":"Reconciling CDI","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} {"level":"info","ts":1583947088.1858122,"logger":"cdi-operator","msg":"CDI CR no longer exists","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} {"level":"info","ts":1583947088.2012565,"logger":"cdi-operator","msg":"Reconciling CDI","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} {"level":"info","ts":1583947088.2013493,"logger":"cdi-operator","msg":"Reconciling to error state, unwanted CDI object","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} {"level":"info","ts":1583947088.6882749,"logger":"cdi-operator","msg":"Reconciling CDI","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} {"level":"info","ts":1583947088.6883676,"logger":"cdi-operator","msg":"Reconciling to error state, unwanted CDI object","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} {"level":"error","ts":1583947088.9515827,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"cdi-operator-controller","request":"/cdi-kubevirt-hyperconverged","error":"Operation cannot be fulfilled on cdis.cdi.kubevirt.io \"cdi-kubevirt-hyperconverged\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"kubevirt.io/containerized-data-importer/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\tvendor/github.com/go-logr/zapr/zapr.go:128\nkubevirt.io/containerized-data-importer/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tvendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:258\nkubevirt.io/containerized-data-importer/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tvendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nkubevirt.io/containerized-data-importer/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\tvendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nkubevirt.io/containerized-data-importer/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\tvendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nkubevirt.io/containerized-data-importer/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tvendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nkubevirt.io/containerized-data-importer/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\tvendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"} {"level":"info","ts":1583947089.9519157,"logger":"cdi-operator","msg":"Reconciling CDI","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} {"level":"info","ts":1583947089.9522083,"logger":"cdi-operator","msg":"Reconciling to error state, no configmap","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} {"level":"info","ts":1583947090.0441933,"logger":"cdi-operator","msg":"Reconciling CDI","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} {"level":"info","ts":1583947090.044272,"logger":"cdi-operator","msg":"Reconciling to error state, no configmap","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} So first (1583947088.2013493) CDI operator marked the new CR in Error phase on: https://github.com/kubevirt/containerized-data-importer/blob/release-v1.13/pkg/operator/controller/controller.go#L203 and only then (1583947089.9522083) marked it again in Error phase due to the lack of the config map being already in error phase (so phase!=""). I suspect that CDI operator detects the new CR and it enter in the reconciliation loop for that. Then, if we are unlucky, it founds the previous configMap that was still there but flagged for deletion and so it avoid creating a new one here: https://github.com/kubevirt/containerized-data-importer/blob/release-v1.13/pkg/operator/controller/controller.go#L189 but then the new CR is still not controlled by the old config map that is going to be deleted. So on line: https://github.com/kubevirt/containerized-data-importer/blob/release-v1.13/pkg/operator/controller/controller.go#L201 CDI operator flags the new CR in error state. and, on the next round, it skip creating a new configMap because the CR is already in error state: https://github.com/kubevirt/containerized-data-importer/blob/release-v1.13/pkg/operator/controller/controller.go#L190
Tested with virt-cdi-operator-container-v2.3.0-37, the bug could be reproduced. [cloud-user@ocp-psi-executor ~]oc delete cdi cdi-kubevirt-hyperconverged; oc get cdi -w cdi.cdi.kubevirt.io "cdi-kubevirt-hyperconverged" deleted NAME AGE PHASE cdi-kubevirt-hyperconverged 0s Error cdi-kubevirt-hyperconverged 1s Error cdi-kubevirt-hyperconverged 2s Error [cloud-user@ocp-psi-executor ~]$ oc get cdi -w NAME AGE PHASE cdi-kubevirt-hyperconverged 5m57s Error [cloud-user@ocp-psi-executor ~]$ oc delete cdi cdi-kubevirt-hyperconverged; oc get cdi -w cdi.cdi.kubevirt.io "cdi-kubevirt-hyperconverged" deleted NAME AGE PHASE cdi-kubevirt-hyperconverged 0s Deploying cdi-kubevirt-hyperconverged 17s Deployed [cloud-user@ocp-psi-executor ~]$ oc delete cdi cdi-kubevirt-hyperconverged; oc get cdi -w cdi.cdi.kubevirt.io "cdi-kubevirt-hyperconverged" deleted NAME AGE PHASE cdi-kubevirt-hyperconverged 1s Error cdi-kubevirt-hyperconverged 1s Error cdi-kubevirt-hyperconverged 2s Error
@Michael, this bug failed QA. Can you please take a look as it seems the cdi operator might be depending on the existence of a resource it is supposed to create for itself (cdi-config).
In order to ensure that there is only one "active" CDI CR, the cdi-config ConfigMap (which tracks the active CR) is created when the first CDI CR is reconciled. If that CR is reconciled again and cdi-config does not exist, that of course indicates an error. It would appear that Simone's fix needs to be revisited.
Michael, PR is merged, do you feel it's ready for QE?
If we must have this for 2.3, the PR has to be cherry picked and a new downstream build created.
@Michael this has blocker+ so let's get on that cherry-pick ASAP.
Tested with virt-cdi-operator-container-v2.3.0-39, the bug has been fixed, thanks!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:2011