Description of problem: We got a case were cdi-deployment wasn't able to start due to PSA. The deployment controller was clearly stating that on the conditions on the deployment status: $ oc get deployments -n openshift-cnv cdi-deployment -o json | jq .status.conditions [ { "lastTransitionTime": "2022-09-21T21:21:28Z", "lastUpdateTime": "2022-09-21T21:21:28Z", "message": "Deployment does not have minimum availability.", "reason": "MinimumReplicasUnavailable", "status": "False", "type": "Available" }, { "lastTransitionTime": "2022-09-21T21:21:28Z", "lastUpdateTime": "2022-09-21T21:21:28Z", "message": "pods \"cdi-deployment-6f4888b5cb-r9f5h\" is forbidden: violates PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (container \"cdi-controller\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container \"cdi-controller\" must set securityContext.capabilities.drop=[\"ALL\"]), seccompProfile (pod or container \"cdi-controller\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")", "reason": "FailedCreate", "status": "True", "type": "ReplicaFailure" }, { "lastTransitionTime": "2022-09-21T21:31:29Z", "lastUpdateTime": "2022-09-21T21:31:29Z", "message": "ReplicaSet \"cdi-deployment-6f4888b5cb\" has timed out progressing.", "reason": "ProgressDeadlineExceeded", "status": "False", "type": "Progressing" } ] although cdi-operator was still reporting Progressing=True, Degraded=False on its CR: $ oc get cdi cdi-kubevirt-hyperconverged -o yaml apiVersion: cdi.kubevirt.io/v1beta1 kind: CDI metadata: annotations: cdi.kubevirt.io/configAuthority: "" creationTimestamp: "2022-09-21T21:21:24Z" finalizers: - operator.cdi.kubevirt.io generation: 2 labels: app: kubevirt-hyperconverged app.kubernetes.io/component: storage app.kubernetes.io/managed-by: hco-operator app.kubernetes.io/part-of: hyperconverged-cluster app.kubernetes.io/version: 4.11.0 name: cdi-kubevirt-hyperconverged resourceVersion: "39138" uid: 8cf09f52-9ad4-48d0-8831-d97923fdfe29 spec: certConfig: ca: duration: 48h0m0s renewBefore: 24h0m0s server: duration: 24h0m0s renewBefore: 12h0m0s config: featureGates: - HonorWaitForFirstConsumer infra: {} uninstallStrategy: BlockUninstallIfWorkloadsExist workload: {} status: conditions: - lastHeartbeatTime: "2022-09-21T21:21:25Z" lastTransitionTime: "2022-09-21T21:21:25Z" status: "False" type: Available - lastHeartbeatTime: "2022-09-21T21:21:25Z" lastTransitionTime: "2022-09-21T21:21:25Z" message: Started Deployment reason: DeployStarted status: "True" type: Progressing - lastHeartbeatTime: "2022-09-21T21:21:25Z" lastTransitionTime: "2022-09-21T21:21:25Z" status: "False" type: Degraded operatorVersion: 4.11.0 phase: Deploying targetVersion: 4.11.0 so HCO is going to report progressing forever as well (HCO is reading only the conditions on CDI CR and not the ones on its operands) Version-Release number of selected component (if applicable): 4.11.0 How reproducible: 100% Steps to Reproduce: 1. try to deploy with cdi-deployment getting stuck (for instance CNV 4.11.0 on OCP 4.12.0 with enforcing restricted PSA on openshift-cnv) 2. check `oc get deployments -n openshift-cnv cdi-deployment -o json | jq .status.conditions` 3. compare with `oc get cdi cdi-kubevirt-hyperconverged -o json | jq .status.conditions` Actual results: a mismatch between cdi-deployment and CDI CR: on CDI deployment: status: conditions: - lastTransitionTime: "2022-09-21T21:21:28Z" lastUpdateTime: "2022-09-21T21:21:28Z" message: 'pods "cdi-deployment-6f4888b5cb-r9f5h" is forbidden: violates PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "cdi-controller" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "cdi-controller" must set securityContext.capabilities.drop=["ALL"]), seccompProfile (pod or container "cdi-controller" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")' reason: FailedCreate status: "True" type: ReplicaFailure - lastTransitionTime: "2022-09-21T21:31:29Z" lastUpdateTime: "2022-09-21T21:31:29Z" message: ReplicaSet "cdi-deployment-6f4888b5cb" has timed out progressing. reason: ProgressDeadlineExceeded status: "False" type: Progressing on CDI CR: status: conditions: - lastHeartbeatTime: "2022-09-21T21:21:25Z" lastTransitionTime: "2022-09-21T21:21:25Z" status: "False" type: Available - lastHeartbeatTime: "2022-09-21T21:21:25Z" lastTransitionTime: "2022-09-21T21:21:25Z" message: Started Deployment reason: DeployStarted status: "True" type: Progressing - lastHeartbeatTime: "2022-09-21T21:21:25Z" lastTransitionTime: "2022-09-21T21:21:25Z" status: "False" type: Degraded Expected results: CDI CR conditions correctly aggregates the ones from its operands. Additional info: Please notice that correctly reporting a failed install/upgrade is actually a prerequisite for the unsafe fail forward upgrades feature: https://olm.operatorframework.io/docs/advanced-tasks/unsafe-fail-forward-upgrades/ Currently we are still not opting-in, but that feature will eventually let customers try to recover from stuck upgrades.
Alex please have a look.
Alex, could you please provide some update?
Tracking this at https://issues.redhat.com/browse/CNV-31135