Description of problem: Under certain conditions, when HCO encounters an error reconciling the CR of the component operator, HCO is correctly reporting it in its conditions but this is not reflected in the ready status on the pod that is (currently) the only information consumed by the OLM. Example: $ oc get hco -n kubevirt-hyperconverged kubevirt-hyperconverged -o yaml apiVersion: hco.kubevirt.io/v1beta1 kind: HyperConverged ... status: conditions: - lastHeartbeatTime: "2021-01-26T16:12:57Z" lastTransitionTime: "2021-01-26T13:04:14Z" message: 'Error while reconciling: Internal error occurred: failed calling webhook "vssp.kb.io": Post "https://ssp-operator-service.kubevirt-hyperconverged.svc:9443/validate-ssp-kubevirt-io-v1beta1-ssp?timeout=15s": x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "Red Hat, Inc.")' reason: ReconcileFailed status: "False" type: ReconcileComplete - lastHeartbeatTime: "2021-01-26T16:12:57Z" lastTransitionTime: "2021-01-26T15:43:29Z" message: Unknown Status status: Unknown type: Available - lastHeartbeatTime: "2021-01-26T16:12:57Z" lastTransitionTime: "2021-01-26T13:06:13Z" message: Unknown Status status: Unknown type: Progressing - lastHeartbeatTime: "2021-01-26T16:12:57Z" lastTransitionTime: "2021-01-26T15:43:28Z" message: Unknown Status status: Unknown type: Degraded - lastHeartbeatTime: "2021-01-26T16:12:57Z" lastTransitionTime: "2021-01-26T13:06:13Z" message: Unknown Status status: Unknown type: Upgradeable while: $ oc get pods -n kubevirt-hyperconverged | grep hco-operator hco-operator-bbdcbf74-cqvfd 1/1 Running 18 3h17m and so: $ oc get csvs -n kubevirt-hyperconverged NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v1.4.0-202101261208 KubeVirt HyperConverged Cluster Operator 1.4.0-202101261208 kubevirt-hyperconverged-operator.v1.3.0 Succeeded Version-Release number of selected component (if applicable): 2.6.0 How reproducible: pretty difficult Steps to Reproduce: 1. we don't have a clear reproduction process 2. 3. Actual results: We see: status: conditions: - lastHeartbeatTime: "2021-01-26T16:12:57Z" lastTransitionTime: "2021-01-26T13:04:14Z" message: 'Error while reconciling: Internal error occurred: failed calling webhook "vssp.kb.io": Post "https://ssp-operator-service.kubevirt-hyperconverged.svc:9443/validate-ssp-kubevirt-io-v1beta1-ssp?timeout=15s": x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "Red Hat, Inc.")' reason: ReconcileFailed status: "False" type: ReconcileComplete - lastHeartbeatTime: "2021-01-26T16:12:57Z" lastTransitionTime: "2021-01-26T15:43:29Z" message: Unknown Status status: Unknown type: Available but at the same time: hco-operator-bbdcbf74-cqvfd 1/1 Expected results: if the conditions are not positives, HCO pod should not be ready. Additional info: We seldom see it a side effect of https://bugzilla.redhat.com/1907290 when, due to 1907290, HCO is not able to create/update the CR for SP which is validated by a webhook configured (incorrectly) by the OLM. This is not about fixing the real issue ( https://bugzilla.redhat.com/1907290 ) but just about properly communicating the status to the OLM and so to the user when we hit that issue.
verified upstream using unit testing
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 2.6.0 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:0799