Bug 1920576

Summary: HCO can report ready=true when it failed to create a CR for a component operator
Product: Container Native Virtualization (CNV) Reporter: Simone Tiraboschi <stirabos>
Component: InstallationAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED ERRATA QA Contact: Inbar Rose <irose>
Severity: high Docs Contact:
Priority: high    
Version: 2.6.0CC: cnv-qe-bugs, stirabos
Target Milestone: ---Keywords: EasyFix
Target Release: 2.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: hco-bundle-registry-container-v2.6.0-519 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-10 11:23:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Simone Tiraboschi 2021-01-26 16:19:15 UTC
Description of problem:
Under certain conditions,
when HCO encounters an error reconciling the CR of the component operator,
HCO is correctly reporting it in its conditions but this is not reflected in the ready status on the pod that is (currently) the only information consumed by the OLM.

Example:
$ oc get hco -n kubevirt-hyperconverged kubevirt-hyperconverged -o yaml
apiVersion: hco.kubevirt.io/v1beta1
kind: HyperConverged
...
status:
  conditions:
  - lastHeartbeatTime: "2021-01-26T16:12:57Z"
    lastTransitionTime: "2021-01-26T13:04:14Z"
    message: 'Error while reconciling: Internal error occurred: failed calling webhook "vssp.kb.io": Post "https://ssp-operator-service.kubevirt-hyperconverged.svc:9443/validate-ssp-kubevirt-io-v1beta1-ssp?timeout=15s": x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "Red Hat, Inc.")'
    reason: ReconcileFailed
    status: "False"
    type: ReconcileComplete
  - lastHeartbeatTime: "2021-01-26T16:12:57Z"
    lastTransitionTime: "2021-01-26T15:43:29Z"
    message: Unknown Status
    status: Unknown
    type: Available
  - lastHeartbeatTime: "2021-01-26T16:12:57Z"
    lastTransitionTime: "2021-01-26T13:06:13Z"
    message: Unknown Status
    status: Unknown
    type: Progressing
  - lastHeartbeatTime: "2021-01-26T16:12:57Z"
    lastTransitionTime: "2021-01-26T15:43:28Z"
    message: Unknown Status
    status: Unknown
    type: Degraded
  - lastHeartbeatTime: "2021-01-26T16:12:57Z"
    lastTransitionTime: "2021-01-26T13:06:13Z"
    message: Unknown Status
    status: Unknown
    type: Upgradeable

while:

$ oc get pods -n kubevirt-hyperconverged  | grep hco-operator
hco-operator-bbdcbf74-cqvfd                           1/1     Running   18         3h17m

and so:

$ oc get csvs -n kubevirt-hyperconverged
NAME                                                   DISPLAY                                    VERSION              REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v1.4.0-202101261208   KubeVirt HyperConverged Cluster Operator   1.4.0-202101261208   kubevirt-hyperconverged-operator.v1.3.0   Succeeded


Version-Release number of selected component (if applicable):
2.6.0

How reproducible:
pretty difficult

Steps to Reproduce:
1. we don't have a clear reproduction process
2.
3.

Actual results:
We see:
status:
  conditions:
  - lastHeartbeatTime: "2021-01-26T16:12:57Z"
    lastTransitionTime: "2021-01-26T13:04:14Z"
    message: 'Error while reconciling: Internal error occurred: failed calling webhook "vssp.kb.io": Post "https://ssp-operator-service.kubevirt-hyperconverged.svc:9443/validate-ssp-kubevirt-io-v1beta1-ssp?timeout=15s": x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "Red Hat, Inc.")'
    reason: ReconcileFailed
    status: "False"
    type: ReconcileComplete
  - lastHeartbeatTime: "2021-01-26T16:12:57Z"
    lastTransitionTime: "2021-01-26T15:43:29Z"
    message: Unknown Status
    status: Unknown
    type: Available

but at the same time:
hco-operator-bbdcbf74-cqvfd                           1/1

Expected results:
if the conditions are not positives, HCO pod should not be ready.

Additional info:
We seldom see it a side effect of https://bugzilla.redhat.com/1907290 when, due to 1907290, HCO is not able to create/update the CR for SP which is validated by a webhook configured (incorrectly) by the OLM.

This is not about fixing the real issue ( https://bugzilla.redhat.com/1907290 ) but just about properly communicating the status to the OLM and so to the user when we hit that issue.

Comment 2 Inbar Rose 2021-02-08 09:27:06 UTC
verified upstream using unit testing

Comment 5 errata-xmlrpc 2021-03-10 11:23:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 2.6.0 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0799