Bug 1920576 - HCO can report ready=true when it failed to create a CR for a component operator
Summary: HCO can report ready=true when it failed to create a CR for a component operator
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Installation
Version: 2.6.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 2.6.0
Assignee: Simone Tiraboschi
QA Contact: Inbar Rose
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-26 16:19 UTC by Simone Tiraboschi
Modified: 2021-03-10 11:24 UTC (History)
2 users (show)

Fixed In Version: hco-bundle-registry-container-v2.6.0-519
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-10 11:23:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt hyperconverged-cluster-operator pull 1097 0 None closed Fix a bug where HCO is ready while there is handling one of the operand returns error 2021-02-04 13:36:11 UTC
Github kubevirt hyperconverged-cluster-operator pull 1098 0 None closed [release-1.3] Fix a bug where HCO is ready while there is handling one of the operand returns error 2021-02-04 13:36:13 UTC
Red Hat Product Errata RHSA-2021:0799 0 None None None 2021-03-10 11:24:48 UTC

Description Simone Tiraboschi 2021-01-26 16:19:15 UTC
Description of problem:
Under certain conditions,
when HCO encounters an error reconciling the CR of the component operator,
HCO is correctly reporting it in its conditions but this is not reflected in the ready status on the pod that is (currently) the only information consumed by the OLM.

Example:
$ oc get hco -n kubevirt-hyperconverged kubevirt-hyperconverged -o yaml
apiVersion: hco.kubevirt.io/v1beta1
kind: HyperConverged
...
status:
  conditions:
  - lastHeartbeatTime: "2021-01-26T16:12:57Z"
    lastTransitionTime: "2021-01-26T13:04:14Z"
    message: 'Error while reconciling: Internal error occurred: failed calling webhook "vssp.kb.io": Post "https://ssp-operator-service.kubevirt-hyperconverged.svc:9443/validate-ssp-kubevirt-io-v1beta1-ssp?timeout=15s": x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "Red Hat, Inc.")'
    reason: ReconcileFailed
    status: "False"
    type: ReconcileComplete
  - lastHeartbeatTime: "2021-01-26T16:12:57Z"
    lastTransitionTime: "2021-01-26T15:43:29Z"
    message: Unknown Status
    status: Unknown
    type: Available
  - lastHeartbeatTime: "2021-01-26T16:12:57Z"
    lastTransitionTime: "2021-01-26T13:06:13Z"
    message: Unknown Status
    status: Unknown
    type: Progressing
  - lastHeartbeatTime: "2021-01-26T16:12:57Z"
    lastTransitionTime: "2021-01-26T15:43:28Z"
    message: Unknown Status
    status: Unknown
    type: Degraded
  - lastHeartbeatTime: "2021-01-26T16:12:57Z"
    lastTransitionTime: "2021-01-26T13:06:13Z"
    message: Unknown Status
    status: Unknown
    type: Upgradeable

while:

$ oc get pods -n kubevirt-hyperconverged  | grep hco-operator
hco-operator-bbdcbf74-cqvfd                           1/1     Running   18         3h17m

and so:

$ oc get csvs -n kubevirt-hyperconverged
NAME                                                   DISPLAY                                    VERSION              REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v1.4.0-202101261208   KubeVirt HyperConverged Cluster Operator   1.4.0-202101261208   kubevirt-hyperconverged-operator.v1.3.0   Succeeded


Version-Release number of selected component (if applicable):
2.6.0

How reproducible:
pretty difficult

Steps to Reproduce:
1. we don't have a clear reproduction process
2.
3.

Actual results:
We see:
status:
  conditions:
  - lastHeartbeatTime: "2021-01-26T16:12:57Z"
    lastTransitionTime: "2021-01-26T13:04:14Z"
    message: 'Error while reconciling: Internal error occurred: failed calling webhook "vssp.kb.io": Post "https://ssp-operator-service.kubevirt-hyperconverged.svc:9443/validate-ssp-kubevirt-io-v1beta1-ssp?timeout=15s": x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "Red Hat, Inc.")'
    reason: ReconcileFailed
    status: "False"
    type: ReconcileComplete
  - lastHeartbeatTime: "2021-01-26T16:12:57Z"
    lastTransitionTime: "2021-01-26T15:43:29Z"
    message: Unknown Status
    status: Unknown
    type: Available

but at the same time:
hco-operator-bbdcbf74-cqvfd                           1/1

Expected results:
if the conditions are not positives, HCO pod should not be ready.

Additional info:
We seldom see it a side effect of https://bugzilla.redhat.com/1907290 when, due to 1907290, HCO is not able to create/update the CR for SP which is validated by a webhook configured (incorrectly) by the OLM.

This is not about fixing the real issue ( https://bugzilla.redhat.com/1907290 ) but just about properly communicating the status to the OLM and so to the user when we hit that issue.

Comment 2 Inbar Rose 2021-02-08 09:27:06 UTC
verified upstream using unit testing

Comment 5 errata-xmlrpc 2021-03-10 11:23:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 2.6.0 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0799


Note You need to log in before you can comment on or make changes to this bug.