1920576 – HCO can report ready=true when it failed to create a CR for a component operator

Bug 1920576 - HCO can report ready=true when it failed to create a CR for a component operator

Summary: HCO can report ready=true when it failed to create a CR for a component operator

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Installation
Sub Component:
Version:	2.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	2.6.0
Assignee:	Simone Tiraboschi
QA Contact:	Inbar Rose
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-26 16:19 UTC by Simone Tiraboschi
Modified:	2021-03-10 11:24 UTC (History)
CC List:	2 users (show)
Fixed In Version:	hco-bundle-registry-container-v2.6.0-519
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-10 11:23:40 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubevirt hyperconverged-cluster-operator pull 1097	None	closed	Fix a bug where HCO is ready while there is handling one of the operand returns error	2021-02-04 13:36:11 UTC
Github	kubevirt hyperconverged-cluster-operator pull 1098	None	closed	[release-1.3] Fix a bug where HCO is ready while there is handling one of the operand returns error	2021-02-04 13:36:13 UTC
Red Hat Product Errata	RHSA-2021:0799	None	None	None	2021-03-10 11:24:48 UTC

Description Simone Tiraboschi 2021-01-26 16:19:15 UTC

Description of problem:
Under certain conditions,
when HCO encounters an error reconciling the CR of the component operator,
HCO is correctly reporting it in its conditions but this is not reflected in the ready status on the pod that is (currently) the only information consumed by the OLM.

Example:
$ oc get hco -n kubevirt-hyperconverged kubevirt-hyperconverged -o yaml
apiVersion: hco.kubevirt.io/v1beta1
kind: HyperConverged
...
status:
  conditions:
  - lastHeartbeatTime: "2021-01-26T16:12:57Z"
    lastTransitionTime: "2021-01-26T13:04:14Z"
    message: 'Error while reconciling: Internal error occurred: failed calling webhook "vssp.kb.io": Post "https://ssp-operator-service.kubevirt-hyperconverged.svc:9443/validate-ssp-kubevirt-io-v1beta1-ssp?timeout=15s": x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "Red Hat, Inc.")'
    reason: ReconcileFailed
    status: "False"
    type: ReconcileComplete
  - lastHeartbeatTime: "2021-01-26T16:12:57Z"
    lastTransitionTime: "2021-01-26T15:43:29Z"
    message: Unknown Status
    status: Unknown
    type: Available
  - lastHeartbeatTime: "2021-01-26T16:12:57Z"
    lastTransitionTime: "2021-01-26T13:06:13Z"
    message: Unknown Status
    status: Unknown
    type: Progressing
  - lastHeartbeatTime: "2021-01-26T16:12:57Z"
    lastTransitionTime: "2021-01-26T15:43:28Z"
    message: Unknown Status
    status: Unknown
    type: Degraded
  - lastHeartbeatTime: "2021-01-26T16:12:57Z"
    lastTransitionTime: "2021-01-26T13:06:13Z"
    message: Unknown Status
    status: Unknown
    type: Upgradeable

while:

$ oc get pods -n kubevirt-hyperconverged  | grep hco-operator
hco-operator-bbdcbf74-cqvfd                           1/1     Running   18         3h17m

and so:

$ oc get csvs -n kubevirt-hyperconverged
NAME                                                   DISPLAY                                    VERSION              REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v1.4.0-202101261208   KubeVirt HyperConverged Cluster Operator   1.4.0-202101261208   kubevirt-hyperconverged-operator.v1.3.0   Succeeded


Version-Release number of selected component (if applicable):
2.6.0

How reproducible:
pretty difficult

Steps to Reproduce:
1. we don't have a clear reproduction process
2.
3.

Actual results:
We see:
status:
  conditions:
  - lastHeartbeatTime: "2021-01-26T16:12:57Z"
    lastTransitionTime: "2021-01-26T13:04:14Z"
    message: 'Error while reconciling: Internal error occurred: failed calling webhook "vssp.kb.io": Post "https://ssp-operator-service.kubevirt-hyperconverged.svc:9443/validate-ssp-kubevirt-io-v1beta1-ssp?timeout=15s": x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "Red Hat, Inc.")'
    reason: ReconcileFailed
    status: "False"
    type: ReconcileComplete
  - lastHeartbeatTime: "2021-01-26T16:12:57Z"
    lastTransitionTime: "2021-01-26T15:43:29Z"
    message: Unknown Status
    status: Unknown
    type: Available

but at the same time:
hco-operator-bbdcbf74-cqvfd                           1/1

Expected results:
if the conditions are not positives, HCO pod should not be ready.

Additional info:
We seldom see it a side effect of https://bugzilla.redhat.com/1907290 when, due to 1907290, HCO is not able to create/update the CR for SP which is validated by a webhook configured (incorrectly) by the OLM.

This is not about fixing the real issue ( https://bugzilla.redhat.com/1907290 ) but just about properly communicating the status to the OLM and so to the user when we hit that issue.

Comment 2 Inbar Rose 2021-02-08 09:27:06 UTC

verified upstream using unit testing

Comment 5 errata-xmlrpc 2021-03-10 11:23:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 2.6.0 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0799

Note You need to log in before you can comment on or make changes to this bug.