Bug 1868229 - [deply cnv] hco operator run then terminate in a loop
Summary: [deply cnv] hco operator run then terminate in a loop
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Installation
Version: 2.5.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 2.5.0
Assignee: Simone Tiraboschi
QA Contact: Tareq Alayan
URL:
Whiteboard:
: 1867493 (view as bug list)
Depends On: 1868712 1876908
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-12 04:50 UTC by Tareq Alayan
Modified: 2020-11-17 13:24 UTC (History)
6 users (show)

Fixed In Version: hyperconverged-cluster-operator:v2.5.0-25
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-17 13:24:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
SSP operator logs (4.99 MB, text/plain)
2020-09-08 12:36 UTC, Denis Ollier
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt hyperconverged-cluster-operator pull 751 0 None closed Bump vm-import-operator to v0.2.0 and consume CRD v1beta1 2020-12-12 21:36:17 UTC
Github kubevirt hyperconverged-cluster-operator pull 765 0 None closed Bump vmimport v0.2.0 2020-12-12 21:36:17 UTC
Github kubevirt vm-import-operator pull 378 0 None closed Update types 2020-12-12 21:36:17 UTC
Github kubevirt vm-import-operator pull 383 0 None closed Permissions issue during HCO deployment 2020-12-12 21:36:45 UTC
Red Hat Bugzilla 1867493 0 urgent CLOSED [deploy cnv] olm failed to deploy the operator because of issue with vm import component 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHEA-2020:5127 0 None None None 2020-11-17 13:24:32 UTC

Description Tareq Alayan 2020-08-12 04:50:17 UTC
Description of problem:
hco operator is not staying in unning state in constant manner.
It keep going into terminating state


Version-Release number of selected component (if applicable):
2.5
IIB_IMAGE="registry-proxy.engineering.redhat.com/rh-osbs/iib:1785"

How reproducible:
always

Steps to Reproduce:
1. deploy hco
2.
3.

Additional info:
oc logs hco-operator-5b46c9c99b-5dbnz   -nopenshift-cnv
{"level":"info","ts":1597207467.414534,"logger":"cmd","msg":"Go Version: go1.13.4"}
{"level":"info","ts":1597207467.414588,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1597207467.4145932,"logger":"cmd","msg":"Version of operator-sdk: v0.17.0"}
{"level":"info","ts":1597207467.4152842,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1597207470.3352704,"logger":"leader","msg":"Found existing lock with my name. I was likely restarted."}
{"level":"info","ts":1597207470.335304,"logger":"leader","msg":"Continuing as the leader."}
{"level":"info","ts":1597207473.2450678,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"0.0.0.0:8383"}
{"level":"info","ts":1597207473.2678356,"logger":"cmd","msg":"Cluster type = openshift"}
{"level":"info","ts":1597207473.2740977,"logger":"cmd","msg":"Found Pod","Pod.Namespace":"openshift-cnv","Pod.Name":"hco-operator-5b46c9c99b-5dbnz"}
{"level":"error","ts":1597207473.2922847,"logger":"cmd","msg":"Failed to get HCO CSV","error":"no kind is registered for the type v1alpha1.ClusterServiceVersion in scheme \"k8s.io/client-go/kubernetes/scheme/register.go:67\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kubevirt/hyperconverged-cluster-operator/pkg/util.GetCSVfromPod\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/pkg/util/util.go:122\ngithub.com/kubevirt/hyperconverged-cluster-operator/pkg/util.(*eventEmitter).UpdateClient\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/pkg/util/event_emmiter.go:69\ngithub.com/kubevirt/hyperconverged-cluster-operator/pkg/util.(*eventEmitter).Init\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/pkg/util/event_emmiter.go:39\nmain.main\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/cmd/hyperconverged-cluster-operator/main.go:169\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:203"}
{"level":"error","ts":1597207473.2924006,"logger":"cmd","msg":"Can't get CSV","error":"no kind is registered for the type v1alpha1.ClusterServiceVersion in scheme \"k8s.io/client-go/kubernetes/scheme/register.go:67\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kubevirt/hyperconverged-cluster-operator/pkg/util.(*eventEmitter).UpdateClient\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/pkg/util/event_emmiter.go:71\ngithub.com/kubevirt/hyperconverged-cluster-operator/pkg/util.(*eventEmitter).Init\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/pkg/util/event_emmiter.go:39\nmain.main\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/cmd/hyperconverged-cluster-operator/main.go:169\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:203"}
{"level":"info","ts":1597207473.2924278,"logger":"cmd","msg":"Registering Components."}

Comment 1 Lukas Bednar 2020-08-12 06:20:10 UTC
I also see following log for HCO, with the same symptoms (hco is failing and looping in creating and terminating state)

{"level":"error","ts":1597212472.774559,"logger":"controller-runtime.source","msg":"if kind is a CRD, it should be installed before calling Start","kind":"VMImportConfig.v2v.kubevirt.io","error":"no matches for kind \"VMImportConfig\" in version \"v2v.kubevirt.io/v1alpha1\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:104\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:165\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startLeaderElectionRunnables.func1\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:473"}
{"level":"error","ts":1597212472.7748973,"logger":"cmd","msg":"Manager exited non-zero","error":"no matches for kind \"VMImportConfig\" in version \"v2v.kubevirt.io/v1alpha1\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nmain.main\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/cmd/hyperconverged-cluster-operator/main.go:246\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:203"}

Comment 2 Simone Tiraboschi 2020-08-12 08:08:39 UTC
It's a sort of side effect of: 1867493
vm-import-operator CRD should keep also v1alpha1 version for backward compatibility.

Then on HCO code we are still using v1alpha1 on vm-import-operator:
https://github.com/kubevirt/hyperconverged-cluster-operator/tree/master/vendor/github.com/kubevirt/vm-import-operator

because we are pinned to v0.1.0 which is the latest version available there.

Piotr, can you please create an upstream pre/rc release so that we can move forward?

Comment 3 Piotr Kliczewski 2020-08-12 10:36:29 UTC
@Simone here is the PR -> https://github.com/kubevirt/vm-import-operator/pull/370

We are about to merge it

Comment 4 Simone Tiraboschi 2020-08-17 10:46:02 UTC
Upstream v0.2.0 seems still broken

Comment 5 Piotr Kliczewski 2020-08-17 11:11:32 UTC
I see "VMimport is not 'Available'","Request.Namespace" message. Is this related to api version bump?

Comment 6 Simone Tiraboschi 2020-08-17 16:06:08 UTC
(In reply to Piotr Kliczewski from comment #5)
> I see "VMimport is not 'Available'","Request.Namespace" message. Is this
> related to api version bump?

In HCO logs I see:
{"level":"info","ts":1597655572.3726707,"logger":"controller_hyperconverged","msg":"VM import exists","Request.Namespace":"kubevirt-hyperconverged","Request.Name":"kubevirt-hyperconverged","vmImport.Namespace":"","vmImport.Name":"vmimport-kubevirt-hyperconverged"}
{"level":"info","ts":1597655572.37268,"logger":"controller_hyperconverged","msg":"VMimport's resource is not reporting Conditions on it's Status","Request.Namespace":"kubevirt-hyperconverged","Request.Name":"kubevirt-hyperconverged"}

I'll try to reproduce it locally to better understand the root cause.

Comment 7 Simone Tiraboschi 2020-08-17 19:02:57 UTC
vm-import operator CR v1beta1 completely misses .status:

+ oc get -n kubevirt-hyperconverged VMImportConfig vmimport-kubevirt-hyperconverged -o yaml
apiVersion: v2v.kubevirt.io/v1beta1
kind: VMImportConfig
metadata:
  creationTimestamp: "2020-08-17T17:36:45Z"
  generation: 1
  labels:
    app: kubevirt-hyperconverged
  managedFields:
  - apiVersion: v2v.kubevirt.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .: {}
          f:app: {}
        f:ownerReferences:
          .: {}
          k:{"uid":"d4d4640a-ac09-40cc-8215-6b444cb76e8e"}:
            .: {}
            f:apiVersion: {}
            f:blockOwnerDeletion: {}
            f:controller: {}
            f:kind: {}
            f:name: {}
            f:uid: {}
      f:spec: {}
      f:status: {}
    manager: hyperconverged-cluster-operator
    operation: Update
    time: "2020-08-17T17:36:45Z"
  name: vmimport-kubevirt-hyperconverged
  ownerReferences:
  - apiVersion: hco.kubevirt.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: HyperConverged
    name: kubevirt-hyperconverged
    uid: d4d4640a-ac09-40cc-8215-6b444cb76e8e
  resourceVersion: "35427"
  selfLink: /apis/v2v.kubevirt.io/v1beta1/vmimportconfigs/vmimport-kubevirt-hyperconverged
  uid: faf13a16-8c84-4a32-8393-540f74b71cb0
spec: {}

Comment 8 Piotr Kliczewski 2020-08-18 07:21:47 UTC
Yes, I see the issue as well. Investigating...

Comment 9 Inbar Rose 2020-08-19 12:59:27 UTC
*** Bug 1867493 has been marked as a duplicate of this bug. ***

Comment 10 Simone Tiraboschi 2020-08-24 16:48:30 UTC
Found another issue in vm-import-operator, we need also https://github.com/kubevirt/vm-import-operator/pull/383

Comment 11 Tareq Alayan 2020-08-30 10:52:32 UTC
trying with 2.5.0-124 rh-osbs/iib:5741

Failed: no matches for kind "OperatorSource" in version "operators.coreos.com/v1"


when applying:

apiVersion: operators.coreos.com/v1
kind: OperatorSource
metadata:
  name: kubevirt-hyperconverged
spec:
  registryNamespace: rh-verified-operators
  publisher: Red Hat

Comment 12 Piotr Kliczewski 2020-08-31 08:36:36 UTC
@Simone please take a look

Comment 13 Simone Tiraboschi 2020-08-31 10:44:34 UTC
OperatorSource got deprecated in OCP 4.5 and probably already removed in OCP 4.6: we should directly use a CatalogSource that point to the index image.
Oren is going to fix it in the kustomize template.

Then please notice that the Index Image built by CVP are currently not consumable on OCP 4.6; please see https://bugzilla.redhat.com/show_bug.cgi?id=1871234#c18 for a temporary workaround.

Comment 14 Denis Ollier 2020-08-31 13:59:40 UTC
When deploying from a Catalog image, in addition to the CatalogSource, QE's deploy_kustomize.sh was unexpectedly creating an OperatorSource from an AppRegistry although it shouldn't.

Since OperatorSource API got removed with OCP 4.6.0-fc.1, I removed support to deploy from AppRegistry in QE's deploy_kustomize.sh.

Comment 15 Denis Ollier 2020-08-31 16:36:55 UTC
For the record, I managed to deploy CNV 2.5 on an OCP *4.5* cluster using:

- CNV 2.5 from registry-proxy.engineering.redhat.com/rh-osbs/iib:5809
- NMO 4.6 from registry-proxy.engineering.redhat.com/rh-osbs/iib:4255

The die/restart loop of the hco-operator is still present.

Will retry on an OCP 4.6 cluster with the opm workaround.

Comment 16 Simone Tiraboschi 2020-08-31 17:06:43 UTC
(In reply to Denis Ollier from comment #15)
> For the record, I managed to deploy CNV 2.5 on an OCP *4.5* cluster using:
> 
> - CNV 2.5 from registry-proxy.engineering.redhat.com/rh-osbs/iib:5809
> - NMO 4.6 from registry-proxy.engineering.redhat.com/rh-osbs/iib:4255
> 
> The die/restart loop of the hco-operator is still present.

There is also a BUG on post on OLM about a similar issue:
https://bugzilla.redhat.com/show_bug.cgi?id=1868712

As for this bug, HCO should now (sooner or later) reach ready status,
but in the meantime OLM will still kill and restart it more than once.

Comment 17 Denis Ollier 2020-09-01 08:24:11 UTC
(In reply to Simone Tiraboschi from comment #16)
> 
> As for this bug, HCO should now (sooner or later) reach ready status,
> but in the meantime OLM will still kill and restart it more than once.

Evan after a whole night, HCO is still dying/restarting in loop.

Comment 18 Denis Ollier 2020-09-01 11:17:49 UTC
With the opm workaroud, I finally managed to deploy CNV 2.5 on an OCP 4.6 cluster using:

- CNV 2.5 from registry-proxy.engineering.redhat.com/rh-osbs/iib:6136 (hco-bundle-registry:v2.5.0-135)
- NMO 4.6 from registry-proxy.engineering.redhat.com/rh-osbs/iib:4255

I don't see the die/restart loop of the hco-operator anymore.

Comment 19 Denis Ollier 2020-09-01 19:23:11 UTC
With registry-proxy.engineering.redhat.com/rh-osbs/iib:6196 the issue is happening again.

Note that the CSV seems older with iib:6196 than with iib:6136.

- OK: registry-proxy.engineering.redhat.com/rh-osbs/iib:6136 (hco-bundle-registry:v2.5.0-135, CSV createdAt: 2020-09-01 07:44:46)
- Not OK: registry-proxy.engineering.redhat.com/rh-osbs/iib:6196 (hco-bundle-registry:v2.5.0-???, CSV createdAt: 2020-08-28 14:53:33)

Comment 20 Denis Ollier 2020-09-08 11:54:49 UTC
I deployed CNV 2.5.0 on an OCP 4.6.0-fc.4 cluster using registry-proxy.engineering.redhat.com/rh-osbs/iib:7848 (hco-bundle-registry:v2.5.0-160) and the issue is still present.

Comment 21 Denis Ollier 2020-09-08 12:20:37 UTC
Status of KubevirtMetricsAggregation CR:

> kubectl get kubevirtmetricsaggregations.ssp.kubevirt.io metrics-aggregation-kubevirt-hyperconverged -o yaml
> 
> apiVersion: ssp.kubevirt.io/v1
> kind: KubevirtMetricsAggregation
> metadata:
>   creationTimestamp: "2020-09-08T10:33:16Z"
>   generation: 1
>   labels:
>     app: kubevirt-hyperconverged
>   name: metrics-aggregation-kubevirt-hyperconverged
>   namespace: openshift-cnv
>   ownerReferences:
>   - apiVersion: hco.kubevirt.io/v1beta1
>     blockOwnerDeletion: true
>     controller: true
>     kind: HyperConverged
>     name: kubevirt-hyperconverged
>     uid: 10153ff9-90a5-454d-9d33-9791357156e6
>   resourceVersion: "94052"
>   selfLink: /apis/ssp.kubevirt.io/v1/namespaces/openshift-cnv/kubevirtmetricsaggregations/metrics-aggregation-kubevirt-hyperconverged
>   uid: 6fd008c3-c6dd-4550-be24-d3d88f7fc8d2
> spec: {}
> status:
>   conditions:
>   - ansibleResult:
>       changed: 2
>       completion: 2020-09-08T10:33:25.964362
>       failures: 0
>       ok: 4
>       skipped: 0
>     lastTransitionTime: "2020-09-08T10:33:16Z"
>     message: Awaiting next reconciliation
>     reason: Successful
>     status: "True"
>     type: Running
>   operatorVersion: v2.5.0
>   targetVersion: v2.5.0

Comment 22 Simone Tiraboschi 2020-09-08 12:25:02 UTC
The issue is that KubevirtMetricsAggregation is reporting Running=True and never gets to Available=True so probably something is stuck on SSP operator now.
Can you please attach also the logs of SSP operator?

Comment 23 Denis Ollier 2020-09-08 12:27:37 UTC
Relevant logs:

> TASK [Inject owner references for KubevirtNodeLabellerBundle] ******************************** 
> fatal: [localhost]: FAILED! => {"msg": "template error while templating string: no filter named 'k8s_inject_ownership'. String: {{ objects | k8s_inject_ownership(cr_info) }}"}

=> template error while templating string: no filter named 'k8s_inject_ownership'

Comment 24 Denis Ollier 2020-09-08 12:29:25 UTC
Note that the node-maintenance-operator is also looping.

Comment 25 Denis Ollier 2020-09-08 12:36:03 UTC
Created attachment 1714098 [details]
SSP operator logs

SSP operator logs

Comment 26 Simone Tiraboschi 2020-09-08 12:55:35 UTC
(In reply to Denis Ollier from comment #24)
> Note that the node-maintenance-operator is also looping.

Yes, this is also now expected as a side effect of https://bugzilla.redhat.com/1868712 because now (NMO >= 0.7.0) also NMO includes an OLM based admission webhook.

Comment 27 Simone Tiraboschi 2020-09-21 11:58:25 UTC
https://bugzilla.redhat.com/1868712 was on Modified and its bits can be consumed from OCP 4.6 nightly builds, moving this to ON_QA for further verifications.

Comment 28 Tareq Alayan 2020-09-21 12:00:30 UTC
2.5 is deployable.
Verified on 
HCO:[v2.5.0-209]
 HCO image: registry.redhat.io/container-native-virtualization/hyperconverged-cluster-operator@sha256:bec6349f6f98faae85fa7ee91c49c20522d2ce955e70e2d04e75e14822f2562d
CSV creation time: 2020-09-21 07:30:25

Comment 31 errata-xmlrpc 2020-11-17 13:24:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Virtualization 2.5.0 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:5127


Note You need to log in before you can comment on or make changes to this bug.