Bug 1868229

Summary:

[deply cnv] hco operator run then terminate in a loop

Product:

Container Native Virtualization (CNV)

Reporter:

Tareq Alayan <talayan>

Component:

Installation

Assignee:

Simone Tiraboschi <stirabos>

Status:

CLOSED ERRATA

QA Contact:

Tareq Alayan <talayan>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

2.5.0

CC:

cnv-qe-bugs, dollierp, lbednar, ncredi, pkliczew, stirabos

Target Milestone:

---

Keywords:

AutomationBlocker, Regression

Target Release:

2.5.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

hyperconverged-cluster-operator:v2.5.0-25

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-11-17 13:24:21 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1868712, 1876908

Bug Blocks:

Attachments:

Description	Flags
SSP operator logs	none

Description Tareq Alayan 2020-08-12 04:50:17 UTC

Description of problem:
hco operator is not staying in unning state in constant manner.
It keep going into terminating state


Version-Release number of selected component (if applicable):
2.5
IIB_IMAGE="registry-proxy.engineering.redhat.com/rh-osbs/iib:1785"

How reproducible:
always

Steps to Reproduce:
1. deploy hco
2.
3.

Additional info:
oc logs hco-operator-5b46c9c99b-5dbnz   -nopenshift-cnv
{"level":"info","ts":1597207467.414534,"logger":"cmd","msg":"Go Version: go1.13.4"}
{"level":"info","ts":1597207467.414588,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1597207467.4145932,"logger":"cmd","msg":"Version of operator-sdk: v0.17.0"}
{"level":"info","ts":1597207467.4152842,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1597207470.3352704,"logger":"leader","msg":"Found existing lock with my name. I was likely restarted."}
{"level":"info","ts":1597207470.335304,"logger":"leader","msg":"Continuing as the leader."}
{"level":"info","ts":1597207473.2450678,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"0.0.0.0:8383"}
{"level":"info","ts":1597207473.2678356,"logger":"cmd","msg":"Cluster type = openshift"}
{"level":"info","ts":1597207473.2740977,"logger":"cmd","msg":"Found Pod","Pod.Namespace":"openshift-cnv","Pod.Name":"hco-operator-5b46c9c99b-5dbnz"}
{"level":"error","ts":1597207473.2922847,"logger":"cmd","msg":"Failed to get HCO CSV","error":"no kind is registered for the type v1alpha1.ClusterServiceVersion in scheme \"k8s.io/client-go/kubernetes/scheme/register.go:67\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kubevirt/hyperconverged-cluster-operator/pkg/util.GetCSVfromPod\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/pkg/util/util.go:122\ngithub.com/kubevirt/hyperconverged-cluster-operator/pkg/util.(*eventEmitter).UpdateClient\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/pkg/util/event_emmiter.go:69\ngithub.com/kubevirt/hyperconverged-cluster-operator/pkg/util.(*eventEmitter).Init\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/pkg/util/event_emmiter.go:39\nmain.main\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/cmd/hyperconverged-cluster-operator/main.go:169\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:203"}
{"level":"error","ts":1597207473.2924006,"logger":"cmd","msg":"Can't get CSV","error":"no kind is registered for the type v1alpha1.ClusterServiceVersion in scheme \"k8s.io/client-go/kubernetes/scheme/register.go:67\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kubevirt/hyperconverged-cluster-operator/pkg/util.(*eventEmitter).UpdateClient\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/pkg/util/event_emmiter.go:71\ngithub.com/kubevirt/hyperconverged-cluster-operator/pkg/util.(*eventEmitter).Init\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/pkg/util/event_emmiter.go:39\nmain.main\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/cmd/hyperconverged-cluster-operator/main.go:169\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:203"}
{"level":"info","ts":1597207473.2924278,"logger":"cmd","msg":"Registering Components."}

Comment 1 Lukas Bednar 2020-08-12 06:20:10 UTC

I also see following log for HCO, with the same symptoms (hco is failing and looping in creating and terminating state)

{"level":"error","ts":1597212472.774559,"logger":"controller-runtime.source","msg":"if kind is a CRD, it should be installed before calling Start","kind":"VMImportConfig.v2v.kubevirt.io","error":"no matches for kind \"VMImportConfig\" in version \"v2v.kubevirt.io/v1alpha1\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:104\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:165\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startLeaderElectionRunnables.func1\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:473"}
{"level":"error","ts":1597212472.7748973,"logger":"cmd","msg":"Manager exited non-zero","error":"no matches for kind \"VMImportConfig\" in version \"v2v.kubevirt.io/v1alpha1\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nmain.main\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/cmd/hyperconverged-cluster-operator/main.go:246\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:203"}

Comment 2 Simone Tiraboschi 2020-08-12 08:08:39 UTC

It's a sort of side effect of: 1867493
vm-import-operator CRD should keep also v1alpha1 version for backward compatibility.

Then on HCO code we are still using v1alpha1 on vm-import-operator:
https://github.com/kubevirt/hyperconverged-cluster-operator/tree/master/vendor/github.com/kubevirt/vm-import-operator

because we are pinned to v0.1.0 which is the latest version available there.

Piotr, can you please create an upstream pre/rc release so that we can move forward?

Comment 3 Piotr Kliczewski 2020-08-12 10:36:29 UTC

@Simone here is the PR -> https://github.com/kubevirt/vm-import-operator/pull/370

We are about to merge it

Comment 4 Simone Tiraboschi 2020-08-17 10:46:02 UTC

Upstream v0.2.0 seems still broken

Comment 5 Piotr Kliczewski 2020-08-17 11:11:32 UTC

I see "VMimport is not 'Available'","Request.Namespace" message. Is this related to api version bump?

Comment 6 Simone Tiraboschi 2020-08-17 16:06:08 UTC

(In reply to Piotr Kliczewski from comment #5)
> I see "VMimport is not 'Available'","Request.Namespace" message. Is this
> related to api version bump?

In HCO logs I see:
{"level":"info","ts":1597655572.3726707,"logger":"controller_hyperconverged","msg":"VM import exists","Request.Namespace":"kubevirt-hyperconverged","Request.Name":"kubevirt-hyperconverged","vmImport.Namespace":"","vmImport.Name":"vmimport-kubevirt-hyperconverged"}
{"level":"info","ts":1597655572.37268,"logger":"controller_hyperconverged","msg":"VMimport's resource is not reporting Conditions on it's Status","Request.Namespace":"kubevirt-hyperconverged","Request.Name":"kubevirt-hyperconverged"}

I'll try to reproduce it locally to better understand the root cause.

Comment 7 Simone Tiraboschi 2020-08-17 19:02:57 UTC

vm-import operator CR v1beta1 completely misses .status:

+ oc get -n kubevirt-hyperconverged VMImportConfig vmimport-kubevirt-hyperconverged -o yaml
apiVersion: v2v.kubevirt.io/v1beta1
kind: VMImportConfig
metadata:
  creationTimestamp: "2020-08-17T17:36:45Z"
  generation: 1
  labels:
    app: kubevirt-hyperconverged
  managedFields:
  - apiVersion: v2v.kubevirt.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .: {}
          f:app: {}
        f:ownerReferences:
          .: {}
          k:{"uid":"d4d4640a-ac09-40cc-8215-6b444cb76e8e"}:
            .: {}
            f:apiVersion: {}
            f:blockOwnerDeletion: {}
            f:controller: {}
            f:kind: {}
            f:name: {}
            f:uid: {}
      f:spec: {}
      f:status: {}
    manager: hyperconverged-cluster-operator
    operation: Update
    time: "2020-08-17T17:36:45Z"
  name: vmimport-kubevirt-hyperconverged
  ownerReferences:
  - apiVersion: hco.kubevirt.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: HyperConverged
    name: kubevirt-hyperconverged
    uid: d4d4640a-ac09-40cc-8215-6b444cb76e8e
  resourceVersion: "35427"
  selfLink: /apis/v2v.kubevirt.io/v1beta1/vmimportconfigs/vmimport-kubevirt-hyperconverged
  uid: faf13a16-8c84-4a32-8393-540f74b71cb0
spec: {}

Comment 8 Piotr Kliczewski 2020-08-18 07:21:47 UTC

Yes, I see the issue as well. Investigating...

Comment 9 Inbar Rose 2020-08-19 12:59:27 UTC

*** Bug 1867493 has been marked as a duplicate of this bug. ***

Comment 10 Simone Tiraboschi 2020-08-24 16:48:30 UTC

Found another issue in vm-import-operator, we need also https://github.com/kubevirt/vm-import-operator/pull/383

Comment 11 Tareq Alayan 2020-08-30 10:52:32 UTC

trying with 2.5.0-124 rh-osbs/iib:5741

Failed: no matches for kind "OperatorSource" in version "operators.coreos.com/v1"


when applying:

apiVersion: operators.coreos.com/v1
kind: OperatorSource
metadata:
  name: kubevirt-hyperconverged
spec:
  registryNamespace: rh-verified-operators
  publisher: Red Hat

Comment 12 Piotr Kliczewski 2020-08-31 08:36:36 UTC

@Simone please take a look

Comment 13 Simone Tiraboschi 2020-08-31 10:44:34 UTC

OperatorSource got deprecated in OCP 4.5 and probably already removed in OCP 4.6: we should directly use a CatalogSource that point to the index image.
Oren is going to fix it in the kustomize template.

Then please notice that the Index Image built by CVP are currently not consumable on OCP 4.6; please see https://bugzilla.redhat.com/show_bug.cgi?id=1871234#c18 for a temporary workaround.

Comment 14 Denis Ollier 2020-08-31 13:59:40 UTC

When deploying from a Catalog image, in addition to the CatalogSource, QE's deploy_kustomize.sh was unexpectedly creating an OperatorSource from an AppRegistry although it shouldn't.

Since OperatorSource API got removed with OCP 4.6.0-fc.1, I removed support to deploy from AppRegistry in QE's deploy_kustomize.sh.

Comment 15 Denis Ollier 2020-08-31 16:36:55 UTC

For the record, I managed to deploy CNV 2.5 on an OCP *4.5* cluster using:

- CNV 2.5 from registry-proxy.engineering.redhat.com/rh-osbs/iib:5809
- NMO 4.6 from registry-proxy.engineering.redhat.com/rh-osbs/iib:4255

The die/restart loop of the hco-operator is still present.

Will retry on an OCP 4.6 cluster with the opm workaround.

Comment 16 Simone Tiraboschi 2020-08-31 17:06:43 UTC

(In reply to Denis Ollier from comment #15)
> For the record, I managed to deploy CNV 2.5 on an OCP *4.5* cluster using:
> 
> - CNV 2.5 from registry-proxy.engineering.redhat.com/rh-osbs/iib:5809
> - NMO 4.6 from registry-proxy.engineering.redhat.com/rh-osbs/iib:4255
> 
> The die/restart loop of the hco-operator is still present.

There is also a BUG on post on OLM about a similar issue:
https://bugzilla.redhat.com/show_bug.cgi?id=1868712

As for this bug, HCO should now (sooner or later) reach ready status,
but in the meantime OLM will still kill and restart it more than once.

Comment 17 Denis Ollier 2020-09-01 08:24:11 UTC

(In reply to Simone Tiraboschi from comment #16)
> 
> As for this bug, HCO should now (sooner or later) reach ready status,
> but in the meantime OLM will still kill and restart it more than once.

Evan after a whole night, HCO is still dying/restarting in loop.

Comment 18 Denis Ollier 2020-09-01 11:17:49 UTC

With the opm workaroud, I finally managed to deploy CNV 2.5 on an OCP 4.6 cluster using:

- CNV 2.5 from registry-proxy.engineering.redhat.com/rh-osbs/iib:6136 (hco-bundle-registry:v2.5.0-135)
- NMO 4.6 from registry-proxy.engineering.redhat.com/rh-osbs/iib:4255

I don't see the die/restart loop of the hco-operator anymore.

Comment 19 Denis Ollier 2020-09-01 19:23:11 UTC

With registry-proxy.engineering.redhat.com/rh-osbs/iib:6196 the issue is happening again.

Note that the CSV seems older with iib:6196 than with iib:6136.

- OK: registry-proxy.engineering.redhat.com/rh-osbs/iib:6136 (hco-bundle-registry:v2.5.0-135, CSV createdAt: 2020-09-01 07:44:46)
- Not OK: registry-proxy.engineering.redhat.com/rh-osbs/iib:6196 (hco-bundle-registry:v2.5.0-???, CSV createdAt: 2020-08-28 14:53:33)

Comment 20 Denis Ollier 2020-09-08 11:54:49 UTC

I deployed CNV 2.5.0 on an OCP 4.6.0-fc.4 cluster using registry-proxy.engineering.redhat.com/rh-osbs/iib:7848 (hco-bundle-registry:v2.5.0-160) and the issue is still present.

Comment 21 Denis Ollier 2020-09-08 12:20:37 UTC

Status of KubevirtMetricsAggregation CR:

> kubectl get kubevirtmetricsaggregations.ssp.kubevirt.io metrics-aggregation-kubevirt-hyperconverged -o yaml
> 
> apiVersion: ssp.kubevirt.io/v1
> kind: KubevirtMetricsAggregation
> metadata:
>   creationTimestamp: "2020-09-08T10:33:16Z"
>   generation: 1
>   labels:
>     app: kubevirt-hyperconverged
>   name: metrics-aggregation-kubevirt-hyperconverged
>   namespace: openshift-cnv
>   ownerReferences:
>   - apiVersion: hco.kubevirt.io/v1beta1
>     blockOwnerDeletion: true
>     controller: true
>     kind: HyperConverged
>     name: kubevirt-hyperconverged
>     uid: 10153ff9-90a5-454d-9d33-9791357156e6
>   resourceVersion: "94052"
>   selfLink: /apis/ssp.kubevirt.io/v1/namespaces/openshift-cnv/kubevirtmetricsaggregations/metrics-aggregation-kubevirt-hyperconverged
>   uid: 6fd008c3-c6dd-4550-be24-d3d88f7fc8d2
> spec: {}
> status:
>   conditions:
>   - ansibleResult:
>       changed: 2
>       completion: 2020-09-08T10:33:25.964362
>       failures: 0
>       ok: 4
>       skipped: 0
>     lastTransitionTime: "2020-09-08T10:33:16Z"
>     message: Awaiting next reconciliation
>     reason: Successful
>     status: "True"
>     type: Running
>   operatorVersion: v2.5.0
>   targetVersion: v2.5.0

Comment 22 Simone Tiraboschi 2020-09-08 12:25:02 UTC

The issue is that KubevirtMetricsAggregation is reporting Running=True and never gets to Available=True so probably something is stuck on SSP operator now.
Can you please attach also the logs of SSP operator?

Comment 23 Denis Ollier 2020-09-08 12:27:37 UTC

Relevant logs:

> TASK [Inject owner references for KubevirtNodeLabellerBundle] ******************************** 
> fatal: [localhost]: FAILED! => {"msg": "template error while templating string: no filter named 'k8s_inject_ownership'. String: {{ objects | k8s_inject_ownership(cr_info) }}"}

=> template error while templating string: no filter named 'k8s_inject_ownership'

Comment 24 Denis Ollier 2020-09-08 12:29:25 UTC

Note that the node-maintenance-operator is also looping.

Comment 25 Denis Ollier 2020-09-08 12:36:03 UTC

Created attachment 1714098 [details]
SSP operator logs

SSP operator logs

Comment 26 Simone Tiraboschi 2020-09-08 12:55:35 UTC

(In reply to Denis Ollier from comment #24)
> Note that the node-maintenance-operator is also looping.

Yes, this is also now expected as a side effect of https://bugzilla.redhat.com/1868712 because now (NMO >= 0.7.0) also NMO includes an OLM based admission webhook.

Comment 27 Simone Tiraboschi 2020-09-21 11:58:25 UTC

https://bugzilla.redhat.com/1868712 was on Modified and its bits can be consumed from OCP 4.6 nightly builds, moving this to ON_QA for further verifications.

Comment 28 Tareq Alayan 2020-09-21 12:00:30 UTC

2.5 is deployable.
Verified on 
HCO:[v2.5.0-209]
 HCO image: registry.redhat.io/container-native-virtualization/hyperconverged-cluster-operator@sha256:bec6349f6f98faae85fa7ee91c49c20522d2ce955e70e2d04e75e14822f2562d
CSV creation time: 2020-09-21 07:30:25

Comment 31 errata-xmlrpc 2020-11-17 13:24:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Virtualization 2.5.0 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:5127