Bug 1868229
Summary: | [deply cnv] hco operator run then terminate in a loop | ||||||
---|---|---|---|---|---|---|---|
Product: | Container Native Virtualization (CNV) | Reporter: | Tareq Alayan <talayan> | ||||
Component: | Installation | Assignee: | Simone Tiraboschi <stirabos> | ||||
Status: | CLOSED ERRATA | QA Contact: | Tareq Alayan <talayan> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 2.5.0 | CC: | cnv-qe-bugs, dollierp, lbednar, ncredi, pkliczew, stirabos | ||||
Target Milestone: | --- | Keywords: | AutomationBlocker, Regression | ||||
Target Release: | 2.5.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | hyperconverged-cluster-operator:v2.5.0-25 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-11-17 13:24:21 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1868712, 1876908 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Tareq Alayan
2020-08-12 04:50:17 UTC
I also see following log for HCO, with the same symptoms (hco is failing and looping in creating and terminating state) {"level":"error","ts":1597212472.774559,"logger":"controller-runtime.source","msg":"if kind is a CRD, it should be installed before calling Start","kind":"VMImportConfig.v2v.kubevirt.io","error":"no matches for kind \"VMImportConfig\" in version \"v2v.kubevirt.io/v1alpha1\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:104\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:165\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startLeaderElectionRunnables.func1\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:473"} {"level":"error","ts":1597212472.7748973,"logger":"cmd","msg":"Manager exited non-zero","error":"no matches for kind \"VMImportConfig\" in version \"v2v.kubevirt.io/v1alpha1\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nmain.main\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/cmd/hyperconverged-cluster-operator/main.go:246\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:203"} It's a sort of side effect of: 1867493 vm-import-operator CRD should keep also v1alpha1 version for backward compatibility. Then on HCO code we are still using v1alpha1 on vm-import-operator: https://github.com/kubevirt/hyperconverged-cluster-operator/tree/master/vendor/github.com/kubevirt/vm-import-operator because we are pinned to v0.1.0 which is the latest version available there. Piotr, can you please create an upstream pre/rc release so that we can move forward? @Simone here is the PR -> https://github.com/kubevirt/vm-import-operator/pull/370 We are about to merge it Upstream v0.2.0 seems still broken I see "VMimport is not 'Available'","Request.Namespace" message. Is this related to api version bump? (In reply to Piotr Kliczewski from comment #5) > I see "VMimport is not 'Available'","Request.Namespace" message. Is this > related to api version bump? In HCO logs I see: {"level":"info","ts":1597655572.3726707,"logger":"controller_hyperconverged","msg":"VM import exists","Request.Namespace":"kubevirt-hyperconverged","Request.Name":"kubevirt-hyperconverged","vmImport.Namespace":"","vmImport.Name":"vmimport-kubevirt-hyperconverged"} {"level":"info","ts":1597655572.37268,"logger":"controller_hyperconverged","msg":"VMimport's resource is not reporting Conditions on it's Status","Request.Namespace":"kubevirt-hyperconverged","Request.Name":"kubevirt-hyperconverged"} I'll try to reproduce it locally to better understand the root cause. vm-import operator CR v1beta1 completely misses .status: + oc get -n kubevirt-hyperconverged VMImportConfig vmimport-kubevirt-hyperconverged -o yaml apiVersion: v2v.kubevirt.io/v1beta1 kind: VMImportConfig metadata: creationTimestamp: "2020-08-17T17:36:45Z" generation: 1 labels: app: kubevirt-hyperconverged managedFields: - apiVersion: v2v.kubevirt.io/v1beta1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:labels: .: {} f:app: {} f:ownerReferences: .: {} k:{"uid":"d4d4640a-ac09-40cc-8215-6b444cb76e8e"}: .: {} f:apiVersion: {} f:blockOwnerDeletion: {} f:controller: {} f:kind: {} f:name: {} f:uid: {} f:spec: {} f:status: {} manager: hyperconverged-cluster-operator operation: Update time: "2020-08-17T17:36:45Z" name: vmimport-kubevirt-hyperconverged ownerReferences: - apiVersion: hco.kubevirt.io/v1beta1 blockOwnerDeletion: true controller: true kind: HyperConverged name: kubevirt-hyperconverged uid: d4d4640a-ac09-40cc-8215-6b444cb76e8e resourceVersion: "35427" selfLink: /apis/v2v.kubevirt.io/v1beta1/vmimportconfigs/vmimport-kubevirt-hyperconverged uid: faf13a16-8c84-4a32-8393-540f74b71cb0 spec: {} Yes, I see the issue as well. Investigating... *** Bug 1867493 has been marked as a duplicate of this bug. *** Found another issue in vm-import-operator, we need also https://github.com/kubevirt/vm-import-operator/pull/383 trying with 2.5.0-124 rh-osbs/iib:5741 Failed: no matches for kind "OperatorSource" in version "operators.coreos.com/v1" when applying: apiVersion: operators.coreos.com/v1 kind: OperatorSource metadata: name: kubevirt-hyperconverged spec: registryNamespace: rh-verified-operators publisher: Red Hat @Simone please take a look OperatorSource got deprecated in OCP 4.5 and probably already removed in OCP 4.6: we should directly use a CatalogSource that point to the index image. Oren is going to fix it in the kustomize template. Then please notice that the Index Image built by CVP are currently not consumable on OCP 4.6; please see https://bugzilla.redhat.com/show_bug.cgi?id=1871234#c18 for a temporary workaround. When deploying from a Catalog image, in addition to the CatalogSource, QE's deploy_kustomize.sh was unexpectedly creating an OperatorSource from an AppRegistry although it shouldn't. Since OperatorSource API got removed with OCP 4.6.0-fc.1, I removed support to deploy from AppRegistry in QE's deploy_kustomize.sh. For the record, I managed to deploy CNV 2.5 on an OCP *4.5* cluster using: - CNV 2.5 from registry-proxy.engineering.redhat.com/rh-osbs/iib:5809 - NMO 4.6 from registry-proxy.engineering.redhat.com/rh-osbs/iib:4255 The die/restart loop of the hco-operator is still present. Will retry on an OCP 4.6 cluster with the opm workaround. (In reply to Denis Ollier from comment #15) > For the record, I managed to deploy CNV 2.5 on an OCP *4.5* cluster using: > > - CNV 2.5 from registry-proxy.engineering.redhat.com/rh-osbs/iib:5809 > - NMO 4.6 from registry-proxy.engineering.redhat.com/rh-osbs/iib:4255 > > The die/restart loop of the hco-operator is still present. There is also a BUG on post on OLM about a similar issue: https://bugzilla.redhat.com/show_bug.cgi?id=1868712 As for this bug, HCO should now (sooner or later) reach ready status, but in the meantime OLM will still kill and restart it more than once. (In reply to Simone Tiraboschi from comment #16) > > As for this bug, HCO should now (sooner or later) reach ready status, > but in the meantime OLM will still kill and restart it more than once. Evan after a whole night, HCO is still dying/restarting in loop. With the opm workaroud, I finally managed to deploy CNV 2.5 on an OCP 4.6 cluster using: - CNV 2.5 from registry-proxy.engineering.redhat.com/rh-osbs/iib:6136 (hco-bundle-registry:v2.5.0-135) - NMO 4.6 from registry-proxy.engineering.redhat.com/rh-osbs/iib:4255 I don't see the die/restart loop of the hco-operator anymore. With registry-proxy.engineering.redhat.com/rh-osbs/iib:6196 the issue is happening again. Note that the CSV seems older with iib:6196 than with iib:6136. - OK: registry-proxy.engineering.redhat.com/rh-osbs/iib:6136 (hco-bundle-registry:v2.5.0-135, CSV createdAt: 2020-09-01 07:44:46) - Not OK: registry-proxy.engineering.redhat.com/rh-osbs/iib:6196 (hco-bundle-registry:v2.5.0-???, CSV createdAt: 2020-08-28 14:53:33) I deployed CNV 2.5.0 on an OCP 4.6.0-fc.4 cluster using registry-proxy.engineering.redhat.com/rh-osbs/iib:7848 (hco-bundle-registry:v2.5.0-160) and the issue is still present. Status of KubevirtMetricsAggregation CR:
> kubectl get kubevirtmetricsaggregations.ssp.kubevirt.io metrics-aggregation-kubevirt-hyperconverged -o yaml
>
> apiVersion: ssp.kubevirt.io/v1
> kind: KubevirtMetricsAggregation
> metadata:
> creationTimestamp: "2020-09-08T10:33:16Z"
> generation: 1
> labels:
> app: kubevirt-hyperconverged
> name: metrics-aggregation-kubevirt-hyperconverged
> namespace: openshift-cnv
> ownerReferences:
> - apiVersion: hco.kubevirt.io/v1beta1
> blockOwnerDeletion: true
> controller: true
> kind: HyperConverged
> name: kubevirt-hyperconverged
> uid: 10153ff9-90a5-454d-9d33-9791357156e6
> resourceVersion: "94052"
> selfLink: /apis/ssp.kubevirt.io/v1/namespaces/openshift-cnv/kubevirtmetricsaggregations/metrics-aggregation-kubevirt-hyperconverged
> uid: 6fd008c3-c6dd-4550-be24-d3d88f7fc8d2
> spec: {}
> status:
> conditions:
> - ansibleResult:
> changed: 2
> completion: 2020-09-08T10:33:25.964362
> failures: 0
> ok: 4
> skipped: 0
> lastTransitionTime: "2020-09-08T10:33:16Z"
> message: Awaiting next reconciliation
> reason: Successful
> status: "True"
> type: Running
> operatorVersion: v2.5.0
> targetVersion: v2.5.0
The issue is that KubevirtMetricsAggregation is reporting Running=True and never gets to Available=True so probably something is stuck on SSP operator now. Can you please attach also the logs of SSP operator? Relevant logs:
> TASK [Inject owner references for KubevirtNodeLabellerBundle] ********************************
> fatal: [localhost]: FAILED! => {"msg": "template error while templating string: no filter named 'k8s_inject_ownership'. String: {{ objects | k8s_inject_ownership(cr_info) }}"}
=> template error while templating string: no filter named 'k8s_inject_ownership'
Note that the node-maintenance-operator is also looping. Created attachment 1714098 [details]
SSP operator logs
SSP operator logs
(In reply to Denis Ollier from comment #24) > Note that the node-maintenance-operator is also looping. Yes, this is also now expected as a side effect of https://bugzilla.redhat.com/1868712 because now (NMO >= 0.7.0) also NMO includes an OLM based admission webhook. https://bugzilla.redhat.com/1868712 was on Modified and its bits can be consumed from OCP 4.6 nightly builds, moving this to ON_QA for further verifications. 2.5 is deployable. Verified on HCO:[v2.5.0-209] HCO image: registry.redhat.io/container-native-virtualization/hyperconverged-cluster-operator@sha256:bec6349f6f98faae85fa7ee91c49c20522d2ce955e70e2d04e75e14822f2562d CSV creation time: 2020-09-21 07:30:25 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Virtualization 2.5.0 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:5127 |