Bug 2082912

Summary: [CNV-4.11] HCO Being Unable to Reconcile State
Product: Container Native Virtualization (CNV) Reporter: Mor Cohen <mocohen>
Component: SSPAssignee: Andrej Krejcir <akrejcir>
Status: CLOSED ERRATA QA Contact: Geetika Kapoor <gkapoor>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.11.0CC: akrejcir, bkorren, cnv-qe-bugs, dbasunag, dholler, kmajcher, stirabos, ycui
Target Milestone: ---Keywords: AutomationBlocker, Regression, Reopened, TestBlocker
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kubevirt-ssp-operator-container-v4.11.0-43 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2089661 (view as bug list) Environment:
Last Closed: 2022-09-14 19:31:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2089661    

Description Mor Cohen 2022-05-08 13:25:56 UTC
Description of problem:
When installing CNV: v4.11.0-304, v4.11.0-306, and v4.11.0-307 HCO is failing to reconcile its state.

Version-Release number of selected component (if applicable):
CNV: v4.11.0-304
CNV: v4.11.0-307

How reproducible:
100

Steps to Reproduce:
1. Install CNV.
2. Check HCO.
3.

More Info:

HCO Condition Message:
      KubeVirt is degraded: An error occurred during deployment: unable to create ClusterRole &ClusterRole{ObjectMeta:{kubevirt.io:admin      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[app.kubernetes.io/component:compute app.kubernetes.io/managed-by:virt-operator app.kubernetes.io/part-of:hyperconverged-cluster app.kubernetes.io/version:4.11.0 kubevirt.io: rbac.authorization.k8s.io/aggregate-to-admin:true] map[kubevirt.io/generation:1 kubevirt.io/install-strategy-identifier:4436e2203a4eb79f2547251abb4dc86dd6521959 kubevirt.io/install-strategy-registry:registry.redhat.io/container-native-virtualization kubevirt.io/install-strategy-version:sha256:af751d79e58a5d37d6510670970bcd81d6a7e384dc7441b27d3c2a8307e6e0a0] [] []  []},Rules:[]PolicyRule{PolicyRule{Verbs:[get],APIGroups:[subresources.kubevirt.io],Resources:[virtualmachineinstances/console virtualmachineinstances/vnc virtualmachineinstances/guestosinfo virtualmachineinstances/filesystemlist virtualmachineinstances/userlist],ResourceNames:[],NonResourceURLs:[],},PolicyRule{Verbs:[update],APIGroups:[subresources.kubevirt.io],Resources:[virtualmachineinstances/pause virtualmachineinstances/unpause virtualmachineinstances/addvolume virtualmachineinstances/removevolume virtualmachineinstances/freeze virtualmachineinstances/unfreeze virtualmachineinstances/softreboot virtualmachineinstances/portforward],ResourceNames:[],NonResourceURLs:[],},PolicyRule{Verbs:[update],APIGroups:[subresources.kubevirt.io],Resources:[virtualmachines/start virtualmachines/stop virtualmachines/restart virtualmachines/addvolume virtualmachines/removevolume],ResourceNames:[],NonResourceURLs:[],},PolicyRule{Verbs:[get delete create update patch list watch deletecollection],APIGroups:[kubevirt.io],Resources:[virtualmachines virtualmachineinstances virtualmachineinstancepresets virtualmachineinstancereplicasets virtualmachineinstancemigrations],ResourceNames:[],NonResourceURLs:[],},PolicyRule{Verbs:[get delete create update patch list watch deletecollection],APIGroups:[snapshot.kubevirt.io],Resources:[virtualmachinesnapshots virtualmachinesnapshotcontents virtualmachinerestores],ResourceNames:[],NonResourceURLs:[],},PolicyRule{Verbs:[get delete create update patch list watch deletecollection],APIGroups:[flavor.kubevirt.io],Resources:[virtualmachineflavors virtualmachineclusterflavors],ResourceNames:[],NonResourceURLs:[],},PolicyRule{Verbs:[get delete create update patch list watch deletecollection],APIGroups:[pool.kubevirt.io],Resources:[virtualmachinepools],ResourceNames:[],NonResourceURLs:[],},PolicyRule{Verbs:[get list watch],APIGroups:[migrations.kubevirt.io],Resources:[migrationpolicies],ResourceNames:[],NonResourceURLs:[],},},AggregationRule:nil,}: clusterroles.rbac.authorization.k8s.io "kubevirt.io:admin" is forbidden: user "system:serviceaccount:openshift-cnv:kubevirt-operator" (groups=["system:serviceaccounts" "system:serviceaccounts:openshift-cnv" "system:authenticated"]) is attempting to grant RBAC permissions not currently held:

Logs from hco-operator pod:

{"level":"error","ts":1652016328.255472,"logger":"controller_hyperconverged","msg":"Failed to update HCO Status","Request.Namespace":"openshift-cnv","Request.Name":"kubevirt-hyperconverged","error":"Operation cannot be fulfilled on hyperconvergeds.hco.kubevirt.io \"kubevirt-hyperconverged\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/kubevirt/hyperconverged-cluster-operator/controllers/hyperconverged.(*ReconcileHyperConverged).updateHyperConverged\n\t/remote-source/app/controllers/hyperconverged/hyperconverged_controller.go:548\ngithub.com/kubevirt/hyperconverged-cluster-operator/controllers/hyperconverged.(*ReconcileHyperConverged).Reconcile\n\t/remote-source/app/controllers/hyperconverged/hyperconverged_controller.go:303\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}


{"level":"error","ts":1652016328.2555847,"logger":"controller.hyperconverged-controller","msg":"Reconciler error","name":"hco-controlled-cr-bda5152f-a88e-4ebc-92e1-0ad512c4bc7b","namespace":"openshift-cnv","error":"Operation cannot be fulfilled on hyperconvergeds.hco.kubevirt.io \"kubevirt-hyperconverged\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}

Comment 1 Mor Cohen 2022-05-09 07:17:39 UTC
I get the same behavior on the following CBV builds:
v4.11.0-309
v4.11.0-310
v4.11.0-313

Comment 2 Simone Tiraboschi 2022-05-09 11:50:50 UTC
bundle v4.11.0-314 correctly passed the smoke test and bundle v4.11.0-315 is also correctly deployable

Comment 3 Debarati Basu-Nag 2022-05-09 13:57:47 UTC
We are hitting this issue both with 4.10.0->4.10.1 upgrade and fresh 4.11 install. 

==================
[cnv-qe-jenkins@verify411-8f8hz-executor ~]$ kubectl get csv -n openshift-cnv kubevirt-hyperconverged-operator.v4.11.0 -o json | jq ".spec.relatedImages" | grep ssp
    "image": "registry.redhat.io/container-native-virtualization/kubevirt-ssp-operator@sha256:9c2c478ed7ae920d4e7ea351bb393bccd77d68c3f355e7786952add9ce958c5b",
    "name": "registry.redhat.io/container-native-virtualization/kubevirt-ssp-operator:v4.11.0-29"
    "image": "registry.redhat.io/openshift4/ose-csi-livenessprobe@sha256:6578cc878df9c7339cc8207ae8cbe185361b1fcc8ed69da7ec0346bb94d4a588",
    "name": "registry.redhat.io/openshift4/ose-csi-livenessprobe:v4.10"
[cnv-qe-jenkins@verify411-8f8hz-executor ~]$

====================
[cnv-qe-jenkins@verify411-8f8hz-executor ~]$ kubectl get ssp ssp-kubevirt-hyperconverged -n openshift-cnv -o json | jq ".status"
{
  "conditions": [
    {
      "lastHeartbeatTime": "2022-05-09T13:39:02Z",
      "lastTransitionTime": "2022-05-09T13:39:02Z",
      "message": "Reconciling SSP resources",
      "reason": "Available",
      "status": "False",
      "type": "Available"
    },
    {
      "lastHeartbeatTime": "2022-05-09T13:39:02Z",
      "lastTransitionTime": "2022-05-09T13:39:02Z",
      "message": "Reconciling SSP resources",
      "reason": "Progressing",
      "status": "True",
      "type": "Progressing"
    },
    {
      "lastHeartbeatTime": "2022-05-09T13:39:02Z",
      "lastTransitionTime": "2022-05-09T13:39:02Z",
      "message": "Reconciling SSP resources",
      "reason": "Degraded",
      "status": "True",
      "type": "Degraded"
    }
  ],
  "observedGeneration": 1,
  "observedVersion": "4.11.0",
  "operatorVersion": "4.11.0",
  "phase": "Deploying",
  "targetVersion": "4.11.0"
}
[cnv-qe-jenkins@verify411-8f8hz-executor ~]$ 

HCO.status.versions shows
========================
 "versions": [
    {
      "name": "operator",
      "version": "4.11.0"
    }
  ]
========================

It is not very clear on what is causing this.

Will attach ssp and hco log.

Comment 6 Simone Tiraboschi 2022-05-09 14:40:14 UTC
{"level":"info","ts":1652102943.1696093,"logger":"controllers.SSP","msg":"CR status updated","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652102947.14979,"logger":"controllers.SSP","msg":"Reconciling operands...","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652102950.0840895,"logger":"controllers.SSP","msg":"CR status updated","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652102950.3263416,"logger":"controllers.SSP","msg":"Reconciling operands...","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
...
{"level":"info","ts":1652103640.047834,"logger":"controllers.SSP","msg":"Reconciling operands...","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652103642.159796,"logger":"controllers.SSP","msg":"CR status updated","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652103642.3250394,"logger":"controllers.SSP","msg":"Reconciling operands...","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652103644.900461,"logger":"controllers.SSP","msg":"CR status updated","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652103645.1145089,"logger":"controllers.SSP","msg":"Reconciling operands...","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652103647.6483648,"logger":"controllers.SSP","msg":"CR status updated","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652103647.7232087,"logger":"controllers.SSP","msg":"Reconciling operands...","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}

SSP is definitively hot-looping here but the logs are not that explicative.

@dholler can you please take a look?

Comment 7 Andrej Krejcir 2022-05-10 10:53:21 UTC
I have tried enabling debug logging for SSP using steps:

- Scale HCO deployment to 0, so it does not reconcile SSP
- Modify SSP deployment by adding command line parameter "--zap-log-level debug"

But after SSP restarted, it was not looping anymore.

Then I tried scaling HCO back to 1, which reverted SSP to its original definition, but it is still not looping.


I will create a PR to add the reason for reconciliation into SSP log.

Comment 8 Mor Cohen 2022-05-11 07:42:54 UTC
Hello, just updating that I can see this "unable to create ClusterRole" issue again on CNV-v4.11.0-331.
So far, I didn't see this issue on later builds (332, 333).

Comment 9 Mor Cohen 2022-05-16 07:41:38 UTC
Hey,
Updating that I didn't see this issue for a while.

Can I close this bug?

Comment 10 Mor Cohen 2022-05-19 11:25:48 UTC
Hello again,
I'm seeing something similar:

SSP state is looping between Deploying and Deployed states for a long time until the reconciliation eventually completed.

some logs from the ssp-operator log:

{"level":"info","ts":1652957658.8573565,"logger":"controllers.SSP","msg":"Reconciliation event received","ssp":{"namespace":"openshift-cnv","name":"ssp-kubevirt-hyperconverged"},"cause_type":"DataImportCron","cause_object":"openshift-virtualization-os-images/centos-7-image-cron"}
{"level":"info","ts":1652957658.8573945,"logger":"controllers.SSP","msg":"Reconciliation event received","ssp":{"namespace":"openshift-cnv","name":"ssp-kubevirt-hyperconverged"},"cause_type":"DataImportCron","cause_object":"openshift-virtualization-os-images/centos-7-image-cron"}
{"level":"info","ts":1652957659.195631,"logger":"controllers.SSP","msg":"Reconciliation event received","ssp":{"namespace":"openshift-cnv","name":"ssp-kubevirt-hyperconverged"},"cause_type":"DataImportCron","cause_object":"openshift-virtualization-os-images/centos-stream8-image-cron"}
{"level":"info","ts":1652957659.1956854,"logger":"controllers.SSP","msg":"Reconciliation event received","ssp":{"namespace":"openshift-cnv","name":"ssp-kubevirt-hyperconverged"},"cause_type":"DataImportCron","cause_object":"openshift-virtualization-os-images/centos-stream8-image-cron"}
{"level":"info","ts":1652957659.3559809,"logger":"controllers.SSP","msg":"Reconciliation event received","ssp":{"namespace":"openshift-cnv","name":"ssp-kubevirt-hyperconverged"},"cause_type":"DataImportCron","cause_object":"openshift-virtualization-os-images/rhel8-image-cron"}
{"level":"info","ts":1652957659.3560662,"logger":"controllers.SSP","msg":"Reconciliation event received","ssp":{"namespace":"openshift-cnv","name":"ssp-kubevirt-hyperconverged"},"cause_type":"DataImportCron","cause_object":"openshift-virtualization-os-images/rhel8-image-cron"}
{"level":"info","ts":1652957659.719083,"logger":"controllers.SSP","msg":"Reconciliation event received","ssp":{"namespace":"openshift-cnv","name":"ssp-kubevirt-hyperconverged"},"cause_type":"DataImportCron","cause_object":"openshift-virtualization-os-images/centos-7-image-cron"}
{"level":"info","ts":1652957659.7191408,"logger":"controllers.SSP","msg":"Reconciliation event received","ssp":{"namespace":"openshift-cnv","name":"ssp-kubevirt-hyperconverged"},"cause_type":"DataImportCron","cause_object":"openshift-virtualization-os-images/centos-7-image-cron"}
{"level":"info","ts":1652957660.2953465,"logger":"controllers.SSP","msg":"Reconciliation event received","ssp":{"namespace":"openshift-cnv","name":"ssp-kubevirt-hyperconverged"},"cause_type":"DataImportCron","cause_object":"openshift-virtualization-os-images/centos-7-image-cron"}
{"level":"info","ts":1652957660.2954412,"logger":"controllers.SSP","msg":"Reconciliation event received","ssp":{"namespace":"openshift-cnv","name":"ssp-kubevirt-hyperconverged"},"cause_type":"DataImportCron","cause_object":"openshift-virtualization-os-images/centos-7-image-cron"}
{"level":"info","ts":1652957661.2083428,"logger":"controllers.SSP","msg":"CR status updated","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652957661.2084458,"logger":"controllers.SSP","msg":"Starting reconciliation","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652957661.3715484,"logger":"controllers.SSP","msg":"Reconciling operands...","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}


And it loops like that, for example:
[cnv-qe-jenkins@verify-411-hx422-executor mocohen]$ oc logs -n openshift-cnv ssp-operator-bb9c945f8-5hxvr  | grep "CR status updated"
{"level":"info","ts":1652956654.7442966,"logger":"controllers.SSP","msg":"CR status updated","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652956659.1829374,"logger":"controllers.SSP","msg":"CR status updated","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652956662.8284736,"logger":"controllers.SSP","msg":"CR status updated","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652956666.4708722,"logger":"controllers.SSP","msg":"CR status updated","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652956670.1044624,"logger":"controllers.SSP","msg":"CR status updated","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652956673.72668,"logger":"controllers.SSP","msg":"CR status updated","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652956676.98464,"logger":"controllers.SSP","msg":"CR status updated","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652956681.7213516,"logger":"controllers.SSP","msg":"CR status updated","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
.....
{"level":"info","ts":1652957639.081216,"logger":"controllers.SSP","msg":"CR status updated","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652957642.684177,"logger":"controllers.SSP","msg":"CR status updated","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652957645.4238188,"logger":"controllers.SSP","msg":"CR status updated","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652957648.6668222,"logger":"controllers.SSP","msg":"CR status updated","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652957652.0845282,"logger":"controllers.SSP","msg":"CR status updated","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652957654.97079,"logger":"controllers.SSP","msg":"CR status updated","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652957658.388607,"logger":"controllers.SSP","msg":"CR status updated","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652957661.2083428,"logger":"controllers.SSP","msg":"CR status updated","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}
{"level":"info","ts":1652957665.176946,"logger":"controllers.SSP","msg":"CR status updated","ssp":"openshift-cnv/ssp-kubevirt-hyperconverged"}

Until it stops.

Build v4.11.0-373

Comment 11 Mor Cohen 2022-05-20 07:09:11 UTC
Same with CNV build 4.11.0-380.

Comment 12 Mor Cohen 2022-05-30 11:06:04 UTC
Hey,
Since kubevirt/ssp-operator/pull/361 has been merged, I don't see this issue anymore. I will update here if it comes up again for some reason.

Thanks for everyone.

Comment 15 errata-xmlrpc 2022-09-14 19:31:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.11.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6526