Bug 2046554

Summary: infrastructure-operator pod crashes due to insufficient privileges in ACM 2.5
Product: Red Hat Advanced Cluster Management for Kubernetes Reporter: Thuy Nguyen <thnguyen>
Component: Cluster LifecycleAssignee: Jian Qiu <jqiu>
Status: CLOSED ERRATA QA Contact: Hui Chen <huichen>
Severity: high Docs Contact: Christopher Dawson <cdawson>
Priority: unspecified    
Version: rhacm-2.5CC: akrzos, asegurap, ccrum, dhuynh, jagray, mfilanov, mhrivnak, smiron, yuhe
Target Milestone: ---Flags: bot-tracker-sync: rhacm-2.5+
Target Release: rhacm-2.5   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-06-09 02:08:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Thuy Nguyen 2022-01-26 23:01:38 UTC
Description of the problem: infrastructure-operator pod crashes due to insufficient privileges in ACM 2.5

Release version: ACM 2.5.0

Operator snapshot version: 2.5.0-DOWNSTREAM-2022-01-19-20-35-27 (Final Sprint 0)

OCP version: 4.9.11

Browser Info: Firefox 91.5.0esr (64-bit)

Steps to reproduce:
1. Install ACM 2.5 onto ocm namespace
2. Check infrastructure-operator pod 

Actual results:
The infrastructure-operator pod crashes

# oc get po -n ocm | grep Crash
infrastructure-operator-694dfdf9f6-qtr57                          0/1     CrashLoopBackOff   5 (2m11s ago)   19m

Expected results:
All pods are up and running

Additional info:

infrastructure-operator pod logs -

I0126 22:56:20.692369       1 request.go:668] Waited for 1.034635988s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/imageregistry.open-cluster-management.io/v1alpha1?timeout=32s
{"level":"info","ts":1643237784.394883,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1643237789.1602755,"logger":"setup","msg":"starting manager"}
I0126 22:56:29.160639       1 leaderelection.go:243] attempting to acquire leader lease ocm/86f835c3.agent-install.openshift.io...
{"level":"info","ts":1643237789.160651,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
I0126 22:56:46.196847       1 leaderelection.go:253] successfully acquired lease ocm/86f835c3.agent-install.openshift.io
{"level":"info","ts":1643237806.1971397,"logger":"controller-runtime.manager.controller.agentserviceconfig","msg":"Starting EventSource","reconciler group":"agent-install.openshift.io","reconciler kind":"AgentServiceConfig","source":"kind source: /, Kind="}
{"level":"info","ts":1643237806.1971996,"logger":"controller-runtime.manager.controller.agentserviceconfig","msg":"Starting EventSource","reconciler group":"agent-install.openshift.io","reconciler kind":"AgentServiceConfig","source":"kind source: /, Kind="}
{"level":"info","ts":1643237806.1972117,"logger":"controller-runtime.manager.controller.agentserviceconfig","msg":"Starting EventSource","reconciler group":"agent-install.openshift.io","reconciler kind":"AgentServiceConfig","source":"kind source: /, Kind="}
{"level":"info","ts":1643237806.19722,"logger":"controller-runtime.manager.controller.agentserviceconfig","msg":"Starting EventSource","reconciler group":"agent-install.openshift.io","reconciler kind":"AgentServiceConfig","source":"kind source: /, Kind="}
{"level":"info","ts":1643237806.1972282,"logger":"controller-runtime.manager.controller.agentserviceconfig","msg":"Starting EventSource","reconciler group":"agent-install.openshift.io","reconciler kind":"AgentServiceConfig","source":"kind source: /, Kind="}
{"level":"info","ts":1643237806.1972363,"logger":"controller-runtime.manager.controller.agentserviceconfig","msg":"Starting EventSource","reconciler group":"agent-install.openshift.io","reconciler kind":"AgentServiceConfig","source":"kind source: /, Kind="}
{"level":"info","ts":1643237806.197244,"logger":"controller-runtime.manager.controller.agentserviceconfig","msg":"Starting EventSource","reconciler group":"agent-install.openshift.io","reconciler kind":"AgentServiceConfig","source":"kind source: /, Kind="}
{"level":"info","ts":1643237806.197251,"logger":"controller-runtime.manager.controller.agentserviceconfig","msg":"Starting EventSource","reconciler group":"agent-install.openshift.io","reconciler kind":"AgentServiceConfig","source":"kind source: /, Kind="}
{"level":"info","ts":1643237806.1972582,"logger":"controller-runtime.manager.controller.agentserviceconfig","msg":"Starting EventSource","reconciler group":"agent-install.openshift.io","reconciler kind":"AgentServiceConfig","source":"kind source: /, Kind="}
{"level":"info","ts":1643237806.1972685,"logger":"controller-runtime.manager.controller.agentserviceconfig","msg":"Starting EventSource","reconciler group":"agent-install.openshift.io","reconciler kind":"AgentServiceConfig","source":"kind source: /, Kind="}
{"level":"info","ts":1643237806.1972754,"logger":"controller-runtime.manager.controller.agentserviceconfig","msg":"Starting EventSource","reconciler group":"agent-install.openshift.io","reconciler kind":"AgentServiceConfig","source":"kind source: /, Kind="}
{"level":"info","ts":1643237806.197282,"logger":"controller-runtime.manager.controller.agentserviceconfig","msg":"Starting EventSource","reconciler group":"agent-install.openshift.io","reconciler kind":"AgentServiceConfig","source":"kind source: /, Kind="}
{"level":"info","ts":1643237806.1972885,"logger":"controller-runtime.manager.controller.agentserviceconfig","msg":"Starting EventSource","reconciler group":"agent-install.openshift.io","reconciler kind":"AgentServiceConfig","source":"kind source: /, Kind="}
{"level":"info","ts":1643237806.197295,"logger":"controller-runtime.manager.controller.agentserviceconfig","msg":"Starting Controller","reconciler group":"agent-install.openshift.io","reconciler kind":"AgentServiceConfig"}
E0126 22:56:46.210237       1 reflector.go:138] pkg/mod/k8s.io/client-go.1/tools/cache/reflector.go:167: Failed to watch *v1.ClusterRoleBinding: failed to list *v1.ClusterRoleBinding: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "system:serviceaccount:ocm:assisted-service" cannot list resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
E0126 22:56:46.228175       1 reflector.go:138] pkg/mod/k8s.io/client-go.1/tools/cache/reflector.go:167: Failed to watch *v1.APIService: failed to list *v1.APIService: apiservices.apiregistration.k8s.io is forbidden: User "system:serviceaccount:ocm:assisted-service" cannot list resource "apiservices" in API group "apiregistration.k8s.io" at the cluster scope
E0126 22:56:47.590934       1 reflector.go:138] pkg/mod/k8s.io/client-go.1/tools/cache/reflector.go:167: Failed to watch *v1.ClusterRoleBinding: failed to list *v1.ClusterRoleBinding: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "system:serviceaccount:ocm:assisted-service" cannot list resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
E0126 22:56:47.685548       1 reflector.go:138] pkg/mod/k8s.io/client-go.1/tools/cache/reflector.go:167: Failed to watch *v1.APIService: failed to list *v1.APIService: apiservices.apiregistration.k8s.io is forbidden: User "system:serviceaccount:ocm:assisted-service" cannot list resource "apiservices" in API group "apiregistration.k8s.io" at the cluster scope
E0126 22:56:49.993785       1 reflector.go:138] pkg/mod/k8s.io/client-go.1/tools/cache/reflector.go:167: Failed to watch *v1.ClusterRoleBinding: failed to list *v1.ClusterRoleBinding: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "system:serviceaccount:ocm:assisted-service" cannot list resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
E0126 22:56:50.113672       1 reflector.go:138] pkg/mod/k8s.io/client-go.1/tools/cache/reflector.go:167: Failed to watch *v1.APIService: failed to list *v1.APIService: apiservices.apiregistration.k8s.io is forbidden: User "system:serviceaccount:ocm:assisted-service" cannot list resource "apiservices" in API group "apiregistration.k8s.io" at the cluster scope
E0126 22:56:53.837276       1 reflector.go:138] pkg/mod/k8s.io/client-go.1/tools/cache/reflector.go:167: Failed to watch *v1.APIService: failed to list *v1.APIService: apiservices.apiregistration.k8s.io is forbidden: User "system:serviceaccount:ocm:assisted-service" cannot list resource "apiservices" in API group "apiregistration.k8s.io" at the cluster scope
E0126 22:56:54.188915       1 reflector.go:138] pkg/mod/k8s.io/client-go.1/tools/cache/reflector.go:167: Failed to watch *v1.ClusterRoleBinding: failed to list *v1.ClusterRoleBinding: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "system:serviceaccount:ocm:assisted-service" cannot list resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
E0126 22:57:03.254215       1 reflector.go:138] pkg/mod/k8s.io/client-go.1/tools/cache/reflector.go:167: Failed to watch *v1.ClusterRoleBinding: failed to list *v1.ClusterRoleBinding: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "system:serviceaccount:ocm:assisted-service" cannot list resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
E0126 22:57:03.656575       1 reflector.go:138] pkg/mod/k8s.io/client-go.1/tools/cache/reflector.go:167: Failed to watch *v1.APIService: failed to list *v1.APIService: apiservices.apiregistration.k8s.io is forbidden: User "system:serviceaccount:ocm:assisted-service" cannot list resource "apiservices" in API group "apiregistration.k8s.io" at the cluster scope
E0126 22:57:22.132234       1 reflector.go:138] pkg/mod/k8s.io/client-go.1/tools/cache/reflector.go:167: Failed to watch *v1.APIService: failed to list *v1.APIService: apiservices.apiregistration.k8s.io is forbidden: User "system:serviceaccount:ocm:assisted-service" cannot list resource "apiservices" in API group "apiregistration.k8s.io" at the cluster scope
E0126 22:57:22.812778       1 reflector.go:138] pkg/mod/k8s.io/client-go.1/tools/cache/reflector.go:167: Failed to watch *v1.ClusterRoleBinding: failed to list *v1.ClusterRoleBinding: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "system:serviceaccount:ocm:assisted-service" cannot list resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
E0126 22:58:03.239809       1 reflector.go:138] pkg/mod/k8s.io/client-go.1/tools/cache/reflector.go:167: Failed to watch *v1.APIService: failed to list *v1.APIService: apiservices.apiregistration.k8s.io is forbidden: User "system:serviceaccount:ocm:assisted-service" cannot list resource "apiservices" in API group "apiregistration.k8s.io" at the cluster scope
E0126 22:58:11.510740       1 reflector.go:138] pkg/mod/k8s.io/client-go.1/tools/cache/reflector.go:167: Failed to watch *v1.ClusterRoleBinding: failed to list *v1.ClusterRoleBinding: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "system:serviceaccount:ocm:assisted-service" cannot list resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
E0126 22:58:36.529824       1 reflector.go:138] pkg/mod/k8s.io/client-go.1/tools/cache/reflector.go:167: Failed to watch *v1.APIService: failed to list *v1.APIService: apiservices.apiregistration.k8s.io is forbidden: User "system:serviceaccount:ocm:assisted-service" cannot list resource "apiservices" in API group "apiregistration.k8s.io" at the cluster scope
{"level":"error","ts":1643237926.1978145,"logger":"controller-runtime.manager.controller.agentserviceconfig","msg":"Could not wait for Cache to sync","reconciler group":"agent-install.openshift.io","reconciler kind":"AgentServiceConfig","error":"failed to wait for agentserviceconfig caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/remote-source/assisted-service/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/internal/controller/controller.go:195\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/remote-source/assisted-service/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/internal/controller/controller.go:221\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startRunnable.func1\n\t/remote-source/assisted-service/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.0/pkg/manager/internal.go:697"}
{"level":"error","ts":1643237926.19815,"logger":"controller-runtime.manager","msg":"error received after stop sequence was engaged","error":"leader election lost"}
{"level":"error","ts":1643237926.1982,"logger":"controller-runtime.manager","msg":"error received after stop sequence was engaged","error":"context canceled"}
{"level":"error","ts":1643237926.1982052,"logger":"setup","msg":"problem running manager","error":"failed to wait for agentserviceconfig caches to sync: timed out waiting for cache to be synced","stacktrace":"main.main\n\t/remote-source/assisted-service/app/cmd/operator/main.go:164\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:255"}

Comment 1 ian zhang 2022-02-04 14:32:36 UTC
@mfilanov 

Hi Michael, 

Do you know how do we set up the RBAC for AI on ACM? if so can you please let me know, I will try to follow the previous steps to add the above.

Comment 2 Michael Filanov 2022-02-06 07:53:32 UTC
Hey, all the RNAC config is located at https://github.com/openshift/assisted-service/tree/master/config/rbac 
@asegurap do you know how it is being deployed or how we can debug it?

Comment 3 Michael Hrivnak 2022-02-07 14:28:56 UTC
Wasn't there some automation to convert the CSV (https://github.com/openshift/assisted-service/blob/master/deploy/olm-catalog/manifests/assisted-service-operator.clusterserviceversion.yaml) into manifests for ACM's helm chart?

I understood that the artifact being released and handed off would be an OLM bundle, and that ACM could then consume that for integration into ACM via a helm chart. Is that what's happening, or did we settle into some other process?

Comment 4 Chad Crum 2022-02-07 15:55:48 UTC
*** Bug 2050363 has been marked as a duplicate of this bug. ***

Comment 5 ian zhang 2022-02-07 16:21:49 UTC
I believe the acm is using https://github.com/stolostron/assisted-service-chart to convert the csv to acm deployments.

It seems the AI added some new clusterrole entries, such as clusterrolebindings at https://github.com/openshift/assisted-service/blob/master/deploy/olm-catalog/manifests/assisted-service-operator.clusterserviceversion.yaml, however, the acm side didn't update these new clusterroles to the converter code.

@jagray can you please help us update the acm's converter code? Also @mhrivnak can you please let up know if there's a way to identify all the new changes to the csv? do we just do a diff of the commit?

Comment 6 Jakob 2022-02-07 21:35:52 UTC
I've updated the automation. This seems to have picked up some role changes: https://github.com/stolostron/assisted-service-chart/commit/103d0e9c4c73cc9ae87e1e2984640fbd4afb559f. The workflow had been disabled because there hadn't been activity in the repo for 60 days, which I wasn't aware happened.

Comment 7 ian zhang 2022-02-09 21:13:39 UTC
Hi @thnguyen 

Do can you please try out our latest image to see if the above changes made by Jakob is working or not?

Comment 8 Michael Hrivnak 2022-02-14 16:14:33 UTC
The automation is best, but a diff of CSV would of course show any changes. Looks like this should be resolved now.

Comment 9 Thuy Nguyen 2022-02-14 16:33:51 UTC
@izhang, please change the status to ON_QA if the fix is already in and ready to test. Please also include the upstream/downstream build that contains the fix. Thank you.

Comment 10 ian zhang 2022-02-15 14:33:35 UTC
@jagray 

Hi Jakob, 

Do you know the specific build for this issue?

Comment 11 Jakob 2022-02-15 14:46:15 UTC
I only know the change was committed on Feb 7, 2022. Any build after that date will have the change.

Comment 12 Alex Krzos 2022-02-15 16:59:18 UTC
I have version 2.5.0-DOWNSTREAM-2022-02-10-07-31-45 of ACM deployed and I no longer see the infrastructure operator crashlooping FWIW

Comment 13 Thuy Nguyen 2022-02-16 14:48:14 UTC
Validated on 2.5.0-DOWNSTREAM-2022-02-14-13-53-12.

Comment 18 errata-xmlrpc 2022-06-09 02:08:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Advanced Cluster Management 2.5 security updates, images, and bug fixes), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:4956

Comment 19 Kimjonas34 2023-01-03 05:48:48 UTC Comment hidden (spam)
Comment 20 Red Hat Bugzilla 2023-09-18 04:30:52 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days