1912077 – helm operator's default rbac forbidden

Bug 1912077 - helm operator's default rbac forbidden

Summary: helm operator's default rbac forbidden

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Operator SDK
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Varsha
QA Contact:	Fan Jia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-03 09:32 UTC by Fan Jia
Modified:	2021-02-24 15:49 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:49:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:49:54 UTC

Comment 1 Varsha 2021-01-07 18:51:21 UTC

I'm working on this, will have an update shortly.

Comment 2 Varsha 2021-01-07 20:48:36 UTC

This issue is occurring because the helm-operator-1.3.0 base image does not have liveness and readiness probe enabled. You may have to test with the released 1.3.0 tag instead of building from master.

Please test with:
* binary extracted from the downstream image: registry-proxy.engineering.redhat.com/rh-osbs/openshift-ose-operator-sdk:v4.7.0-202101071845.p0 (or wherever you typically get test images)
* the v1.3.0 released binary: https://github.com/operator-framework/operator-sdk/releases/tag/v1.3.0

Comment 4 Varsha 2021-01-08 20:41:50 UTC

@Jia Fan, I tried running the operator with these 2 base images:
1. quay.io/operator-framework/helm-operator:v1.3.0 and
2. registry-proxy.engineering.redhat.com/rh-osbs/openshift-ose-helm-operator:v4.7.0-202101081525.p0

I'm not able to reproduce this error. After creating the helm-operator project, I ran:
1. make install
2. make deploy 
3. Applied the cr

Output:
```
{"level":"info","ts":1610135963.618943,"logger":"cmd","msg":"Version","Go Version":"go1.15.5","GOOS":"linux","GOARCH":"amd64","helm-operator":"v1.3.0","commit":"1abf57985b43bf6a59dcd18147b3c574fa57d3f6"}
{"level":"info","ts":1610135963.6198597,"logger":"cmd","msg":"WATCH_NAMESPACE environment variable not set. Watching all namespaces.","Namespace":""}
I0108 19:59:24.670513       1 request.go:645] Throttling request took 1.034612949s, request: GET:https://172.30.0.1:443/apis/ingress.operator.openshift.io/v1?timeout=32s
{"level":"info","ts":1610135966.2819138,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"127.0.0.1:8080"}
{"level":"info","ts":1610135966.2827315,"logger":"helm.controller","msg":"Watching resource","apiVersion":"demo.my.domain/v1","kind":"Nginx","namespace":"","reconcilePeriod":"1m0s"}
I0108 19:59:26.283017       1 leaderelection.go:243] attempting to acquire leader lease  nginx-operator-system/nginx-operator...
{"level":"info","ts":1610135966.2835867,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
I0108 19:59:43.699785       1 leaderelection.go:253] successfully acquired lease nginx-operator-system/nginx-operator
{"level":"info","ts":1610135983.700022,"logger":"controller-runtime.manager.controller.nginx-controller","msg":"Starting EventSource","source":"kind source: demo.my.domain/v1, Kind=Nginx"}
{"level":"info","ts":1610135983.8005197,"logger":"controller-runtime.manager.controller.nginx-controller","msg":"Starting Controller"}
{"level":"info","ts":1610135983.8005683,"logger":"controller-runtime.manager.controller.nginx-controller","msg":"Starting workers","worker count":4}
I0108 19:59:44.890134       1 request.go:645] Throttling request took 1.044493238s, request: GET:https://172.30.0.1:443/apis/image.openshift.io/v1?timeout=32s
```

I also modified the CR and applied it again to check if manager is controlling the pods. Modified the replica count to 1 from 2:
```
➜  nginx-operator oc get pods -n default
NAME                            READY   STATUS        RESTARTS   AGE
nginx-sample-646f977b4f-6l67r   0/1     Terminating   0          63s
nginx-sample-646f977b4f-wzht4   1/1     Running       0          63s
```

Since the logs say that `"failed to install release: rendered manifests contain a resource that already exists.`, Maybe the role.yaml was not applied correctly. Can you try removing all the previous operator deployments and checking if the error persists.

Comment 5 Varsha 2021-01-08 20:48:25 UTC

Can you also attach the project which is giving the brace error. It would be helpful to check if we are missing something in the scaffold.

Comment 6 Fan Jia 2021-01-11 11:34:02 UTC

Varsha , I test this with opeator-sdk 3 versions:
version 1. operator-sdk version: "v1.3.0-8-g10fd87b", commit: "10fd87b3fd53198070a8f7c76acf06cc9d4cf84b", kubernetes version: "v1.19.4", go version: "go1.14", GOOS: "linux", GOARCH: "amd64" (the upstream master latest version)

version 2. operator-sdk version: "v4.7.0-202101071845.p0-dirty", commit: "4f9bb0e23307fe974ed4e8784bc3b527e8ebdac3", kubernetes version: "v1.18.8", go version: "go1.15.5", GOOS: "linux", GOARCH: "amd64"  (copy from downstream release image : registry-proxy.engineering.redhat.com/rh-osbs/openshift-ose-operator-sdk:v4.7.0-202101071845.p0)

version 3. operator-sdk version: "v1.3.0-5-g50fcdb8", commit: "50fcdb8dd189ffe00d34a077adfcbee7309d5fec", kubernetes version: "v1.19.4", go version: "go1.14", GOOS: "linux", GOARCH: "amd64"  (downstream https://github.com/openshift/ocp-release-operator-sdk/pull/87)

version 2, version 3 generate the default rbac/role.yaml without "resources: serviceaccount" so the policy creating the CR will be forbidden. version 1 generate the default rbac/role.yaml with  "resources: serviceaccount" can run well and Reconcile the CR success since the CR need create one serviceaccount.

1) The rbac/role.yaml by version 1 operator (v1.3.0-8-g10fd87b)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: manager-role
rules:
##
## Base operator rules
##
# We need to get namespaces so the operator can read namespaces to ensure they exist
- apiGroups:
........

##
## Rules for apps.my.domain/v1beta1, Kind: Nginx
##
- apiGroups:
  - apps.my.domain
  resources:
  - nginxes
  - nginxes/status
  - nginxes/finalizers
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- verbs:
  - "*"
  apiGroups:
  - ""
  resources:
  - "serviceaccounts"
  - "services"
- verbs:
  - "*"
  apiGroups:
  - "apps"
  resources:
  - "deployments"

# +kubebuilder:scaffold:rules

2) The rbac/role.yaml by version 2/ version 3 operator 
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: manager-role
rules:
##
## Base operator rules
##
# We need to get namespaces so the operator can read namespaces to ensure they exist
- apiGroups:
......

##
## Rules for demo.my.domain/v1, Kind: Nginx
##
- apiGroups:
  - demo.my.domain
  resources:
  - nginxes
  - nginxes/status
  - nginxes/finalizers
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - ""
  resources:
  - pods
  - services
  - services/finalizers
  - endpoints
  - persistentvolumeclaims
  - events
  - configmaps
  - secrets
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - apps
  resources:
  - deployments
  - daemonsets
  - replicasets
  - statefulsets
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch

# +kubebuilder:scaffold:rules

Comment 7 Varsha 2021-01-11 20:35:08 UTC

@Jia Fan, yes, the error does occur if the right permission for service account is not scaffolded. I locally checked out and tested the PR (https://github.com/openshift/ocp-release-operator-sdk/pull/87), the service account seems to have been scaffolded when run on OCP 4.7. `rbac/role.yaml` looks like this:

```
##
## Rules for demo.my.domain/v1, Kind: Nginx
##
- apiGroups:
  - demo.my.domain
  resources:
  - nginxes
  - nginxes/status
  - nginxes/finalizers
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- verbs:
  - "*"
  apiGroups:
  - ""
  resources:
  # - "serviceaccounts"
  - "services"
- verbs:
  - "*"
  apiGroups:
  - "apps"
  resources:
  - "deployments"

# +kubebuilder:scaffold:rules
```

To be specific, the commit with the PR which I tested is `e534ae0bf45356506fc823f22667b36e572ee90e`. Was the version 2 commit (50fcdb8dd189ffe00d34a077adfcbee7309d5fec) from downstream SDK (ocp-release-operator-sdk) because I am not able to find it in the logs.

Comment 8 Varsha 2021-01-11 23:49:17 UTC

@Jia Fan, updating the findings here. All the three version of images seem to be right, as this is expected behavior. 

When the `operator-sdk create api` is run for helm-operator project:
1. If the $KUBECONFIG is set, then the resources which are required to run the helm chart in the cluster are scaffolded with required permissions.
2. When the $KUBECONFIG is not set, then the user has to manually add the required permissions.

In this case, if not running on cluster, we need to manually add `service accounts` in role.yaml to provide necessary permissions. This is an expected behavior of the helm operator scaffolder.
This is `not a bug` in the implementation, instead a bug against the documentation - https://bugzilla.redhat.com/show_bug.cgi?id=1915095

Comment 9 Fan Jia 2021-01-12 01:32:02 UTC

Check with 3 version operator-sdk in comment 6 after set the KUBECONFIG , all generated rbac/role.yaml has the "serviceaccount" sets. The operator run well. The documentation bug can trace the problem.
Verified this bug.

Comment 12 errata-xmlrpc 2021-02-24 15:49:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.