Bug 1852865 - [AWS/VSPHERE]: ocs-operator.v4.5.0-463 is in Pending state in Latest OCP nightly builds
Summary: [AWS/VSPHERE]: ocs-operator.v4.5.0-463 is in Pending state in Latest OCP nigh...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.5.z
Assignee: Evan Cordell
QA Contact: Jian Zhang
URL:
Whiteboard:
: 1853022 (view as bug list)
Depends On: 1853022
Blocks: 1851328
TreeView+ depends on / blocked
 
Reported: 2020-07-01 13:40 UTC by Jose A. Rivera
Modified: 2022-01-19 03:27 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1851328
: 1853022 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:44:51 UTC
Target Upstream Version:
Embargoed:
scuppett: needinfo-
vdinh: needinfo-
vdinh: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:45:14 UTC

Comment 1 Evan Cordell 2020-07-01 16:05:22 UTC
Can we see the

Comment 2 Evan Cordell 2020-07-01 16:10:20 UTC
Can we see the installplans in the openshift-storage namespace?  It would also help to see all of the roles/rolebindings in that namespace.

Comment 4 Jose A. Rivera 2020-07-01 17:31:04 UTC
Petr, could you create another cluster (probably tomorrow) for someone from the OLM team to look at?

Comment 5 Jian Zhang 2020-07-02 03:18:00 UTC
Hi Jose,

I couldn't find this OCS package in the QE app registry: "openshift-qe-optional-operators", as follows:
 packages: openshifttemplateservicebroker,openshiftansibleservicebroker,local-storage-operator,metering-ocp,ptp-operator,cluster-kube-descheduler-operator,cluster-logging,sriov-network-operator,clusterresourceoverride,elasticsearch-operator,nfd

So I could not test your latest OCS operator. Could your help provide your app registry that stores this OCS operator? Thanks!
Anyway, I test the released OCS 4.4 on the latest 4.5, it works well.

mac:~ jianzhang$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-07-01-231224   True        False         53m     Cluster version is 4.5.0-0.nightly-2020-07-01-231224

mac:~ jianzhang$ oc get og -n openshift-storage
NAME                      AGE
openshift-storage-hbd4c   26m

mac:~ jianzhang$ oc get sub -n openshift-storage
NAME                                                                      PACKAGE                   SOURCE                CHANNEL
awss3-operator-registry-alpha-community-operators-openshift-marketplace   awss3-operator-registry   community-operators   alpha
ocs-operator                                                              ocs-operator              redhat-operators      stable-4.4

mac:~ jianzhang$ oc get csv -n openshift-storage
NAME                                           DISPLAY                       VERSION                 REPLACES              PHASE
awss3operator.1.0.1                            AWS S3 Operator               1.0.1                   awss3operator.1.0.0   Succeeded
elasticsearch-operator.4.5.0-202007011712.p0   Elasticsearch Operator        4.5.0-202007011712.p0                         Succeeded
ocs-operator.v4.4.0                            OpenShift Container Storage   4.4.0                                         Succeeded

mac:~ jianzhang$ oc get pods -n openshift-storage
NAME                                  READY   STATUS    RESTARTS   AGE
aws-s3-provisioner-676b5767c-spnjh    1/1     Running   0          5m32s
noobaa-operator-8689b6588-9d7sc       1/1     Running   0          5m40s
ocs-operator-6b4b95bc9b-jft7g         1/1     Running   0          5m41s
rook-ceph-operator-6f4b4b889c-vk2c2   1/1     Running   0          5m41s

Comment 6 Petr Balogh 2020-07-02 09:10:18 UTC
Hey Jose,

I can create the cluster but once someone will need it and will be able to take it right after.

Yesterday I provided cluster in this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1851328 and notified here: https://chat.google.com/room/AAAAREGEba8/ReaDJLO4Urg but no one took it.

So finally the cluster was running for 20 hours and I destroyed it.


So if someone need it please direct message me on hangout chat so I can run the job and have cluster in about 50 mins for you.

Comment 10 Evan Cordell 2020-07-06 19:57:25 UTC
There appears to be an issue with the way the CSV is being generated. I'm not intimately familiar with the OCS deploy pipeline, but 

I can see here: https://github.com/openshift/ocs-operator/commit/fa6e21558113b5da9b9e006ad274133ec900220 that there is supposed to be two roles/bindings for prometheus-k8s as separate bundle files. The CSV in this commit looks correct.

However, on-cluster in the CSV I see this block:


```
      permissions:
        - rules:
            - apiGroups:
                - ''
              resources:
                - services
                - endpoints
                - pods
              verbs:
                - get
                - list
                - watch
          serviceAccountName: noobaa-metrics
        - rules:
            - apiGroups:
                - ''
              resources:
                - services
                - endpoints
                - pods
              verbs:
                - get
                - list
                - watch
          serviceAccountName: rook-ceph-metrics
```

that is not present in the CSV manifest linked in the commit. This block tells OLM to look for bindings for those service accounts with those names, not prometheus-k8s. Because the bundle also included the "real" rolebinding, OLM preferred that when generating the InstallPlan, but the CSV still says that it needs a serviceaccount called "rook-ceph-metrics" to exist with these permissions.

I'm guessing it's an issue in the csv-merger here: https://github.com/openshift/ocs-operator/blob/master/tools/csv-merger/csv-merger.go#L418

If both of these sets of permissions are needed, it should be enough to simply name them so that they do not conflict, i.e. `rook-ceph-metrics-prom` and `rook-ceph-metrics`

Removing these definitions from the OCS CSV in a 4.5.0 cluster resulted in the operator successfully deploying (I didn't further verify any of OCS's functionality)

Comment 11 Jose A. Rivera 2020-07-06 22:42:10 UTC
So I found the following lines in the operator-sdk: https://github.com/operator-framework/operator-sdk/blob/master/internal/generate/clusterserviceversion/clusterserviceversion_updaters.go#L86-L91

In short, when generating the "permissions" section of the CSV, is looks through all roles and fills the "serviceAccountName" with the role's name. Offhand I'm not sure this is correct, shouldn't it be the ServiceAccount referenced in the RoleBinding?

Regardless, the question is why are we hitting this now and haven't hit it before? Also, why only in OCP 4.5? I tried looking through the operator-sdk history but it seems like this has been the behavior since the beginning. Maybe something about moving to operator bundles triggered something new?

Comment 12 umanga 2020-07-07 10:58:08 UTC
In Comment 10, Evan saw this CSV -https://github.com/openshift/ocs-operator/pull/520/commits/fa6e21558113b5da9b9e006ad274133ec900220e?file-filters%5B%5D=.yaml#diff-937daae8888bb33097cb0d21de61026e
which is the old one (same as OCS 4.4).

But, right after this commit (on same Pull Request) we have https://github.com/openshift/ocs-operator/pull/520/commits/8564fba9c0b7782ee4d38a0c707a6a0757b15a9c#diff-5b9a95d7bfb28864d4dfc053b3f8b6b6
which is the new one (done to move to new bundle format and operator sdk 0.17.1) and seen on the OCS 4.5 clusters.

The permissions block is not present in older CSVs but is somehow added to the new ones.
Since we do not manually edit CSV and rely on automation that uses opertor-sdk commands to generate CSV, it seems interesting (and we never noticed).

- Did something change in operator-sdk 0.17.z which started adding the role/rolebinding in a way that's seen here?

- If nothing changed, why did it not add those permissions block to older CSVs? (We did not change anything in that part of code)

Comment 13 umanga 2020-07-07 11:37:46 UTC
(In reply to umanga from comment #12)
> In Comment 10, Evan saw this CSV
> -https://github.com/openshift/ocs-operator/pull/520/commits/
> fa6e21558113b5da9b9e006ad274133ec900220e?file-filters%5B%5D=.yaml#diff-
> 937daae8888bb33097cb0d21de61026e
> which is the old one (same as OCS 4.4).
> 
> But, right after this commit (on same Pull Request) we have
> https://github.com/openshift/ocs-operator/pull/520/commits/
> 8564fba9c0b7782ee4d38a0c707a6a0757b15a9c#diff-
> 5b9a95d7bfb28864d4dfc053b3f8b6b6
> which is the new one (done to move to new bundle format and operator sdk
> 0.17.1) and seen on the OCS 4.5 clusters.
> 
> The permissions block is not present in older CSVs but is somehow added to
> the new ones.
> Since we do not manually edit CSV and rely on automation that uses
> opertor-sdk commands to generate CSV, it seems interesting (and we never
> noticed).
> 
> - Did something change in operator-sdk 0.17.z which started adding the
> role/rolebinding in a way that's seen here?
> 
> - If nothing changed, why did it not add those permissions block to older
> CSVs? (We did not change anything in that part of code)

Tried a few things and it seems things did change in the way CSV is generated.

Starting from SDK 0.17.0, few flags (--deploy-dir, --output-dir etc) were added.
Now the sdk looks recursively into all the files inside --deploy-dir (which is deploy/ by default)
Release Note: https://github.com/operator-framework/operator-sdk/releases/tag/v0.17.0 (2nd point under *Added* and 3rd point under *Removed* which is also marked as blocking change)

Unluckily for us, under deploy/ we have bundlemanifests/ which holds the role/rolebindings for metrics.
So, SDK adds the problematic permissions (as noted in Comment 10) to the CSV which it found under deploy/bundlemanifests/ (because of recursive check).
I tried SDK 0.16 and it didn't look inside deploy/bundlemanifests/ and so the older CSVs didn't have this permission block.

As a workaround for this I modified our generator script to remove bundlemanifests/ while CSV is getting generated (and restore it afterwards).
https://github.com/openshift/ocs-operator/pull/613

So, if permissions is the only issue this PR should fix it in OCS.

Comment 14 Vijay Avuthu 2020-07-10 09:07:19 UTC
Looks like above PR fixed the issue

Verified below combinations:

1) OCP 4.4 + OCS 4.5 - vSphere

ocs-operator.v4.5.0-482.ci
openshift installer (4.4.0-0.nightly-2020-07-09-063156)

Job: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/9641/console 

2) OCP 4.5 + OCS 4.5 -vSphere

ocs-operator.v4.5.0-484.ci
openshift installer (4.5.0-0.nightly-2020-07-07-210042)


https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/9696/console


Marking as Verified.

Comment 16 errata-xmlrpc 2020-07-13 17:44:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 17 Nick Hale 2020-07-28 20:17:24 UTC
*** Bug 1853022 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.