Can we see the
Can we see the installplans in the openshift-storage namespace? It would also help to see all of the roles/rolebindings in that namespace.
Petr, could you create another cluster (probably tomorrow) for someone from the OLM team to look at?
Hi Jose, I couldn't find this OCS package in the QE app registry: "openshift-qe-optional-operators", as follows: packages: openshifttemplateservicebroker,openshiftansibleservicebroker,local-storage-operator,metering-ocp,ptp-operator,cluster-kube-descheduler-operator,cluster-logging,sriov-network-operator,clusterresourceoverride,elasticsearch-operator,nfd So I could not test your latest OCS operator. Could your help provide your app registry that stores this OCS operator? Thanks! Anyway, I test the released OCS 4.4 on the latest 4.5, it works well. mac:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-07-01-231224 True False 53m Cluster version is 4.5.0-0.nightly-2020-07-01-231224 mac:~ jianzhang$ oc get og -n openshift-storage NAME AGE openshift-storage-hbd4c 26m mac:~ jianzhang$ oc get sub -n openshift-storage NAME PACKAGE SOURCE CHANNEL awss3-operator-registry-alpha-community-operators-openshift-marketplace awss3-operator-registry community-operators alpha ocs-operator ocs-operator redhat-operators stable-4.4 mac:~ jianzhang$ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE awss3operator.1.0.1 AWS S3 Operator 1.0.1 awss3operator.1.0.0 Succeeded elasticsearch-operator.4.5.0-202007011712.p0 Elasticsearch Operator 4.5.0-202007011712.p0 Succeeded ocs-operator.v4.4.0 OpenShift Container Storage 4.4.0 Succeeded mac:~ jianzhang$ oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE aws-s3-provisioner-676b5767c-spnjh 1/1 Running 0 5m32s noobaa-operator-8689b6588-9d7sc 1/1 Running 0 5m40s ocs-operator-6b4b95bc9b-jft7g 1/1 Running 0 5m41s rook-ceph-operator-6f4b4b889c-vk2c2 1/1 Running 0 5m41s
Hey Jose, I can create the cluster but once someone will need it and will be able to take it right after. Yesterday I provided cluster in this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1851328 and notified here: https://chat.google.com/room/AAAAREGEba8/ReaDJLO4Urg but no one took it. So finally the cluster was running for 20 hours and I destroyed it. So if someone need it please direct message me on hangout chat so I can run the job and have cluster in about 50 mins for you.
There appears to be an issue with the way the CSV is being generated. I'm not intimately familiar with the OCS deploy pipeline, but I can see here: https://github.com/openshift/ocs-operator/commit/fa6e21558113b5da9b9e006ad274133ec900220 that there is supposed to be two roles/bindings for prometheus-k8s as separate bundle files. The CSV in this commit looks correct. However, on-cluster in the CSV I see this block: ``` permissions: - rules: - apiGroups: - '' resources: - services - endpoints - pods verbs: - get - list - watch serviceAccountName: noobaa-metrics - rules: - apiGroups: - '' resources: - services - endpoints - pods verbs: - get - list - watch serviceAccountName: rook-ceph-metrics ``` that is not present in the CSV manifest linked in the commit. This block tells OLM to look for bindings for those service accounts with those names, not prometheus-k8s. Because the bundle also included the "real" rolebinding, OLM preferred that when generating the InstallPlan, but the CSV still says that it needs a serviceaccount called "rook-ceph-metrics" to exist with these permissions. I'm guessing it's an issue in the csv-merger here: https://github.com/openshift/ocs-operator/blob/master/tools/csv-merger/csv-merger.go#L418 If both of these sets of permissions are needed, it should be enough to simply name them so that they do not conflict, i.e. `rook-ceph-metrics-prom` and `rook-ceph-metrics` Removing these definitions from the OCS CSV in a 4.5.0 cluster resulted in the operator successfully deploying (I didn't further verify any of OCS's functionality)
So I found the following lines in the operator-sdk: https://github.com/operator-framework/operator-sdk/blob/master/internal/generate/clusterserviceversion/clusterserviceversion_updaters.go#L86-L91 In short, when generating the "permissions" section of the CSV, is looks through all roles and fills the "serviceAccountName" with the role's name. Offhand I'm not sure this is correct, shouldn't it be the ServiceAccount referenced in the RoleBinding? Regardless, the question is why are we hitting this now and haven't hit it before? Also, why only in OCP 4.5? I tried looking through the operator-sdk history but it seems like this has been the behavior since the beginning. Maybe something about moving to operator bundles triggered something new?
In Comment 10, Evan saw this CSV -https://github.com/openshift/ocs-operator/pull/520/commits/fa6e21558113b5da9b9e006ad274133ec900220e?file-filters%5B%5D=.yaml#diff-937daae8888bb33097cb0d21de61026e which is the old one (same as OCS 4.4). But, right after this commit (on same Pull Request) we have https://github.com/openshift/ocs-operator/pull/520/commits/8564fba9c0b7782ee4d38a0c707a6a0757b15a9c#diff-5b9a95d7bfb28864d4dfc053b3f8b6b6 which is the new one (done to move to new bundle format and operator sdk 0.17.1) and seen on the OCS 4.5 clusters. The permissions block is not present in older CSVs but is somehow added to the new ones. Since we do not manually edit CSV and rely on automation that uses opertor-sdk commands to generate CSV, it seems interesting (and we never noticed). - Did something change in operator-sdk 0.17.z which started adding the role/rolebinding in a way that's seen here? - If nothing changed, why did it not add those permissions block to older CSVs? (We did not change anything in that part of code)
(In reply to umanga from comment #12) > In Comment 10, Evan saw this CSV > -https://github.com/openshift/ocs-operator/pull/520/commits/ > fa6e21558113b5da9b9e006ad274133ec900220e?file-filters%5B%5D=.yaml#diff- > 937daae8888bb33097cb0d21de61026e > which is the old one (same as OCS 4.4). > > But, right after this commit (on same Pull Request) we have > https://github.com/openshift/ocs-operator/pull/520/commits/ > 8564fba9c0b7782ee4d38a0c707a6a0757b15a9c#diff- > 5b9a95d7bfb28864d4dfc053b3f8b6b6 > which is the new one (done to move to new bundle format and operator sdk > 0.17.1) and seen on the OCS 4.5 clusters. > > The permissions block is not present in older CSVs but is somehow added to > the new ones. > Since we do not manually edit CSV and rely on automation that uses > opertor-sdk commands to generate CSV, it seems interesting (and we never > noticed). > > - Did something change in operator-sdk 0.17.z which started adding the > role/rolebinding in a way that's seen here? > > - If nothing changed, why did it not add those permissions block to older > CSVs? (We did not change anything in that part of code) Tried a few things and it seems things did change in the way CSV is generated. Starting from SDK 0.17.0, few flags (--deploy-dir, --output-dir etc) were added. Now the sdk looks recursively into all the files inside --deploy-dir (which is deploy/ by default) Release Note: https://github.com/operator-framework/operator-sdk/releases/tag/v0.17.0 (2nd point under *Added* and 3rd point under *Removed* which is also marked as blocking change) Unluckily for us, under deploy/ we have bundlemanifests/ which holds the role/rolebindings for metrics. So, SDK adds the problematic permissions (as noted in Comment 10) to the CSV which it found under deploy/bundlemanifests/ (because of recursive check). I tried SDK 0.16 and it didn't look inside deploy/bundlemanifests/ and so the older CSVs didn't have this permission block. As a workaround for this I modified our generator script to remove bundlemanifests/ while CSV is getting generated (and restore it afterwards). https://github.com/openshift/ocs-operator/pull/613 So, if permissions is the only issue this PR should fix it in OCS.
Looks like above PR fixed the issue Verified below combinations: 1) OCP 4.4 + OCS 4.5 - vSphere ocs-operator.v4.5.0-482.ci openshift installer (4.4.0-0.nightly-2020-07-09-063156) Job: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/9641/console 2) OCP 4.5 + OCS 4.5 -vSphere ocs-operator.v4.5.0-484.ci openshift installer (4.5.0-0.nightly-2020-07-07-210042) https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/9696/console Marking as Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409
*** Bug 1853022 has been marked as a duplicate of this bug. ***