Description of problem: Upgrading performance addon operator version from 4.6 to 4.7 fails with error: "Warning TooManyOperatorGroups 11m operator-lifecycle-manager csv created in namespace with multiple operatorgroups, can't pick one automatically" [root@dell-r730-009 performance]# oc get csv NAME DISPLAY VERSION REPLACES PHASE performance-addon-operator.v4.6.0 Performance Addon Operator 4.6.0 Pending performance-addon-operator.v4.7.0 Performance Addon Operator 4.7.0 performance-addon-operator.v4.6.0 Pending Version-Release number of selected component (if applicable): Performance Addon operator 4.7 How reproducible: 1. Setup ocp4.7 2. Configure pao as mentioned in https://docs.openshift.com/container-platform/4.6/scalability_and_performance/cnf-performance-addon-operator-for-low-latency-nodes.html#inst[…]aster 3. create a performance profile as show below: apiVersion: performance.openshift.io/v1 kind: PerformanceProfile metadata: name: hugepages spec: cpu: reserved: "0-3" isolated: "4-31" hugepages: defaultHugepagesSize: "1G" pages: - size: "1G" count: 1 node: 0 realTimeKernel: enabled: True numa: topologyPolicy: "single-numa-node" nodeSelector: node-role.kubernetes.io/worker-cnf: "" 4. Verify the performance profile is applied 5. Upgrade procedure: 5.1. Update registry certificates for master and worker oc apply -f 99-master-registries.yaml oc apply -f 99-worker-registries.yaml 5.2 Patch operator hub and specify true to "/spec/disableAllDefaultSources" oc patch OperatorHub cluster --type json -p '[{"op": "add", "path": "/spec/disableAllDefaultSources", "value": true}]' 5.3 specify the registries: apiVersion: operator.openshift.io/v1alpha1 kind: ImageContentSourcePolicy metadata: name: brew-registry spec: repositoryDigestMirrors: - mirrors: - brew.registry.redhat.io source: registry.redhat.io - mirrors: - brew.registry.redhat.io source: registry.stage.redhat.io - mirrors: - brew.registry.redhat.io source: registry-proxy.engineering.redhat.com oc apply -f imageContentSecurityPolicy.yaml 5.4: Git clone cnf-internal-deploy and specify the catalog source: apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: iib-operator-catalog namespace: openshift-marketplace spec: sourceType: grpc image: brew.registry.redhat.io/rh-osbs/iib-pub-pending:v4.7 displayName: IIB Operator Catalog $ oc apply -f ./catalog_source.yaml 5.5: Patch subscriptions to update from 4.6 to 4.7: oc patch subscriptions openshift-performance-addon-operator-subscription --type json -p '[{"op": "replace", "path": "/spec/channel", "value": "4.7"}]' oc patch subscriptions openshift-performance-addon-operator-subscription --type json -p '[{"op": "replace", "path": "/spec/source", "value":"iib-operator-catalog"}]' wait for update $ oc describe csv/performance-addon-operator.v4.7.0 shows below: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning UnsupportedOperatorGroup 87m (x2 over 87m) operator-lifecycle-manager OwnNamespace InstallModeType not supported, cannot configure to watch own namespace Warning TooManyOperatorGroups 63m operator-lifecycle-manager csv created in namespace with multiple operatorgroups, can't pick one automatically Normal RequirementsUnknown 57m operator-lifecycle-manager InstallModes now support target namespaces [root@dell-r730-009 performance]# $ oc get csv NAME DISPLAY VERSION REPLACES PHASE performance-addon-operator.v4.6.0 Performance Addon Operator 4.6.0 Pending performance-addon-operator.v4.7.0 Performance Addon Operator 4.7.0 performance-addon-operator.v4.6.0 Pending Actual results: Update to Performance Addon Operator 4.7 fails Expected results: Update to Performance Addon operator 4.7 should succeed. Additional info:
Workaround exists: before to kickstart the performance-addon-operator per documented steps, edit the performance-addon-operator's OperatorGroup object and remove the `targetNamespace` entry from the spec, such as it reads like: ```yaml apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: performance-addon-operator namespace: openshift-performance-addon-operator ``` Please make sure to wait for the OLM to process the changes; you can be sure the OLM processed if 1. the *status* field of the OperatorGroup lists all the namespaces (namespaces: "") 2. the Performance-addon operator CSV is deployed (automatically by OLM, no action needed) in all the namespaces both needs to be true before to move forward; #2 is a byproduct of how the new OperatorGroup+CSV is set.
we verified that 1. (as expected) the 4.6 operator is still functional if the upgrade is stuck 2. the moment we apply the fix to the operatorgroup per https://bugzilla.redhat.com/show_bug.cgi?id=1913826#c1 , the upgrade completes succesfully.
I removed the UpgradeBlocker keyword since this bug appears to be specific to a performance add-on upgrade as opposed to general cluster upgrade. There also appears to be a workaround. If I've misunderstood please re-apply the UpgradeBlocker keyword. I've also attached the UpgradeBlocker evaluation questions below in case more information is needed. ========================================================================================= We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, it’s always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
(In reply to Jack Ottofaro from comment #3) > I removed the UpgradeBlocker keyword since this bug appears to be specific > to a performance add-on upgrade as opposed to general cluster upgrade. There > also appears to be a workaround. If I've misunderstood please re-apply the > UpgradeBlocker keyword. I've also attached the UpgradeBlocker evaluation > questions below in case more information is needed. > > ============================================================================= > ============ > > We're asking the following questions to evaluate whether or not this bug > warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The > ultimate goal is to avoid delivering an update which introduces new risk or > reduces cluster functionality in any way. Sample answers are provided to > give more context and the UpgradeBlocker flag has been added to this bug. It > will be removed if the assessment indicates that this should not block > upgrade edges. The expectation is that the assignee answers these questions. > > Who is impacted? If we have to block upgrade edges based on this issue, > which edges would need blocking? > example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with > thousands of namespaces, approximately 5% of the subscribed fleet > example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately > 10% of the time > What is the impact? Is it serious enough to warrant blocking edges? > example: Up to 2 minute disruption in edge routing > example: Up to 90seconds of API downtime > example: etcd loses quorum and you have to restore from backup > How involved is remediation (even moderately serious impacts might be > acceptable if they are easy to mitigate)? > example: Issue resolves itself after five minutes > example: Admin uses oc to fix things > example: Admin must SSH to hosts, restore from backups, or other non > standard admin activities > Is this a regression (if all previous versions were also vulnerable, > updating to the new, vulnerable version does not increase exposure)? > example: No, it’s always been like this we just never noticed > example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1 I think it made sense and we can keep this way. Confusion arises about marking things which are upgrade blockers for our component (performance operator), and this issue kinda qualified, albeit with the workaround, and about things which are *platform* upgrade blockers. If the flag indeed is for *platform* blockers, then removing is most likely correct. cc @msivak