Bug 1913826 - Upgrading Performance addon operator from 4.6 to 4.7 fails
Summary: Upgrading Performance addon operator from 4.6 to 4.7 fails
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Performance Addon Operator
Version: 4.7
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: 4.7.0
Assignee: Francesco Romani
QA Contact: Niranjan Mallapadi Raghavender
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-07 17:05 UTC by Niranjan Mallapadi Raghavender
Modified: 2021-11-26 14:26 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Release Note
Doc Text:
Issue: Upgrading Performance Addon Operator from 4.6 to 4.7 fails with the error: "Warning TooManyOperatorGroups 11m operator-lifecycle-manager csv created in namespace with multiple operatorgroups, can't pick one automatically" Follow the procedure below to modify the OperatorGroup object removing the targetNamespaces entry before upgrading. Procedure: 1. Edit the Performance Addon Operator OperatorGroup CR and remove the spec element that contains the targetNamespaces entry by running the following command: $ oc patch operatorgroup -n openshift-performance-addon-operator performance-addon-operator --type json -p '[{ "op": "remove", "path": "/spec" }]' 2. Wait until the Operator Lifecycle Manager (OLM) processes the change. 3. Verify that the OperatorGroup CR change has been successfully applied. Check that the OperatorGroup CR spec element does not contain target namespaces: $ oc describe -n openshift-performance-addon-operator og openshift-performance-addon-operator 4. Proceed with the upgrade.
Clone Of:
Environment:
Last Closed: 2021-02-10 14:58:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Niranjan Mallapadi Raghavender 2021-01-07 17:05:55 UTC
Description of problem:
Upgrading performance addon operator version from 4.6 to 4.7 fails with error:

  "Warning  TooManyOperatorGroups     11m                operator-lifecycle-manager  csv created in namespace with multiple operatorgroups, can't pick one automatically" 


[root@dell-r730-009 performance]# oc get csv
NAME                                DISPLAY                      VERSION   REPLACES                            PHASE
performance-addon-operator.v4.6.0   Performance Addon Operator   4.6.0                                         Pending
performance-addon-operator.v4.7.0   Performance Addon Operator   4.7.0     performance-addon-operator.v4.6.0   Pending

Version-Release number of selected component (if applicable):
Performance Addon operator 4.7 

How reproducible:

1. Setup ocp4.7
2. Configure pao as mentioned in https://docs.openshift.com/container-platform/4.6/scalability_and_performance/cnf-performance-addon-operator-for-low-latency-nodes.html#inst[…]aster
3. create a performance profile as show below:

apiVersion: performance.openshift.io/v1
kind: PerformanceProfile
metadata:
 name: hugepages
spec:
  cpu:
    reserved: "0-3"
    isolated:  "4-31"
  hugepages:
    defaultHugepagesSize: "1G"
    pages:
    - size: "1G"
      count: 1
      node: 0
  realTimeKernel:
    enabled: True
  numa:
    topologyPolicy: "single-numa-node"
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""

4. Verify the performance profile is applied

5. Upgrade procedure:

5.1. Update registry certificates for master and worker
oc apply -f 99-master-registries.yaml 
oc apply -f 99-worker-registries.yaml

5.2 Patch operator hub and specify true to "/spec/disableAllDefaultSources"
oc patch OperatorHub cluster --type json -p '[{"op": "add", "path": "/spec/disableAllDefaultSources", "value": true}]'

5.3 specify the registries:
apiVersion: operator.openshift.io/v1alpha1
kind: ImageContentSourcePolicy
metadata:
  name: brew-registry
spec:
  repositoryDigestMirrors:
  - mirrors:
    - brew.registry.redhat.io
    source: registry.redhat.io
  - mirrors:
    - brew.registry.redhat.io
    source: registry.stage.redhat.io
  - mirrors:
    - brew.registry.redhat.io
    source: registry-proxy.engineering.redhat.com

oc apply -f imageContentSecurityPolicy.yaml

5.4: Git clone cnf-internal-deploy and specify the catalog source:
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: iib-operator-catalog
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: brew.registry.redhat.io/rh-osbs/iib-pub-pending:v4.7
  displayName: IIB Operator Catalog

$ oc apply -f ./catalog_source.yaml

5.5: Patch subscriptions to update from 4.6 to 4.7:
oc patch subscriptions openshift-performance-addon-operator-subscription  --type json -p '[{"op": "replace", "path": "/spec/channel", "value": "4.7"}]'


oc patch subscriptions openshift-performance-addon-operator-subscription --type json -p '[{"op": "replace", "path": "/spec/source", "value":"iib-operator-catalog"}]'

wait for update 

$ oc describe csv/performance-addon-operator.v4.7.0 shows below:
Events:
  Type     Reason                    Age                From                        Message
  ----     ------                    ----               ----                        -------
  Warning  UnsupportedOperatorGroup  87m (x2 over 87m)  operator-lifecycle-manager  OwnNamespace InstallModeType not supported, cannot configure to watch own namespace
  Warning  TooManyOperatorGroups     63m                operator-lifecycle-manager  csv created in namespace with multiple operatorgroups, can't pick one automatically
  Normal   RequirementsUnknown       57m                operator-lifecycle-manager  InstallModes now support target namespaces
[root@dell-r730-009 performance]# 

$ oc get csv
NAME                                DISPLAY                      VERSION   REPLACES                            PHASE
performance-addon-operator.v4.6.0   Performance Addon Operator   4.6.0                                         Pending
performance-addon-operator.v4.7.0   Performance Addon Operator   4.7.0     performance-addon-operator.v4.6.0   Pending


Actual results:
Update to Performance Addon Operator 4.7 fails 

Expected results:
Update to Performance Addon operator 4.7 should succeed. 

Additional info:

Comment 1 Francesco Romani 2021-01-08 15:08:30 UTC
Workaround exists: before to kickstart the performance-addon-operator per documented steps, edit the performance-addon-operator's OperatorGroup object and remove the `targetNamespace` entry from the spec, such as it reads like:
```yaml
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: performance-addon-operator
  namespace: openshift-performance-addon-operator
```

Please make sure to wait for the OLM to process the changes; you can be sure the OLM processed if
1. the *status* field of the OperatorGroup lists all the namespaces (namespaces: "")
2. the Performance-addon operator CSV is deployed (automatically by OLM, no action needed) in all the namespaces

both needs to be true before to move forward; #2 is a byproduct of how the new OperatorGroup+CSV is set.

Comment 2 Francesco Romani 2021-01-11 14:52:48 UTC
we verified that
1. (as expected) the 4.6 operator is still functional if the upgrade is stuck
2. the moment we apply the fix to the operatorgroup per https://bugzilla.redhat.com/show_bug.cgi?id=1913826#c1 , the upgrade completes succesfully.

Comment 3 Jack Ottofaro 2021-01-25 20:20:33 UTC
I removed the UpgradeBlocker keyword since this bug appears to be specific to a performance add-on upgrade as opposed to general cluster upgrade. There also appears to be a workaround. If I've misunderstood please re-apply the UpgradeBlocker keyword. I've also attached the UpgradeBlocker evaluation questions below in case more information is needed.

=========================================================================================

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 5 Francesco Romani 2021-01-29 08:22:00 UTC
(In reply to Jack Ottofaro from comment #3)
> I removed the UpgradeBlocker keyword since this bug appears to be specific
> to a performance add-on upgrade as opposed to general cluster upgrade. There
> also appears to be a workaround. If I've misunderstood please re-apply the
> UpgradeBlocker keyword. I've also attached the UpgradeBlocker evaluation
> questions below in case more information is needed.
> 
> =============================================================================
> ============
> 
> We're asking the following questions to evaluate whether or not this bug
> warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The
> ultimate goal is to avoid delivering an update which introduces new risk or
> reduces cluster functionality in any way. Sample answers are provided to
> give more context and the UpgradeBlocker flag has been added to this bug. It
> will be removed if the assessment indicates that this should not block
> upgrade edges. The expectation is that the assignee answers these questions.
> 
> Who is impacted?  If we have to block upgrade edges based on this issue,
> which edges would need blocking?
>   example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with
> thousands of namespaces, approximately 5% of the subscribed fleet
>   example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately
> 10% of the time
> What is the impact?  Is it serious enough to warrant blocking edges?
>   example: Up to 2 minute disruption in edge routing
>   example: Up to 90seconds of API downtime
>   example: etcd loses quorum and you have to restore from backup
> How involved is remediation (even moderately serious impacts might be
> acceptable if they are easy to mitigate)?
>   example: Issue resolves itself after five minutes
>   example: Admin uses oc to fix things
>   example: Admin must SSH to hosts, restore from backups, or other non
> standard admin activities
> Is this a regression (if all previous versions were also vulnerable,
> updating to the new, vulnerable version does not increase exposure)?
>   example: No, it’s always been like this we just never noticed
>   example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

I think it made sense and we can keep this way. Confusion arises about marking things which are upgrade blockers for our component (performance operator), and this issue kinda qualified, albeit with the workaround, and about things which are *platform* upgrade blockers. If the flag indeed is for *platform* blockers, then removing is most likely correct. cc @msivak


Note You need to log in before you can comment on or make changes to this bug.