1997741 – Unable to upgrade RHACM operator 2.3.2 to 2.4.0 on OCP 4.8 bm

Bug 1997741 - Unable to upgrade RHACM operator 2.3.2 to 2.4.0 on OCP 4.8 bm

Summary: Unable to upgrade RHACM operator 2.3.2 to 2.4.0 on OCP 4.8 bm

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Advanced Cluster Management for Kubernetes
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	rhacm-2.4
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	rhacm-2.4
Assignee:	Jakob
QA Contact:	Thuy Nguyen
Docs Contact:	Christopher Dawson
URL:
Whiteboard:
Depends On:
Blocks:	mint
TreeView+	depends on / blocked

Reported:	2021-08-25 17:51 UTC by Chad Crum
Modified:	2021-11-15 14:30 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-11-15 14:30:53 UTC
Target Upstream Version:
Embargoed:
Flags:	ccrum: rhacm-2.4?

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	open-cluster-management backlog issues 15592	0	None	None	None	2021-08-26 13:42:27 UTC

Description Chad Crum 2021-08-25 17:51:01 UTC

Description of the problem:
Attempting to upgrade RHACM operator 2.3.2 to 2.4.0 on an OCP 4.8 BM hub (ipv4), but it fails. 

Release version:


Operator snapshot version:
2.3.2-DOWNSTREAM-2021-08-24-15-33-59 
2.4.0-DOWNSTREAM-2021-08-25-05-45-31


OCP version:
Cluster version is 4.8.4


Browser Info:
N/A

Steps to reproduce:
1. Deploy ocp 4.8.4 hub bm ipi (ipv4) 
2. Create catalogsource from 2.3.2-DOWNSTREAM-2021-08-24-15-33-59 
3. Create 2nd catalogsource from 2.4.0-DOWNSTREAM-2021-08-25-05-45-31
4. Install ACM 2.3 and MCH from 2.3 catalogsource
5. Edit acm operator subscription and set:
      spec->channel: release-2.4
      spec->source: acm-2.4-snapshot-catalogsource-name
6. Monitor events in rhacm namespace

Actual results:
Upgrade fails - oc get events shows:

4m13s       Normal    AllRequirementsMet                clusterserviceversion/advanced-cluster-management.v2.4.0   all requirements found, attempting install
34m         Warning   InstallComponentFailed            clusterserviceversion/advanced-cluster-management.v2.4.0   install strategy failed: Deployment.apps "multiclusterhub-operator" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"control-plane":"multiclusterhub-operator", "name":"multiclusterhub-operator"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
9m14s       Normal    NeedsReinstall                    clusterserviceversion/advanced-cluster-management.v2.4.0   installing: missing deployment with name=multiclusterhub-operator
54m         Warning   AppliedWithWarnings               installplan/install-jjwxl                                  1 warning(s) generated during installation of operator "advanced-cluster-management.v2.4.0" (CustomResourceDefinition "multiclusterobservabilities.observability.open-cluster-management.io"): observability.open-cluster-management.io/v1beta1 MultiClusterObservability is deprecated in v2.3+, unavailable in v2.6+; use observability.open-cluster-management.io/v1beta2 MultiClusterObservability


Expected results:
Upgrade succeeds


Additional info:

Comment 2 Jakob 2021-08-26 12:41:12 UTC

Thank you for opening this issue. I see where the problem lies and luckily this should be a quick and easy fix to get in. I will report back when we have made the necessary changes.

Comment 3 Jakob 2021-08-26 16:55:42 UTC

The necessary update to the multiclusterhub-operator deployment spec has been made and should be available in the next downstream build

Comment 4 Pablo Iranzo Gómez 2021-08-27 08:42:32 UTC

(In reply to Jakob from comment #3)
> The necessary update to the multiclusterhub-operator deployment spec has
> been made and should be available in the next downstream build

Will you update here the next build that includes it or should assume that anything dated '27-08' should have it already?

Thanks!

Comment 5 Pablo Iranzo Gómez 2021-08-27 09:27:38 UTC

Answering myself...

I've tested with: 2.4.0-SNAPSHOT-2021-08-27-06-14-03 


Result with the script ends in:

MCH is in the following state: Updating
The full MCH status is as follows:
COMPONENT                       STATUS          TYPE                            REASON                        
application-chart-sub           False           Available                       WrongVersion                  
assisted-service-sub            True            Deployed                        InstallSuccessful             
cluster-lifecycle-sub           False           Available                       WrongVersion                  
cluster-manager-cr              True            Applied                         ClusterManagerApplied         
console-chart-sub               False           Available                       WrongVersion                  
discovery-operator-sub          False           Available                       WrongVersion                  
grc-sub                         True            Deployed                        UpgradeSuccessful             
local-cluster                   Unknown         Unknown                         No conditions available       
management-ingress-sub          False           Available                       WrongVersion                  
multiclusterhub-repo            True            Available                       MinimumReplicasAvailable      
ocm-controller                  True            Available                       MinimumReplicasAvailable      
ocm-proxyserver                 True            Available                       MinimumReplicasAvailable      
ocm-webhook                     True            Available                       MinimumReplicasAvailable      
policyreport-sub                False           Available                       WrongVersion                  
search-prod-sub                 False           Available                       WrongVersion           


on the console, it reports as 2.4.0 with status Succeed

Comment 6 Jakob 2021-08-27 13:25:48 UTC

If it's made it that far then it has gotten past the issue in the initial post. The multiclusterhub-operator is able getting deployed and updating the MCH to 2.4. If the reason is displaying as WrongVersion then that means those appsubs are still on their 2.3.x version and waiting for the standalone subscription operator to reconcile with the current version.

Comment 7 Chad Crum 2021-08-27 16:13:48 UTC

I'm having better luck - my env it looks successful.

oc get mch
NAME              STATUS    AGE
multiclusterhub   Running   43h

oc get csv
NAME                                 DISPLAY                                      VERSION   REPLACES                             PHASE
advanced-cluster-management.v2.4.0   Advanced Cluster Management for Kubernetes   2.4.0     advanced-cluster-management.v2.3.2   Succeeded


# I used the following two snapshots to upgrade from:

oc get catalogsource -A
NAMESPACE               NAME                      DISPLAY                                TYPE   PUBLISHER   AGE
openshift-marketplace   acm-custom-snapshot       2.3.2-DOWNSTREAM-2021-08-25-17-16-16   grpc               43h
openshift-marketplace   acm-custom-snapshot-2-4   2.4.0-DOWNSTREAM-2021-08-27-05-05-14   grpc               102m

Hub = Cluster version is 4.8.4 ipv4 connected

All pods Running as well. 

Attached logs with more output.

Comment 9 Chad Crum 2021-08-27 16:28:32 UTC

Also I did not use the upgrade.sh method - I just updated the operator subscription to channel 2.4 and the 2.4 catalogsource. I looked at the upgrade script and it does basically the same thing.

@Pablo - Is your env different from mine is based on what I listed above?




=================================

Also here are mch component status:

oc get multiclusterhub --all-namespaces -o json | jq -r '.items[].status.components'                                                                                                                                       
{                                                                                                                                                                                                                                              
  "application-chart-sub": {                                                                                                                                                                                                                   
    "lastTransitionTime": "2021-08-25T20:20:48Z",                                                                                                                                                                                              
    "reason": "UpgradeSuccessful",                                                                                                                                                                                                             
    "status": "True",                                                                                                                                                                                                                          
    "type": "Deployed"                                                                                                                                                                                                                         
  },                                                                                                                                                                                                                                           
  "assisted-service-sub": {                                                                                                                                                                                                                    
    "lastTransitionTime": "2021-08-25T20:20:52Z",                                                                                                                                                                                              
    "reason": "UpgradeSuccessful",                                                                                                                                                                                                             
    "status": "True",                                                                                                                                                                                                                          
    "type": "Deployed"                                                                                                                                                                                                                         
  },                                                                                                                                                                                                                                           
  "cluster-lifecycle-sub": {                                                                                                                                                                                                                   
    "lastTransitionTime": "2021-08-25T20:20:53Z",                                                                                                                                                                                              
    "reason": "UpgradeSuccessful",                                                                                                                                                                                                             
    "status": "True",                                                                                                                                                                                                                          
    "type": "Deployed"                                                                                                                                                                                                                         
  },      
  "cluster-manager-cr": {                                                                                                                                                                                                                      
    "lastTransitionTime": "2021-08-27T15:36:10Z",                                                                                                                                                                                              
    "message": "Components of cluster manager is applied",                                                                                                                                                                                     
    "reason": "ClusterManagerApplied",                                                                                                                                                                                                         
    "status": "True",                                                                                                                                                                                                                          
    "type": "Applied"                                                                                                                                                                                                                          
  },                                                                                                                                                                                                                                           
  "console-chart-sub": {                                                                                                                                                                                                                       
    "lastTransitionTime": "2021-08-25T20:20:49Z",                                                                                                                                                                                              
    "reason": "UpgradeSuccessful",                                                                                                                                                                                                             
    "status": "True",                                                                                                                                                                                                                          
    "type": "Deployed"                                                                                                                                                                                                                         
  },                                                                                                                                                                                                                                           
  "discovery-operator-sub": {                                                                                                                                                                                                                  
    "lastTransitionTime": "2021-08-25T20:20:50Z",                                                                                                                                                                                              
    "reason": "UpgradeSuccessful",                                                                                                                                                                                                             
    "status": "True",                                                                                                                                                                                                                          
    "type": "Deployed"                                                                                                                                                                                                                         
  },                                                                                                                                                                                                                                           
  "grc-sub": {                                                                                                                                                                                                                                 
    "lastTransitionTime": "2021-08-25T20:20:52Z",                                                                                                                                                                                              
    "reason": "UpgradeSuccessful",                         
    "status": "True",                                      
    "type": "Deployed"                                     
  },        
  "local-cluster": {                                                                                                                                                                                                                           
    "lastTransitionTime": "2021-08-27T15:36:10Z",                                                                                                                                                                                              
    "message": "ManagedCluster is accepted, joined, and available",                                                                                                                                                                            
    "reason": "ManagedClusterImported",                                                                                                                                                                                                        
    "status": "True",                                                                                                                                                                                                                          
    "type": "ManagedClusterImportSuccess"                                                                                                                                                                                                      
  },                                                                                                                                                                                                                                           
  "management-ingress-sub": {                                                                                                                                                                                                                  
    "lastTransitionTime": "2021-08-25T20:20:50Z",                                                                                                                                                                                              
    "reason": "UpgradeSuccessful",                                                                                                                                                                                                             
    "status": "True",                                                                                                                                                                                                                          
    "type": "Deployed"                                                                                                                                                                                                                         
  },                                                                                                                                                                                                                                           
  "multiclusterhub-repo": {                                                                                                                                                                                                                    
    "lastTransitionTime": "2021-08-25T20:19:51Z",                                                                                                                                                                                              
    "reason": "MinimumReplicasAvailable",                                                                                                                                                                                                      
    "status": "True",                                                                                                                                                                                                                          
    "type": "Available"                                                                                                                                                                                                                        
  },                                                                                                                                                                                                                                           
  "ocm-controller": {                                                                                                                                                                                                                          
    "lastTransitionTime": "2021-08-25T20:20:51Z",                                                                                                                                                                                              
    "reason": "MinimumReplicasAvailable",                                                                                                                                                                                                      
    "status": "True",                                      
    "type": "Available"                                    
  },      
  "ocm-proxyserver": {
    "lastTransitionTime": "2021-08-25T20:22:31Z",
    "reason": "MinimumReplicasAvailable",
    "status": "True",
    "type": "Available"
  },
  "ocm-webhook": {
    "lastTransitionTime": "2021-08-25T20:20:24Z",
    "reason": "MinimumReplicasAvailable",
    "status": "True",
    "type": "Available"
  },
  "policyreport-sub": {
    "lastTransitionTime": "2021-08-25T20:20:50Z",
    "reason": "UpgradeSuccessful",
    "status": "True",
    "type": "Deployed"
  },
  "search-prod-sub": {
    "lastTransitionTime": "2021-08-25T20:20:55Z",
    "reason": "UpgradeSuccessful",
    "status": "True",
    "type": "Deployed"
  }
}

Comment 11 Chad Crum 2021-08-30 23:42:57 UTC

Now I'm getting failures intermittently, similar to Pablo where certain components in MCH fail to upgrade.

Trying to reproduce again.

Comment 13 Jakob 2021-08-31 14:52:57 UTC

If the intermittent failures are caused by a WrongVersion Reason in the mch status then I would expect something like that to be temporary and the upgrade would eventually progress. Next time you encounter that error on an upgrade could you leave it alone and see if it works itself out and how long that takes?

Comment 14 Chad Crum 2021-08-31 15:11:28 UTC

Yes, that's what I'm planning to do. I'll just monitor it keeping track of time and leave it alone until it (hopefully) upgrade.

Comment 15 Chad Crum 2021-08-31 20:27:41 UTC

I just run it again and it eventually completed. Took about 1 hr and 20 minutes to complete. 

Is that normal for an upgrade?

Comment 16 Jakob 2021-08-31 23:28:52 UTC

That's not normal, and not desirable. I think it is do to the low reconcile rate of the HelmRepo subscription channel. I will see if I can try to improve that so it reconciles sooner.

Comment 18 Chad Crum 2021-09-01 11:49:11 UTC

Jakob - If it helps, I repeated the below commands every 10 seconds for the duration of the upgrade. Output is 2-3-24-upgrade.log attached.


date; oc get multiclusterhub --all-namespaces -o json | jq -r '.items[].status.components

Comment 19 Jakob 2021-09-01 13:08:52 UTC

Thanks for the upgrade log. It looks to me like there could be something else going on that's causing the CSV upgrade to fail for some time.

```
NAME                                 DISPLAY                                      VERSION   REPLACES                             PHASE
advanced-cluster-management.v2.3.2   Advanced Cluster Management for Kubernetes   2.3.2                                          Replacing
advanced-cluster-management.v2.4.0   Advanced Cluster Management for Kubernetes   2.4.0     advanced-cluster-management.v2.3.2   Failed
```

The long time it takes for the multiclusterhub upgrade to finish shouldn't affect the subscription from completing its replacement.

Comment 20 Jakob 2021-09-07 12:57:53 UTC

I made a change to the mch operator last week to set the reconciliation time on appsubs to frequent. A downstream build from 9/03 and on should include the change.

Comment 21 Chad Crum 2021-09-07 16:42:14 UTC

My CI ran earlier and it shows the upgrade took 10 minutes with 2.4.0-DOWNSTREAM-2021-09-07-03-25-30. 

I'll run through one more time manually, but that is much better.

Comment 22 Jakob 2021-09-10 12:34:12 UTC

It sounds like the upgrade is being more consistent. Can this issue be closed?

Note You need to log in before you can comment on or make changes to this bug.