Description of the problem: Attempting to upgrade RHACM operator 2.3.2 to 2.4.0 on an OCP 4.8 BM hub (ipv4), but it fails. Release version: Operator snapshot version: 2.3.2-DOWNSTREAM-2021-08-24-15-33-59 2.4.0-DOWNSTREAM-2021-08-25-05-45-31 OCP version: Cluster version is 4.8.4 Browser Info: N/A Steps to reproduce: 1. Deploy ocp 4.8.4 hub bm ipi (ipv4) 2. Create catalogsource from 2.3.2-DOWNSTREAM-2021-08-24-15-33-59 3. Create 2nd catalogsource from 2.4.0-DOWNSTREAM-2021-08-25-05-45-31 4. Install ACM 2.3 and MCH from 2.3 catalogsource 5. Edit acm operator subscription and set: spec->channel: release-2.4 spec->source: acm-2.4-snapshot-catalogsource-name 6. Monitor events in rhacm namespace Actual results: Upgrade fails - oc get events shows: 4m13s Normal AllRequirementsMet clusterserviceversion/advanced-cluster-management.v2.4.0 all requirements found, attempting install 34m Warning InstallComponentFailed clusterserviceversion/advanced-cluster-management.v2.4.0 install strategy failed: Deployment.apps "multiclusterhub-operator" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"control-plane":"multiclusterhub-operator", "name":"multiclusterhub-operator"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable 9m14s Normal NeedsReinstall clusterserviceversion/advanced-cluster-management.v2.4.0 installing: missing deployment with name=multiclusterhub-operator 54m Warning AppliedWithWarnings installplan/install-jjwxl 1 warning(s) generated during installation of operator "advanced-cluster-management.v2.4.0" (CustomResourceDefinition "multiclusterobservabilities.observability.open-cluster-management.io"): observability.open-cluster-management.io/v1beta1 MultiClusterObservability is deprecated in v2.3+, unavailable in v2.6+; use observability.open-cluster-management.io/v1beta2 MultiClusterObservability Expected results: Upgrade succeeds Additional info:
Thank you for opening this issue. I see where the problem lies and luckily this should be a quick and easy fix to get in. I will report back when we have made the necessary changes.
The necessary update to the multiclusterhub-operator deployment spec has been made and should be available in the next downstream build
(In reply to Jakob from comment #3) > The necessary update to the multiclusterhub-operator deployment spec has > been made and should be available in the next downstream build Will you update here the next build that includes it or should assume that anything dated '27-08' should have it already? Thanks!
Answering myself... I've tested with: 2.4.0-SNAPSHOT-2021-08-27-06-14-03 Result with the script ends in: MCH is in the following state: Updating The full MCH status is as follows: COMPONENT STATUS TYPE REASON application-chart-sub False Available WrongVersion assisted-service-sub True Deployed InstallSuccessful cluster-lifecycle-sub False Available WrongVersion cluster-manager-cr True Applied ClusterManagerApplied console-chart-sub False Available WrongVersion discovery-operator-sub False Available WrongVersion grc-sub True Deployed UpgradeSuccessful local-cluster Unknown Unknown No conditions available management-ingress-sub False Available WrongVersion multiclusterhub-repo True Available MinimumReplicasAvailable ocm-controller True Available MinimumReplicasAvailable ocm-proxyserver True Available MinimumReplicasAvailable ocm-webhook True Available MinimumReplicasAvailable policyreport-sub False Available WrongVersion search-prod-sub False Available WrongVersion on the console, it reports as 2.4.0 with status Succeed
If it's made it that far then it has gotten past the issue in the initial post. The multiclusterhub-operator is able getting deployed and updating the MCH to 2.4. If the reason is displaying as WrongVersion then that means those appsubs are still on their 2.3.x version and waiting for the standalone subscription operator to reconcile with the current version.
I'm having better luck - my env it looks successful. oc get mch NAME STATUS AGE multiclusterhub Running 43h oc get csv NAME DISPLAY VERSION REPLACES PHASE advanced-cluster-management.v2.4.0 Advanced Cluster Management for Kubernetes 2.4.0 advanced-cluster-management.v2.3.2 Succeeded # I used the following two snapshots to upgrade from: oc get catalogsource -A NAMESPACE NAME DISPLAY TYPE PUBLISHER AGE openshift-marketplace acm-custom-snapshot 2.3.2-DOWNSTREAM-2021-08-25-17-16-16 grpc 43h openshift-marketplace acm-custom-snapshot-2-4 2.4.0-DOWNSTREAM-2021-08-27-05-05-14 grpc 102m Hub = Cluster version is 4.8.4 ipv4 connected All pods Running as well. Attached logs with more output.
Also I did not use the upgrade.sh method - I just updated the operator subscription to channel 2.4 and the 2.4 catalogsource. I looked at the upgrade script and it does basically the same thing. @Pablo - Is your env different from mine is based on what I listed above? ================================= Also here are mch component status: oc get multiclusterhub --all-namespaces -o json | jq -r '.items[].status.components' { "application-chart-sub": { "lastTransitionTime": "2021-08-25T20:20:48Z", "reason": "UpgradeSuccessful", "status": "True", "type": "Deployed" }, "assisted-service-sub": { "lastTransitionTime": "2021-08-25T20:20:52Z", "reason": "UpgradeSuccessful", "status": "True", "type": "Deployed" }, "cluster-lifecycle-sub": { "lastTransitionTime": "2021-08-25T20:20:53Z", "reason": "UpgradeSuccessful", "status": "True", "type": "Deployed" }, "cluster-manager-cr": { "lastTransitionTime": "2021-08-27T15:36:10Z", "message": "Components of cluster manager is applied", "reason": "ClusterManagerApplied", "status": "True", "type": "Applied" }, "console-chart-sub": { "lastTransitionTime": "2021-08-25T20:20:49Z", "reason": "UpgradeSuccessful", "status": "True", "type": "Deployed" }, "discovery-operator-sub": { "lastTransitionTime": "2021-08-25T20:20:50Z", "reason": "UpgradeSuccessful", "status": "True", "type": "Deployed" }, "grc-sub": { "lastTransitionTime": "2021-08-25T20:20:52Z", "reason": "UpgradeSuccessful", "status": "True", "type": "Deployed" }, "local-cluster": { "lastTransitionTime": "2021-08-27T15:36:10Z", "message": "ManagedCluster is accepted, joined, and available", "reason": "ManagedClusterImported", "status": "True", "type": "ManagedClusterImportSuccess" }, "management-ingress-sub": { "lastTransitionTime": "2021-08-25T20:20:50Z", "reason": "UpgradeSuccessful", "status": "True", "type": "Deployed" }, "multiclusterhub-repo": { "lastTransitionTime": "2021-08-25T20:19:51Z", "reason": "MinimumReplicasAvailable", "status": "True", "type": "Available" }, "ocm-controller": { "lastTransitionTime": "2021-08-25T20:20:51Z", "reason": "MinimumReplicasAvailable", "status": "True", "type": "Available" }, "ocm-proxyserver": { "lastTransitionTime": "2021-08-25T20:22:31Z", "reason": "MinimumReplicasAvailable", "status": "True", "type": "Available" }, "ocm-webhook": { "lastTransitionTime": "2021-08-25T20:20:24Z", "reason": "MinimumReplicasAvailable", "status": "True", "type": "Available" }, "policyreport-sub": { "lastTransitionTime": "2021-08-25T20:20:50Z", "reason": "UpgradeSuccessful", "status": "True", "type": "Deployed" }, "search-prod-sub": { "lastTransitionTime": "2021-08-25T20:20:55Z", "reason": "UpgradeSuccessful", "status": "True", "type": "Deployed" } }
Now I'm getting failures intermittently, similar to Pablo where certain components in MCH fail to upgrade. Trying to reproduce again.
If the intermittent failures are caused by a WrongVersion Reason in the mch status then I would expect something like that to be temporary and the upgrade would eventually progress. Next time you encounter that error on an upgrade could you leave it alone and see if it works itself out and how long that takes?
Yes, that's what I'm planning to do. I'll just monitor it keeping track of time and leave it alone until it (hopefully) upgrade.
I just run it again and it eventually completed. Took about 1 hr and 20 minutes to complete. Is that normal for an upgrade?
That's not normal, and not desirable. I think it is do to the low reconcile rate of the HelmRepo subscription channel. I will see if I can try to improve that so it reconciles sooner.
Jakob - If it helps, I repeated the below commands every 10 seconds for the duration of the upgrade. Output is 2-3-24-upgrade.log attached. date; oc get multiclusterhub --all-namespaces -o json | jq -r '.items[].status.components
Thanks for the upgrade log. It looks to me like there could be something else going on that's causing the CSV upgrade to fail for some time. ``` NAME DISPLAY VERSION REPLACES PHASE advanced-cluster-management.v2.3.2 Advanced Cluster Management for Kubernetes 2.3.2 Replacing advanced-cluster-management.v2.4.0 Advanced Cluster Management for Kubernetes 2.4.0 advanced-cluster-management.v2.3.2 Failed ``` The long time it takes for the multiclusterhub upgrade to finish shouldn't affect the subscription from completing its replacement.
I made a change to the mch operator last week to set the reconciliation time on appsubs to frequent. A downstream build from 9/03 and on should include the change.
My CI ran earlier and it shows the upgrade took 10 minutes with 2.4.0-DOWNSTREAM-2021-09-07-03-25-30. I'll run through one more time manually, but that is much better.
It sounds like the upgrade is being more consistent. Can this issue be closed?