Description of problem (please be detailed as possible and provide log snippests): Upgrade of 4.7 cluster to 4.8 internal build is getting stuck. NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.7.0 OpenShift Container Storage 4.7.0 Replacing ocs-operator.v4.8.0-399.ci OpenShift Container Storage 4.8.0-399.ci ocs-operator.v4.7.0 Pending Version of all relevant components (if applicable): OCS: ocs-operator.v4.8.0-399.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, the cluster is not upgraded. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Haven't tried yet Can this issue reproduce from the UI? Haven't tried If this is a regression, please provide more details to justify this: Yes Steps to Reproduce: 1. Install OCS 4.7 GA cluster in external mode 2. Upgrade to 4.8 internal build 3. Actual results: Upgrade is stuck in pending state Expected results: Have successful upgrade to 4.8 Additional info: Must gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j008vu1ce33-ua/j008vu1ce33-ua_20210520T004158/logs/failed_testcase_ocs_logs_1621474062/test_upgrade_ocs_logs/ Job: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/901/
The rook-ceph-operator has logs like: 2021-05-20T01:25:55.475079285Z E0520 01:25:55.475014 7 reflector.go:138] pkg/mod/k8s.io/client-go.0/tools/cache/reflector.go:167: Failed to watch *v1.Secret: the server has received too many requests and has asked us to try again later (get secrets) 2021-05-20T01:25:55.475079285Z E0520 01:25:55.475028 7 reflector.go:138] pkg/mod/k8s.io/client-go.0/tools/cache/reflector.go:167: Failed to watch *v1.CephNFS: the server has received too many requests and has asked us to try again later (get cephnfses.ceph.rook.io) 2021-05-20T01:25:55.475079285Z E0520 01:25:55.475049 7 reflector.go:138] pkg/mod/k8s.io/client-go.0/tools/cache/reflector.go:167: Failed to watch *v1.CephClient: the server has received too many requests and has asked us to try again later (get cephclients.ceph.rook.io) 2021-05-20T01:25:55.475079285Z E0520 01:25:55.475056 7 reflector.go:138] pkg/mod/k8s.io/client-go.0/tools/cache/reflector.go:167: Failed to watch *v1.Deployment: the server has received too many requests and has asked us to try again later (get deployments.apps) 2021-05-20T01:25:55.475079285Z E0520 01:25:55.475067 7 reflector.go:138] pkg/mod/k8s.io/client-go.0/tools/cache/reflector.go:167: Failed to watch *v1.CephFilesystem: the server has received too many requests and has asked us to try again later (get cephfilesystems.ceph.rook.io) Then stopped logging anything. Could it be the cluster is under heavy load during the upgrade and needs more time to complete? Thanks
Re-triggered the job here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-1az-rhcos-external-3m-3w-upgrade-ocs-auto/9/
Failed again. OCS 4.7 deployment on this external cluster passed well. Must gather after deployment: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j009vu1ce33-ua/j009vu1ce33-ua_20210602T204421/logs/deployment_1622667481/ Must gather after unsuccessful upgrade: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j009vu1ce33-ua/j009vu1ce33-ua_20210602T204421/logs/failed_testcase_ocs_logs_1622671982/test_upgrade_ocs_logs/ So looks like it's still consistently reproducible.
After a second analysis, I didn't find anything suspicious in the Rook logs, after discussing with Petr the issue seems more on the CSV side. Moving to OCS-op for better analysis. Thanks.
Moving back to Rook. Thanks Umanga for looking.
Scheduled new verification job: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-1az-rhcos-external-3m-3w-upgrade-ocs-auto/14 It is now waiting for PSI OpenStack to come up before it will really schedule the job.
ocs_ci.ocs.exceptions.ResourceWrongStatusException: Resource Resource: ocs-operator.v4.8.0-417.ci is not in expected phase: Succeeded I still see the upgrade test failed: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1138/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/ http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j014vu1ce33-ua/j014vu1ce33-ua_20210616T201836/logs/failed_testcase_ocs_logs_1623879680/test_upgrade_ocs_logs/ FAILED QE
Can this issue be related to this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1971593 and missing: rgw-admin-ops-user value in the config as mentioned by Sidhanth in comment: https://bugzilla.redhat.com/show_bug.cgi?id=1971593#c11
Yes Petr now this is most likely the case, I opened https://bugzilla.redhat.com/show_bug.cgi?id=1974441 yesterday to document the upgrade.
Few queries regarding upgrade of external cluster: 1. how will the customers download the new script. Download option only gets when we select external storage while creating storage cluster( since we already had storage cluster ( upgrade scenario ), we cann't download new script ) 2. let say if we have new script and new JSON o/p, at what stage of upgrade do we need to apply it . before starting of upgrade? or post upgrade once we hit failures
(In reply to Vijay Avuthu from comment #17) > Few queries regarding upgrade of external cluster: > > 1. how will the customers download the new script. Download option only gets > when we select external storage while creating storage cluster( since we > already had storage cluster ( upgrade scenario ), we cann't download new > script ) I don't know, but for the CI you can introspect the rook-ceph operator image and in the /etc/ceph-csv-templates/rook-ceph-ocp.vVERSION.clusterserviceversion.yaml.in file and parse the key externalClusterScript and decode the base64. > 2. let say if we have new script and new JSON o/p, at what stage of upgrade > do we need to apply it . before starting of upgrade? or post upgrade once we > hit failures We need to run the script **before** the upgrade.
If: We need to run the script **before** the upgrade. I am afraid this will break the cluster - at least Vijay tried to deploy the new external cluster with OCS 4.7 and new generated file but it actually failed. So I have concern that if we apply this new config that the cluster will be in bad shape before cluster - I haven't tested this - so we need to test this scenario to see if it works or not. But if we see the deployment is not working with new config my guess it will not have good impact on running cluster as well. @vavuthu can you please elaborate more on the error you saw during deployment of 4.7 with new config? @shan this probably cannot be somehow automated as part of the upgrade? Or make the structure of the config that it will not break 4.7 deployment if we apply config from 4.8 script in 4.7? Petr
Honestly, I don't see what could break a 4.7 cluster if you inject the new JSON. The format is the same, we just append a new entry, all the rest is identical. So indeed, @vavuthu can you please elaborate more the errors you saw? Thanks.
> Before upgrade, ran new external secret ( json o/p ) generated $ oc apply -f external_secret_new_48 Warning: resource secrets/rook-ceph-external-cluster-details is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by oc apply. oc apply should only be used on resources created declaratively by either oc create --save-config or oc apply. The missing annotation will be patched automatically. secret/rook-ceph-external-cluster-details configured $ > unpause the upgrade job after checking external secret is updated > ocs-operator.v4.8.0-432.ci is in pending state 11:09:16 - MainThread - ocs_ci.ocs.ocp - INFO - Resource ocs-operator.v4.8.0-432.ci is in phase: Pending! 11:09:16 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 5 seconds before next iteration > CephObjectStore is connected $ oc describe CephObjectStore | grep -i phase -B3 .: f:bucketStatus: f:info: f:phase: -- Last Checked: 2021-06-29T13:53:57Z Info: Endpoint: http://rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore.openshift-storage.svc:8080 Phase: Connected [vavuthu@vavuthu ext]$ > $ oc get csv NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.7.1 OpenShift Container Storage 4.7.1 Replacing ocs-operator.v4.8.0-432.ci OpenShift Container Storage 4.8.0-432.ci ocs-operator.v4.7.1 Pending $ oc describe csv ocs-operator.v4.8.0-432.ci Name: ocs-operator.v4.8.0-432.ci Namespace: openshift-storage Status: Cleanup: Conditions: Last Transition Time: 2021-06-29T10:55:25Z Last Update Time: 2021-06-29T10:55:25Z Message: requirements not yet checked Phase: Pending Reason: RequirementsUnknown Last Transition Time: 2021-06-29T10:55:25Z Last Update Time: 2021-06-29T10:55:25Z Message: one or more requirements couldn't be found Phase: Pending Reason: RequirementsNotMet Last Transition Time: 2021-06-29T10:55:25Z Last Update Time: 2021-06-29T10:55:25Z Message: one or more requirements couldn't be found Phase: Pending Reason: RequirementsNotMet Requirement Status: Group: operators.coreos.com Kind: ClusterServiceVersion Message: CSV minKubeVersion (1.16.0) less than server version (v1.21.0-rc.0+766a5fe) Name: ocs-operator.v4.8.0-432.ci Status: Present Version: v1alpha1 Group: apiextensions.k8s.io Kind: CustomResourceDefinition Message: CRD installed alongside other CSV(s): ocs-operator.v4.7.1 Name: backingstores.noobaa.io Status: PresentNotSatisfied Version: v1 Group: apiextensions.k8s.io Kind: CustomResourceDefinition Message: CRD installed alongside other CSV(s): ocs-operator.v4.7.1 Name: bucketclasses.noobaa.io Status: PresentNotSatisfied Version: v1 Group: apiextensions.k8s.io Kind: CustomResourceDefinition Message: CRD installed alongside other CSV(s): ocs-operator.v4.7.1 Name: cephblockpools.ceph.rook.io Status: PresentNotSatisfied Version: v1 Group: apiextensions.k8s.io Kind: CustomResourceDefinition Message: CRD installed alongside other CSV(s): ocs-operator.v4.7.1 Name: cephclients.ceph.rook.io Status: PresentNotSatisfied Version: v1 Group: apiextensions.k8s.io Kind: CustomResourceDefinition Message: CRD installed alongside other CSV(s): ocs-operator.v4.7.1 Name: cephclusters.ceph.rook.io Status: PresentNotSatisfied Version: v1 Group: apiextensions.k8s.io Kind: CustomResourceDefinition Message: CRD is not present Name: cephfilesystemmirrors.ceph.rook.io Status: NotPresent Version: v1 Group: apiextensions.k8s.io Kind: CustomResourceDefinition Message: CRD installed alongside other CSV(s): ocs-operator.v4.7.1 Name: cephfilesystems.ceph.rook.io Status: PresentNotSatisfied Version: v1 Group: apiextensions.k8s.io Kind: CustomResourceDefinition Message: CRD is present and Established condition is true Name: cephnfses.ceph.rook.io Status: Present Uuid: 6bab4f9c-b004-4fde-8f08-a0c02b1b2a84 Version: v1 Group: apiextensions.k8s.io Kind: CustomResourceDefinition Message: CRD installed alongside other CSV(s): ocs-operator.v4.7.1 Name: cephobjectrealms.ceph.rook.io Status: PresentNotSatisfied Version: v1 Group: apiextensions.k8s.io Kind: CustomResourceDefinition Message: CRD installed alongside other CSV(s): ocs-operator.v4.7.1 Name: cephobjectstores.ceph.rook.io Status: PresentNotSatisfied Version: v1 Group: apiextensions.k8s.io Kind: CustomResourceDefinition Message: CRD is present and Established condition is true Name: cephobjectstoreusers.ceph.rook.io Status: Present Uuid: d360e42e-b863-4e4a-a94b-0cfc1f2d2966 Version: v1 Group: apiextensions.k8s.io Kind: CustomResourceDefinition Message: CRD installed alongside other CSV(s): ocs-operator.v4.7.1 Name: cephobjectzonegroups.ceph.rook.io Status: PresentNotSatisfied Version: v1 Group: apiextensions.k8s.io Kind: CustomResourceDefinition Message: CRD installed alongside other CSV(s): ocs-operator.v4.7.1 Name: cephobjectzones.ceph.rook.io Status: PresentNotSatisfied Version: v1 Group: apiextensions.k8s.io Kind: CustomResourceDefinition Message: CRD installed alongside other CSV(s): ocs-operator.v4.7.1 Name: cephrbdmirrors.ceph.rook.io Status: PresentNotSatisfied Version: v1 Group: apiextensions.k8s.io Kind: CustomResourceDefinition Message: CRD is present and Established condition is true Name: namespacestores.noobaa.io Status: Present Uuid: a9b0c1c2-4b8c-4f41-847a-209eac43bafa Version: v1 Group: apiextensions.k8s.io Kind: CustomResourceDefinition Message: CRD is present and Established condition is true Name: noobaas.noobaa.io Status: Present Uuid: 622c7b3f-b10c-483f-b708-c4738d6f4ec8 Version: v1 Group: apiextensions.k8s.io Kind: CustomResourceDefinition Message: CRD installed alongside other CSV(s): ocs-operator.v4.7.1 Name: objectbucketclaims.objectbucket.io Status: PresentNotSatisfied Version: v1 Group: apiextensions.k8s.io Kind: CustomResourceDefinition Message: CRD installed alongside other CSV(s): ocs-operator.v4.7.1 Name: objectbuckets.objectbucket.io Status: PresentNotSatisfied Version: v1 Group: apiextensions.k8s.io Kind: CustomResourceDefinition Message: CRD installed alongside other CSV(s): ocs-operator.v4.7.1 Name: ocsinitializations.ocs.openshift.io Status: PresentNotSatisfied Version: v1 Group: apiextensions.k8s.io Kind: CustomResourceDefinition Message: CRD installed alongside other CSV(s): ocs-operator.v4.7.1 Name: storageclusters.ocs.openshift.io Status: PresentNotSatisfied Version: v1 Group: apiextensions.k8s.io Kind: CustomResourceDefinition Message: CRD is not present Name: volumereplicationclasses.replication.storage.openshift.io Status: NotPresent Version: v1 Group: apiextensions.k8s.io Kind: CustomResourceDefinition Message: CRD is present and Established condition is true Name: volumereplications.replication.storage.openshift.io Status: Present Uuid: 92647e2c-56da-41ae-8ea8-9bf9a4c035dc Version: v1 Group: Kind: ServiceAccount Message: Service account does not exist Name: noobaa-endpoint Status: NotPresent Version: v1 Group: Kind: ServiceAccount Message: Service account is owned by another ClusterServiceVersion Name: rook-ceph-admission-controller-role Status: PresentNotSatisfied Version: v1 Group: Kind: ServiceAccount Message: Service account is owned by another ClusterServiceVersion Name: rook-csi-cephfs-plugin-sa Status: PresentNotSatisfied Version: v1 Group: Kind: ServiceAccount Message: Service account is owned by another ClusterServiceVersion Name: rook-ceph-osd Status: PresentNotSatisfied Version: v1 Group: Kind: ServiceAccount Message: Service account is owned by another ClusterServiceVersion Name: rook-ceph-mgr Status: PresentNotSatisfied Version: v1 Group: Kind: ServiceAccount Message: Service account is owned by another ClusterServiceVersion Name: rook-csi-cephfs-provisioner-sa Status: PresentNotSatisfied Version: v1 Group: Kind: ServiceAccount Message: Service account is owned by another ClusterServiceVersion Name: rook-csi-rbd-provisioner-sa Status: PresentNotSatisfied Version: v1 Group: Kind: ServiceAccount Message: Service account is owned by another ClusterServiceVersion Name: noobaa Status: PresentNotSatisfied Version: v1 Group: Kind: ServiceAccount Message: Service account is owned by another ClusterServiceVersion Name: rook-ceph-system Status: PresentNotSatisfied Version: v1 Group: Kind: ServiceAccount Message: Service account is owned by another ClusterServiceVersion Name: rook-ceph-cmd-reporter Status: PresentNotSatisfied Version: v1 Group: Kind: ServiceAccount Message: Service account is owned by another ClusterServiceVersion Name: ocs-operator Status: PresentNotSatisfied Version: v1 Group: Kind: ServiceAccount Message: Service account is owned by another ClusterServiceVersion Name: rook-csi-rbd-plugin-sa Status: PresentNotSatisfied Version: v1 Group: Kind: ServiceAccount Message: Service account is owned by another ClusterServiceVersion Name: ocs-metrics-exporter Status: PresentNotSatisfied Version: v1 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal RequirementsUnknown 160m (x3 over 160m) operator-lifecycle-manager requirements not yet checked Normal RequirementsNotMet 160m operator-lifecycle-manager one or more requirements couldn't be found Must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j021vu1ce33-ua/j021vu1ce33-ua_20210629T081528/logs/failed_testcase_ocs_logs_1624962838/test_upgrade_ocs_logs/
This looks more like an OLM / upgrade problem than a rook-ceph upgrade issue. Message: waiting for install components to report healthy Phase: Installing Reason: InstallSucceeded Last Transition Time: 2021-06-29T09:06:51Z Last Update Time: 2021-06-29T09:06:51Z Message: installing: waiting for deployment ocs-operator to become ready: deployment "ocs-operator" not available: Deployment does not have minimum availability. Phase: Installing Reason: InstallWaiting Last Transition Time: 2021-06-29T09:08:45Z Last Update Time: 2021-06-29T09:08:45Z Message: install strategy completed with no errors Phase: Succeeded Reason: InstallSucceeded Last Transition Time: 2021-06-29T10:55:25Z Last Update Time: 2021-06-29T10:55:25Z Message: being replaced by csv: ocs-operator.v4.8.0-432.ci Phase: Replacing Reason: BeingReplaced Last Transition Time: 2021-06-29T10:55:25Z Last Update Time: 2021-06-29T10:55:25Z Message: being replaced by csv: ocs-operator.v4.8.0-432.ci Phase: Replacing Reason: BeingReplaced Umanaga any idea? Moving to ocs for further investigation.
Error: failed: error validating existing CRs against new CRD's schema: cephobjectstores.ceph.rook.io: error validating custom resource against new schema for CephObjectStore openshift-storage/ocs-external-storagecluster-cephobjectstore: [].spec.gateway.instances: Invalid value: 0: spec.gateway.instances in body should be greater than or equal to 1 Thanks Umanga for looking.
Vijay, please give it another try with 4.8.0-433.ci
Update: ========= > Install OCS with 4.7.2 > before upgrade, apply external secret which was generated from the script downloaded from OCS 4.8 installation > continue with upgrade ( to ocs-registry:4.8.0-433.ci ) > upgrade succeeded without any issue > post upgrade verification failed in verifying storage classes. Job: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1278/consoleFull will raise separate bug for missing storage class after upgrade
logs: ===== $ oc get csv NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.8.0-433.ci OpenShift Container Storage 4.8.0-433.ci ocs-operator.v4.7.2 Succeeded $ 11:25:52 - MainThread - ocs_ci.ocs.resources.storage_cluster - INFO - Verifying OCS installation 11:25:52 - MainThread - ocs_ci.ocs.resources.storage_cluster - INFO - verifying ocs csv 11:25:52 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-marketplace get CatalogSource ocs-catalogsource -n openshift-marketplace -o yaml 11:25:54 - Thread-2 - ocs_ci.utility.retry - WARNING - list index out of range, Retrying in 5 seconds... 11:25:57 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-marketplace get packagemanifest -n openshift-marketplace --selector=ocs-operator-internal=true -o yaml 11:26:02 - MainThread - ocs_ci.ocs.resources.ocs - INFO - Check if OCS operator: ocs-operator.v4.8.0-433.ci is in Succeeded phase. 11:26:02 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get csv ocs-operator.v4.8.0-433.ci -n openshift-storage -o yaml 11:26:04 - Thread-2 - ocs_ci.utility.retry - WARNING - list index out of range, Retrying in 5 seconds... 11:26:08 - MainThread - ocs_ci.ocs.ocp - INFO - Resource ocs-operator.v4.8.0-433.ci is in phase: Succeeded! 11:26:08 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get csv ocs-operator.v4.8.0-433.ci -n openshift-storage -o yaml 11:26:14 - Thread-2 - ocs_ci.utility.retry - WARNING - list index out of range, Retrying in 5 seconds... 11:26:14 - MainThread - ocs_ci.ocs.resources.storage_cluster - INFO - Check if OCS version: 4.8 matches with CSV: 4.8.0-433.ci 11:26:14 - MainThread - ocs_ci.ocs.resources.storage_cluster - INFO - Check if OCS registry image: 4.8.0-433.ci matches with CSV: 4.8.0-433.ci 11:26:14 - MainThread - ocs_ci.ocs.resources.storage_cluster - INFO - Verifying status of storage cluster: ocs-external-storagecluster 11:26:14 - MainThread - ocs_ci.ocs.resources.storage_cluster - INFO - Check if StorageCluster: ocs-external-storagecluster is inSucceeded phase
Chnaging status to verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3003