Description of problem (please be detailed as possible and provide log snippests): In External mode, upgrade fails with connecting to external Ceph cluster Version of all relevant components (if applicable): upgrade from ocs-operator.v4.9.3 to ocs-registry:4.10.0-189 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Not able to upgrade ocs-operator Is there any workaround available to the best of your knowledge? NA Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? 2/2 Can this issue reproduce from the UI? Not tried If this is a regression, please provide more details to justify this: Yes Steps to Reproduce: 1. upgrade odf from ocs-operator.v4.9.3 to ocs-registry:4.10.0-189 2. check all operators 3. Actual results: ocs-operator.v4.10.0 in pending state Expected results: ocs-operator.v4.10.0 should be in suceeded state Additional info: > csv status $ oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.0 NooBaa Operator 4.10.0 mcg-operator.v4.9.3 Succeeded ocs-operator.v4.10.0 OpenShift Container Storage 4.10.0 ocs-operator.v4.9.3 Pending ocs-operator.v4.9.3 OpenShift Container Storage 4.9.3 ocs-operator.v4.9.2 Replacing odf-csi-addons-operator.v4.10.0 CSI Addons 4.10.0 Succeeded odf-operator.v4.10.0 OpenShift Data Foundation 4.10.0 odf-operator.v4.9.3 Succeeded $ > $ oc describe csv ocs-operator.v4.10.0 Status: Cleanup: Conditions: Last Transition Time: 2022-03-14T15:01:14Z Last Update Time: 2022-03-14T15:01:14Z Message: requirements not yet checked Phase: Pending Reason: RequirementsUnknown Last Transition Time: 2022-03-14T15:01:14Z Last Update Time: 2022-03-14T15:01:14Z Message: operator is not upgradeable: The operator is not upgradeable: StorageCluster is not ready. Phase: Pending Reason: OperatorConditionNotUpgradeable Last Transition Time: 2022-03-14T15:01:14Z Last Update Time: 2022-03-14T15:01:14Z Message: operator is not upgradeable: The operator is not upgradeable: StorageCluster is not ready. Phase: Pending Reason: OperatorConditionNotUpgradeable Events: <none> > $ oc get storagecluster NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-external-storagecluster 13h Ready true 2022-03-14T14:44:07Z 4.9.0 $ $ oc describe storagecluster ocs-external-storagecluster Name: ocs-external-storagecluster Namespace: openshift-storage Labels: <none> Annotations: storagesystem.odf.openshift.io/watched-by: ocs-external-storagecluster-storagesystem uninstall.ocs.openshift.io/cleanup-policy: delete uninstall.ocs.openshift.io/mode: graceful API Version: ocs.openshift.io/v1 Kind: StorageCluster Status: Conditions: Last Heartbeat Time: 2022-03-15T04:33:58Z Last Transition Time: 2022-03-14T14:44:09Z Message: Reconcile completed successfully Reason: ReconcileCompleted Status: True Type: ReconcileComplete Last Heartbeat Time: 2022-03-15T04:33:58Z Last Transition Time: 2022-03-14T14:49:55Z Message: Reconcile completed successfully Reason: ReconcileCompleted Status: True Type: Available Last Heartbeat Time: 2022-03-15T04:33:58Z Last Transition Time: 2022-03-14T15:06:59Z Message: Reconcile completed successfully Reason: ReconcileCompleted Status: False Type: Progressing Last Heartbeat Time: 2022-03-15T04:33:58Z Last Transition Time: 2022-03-14T14:44:08Z Message: Reconcile completed successfully Reason: ReconcileCompleted Status: False Type: Degraded Last Heartbeat Time: 2022-03-15T04:33:58Z Last Transition Time: 2022-03-14T14:49:55Z Message: Reconcile completed successfully Reason: ReconcileCompleted Status: True Type: Upgradeable Last Heartbeat Time: 2022-03-14T14:44:20Z Last Transition Time: 2022-03-14T14:44:10Z Message: External CephCluster is trying to connect: Attempting to connect to an external Ceph cluster Reason: ExternalClusterStateConnecting Status: True Type: ExternalClusterConnecting Last Heartbeat Time: 2022-03-14T14:44:20Z Last Transition Time: 2022-03-14T14:44:10Z Message: External CephCluster is trying to connect: Attempting to connect to an external Ceph cluster Reason: ExternalClusterStateConnecting Status: False Type: ExternalClusterConnected External Secret Hash: 51db4eab08fed8ea264bed93c58e2ae2b60a828c41076b59450ac037be0e1dcccd423bcbc74f842cabab36504ce77c2a6582c854fd92495b094f6cf09a84dac1 > $ oc get cephcluster NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL ocs-external-storagecluster-cephcluster 13h Connected Cluster connected successfully HEALTH_OK true $ $ oc describe cephcluster ocs-external-storagecluster-cephcluster Name: ocs-external-storagecluster-cephcluster Namespace: openshift-storage Labels: app=ocs-external-storagecluster Annotations: <none> API Version: ceph.rook.io/v1 Kind: CephCluster Status: Ceph: Capacity: Bytes Available: 12405755543552 Bytes Total: 14399255347200 Bytes Used: 1993499803648 Last Updated: 2022-03-15T04:36:02Z Health: HEALTH_OK Last Checked: 2022-03-15T04:36:02Z Versions: Mds: ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable): 2 Mgr: ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable): 3 Mon: ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable): 3 Osd: ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable): 12 Overall: ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable): 21 Rgw: ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable): 1 Conditions: Last Heartbeat Time: 2022-03-14T14:44:10Z Last Transition Time: 2022-03-14T14:44:10Z Message: Attempting to connect to an external Ceph cluster Reason: ClusterConnecting Status: True Type: Connecting Last Heartbeat Time: 2022-03-15T04:36:08Z Last Transition Time: 2022-03-14T14:44:33Z Message: Cluster connected successfully Reason: ClusterConnected Status: True Type: Connected Message: Cluster connected successfully Phase: Connected State: Connected Version: Version: 14.2.11-147 Events: <none> > job link: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3518/console > must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-046vu1ce33-ua/j-046vu1ce33-ua_20220314T141536/logs/failed_testcase_ocs_logs_1647269882/test_upgrade_ocs_logs/
> cephobjectstore is in connected state $ oc describe cephobjectstore ocs-external-storagecluster-cephobjectstore Name: ocs-external-storagecluster-cephobjectstore Namespace: openshift-storage Labels: <none> Annotations: <none> API Version: ceph.rook.io/v1 Kind: CephObjectStore Status: Bucket Status: Health: Connected Last Changed: 2022-03-14T20:07:49Z Last Checked: 2022-03-15T04:50:51Z Info: Endpoint: http://rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore.openshift-storage.svc:8080 Phase: Connected Events: <none> > pods are in running state $ oc get pods NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-57966b99d-ltm8n 2/2 Running 0 13h csi-cephfsplugin-dxh4d 3/3 Running 0 14h csi-cephfsplugin-m4ns5 3/3 Running 0 14h csi-cephfsplugin-provisioner-688699dc8-tbrnf 6/6 Running 0 14h csi-cephfsplugin-provisioner-688699dc8-wh8vz 6/6 Running 0 14h csi-cephfsplugin-t9fzj 3/3 Running 0 14h csi-rbdplugin-8l6wl 3/3 Running 0 14h csi-rbdplugin-kchwc 3/3 Running 0 14h csi-rbdplugin-provisioner-58cb769786-5rg56 6/6 Running 0 14h csi-rbdplugin-provisioner-58cb769786-wdjbb 6/6 Running 0 14h csi-rbdplugin-vjpkj 3/3 Running 0 14h noobaa-core-0 1/1 Running 0 14h noobaa-db-pg-0 1/1 Running 4 (13h ago) 13h noobaa-endpoint-6d46d6875b-6cfv8 1/1 Running 0 14h noobaa-operator-644bf9bb74-hqckj 1/1 Running 0 13h ocs-metrics-exporter-7d87cdfcd-rlpsw 1/1 Running 0 14h ocs-operator-5d7b75df8f-tdf6t 1/1 Running 0 14h odf-console-74f4bc8b99-2f8ws 1/1 Running 0 13h odf-operator-controller-manager-5f5db6bf97-x4snd 2/2 Running 0 13h rook-ceph-operator-5758fc4894-jtwrs 1/1 Running 0 14h rook-ceph-tools-external-65cfc7bbd5-jmzs7 1/1 Running 0 14h > odf-operator-controller-manager-5f5db6bf97-x4snd log 2022-03-15T04:59:26.184Z INFO controllers.StorageSystem vendor CSV is installed and ready {"instance": "openshift-storage/ocs-external-storagecluster-storagesystem", "ClusterServiceVersion": "odf-csi-addons-operator.v4.10.0"} 2022-03-15T04:59:26.189Z ERROR controller-runtime.manager.controller.storagesystem Reconciler error {"reconciler group": "odf.openshift.io", "reconciler kind": "StorageSystem", "name": "ocs-external-storagecluster-storagesystem", "namespace": "openshift-storage", "error": "CSV is not successfully installed"} $ oc describe storagesystem Name: ocs-external-storagecluster-storagesystem Namespace: openshift-storage Labels: <none> Annotations: <none> API Version: odf.openshift.io/v1alpha1 Kind: StorageSystem Status: Conditions: Last Heartbeat Time: 2022-03-15T04:59:26Z Last Transition Time: 2022-03-14T15:00:53Z Message: Reconcile is in progress Reason: Reconciling Status: False Type: Available Last Heartbeat Time: 2022-03-15T04:59:26Z Last Transition Time: 2022-03-14T15:00:53Z Message: Reconcile is in progress Reason: Reconciling Status: True Type: Progressing Last Heartbeat Time: 2022-03-15T04:59:26Z Last Transition Time: 2022-03-14T14:44:07Z Message: StorageSystem CR is valid Reason: Valid Status: False Type: StorageSystemInvalid Last Heartbeat Time: 2022-03-15T04:59:26Z Last Transition Time: 2022-03-14T15:00:53Z Message: CSV is not successfully installed Reason: NotReady Status: False Type: VendorCsvReady Last Heartbeat Time: 2022-03-14T14:44:07Z Last Transition Time: 2022-03-14T14:44:07Z Reason: Found Status: True Type: VendorSystemPresent Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ReconcileFailed 39m (x13 over 14h) StorageSystem controller CSV is not successfully installed > cluster still exists in case if it needed for live debugging
This is just happening with external cluster only? If yes, have we tested upgrade previously for external cluster?
(In reply to Mudit Agarwal from comment #4) > This is just happening with external cluster only? > If yes, have we tested upgrade previously for external cluster? It happened only in external cluster. and in previous version upgrade is passed ( v4.8.4 to ocs-registry:4.9.0-249.ci ) old run log file: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-037vu1ce33-ua/j-037vu1ce33-ua_20211125T130856/logs/ocs-ci-logs-1637848904/tests/ecosystem/upgrade/test_upgrade.py/test_upgrade/logs old job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-1az-rhcos-external-3m-3w-upgrade-ocs-auto/37/console
Thanks but I want to know if this was ever run for 4.9 to 4.10 for external mode or not?
(In reply to Mudit Agarwal from comment #6) > Thanks but I want to know if this was ever run for 4.9 to 4.10 for external > mode or not? nope, this was the 1st time we are running from 4.9 to 4.10 in external mode
Looks like https://github.com/red-hat-storage/ocs-operator/pull/1291 change which was merged in ocs-operator v4.9.z is causing this issue and only for external mode clusters. When external mode clusters were upgraded from 4.8 to 4.9 (with this change) it was marked as "Not Upgradable". So all future upgrades will be blocked including upgrade to newer 4.9.z. We'd have to somehow fix this in 4.9 and then upgrade to 4.10. Is it possible to get an external mode cluster upgraded from 4.8 to 4.9(latest) to verify this?
This issue wasn't reproduced in cluster upgraded from 4.8 to 4.9. So, it can be further upgraded to 4.10. But, there is one similarity in CephCluster status. On both clusters I see this status ``` - lastHeartbeatTime: "2022-03-14T14:44:10Z" lastTransitionTime: "2022-03-14T14:44:10Z" message: Attempting to connect to an external Ceph cluster reason: ClusterConnecting status: "True" type: Connecting - lastHeartbeatTime: "2022-03-14T15:15:20Z" lastTransitionTime: "2022-03-14T14:44:33Z" message: Cluster connected successfully reason: ClusterConnected status: "True" type: Connected ``` Both "connected" and "connecting" status can't be true at the same time. This along with change pointed in Comment #8 might be causing this issue. Still trying to pin point when and why exactly this happens.
Since this issue was reproduced multiple times (for 4.9 to 4.10 upgrades), adding devel_ack+ . Posted a fix here: https://github.com/red-hat-storage/ocs-operator/pull/1597. Once we test this, it needs to be backported to 4.10 and 4.9.
Despite this PR being merged into release-4.10, I don't believe it is an adequate solution. As per the original design, an external StorageCluster should be reporting a Phase of "Connected" instead of "Ready" if reconciliation completed without any errors. @vavuthu when you next test this please check if the StorageCluster is showing "Ready" after upgrade. That said, this is probably a regression introduced in ODF 4.9 that we've just never caught until now. As such, I do think it would be fair to open another BZ against it, but I will leave it up to QE whether they want to consider it a blocker or not.
upgrade from 4.9.5-4 to ocs-registry:4.10.0-210 is successful https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3629/console
(In reply to Jose A. Rivera from comment #18) > Despite this PR being merged into release-4.10, I don't believe it is an > adequate solution. As per the original design, an external StorageCluster > should be reporting a Phase of "Connected" instead of "Ready" if > reconciliation completed without any errors. @vavuthu when you > next test this please check if the StorageCluster is showing "Ready" after > upgrade. > > That said, this is probably a regression introduced in ODF 4.9 that we've > just never caught until now. As such, I do think it would be fair to open > another BZ against it, but I will leave it up to QE whether they want to > consider it a blocker or not. (In reply to Jose A. Rivera from comment #18) > Despite this PR being merged into release-4.10, I don't believe it is an > adequate solution. As per the original design, an external StorageCluster > should be reporting a Phase of "Connected" instead of "Ready" if > reconciliation completed without any errors. @vavuthu when you > next test this please check if the StorageCluster is showing "Ready" after > upgrade. > > That said, this is probably a regression introduced in ODF 4.9 that we've > just never caught until now. As such, I do think it would be fair to open > another BZ against it, but I will leave it up to QE whether they want to > consider it a blocker or not. is it storagecluster or cephcluster? I can see StorageCluster status as Ready and CephCluster as Connected http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/vavuthuext-upgrade/vavuthuext-upgrade_20220325T061303/logs/testcases_1648191511/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2657d3206a2ff2a742043b6432387fe63fba26c056bf8afd0296ffc42cf29fe1/namespaces/openshift-storage/oc_output/storagecluster Name: ocs-external-storagecluster Namespace: openshift-storage Labels: <none> Annotations: storagesystem.odf.openshift.io/watched-by: ocs-external-storagecluster-storagesystem uninstall.ocs.openshift.io/cleanup-policy: delete uninstall.ocs.openshift.io/mode: graceful API Version: ocs.openshift.io/v1 Kind: StorageCluster Images: Ceph: Desired Image: quay.io/rhceph-dev/rhceph@sha256:82acc1ae5b6ee7f4c9100dbc803054b8edd7b77c64461966ecba621b3380f14b Noobaa Core: Actual Image: quay.io/rhceph-dev/odf4-mcg-core-rhel8@sha256:805be8933e81fa2b037a843f9bf521f46ce25b828b28854c1d377dedc23f7f08 Desired Image: quay.io/rhceph-dev/odf4-mcg-core-rhel8@sha256:805be8933e81fa2b037a843f9bf521f46ce25b828b28854c1d377dedc23f7f08 Noobaa DB: Actual Image: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:be7212e938d1ef314a75aca070c28b6433cd0346704d0d3523c8ef403ff0c69e Desired Image: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:be7212e938d1ef314a75aca070c28b6433cd0346704d0d3523c8ef403ff0c69e Phase: Ready http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/vavuthuext-upgrade/vavuthuext-upgrade_20220325T061303/logs/testcases_1648191511/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2657d3206a2ff2a742043b6432387fe63fba26c056bf8afd0296ffc42cf29fe1/ceph/namespaces/openshift-storage/ceph.rook.io/cephclusters/ apiVersion: ceph.rook.io/v1 kind: CephCluster metadata: creationTimestamp: "2022-03-25T06:47:08Z" finalizers: - cephcluster.ceph.rook.io generation: 1 labels: app: ocs-external-storagecluster name: ocs-external-storagecluster-cephcluster namespace: openshift-storage - lastHeartbeatTime: "2022-03-25T07:12:30Z" lastTransitionTime: "2022-03-25T06:47:17Z" message: Cluster connected successfully reason: ClusterConnected status: "True" type: Connected message: Cluster connected successfully phase: Connected state: Connected
Since fix is there in 4.9.5-4, we need to upgrade from 4.9.5-4 to ocs-registry:4.10.0-210 1. upgrade from 4.9.5-4 to ocs-registry:4.10.0-210 Result: PASS https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3629/console Marking as verified