Bug 2064107 - [External Mode] upgrade fails with connecting to external Ceph cluster [NEEDINFO]
Summary: [External Mode] upgrade fails with connecting to external Ceph cluster
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ODF 4.10.0
Assignee: umanga
QA Contact: Vijay Avuthu
URL:
Whiteboard:
Depends On:
Blocks: 2066997
TreeView+ depends on / blocked
 
Reported: 2022-03-15 04:39 UTC by Vijay Avuthu
Modified: 2023-08-09 17:00 UTC (History)
11 users (show)

Fixed In Version: 4.10.0-210
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2066997 (view as bug list)
Environment:
Last Closed: 2022-04-21 09:12:53 UTC
Embargoed:
vavuthu: needinfo? (jrivera)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-operator pull 1597 0 None open set upgrade conditions to true in external mode deployment 2022-03-23 14:10:32 UTC
Github red-hat-storage ocs-operator pull 1601 0 None open Bug 2064107: [release-4.10] set upgrade conditions to true in external mode deployment 2022-03-23 14:26:49 UTC

Description Vijay Avuthu 2022-03-15 04:39:58 UTC
Description of problem (please be detailed as possible and provide log
snippests):

In External mode, upgrade fails with connecting to external Ceph cluster

Version of all relevant components (if applicable):

upgrade from ocs-operator.v4.9.3 to ocs-registry:4.10.0-189

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Not able to upgrade ocs-operator

Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
2/2

Can this issue reproduce from the UI?
Not tried

If this is a regression, please provide more details to justify this:
Yes

Steps to Reproduce:
1. upgrade odf from ocs-operator.v4.9.3 to ocs-registry:4.10.0-189
2. check all operators 
3.


Actual results:

ocs-operator.v4.10.0 in pending state


Expected results:

ocs-operator.v4.10.0 should be in suceeded state


Additional info:

> csv status

$ oc get csv
NAME                              DISPLAY                       VERSION   REPLACES              PHASE
mcg-operator.v4.10.0              NooBaa Operator               4.10.0    mcg-operator.v4.9.3   Succeeded
ocs-operator.v4.10.0              OpenShift Container Storage   4.10.0    ocs-operator.v4.9.3   Pending
ocs-operator.v4.9.3               OpenShift Container Storage   4.9.3     ocs-operator.v4.9.2   Replacing
odf-csi-addons-operator.v4.10.0   CSI Addons                    4.10.0                          Succeeded
odf-operator.v4.10.0              OpenShift Data Foundation     4.10.0    odf-operator.v4.9.3   Succeeded
$ 

> 

$ oc describe csv ocs-operator.v4.10.0

Status:
  Cleanup:
  Conditions:
    Last Transition Time:  2022-03-14T15:01:14Z
    Last Update Time:      2022-03-14T15:01:14Z
    Message:               requirements not yet checked
    Phase:                 Pending
    Reason:                RequirementsUnknown
    Last Transition Time:  2022-03-14T15:01:14Z
    Last Update Time:      2022-03-14T15:01:14Z
    Message:               operator is not upgradeable: The operator is not upgradeable: StorageCluster is not ready.
    Phase:                 Pending
    Reason:                OperatorConditionNotUpgradeable
  Last Transition Time:    2022-03-14T15:01:14Z
  Last Update Time:        2022-03-14T15:01:14Z
  Message:                 operator is not upgradeable: The operator is not upgradeable: StorageCluster is not ready.
  Phase:                   Pending
  Reason:                  OperatorConditionNotUpgradeable
Events:                    <none>

> 
$ oc get storagecluster
NAME                          AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-external-storagecluster   13h   Ready   true       2022-03-14T14:44:07Z   4.9.0
$ 


$ oc describe storagecluster ocs-external-storagecluster
Name:         ocs-external-storagecluster
Namespace:    openshift-storage
Labels:       <none>
Annotations:  storagesystem.odf.openshift.io/watched-by: ocs-external-storagecluster-storagesystem
              uninstall.ocs.openshift.io/cleanup-policy: delete
              uninstall.ocs.openshift.io/mode: graceful
API Version:  ocs.openshift.io/v1
Kind:         StorageCluster


Status:
  Conditions:
    Last Heartbeat Time:   2022-03-15T04:33:58Z
    Last Transition Time:  2022-03-14T14:44:09Z
    Message:               Reconcile completed successfully
    Reason:                ReconcileCompleted
    Status:                True
    Type:                  ReconcileComplete
    Last Heartbeat Time:   2022-03-15T04:33:58Z
    Last Transition Time:  2022-03-14T14:49:55Z
    Message:               Reconcile completed successfully
    Reason:                ReconcileCompleted
    Status:                True
    Type:                  Available
    Last Heartbeat Time:   2022-03-15T04:33:58Z
    Last Transition Time:  2022-03-14T15:06:59Z
    Message:               Reconcile completed successfully
    Reason:                ReconcileCompleted
    Status:                False
    Type:                  Progressing
    Last Heartbeat Time:   2022-03-15T04:33:58Z
    Last Transition Time:  2022-03-14T14:44:08Z
    Message:               Reconcile completed successfully
    Reason:                ReconcileCompleted
    Status:                False
    Type:                  Degraded
    Last Heartbeat Time:   2022-03-15T04:33:58Z
    Last Transition Time:  2022-03-14T14:49:55Z
    Message:               Reconcile completed successfully
    Reason:                ReconcileCompleted
    Status:                True
    Type:                  Upgradeable
    Last Heartbeat Time:   2022-03-14T14:44:20Z
    Last Transition Time:  2022-03-14T14:44:10Z
    Message:               External CephCluster is trying to connect: Attempting to connect to an external Ceph cluster
    Reason:                ExternalClusterStateConnecting
    Status:                True
    Type:                  ExternalClusterConnecting
    Last Heartbeat Time:   2022-03-14T14:44:20Z
    Last Transition Time:  2022-03-14T14:44:10Z
    Message:               External CephCluster is trying to connect: Attempting to connect to an external Ceph cluster
    Reason:                ExternalClusterStateConnecting
    Status:                False
    Type:                  ExternalClusterConnected
  External Secret Hash:    51db4eab08fed8ea264bed93c58e2ae2b60a828c41076b59450ac037be0e1dcccd423bcbc74f842cabab36504ce77c2a6582c854fd92495b094f6cf09a84dac1

> 

$ oc get cephcluster
NAME                                      DATADIRHOSTPATH   MONCOUNT   AGE   PHASE       MESSAGE                          HEALTH      EXTERNAL
ocs-external-storagecluster-cephcluster                                13h   Connected   Cluster connected successfully   HEALTH_OK   true
$ 

$ oc describe cephcluster ocs-external-storagecluster-cephcluster
Name:         ocs-external-storagecluster-cephcluster
Namespace:    openshift-storage
Labels:       app=ocs-external-storagecluster
Annotations:  <none>
API Version:  ceph.rook.io/v1
Kind:         CephCluster

Status:
  Ceph:
    Capacity:
      Bytes Available:  12405755543552
      Bytes Total:      14399255347200
      Bytes Used:       1993499803648
      Last Updated:     2022-03-15T04:36:02Z
    Health:             HEALTH_OK
    Last Checked:       2022-03-15T04:36:02Z
    Versions:
      Mds:
        ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable):  2
      Mgr:
        ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable):  3
      Mon:
        ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable):  3
      Osd:
        ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable):  12
      Overall:
        ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable):  21
      Rgw:
        ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable):  1
  Conditions:
    Last Heartbeat Time:   2022-03-14T14:44:10Z
    Last Transition Time:  2022-03-14T14:44:10Z
    Message:               Attempting to connect to an external Ceph cluster
    Reason:                ClusterConnecting
    Status:                True
    Type:                  Connecting
    Last Heartbeat Time:   2022-03-15T04:36:08Z
    Last Transition Time:  2022-03-14T14:44:33Z
    Message:               Cluster connected successfully
    Reason:                ClusterConnected
    Status:                True
    Type:                  Connected
  Message:                 Cluster connected successfully
  Phase:                   Connected
  State:                   Connected
  Version:
    Version:  14.2.11-147
Events:       <none>


> job link: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3518/console

> must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-046vu1ce33-ua/j-046vu1ce33-ua_20220314T141536/logs/failed_testcase_ocs_logs_1647269882/test_upgrade_ocs_logs/

Comment 3 Vijay Avuthu 2022-03-15 05:09:58 UTC
> cephobjectstore is in connected state

$ oc describe cephobjectstore ocs-external-storagecluster-cephobjectstore
Name:         ocs-external-storagecluster-cephobjectstore
Namespace:    openshift-storage
Labels:       <none>
Annotations:  <none>
API Version:  ceph.rook.io/v1
Kind:         CephObjectStore

Status:
  Bucket Status:
    Health:        Connected
    Last Changed:  2022-03-14T20:07:49Z
    Last Checked:  2022-03-15T04:50:51Z
  Info:
    Endpoint:  http://rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore.openshift-storage.svc:8080
  Phase:       Connected
Events:        <none>


> pods are in running state

$ oc get pods
NAME                                               READY   STATUS    RESTARTS      AGE
csi-addons-controller-manager-57966b99d-ltm8n      2/2     Running   0             13h
csi-cephfsplugin-dxh4d                             3/3     Running   0             14h
csi-cephfsplugin-m4ns5                             3/3     Running   0             14h
csi-cephfsplugin-provisioner-688699dc8-tbrnf       6/6     Running   0             14h
csi-cephfsplugin-provisioner-688699dc8-wh8vz       6/6     Running   0             14h
csi-cephfsplugin-t9fzj                             3/3     Running   0             14h
csi-rbdplugin-8l6wl                                3/3     Running   0             14h
csi-rbdplugin-kchwc                                3/3     Running   0             14h
csi-rbdplugin-provisioner-58cb769786-5rg56         6/6     Running   0             14h
csi-rbdplugin-provisioner-58cb769786-wdjbb         6/6     Running   0             14h
csi-rbdplugin-vjpkj                                3/3     Running   0             14h
noobaa-core-0                                      1/1     Running   0             14h
noobaa-db-pg-0                                     1/1     Running   4 (13h ago)   13h
noobaa-endpoint-6d46d6875b-6cfv8                   1/1     Running   0             14h
noobaa-operator-644bf9bb74-hqckj                   1/1     Running   0             13h
ocs-metrics-exporter-7d87cdfcd-rlpsw               1/1     Running   0             14h
ocs-operator-5d7b75df8f-tdf6t                      1/1     Running   0             14h
odf-console-74f4bc8b99-2f8ws                       1/1     Running   0             13h
odf-operator-controller-manager-5f5db6bf97-x4snd   2/2     Running   0             13h
rook-ceph-operator-5758fc4894-jtwrs                1/1     Running   0             14h
rook-ceph-tools-external-65cfc7bbd5-jmzs7          1/1     Running   0             14h

> odf-operator-controller-manager-5f5db6bf97-x4snd log

2022-03-15T04:59:26.184Z	INFO	controllers.StorageSystem	vendor CSV is installed and ready	{"instance": "openshift-storage/ocs-external-storagecluster-storagesystem", "ClusterServiceVersion": "odf-csi-addons-operator.v4.10.0"}
2022-03-15T04:59:26.189Z	ERROR	controller-runtime.manager.controller.storagesystem	Reconciler error	{"reconciler group": "odf.openshift.io", "reconciler kind": "StorageSystem", "name": "ocs-external-storagecluster-storagesystem", "namespace": "openshift-storage", "error": "CSV is not successfully installed"}

$ oc describe storagesystem 
Name:         ocs-external-storagecluster-storagesystem
Namespace:    openshift-storage
Labels:       <none>
Annotations:  <none>
API Version:  odf.openshift.io/v1alpha1
Kind:         StorageSystem

Status:
  Conditions:
    Last Heartbeat Time:   2022-03-15T04:59:26Z
    Last Transition Time:  2022-03-14T15:00:53Z
    Message:               Reconcile is in progress
    Reason:                Reconciling
    Status:                False
    Type:                  Available
    Last Heartbeat Time:   2022-03-15T04:59:26Z
    Last Transition Time:  2022-03-14T15:00:53Z
    Message:               Reconcile is in progress
    Reason:                Reconciling
    Status:                True
    Type:                  Progressing
    Last Heartbeat Time:   2022-03-15T04:59:26Z
    Last Transition Time:  2022-03-14T14:44:07Z
    Message:               StorageSystem CR is valid
    Reason:                Valid
    Status:                False
    Type:                  StorageSystemInvalid
    Last Heartbeat Time:   2022-03-15T04:59:26Z
    Last Transition Time:  2022-03-14T15:00:53Z
    Message:               CSV is not successfully installed
    Reason:                NotReady
    Status:                False
    Type:                  VendorCsvReady
    Last Heartbeat Time:   2022-03-14T14:44:07Z
    Last Transition Time:  2022-03-14T14:44:07Z
    Reason:                Found
    Status:                True
    Type:                  VendorSystemPresent
Events:
  Type     Reason           Age                 From                      Message
  ----     ------           ----                ----                      -------
  Warning  ReconcileFailed  39m (x13 over 14h)  StorageSystem controller  CSV is not successfully installed

> cluster still exists in case if it needed for live debugging

Comment 4 Mudit Agarwal 2022-03-15 06:39:30 UTC
This is just happening with external cluster only?
If yes, have we tested upgrade previously for external cluster?

Comment 5 Vijay Avuthu 2022-03-15 07:48:15 UTC
(In reply to Mudit Agarwal from comment #4)
> This is just happening with external cluster only?
> If yes, have we tested upgrade previously for external cluster?

It happened only in external cluster. and in previous version upgrade is passed ( v4.8.4 to ocs-registry:4.9.0-249.ci )

old run log file: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-037vu1ce33-ua/j-037vu1ce33-ua_20211125T130856/logs/ocs-ci-logs-1637848904/tests/ecosystem/upgrade/test_upgrade.py/test_upgrade/logs

old job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-1az-rhcos-external-3m-3w-upgrade-ocs-auto/37/console

Comment 6 Mudit Agarwal 2022-03-15 07:49:50 UTC
Thanks but I want to know if this was ever run for 4.9 to 4.10 for external mode or not?

Comment 7 Vijay Avuthu 2022-03-15 07:52:27 UTC
(In reply to Mudit Agarwal from comment #6)
> Thanks but I want to know if this was ever run for 4.9 to 4.10 for external
> mode or not?

nope, this was the 1st time we are running from 4.9 to 4.10 in external mode

Comment 8 umanga 2022-03-15 08:51:33 UTC
Looks like https://github.com/red-hat-storage/ocs-operator/pull/1291 change which was merged in ocs-operator v4.9.z is causing this issue and only for external mode clusters.
When external mode clusters were upgraded from 4.8 to 4.9 (with this change) it was marked as "Not Upgradable". So all future upgrades will be blocked including upgrade to newer 4.9.z.
We'd have to somehow fix this in 4.9 and then upgrade to 4.10.

Is it possible to get an external mode cluster upgraded from 4.8 to 4.9(latest) to verify this?

Comment 10 umanga 2022-03-16 11:32:07 UTC
This issue wasn't reproduced in cluster upgraded from 4.8 to 4.9.
So, it can be further upgraded to 4.10.

But, there is one similarity in CephCluster status. On both clusters I see this status
```
  - lastHeartbeatTime: "2022-03-14T14:44:10Z"
    lastTransitionTime: "2022-03-14T14:44:10Z"
    message: Attempting to connect to an external Ceph cluster
    reason: ClusterConnecting
    status: "True"
    type: Connecting
  - lastHeartbeatTime: "2022-03-14T15:15:20Z"
    lastTransitionTime: "2022-03-14T14:44:33Z"
    message: Cluster connected successfully
    reason: ClusterConnected
    status: "True"
    type: Connected
```
Both "connected" and "connecting" status can't be true at the same time.
This along with change pointed in Comment #8 might be causing this issue.

Still trying to pin point when and why exactly this happens.

Comment 12 umanga 2022-03-22 17:53:03 UTC
Since this issue was reproduced multiple times (for 4.9 to 4.10 upgrades), adding devel_ack+ .
Posted a fix here: https://github.com/red-hat-storage/ocs-operator/pull/1597.
Once we test this, it needs to be backported to 4.10 and 4.9.

Comment 18 Jose A. Rivera 2022-03-23 14:33:30 UTC
Despite this PR being merged into release-4.10, I don't believe it is an adequate solution. As per the original design, an external StorageCluster should be reporting a Phase of "Connected" instead of "Ready" if reconciliation completed without any errors. @vavuthu when you next test this please check if the StorageCluster is showing "Ready" after upgrade.

That said, this is probably a regression introduced in ODF 4.9 that we've just never caught until now. As such, I do think it would be fair to open another BZ against it, but I will leave it up to QE whether they want to consider it a blocker or not.

Comment 19 Vijay Avuthu 2022-03-29 11:18:47 UTC
upgrade from 4.9.5-4 to ocs-registry:4.10.0-210 is successful
 	
 	https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3629/console

Comment 20 Vijay Avuthu 2022-03-29 12:20:05 UTC
(In reply to Jose A. Rivera from comment #18)
> Despite this PR being merged into release-4.10, I don't believe it is an
> adequate solution. As per the original design, an external StorageCluster
> should be reporting a Phase of "Connected" instead of "Ready" if
> reconciliation completed without any errors. @vavuthu when you
> next test this please check if the StorageCluster is showing "Ready" after
> upgrade.
> 
> That said, this is probably a regression introduced in ODF 4.9 that we've
> just never caught until now. As such, I do think it would be fair to open
> another BZ against it, but I will leave it up to QE whether they want to
> consider it a blocker or not.

(In reply to Jose A. Rivera from comment #18)
> Despite this PR being merged into release-4.10, I don't believe it is an
> adequate solution. As per the original design, an external StorageCluster
> should be reporting a Phase of "Connected" instead of "Ready" if
> reconciliation completed without any errors. @vavuthu when you
> next test this please check if the StorageCluster is showing "Ready" after
> upgrade.
> 
> That said, this is probably a regression introduced in ODF 4.9 that we've
> just never caught until now. As such, I do think it would be fair to open
> another BZ against it, but I will leave it up to QE whether they want to
> consider it a blocker or not.

is it storagecluster or cephcluster? I can see StorageCluster status as Ready and CephCluster as Connected 

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/vavuthuext-upgrade/vavuthuext-upgrade_20220325T061303/logs/testcases_1648191511/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2657d3206a2ff2a742043b6432387fe63fba26c056bf8afd0296ffc42cf29fe1/namespaces/openshift-storage/oc_output/storagecluster

Name:         ocs-external-storagecluster
Namespace:    openshift-storage
Labels:       <none>
Annotations:  storagesystem.odf.openshift.io/watched-by: ocs-external-storagecluster-storagesystem
              uninstall.ocs.openshift.io/cleanup-policy: delete
              uninstall.ocs.openshift.io/mode: graceful
API Version:  ocs.openshift.io/v1
Kind:         StorageCluster

  Images:
    Ceph:
      Desired Image:  quay.io/rhceph-dev/rhceph@sha256:82acc1ae5b6ee7f4c9100dbc803054b8edd7b77c64461966ecba621b3380f14b
    Noobaa Core:
      Actual Image:   quay.io/rhceph-dev/odf4-mcg-core-rhel8@sha256:805be8933e81fa2b037a843f9bf521f46ce25b828b28854c1d377dedc23f7f08
      Desired Image:  quay.io/rhceph-dev/odf4-mcg-core-rhel8@sha256:805be8933e81fa2b037a843f9bf521f46ce25b828b28854c1d377dedc23f7f08
    Noobaa DB:
      Actual Image:   quay.io/rhceph-dev/rhel8-postgresql-12@sha256:be7212e938d1ef314a75aca070c28b6433cd0346704d0d3523c8ef403ff0c69e
      Desired Image:  quay.io/rhceph-dev/rhel8-postgresql-12@sha256:be7212e938d1ef314a75aca070c28b6433cd0346704d0d3523c8ef403ff0c69e
  Phase:              Ready



http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/vavuthuext-upgrade/vavuthuext-upgrade_20220325T061303/logs/testcases_1648191511/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2657d3206a2ff2a742043b6432387fe63fba26c056bf8afd0296ffc42cf29fe1/ceph/namespaces/openshift-storage/ceph.rook.io/cephclusters/

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  creationTimestamp: "2022-03-25T06:47:08Z"
  finalizers:
  - cephcluster.ceph.rook.io
  generation: 1
  labels:
    app: ocs-external-storagecluster
  name: ocs-external-storagecluster-cephcluster
  namespace: openshift-storage


  - lastHeartbeatTime: "2022-03-25T07:12:30Z"
    lastTransitionTime: "2022-03-25T06:47:17Z"
    message: Cluster connected successfully
    reason: ClusterConnected
    status: "True"
    type: Connected
  message: Cluster connected successfully
  phase: Connected
  state: Connected

Comment 21 Vijay Avuthu 2022-03-31 05:53:28 UTC
Since fix is there in 4.9.5-4, we need to upgrade from 4.9.5-4 to ocs-registry:4.10.0-210

1. upgrade from 4.9.5-4 to ocs-registry:4.10.0-210

Result: PASS
 	
 	https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3629/console

Marking as verified


Note You need to log in before you can comment on or make changes to this bug.