2227871 – ODF Upgrade Failing to Progress - Custom Resources/ArgoCD Present in ODF

Bug 2227871 - ODF Upgrade Failing to Progress - Custom Resources/ArgoCD Present in ODF

Summary: ODF Upgrade Failing to Progress - Custom Resources/ArgoCD Present in ODF

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-operator
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Nitin Goyal
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-07-31 18:12 UTC by Craig Wayman
Modified:	2024-01-08 09:55 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-01-03 07:47:24 UTC
Embargoed:

Attachments	(Terms of Use)

Description Craig Wayman 2023-07-31 18:12:35 UTC

Description of problem (please be detailed as possible and provide log
snippests):

  The customer currently has three clusters with the ODF Operator failing an upgrade. 


1012 ←—---------------- THE CLUSTER THIS CASE WAS OPENED FOR.
NAME                 AGE    PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   718d   Progressing              2021-08-11T15:48:03Z   4.7.0

1013
NAME                 AGE    PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   2y3d   Progressing              2021-07-27T22:13:04Z   4.8.0

1014
NAME                 AGE    PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   717d   Progressing              2021-08-12T21:15:25Z   4.6.0


  Looking at their storagecluster.yaml, I see that it’s being managed by ArgoCD. There could be a setting at Argo layer (custom ocs-operator) that was sync’d and everything may have seemed fine until an ODF upgrade was initiated. 

  For this particular cluster, some observations that were conveyed to the customer is that there are custom resources that we need to get back to default in order to rule them out as a delta. The first was that there are ODF resources picking up some custom SCCs:

  NAME: noobaa-core-0
  SCC: gs-default-nobody-scc

  NAME: noobaa-endpoint-74f64ddbd9-fhg97
  SCC: gs-default-nobody-scc

 NAME: noobaa-operator-5f97c5779f-7plwn
  SCC: gs-default-nobody-scc

  NAME: ocs-metrics-exporter-586744f6b-76ksr
  SCC: gs-default-nobody-scc

  NAME: ocs-operator-79bdb48897-bht9n
  SCC: gs-default-nobody-scc

  NAME: odf-console-7c9bf8c6bf-dv28p
  SCC: gs-default-nobody-scc

  NAME: odf-operator-controller-manager-66dcc9fbb8-h88ww
  SCC: gs-default-nobody-scc

  NAME: rook-ceph-operator-698fdb8d87-jttjm
  SCC: gs-default-scc


  The second, and what I assume is the likely delta as to why the ODF upgrade is failing, is that the customer created a custom ocs-operator called the gs-ocs-operator. This is the first time I’ve seen that in an ODF deployment. Seeing how crucial the ocs-operator is to ODF, e.g. reconciling resources (configmaps, templates, etc.) this may likely be the cause. Additionally, by looking at the csv output in the version section the ocs-operator is the failing resource. I'd be willing to guess that by deleting the custom gs-ocs-operator deployment, this ODF upgrade should be able to reconcile no problem however, we did want to take issue with the SCCs first.  

$ oc get deployment -n openshift-storage
NAME                                                 READY  UP-TO-DATE  AVAILABLE  AGE
gs-ocs-operator                                      ??     ??          ??         228d


storagecluster.yaml status:

status:
    conditions:
    - lastHeartbeatTime: "2023-06-22T16:35:59Z"
      lastTransitionTime: "2023-06-04T16:26:11Z"
      message: Reconcile completed successfully
      reason: ReconcileCompleted
      status: "True"
      type: ReconcileComplete
    - lastHeartbeatTime: "2023-06-05T04:23:18Z"
      lastTransitionTime: "2023-05-25T00:28:45Z"
      message: 'CephCluster error: failed the ceph version check: failed to complete
        ceph version job: failed to run CmdReporter rook-ceph-detect-version successfully.
        failed to run job. object is being deleted: jobs.batch "rook-ceph-detect-version"
        already exists'
      reason: ClusterStateError
      status: "False"
      type: Available
    - lastHeartbeatTime: "2023-06-22T16:35:59Z"
      lastTransitionTime: "2023-05-24T10:03:06Z"
      message: Waiting on Nooba instance to finish initialization
      reason: NoobaaInitializing
      status: "True"
      type: Progressing
    - lastHeartbeatTime: "2023-06-05T04:23:18Z"
      lastTransitionTime: "2023-05-25T00:28:45Z"
      message: 'CephCluster error: failed the ceph version check: failed to complete
        ceph version job: failed to run CmdReporter rook-ceph-detect-version successfully.
        failed to run job. object is being deleted: jobs.batch "rook-ceph-detect-version"
        already exists'
      reason: ClusterStateError
      status: "True"
      type: Degraded
    - lastHeartbeatTime: "2023-06-22T06:57:39Z"
      lastTransitionTime: "2023-05-24T10:03:06Z"
      message: 'CephCluster is creating: Processing OSD 99 on PVC "ocs-deviceset-2-data-2024lpg2"'
      reason: ClusterStateCreating
      status: "False"
      type: Upgradeable


Version of all relevant components (if applicable):

$ oc get csv -n openshift-storage 

NAME                                    DISPLAY                       VERSION   REPLACES                                             PHASE
loki-operator.v5.6.7                    Loki Operator                 5.6.7     loki-operator.v5.6.6                                 Succeeded
mcg-operator.v4.10.13                   NooBaa Operator               4.10.13   mcg-operator.v4.10.11                                Succeeded
network-observability-operator.v1.2.0   Network observability         1.2.0     network-observability-operator.v1.1.0-202302110050   Succeeded
node-maintenance-operator.v5.1.0        Node Maintenance Operator     5.1.0                                                          Succeeded
ocs-operator.v4.10.11                   OpenShift Container Storage   4.10.11   ocs-operator.v4.9.14                                 Replacing
odf-csi-addons-operator.v4.10.13        CSI Addons                    4.10.13   odf-csi-addons-operator.v4.10.11                     Succeeded
odf-operator.v4.11.8                    OpenShift Data Foundation     4.11.8    odf-operator.v4.10.11                                Succeeded


  Awaiting the output of $ ceph versions, will update the BZ once the output is provided.


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

  This cluster looks to be a dev cluster however, this has been a cluster that has been brought up in the GS and Red Hat meetings, and after discussion, Engineering recommend opening a BZ for tracking.
  

Is there any workaround available to the best of your knowledge?

  We’ve attempted to provide the customer with troubleshooting steps however, the last we’ve heard from the TAM they’re still working through them.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)? 3


Additional info:

  What this cluster (1012) does have going for it (positive) is that Ceph is currently reflecting the following status:

    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum ax,az,bb (age 5h)
    mgr: a(active, since 7d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 166 osds: 166 up (since 12h), 166 in (since 3h); 1 remapped pgs
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   11 pools, 5265 pgs
    objects: 1.66M objects, 6.2 TiB
    usage:   18 TiB used, 355 TiB / 374 TiB avail
    pgs:     5265 active+clean
 
  io:
    client:   125 MiB/s rd, 228 MiB/s wr, 1.16k op/s rd, 528 op/s wr


  The troubleshooting steps currently given to the customer were to edit the gs-default-nobody-scc and gs-default-scc SCCs (priority/groups) in hopes that by deleting the pods associated with those SCCs they’ll pick up the correct SCC. Still waiting on confirmation.

Note You need to log in before you can comment on or make changes to this bug.