Bug 2227871
| Summary: | ODF Upgrade Failing to Progress - Custom Resources/ArgoCD Present in ODF | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Craig Wayman <crwayman> |
| Component: | odf-operator | Assignee: | Nitin Goyal <nigoyal> |
| Status: | ASSIGNED --- | QA Contact: | Elad <ebenahar> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.11 | CC: | bkunal, brgardne, muagarwa, nbecker, nigoyal, odf-bz-bot, uchapaga |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Description of problem (please be detailed as possible and provide log snippests): The customer currently has three clusters with the ODF Operator failing an upgrade. 1012 ←—---------------- THE CLUSTER THIS CASE WAS OPENED FOR. NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 718d Progressing 2021-08-11T15:48:03Z 4.7.0 1013 NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 2y3d Progressing 2021-07-27T22:13:04Z 4.8.0 1014 NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 717d Progressing 2021-08-12T21:15:25Z 4.6.0 Looking at their storagecluster.yaml, I see that it’s being managed by ArgoCD. There could be a setting at Argo layer (custom ocs-operator) that was sync’d and everything may have seemed fine until an ODF upgrade was initiated. For this particular cluster, some observations that were conveyed to the customer is that there are custom resources that we need to get back to default in order to rule them out as a delta. The first was that there are ODF resources picking up some custom SCCs: NAME: noobaa-core-0 SCC: gs-default-nobody-scc NAME: noobaa-endpoint-74f64ddbd9-fhg97 SCC: gs-default-nobody-scc NAME: noobaa-operator-5f97c5779f-7plwn SCC: gs-default-nobody-scc NAME: ocs-metrics-exporter-586744f6b-76ksr SCC: gs-default-nobody-scc NAME: ocs-operator-79bdb48897-bht9n SCC: gs-default-nobody-scc NAME: odf-console-7c9bf8c6bf-dv28p SCC: gs-default-nobody-scc NAME: odf-operator-controller-manager-66dcc9fbb8-h88ww SCC: gs-default-nobody-scc NAME: rook-ceph-operator-698fdb8d87-jttjm SCC: gs-default-scc The second, and what I assume is the likely delta as to why the ODF upgrade is failing, is that the customer created a custom ocs-operator called the gs-ocs-operator. This is the first time I’ve seen that in an ODF deployment. Seeing how crucial the ocs-operator is to ODF, e.g. reconciling resources (configmaps, templates, etc.) this may likely be the cause. Additionally, by looking at the csv output in the version section the ocs-operator is the failing resource. I'd be willing to guess that by deleting the custom gs-ocs-operator deployment, this ODF upgrade should be able to reconcile no problem however, we did want to take issue with the SCCs first. $ oc get deployment -n openshift-storage NAME READY UP-TO-DATE AVAILABLE AGE gs-ocs-operator ?? ?? ?? 228d storagecluster.yaml status: status: conditions: - lastHeartbeatTime: "2023-06-22T16:35:59Z" lastTransitionTime: "2023-06-04T16:26:11Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "True" type: ReconcileComplete - lastHeartbeatTime: "2023-06-05T04:23:18Z" lastTransitionTime: "2023-05-25T00:28:45Z" message: 'CephCluster error: failed the ceph version check: failed to complete ceph version job: failed to run CmdReporter rook-ceph-detect-version successfully. failed to run job. object is being deleted: jobs.batch "rook-ceph-detect-version" already exists' reason: ClusterStateError status: "False" type: Available - lastHeartbeatTime: "2023-06-22T16:35:59Z" lastTransitionTime: "2023-05-24T10:03:06Z" message: Waiting on Nooba instance to finish initialization reason: NoobaaInitializing status: "True" type: Progressing - lastHeartbeatTime: "2023-06-05T04:23:18Z" lastTransitionTime: "2023-05-25T00:28:45Z" message: 'CephCluster error: failed the ceph version check: failed to complete ceph version job: failed to run CmdReporter rook-ceph-detect-version successfully. failed to run job. object is being deleted: jobs.batch "rook-ceph-detect-version" already exists' reason: ClusterStateError status: "True" type: Degraded - lastHeartbeatTime: "2023-06-22T06:57:39Z" lastTransitionTime: "2023-05-24T10:03:06Z" message: 'CephCluster is creating: Processing OSD 99 on PVC "ocs-deviceset-2-data-2024lpg2"' reason: ClusterStateCreating status: "False" type: Upgradeable Version of all relevant components (if applicable): $ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE loki-operator.v5.6.7 Loki Operator 5.6.7 loki-operator.v5.6.6 Succeeded mcg-operator.v4.10.13 NooBaa Operator 4.10.13 mcg-operator.v4.10.11 Succeeded network-observability-operator.v1.2.0 Network observability 1.2.0 network-observability-operator.v1.1.0-202302110050 Succeeded node-maintenance-operator.v5.1.0 Node Maintenance Operator 5.1.0 Succeeded ocs-operator.v4.10.11 OpenShift Container Storage 4.10.11 ocs-operator.v4.9.14 Replacing odf-csi-addons-operator.v4.10.13 CSI Addons 4.10.13 odf-csi-addons-operator.v4.10.11 Succeeded odf-operator.v4.11.8 OpenShift Data Foundation 4.11.8 odf-operator.v4.10.11 Succeeded Awaiting the output of $ ceph versions, will update the BZ once the output is provided. Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? This cluster looks to be a dev cluster however, this has been a cluster that has been brought up in the GS and Red Hat meetings, and after discussion, Engineering recommend opening a BZ for tracking. Is there any workaround available to the best of your knowledge? We’ve attempted to provide the customer with troubleshooting steps however, the last we’ve heard from the TAM they’re still working through them. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Additional info: What this cluster (1012) does have going for it (positive) is that Ceph is currently reflecting the following status: health: HEALTH_OK services: mon: 3 daemons, quorum ax,az,bb (age 5h) mgr: a(active, since 7d) mds: 1/1 daemons up, 1 hot standby osd: 166 osds: 166 up (since 12h), 166 in (since 3h); 1 remapped pgs rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 11 pools, 5265 pgs objects: 1.66M objects, 6.2 TiB usage: 18 TiB used, 355 TiB / 374 TiB avail pgs: 5265 active+clean io: client: 125 MiB/s rd, 228 MiB/s wr, 1.16k op/s rd, 528 op/s wr The troubleshooting steps currently given to the customer were to edit the gs-default-nobody-scc and gs-default-scc SCCs (priority/groups) in hopes that by deleting the pods associated with those SCCs they’ll pick up the correct SCC. Still waiting on confirmation.