Bug 2188670

Summary: [Fusion-aaS]Only 1 OSD out of 3 came up on a DF provider cluster in Managed services
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Neha Berry <nberry>
Component: rookAssignee: Subham Rai <srai>
Status: CLOSED NOTABUG QA Contact: Neha Berry <nberry>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.13CC: muagarwa, ocs-bugs, odf-bz-bot, paarora, sapillai, srai, tnielsen
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-06 15:28:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Neha Berry 2023-04-21 14:48:23 UTC
Description of problem (please be detailed as possible and provide log
snippests):
=========================================================================
Created an agent based provider with DF offering installed in the fusion-storage namespace. Only 1 OSD came up and prepare pods for the other 2 were not created.

mon-b was also out of quorum, hence it was failed over and mon-d came up successfully

Few repeated messages in the rook operator log

023-04-21 09:07:21.962692 I | op-osd: restarting watcher for OSD provisioning status ConfigMaps. the watcher closed the channel
2023-04-21 09:07:21.968316 I | op-osd: OSD orchestration status for PVC default-0-data-0p9ck9 is "orchestrating"
2023-04-21 09:07:21.968335 I | op-osd: OSD orchestration status for PVC default-1-data-0zj6vs is "orchestrating"
2023-04-21 09:08:17.905979 I | op-osd: waiting... 0 of 2 OSD prepare jobs have finished processing and 1 of 1 OSDs have been updated
2023-04-21 09:09:17.906212 I | op-osd: waiting... 0 of 2 OSD prepare jobs have finished processing and 1 of 1 OSDs have been updated
2023-04-21 09:10:17.905390 I | op-osd: waiting... 0 of 2 OSD prepare jobs have finished processing and 1 of 1 OSDs have been updated

rook-ceph-operator-6c645c7f58-6vfr6                               1/1     Running     0                6h19m   10.129.2.158   ip-10-0-21-159.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-0-57df5c5cb4-drcf5                                  2/2     Running     0                6h2m    10.0.21.159    ip-10-0-21-159.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-prepare-default-2-data-0j74zq-l4qzl                 0/1     Completed   0                6h2m    10.0.21.159    ip-10-0-21-159.us-east-2.compute.internal   <none>           <none>


>>> Restarted the rook-operator pod and the remaining OSD prepare pods and OSD came up.

rook-ceph-operator-6c645c7f58-bzx78                               1/1     Running     0                14m     10.128.2.85    ip-10-0-15-227.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-0-57df5c5cb4-drcf5                                  2/2     Running     0                7h29m   10.0.21.159    ip-10-0-21-159.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-1-586d467d49-cwh6z                                  2/2     Running     0                14m     10.0.19.7      ip-10-0-19-7.us-east-2.compute.internal     <none>           <none>
rook-ceph-osd-2-5475f5644d-nk5rm                                  2/2     Running     0                14m     10.0.15.227    ip-10-0-15-227.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-prepare-default-0-data-0p9ck9-cmnl5                 0/1     Completed   0                14m     10.0.15.227    ip-10-0-15-227.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-prepare-default-1-data-0zj6vs-kq89d                 0/1     Completed   0                14m     10.0.19.7      ip-10-0-19-7.us-east-2.compute.internal     <none>           <none>
rook-ceph-osd-prepare-default-2-data-0j74zq-l4qzl                 0/1     Completed   0                7h29m   10.0.21.159    ip-10-0-21-159.us-east-2.compute.internal   <none>           <none>



Version of all relevant components (if applicable):
====================================================
OCP (ROSA) = 4.11.36
ceph image version: "17.2.6-10 quincy"
managed-fusion-agent.v2.0.11              Managed Fusion Agent          2.0.11                                                        Succeeded
observability-operator.v0.0.20            Observability Operator        0.0.20              observability-operator.v0.0.19            Succeeded
ocs-operator.v4.13.0-168.stable           OpenShift Container Storage   4.13.0-168.stable                                             Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0                                                        Succeeded
route-monitor-operator.v0.1.498-e33e391   Route Monitor Operator        0.1.498-e33e391     route-monitor-operator.v0.1.496-7e66488   Succeeded



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
=========================================================================
Yes unless the workaround is tried

Is there any workaround available to the best of your knowledge?
======================================================================
>>Workaround : Restarted the rook-ceph-operator pod and the missing OSDs were created

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
=====================================
Observed once so far

Can this issue reproduce from the UI?
==========================================
NA


If this is a regression, please provide more details to justify this:
============================================================================
Not sure


Steps to Reproduce:
============================
1. Create ROSA 4.11.36 cluster with m5.2xlarge instances for worker nodes
2. Install Fusion aaS agent with build quay.io/resoni/managed-fusion-agent-index:4.13.0-168. Document for reference
3. Install DF offering using the managedFusionOffering CR

[1] - https://docs.google.com/document/d/1Jdx8czlMjbumvilw8nZ6LtvWOMAx3H4TfwoVwiBs0nE/edit#


Actual results:
======================
Only 1 OSD and prepare pod was up. 2 didnt come up

>>Workaround : Restarted the rook-ceph-operator pod and the missing OSDs were created
Expected results:
==========================
All 3 OSDs should be up and Running

Comment 6 Travis Nielsen 2023-06-06 15:28:22 UTC
Per offline discussion of Subham with Jilju and Rewant, this is not reproing. Please reopen if it is hit again.