Bug 2188670 - [Fusion-aaS]Only 1 OSD out of 3 came up on a DF provider cluster in Managed services
Summary: [Fusion-aaS]Only 1 OSD out of 3 came up on a DF provider cluster in Managed s...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.13
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Subham Rai
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-04-21 14:48 UTC by Neha Berry
Modified: 2023-08-09 17:03 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-06-06 15:28:22 UTC
Embargoed:


Attachments (Terms of Use)

Description Neha Berry 2023-04-21 14:48:23 UTC
Description of problem (please be detailed as possible and provide log
snippests):
=========================================================================
Created an agent based provider with DF offering installed in the fusion-storage namespace. Only 1 OSD came up and prepare pods for the other 2 were not created.

mon-b was also out of quorum, hence it was failed over and mon-d came up successfully

Few repeated messages in the rook operator log

023-04-21 09:07:21.962692 I | op-osd: restarting watcher for OSD provisioning status ConfigMaps. the watcher closed the channel
2023-04-21 09:07:21.968316 I | op-osd: OSD orchestration status for PVC default-0-data-0p9ck9 is "orchestrating"
2023-04-21 09:07:21.968335 I | op-osd: OSD orchestration status for PVC default-1-data-0zj6vs is "orchestrating"
2023-04-21 09:08:17.905979 I | op-osd: waiting... 0 of 2 OSD prepare jobs have finished processing and 1 of 1 OSDs have been updated
2023-04-21 09:09:17.906212 I | op-osd: waiting... 0 of 2 OSD prepare jobs have finished processing and 1 of 1 OSDs have been updated
2023-04-21 09:10:17.905390 I | op-osd: waiting... 0 of 2 OSD prepare jobs have finished processing and 1 of 1 OSDs have been updated

rook-ceph-operator-6c645c7f58-6vfr6                               1/1     Running     0                6h19m   10.129.2.158   ip-10-0-21-159.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-0-57df5c5cb4-drcf5                                  2/2     Running     0                6h2m    10.0.21.159    ip-10-0-21-159.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-prepare-default-2-data-0j74zq-l4qzl                 0/1     Completed   0                6h2m    10.0.21.159    ip-10-0-21-159.us-east-2.compute.internal   <none>           <none>


>>> Restarted the rook-operator pod and the remaining OSD prepare pods and OSD came up.

rook-ceph-operator-6c645c7f58-bzx78                               1/1     Running     0                14m     10.128.2.85    ip-10-0-15-227.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-0-57df5c5cb4-drcf5                                  2/2     Running     0                7h29m   10.0.21.159    ip-10-0-21-159.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-1-586d467d49-cwh6z                                  2/2     Running     0                14m     10.0.19.7      ip-10-0-19-7.us-east-2.compute.internal     <none>           <none>
rook-ceph-osd-2-5475f5644d-nk5rm                                  2/2     Running     0                14m     10.0.15.227    ip-10-0-15-227.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-prepare-default-0-data-0p9ck9-cmnl5                 0/1     Completed   0                14m     10.0.15.227    ip-10-0-15-227.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-prepare-default-1-data-0zj6vs-kq89d                 0/1     Completed   0                14m     10.0.19.7      ip-10-0-19-7.us-east-2.compute.internal     <none>           <none>
rook-ceph-osd-prepare-default-2-data-0j74zq-l4qzl                 0/1     Completed   0                7h29m   10.0.21.159    ip-10-0-21-159.us-east-2.compute.internal   <none>           <none>



Version of all relevant components (if applicable):
====================================================
OCP (ROSA) = 4.11.36
ceph image version: "17.2.6-10 quincy"
managed-fusion-agent.v2.0.11              Managed Fusion Agent          2.0.11                                                        Succeeded
observability-operator.v0.0.20            Observability Operator        0.0.20              observability-operator.v0.0.19            Succeeded
ocs-operator.v4.13.0-168.stable           OpenShift Container Storage   4.13.0-168.stable                                             Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0                                                        Succeeded
route-monitor-operator.v0.1.498-e33e391   Route Monitor Operator        0.1.498-e33e391     route-monitor-operator.v0.1.496-7e66488   Succeeded



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
=========================================================================
Yes unless the workaround is tried

Is there any workaround available to the best of your knowledge?
======================================================================
>>Workaround : Restarted the rook-ceph-operator pod and the missing OSDs were created

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
=====================================
Observed once so far

Can this issue reproduce from the UI?
==========================================
NA


If this is a regression, please provide more details to justify this:
============================================================================
Not sure


Steps to Reproduce:
============================
1. Create ROSA 4.11.36 cluster with m5.2xlarge instances for worker nodes
2. Install Fusion aaS agent with build quay.io/resoni/managed-fusion-agent-index:4.13.0-168. Document for reference
3. Install DF offering using the managedFusionOffering CR

[1] - https://docs.google.com/document/d/1Jdx8czlMjbumvilw8nZ6LtvWOMAx3H4TfwoVwiBs0nE/edit#


Actual results:
======================
Only 1 OSD and prepare pod was up. 2 didnt come up

>>Workaround : Restarted the rook-ceph-operator pod and the missing OSDs were created
Expected results:
==========================
All 3 OSDs should be up and Running

Comment 6 Travis Nielsen 2023-06-06 15:28:22 UTC
Per offline discussion of Subham with Jilju and Rewant, this is not reproing. Please reopen if it is hit again.


Note You need to log in before you can comment on or make changes to this bug.