Bug 2172806

Summary: [GSS] After node replacement the osds pods are not created
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: amansan <amanzane>
Component: rookAssignee: Santosh Pillai <sapillai>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Neha Berry <nberry>
Severity: high Docs Contact:
Priority: high    
Version: 4.10CC: hnallurv, mduasope, mmayeras, ocs-bugs, odf-bz-bot, rafrojas, sapillai, sheggodu, srai, tnielsen
Target Milestone: ---Keywords: Reopened
Target Release: ---Flags: sapillai: needinfo? (rafrojas)
tnielsen: needinfo? (amanzane)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-23 22:16:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 5 Subham Rai 2023-02-23 13:28:47 UTC
I think this is the issue, 
```
op-osd: OSD orchestration status for PVC ocs-deviceset-sc-odf-2-data-27lr657 is "failed"
2023-01-24T15:44:23.415980210Z 2023-01-24 15:44:23.415968 E | op-osd: failed to provision OSD(s) on PVC ocs-deviceset-sc-odf-2-data-27lr657. &{OSDs:[] Status:failed PvcBackedOSD:true Message:failed to configure devices: failed to generate osd keyring: failed to get or create auth key for client.bootstrap-osd: failed get-or-create-key client.bootstrap-osd: exit status 1}
2023-01-24T15:44:23.433564120Z 2023-01-24 15:44:23.433532 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to create cluster: failed to start ceph osds: 15 failures encountered while running osds on nodes in namespace "openshift-storage". 
```

did you tried this solution https://access.redhat.com/solutions/3524771 seems same scenario

Comment 7 Subham Rai 2023-02-27 15:50:56 UTC
@mduasope I don't see any osd-prepare pod or logs in must gather I'm looking, could you attach them here.
And, in rook operator logs only errors I see is 
```2023-01-24T15:44:23.415980210Z 2023-01-24 15:44:23.415968 E | op-osd: failed to provision OSD(s) on PVC ocs-deviceset-sc-odf-2-data-27lr657. &{OSDs:[] Status:failed PvcBackedOSD:true Message:failed to configure devices: failed to generate osd keyring: failed to get or create auth key for client.bootstrap-osd: failed get-or-create-key client.bootstrap-osd: exit status 1}```

which says that osd is trying to come up but the keys are missing.

Comment 12 Travis Nielsen 2023-03-07 15:17:07 UTC
Were you able to clean the device and get this working?

Comment 16 Santosh Pillai 2023-03-16 12:55:42 UTC
Hi. A few questions to better understand what's happening. 

- A node got failed so it was replaced by a new node. Is that the only event that led to the current situation with the cluster?
- The logs suggest that osd is using the same pv `local-pv-2a0b2a3` even on the newly added node. Is that correct?. IIUC, when a new node is added, then LSO operator would create a new PV using that disks on the new node. So the osd on this new node should use a new pv. 
- When the old node got failed, were the failing OSDs removed from that node using the OSD removal job?

Comment 19 Subham Rai 2023-03-20 05:35:01 UTC
Re-assigning to Santosh as he looking now. Thanks

Comment 23 Travis Nielsen 2023-03-20 17:01:32 UTC
Removing needsinfo since Santosh is looking

Comment 27 Santosh Pillai 2023-03-23 15:18:49 UTC
This is a known ODF issue and currently being looked at in https://bugzilla.redhat.com/show_bug.cgi?id=2102304

Comment 29 Santosh Pillai 2023-03-27 02:21:08 UTC
Hi Alicia,

A few questions:

1. Is comment 17, the only issue the customer is facing right now?
2. Does issue with comment 17 not allow user to continue further?

Comment 37 Santosh Pillai 2023-05-09 02:29:13 UTC
Hi.

So customer added a new node and still don't see any new osds coming up.  Does that summarize the current situation correctly?


Looked at the must gather logs:

All the osd prepare pods are in pending state due to following error:
```
'0/23 nodes are available: 10 node(s) didn''t match Pod''s node affinity/selector,
      2 node(s) had taint {node-role.kubernetes.io/spectrum: }, that the pod didn''t
      tolerate, 3 node(s) had taint {node-role.kubernetes.io/infra: reserved}, that
      the pod didn''t tolerate, 3 node(s) had taint {node-role.kubernetes.io/master:
      }, that the pod didn''t tolerate, 5 node(s) didn''t find available persistent
      volumes to bind.'

```

```
pod/rook-ceph-osd-prepare-297231d095a8b2f3cce9df0c6c95f036-gzs4s      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
pod/rook-ceph-osd-prepare-4750a47e6acc2ed202fcd5e904355c6f-s2c2w      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
pod/rook-ceph-osd-prepare-902a8387150a5d7492e83406b88b7312-kp9gv      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
pod/rook-ceph-osd-prepare-97ca5e82f872abfea03437c8783882e3-2f5w6      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
pod/rook-ceph-osd-prepare-abb427bd904e400157f3f6ce14328332-lc8zx      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
pod/rook-ceph-osd-prepare-b0e69ad309df460a3f979a39928800e7-9qhrb      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
pod/rook-ceph-osd-prepare-c680a538497b3a87011db28b856e0b4e-h98r9      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
pod/rook-ceph-osd-prepare-cd84b42e320034e2b5447e34441728ef-bn5jm      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
pod/rook-ceph-osd-prepare-d8bd70f44817b51607869df9e6cc491c-qvtfv      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
pod/rook-ceph-osd-prepare-daafab739afa1600c6d1d4e311e98c15-56hrc      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
pod/rook-ceph-osd-prepare-e9c94c244fc6ef4b02a2d75fc74d47f0-7ffxh      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
pod/rook-ceph-osd-prepare-fe4e80997e70ed70447be3c38d9314c3-zlfbp      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
```

Comment 54 mmayeras 2023-06-02 13:48:54 UTC
Hello,

I do not agree with the bug closure.
This still needs to be fixed or documented.