Bug 2172806 - [GSS] After node replacement the osds pods are not created [NEEDINFO]
Summary: [GSS] After node replacement the osds pods are not created
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Santosh Pillai
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-02-23 07:12 UTC by amansan
Modified: 2023-08-09 17:03 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-06-23 22:16:54 UTC
Embargoed:
sapillai: needinfo? (rafrojas)
tnielsen: needinfo? (amanzane)


Attachments (Terms of Use)

Comment 5 Subham Rai 2023-02-23 13:28:47 UTC
I think this is the issue, 
```
op-osd: OSD orchestration status for PVC ocs-deviceset-sc-odf-2-data-27lr657 is "failed"
2023-01-24T15:44:23.415980210Z 2023-01-24 15:44:23.415968 E | op-osd: failed to provision OSD(s) on PVC ocs-deviceset-sc-odf-2-data-27lr657. &{OSDs:[] Status:failed PvcBackedOSD:true Message:failed to configure devices: failed to generate osd keyring: failed to get or create auth key for client.bootstrap-osd: failed get-or-create-key client.bootstrap-osd: exit status 1}
2023-01-24T15:44:23.433564120Z 2023-01-24 15:44:23.433532 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to create cluster: failed to start ceph osds: 15 failures encountered while running osds on nodes in namespace "openshift-storage". 
```

did you tried this solution https://access.redhat.com/solutions/3524771 seems same scenario

Comment 7 Subham Rai 2023-02-27 15:50:56 UTC
@mduasope I don't see any osd-prepare pod or logs in must gather I'm looking, could you attach them here.
And, in rook operator logs only errors I see is 
```2023-01-24T15:44:23.415980210Z 2023-01-24 15:44:23.415968 E | op-osd: failed to provision OSD(s) on PVC ocs-deviceset-sc-odf-2-data-27lr657. &{OSDs:[] Status:failed PvcBackedOSD:true Message:failed to configure devices: failed to generate osd keyring: failed to get or create auth key for client.bootstrap-osd: failed get-or-create-key client.bootstrap-osd: exit status 1}```

which says that osd is trying to come up but the keys are missing.

Comment 12 Travis Nielsen 2023-03-07 15:17:07 UTC
Were you able to clean the device and get this working?

Comment 16 Santosh Pillai 2023-03-16 12:55:42 UTC
Hi. A few questions to better understand what's happening. 

- A node got failed so it was replaced by a new node. Is that the only event that led to the current situation with the cluster?
- The logs suggest that osd is using the same pv `local-pv-2a0b2a3` even on the newly added node. Is that correct?. IIUC, when a new node is added, then LSO operator would create a new PV using that disks on the new node. So the osd on this new node should use a new pv. 
- When the old node got failed, were the failing OSDs removed from that node using the OSD removal job?

Comment 19 Subham Rai 2023-03-20 05:35:01 UTC
Re-assigning to Santosh as he looking now. Thanks

Comment 23 Travis Nielsen 2023-03-20 17:01:32 UTC
Removing needsinfo since Santosh is looking

Comment 27 Santosh Pillai 2023-03-23 15:18:49 UTC
This is a known ODF issue and currently being looked at in https://bugzilla.redhat.com/show_bug.cgi?id=2102304

Comment 29 Santosh Pillai 2023-03-27 02:21:08 UTC
Hi Alicia,

A few questions:

1. Is comment 17, the only issue the customer is facing right now?
2. Does issue with comment 17 not allow user to continue further?

Comment 37 Santosh Pillai 2023-05-09 02:29:13 UTC
Hi.

So customer added a new node and still don't see any new osds coming up.  Does that summarize the current situation correctly?


Looked at the must gather logs:

All the osd prepare pods are in pending state due to following error:
```
'0/23 nodes are available: 10 node(s) didn''t match Pod''s node affinity/selector,
      2 node(s) had taint {node-role.kubernetes.io/spectrum: }, that the pod didn''t
      tolerate, 3 node(s) had taint {node-role.kubernetes.io/infra: reserved}, that
      the pod didn''t tolerate, 3 node(s) had taint {node-role.kubernetes.io/master:
      }, that the pod didn''t tolerate, 5 node(s) didn''t find available persistent
      volumes to bind.'

```

```
pod/rook-ceph-osd-prepare-297231d095a8b2f3cce9df0c6c95f036-gzs4s      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
pod/rook-ceph-osd-prepare-4750a47e6acc2ed202fcd5e904355c6f-s2c2w      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
pod/rook-ceph-osd-prepare-902a8387150a5d7492e83406b88b7312-kp9gv      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
pod/rook-ceph-osd-prepare-97ca5e82f872abfea03437c8783882e3-2f5w6      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
pod/rook-ceph-osd-prepare-abb427bd904e400157f3f6ce14328332-lc8zx      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
pod/rook-ceph-osd-prepare-b0e69ad309df460a3f979a39928800e7-9qhrb      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
pod/rook-ceph-osd-prepare-c680a538497b3a87011db28b856e0b4e-h98r9      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
pod/rook-ceph-osd-prepare-cd84b42e320034e2b5447e34441728ef-bn5jm      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
pod/rook-ceph-osd-prepare-d8bd70f44817b51607869df9e6cc491c-qvtfv      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
pod/rook-ceph-osd-prepare-daafab739afa1600c6d1d4e311e98c15-56hrc      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
pod/rook-ceph-osd-prepare-e9c94c244fc6ef4b02a2d75fc74d47f0-7ffxh      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
pod/rook-ceph-osd-prepare-fe4e80997e70ed70447be3c38d9314c3-zlfbp      0/1     Pending   0             18d    <none>           <none>                            <none>           <none>
```

Comment 54 mmayeras 2023-06-02 13:48:54 UTC
Hello,

I do not agree with the bug closure.
This still needs to be fixed or documented.


Note You need to log in before you can comment on or make changes to this bug.