Bug 2172806
| Summary: | [GSS] After node replacement the osds pods are not created | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | amansan <amanzane> |
| Component: | rook | Assignee: | Santosh Pillai <sapillai> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Neha Berry <nberry> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.10 | CC: | hnallurv, mduasope, mmayeras, ocs-bugs, odf-bz-bot, rafrojas, sapillai, sheggodu, srai, tnielsen |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | --- | Flags: | sapillai:
needinfo?
(rafrojas) tnielsen: needinfo? (amanzane) |
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-06-23 22:16:54 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
@mduasope I don't see any osd-prepare pod or logs in must gather I'm looking, could you attach them here.
And, in rook operator logs only errors I see is
```2023-01-24T15:44:23.415980210Z 2023-01-24 15:44:23.415968 E | op-osd: failed to provision OSD(s) on PVC ocs-deviceset-sc-odf-2-data-27lr657. &{OSDs:[] Status:failed PvcBackedOSD:true Message:failed to configure devices: failed to generate osd keyring: failed to get or create auth key for client.bootstrap-osd: failed get-or-create-key client.bootstrap-osd: exit status 1}```
which says that osd is trying to come up but the keys are missing.
Were you able to clean the device and get this working? Hi. A few questions to better understand what's happening. - A node got failed so it was replaced by a new node. Is that the only event that led to the current situation with the cluster? - The logs suggest that osd is using the same pv `local-pv-2a0b2a3` even on the newly added node. Is that correct?. IIUC, when a new node is added, then LSO operator would create a new PV using that disks on the new node. So the osd on this new node should use a new pv. - When the old node got failed, were the failing OSDs removed from that node using the OSD removal job? Re-assigning to Santosh as he looking now. Thanks Removing needsinfo since Santosh is looking This is a known ODF issue and currently being looked at in https://bugzilla.redhat.com/show_bug.cgi?id=2102304 Hi Alicia, A few questions: 1. Is comment 17, the only issue the customer is facing right now? 2. Does issue with comment 17 not allow user to continue further? Hi.
So customer added a new node and still don't see any new osds coming up. Does that summarize the current situation correctly?
Looked at the must gather logs:
All the osd prepare pods are in pending state due to following error:
```
'0/23 nodes are available: 10 node(s) didn''t match Pod''s node affinity/selector,
2 node(s) had taint {node-role.kubernetes.io/spectrum: }, that the pod didn''t
tolerate, 3 node(s) had taint {node-role.kubernetes.io/infra: reserved}, that
the pod didn''t tolerate, 3 node(s) had taint {node-role.kubernetes.io/master:
}, that the pod didn''t tolerate, 5 node(s) didn''t find available persistent
volumes to bind.'
```
```
pod/rook-ceph-osd-prepare-297231d095a8b2f3cce9df0c6c95f036-gzs4s 0/1 Pending 0 18d <none> <none> <none> <none>
pod/rook-ceph-osd-prepare-4750a47e6acc2ed202fcd5e904355c6f-s2c2w 0/1 Pending 0 18d <none> <none> <none> <none>
pod/rook-ceph-osd-prepare-902a8387150a5d7492e83406b88b7312-kp9gv 0/1 Pending 0 18d <none> <none> <none> <none>
pod/rook-ceph-osd-prepare-97ca5e82f872abfea03437c8783882e3-2f5w6 0/1 Pending 0 18d <none> <none> <none> <none>
pod/rook-ceph-osd-prepare-abb427bd904e400157f3f6ce14328332-lc8zx 0/1 Pending 0 18d <none> <none> <none> <none>
pod/rook-ceph-osd-prepare-b0e69ad309df460a3f979a39928800e7-9qhrb 0/1 Pending 0 18d <none> <none> <none> <none>
pod/rook-ceph-osd-prepare-c680a538497b3a87011db28b856e0b4e-h98r9 0/1 Pending 0 18d <none> <none> <none> <none>
pod/rook-ceph-osd-prepare-cd84b42e320034e2b5447e34441728ef-bn5jm 0/1 Pending 0 18d <none> <none> <none> <none>
pod/rook-ceph-osd-prepare-d8bd70f44817b51607869df9e6cc491c-qvtfv 0/1 Pending 0 18d <none> <none> <none> <none>
pod/rook-ceph-osd-prepare-daafab739afa1600c6d1d4e311e98c15-56hrc 0/1 Pending 0 18d <none> <none> <none> <none>
pod/rook-ceph-osd-prepare-e9c94c244fc6ef4b02a2d75fc74d47f0-7ffxh 0/1 Pending 0 18d <none> <none> <none> <none>
pod/rook-ceph-osd-prepare-fe4e80997e70ed70447be3c38d9314c3-zlfbp 0/1 Pending 0 18d <none> <none> <none> <none>
```
Hello, I do not agree with the bug closure. This still needs to be fixed or documented. |
I think this is the issue, ``` op-osd: OSD orchestration status for PVC ocs-deviceset-sc-odf-2-data-27lr657 is "failed" 2023-01-24T15:44:23.415980210Z 2023-01-24 15:44:23.415968 E | op-osd: failed to provision OSD(s) on PVC ocs-deviceset-sc-odf-2-data-27lr657. &{OSDs:[] Status:failed PvcBackedOSD:true Message:failed to configure devices: failed to generate osd keyring: failed to get or create auth key for client.bootstrap-osd: failed get-or-create-key client.bootstrap-osd: exit status 1} 2023-01-24T15:44:23.433564120Z 2023-01-24 15:44:23.433532 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to create cluster: failed to start ceph osds: 15 failures encountered while running osds on nodes in namespace "openshift-storage". ``` did you tried this solution https://access.redhat.com/solutions/3524771 seems same scenario