Description of problem (please be detailed as possible and provide log snippets): 1. We have 1 optane drive and 8 nvme drives in each worker node in the OCP 4.11 bare-metal cluster. 2. As part of drive detach testing, after detaching the metadata Optane drive from one of the Worker abruptly, 7 OSD's went down which is expected. 3. We also noticed that 7 pods went to crashloopback state. And the PV's is based on LSO with the partitioning of the metadata device. 4. We have performed 1 to 10 steps (skipped 8th step) in the below link before reattaching the drive and restarted the rook ceph operator pod. https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.11/html/replacing_devices/openshift_data_foundation_deployed_using_local_storage_devices#replacing-operational-or-failed-storage-devices-on-clusters-backed-by-local-storage-devices_rhodf 5. In the 10th step, We've deleted the PVs corresponding to those 7 OSDs. After deleting the PVC & PV, it is in the available state. 6. The ceph health is in the same ‘WARN’ state showing 7 OSDs as down. Version of all relevant components (if applicable): 4.11 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? yes Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? yes Can this issue reproduce from the UI? no Actual results: OSD's count not increased after performing these steps in the above link. Expected results: OSD's count should increase after performing these steps. Additional info:
Please find the drop box link for must-gather logs: https://www.dropbox.com/s/3s32abqfo063i9a/must-gather%201.zip?dl=0 ocs-mustgatherlogs: https://www.dropbox.com/s/jty896cuiz2spfw/ocsmustgatherlogs.zip?dl=0
Created attachment 1927364 [details] oc get pv
Created attachment 1927365 [details] oc get pvc
Created attachment 1927366 [details] oc get csv
Created attachment 1927367 [details] oc get jobs
Created attachment 1927368 [details] oc logs for rook ceph operator pod
Created attachment 1927369 [details] oc describe pod for rook ceph operator pod
Created attachment 1927370 [details] oc describe for one osd pod which is in crashloopback state
Created attachment 1927371 [details] oc describe storagecluster -n openshift-storage
Created attachment 1927372 [details] ceph status
Created attachment 1927373 [details] oc version
Created attachment 1927374 [details] oc get cepcluster , oc get storagecluster
From the ceph status in comment 13, it appears the old OSDs were not fully removed from the cluster. Ceph should not show the existence of those 7 removed OSDs anymore. In steps 6 and 7, what did the logs show for the removed OSDs? There must have been a failure at that step.
Hi Travis, In step 6, the osd-removal-job was in 'Running' state rather than 'Completed' and while checking the corresponding pod's log (as per step 7) it showed that one of the OSD is 'NOT ok to be destroy' message. Attached the screenshot.
Created attachment 1928368 [details] osd_removal_job_pod_describe
If the OSD is not ok to destroy, you will either need to wait for the OSD to be safe, or you will need to set the force destroy flag when you run the OSD removal job.
Not a 4.12 blocker if the previous OSD wasn't fully removed, moving to 4.13 to complete investigation.
We again performed a drive detach test in another cluster (OCP 4.11.12) which has 2 nvme drives and 1 optane drive in each worker node. After detaching the optane drive from the worker2 node, we proceeded the step 6 by setting the force destroy flag, and the osd-removal-job was in a 'Completed' state. After deleting the corresponding PV in the released state, a new OSD pod is not automatically created. While describing the corresponding PV, we are not seeing any errors. Attached is the screenshot for reference.
Created attachment 1929993 [details] oc -n openshift-local-storage describe localvolume local-metadata
Created attachment 1929994 [details] oc -n openshift-local-storage describe localvolume local-wal
Created attachment 1930018 [details] ceph status
Created attachment 1930020 [details] ceph osd tree
Please find the dropbox link for must-gather logs: https://www.dropbox.com/s/uas7d5626wl4e97/must-gather.tar.gz?dl=0 ocs-mustgatherlogs: https://www.dropbox.com/s/rkwovpydl8111ld/ocsmustgather.zip?dl=0
What is not working now, from the comment https://bugzilla.redhat.com/show_bug.cgi?id=2147526#c23 seems like you were able to remove the osd.
Hi Subham, As per the comment https://bugzilla.redhat.com/show_bug.cgi?id=2147526#c23 we were able to delete the old OSDs successfully by passing the force parameter. The OSD count reduced to 2 from 3 which is expected. But after attaching the drive back, the OSD count did not increase, as per the RH link (https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.11/html/replacing_devices/openshift_data_foundation_deployed_using_local_storage_devices#replacing-operational-or-failed-storage-devices-on-clusters-backed-by-local-storage-devices_rhodf) the OSD will automatically get added once the OSD pods are created. But here we see that the OSD pod did not get created. We even tried deleting the rook-ceph-operator pod, still no change. We waited for many hours even a day, still the OSD pod did not list. There were no errors in rook-ceph-rgw pod, oc -n openshift-local-storage describe localvolume local-metadata, oc -n openshift-local-storage describe localvolume local-wal commands too did not list any errors.
The OSD prepare log [1] shows that the metadata device may be the issue. 2022-12-04T14:36:37.883434079Z 2022-12-04 14:36:37.883431 D | exec: Running command: stdbuf -oL ceph-volume --log-path /var/log/ceph/ocs-deviceset-0-data-0bk76s raw prepare --bluestore --data /mnt/ocs-deviceset-0-data-0bk76s --block.db /srv/ocs-deviceset-0-metadata-0qlwlc --block.wal /wal/ocs-deviceset-0-wal-0rdxnr 2022-12-04T14:36:38.523696491Z 2022-12-04 14:36:38.523557 I | cephosd: stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected. 2022-12-04T14:36:38.523696491Z stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected. 2022-12-04T14:36:38.523696491Z --> Raw device /srv/ocs-deviceset-0-metadata-0qlwlc is already prepared. The last line shows that the metadata device is already prepared, so then it skips creating the OSD. Were the metadata and wal devices also wiped in addition to the data device? To create the new OSD, all three devices need to be cleaned from the previous OSD. Also, do you have the log from the osd purge job to show if the metadata and db PVCs were also deleted? [1] https://www.dropbox.com/s/rkwovpydl8111ld/ocsmustgather.zip?dl=0&file_subpath=%2Focsmustgather%2Fregistry-redhat-io-odf4-ocs-must-gather-rhel8-sha256-5816523a486a0363410fcdb581457e65b993db0c52829c1b8e5124d78b9abd90%2Fnamespaces%2Fopenshift-storage%2Fpods%2Frook-ceph-osd-prepare-ocs-deviceset-0-data-0bk76s-2jmhk%2Fprovision%2Fprovision%2Flogs%2Fcurrent.log
Hi Travis, This is a mixed media setup where we have 2 nvme (slower) drives and 1 Optane (faster) drive connected to each worker (total-3 workers). The data devices are created on 6 nvme drives and metadata, wal devices are created on the 3 Optane drives. The drive detached was the optane drive which is the metadata & wal device. Is it necessary to remove the corresponding data device also when we detach the metadata/wal device? Since the data device is on seperate nvme drive, its not related to the pulled out metadata device. Please suggest on how to add back the OSD. Thanks Ishwarya M
The fundamental issue that an OSD needs to be created on clean device(s). In this case since there is a separate wal, db, and data device, if any of them are replaced, all of them must be wiped or replaced. Otherwise, the remnants from the previous OSD will prevent the creation of a new OSD on them.
Hi Travis, We have below queries related to wiping out the drive in drive detach scenario, 1. How to find out the corresponding data drive for the detached metadata drive? 2. In our cluster, 'osd-0' was the one that did not come up after re-attaching the drive, so we wiped (dd command) the data drive corresponding to the deviceset-0. Before this, we removed its PVC and its PV. On applying the local-data.yaml file, there was no error thrown yet the local-data PV did not add back. 3. On restarting the rook-ceph operator pod, the rook-ceph-osd-1 pod went to 'Crashloopbackoff' state since the PV of the corresponding data drive is not added back. ODF must gather logs is placed in https://www.dropbox.com/s/0zu8orbxfwf8b2y/odf_must_gather_12dec.tar.gz?dl=0 4. Can you please provide your inputs on how to recover the OSDs from this state. Also, if its expected to wipe/replace all data,metadata, wal devices on replacing one of the drive, this will not be possible from a customer environment. If any one of the drive goes faulty and needs a replacement, we cannot expect the customer to identify the relative drives to wipe out. Anyone would expect the data to be recovered automatically on replacing the faulty drive alone. Can you please explain on why such behavior is designed in ODF. Thanks Ishwarya M
(In reply to Ishwarya Munesh from comment #34) > Hi Travis, > We have below queries related to wiping out the drive in drive detach > scenario, > 1. How to find out the corresponding data drive for the detached metadata > drive? When you purged the OSD, you ran the job template to purge the OSD, correct? This job should have purged the OSDs and deleted all the related PVCs for data, metadata, and wal. While the purge doesn't wipe the devices, it's expected that their PVCs and PVs were deleted, and then the next time the operator reconciles the OSDs, it would create new PVCs in their place. Then if there are no new PVs available, the OSD creation should wait for a new clean PV where the PVC can be bound. But since it sounds like something is missing for you in that flow, the PVCs for the metadata, wal, and data devices are all named in a related way. You should see the PVCs named similar to: - set1-data-0-<suffix> - set1-metadata-0-<suffix> - set1-wal-0-<suffix> Where the suffix is a random ID, and "set1" is the name of the storageClassDeviceSet. In this example "0" is a simple index for OSDs that were created from the same storageClassDeviceSet. From the PVCs, you can see which PVs they are bound to. If the PVCs had been deleted by the purge job, then finding the PVs for the deleted PVCs would be more difficult. > 2. In our cluster, 'osd-0' was the one that did not come up after > re-attaching the drive, so we wiped (dd command) the data drive > corresponding to the deviceset-0. Before this, we removed its PVC and its > PV. On applying the local-data.yaml file, there was no error thrown yet the > local-data PV did not add back. What is the local-data.yaml file? Is this for creating the local PVs? Did you clean the meatadata and wal PVs yet? The osd prepare log is just indicating that they cannot be re-used. > 3. On restarting the rook-ceph operator pod, the rook-ceph-osd-1 pod went > to 'Crashloopbackoff' state since the PV of the corresponding data drive is > not added back. ODF must gather logs is placed in > https://www.dropbox.com/s/0zu8orbxfwf8b2y/odf_must_gather_12dec.tar.gz?dl=0 > > 4. Can you please provide your inputs on how to recover the OSDs from this > state. > Also, if its expected to wipe/replace all data,metadata, wal devices on > replacing one of the drive, this will not be possible from a customer > environment. If any one of the drive goes faulty and needs a replacement, we > cannot expect the customer to identify the relative drives to wipe out. > Anyone would expect the data to be recovered automatically on replacing the > faulty drive alone. Can you please explain on why such behavior is designed > in ODF. ODF is designed to keep OSDs running. If there is ever a question about whether an OSD should be removed or wiped, ODF expects the admin to be involved in that decision so that ODF doesn't automatically remove any data and cause data loss accidentally. This scenario of having a metadata, wal, and data PV for the OSD is not common, so we should certainly improve the scenario to at least make it easier to identify corresponding devices. Ultimately, this isn't a managed service where the cloud storage can be fully and automatically managed when hardware dies. When underlying devices are replaced, it is a disruptive change for sure where storage admins will need to handle it. > > Thanks > Ishwarya M
(In reply to Travis Nielsen from comment #35) > (In reply to Ishwarya Munesh from comment #34) > > Hi Travis, > > We have below queries related to wiping out the drive in drive detach > > scenario, > > 1. How to find out the corresponding data drive for the detached metadata > > drive? > > When you purged the OSD, you ran the job template to purge the OSD, correct? > This job should have purged the OSDs and deleted all the related PVCs for > data, metadata, and wal. While the purge doesn't wipe the devices, it's > expected > that their PVCs and PVs were deleted, and then the next time the operator > reconciles the OSDs, it would create new PVCs in their place. Then if there > are no new PVs available, the OSD creation should wait for a new clean PV > where the PVC can be bound. > > But since it sounds like something is missing for you in that flow, > the PVCs for the metadata, wal, and data devices are all named in a related > way. > You should see the PVCs named similar to: > - set1-data-0-<suffix> > - set1-metadata-0-<suffix> > - set1-wal-0-<suffix> > > Where the suffix is a random ID, and "set1" is the name of the > storageClassDeviceSet. > In this example "0" is a simple index for OSDs that were created from the > same storageClassDeviceSet. > From the PVCs, you can see which PVs they are bound to. > > If the PVCs had been deleted by the purge job, then finding the PVs for the > deleted PVCs would be more difficult. > > > 2. In our cluster, 'osd-0' was the one that did not come up after > > re-attaching the drive, so we wiped (dd command) the data drive > > corresponding to the deviceset-0. Before this, we removed its PVC and its > > PV. On applying the local-data.yaml file, there was no error thrown yet the > > local-data PV did not add back. > > What is the local-data.yaml file? Is this for creating the local PVs? > Did you clean the meatadata and wal PVs yet? The osd prepare log is just > indicating that they cannot be re-used. >>> Yes, the local-data.yaml file is for creating the local-data PVs. The metadata and wal PVs were deleted during the process of deleting the previous OSDs after detaching the drive. > > > 3. On restarting the rook-ceph operator pod, the rook-ceph-osd-1 pod went > > to 'Crashloopbackoff' state since the PV of the corresponding data drive is > > not added back. ODF must gather logs is placed in > > https://www.dropbox.com/s/0zu8orbxfwf8b2y/odf_must_gather_12dec.tar.gz?dl=0 > > > > 4. Can you please provide your inputs on how to recover the OSDs from this > > state. > > Also, if its expected to wipe/replace all data,metadata, wal devices on > > replacing one of the drive, this will not be possible from a customer > > environment. If any one of the drive goes faulty and needs a replacement, we > > cannot expect the customer to identify the relative drives to wipe out. > > Anyone would expect the data to be recovered automatically on replacing the > > faulty drive alone. Can you please explain on why such behavior is designed > > in ODF. > > ODF is designed to keep OSDs running. If there is ever a question about > whether > an OSD should be removed or wiped, ODF expects the admin to be involved in > that decision so that ODF doesn't automatically remove any data and cause > data > loss accidentally. > > This scenario of having a metadata, wal, and data PV for the OSD is not > common, >>> Can you provide more clarity on what are the other possible ways of configuring data, metadata and wal ?? > so we should certainly improve the scenario to at least make it easier to > identify corresponding devices. > > Ultimately, this isn't a managed service where the cloud storage can be > fully > and automatically managed when hardware dies. When underlying devices are > replaced, it is a disruptive change for sure where storage admins will need > to handle it. > Also we performed the same steps in OCP 4.10 and we were able to see the OSD got added after reattaching the drive. Was there any change in OCP 4.11 requiring the data,metadata,wal devices to be wiped for the OSD to add?
> Also we performed the same steps in OCP 4.10 and we were able to see the OSD got added after reattaching the drive. Was there any change in OCP 4.11 requiring the data,metadata,wal devices to be wiped for the OSD to add? I cannot think of a change that would have affected this replacement of the OSDs between 4.10 and 4.11. If you can gather the logs for the osd purge job for 4.10 and 4.11, hopefully we can see the difference about why there is a difference in the cleanup that would allow it to work in 4.10.
Hi Travis, We once again performed a complete ODF cleanup and re-installed ODF, storagecluster and performed the drive detach scenario again. Steps followed: Initial configuration of Mixed media setup: Worker nodes -3 1. Each worker has 3 drives (2-nvme and 1 Optane (faster drive)) 2. 12 PVs were created on these drives - 3 data, 3 metadata, 3 wal, 3 mon-pods 3. Metadata and wal PVs were created on each Optane drive by creating 2 partitions 4. 12 PVC created 5. Storage cluster is created with replica -3 6. Number of OSDs - 3, ceph was healthy 7. Detached one Optane drive (metadata & wal partitions) from one worker node 8. Number of OSDs reduced to 2 as expected 9. Purged the old OSDs as per the steps provided here - https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.11/html/replacing_devices/openshift_data_foundation_deployed_using_local_storage_devices#replacing-operational-or-failed-storage-devices-on-clusters-backed-by-local-storage-devices_rhodf 10. After deleting the PV (step 10 in above link), its corresponding data PV also got removed. Hope this is as expected as per your previous comment. (In the attached txt file - getpvafter.txt, deviceset-2 related PVs not present) 11. After re-attaching the drive back, the data, metadata and wal PVs got created. PVC also got created. 12. As per your earlier comments, cleaned up the data, wal/metadata drive by performing dd command and also sgdisk zap command, still the OSD-2 was not added back. 13. ODF must-gather logs is placed in https://www.dropbox.com/s/ss11mz1i982oifi/odf_mustgat_20dec.tar.gz?dl=0 for reference Can you please check the attached logs and let us know what is missing here and how to recover the OSD. Also can you let us know how to get the logs of OSD purge job? Thanks Ishwarya M
Created attachment 1933764 [details] pv list before detaching the drive
Created attachment 1933766 [details] pv list after detaching the drive and osd purge
From the osd prepare log for pod [1] I see that the OSD is not being created because metadata device appears to be not clean: Raw device /srv/ocs-deviceset-2-metadata-0gfzvc is already prepared Here is more of the osd prepare log: 2022-12-20T07:39:29.188095716Z 2022-12-20 07:39:29.188089 D | cephosd: device "/srv/ocs-deviceset-2-metadata-0gfzvc" is a metadata or wal device, skipping this iteration it will be used in the next one 2022-12-20T07:39:29.188095716Z 2022-12-20 07:39:29.188093 D | cephosd: device "/wal/ocs-deviceset-2-wal-0vwsnk" is a metadata or wal device, skipping this iteration it will be used in the next one 2022-12-20T07:39:29.188101190Z 2022-12-20 07:39:29.188097 I | cephosd: configuring new device "/mnt/ocs-deviceset-2-data-04ck6p" 2022-12-20T07:39:29.188101190Z 2022-12-20 07:39:29.188099 I | cephosd: devlink names: 2022-12-20T07:39:29.188105289Z 2022-12-20 07:39:29.188101 I | cephosd: /dev/disk/by-id/nvme-INTEL_SSDPF2KX076TZ_PHAC112401YW7P6CGN 2022-12-20T07:39:29.188105289Z 2022-12-20 07:39:29.188102 I | cephosd: /dev/disk/by-path/pci-0000:31:00.0-nvme-1 2022-12-20T07:39:29.188109150Z 2022-12-20 07:39:29.188104 I | cephosd: /dev/disk/by-id/nvme-eui.01000000000000005cd2e41a4d3e5351 2022-12-20T07:39:29.188112860Z 2022-12-20 07:39:29.188109 D | exec: Running command: stdbuf -oL ceph-volume --log-path /var/log/ceph/ocs-deviceset-2-data-04ck6p raw prepare --bluestore --data /mnt/ocs-deviceset-2-data-04ck6p --block.db /srv/ocs-deviceset-2-metadata-0gfzvc --block.wal /wal/ocs-deviceset-2-wal-0vwsnk 2022-12-20T07:39:30.298429727Z 2022-12-20 07:39:30.298392 I | cephosd: stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected. 2022-12-20T07:39:30.298429727Z stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected. 2022-12-20T07:39:30.298429727Z --> Raw device /srv/ocs-deviceset-2-metadata-0gfzvc is already prepared. [1] rook-ceph-osd-prepare-ocs-deviceset-2-data-04ck6p-t72td pod logs in the must-gather
Hi Travis, Tried cleaning up the mentioned metadata device via dd and zap commands, restarted the ceph-rook-operator pod, still the OSD did not come up. Is there any step to be performed after cleaning up the drive? Thanks Ishwarya M
Travis, ODF must gather collected after cleaning up the metadata device is placed here for reference-https://www.dropbox.com/s/048fvqmoytu4g6w/odf_mustgather_21dec.tar.gz?dl=0 We still see the message 'already prepared' in the logs. Let us know what step should be done after cleaning up the drives.
Hello Travis, Just making sure you saw the latest comment from Ishwarya at https://bugzilla.redhat.com/show_bug.cgi?id=2147526#c43
(In reply to Bertrand from comment #47) > Hello Travis, > > Just making sure you saw the latest comment from Ishwarya at > https://bugzilla.redhat.com/show_bug.cgi?id=2147526#c43 Yes, back from break now... The latest must-gather still shows the metadata device was already prepared for a prior OSD. Something is still not fully wiped on the metadata device so Ceph still will not create the new volume. 2022-12-21T05:13:20.570792887Z 2022-12-21 05:13:20.570787 D | exec: Running command: stdbuf -oL ceph-volume --log-path /var/log/ceph/ocs-deviceset-2-data-04ck6p raw prepare --bluestore --data /mnt/ocs-deviceset-2-data-04ck6p --block.db /srv/ocs-deviceset-2-metadata-0gfzvc --block.wal /wal/ocs-deviceset-2-wal-0vwsnk 2022-12-21T05:13:21.632710555Z 2022-12-21 05:13:21.632574 I | cephosd: stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected. 2022-12-21T05:13:21.632710555Z stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected. 2022-12-21T05:13:21.632710555Z --> Raw device /srv/ocs-deviceset-2-metadata-0gfzvc is already prepared. From the steps in comment 38, the OSDs were successfully created after the reinstall. The issue is just during the OSD replacement. What steps were different when cleaning up for the full reinstall? There must be something else that cleaned the PV that worked for the full install, but was missed for the OSD replacement. The requirement for the clean metadata PV/device is the same in either case.
Is this solved now so we can close this issue?
Please reopen if there is still an issue