Description of problem (please be detailed as possible and provide log snippests): We installed RH ODF 4.10.1 and tried to create an StorageSystem for IBM FlashSystem without any encryption option. The osd prepare jobs seems to be stuck in state 'in progress' and no OSDs were deployed. Version of all relevant components (if applicable): ODF 4.10.1 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, we cannot install IBM FlashSystem on top of RH ODF Is there any workaround available to the best of your knowledge? no Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? yes Can this issue reproduce from the UI? yes If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Install ODF 4.10.1 2.Create StorageSystem for IBM FlashSystem 3. Actual results: No OSDs were deployed Expected results: Successfully create the OSD pods Additional info:
A must-gather is needed for troubleshooting. Based on the description, the OSD prepare pod logs may show the issue.
must-gather logs contains many files, how can I attach all of them?
If it's too much to zip up the must-gather, please at least attach the osd prepare and rook operator logs
Created attachment 1876827 [details] must-gather logs
must-gather logs attached (works with .rar instead of .zip, thanks)
can you access one of the machine where prepare pod runs? please send the logs from /var/lib/rook directory Thanks
Also, at this stage, I think it's crucial that we get access to the env to do further troubleshooting. Thanks
Created attachment 1877240 [details] log folder under /var/lib/rook
(In reply to Tal Yichye from comment #9) > Created attachment 1877240 [details] > log folder under /var/lib/rook It's likely the wrong node since I only see monitor logs, we need the logs from the node where the rook-ceph-osd-prepare job runs. Thanks.
Created attachment 1877871 [details] log folder under /var/lib/rook with ceph_volume logs Correct log folder with ceph_volume.log file.
After looking at the system closer, it appears that: * ceph-volume is stuck in the prepare job, trying to configure and OSD on /dev/mapper/mpathat * the device ceph-volume is trying to consume a multipath device which seems to be in a bad shape: May 09 08:06:53 | sdd: prio = const (setting: emergency fallback - alua failed) May 09 08:06:53 | sde: prio = const (setting: emergency fallback - alua failed) mpathar (36005076810810128480000000000383b) dm-1 IBM,2145 size=512G features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='service-time 0' prio=0 status=enabled |- 3:0:0:63 sdd 8:48 failed faulty running `- 4:0:0:63 sde 8:64 failed faulty running Also, there are a bunch of encrypted disk opened and non-responsive, it'd be great to close them with "crypsetup luksClose" or try to "dmsetup remove --force", rebooting the node might help too. Also, let's make sure the multipath devices are working correctly and try to re-install. I think the root cause is mostly a setup problem on that machine. Tal had a point about the uninstallation not working correctly or at least not closing the encrypted devices. Even if it's not done, we should be able to close the devices manually which is somehow not possible now. Can we get a clean installation and try again? I don't think it's worth pursuing any further debugging. Tal, can we close this? Thanks!
Hi Sebastien, After cleaning up the environment manually (multipath, pv's, etc.) the installation succeed (There is an issue with the ODF console, but I opened other ticket for it). I think there is a problem with the cleanup process, since many ODF objects are left on the cluster, in addition to the encrypted devices, after uninstalling the it through the GUI. I think it worth to verify the uninstall process again, specifically when the install process is stuck. Thanks!
Thanks Tal, based on https://bugzilla.redhat.com/show_bug.cgi?id=2081431#c13 I'm closing this BZ since the issue is resolved.