Bug 2081431
Summary: | OSD prepare jobs are stuck in state 'in progress' and no OSDs were deployed | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Tal Yichye <tal.yichye> | ||||||||
Component: | rook | Assignee: | Sébastien Han <shan> | ||||||||
Status: | CLOSED NOTABUG | QA Contact: | Neha Berry <nberry> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 4.10 | CC: | madam, ocs-bugs, odf-bz-bot, shan | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2022-05-09 14:14:48 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Tal Yichye
2022-05-03 17:19:45 UTC
A must-gather is needed for troubleshooting. Based on the description, the OSD prepare pod logs may show the issue. must-gather logs contains many files, how can I attach all of them? If it's too much to zip up the must-gather, please at least attach the osd prepare and rook operator logs Created attachment 1876827 [details]
must-gather logs
must-gather logs attached (works with .rar instead of .zip, thanks) can you access one of the machine where prepare pod runs? please send the logs from /var/lib/rook directory Thanks Also, at this stage, I think it's crucial that we get access to the env to do further troubleshooting. Thanks Created attachment 1877240 [details]
log folder under /var/lib/rook
(In reply to Tal Yichye from comment #9) > Created attachment 1877240 [details] > log folder under /var/lib/rook It's likely the wrong node since I only see monitor logs, we need the logs from the node where the rook-ceph-osd-prepare job runs. Thanks. Created attachment 1877871 [details]
log folder under /var/lib/rook with ceph_volume logs
Correct log folder with ceph_volume.log file.
After looking at the system closer, it appears that: * ceph-volume is stuck in the prepare job, trying to configure and OSD on /dev/mapper/mpathat * the device ceph-volume is trying to consume a multipath device which seems to be in a bad shape: May 09 08:06:53 | sdd: prio = const (setting: emergency fallback - alua failed) May 09 08:06:53 | sde: prio = const (setting: emergency fallback - alua failed) mpathar (36005076810810128480000000000383b) dm-1 IBM,2145 size=512G features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='service-time 0' prio=0 status=enabled |- 3:0:0:63 sdd 8:48 failed faulty running `- 4:0:0:63 sde 8:64 failed faulty running Also, there are a bunch of encrypted disk opened and non-responsive, it'd be great to close them with "crypsetup luksClose" or try to "dmsetup remove --force", rebooting the node might help too. Also, let's make sure the multipath devices are working correctly and try to re-install. I think the root cause is mostly a setup problem on that machine. Tal had a point about the uninstallation not working correctly or at least not closing the encrypted devices. Even if it's not done, we should be able to close the devices manually which is somehow not possible now. Can we get a clean installation and try again? I don't think it's worth pursuing any further debugging. Tal, can we close this? Thanks! Hi Sebastien, After cleaning up the environment manually (multipath, pv's, etc.) the installation succeed (There is an issue with the ODF console, but I opened other ticket for it). I think there is a problem with the cleanup process, since many ODF objects are left on the cluster, in addition to the encrypted devices, after uninstalling the it through the GUI. I think it worth to verify the uninstall process again, specifically when the install process is stuck. Thanks! Thanks Tal, based on https://bugzilla.redhat.com/show_bug.cgi?id=2081431#c13 I'm closing this BZ since the issue is resolved. |