Bug 2081431 - OSD prepare jobs are stuck in state 'in progress' and no OSDs were deployed
Summary: OSD prepare jobs are stuck in state 'in progress' and no OSDs were deployed
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Sébastien Han
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-03 17:19 UTC by Tal Yichye
Modified: 2023-08-09 17:03 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-05-09 14:14:48 UTC
Embargoed:


Attachments (Terms of Use)
must-gather logs (19.09 MB, application/vnd.rar)
2022-05-03 21:04 UTC, Tal Yichye
no flags Details
log folder under /var/lib/rook (2.37 MB, application/vnd.rar)
2022-05-05 06:13 UTC, Tal Yichye
no flags Details
log folder under /var/lib/rook with ceph_volume logs (5.13 MB, application/vnd.rar)
2022-05-08 07:12 UTC, Tal Yichye
no flags Details

Description Tal Yichye 2022-05-03 17:19:45 UTC
Description of problem (please be detailed as possible and provide log
snippests):

We installed RH ODF 4.10.1  and tried to create an StorageSystem for IBM FlashSystem without any encryption option.
The osd prepare jobs seems to be stuck in state 'in progress' and no OSDs were deployed.

Version of all relevant components (if applicable):

ODF 4.10.1

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes, we cannot install IBM FlashSystem on top of RH ODF

Is there any workaround available to the best of your knowledge?
no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Install ODF 4.10.1
2.Create StorageSystem for IBM FlashSystem
3.


Actual results:
No OSDs were deployed

Expected results:
Successfully create the OSD pods

Additional info:

Comment 2 Travis Nielsen 2022-05-03 18:03:37 UTC
A must-gather is needed for troubleshooting. Based on the description, the OSD prepare pod logs may show the issue.

Comment 3 Tal Yichye 2022-05-03 20:36:02 UTC
must-gather logs contains many files, how can I attach all of them?

Comment 4 Travis Nielsen 2022-05-03 21:00:28 UTC
If it's too much to zip up the must-gather, please at least attach the osd prepare and rook operator logs

Comment 5 Tal Yichye 2022-05-03 21:04:06 UTC
Created attachment 1876827 [details]
must-gather logs

Comment 6 Tal Yichye 2022-05-03 21:05:03 UTC
must-gather logs attached (works with .rar instead of .zip, thanks)

Comment 7 Sébastien Han 2022-05-04 13:53:19 UTC
can you access one of the machine where prepare pod runs?
please send the logs from /var/lib/rook directory

Thanks

Comment 8 Sébastien Han 2022-05-04 13:59:36 UTC
Also, at this stage, I think it's crucial that we get access to the env to do further troubleshooting.
Thanks

Comment 9 Tal Yichye 2022-05-05 06:13:19 UTC
Created attachment 1877240 [details]
log folder under /var/lib/rook

Comment 10 Sébastien Han 2022-05-05 07:15:25 UTC
(In reply to Tal Yichye from comment #9)
> Created attachment 1877240 [details]
> log folder under /var/lib/rook

It's likely the wrong node since I only see monitor logs, we need the logs from the node where the rook-ceph-osd-prepare job runs.
Thanks.

Comment 11 Tal Yichye 2022-05-08 07:12:20 UTC
Created attachment 1877871 [details]
log folder under /var/lib/rook with ceph_volume logs

Correct log folder with ceph_volume.log file.

Comment 12 Sébastien Han 2022-05-09 09:16:34 UTC
After looking at the system closer, it appears that:

* ceph-volume is stuck in the prepare job, trying to configure and OSD on /dev/mapper/mpathat
* the device ceph-volume is trying to consume a multipath device which seems to be in a bad shape: 

May 09 08:06:53 | sdd: prio = const (setting: emergency fallback - alua failed)
May 09 08:06:53 | sde: prio = const (setting: emergency fallback - alua failed)
mpathar (36005076810810128480000000000383b) dm-1 IBM,2145
size=512G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=0 status=enabled
  |- 3:0:0:63 sdd 8:48 failed faulty running
  `- 4:0:0:63 sde 8:64 failed faulty running

Also, there are a bunch of encrypted disk opened and non-responsive, it'd be great to close them with "crypsetup luksClose" or try to "dmsetup remove --force", rebooting the node might help too.
Also, let's make sure the multipath devices are working correctly and try to re-install.

I think the root cause is mostly a setup problem on that machine.
Tal had a point about the uninstallation not working correctly or at least not closing the encrypted devices. Even if it's not done, we should be able to close the devices manually which is somehow not possible now.

Can we get a clean installation and try again?
I don't think it's worth pursuing any further debugging.

Tal, can we close this?
Thanks!

Comment 13 Tal Yichye 2022-05-09 13:22:07 UTC
Hi Sebastien,
After cleaning up the environment manually (multipath, pv's, etc.) the installation succeed (There is an issue with the ODF console, but I opened other ticket for it).
I think there is a problem with the cleanup process, since many ODF objects are left on the cluster, in addition to the encrypted devices, after uninstalling the it through the GUI. I think it worth to verify the uninstall process again, specifically when the install process is stuck.

Thanks!

Comment 14 Sébastien Han 2022-05-09 14:14:48 UTC
Thanks Tal, based on https://bugzilla.redhat.com/show_bug.cgi?id=2081431#c13 I'm closing this BZ since the issue is resolved.


Note You need to log in before you can comment on or make changes to this bug.