Description of problem (please be detailed as possible and provide log snippests): OSD does not work with multipath devices. RedHat OpenShift kubernetes service (ROKS) clusters will have workers with multipath present on local SSD disks and also on remote block devices which are attached. OCS needs to recognize and use multipath devices in OSD prepare pods. Note: Mon pods identifies the multipath devices successfully. Version of all relevant components (if applicable): Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? OCS deploy fails as OSD devices are not configured due to multipath. Is there any workaround available to the best of your knowledge? I disable multipath from the device on the node manually so I can deploy OCS. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 4 Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Deploy OCS on RedHat OpenShift kubernetes service (ROKS) with local SSD disks present on the nodes and used for OSD. Actual results: OCS deploy fails at osd-prepare step as the local disk are unusable due to multipath present on them. Expected results: Additional info: We have a GHE open for this issue https://github.com/openshift/ocs-operator/issues/452
Reducing severity to high and setting high priority as this needs to be prioritized for 4.4. Moving from 4.3 as this is not a blocker for the release
Moving out at least to 4.5 since this is a feature request that is not possible for 4.4.
Adding the dependent ceph issue open in community https://tracker.ceph.com/issues/45094
We found a working solution for running OSDs on multipath. The restriction for multipath was in certain configurations with ceph-volume that didn't apply to OCS. Multipath support is now in rook when configuring storageClassDeviceSets (OSDs on PVCs) in raw mode (no LVM). The changes necessary in rook included: - For OSDs on PVs, skip the ceph-volume inventory check for a valid device. We just proceed with the OSD provisioning. If it's an invalid PV, the provisioning will just fail anyway. - multipath devices are allowed to be used for OSDs IBM has confirmed it is working for them in an upstream test based on the proposed changes. https://github.com/openshift/ocs-operator/issues/452#issuecomment-623417136 The fix has been merged upstream, and also synced to the downstream branch with https://github.com/openshift/rook/pull/50. Moving to modified. ceph-volume inventory should also still be updated to support multipath, at least when in raw mode without LVM.
@Neha Agreed, the only scenario where I see the multipath as useful for OCS is if LSO is used to create PVs on multipath devices. IBM had said this was a requirement, although they found another solution to avoid multipath for now anyway.
@Michael, At this point in time, we don't have access to ROKS system so not acking for now
(In reply to Raz Tamir from comment #8) > @Michael, > > At this point in time, we don't have access to ROKS system so not acking for > now I think i could get you access to a ROKS system, but all we really need is access to multipath devices. Is that available somewhere more easily? I think it might be sufficient to use a multipath iSCSI device, if we don't have access to physical multipath ssds.
In this case a partner (IBM) had the requirement and they also are the ones that can validate that the scenario is fully working for them. Is it not sufficient for a partner to validate their scenario is working? There might be other scenarios like this in the future as well where a partner will need to validate their own setup.
(In reply to Travis Nielsen from comment #10) > In this case a partner (IBM) had the requirement and they also are the ones > that can validate that the scenario is fully working for them. Is it not > sufficient for a partner to validate their scenario is working? There might > be other scenarios like this in the future as well where a partner will need > to validate their own setup. I think if our QE team is to give qa_ack (which is formally needed to officially take the BZ into the release), they would not want to just delegate. @Raz: I think that @Petr has been writing about him running our ocs-ci against ibm cloud, so I guess he can provide access to a test system?
@Travis, and adding to the above thought: It is true that the requirement comes from IBM for their ibm cloud / roks use case. But from the generic product PoV, it should be sufficient to test the generic feature with *some* multipath devices. The specific IBM cloud deployment can not (yet?) be part of the qualification matrix of the OCS product, but can be done alongside. (So I still think using an iscsi multipath setup should be sufficient.) Agree?
@Michael Agreed, generically testing multipath should be sufficient. If there is a multipath device, a local PV (LSO) can be added to it, and configured like any other LSO configuration.
Hey Michael. The access to the IBM cluster was shared in this email thread: http://post-office.corp.redhat.com/archives/rhocs-eng/2020-June/msg00240.html last week. But I see that someone probably started deleting the cluster so the cluster is probably installed anymore but I have asked Gangadhar to have new one cluster deployed so once I have it I can share the access here.
@Akash, what is IBM's timeline requirement for getting the multipath support? Is OCS 4.5.0 a hard requirement?
@Michael, Yes this is a important requirement for IBM Cloud to have OCS working as there are different use-cases which need this. I see it has been postponed from OCS 4.3 to 4.5 so we need to rollout in OCS 4.5 with this feature. @Travis, I had also highlighted a issue with multipath about node reboots here https://github.com/openshift/ocs-operator/issues/452#issuecomment-624510552 Can you confirm if thats also done now?
(In reply to akgunjal.com from comment #20) > @Michael, Yes this is a important requirement for IBM Cloud to have OCS > working as there are different use-cases which need this. I see it has been > postponed from OCS 4.3 to 4.5 so we need to rollout in OCS 4.5 with this > feature. Thanks Akash, proposing as a blocker for 4.5 because of this.
Hi Michael, Akash, In order to provide QA ack, we will need to make sure we have the HW to test with. QE doesn't possess any machine exposed to volumes by multipath that is suitable for OCS deployment. Appreciate if you can provide us with access to such an environment. Also, Eran, from what I can tell, this is an RFE. The scope of testing is broader than what's described in this BZ and may require proper planning. We need to track it with Jira.
(In reply to Elad from comment #22) > Hi Michael, Akash, > > In order to provide QA ack, we will need to make sure we have the HW to test > with. QE doesn't possess any machine exposed to volumes by multipath that is > suitable for OCS deployment. > Appreciate if you can provide us with access to such an environment. > > Also, Eran, from what I can tell, this is an RFE. The scope of testing is > broader than what's described in this BZ and may require proper planning. We > need to track it with Jira. @Elad, As mentioned before, we don't need to have specific hardware to test this. We can create multipath disk devices with iscsi. See comments #12 and #13. And yes, it may be called a (small) rfe, or called a bugfix, not sure it really matters. Mainly deployment was not successful if the disks carried this multipath flag. Cheers - Michael
(In reply to Michael Adam from comment #23) > @Elad, > > As mentioned before, we don't need to have specific hardware to test this. > We can create multipath disk devices with iscsi. > See comments #12 and #13. OCS4 QE don't have machines that are connected by iSCSI nor FC to storage backend which we can use to deploy OCS on. Maybe other teams have, I can check > > And yes, it may be called a (small) rfe, or called a bugfix, not sure it > really matters. > Mainly deployment was not successful if the disks carried this multipath > flag. This is a new feature that requires QE work - which is at least finding the suitable HW setting up a cluster in a new way and full regression. It's not a small feature
Thanks Michael for suggesting we will verify the bug based on regression testing, just to make sure the bug fix didn't break anything. This in order to allow IBM to work with Multipath. It means that multipath will not be an OCS feature and will not be mentioned in the docs. Once it will have a proper Jira epic, we will be able to decide which version we will qualify it for, as a fully supported feature. Acking
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754
*** Bug 1866775 has been marked as a duplicate of this bug. ***