Bug 1823409
| Summary: | OSD needs to recognize and support multipath | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | akgunjal <akgunjal> |
| Component: | rook | Assignee: | Travis Nielsen <tnielsen> |
| Status: | CLOSED ERRATA | QA Contact: | Elad <ebenahar> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.2 | CC: | ebenahar, etamir, jelopez, madam, mmuench, nberry, ocs-bugs, owasserm, pbalogh, ratamir, sabose, sostapov, tnielsen |
| Target Milestone: | --- | Keywords: | AutomationBackLog |
| Target Release: | OCS 4.5.0 | ||
| Hardware: | Unspecified | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-09-15 10:16:49 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
akgunjal@in.ibm.com
2020-04-13 15:50:35 UTC
Reducing severity to high and setting high priority as this needs to be prioritized for 4.4. Moving from 4.3 as this is not a blocker for the release Moving out at least to 4.5 since this is a feature request that is not possible for 4.4. Adding the dependent ceph issue open in community https://tracker.ceph.com/issues/45094 We found a working solution for running OSDs on multipath. The restriction for multipath was in certain configurations with ceph-volume that didn't apply to OCS. Multipath support is now in rook when configuring storageClassDeviceSets (OSDs on PVCs) in raw mode (no LVM). The changes necessary in rook included: - For OSDs on PVs, skip the ceph-volume inventory check for a valid device. We just proceed with the OSD provisioning. If it's an invalid PV, the provisioning will just fail anyway. - multipath devices are allowed to be used for OSDs IBM has confirmed it is working for them in an upstream test based on the proposed changes. https://github.com/openshift/ocs-operator/issues/452#issuecomment-623417136 The fix has been merged upstream, and also synced to the downstream branch with https://github.com/openshift/rook/pull/50. Moving to modified. ceph-volume inventory should also still be updated to support multipath, at least when in raw mode without LVM. @Neha Agreed, the only scenario where I see the multipath as useful for OCS is if LSO is used to create PVs on multipath devices. IBM had said this was a requirement, although they found another solution to avoid multipath for now anyway. @Michael, At this point in time, we don't have access to ROKS system so not acking for now (In reply to Raz Tamir from comment #8) > @Michael, > > At this point in time, we don't have access to ROKS system so not acking for > now I think i could get you access to a ROKS system, but all we really need is access to multipath devices. Is that available somewhere more easily? I think it might be sufficient to use a multipath iSCSI device, if we don't have access to physical multipath ssds. In this case a partner (IBM) had the requirement and they also are the ones that can validate that the scenario is fully working for them. Is it not sufficient for a partner to validate their scenario is working? There might be other scenarios like this in the future as well where a partner will need to validate their own setup. (In reply to Travis Nielsen from comment #10) > In this case a partner (IBM) had the requirement and they also are the ones > that can validate that the scenario is fully working for them. Is it not > sufficient for a partner to validate their scenario is working? There might > be other scenarios like this in the future as well where a partner will need > to validate their own setup. I think if our QE team is to give qa_ack (which is formally needed to officially take the BZ into the release), they would not want to just delegate. @Raz: I think that @Petr has been writing about him running our ocs-ci against ibm cloud, so I guess he can provide access to a test system? @Travis, and adding to the above thought: It is true that the requirement comes from IBM for their ibm cloud / roks use case. But from the generic product PoV, it should be sufficient to test the generic feature with *some* multipath devices. The specific IBM cloud deployment can not (yet?) be part of the qualification matrix of the OCS product, but can be done alongside. (So I still think using an iscsi multipath setup should be sufficient.) Agree? @Michael Agreed, generically testing multipath should be sufficient. If there is a multipath device, a local PV (LSO) can be added to it, and configured like any other LSO configuration. Hey Michael. The access to the IBM cluster was shared in this email thread: http://post-office.corp.redhat.com/archives/rhocs-eng/2020-June/msg00240.html last week. But I see that someone probably started deleting the cluster so the cluster is probably installed anymore but I have asked Gangadhar to have new one cluster deployed so once I have it I can share the access here. @Akash, what is IBM's timeline requirement for getting the multipath support? Is OCS 4.5.0 a hard requirement? @Michael, Yes this is a important requirement for IBM Cloud to have OCS working as there are different use-cases which need this. I see it has been postponed from OCS 4.3 to 4.5 so we need to rollout in OCS 4.5 with this feature. @Travis, I had also highlighted a issue with multipath about node reboots here https://github.com/openshift/ocs-operator/issues/452#issuecomment-624510552 Can you confirm if thats also done now? (In reply to akgunjal.com from comment #20) > @Michael, Yes this is a important requirement for IBM Cloud to have OCS > working as there are different use-cases which need this. I see it has been > postponed from OCS 4.3 to 4.5 so we need to rollout in OCS 4.5 with this > feature. Thanks Akash, proposing as a blocker for 4.5 because of this. Hi Michael, Akash, In order to provide QA ack, we will need to make sure we have the HW to test with. QE doesn't possess any machine exposed to volumes by multipath that is suitable for OCS deployment. Appreciate if you can provide us with access to such an environment. Also, Eran, from what I can tell, this is an RFE. The scope of testing is broader than what's described in this BZ and may require proper planning. We need to track it with Jira. (In reply to Elad from comment #22) > Hi Michael, Akash, > > In order to provide QA ack, we will need to make sure we have the HW to test > with. QE doesn't possess any machine exposed to volumes by multipath that is > suitable for OCS deployment. > Appreciate if you can provide us with access to such an environment. > > Also, Eran, from what I can tell, this is an RFE. The scope of testing is > broader than what's described in this BZ and may require proper planning. We > need to track it with Jira. @Elad, As mentioned before, we don't need to have specific hardware to test this. We can create multipath disk devices with iscsi. See comments #12 and #13. And yes, it may be called a (small) rfe, or called a bugfix, not sure it really matters. Mainly deployment was not successful if the disks carried this multipath flag. Cheers - Michael (In reply to Michael Adam from comment #23) > @Elad, > > As mentioned before, we don't need to have specific hardware to test this. > We can create multipath disk devices with iscsi. > See comments #12 and #13. OCS4 QE don't have machines that are connected by iSCSI nor FC to storage backend which we can use to deploy OCS on. Maybe other teams have, I can check > > And yes, it may be called a (small) rfe, or called a bugfix, not sure it > really matters. > Mainly deployment was not successful if the disks carried this multipath > flag. This is a new feature that requires QE work - which is at least finding the suitable HW setting up a cluster in a new way and full regression. It's not a small feature Thanks Michael for suggesting we will verify the bug based on regression testing, just to make sure the bug fix didn't break anything. This in order to allow IBM to work with Multipath. It means that multipath will not be an OCS feature and will not be mentioned in the docs. Once it will have a proper Jira epic, we will be able to decide which version we will qualify it for, as a fully supported feature. Acking Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754 *** Bug 1866775 has been marked as a duplicate of this bug. *** |