1823409 – OSD needs to recognize and support multipath

Bug 1823409 - OSD needs to recognize and support multipath

Summary: OSD needs to recognize and support multipath

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.2
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.5.0
Assignee:	Travis Nielsen
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1866775 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-13 15:50 UTC by akgunjal@in.ibm.com
Modified:	2022-02-22 15:47 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-15 10:16:49 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:3754	0	None	None	None	2020-09-15 10:17:21 UTC

Description akgunjal@in.ibm.com 2020-04-13 15:50:35 UTC

Description of problem (please be detailed as possible and provide log
snippests):
OSD does not work with multipath devices. RedHat OpenShift kubernetes service (ROKS) clusters will have workers with multipath present on local SSD disks and also on remote block devices which are attached. OCS needs to recognize and use multipath devices in OSD prepare pods.

Note: Mon pods identifies the multipath devices successfully.


Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
OCS deploy fails as OSD devices are not configured due to multipath.


Is there any workaround available to the best of your knowledge?
I disable multipath from the device on the node manually so I can deploy OCS.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
4


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy OCS on RedHat OpenShift kubernetes service (ROKS) with local SSD disks present on the nodes and used for OSD.


Actual results:
OCS deploy fails at osd-prepare step as the local disk are unusable due to multipath present on them.

Expected results:


Additional info:
We have a GHE open for this issue https://github.com/openshift/ocs-operator/issues/452

Comment 2 Raz Tamir 2020-04-13 16:22:14 UTC

Reducing severity to high and setting high priority as this needs to be prioritized for 4.4.
Moving from 4.3 as this is not a blocker for the release

Comment 3 Travis Nielsen 2020-04-13 18:24:09 UTC

Moving out at least to 4.5 since this is a feature request that is not possible for 4.4.

Comment 4 akgunjal@in.ibm.com 2020-04-22 16:34:54 UTC

Adding the dependent ceph issue open in community https://tracker.ceph.com/issues/45094

Comment 5 Travis Nielsen 2020-05-07 19:56:24 UTC

We found a working solution for running OSDs on multipath. The restriction for multipath was in certain configurations with ceph-volume that didn't apply to OCS.

Multipath support is now in rook when configuring storageClassDeviceSets (OSDs on PVCs) in raw mode (no LVM). The changes necessary in rook included:
- For OSDs on PVs, skip the ceph-volume inventory check for a valid device. We just proceed with the OSD provisioning. If it's an invalid PV, the provisioning will just fail anyway.
- multipath devices are allowed to be used for OSDs

IBM has confirmed it is working for them in an upstream test based on the proposed changes.
https://github.com/openshift/ocs-operator/issues/452#issuecomment-623417136

The fix has been merged upstream, and also synced to the downstream branch with https://github.com/openshift/rook/pull/50.
Moving to modified.

ceph-volume inventory should also still be updated to support multipath, at least when in raw mode without LVM.

Comment 7 Travis Nielsen 2020-05-12 18:11:13 UTC

@Neha Agreed, the only scenario where I see the multipath as useful for OCS is if LSO is used to create PVs on multipath devices. IBM had said this was a requirement, although they found another solution to avoid multipath for now anyway.

Comment 8 Raz Tamir 2020-06-29 06:40:10 UTC

@Michael,

At this point in time, we don't have access to ROKS system so not acking for now

Comment 9 Michael Adam 2020-06-29 06:58:22 UTC

(In reply to Raz Tamir from comment #8)
> @Michael,
> 
> At this point in time, we don't have access to ROKS system so not acking for
> now

I think i could get you access to a ROKS system, but all we really need is access to multipath devices. Is that available somewhere more easily? I think it might be sufficient to use a multipath iSCSI device, if we don't have access to physical multipath ssds.

Comment 10 Travis Nielsen 2020-06-29 20:35:05 UTC

In this case a partner (IBM) had the requirement and they also are the ones that can validate that the scenario is fully working for them. Is it not sufficient for a partner to validate their scenario is working? There might be other scenarios like this in the future as well where a partner will need to validate their own setup.

Comment 11 Michael Adam 2020-06-30 09:50:31 UTC

(In reply to Travis Nielsen from comment #10)
> In this case a partner (IBM) had the requirement and they also are the ones
> that can validate that the scenario is fully working for them. Is it not
> sufficient for a partner to validate their scenario is working? There might
> be other scenarios like this in the future as well where a partner will need
> to validate their own setup.

I think if our QE team is to give qa_ack (which is formally needed to officially take the BZ into the release), they would not want to just delegate. 

@Raz: I think that @Petr has been writing about him running our ocs-ci against ibm cloud, so I guess he can provide access to a test system?

Comment 12 Michael Adam 2020-06-30 14:23:45 UTC

@Travis, and adding to the above thought:

It is true that the requirement comes from IBM for their ibm cloud / roks use case.
But from the generic product PoV, it should be sufficient to test the generic feature with *some* multipath devices.
The specific IBM cloud deployment can not (yet?) be part of the qualification matrix of the OCS product, but can be done alongside.

(So I still think using an iscsi multipath setup should be sufficient.)

Agree?

Comment 13 Travis Nielsen 2020-06-30 14:50:10 UTC

@Michael Agreed, generically testing multipath should be sufficient. If there is a multipath device, a local PV (LSO) can be added to it, and configured like any other LSO configuration.

Comment 14 Petr Balogh 2020-06-30 15:07:23 UTC

Hey Michael.

The access to the IBM cluster was shared in this email thread:
http://post-office.corp.redhat.com/archives/rhocs-eng/2020-June/msg00240.html
last week.


But I see that someone probably started deleting the cluster so the cluster is probably installed anymore but I have asked Gangadhar to have new one cluster deployed so once I have it I can share the access here.

Comment 19 Michael Adam 2020-07-02 07:13:24 UTC

@Akash, what is IBM's timeline requirement for getting the multipath support? Is OCS 4.5.0 a hard requirement?

Comment 20 akgunjal@in.ibm.com 2020-07-02 13:34:34 UTC

@Michael, Yes this is a important requirement for IBM Cloud to have OCS working as there are different use-cases which need this. I see it has been postponed from OCS 4.3 to 4.5 so we need to rollout in OCS 4.5 with this feature.

@Travis, I had also highlighted a issue with multipath about node reboots here https://github.com/openshift/ocs-operator/issues/452#issuecomment-624510552 Can you confirm if thats also done now?

Comment 21 Michael Adam 2020-07-02 21:04:41 UTC

(In reply to akgunjal.com from comment #20)
> @Michael, Yes this is a important requirement for IBM Cloud to have OCS
> working as there are different use-cases which need this. I see it has been
> postponed from OCS 4.3 to 4.5 so we need to rollout in OCS 4.5 with this
> feature.

Thanks Akash, proposing as a blocker for 4.5 because of this.

Comment 22 Elad 2020-07-06 10:24:55 UTC

Hi Michael, Akash,

In order to provide QA ack, we will need to make sure we have the HW to test with. QE doesn't possess any machine exposed to volumes by multipath that is suitable for OCS deployment.
Appreciate if you can provide us with access to such an environment.

Also, Eran, from what I can tell, this is an RFE. The scope of testing is broader than what's described in this BZ and may require proper planning. We need to track it with Jira.

Comment 23 Michael Adam 2020-07-06 10:30:57 UTC

(In reply to Elad from comment #22)
> Hi Michael, Akash,
> 
> In order to provide QA ack, we will need to make sure we have the HW to test
> with. QE doesn't possess any machine exposed to volumes by multipath that is
> suitable for OCS deployment.
> Appreciate if you can provide us with access to such an environment.
> 
> Also, Eran, from what I can tell, this is an RFE. The scope of testing is
> broader than what's described in this BZ and may require proper planning. We
> need to track it with Jira.

@Elad,

As mentioned before, we don't need to have specific hardware to test this.
We can create multipath disk devices with iscsi.
See comments #12 and #13.

And yes, it may be called a (small) rfe, or called a bugfix, not sure it really matters.
Mainly deployment was not successful if the disks carried this multipath flag.

Cheers - Michael

Comment 24 Elad 2020-07-06 11:27:42 UTC

(In reply to Michael Adam from comment #23)

> @Elad,
> 
> As mentioned before, we don't need to have specific hardware to test this.
> We can create multipath disk devices with iscsi.
> See comments #12 and #13.
OCS4 QE don't have machines that are connected by iSCSI nor FC to storage backend which we can use to deploy OCS on.
Maybe other teams have, I can check
> 
> And yes, it may be called a (small) rfe, or called a bugfix, not sure it
> really matters.
> Mainly deployment was not successful if the disks carried this multipath
> flag.
This is a new feature that requires QE work - which is at least finding the suitable HW setting up a cluster in a new way and full regression. It's not a small feature

Comment 25 Elad 2020-07-06 13:38:43 UTC

Thanks Michael for suggesting we will verify the bug based on regression testing, just to make sure the bug fix didn't break anything. 
This in order to allow IBM to work with Multipath. It means that multipath will not be an OCS feature and will not be mentioned in the docs.

Once it will have a proper Jira epic, we will be able to decide which version we will qualify it for, as a fully supported feature.


Acking

Comment 35 errata-xmlrpc 2020-09-15 10:16:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3754

Comment 36 Travis Nielsen 2020-09-25 13:17:00 UTC

*** Bug 1866775 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.