1524966 – [RFE]: NetApp 7.5: Add group_by_prio support in DM-multipath for NVMe namespaces

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1524966 - [RFE]: NetApp 7.5: Add group_by_prio support in DM-multipath for NVMe namespaces

Summary: [RFE]: NetApp 7.5: Add group_by_prio support in DM-multipath for NVMe namespaces

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	device-mapper-multipath
Sub Component:
Version:	7.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	7.8
Assignee:	Ben Marzinski
QA Contact:	Lin Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1500798 1500889 1563290
TreeView+	depends on / blocked

Reported:	2017-12-12 11:37 UTC by gowrav
Modified:	2021-09-03 12:11 UTC (History)
CC List:	26 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-09 19:56:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description gowrav 2017-12-12 11:37:17 UTC

Description of problem:

RHEL 7.5 LPe32002 host is configured as NVMe initiator and connected to a LPFC softwar target via 32G Brocade fabric switch. I created 3 namespaces on the LPFC target and mapped it to the NVMe initiator. Each of the 3 namespaces have 2 paths each. From initiator, I was able to connect to the target and discover all 3 namespaces. 

# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme1n1     2c3ceea9b05ca515     Linux                                    1           5.37  GB /   5.37  GB    512   B +  0 B   3.10.0-7
/dev/nvme1n2     2c3ceea9b05ca515     Linux                                    2           5.37  GB /   5.37  GB    512   B +  0 B   3.10.0-7
/dev/nvme1n3     2c3ceea9b05ca515     Linux                                    3           5.37  GB /   5.37  GB    512   B +  0 B   3.10.0-7
/dev/nvme2n1     2c3ceea9b05ca515     Linux                                    1           5.37  GB /   5.37  GB    512   B +  0 B   3.10.0-7
/dev/nvme2n2     2c3ceea9b05ca515     Linux                                    2           5.37  GB /   5.37  GB    512   B +  0 B   3.10.0-7
/dev/nvme2n3     2c3ceea9b05ca515     Linux                                    3           5.37  GB /   5.37  GB    512   B +  0 B   3.10.0-7

Above output shows 6 NVMe devices. Two devices for each of the 3 namespaces.

On the above NVMe devices, I configured dm-multipath and listed the output.

# multipath -ll
uuid.6ae2a34e-089a-4acf-99a4-b6bf9c8bc674 dm-3 NVME,Linux
size=5.0G features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 1:0:1:0 nvme1n1 259:0 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 2:0:1:0 nvme2n1 259:3 active ready running
uuid.d264dddf-40c9-4c22-a721-6bb1a9d58c67 dm-2 NVME,Linux
size=5.0G features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 1:0:3:0 nvme1n3 259:2 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 2:0:3:0 nvme2n3 259:5 active ready running
uuid.86751f51-11dc-42d0-957e-c178f7a14d52 dm-4 NVME,Linux
size=5.0G features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 1:0:2:0 nvme1n2 259:1 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 2:0:2:0 nvme2n2 259:4 active ready running

In the above output, the dm-multipath device uuid.6ae2a34e-089a-4acf-99a4-b6bf9c8bc674 (dm-3) has 2 NVMe devices -- nvme1n1 and nvme2n1. And both are discovered from primary (or AO) path only. But they are grouped under two different prio group. The first group status shows "active" indicating it's the primary (AO) path and the second group status shows "enabled" indicating it's a secondary path. 
Ideally both should devices should have been listed under same prio group with same status.


Version-Release number of selected component (if applicable):

OS: RHEL 7.5 Alpha
Kernel: 3.10.0-799.el7.x86_64
Device-Mapper: device-mapper-1.02.145-4.el7.x86_64
Multipath: device-mapper-multipath-0.4.9-118.el7.x86_64

How reproducible:
Always

Comment 2 Steve Schremmer 2017-12-12 16:09:17 UTC

Are you using the default settings for NVMe multipaths? The default path grouping policy is FAILOVER, and your output from multipath -ll looks correct for that.

Comment 3 Ben Marzinski 2017-12-12 21:21:21 UTC

This isn't a bug. The only support for multipath NVMe that is being supported in rhel-7.5 is simple failover support.  I'll leave this bug around as a feature request for better path grouping support, but that won't be in rhel-7.5

Comment 5 gowrav 2017-12-15 07:14:36 UTC

(In reply to Ben Marzinski from comment #3)
> This isn't a bug. The only support for multipath NVMe that is being
> supported in rhel-7.5 is simple failover support.  I'll leave this bug
> around as a feature request for better path grouping support, but that won't
> be in rhel-7.5

Hi Ben,

I shall change the bug title to reflect it as feature request. Is the a possibility to support "group_by_prio" option in future? For ONTAP storage

Also in the current release, if we explicitly override "failover" option with "multibus" option, will it work?

Comment 7 Ben Marzinski 2017-12-15 16:04:18 UTC

(In reply to gowrav from comment #5)
> Hi Ben,
> 
> I shall change the bug title to reflect it as feature request. Is the a
> possibility to support "group_by_prio" option in future? For ONTAP storage

Yes, we are actively looking at improving multipath support for NVMe, including better path grouping. That's why I'm leaving this bug open as a feature request.

> Also in the current release, if we explicitly override "failover" option
> with "multibus" option, will it work?

From multipaths point of view, yes, it will create a device with multiple paths that can all get IO. Whether this will actually work in reality depends on your device and drives.  This is not tested at all, and I would definitely recommend against it, except for playing around with in non-production setups, but I don't personally know of any specific reasons why it absolutely can't work.

Comment 8 Ewan D. Milne 2018-10-02 13:29:14 UTC

cc'ing Mike Snitzer, my understanding is that kernel changes are required to
support load balancing (multiple simultaneous active paths) on NVMe devices
in dm-multipath.

Comment 9 Ben Marzinski 2019-02-01 00:38:46 UTC

There currently is an upstream NetApp builtin config like this:

        {
                /*
                 * NVMe-FC namespace devices: MULTIBUS, queueing preferred
                 *
                 * The hwtable is searched backwards, so place this after "Gener
ic NVMe"
                 */
                .vendor        = "NVME",
                .product       = "^NetApp ONTAP Controller",
                .pgpolicy      = MULTIBUS,
                .no_path_retry = NO_PATH_RETRY_QUEUE,
        },

Which has been working fine for people, at least with recent fedora releases. Did you end up doing any testing with MULTIBUS in RHEL-7.  At least from a multipath tools perspective, this should work fine, and I don't know of any kernel work that needs to be done for RHEL-7.7 to make this work. Mike there's not anything missing in the kernel to handle multiple NVMe paths per pathgroup in RHEL-7 (non-failback setups), is there?

Comment 10 Mike Snitzer 2019-02-01 05:35:36 UTC

(In reply to Ewan D. Milne from comment #8)
> cc'ing Mike Snitzer, my understanding is that kernel changes are required to
> support load balancing (multiple simultaneous active paths) on NVMe devices
> in dm-multipath.

Sorry for late reply, that is only the case for bio-based NVMe (when "queue_mode bio" is specified on DM multipath table load).
This was to model what native NVMe multipathing supports.

But if "queue_mode rq" (default) or "queue_mode mq" (blk-mq) are used then round-robin will work as usual.

(In reply to Ben Marzinski from comment #9)
> Did you end up doing any testing with MULTIBUS in RHEL-7.  At
> least from a multipath tools perspective, this should work fine, and I don't
> know of any kernel work that needs to be done for RHEL-7.7 to make this
> work. Mike there's not anything missing in the kernel to handle multiple
> NVMe paths per pathgroup in RHEL-7 (non-failback setups), is there?

No, should be cool.  But can you or others test?  Or do you need me to?

Comment 11 Steve Schremmer 2019-03-22 20:00:02 UTC

I'm trying to understand the outcome of this discussion. For NetApp E-Series, where we ultimately want to end up is a multipath configuration something like this:
        device {
                vendor "NVME"
                product "NetApp E-Series*"
                path_grouping_policy group_by_prio
                prio ana
                failback immediate
                no_path_retry 30
        }

Once we have the updated device-mapper-multipath package installed, which has the ANA prio, is it okay to use 'group_by_prio', or do we need to be using 'failover' for now?

Also, do you have recommendation of what to use for queue_mode for NVMe multipath?

Thanks,
Steve

Comment 12 Mike Snitzer 2019-03-22 20:35:01 UTC

(In reply to Steve Schremmer from comment #11)
> I'm trying to understand the outcome of this discussion. For NetApp
> E-Series, where we ultimately want to end up is a multipath configuration
> something like this:
>         device {
>                 vendor "NVME"
>                 product "NetApp E-Series*"
>                 path_grouping_policy group_by_prio
>                 prio ana
>                 failback immediate
>                 no_path_retry 30
>         }
> 
> Once we have the updated device-mapper-multipath package installed, which
> has the ANA prio, is it okay to use 'group_by_prio', or do we need to be
> using 'failover' for now?

Ben would be the better person to ask.  But I _think_ you'd use 'group_by_prio'.

> Also, do you have recommendation of what to use for queue_mode for NVMe
> multipath?

NVMe is only blk-mq so 'queue_mode mq' would be needed (or dm_mod.use_blk_mq=Y on kernel commandline).
I doubt it worthwhile to use 'queue_mode bio' because it doesn't support path selectors -- it forces use of failover.

Comment 13 Steve Schremmer 2019-06-13 22:23:20 UTC

We're testing with RHEL 7.7 with the following config in /etc/multipath.conf:

devices {
       device {
               vendor "NVME"
               product "NetApp E-Series*"
               path_grouping_policy group_by_prio
               failback immediate
               no_path_retry 30
       }
}

detect_prio defaults to yes and the ANA prio gets used, as expected.

I just updated to RHEL 7.7 SS2 (including kernel-3.10.0-1053 and device-mapper-multipath-0.4.9-127.

We're not applying any specific settings to queue_mode. Is this okay?

Comment 14 Ben Marzinski 2019-06-14 22:32:42 UTC

(In reply to Steve Schremmer from comment #13)
> 
> We're not applying any specific settings to queue_mode. Is this okay?

When I tested the rhel-7 backport of the nvme code, I didn't set queue_mode, and everything appeared to work fine. Mike, is this really necessary in rhel7?

Comment 15 Adelino Barbosa 2019-06-17 14:21:25 UTC

Mike, please see Ben's question.

Comment 16 Mike Snitzer 2019-06-18 18:34:59 UTC

(In reply to Ben Marzinski from comment #14)
> (In reply to Steve Schremmer from comment #13)
> > 
> > We're not applying any specific settings to queue_mode. Is this okay?
> 
> When I tested the rhel-7 backport of the nvme code, I didn't set queue_mode,
> and everything appeared to work fine. Mike, is this really necessary in
> rhel7?

If you want the DM multipath device to use blk-mq then _yes_ it is required to set "queue_mode mq" (or you can establish dm_mod.use_blk_mq=Y on the kernel commandline and then all request-based DM multipath devices will use blk-mq).

Otherwise, as just verified against RHEL7.6, even if the DM-multipath device's underlying paths are all blk-mq the DM-multipath device will still use the old .request_fn request-queue interface.

FYI: Both RHEL8 and upstream no longer allow stacking old .request_fn (non-blk-mq) ontop of blk-mq paths (because old .request_fn support no longer exists in those kernels).

All said, you don't need to use blk-mq for the DM multipath device but for layering multipath ontop fast NVMe underlying paths it really should offer a performance advantage (because it avoids the locking overhead associated with old .request_fn interface).  But using blk-mq in RHEL7 does eliminate the use of traditional IO schedulers (e.g. deadline, cfq) -- which could prove to be an unwelcome change for devices that benefit from that upfront IO scheduling.

I hope I've been clear.  If not, please feel free to ask follow-up questions (and set needinfo from me accordingly).

Comment 17 Adelino Barbosa 2019-07-15 14:28:31 UTC

Ben, please see Mike's comments.

Comment 18 Ben Marzinski 2019-07-22 17:27:54 UTC

Yeah. Running without changing the queue_mode should be fine in RHEL7, and that's what QA and our partners have tested, so I would rather not change the default in rhel-7.8. Switching it to blk-mq is also fine if you want to try to optimize performance.

So Steve, if you've tested multipath running with group_by_prio and it's running fine, then I don't know of any reason why it shouldn't be, and I'm fine with changing the default config to what you've used in Comment 13.

Comment 19 Ben Marzinski 2019-08-12 04:06:59 UTC

Steve, do you want go ahead with changing the default config to the one listed in Comment 13

Comment 20 Steve Schremmer 2019-08-13 22:06:18 UTC

The current plan is to have our E-Series customers modify their multipath.conf with the settings shown in comment 13. We are also using the defaults for any queuing modes. 

The default config shown in comment 9 is for a different product.

Comment 21 Ben Marzinski 2019-08-14 18:02:39 UTC

O.k. Then I'll pull the config from comment 13 into the default configs.

Comment 22 Steve Schremmer 2019-08-14 19:45:53 UTC

(In reply to Ben Marzinski from comment #21)
> O.k. Then I'll pull the config from comment 13 into the default configs.

We'd rather not add defaults at this time. Our plan is to have our documentation tell the customer to add the proper lines to /etc/multipath.conf

Comment 23 Ben Marzinski 2019-08-15 14:14:12 UTC

fair enough. So, is there any reason to keep this bug open, or can I close it?

Comment 24 Matt Schulte 2019-08-15 15:03:24 UTC

Bug was opened by Gowrav, so I think we should ask if he needs it to still stay open.

Comment 25 Matt Schulte 2019-10-16 16:30:13 UTC

Should this be closed?

Comment 26 Ewan D. Milne 2020-01-08 15:14:04 UTC

Awaiting verification from NetApp that the functionality works, then we should close.

Comment 27 Martin George 2020-01-08 15:26:43 UTC

Yes, this functionality works now, so this bz may be marked as FIXED. 

That said, there are still a couple of outstanding issues with this feature - tracked in bug 1757348 & bug 1718361.

Comment 28 Ewan D. Milne 2020-01-09 19:56:12 UTC

OK, thank you.  Closing this BZ as the group_by_prio functionality is present.

Note You need to log in before you can comment on or make changes to this bug.

agk
akarlsso
bcodding
bmarzins
bubrown
coughlan
dick.kennedy
dinil
emilne
fgarciad
gokulnat
gowrav.mahadevaiah
heinzm
james.smart
jbrassow
laurie.barry
lilin
marting
matt.schulte
msnitzer
ng-hsg-engcustomer-iop-bz
ng-redhat-bugzilla
prajnoha
revers
rhandlin
sschremm