Bug 1524966 - [NetApp 7.5 RFE]: Add group_by_prio support in DM-multipath for NVMe namespaces [NEEDINFO]
Summary: [NetApp 7.5 RFE]: Add group_by_prio support in DM-multipath for NVMe namespaces
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: device-mapper-multipath
Version: 7.5
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: rc
: 7.7
Assignee: Ben Marzinski
QA Contact: Lin Li
URL:
Whiteboard:
Keywords: FutureFeature
Depends On:
Blocks: 1500798 1500889 1563290
TreeView+ depends on / blocked
 
Reported: 2017-12-12 11:37 UTC by gowrav
Modified: 2019-07-15 14:28 UTC (History)
23 users (show)

(edit)
Clone Of:
(edit)
Last Closed:
adbarbos: needinfo? (bmarzins)


Attachments (Terms of Use)

Description gowrav 2017-12-12 11:37:17 UTC
Description of problem:

RHEL 7.5 LPe32002 host is configured as NVMe initiator and connected to a LPFC softwar target via 32G Brocade fabric switch. I created 3 namespaces on the LPFC target and mapped it to the NVMe initiator. Each of the 3 namespaces have 2 paths each. From initiator, I was able to connect to the target and discover all 3 namespaces. 

# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme1n1     2c3ceea9b05ca515     Linux                                    1           5.37  GB /   5.37  GB    512   B +  0 B   3.10.0-7
/dev/nvme1n2     2c3ceea9b05ca515     Linux                                    2           5.37  GB /   5.37  GB    512   B +  0 B   3.10.0-7
/dev/nvme1n3     2c3ceea9b05ca515     Linux                                    3           5.37  GB /   5.37  GB    512   B +  0 B   3.10.0-7
/dev/nvme2n1     2c3ceea9b05ca515     Linux                                    1           5.37  GB /   5.37  GB    512   B +  0 B   3.10.0-7
/dev/nvme2n2     2c3ceea9b05ca515     Linux                                    2           5.37  GB /   5.37  GB    512   B +  0 B   3.10.0-7
/dev/nvme2n3     2c3ceea9b05ca515     Linux                                    3           5.37  GB /   5.37  GB    512   B +  0 B   3.10.0-7

Above output shows 6 NVMe devices. Two devices for each of the 3 namespaces.

On the above NVMe devices, I configured dm-multipath and listed the output.

# multipath -ll
uuid.6ae2a34e-089a-4acf-99a4-b6bf9c8bc674 dm-3 NVME,Linux
size=5.0G features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 1:0:1:0 nvme1n1 259:0 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 2:0:1:0 nvme2n1 259:3 active ready running
uuid.d264dddf-40c9-4c22-a721-6bb1a9d58c67 dm-2 NVME,Linux
size=5.0G features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 1:0:3:0 nvme1n3 259:2 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 2:0:3:0 nvme2n3 259:5 active ready running
uuid.86751f51-11dc-42d0-957e-c178f7a14d52 dm-4 NVME,Linux
size=5.0G features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 1:0:2:0 nvme1n2 259:1 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 2:0:2:0 nvme2n2 259:4 active ready running

In the above output, the dm-multipath device uuid.6ae2a34e-089a-4acf-99a4-b6bf9c8bc674 (dm-3) has 2 NVMe devices -- nvme1n1 and nvme2n1. And both are discovered from primary (or AO) path only. But they are grouped under two different prio group. The first group status shows "active" indicating it's the primary (AO) path and the second group status shows "enabled" indicating it's a secondary path. 
Ideally both should devices should have been listed under same prio group with same status.


Version-Release number of selected component (if applicable):

OS: RHEL 7.5 Alpha
Kernel: 3.10.0-799.el7.x86_64
Device-Mapper: device-mapper-1.02.145-4.el7.x86_64
Multipath: device-mapper-multipath-0.4.9-118.el7.x86_64

How reproducible:
Always

Comment 2 Steve Schremmer 2017-12-12 16:09:17 UTC
Are you using the default settings for NVMe multipaths? The default path grouping policy is FAILOVER, and your output from multipath -ll looks correct for that.

Comment 3 Ben Marzinski 2017-12-12 21:21:21 UTC
This isn't a bug. The only support for multipath NVMe that is being supported in rhel-7.5 is simple failover support.  I'll leave this bug around as a feature request for better path grouping support, but that won't be in rhel-7.5

Comment 5 gowrav 2017-12-15 07:14:36 UTC
(In reply to Ben Marzinski from comment #3)
> This isn't a bug. The only support for multipath NVMe that is being
> supported in rhel-7.5 is simple failover support.  I'll leave this bug
> around as a feature request for better path grouping support, but that won't
> be in rhel-7.5

Hi Ben,

I shall change the bug title to reflect it as feature request. Is the a possibility to support "group_by_prio" option in future? For ONTAP storage

Also in the current release, if we explicitly override "failover" option with "multibus" option, will it work?

Comment 7 Ben Marzinski 2017-12-15 16:04:18 UTC
(In reply to gowrav from comment #5)
> Hi Ben,
> 
> I shall change the bug title to reflect it as feature request. Is the a
> possibility to support "group_by_prio" option in future? For ONTAP storage

Yes, we are actively looking at improving multipath support for NVMe, including better path grouping. That's why I'm leaving this bug open as a feature request.

> Also in the current release, if we explicitly override "failover" option
> with "multibus" option, will it work?

From multipaths point of view, yes, it will create a device with multiple paths that can all get IO. Whether this will actually work in reality depends on your device and drives.  This is not tested at all, and I would definitely recommend against it, except for playing around with in non-production setups, but I don't personally know of any specific reasons why it absolutely can't work.

Comment 8 Ewan D. Milne 2018-10-02 13:29:14 UTC
cc'ing Mike Snitzer, my understanding is that kernel changes are required to
support load balancing (multiple simultaneous active paths) on NVMe devices
in dm-multipath.

Comment 9 Ben Marzinski 2019-02-01 00:38:46 UTC
There currently is an upstream NetApp builtin config like this:

        {
                /*
                 * NVMe-FC namespace devices: MULTIBUS, queueing preferred
                 *
                 * The hwtable is searched backwards, so place this after "Gener
ic NVMe"
                 */
                .vendor        = "NVME",
                .product       = "^NetApp ONTAP Controller",
                .pgpolicy      = MULTIBUS,
                .no_path_retry = NO_PATH_RETRY_QUEUE,
        },

Which has been working fine for people, at least with recent fedora releases. Did you end up doing any testing with MULTIBUS in RHEL-7.  At least from a multipath tools perspective, this should work fine, and I don't know of any kernel work that needs to be done for RHEL-7.7 to make this work. Mike there's not anything missing in the kernel to handle multiple NVMe paths per pathgroup in RHEL-7 (non-failback setups), is there?

Comment 10 Mike Snitzer 2019-02-01 05:35:36 UTC
(In reply to Ewan D. Milne from comment #8)
> cc'ing Mike Snitzer, my understanding is that kernel changes are required to
> support load balancing (multiple simultaneous active paths) on NVMe devices
> in dm-multipath.

Sorry for late reply, that is only the case for bio-based NVMe (when "queue_mode bio" is specified on DM multipath table load).
This was to model what native NVMe multipathing supports.

But if "queue_mode rq" (default) or "queue_mode mq" (blk-mq) are used then round-robin will work as usual.

(In reply to Ben Marzinski from comment #9)
> Did you end up doing any testing with MULTIBUS in RHEL-7.  At
> least from a multipath tools perspective, this should work fine, and I don't
> know of any kernel work that needs to be done for RHEL-7.7 to make this
> work. Mike there's not anything missing in the kernel to handle multiple
> NVMe paths per pathgroup in RHEL-7 (non-failback setups), is there?

No, should be cool.  But can you or others test?  Or do you need me to?

Comment 11 Steve Schremmer 2019-03-22 20:00:02 UTC
I'm trying to understand the outcome of this discussion. For NetApp E-Series, where we ultimately want to end up is a multipath configuration something like this:
        device {
                vendor "NVME"
                product "NetApp E-Series*"
                path_grouping_policy group_by_prio
                prio ana
                failback immediate
                no_path_retry 30
        }

Once we have the updated device-mapper-multipath package installed, which has the ANA prio, is it okay to use 'group_by_prio', or do we need to be using 'failover' for now?

Also, do you have recommendation of what to use for queue_mode for NVMe multipath?

Thanks,
Steve

Comment 12 Mike Snitzer 2019-03-22 20:35:01 UTC
(In reply to Steve Schremmer from comment #11)
> I'm trying to understand the outcome of this discussion. For NetApp
> E-Series, where we ultimately want to end up is a multipath configuration
> something like this:
>         device {
>                 vendor "NVME"
>                 product "NetApp E-Series*"
>                 path_grouping_policy group_by_prio
>                 prio ana
>                 failback immediate
>                 no_path_retry 30
>         }
> 
> Once we have the updated device-mapper-multipath package installed, which
> has the ANA prio, is it okay to use 'group_by_prio', or do we need to be
> using 'failover' for now?

Ben would be the better person to ask.  But I _think_ you'd use 'group_by_prio'.

> Also, do you have recommendation of what to use for queue_mode for NVMe
> multipath?

NVMe is only blk-mq so 'queue_mode mq' would be needed (or dm_mod.use_blk_mq=Y on kernel commandline).
I doubt it worthwhile to use 'queue_mode bio' because it doesn't support path selectors -- it forces use of failover.

Comment 13 Steve Schremmer 2019-06-13 22:23:20 UTC
We're testing with RHEL 7.7 with the following config in /etc/multipath.conf:

devices {
       device {
               vendor "NVME"
               product "NetApp E-Series*"
               path_grouping_policy group_by_prio
               failback immediate
               no_path_retry 30
       }
}

detect_prio defaults to yes and the ANA prio gets used, as expected.

I just updated to RHEL 7.7 SS2 (including kernel-3.10.0-1053 and device-mapper-multipath-0.4.9-127.

We're not applying any specific settings to queue_mode. Is this okay?

Comment 14 Ben Marzinski 2019-06-14 22:32:42 UTC
(In reply to Steve Schremmer from comment #13)
> 
> We're not applying any specific settings to queue_mode. Is this okay?

When I tested the rhel-7 backport of the nvme code, I didn't set queue_mode, and everything appeared to work fine. Mike, is this really necessary in rhel7?

Comment 15 Adelino Barbosa 2019-06-17 14:21:25 UTC
Mike, please see Ben's question.

Comment 16 Mike Snitzer 2019-06-18 18:34:59 UTC
(In reply to Ben Marzinski from comment #14)
> (In reply to Steve Schremmer from comment #13)
> > 
> > We're not applying any specific settings to queue_mode. Is this okay?
> 
> When I tested the rhel-7 backport of the nvme code, I didn't set queue_mode,
> and everything appeared to work fine. Mike, is this really necessary in
> rhel7?

If you want the DM multipath device to use blk-mq then _yes_ it is required to set "queue_mode mq" (or you can establish dm_mod.use_blk_mq=Y on the kernel commandline and then all request-based DM multipath devices will use blk-mq).

Otherwise, as just verified against RHEL7.6, even if the DM-multipath device's underlying paths are all blk-mq the DM-multipath device will still use the old .request_fn request-queue interface.

FYI: Both RHEL8 and upstream no longer allow stacking old .request_fn (non-blk-mq) ontop of blk-mq paths (because old .request_fn support no longer exists in those kernels).

All said, you don't need to use blk-mq for the DM multipath device but for layering multipath ontop fast NVMe underlying paths it really should offer a performance advantage (because it avoids the locking overhead associated with old .request_fn interface).  But using blk-mq in RHEL7 does eliminate the use of traditional IO schedulers (e.g. deadline, cfq) -- which could prove to be an unwelcome change for devices that benefit from that upfront IO scheduling.

I hope I've been clear.  If not, please feel free to ask follow-up questions (and set needinfo from me accordingly).

Comment 17 Adelino Barbosa 2019-07-15 14:28:31 UTC
Ben, please see Mike's comments.


Note You need to log in before you can comment on or make changes to this bug.