Bug 1524966 - [NetApp 7.5 RFE]: Add group_by_prio support in DM-multipath for NVMe namespaces
Summary: [NetApp 7.5 RFE]: Add group_by_prio support in DM-multipath for NVMe namespaces
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: device-mapper-multipath   
(Show other bugs)
Version: 7.5
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: rc
: 7.7
Assignee: Ben Marzinski
QA Contact: Lin Li
URL:
Whiteboard:
Keywords: FutureFeature
Depends On:
Blocks: 1500798 1500889 1563290
TreeView+ depends on / blocked
 
Reported: 2017-12-12 11:37 UTC by gowrav
Modified: 2019-04-17 22:58 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

Description gowrav 2017-12-12 11:37:17 UTC
Description of problem:

RHEL 7.5 LPe32002 host is configured as NVMe initiator and connected to a LPFC softwar target via 32G Brocade fabric switch. I created 3 namespaces on the LPFC target and mapped it to the NVMe initiator. Each of the 3 namespaces have 2 paths each. From initiator, I was able to connect to the target and discover all 3 namespaces. 

# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme1n1     2c3ceea9b05ca515     Linux                                    1           5.37  GB /   5.37  GB    512   B +  0 B   3.10.0-7
/dev/nvme1n2     2c3ceea9b05ca515     Linux                                    2           5.37  GB /   5.37  GB    512   B +  0 B   3.10.0-7
/dev/nvme1n3     2c3ceea9b05ca515     Linux                                    3           5.37  GB /   5.37  GB    512   B +  0 B   3.10.0-7
/dev/nvme2n1     2c3ceea9b05ca515     Linux                                    1           5.37  GB /   5.37  GB    512   B +  0 B   3.10.0-7
/dev/nvme2n2     2c3ceea9b05ca515     Linux                                    2           5.37  GB /   5.37  GB    512   B +  0 B   3.10.0-7
/dev/nvme2n3     2c3ceea9b05ca515     Linux                                    3           5.37  GB /   5.37  GB    512   B +  0 B   3.10.0-7

Above output shows 6 NVMe devices. Two devices for each of the 3 namespaces.

On the above NVMe devices, I configured dm-multipath and listed the output.

# multipath -ll
uuid.6ae2a34e-089a-4acf-99a4-b6bf9c8bc674 dm-3 NVME,Linux
size=5.0G features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 1:0:1:0 nvme1n1 259:0 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 2:0:1:0 nvme2n1 259:3 active ready running
uuid.d264dddf-40c9-4c22-a721-6bb1a9d58c67 dm-2 NVME,Linux
size=5.0G features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 1:0:3:0 nvme1n3 259:2 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 2:0:3:0 nvme2n3 259:5 active ready running
uuid.86751f51-11dc-42d0-957e-c178f7a14d52 dm-4 NVME,Linux
size=5.0G features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 1:0:2:0 nvme1n2 259:1 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 2:0:2:0 nvme2n2 259:4 active ready running

In the above output, the dm-multipath device uuid.6ae2a34e-089a-4acf-99a4-b6bf9c8bc674 (dm-3) has 2 NVMe devices -- nvme1n1 and nvme2n1. And both are discovered from primary (or AO) path only. But they are grouped under two different prio group. The first group status shows "active" indicating it's the primary (AO) path and the second group status shows "enabled" indicating it's a secondary path. 
Ideally both should devices should have been listed under same prio group with same status.


Version-Release number of selected component (if applicable):

OS: RHEL 7.5 Alpha
Kernel: 3.10.0-799.el7.x86_64
Device-Mapper: device-mapper-1.02.145-4.el7.x86_64
Multipath: device-mapper-multipath-0.4.9-118.el7.x86_64

How reproducible:
Always

Comment 2 Steve Schremmer 2017-12-12 16:09:17 UTC
Are you using the default settings for NVMe multipaths? The default path grouping policy is FAILOVER, and your output from multipath -ll looks correct for that.

Comment 3 Ben Marzinski 2017-12-12 21:21:21 UTC
This isn't a bug. The only support for multipath NVMe that is being supported in rhel-7.5 is simple failover support.  I'll leave this bug around as a feature request for better path grouping support, but that won't be in rhel-7.5

Comment 5 gowrav 2017-12-15 07:14:36 UTC
(In reply to Ben Marzinski from comment #3)
> This isn't a bug. The only support for multipath NVMe that is being
> supported in rhel-7.5 is simple failover support.  I'll leave this bug
> around as a feature request for better path grouping support, but that won't
> be in rhel-7.5

Hi Ben,

I shall change the bug title to reflect it as feature request. Is the a possibility to support "group_by_prio" option in future? For ONTAP storage

Also in the current release, if we explicitly override "failover" option with "multibus" option, will it work?

Comment 7 Ben Marzinski 2017-12-15 16:04:18 UTC
(In reply to gowrav from comment #5)
> Hi Ben,
> 
> I shall change the bug title to reflect it as feature request. Is the a
> possibility to support "group_by_prio" option in future? For ONTAP storage

Yes, we are actively looking at improving multipath support for NVMe, including better path grouping. That's why I'm leaving this bug open as a feature request.

> Also in the current release, if we explicitly override "failover" option
> with "multibus" option, will it work?

From multipaths point of view, yes, it will create a device with multiple paths that can all get IO. Whether this will actually work in reality depends on your device and drives.  This is not tested at all, and I would definitely recommend against it, except for playing around with in non-production setups, but I don't personally know of any specific reasons why it absolutely can't work.

Comment 8 Ewan D. Milne 2018-10-02 13:29:14 UTC
cc'ing Mike Snitzer, my understanding is that kernel changes are required to
support load balancing (multiple simultaneous active paths) on NVMe devices
in dm-multipath.

Comment 9 Ben Marzinski 2019-02-01 00:38:46 UTC
There currently is an upstream NetApp builtin config like this:

        {
                /*
                 * NVMe-FC namespace devices: MULTIBUS, queueing preferred
                 *
                 * The hwtable is searched backwards, so place this after "Gener
ic NVMe"
                 */
                .vendor        = "NVME",
                .product       = "^NetApp ONTAP Controller",
                .pgpolicy      = MULTIBUS,
                .no_path_retry = NO_PATH_RETRY_QUEUE,
        },

Which has been working fine for people, at least with recent fedora releases. Did you end up doing any testing with MULTIBUS in RHEL-7.  At least from a multipath tools perspective, this should work fine, and I don't know of any kernel work that needs to be done for RHEL-7.7 to make this work. Mike there's not anything missing in the kernel to handle multiple NVMe paths per pathgroup in RHEL-7 (non-failback setups), is there?

Comment 10 Mike Snitzer 2019-02-01 05:35:36 UTC
(In reply to Ewan D. Milne from comment #8)
> cc'ing Mike Snitzer, my understanding is that kernel changes are required to
> support load balancing (multiple simultaneous active paths) on NVMe devices
> in dm-multipath.

Sorry for late reply, that is only the case for bio-based NVMe (when "queue_mode bio" is specified on DM multipath table load).
This was to model what native NVMe multipathing supports.

But if "queue_mode rq" (default) or "queue_mode mq" (blk-mq) are used then round-robin will work as usual.

(In reply to Ben Marzinski from comment #9)
> Did you end up doing any testing with MULTIBUS in RHEL-7.  At
> least from a multipath tools perspective, this should work fine, and I don't
> know of any kernel work that needs to be done for RHEL-7.7 to make this
> work. Mike there's not anything missing in the kernel to handle multiple
> NVMe paths per pathgroup in RHEL-7 (non-failback setups), is there?

No, should be cool.  But can you or others test?  Or do you need me to?

Comment 11 Steve Schremmer 2019-03-22 20:00:02 UTC
I'm trying to understand the outcome of this discussion. For NetApp E-Series, where we ultimately want to end up is a multipath configuration something like this:
        device {
                vendor "NVME"
                product "NetApp E-Series*"
                path_grouping_policy group_by_prio
                prio ana
                failback immediate
                no_path_retry 30
        }

Once we have the updated device-mapper-multipath package installed, which has the ANA prio, is it okay to use 'group_by_prio', or do we need to be using 'failover' for now?

Also, do you have recommendation of what to use for queue_mode for NVMe multipath?

Thanks,
Steve

Comment 12 Mike Snitzer 2019-03-22 20:35:01 UTC
(In reply to Steve Schremmer from comment #11)
> I'm trying to understand the outcome of this discussion. For NetApp
> E-Series, where we ultimately want to end up is a multipath configuration
> something like this:
>         device {
>                 vendor "NVME"
>                 product "NetApp E-Series*"
>                 path_grouping_policy group_by_prio
>                 prio ana
>                 failback immediate
>                 no_path_retry 30
>         }
> 
> Once we have the updated device-mapper-multipath package installed, which
> has the ANA prio, is it okay to use 'group_by_prio', or do we need to be
> using 'failover' for now?

Ben would be the better person to ask.  But I _think_ you'd use 'group_by_prio'.

> Also, do you have recommendation of what to use for queue_mode for NVMe
> multipath?

NVMe is only blk-mq so 'queue_mode mq' would be needed (or dm_mod.use_blk_mq=Y on kernel commandline).
I doubt it worthwhile to use 'queue_mode bio' because it doesn't support path selectors -- it forces use of failover.


Note You need to log in before you can comment on or make changes to this bug.