Bug 1673167 - [Bug 1155706] Several volumes lost all device mapper paths during storage controller firmware upgrade.
Summary: [Bug 1155706] Several volumes lost all device mapper paths during storage con...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: device-mapper-multipath
Version: 8.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: rc
: 8.0
Assignee: Ben Marzinski
QA Contact: Lin Li
Marek Suchánek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-02-06 20:02 UTC by hong.chung
Modified: 2019-06-14 01:51 UTC (History)
16 users (show)

Fixed In Version: device-mapper-multipath-0.7.8-7.el8
Doc Type: Bug Fix
Doc Text:
.`multipathd` no longer disables the path when it fails to get the WWID of a path Previously, the `multipathd` service treated a failed attempt at getting a path's WWID as getting an empty WWID. If `multipathd` failed to get the WWID of a path, it sometimes disabled that path. With this update, `multipathd` continues to use the old WWID if it fails to get the WWID when checking to see if it has changed. As a result, `multipathd` no longer disables paths when it fails to get the WWID, when checking if the WWID has changed.
Clone Of:
Environment:
Last Closed: 2019-06-14 01:51:27 UTC
Type: Bug


Attachments (Terms of Use)
host_logs (2.34 MB, application/zip)
2019-02-06 20:02 UTC, hong.chung
no flags Details
host_logs (13.37 MB, application/zip)
2019-02-20 15:52 UTC, hong.chung
no flags Details
host_logs (1.62 MB, application/zip)
2019-02-22 16:13 UTC, hong.chung
no flags Details
host_logs (3.51 MB, application/zip)
2019-02-25 22:32 UTC, hong.chung
no flags Details
host_logs (2.15 MB, application/zip)
2019-02-26 22:26 UTC, hong.chung
no flags Details
host_logs (653.47 KB, application/zip)
2019-02-27 16:42 UTC, hong.chung
no flags Details

Description hong.chung 2019-02-06 20:02:11 UTC
Created attachment 1527686 [details]
host_logs

Description of problem:
4x1 setup:
4 - RHEL8.0 hosts with iSCSI setup.
1 - Sandhawk storage (98.42.32.02)
2 - network switches

- Created iSCSI (IPv4 and IPv6) sessions from each host to the storage.
- The hosts have 8 total paths to the storage (4 to controller A and 4 to controller B).
- Mapped volumes to the hosts.
- Created RAW and several type of filesystems (ext3, ext4, xfs) on the host.
- Started I/O to all the volumes.
- Started controller firmware upgrade stress test.

After several upgrades, one of the hosts (ictm1611s02h3), reported I/O errors to several volumes.  When looking at the messages log, during the last upgrade, one of the controller went to reboot, the host device mapper multipath removed 4 paths as expected.  But couple seconds later, it removed the other 4 paths, resulting 0 paths to the volume.

One of the failed volume:

3600a098000badf680000bf4c5c3806c0 dm-73 NETAPP,INF-01-00
size=2.0G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 13:0:0:72  sdjz  65:464   active ready running
| |- 12:0:0:72  sdkj  66:368   active ready running
| |- 16:0:0:72  sdadq 65:960   active ready running
| `- 17:0:0:72  sdadr 65:976   active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 15:0:0:72  sdjw  65:416   active ready running
  |- 14:0:0:72  sdkd  66:272   active ready running
  |- 18:0:0:72  sdacy 8:928    active ready running
  `- 19:0:0:72  sdadb 8:976    active ready running


The volume data from the /etc/messages:

Feb  5 19:07:13 ICTM1611S02H3 multipathd[957]: 3600a098000badf680000bf4c5c3806c0: sdjz - rdac checker reports path is down: ctlr firmware downloading
Feb  5 19:07:13 ICTM1611S02H3 multipathd[957]: checker failed path 65:464 in map 3600a098000badf680000bf4c5c3806c0
Feb  5 19:07:13 ICTM1611S02H3 multipathd[957]: 3600a098000badf680000bf4c5c3806c0: remaining active paths: 7
Feb  5 19:07:14 ICTM1611S02H3 multipathd[957]: 3600a098000badf680000bf4c5c3806c0: sdkj - rdac checker reports path is down: ctlr firmware downloading
Feb  5 19:07:14 ICTM1611S02H3 multipathd[957]: checker failed path 66:368 in map 3600a098000badf680000bf4c5c3806c0
Feb  5 19:07:14 ICTM1611S02H3 multipathd[957]: 3600a098000badf680000bf4c5c3806c0: remaining active paths: 6
Feb  5 19:07:15 ICTM1611S02H3 multipathd[957]: 3600a098000badf680000bf4c5c3806c0: sdadq - rdac checker reports path is down: ctlr firmware downloading
Feb  5 19:07:15 ICTM1611S02H3 multipathd[957]: checker failed path 65:960 in map 3600a098000badf680000bf4c5c3806c0
Feb  5 19:07:15 ICTM1611S02H3 multipathd[957]: 3600a098000badf680000bf4c5c3806c0: remaining active paths: 5
Feb  5 19:07:15 ICTM1611S02H3 multipathd[957]: 3600a098000badf680000bf4c5c3806c0: sdadr - rdac checker reports path is down: ctlr firmware downloading
Feb  5 19:07:15 ICTM1611S02H3 multipathd[957]: checker failed path 65:976 in map 3600a098000badf680000bf4c5c3806c0
Feb  5 19:07:15 ICTM1611S02H3 multipathd[957]: 3600a098000badf680000bf4c5c3806c0: remaining active paths: 4
Feb  5 19:07:16 ICTM1611S02H3 multipathd[957]: 3600a098000badf680000bf4c5c3806c0: sdjz - rdac checker reports path is down: ctlr firmware downloading
Feb  5 19:07:16 ICTM1611S02H3 multipathd[957]: 3600a098000badf680000bf4c5c3806c0: sdkj - rdac checker reports path is down: ctlr firmware downloading
Feb  5 19:07:17 ICTM1611S02H3 multipathd[957]: 3600a098000badf680000bf4c5c3806c0: sdadq - rdac checker reports path is down: ctlr firmware downloading
Feb  5 19:07:17 ICTM1611S02H3 multipathd[957]: 3600a098000badf680000bf4c5c3806c0: sdadr - rdac checker reports path is down: ctlr firmware downloading
Feb  5 19:07:22 ICTM1611S02H3 multipathd[957]: 3600a098000badf680000bf4c5c3806c0: sdjz - rdac checker reports path is down: ctlr firmware downloading
Feb  5 19:07:22 ICTM1611S02H3 multipathd[957]: checker failed path 65:464 in map 3600a098000badf680000bf4c5c3806c0
Feb  5 19:07:22 ICTM1611S02H3 multipathd[957]: 3600a098000badf680000bf4c5c3806c0: remaining active paths: 3
Feb  5 19:07:22 ICTM1611S02H3 multipathd[957]: 3600a098000badf680000bf4c5c3806c0: sdkj - rdac checker reports path is down: ctlr firmware downloading
Feb  5 19:07:22 ICTM1611S02H3 multipathd[957]: checker failed path 66:368 in map 3600a098000badf680000bf4c5c3806c0
Feb  5 19:07:22 ICTM1611S02H3 multipathd[957]: 3600a098000badf680000bf4c5c3806c0: remaining active paths: 2
Feb  5 19:07:22 ICTM1611S02H3 multipathd[957]: 3600a098000badf680000bf4c5c3806c0: sdadq - rdac checker reports path is down: ctlr firmware downloading
Feb  5 19:07:22 ICTM1611S02H3 multipathd[957]: checker failed path 65:960 in map 3600a098000badf680000bf4c5c3806c0
Feb  5 19:07:22 ICTM1611S02H3 multipathd[957]: 3600a098000badf680000bf4c5c3806c0: remaining active paths: 1
Feb  5 19:07:22 ICTM1611S02H3 multipathd[957]: 3600a098000badf680000bf4c5c3806c0: sdadr - rdac checker reports path is down: ctlr firmware downloading
Feb  5 19:07:22 ICTM1611S02H3 multipathd[957]: checker failed path 65:976 in map 3600a098000badf680000bf4c5c3806c0
Feb  5 19:07:22 ICTM1611S02H3 multipathd[957]: 3600a098000badf680000bf4c5c3806c0: Entering recovery mode: max_retries=30
Feb  5 19:07:22 ICTM1611S02H3 multipathd[957]: 3600a098000badf680000bf4c5c3806c0: remaining active paths: 0


Version-Release number of selected component (if applicable):
device-mapper-multipath-0.7.8-3.el8.x86_64

Actual results:
I/O error due to lost all paths to volume.

Expected results:
1/2 of the remaining paths still active to the optimal controller before the other controller came back online during controller firmware upgrade.

Additional info:

Comment 1 Ben Marzinski 2019-02-14 23:16:28 UTC
I've got a pretty good idea what's going on here, and I should have a fix to try shortly.

Comment 2 Ben Marzinski 2019-02-18 22:27:23 UTC
You can download test packages at:

http://people.redhat.com/~bmarzins/device-mapper-multipath/rpms/RHEL8/1673167/

Would you be able to try to reproduce this issue with these packages?

Comment 3 Ben Marzinski 2019-02-19 00:34:41 UTC
This bug looks serious enough that its fix should make RHEL-8.0 GA, so I'm requesting a blocker.  The fix I am using it the absolute minimum change to avoid this.  The only difference between the fixed version and the previous code is that multipathd will not update a paths state on a change event (which never happened before RHEL8, and is the cause of this bug). The impact of the fix on other code is minimal, and multipathd needs to avoid the possibility of disabling queueing incorrectly.

I should note that to hit this issue, something needs to happen to make a path appear as not-usable, while still showing up as online. Upgrading firmware is obviously one way that can happen.

Comment 5 Lin Li 2019-02-19 05:42:33 UTC
Hello NETAPP,
Could you provide test result once the fixed version is available?
Thanks in advance!

Comment 6 hong.chung 2019-02-19 16:09:49 UTC
Will try to verify/recreate with the fix.

Thanks
Hong

Comment 7 hong.chung 2019-02-20 15:52:22 UTC
Created attachment 1536753 [details]
host_logs

Comment 8 hong.chung 2019-02-20 15:53:00 UTC
I tried to run the test with the fix yesterday, but it looks like the test failed after several storage controller firmware upgrade.  When looking at one of the failed I/O volumes, 3600a09800059d76f00004dad5b0fe18a, it looks like after the first upgrade, 1/2 number of paths of the volume did not add back to multipath.

Feb 19 12:35:59 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 7
Feb 19 12:35:59 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 6
Feb 19 12:35:59 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 5
Feb 19 12:35:59 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 4
Feb 19 13:02:40 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 3
Feb 19 13:02:40 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 2
Feb 19 13:02:42 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 1
Feb 19 13:02:42 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 0
Feb 19 13:28:59 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 3
Feb 19 13:28:59 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 2
Feb 19 13:28:59 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 1
Feb 19 13:28:59 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 0
.
.
.
Feb 19 17:25:25 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 3
Feb 19 17:25:25 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 2
Feb 19 17:25:25 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 1
Feb 19 17:25:25 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 0
Feb 19 18:19:13 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 3
Feb 19 18:19:13 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 2
Feb 19 18:19:13 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 1
Feb 19 18:19:13 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 0
### I/O failed at 18:22:03
Feb 19 18:22:11 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 1
Feb 19 18:22:27 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 2
Feb 19 18:22:41 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 3
Feb 19 18:22:47 ICTM1608S01H2 multipathd[21557]: 3600a09800059d76f00004dad5b0fe18a: remaining active paths: 4


Also, when trying to run multipath -ll to the volume, there were "Invalid argument" reported, but eventually show the volume informaiton.

multipath -ll 3600a09800059d76f00004dad5b0fe18a
Feb 20 10:33:46 | sdis: failed to get udev uid: Invalid argument
Feb 20 10:33:46 | sdit: failed to get udev uid: Invalid argument
Feb 20 10:33:46 | sdjc: failed to get udev uid: Invalid argument
Feb 20 10:33:46 | sdmo: failed to get udev uid: Invalid argument
Feb 20 10:33:46 | sdmp: failed to get udev uid: Invalid argument
Feb 20 10:33:46 | sdmq: failed to get udev uid: Invalid argument
Feb 20 10:33:46 | sdmr: failed to get udev uid: Invalid argument

I attached the host log "CFW_upgrade_02202019.zip"

Thanks
Hong

Comment 9 Ben Marzinski 2019-02-20 21:17:34 UTC
Looking at the logs, it is obvious that a number of messages have been lost, which is making it hard to determine exactly what's happening.  Would it be possible to recreate this with perhaps 1 multipath device, so that there are fewer messages, and less chance of them getting lost?

However, from what I can see, it looks like the root of you problem might be that the path wwid is not able to acquired for a bit, while the firmware is being upgraded,
and multipathd is misinterpreting that as the wwid changing.  You should be able to work around this issue by adding

disable_changed_wwids "no"

to the defaults section of your /etc/multipath.conf file.  I will make new test packages that should keep multipathd from treating the failed attempt at getting the wwid as the wwid changing.

Comment 10 Ben Marzinski 2019-02-21 02:28:24 UTC
New test rpms are available here:

http://people.redhat.com/~bmarzins/device-mapper-multipath/rpms/RHEL8/1673167/

Comment 13 hong.chung 2019-02-22 16:13:21 UTC
Hi,

I reduced the number of volumes by 2/3 and reran the test last night.  The test passed this morning.  When checking on the volumes active path count, everything look promising when most of the volumes are reducing to 4 paths and recovering back to 8 during the storage firmware upgrade.  There is one volume that went down to 2 paths, but recover afterward, something you want to take a look?  I will re-add the original number of volumes back to the hosts and rerun the test.

Feb 21 17:43:38 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 5
Feb 21 17:43:41 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 6
Feb 21 17:43:45 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 5
Feb 21 17:43:45 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 4
Feb 21 17:43:45 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 3
Feb 21 17:43:45 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 2
Feb 21 17:43:49 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 3
Feb 21 17:43:53 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 4
Feb 21 17:46:39 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 5
Feb 21 17:46:56 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 6
Feb 21 17:47:03 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 7
Feb 21 17:47:08 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 8

# multipath -ll 3600a09800059d72a0000ff9a5c2f58b4
3600a09800059d72a0000ff9a5c2f58b4 dm-28 NETAPP,INF-01-00
size=96G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 10:0:3:0   sdch 69:80   active ready running
| |- 11:0:0:0   sddk 71:32   active ready running
| |- 12:0:3:0   sdlb 67:400  active ready running
| `- 13:0:0:0   sdme 69:352  active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 12:0:0:0   sdhu 134:64  active ready running
  |- 11:0:3:0   sdgr 132:112 active ready running
  |- 10:0:2:0   sdbe 67:128  active ready running
  `- 13:0:3:0   sdpl 130:432 active ready running


Thanks
Hong

Comment 14 hong.chung 2019-02-22 16:13:42 UTC
Created attachment 1537598 [details]
host_logs

Comment 15 Ben Marzinski 2019-02-22 21:14:41 UTC
It's a timing issue.  It looks like, due to the number of events and paths that multipath is dealing with, it takes multipathd a while to get around to checking the last two paths that come back online, this is why you drop down to two paths. Here's a timeline

kernel detects failed IO on all the paths to one controller (we'll call it controller A)

Feb 21 17:40:48 ICTM1608S01H2 kernel: device-mapper: multipath: Failing path 71:32.
Feb 21 17:40:48 ICTM1608S01H2 kernel: device-mapper: multipath: Failing path 67:400.
Feb 21 17:40:48 ICTM1608S01H2 kernel: device-mapper: multipath: Failing path 69:80.
Feb 21 17:40:48 ICTM1608S01H2 kernel: device-mapper: multipath: Failing path 69:352.

Multipathd acknowledges the errors and marks the paths down

Feb 21 17:40:48 ICTM1608S01H2 multipathd[5732]: sdme: mark as failed
Feb 21 17:40:48 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 7
Feb 21 17:40:48 ICTM1608S01H2 multipathd[5732]: sdlb: mark as failed
Feb 21 17:40:48 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 6
Feb 21 17:40:48 ICTM1608S01H2 multipathd[5732]: sddk: mark as failed
Feb 21 17:40:48 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 5
Feb 21 17:40:48 ICTM1608S01H2 multipathd[5732]: sdch: mark as failed
Feb 21 17:40:48 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 4

First path from controller A comes back

Feb 21 17:43:38 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: sdch - rdac checker reports path is up
Feb 21 17:43:38 ICTM1608S01H2 multipathd[5732]: 69:80: reinstated
Feb 21 17:43:38 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 5

Second path from controller B comes back

Feb 21 17:43:41 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: sdlb - rdac checker reports path is up
Feb 21 17:43:41 ICTM1608S01H2 multipathd[5732]: 67:400: reinstated
Feb 21 17:43:41 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 6

kernel detects failed IO on all the paths to controller B

Feb 21 17:43:43 ICTM1608S01H2 kernel: device-mapper: multipath: Failing path 134:64.
Feb 21 17:43:43 ICTM1608S01H2 kernel: device-mapper: multipath: Failing path 67:128.
Feb 21 17:43:43 ICTM1608S01H2 kernel: device-mapper: multipath: Failing path 130:432.
Feb 21 17:43:43 ICTM1608S01H2 kernel: device-mapper: multipath: Failing path 132:112.

Multipathd acknowledges the errors and marks the paths down

Feb 21 17:43:45 ICTM1608S01H2 multipathd[5732]: sdbe: mark as failed
Feb 21 17:43:45 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 5
Feb 21 17:43:45 ICTM1608S01H2 multipathd[5732]: sdgr: mark as failed
Feb 21 17:43:45 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 4
Feb 21 17:43:45 ICTM1608S01H2 multipathd[5732]: sdhu: mark as failed
Feb 21 17:43:45 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 3
Feb 21 17:43:45 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 2

Third path from controller A comes back

Feb 21 17:43:49 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: sddk - rdac checker reports path is up
Feb 21 17:43:49 ICTM1608S01H2 multipathd[5732]: 71:32: reinstated
Feb 21 17:43:49 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 3

Fourth path from controller A comes back

Feb 21 17:43:53 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: sdme - rdac checker reports path is up
Feb 21 17:43:53 ICTM1608S01H2 multipathd[5732]: 69:352: reinstated
Feb 21 17:43:53 ICTM1608S01H2 multipathd[5732]: 3600a09800059d72a0000ff9a5c2f58b4: remaining active paths: 4

Comment 16 Ben Marzinski 2019-02-22 21:20:33 UTC
Multipathd attempts to check paths within the checker interval. However, the rdac checker is (and has always been) a synchronous checker, which means only one path can be checked at a time, and after each checked path, multipathd has to handle the events that happened while the checker was running. All this means that multipathd can't always check the paths within the checker interval. It would be possible to make each rdac checker run in its own thread, like the tur checker does, so that paths could be checked asynchronously, but that wonn't make RHEL-8.0.

Comment 17 Steve Schremmer 2019-02-22 21:52:15 UTC
(In reply to Ben Marzinski from comment #16)
> It would be possible to make each rdac checker run in
> its own thread, like the tur checker does, so that paths could be checked
> asynchronously, but that wonn't make RHEL-8.0.

Before going down that route, maybe we should look into moving E-Series to use the TUR checker like the other ALUA arrays. We've run with the TUR checker before by accident, but it isn't 100% "clean".

Comment 18 Ben Marzinski 2019-02-22 23:25:44 UTC
I've updated the RHEL-8.0 build with the two fixes from this bug. The first keeps multipathd from double counting lost paths, when a path is not offline but not usable. The second keeps multipathd from viewing a failed attempt at getting the wwid as getting an empty wwid.

Comment 20 hong.chung 2019-02-25 22:16:41 UTC
Hi,

I reran the test with the original number of volumes, and several of the volumes are not getting their paths back.

### Example one of the volumes:

Feb 25 10:50:43 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 7
Feb 25 10:50:43 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 6
Feb 25 10:50:43 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 5
Feb 25 10:50:43 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 4
Feb 25 11:18:08 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 4
Feb 25 11:21:22 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 5
Feb 25 11:21:27 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 4
Feb 25 11:21:28 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 3
Feb 25 11:21:28 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 2
Feb 25 11:21:29 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 1
Feb 25 11:44:22 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 4
Feb 25 11:44:22 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 3
Feb 25 11:44:22 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 2
Feb 25 11:44:22 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 1
Feb 25 11:50:41 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 5
Feb 25 12:10:43 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 4
Feb 25 12:37:30 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 4
Feb 25 12:40:36 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 4
Feb 25 12:40:40 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 3
Feb 25 12:40:41 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 2
Feb 25 12:40:45 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 1
Feb 25 13:03:26 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 4
Feb 25 13:29:54 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 4
Feb 25 13:55:56 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 4
Feb 25 13:55:56 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 3
Feb 25 13:55:56 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 2
Feb 25 13:55:56 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 1
Feb 25 13:58:58 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 2
Feb 25 14:02:26 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 5
Feb 25 14:22:44 ICTM1608S01H2 multipathd[25411]: 3600a09800059d72a000049805c628335: remaining active paths: 4


Attached the host log CFW_Upgrade_02252019.zip

Thanks
Hong

Comment 21 hong.chung 2019-02-25 22:32:26 UTC
Created attachment 1538614 [details]
host_logs

Comment 22 Ben Marzinski 2019-02-26 07:21:18 UTC
Hopefully third time's the charm.  I overlooked a case where multipathd could still disable changed wwids when it failed to get a wwid. These patches fix that.

New test rpms are available here:

http://people.redhat.com/~bmarzins/device-mapper-multipath/rpms/RHEL8/1673167/

Please retry with these.

Comment 23 hong.chung 2019-02-26 22:26:02 UTC
Hi,

Thanks for the update.  I reran the test with the new rpms, paths seem to recover better, but I am seeing one of the volumes went down to 0 path once.  Also, there are some strange listing of active paths different in the messages log and multipath -ll.

### one of the volumes went down to 0 path:

Feb 26 13:59:59 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 8
Feb 26 14:00:01 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 7
Feb 26 14:00:01 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 6
Feb 26 14:00:02 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 5
Feb 26 14:00:02 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 4
Feb 26 14:22:53 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 7
Feb 26 14:22:53 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 6
Feb 26 14:22:53 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 5
Feb 26 14:22:53 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 4
Feb 26 14:25:51 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 3
Feb 26 14:25:51 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 2
Feb 26 14:25:51 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 1
Feb 26 14:25:51 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 0
Feb 26 14:25:59 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 1
Feb 26 14:49:20 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 7
Feb 26 14:49:20 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 6
Feb 26 14:49:20 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 5
Feb 26 14:49:20 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 4
Feb 26 14:52:16 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 5
Feb 26 14:52:33 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 6
Feb 26 14:52:35 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 5
Feb 26 15:15:44 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 7

### It seem the last 8th path is not listing active in the messages file anymore:

#### On last upgrade, the multipath -ll show all 8 paths active:
3600a098000d8580e00005bf95c62c04b dm-118 NETAPP,INF-01-00
size=2.0G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 13:0:0:248 sdajq 131:896  active ready running
| |- 14:0:2:248 sdbty 70:1920  active ready running
| |- 10:0:2:248 sdnp  71:432   active ready running
| `- 12:0:1:248 sdaff 68:848   active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 10:0:0:248 sddp  71:112   active ready running
  |- 12:0:0:248 sdyq  129:672  active ready running
  |- 13:0:1:248 sdasm 129:1248 active ready running
  `- 14:0:0:248 sdbif 67:1648  active ready running
 
#### From messages log, it show the remain active paths is 7, 

Feb 26 16:37:47 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: sdyq - rdac checker reports path is down: ctlr is in startup sequence
Feb 26 16:37:47 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: sdasm - rdac checker reports path is down: ctlr is in startup sequence
Feb 26 16:37:47 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: sdbif - rdac checker reports path is down: ctlr is in startup sequence
Feb 26 16:38:10 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: sddp - rdac checker reports path is up
Feb 26 16:38:10 ICTM1608S01H1 multipathd[22580]: 71:112: reinstated
Feb 26 16:38:10 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 5
Feb 26 16:38:10 ICTM1608S01H1 kernel: device-mapper: multipath: Reinstating path 71:112.
Feb 26 16:38:10 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: updating path groups for priority change
Feb 26 16:38:10 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: load table [0 4194304 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 3 1 service-time 0 1 1 71:112 1 service-time 0 3 1 129:672 1 129:1248 1 67:1648 1 service-time 0 4 1 71:432 1 68:848 1 131:896 1 70:1920 1]
Feb 26 16:38:10 ICTM1608S01H1 kernel: device-mapper: multipath: Failing path 129:672.
Feb 26 16:38:10 ICTM1608S01H1 kernel: device-mapper: multipath: Failing path 129:1248.
Feb 26 16:38:10 ICTM1608S01H1 kernel: device-mapper: multipath: Failing path 67:1648.
Feb 26 16:38:17 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005be55c62bf5d: load table [0 4194304 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 3 1 service-time 0 2 1 68:240 1 69:672 1 service-time 0 2 1 71:1120 1 134:1456 1 service-time 0 4 1 69:272 1 65:800 1 69:784 1 68:1792 1]
Feb 26 16:38:17 ICTM1608S01H1 kernel: device-mapper: multipath: Failing path 71:1120.
Feb 26 16:38:18 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: sdyq - rdac checker reports path is up
Feb 26 16:38:18 ICTM1608S01H1 multipathd[22580]: 129:672: reinstated
Feb 26 16:38:18 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 6
Feb 26 16:38:18 ICTM1608S01H1 kernel: device-mapper: multipath: Reinstating path 129:672.
Feb 26 16:38:18 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: updating path groups for priority change
Feb 26 16:38:19 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: load table [0 4194304 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 3 1 service-time 0 2 1 71:112 1 129:672 1 service-time 0 2 1 129:1248 1 67:1648 1 service-time 0 4 1 71:432 1 68:848 1 131:896 1 70:1920 1]
Feb 26 16:38:19 ICTM1608S01H1 kernel: device-mapper: multipath: Failing path 129:1248.
Feb 26 16:38:19 ICTM1608S01H1 kernel: device-mapper: multipath: Failing path 67:1648.
Feb 26 16:38:22 ICTM1608S01H1 multipathd[22580]: 71:1120: reinstated
Feb 26 16:38:22 ICTM1608S01H1 kernel: device-mapper: multipath: Reinstating path 71:1120.
Feb 26 16:38:22 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005be55c62bf5d: load table [0 4194304 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 3 1 service-time 0 3 1 68:240 1 69:672 1 71:1120 1 service-time 0 1 1 134:1456 1 service-time 0 4 1 69:272 1 65:800 1 69:784 1 68:1792 1]
Feb 26 16:38:23 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: sdasm - rdac checker reports path is up
Feb 26 16:38:23 ICTM1608S01H1 multipathd[22580]: 129:1248: reinstated
Feb 26 16:38:23 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 7
Feb 26 16:38:23 ICTM1608S01H1 kernel: device-mapper: multipath: Reinstating path 129:1248.
Feb 26 16:38:23 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: updating path groups for priority change
Feb 26 16:38:23 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: load table [0 4194304 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 3 1 service-time 0 3 1 71:112 1 129:672 1 129:1248 1 service-time 0 1 1 67:1648 1 service-time 0 4 1 71:432 1 68:848 1 131:896 1 70:1920 1]
Feb 26 16:38:23 ICTM1608S01H1 kernel: device-mapper: multipath: Failing path 67:1648.


Several other volumes are also listing 7 paths as well in the messages log.

Thanks
Hong

Comment 24 hong.chung 2019-02-26 22:26:40 UTC
Created attachment 1538988 [details]
host_logs

Comment 25 Ben Marzinski 2019-02-27 00:04:17 UTC
When you get into this state, where some paths simply aren't coming back, could you run

# multipathd show paths

and post the results.

Comment 26 Ben Marzinski 2019-02-27 00:05:21 UTC
Please also post the messages and the multipath -ll output like you have been.

Comment 27 Ben Marzinski 2019-02-27 01:35:02 UTC
As far as the number of paths dropping to 0, that is again a timing issue.

Here is when you lose the first four paths:
Feb 26 14:22:53 ICTM1608S01H1 multipathd[22580]: sdnp: mark as failed
Feb 26 14:22:53 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 7
Feb 26 14:22:53 ICTM1608S01H1 multipathd[22580]: sdaff: mark as failed
Feb 26 14:22:53 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 6
Feb 26 14:22:53 ICTM1608S01H1 multipathd[22580]: sdajq: mark as failed
Feb 26 14:22:53 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 5
Feb 26 14:22:53 ICTM1608S01H1 multipathd[22580]: sdbty: mark as failed
Feb 26 14:22:53 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 4

Here is the last path check where they are down:
Feb 26 14:25:41 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: sdnp - rdac checker reports path is down: ctlr is in startup sequence
Feb 26 14:25:41 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: sdaff - rdac checker reports path is down: ctlr is in startup sequence
Feb 26 14:25:41 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: sdajq - rdac checker reports path is down: ctlr is in startup sequence
Feb 26 14:25:41 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: sdbty - rdac checker reports path is down: ctlr is in startup sequence

Here is where the kernel gets IO errors the four remaining paths:
Feb 26 14:25:50 ICTM1608S01H1 kernel: device-mapper: multipath: Failing path 129:1248.
Feb 26 14:25:50 ICTM1608S01H1 kernel: device-mapper: multipath: Failing path 67:1648.
Feb 26 14:25:50 ICTM1608S01H1 kernel: device-mapper: multipath: Failing path 71:112.
Feb 26 14:25:50 ICTM1608S01H1 kernel: device-mapper: multipath: Failing path 129:672.

Here is where multipathd marks those paths down:
Feb 26 14:25:51 ICTM1608S01H1 multipathd[22580]: sddp: mark as failed
Feb 26 14:25:51 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 3
Feb 26 14:25:51 ICTM1608S01H1 multipathd[22580]: sdyq: mark as failed
Feb 26 14:25:51 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 2
Feb 26 14:25:51 ICTM1608S01H1 multipathd[22580]: sdasm: mark as failed
Feb 26 14:25:51 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 1
Feb 26 14:25:51 ICTM1608S01H1 multipathd[22580]: sdbif: mark as failed
Feb 26 14:25:51 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: Entering recovery mode: max_retries=30
Feb 26 14:25:51 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 0

Here is where multipathd sees the paths that failed earlier start to recover:
Feb 26 14:25:59 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: sdnp - rdac checker reports path is up
Feb 26 14:25:59 ICTM1608S01H1 multipathd[22580]: 71:432: reinstated
Feb 26 14:25:59 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: queue_if_no_path enabled
Feb 26 14:25:59 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: Recovered to normal mode
Feb 26 14:25:59 ICTM1608S01H1 multipathd[22580]: 3600a098000d8580e00005bf95c62c04b: remaining active paths: 1

Here is one of the warnings the multipathd prints, saying that I cannot check the paths fast enough:
Feb 26 14:32:05 ICTM1608S01H1 multipathd[22580]: path checkers took longer than 44 seconds, consider increasing max_polling_interval

Like I said, the only good solution here is to use an asynchronous checker, so the paths can be checked in parallel.


Also, if when you get the results of 
# multipathd show paths

if the failed paths have a chk_st of "i/o pending", can you run
# cat /sys/block/<dev_name>/device/state

on them to see what their state is.  This doesn't seem horribly likely, since the "multipath -ll" output shows them in the "ready" state.

Comment 28 hong.chung 2019-02-27 16:39:29 UTC
Hi,

The test finished last night with no I/O error.  When looking at the active paths for some of the volumes finished with 7 active paths reported in the messages log.

# One of the volumes finished with 7 paths:

Feb 27 00:07:04 ICTM1608S01H2 multipathd[23065]: 3600a09800059d72a000049755c6282e0: remaining active paths: 6
Feb 27 00:07:04 ICTM1608S01H2 multipathd[23065]: 3600a09800059d72a000049755c6282e0: remaining active paths: 5
Feb 27 00:07:04 ICTM1608S01H2 multipathd[23065]: 3600a09800059d72a000049755c6282e0: remaining active paths: 4
Feb 27 00:13:08 ICTM1608S01H2 multipathd[23065]: 3600a09800059d72a000049755c6282e0: remaining active paths: 5
Feb 27 00:13:20 ICTM1608S01H2 multipathd[23065]: 3600a09800059d72a000049755c6282e0: remaining active paths: 6
Feb 27 00:13:30 ICTM1608S01H2 multipathd[23065]: 3600a09800059d72a000049755c6282e0: remaining active paths: 7


# multipath -ll show all 8 paths are active:

3600a09800059d72a000049755c6282e0 dm-221 NETAPP,INF-01-00
size=2.0G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 10:0:2:68  sdli  68:256   active ready running
| |- 11:0:3:68  sdaju 131:960  active ready running
| |- 12:0:0:68  sdaos 67:1216  active ready running
| `- 13:0:3:68  sdbww 131:1856 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 10:0:3:68  sdqg  132:256  active ready running
  |- 11:0:0:68  sdve  68:512   active ready running
  |- 12:0:3:68  sdbdi 131:1408 active ready running
  `- 13:0:0:68  sdbig 67:1664  active ready running


# multipathd show paths (for this volume) show 8 paths are active as well:

hcil       dev   dev_t    pri dm_st  chk_st dev_st  next_check 
10:0:2:68  sdli  68:256   50  active ready  running XXXXXXXXXX 20/20
10:0:3:68  sdqg  132:256  10  active ready  running XXXXX..... 10/20
11:0:0:68  sdve  68:512   10  active ready  running XXXXX..... 10/20
11:0:3:68  sdaju 131:960  50  active ready  running XXXXXXXXXX 20/20
12:0:0:68  sdaos 67:1216  50  active ready  running XXXXXXXXXX 20/20
12:0:3:68  sdbdi 131:1408 10  active ready  running XXXXX..... 10/20
13:0:0:68  sdbig 67:1664  10  active ready  running XXXXX..... 10/20
13:0:3:68  sdbww 131:1856 50  active ready  running XXXXXXXXXX 20/20

Thanks
Hong

Comment 29 hong.chung 2019-02-27 16:42:19 UTC
Created attachment 1539201 [details]
host_logs

Comment 30 Ben Marzinski 2019-02-27 20:27:51 UTC
So clearly, some messages are missing from the logs, but that last failed path (sdbig 67:1664) doesn't actually appear to be down at the end of these logs.

Here is where the 6th path gets restored

Feb 27 00:13:20 ICTM1608S01H2 multipathd[23065]: 3600a09800059d72a000049755c6282e0: sdve - rdac checker reports path is up
Feb 27 00:13:20 ICTM1608S01H2 multipathd[23065]: 68:512: reinstated
Feb 27 00:13:20 ICTM1608S01H2 multipathd[23065]: 3600a09800059d72a000049755c6282e0: remaining active paths: 6
Feb 27 00:13:20 ICTM1608S01H2 kernel: device-mapper: multipath: Reinstating path 68:512.

After that happens, multipathd sees the path's priority has changed and reloads the table. Notice afterwards that there are two paths that are still failed, and multipathd notifies the kernel of this.

Feb 27 00:13:20 ICTM1608S01H2 multipathd[23065]: 3600a09800059d72a000049755c6282e0: updating path groups for priority change
Feb 27 00:13:21 ICTM1608S01H2 multipathd[23065]: 3600a09800059d72a000049755c6282e0: load table [0 4194304 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 3 1 service-time 0 4 1 68:256 1 131:960 1 67:1216 1 131:1856 1 service-time 0 2 1 132:256 1 68:512 1 service-time 0 2 1 131:1408 1 67:1664 1]
Feb 27 00:13:21 ICTM1608S01H2 kernel: device-mapper: multipath: Failing path 131:1408.
Feb 27 00:13:21 ICTM1608S01H2 kernel: device-mapper: multipath: Failing path 67:1664.

Here is the last checker message that shows sdbig as down

Feb 27 00:12:51 ICTM1608S01H2 multipathd[23065]: 3600a09800059d72a000049755c6282e0: sdbig - rdac checker reports path is down: ctlr is in startup sequence

Then here is the other failed path getting restored

Feb 27 00:13:30 ICTM1608S01H2 multipathd[23065]: 3600a09800059d72a000049755c6282e0: sdbdi - rdac checker reports path is up
Feb 27 00:13:30 ICTM1608S01H2 multipathd[23065]: 131:1408: reinstated
Feb 27 00:13:30 ICTM1608S01H2 multipathd[23065]: 3600a09800059d72a000049755c6282e0: remaining active paths: 7

After this, multipathd again sees that the path's priority has changed and reloads the device... However we only get the first line of that output

Feb 27 00:13:30 ICTM1608S01H2 multipathd[23065]: 3600a09800059d72a000049755c6282e0: updating path groups for priority change

clearly log messages have been dropped here.  However, we don't need the log messages, because we can directly ask multipathd what the state of the path is. That's what

# multipathd show paths

does.  That is the definitive answer.  It shows that all of the paths are up, which agrees perfectly with the "multipath -ll" output.  That means that the messages where sdbig was restored must have been dropped. But it clearly was restored, since the kernel is actually using it and the checker has it listed as active (according to both "multipathd show paths" and "multipath -ll")

There might be some knobs you can twist to make journalctl less likely to lose messages, but as far as I can see everything looks good here from multipath's side.

Comment 31 Lin Li 2019-03-05 07:31:32 UTC
Hello NETAPP,
Does the fixed version meet your requirements?
If so, I will change to verified.
Thanks!

Comment 32 hong.chung 2019-03-05 15:43:18 UTC
Hi,

Everything is good so far with the fix for this test.

Thanks
Hong


Note You need to log in before you can comment on or make changes to this bug.