Bug 1701604

Summary:

[HPESD 7.6 Bug] Qlogic: After Direct Attach Storage faults, device-mapper-multipath does not reinstate paths after login

Product:

Red Hat Enterprise Linux 7

Reporter:

Alan Laws <alan.laws>

Component:

device-mapper-multipath

Assignee:

Ben Marzinski <bmarzins>

Status:

CLOSED ERRATA

QA Contact:

Lin Li <lilin>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

7.6

CC:

agk, alan.laws, bmarzins, girish.basrur, heinzm, hmadhani, jkachuck, joseph.szczypek, karen.skweres, lilin, lucas.ham.leborgne, mknutson, msnitzer, phinchman, prajnoha, rhandlin, tom.vaden, trinh.dao, william.gens

Target Milestone:

Keywords:

OtherQA

Target Release:

7.8

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

device-mapper-multipath-0.4.9-128.el7

Doc Type:

Bug Fix

Doc Text:

Cause: Multipathd can occasionally lose track of the number of active paths Consequence: multipathd can disable queueing while an active path still exists Fix: Multipathd now rechecks the number of active paths before every time it changes the value. Result: multipathd no longer can disable queueing while active paths still exist.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-03-31 19:47:09 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1704833, 1754591

Attachments:

Description	Flags
sosreport	none
/var/log/messages from faults on April 25	none

Description Alan Laws 2019-04-19 20:11:22 UTC

Created attachment 1556597 [details]
sosreport

Description of problem:  During failover ( reboot ) of HPE Nimble Storage controllers, Qlogic based paths are not being reinstated as expected.


Version-Release number of selected component (if applicable):
RHEL 7.6 3.10.0-957.el7.x86_64
device-mapper-multipath-0.4.9-123.el7.x86_64
device-mapper-multipath-debuginfo-0.4.9-123.el7.x86_64
kmod-qlgc-qla2xxx-10.01.00.33.07.6_k2-1.rhel7u6.x86_64


How reproducible:
1/4 attempts see this

Steps to Reproduce:
1.  Fail over / Reboot Nimble Storage active controller
2.  Wait for controller to reboot and come back as Standby
3.  Qlogic initiator will detect loop up and log in
4.  multipath -ll will show paths as failed / faulty still

Actual results:
Paths remain failed until reboot of host OS

Expected results:
Paths should automatically reinstate after successfully PLOGI/PRLI to target adapter

Additional info:
Attached sosreport, issue is similar to 1698970, but device-mapper does not become unresponsive with the updated qla2xx module on the host

Comment 2 Ben Marzinski 2019-04-24 18:00:29 UTC

The logs in the sosreport are from a period of 4 days, and I'm having problems trying to figure out what is supposed to be happening at what time.

Would it be possible to run

# mutipath -ll

So I can see the initial state of the devices. Then, reproduce this, and send a copy of /var/log/messages, along with some times for when the various steps in reproducing this occurred. Thanks.

Comment 3 Alan Laws 2019-04-25 12:48:45 UTC

Created attachment 1558678 [details]
/var/log/messages from faults on April 25

Comment 4 Alan Laws 2019-04-25 13:12:35 UTC

Hi Ben,

It took 3 failovers to trigger this issue today.  I have attached /var/log/messages and an sosreport for your information.

++ Controller failover triggered at 08:29:55

++ Link goes down from controller rebooting:
Apr 25 08:30:17 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.0]-500b:1: LOOP DOWN detected (2 7 0 0).

++ ALUA transitions from Standby->Active on remaining controller
Apr 25 08:30:22 rtp-hpe-ops24 kernel: sd 3:0:0:3: Asymmetric access state changed

++ Device-Mapper reinstates the failed paths from the earlier transitioning state:

Apr 25 08:30:39 rtp-hpe-ops24 multipathd: 8:112: reinstated
Apr 25 08:30:39 rtp-hpe-ops24 multipathd: 8:128: reinstated
Apr 25 08:30:39 rtp-hpe-ops24 multipathd: 8:176: reinstated
Apr 25 08:30:39 rtp-hpe-ops24 multipathd: 8:240: reinstated
Apr 25 08:30:39 rtp-hpe-ops24 multipathd: 65:32: reinstated
Apr 25 08:30:40 rtp-hpe-ops24 multipathd: 8:144: reinstated
Apr 25 08:30:40 rtp-hpe-ops24 multipathd: 8:160: reinstated
Apr 25 08:30:40 rtp-hpe-ops24 multipathd: 65:48: reinstated
Apr 25 08:30:40 rtp-hpe-ops24 multipathd: 65:64: reinstated
Apr 25 08:30:40 rtp-hpe-ops24 multipathd: 65:80: reinstated


++ Link comes back up:

Apr 25 08:35:43 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.0]-2869:1: LOOP READY.
Apr 25 08:35:43 rtp-hpe-ops24 kernel: qla2xxx [0000:d8:00.0]-2869:6: LOOP READY.

< No path reinstatements seen >


++ At this point, our test case runs another failover on the storage which causes a complete outage as there are now 0 active paths and the host has to be rebooted to recover.


++ Below are the 'multipath -ll' outputs from the failing scenario described above:

Initial state:

[root@rtp-hpe-ops24 ~]# multipath -ll
mpathe (2635e4cb4327339176c9ce90007a57d5a) dm-7 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 1:0:0:6 sdg 8:96  active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 3:0:0:6 sdl 8:176 active ghost running
mpathd (244c4c43fb537a0106c9ce90007a57d5a) dm-6 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 1:0:0:5 sdf 8:80  active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 3:0:0:5 sdk 8:160 active ghost running
mpathc (28d39d8bc9abdd4c96c9ce90007a57d5a) dm-5 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 1:0:0:4 sde 8:64  active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 3:0:0:4 sdj 8:144 active ghost running
mpathb (206f932dad71084546c9ce90007a57d5a) dm-4 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 1:0:0:3 sdd 8:48  active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 3:0:0:3 sdi 8:128 active ghost running
mpatha (25c9b224eb73b377e6c9ce90007a57d5a) dm-3 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 1:0:0:2 sdc 8:32  active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 3:0:0:2 sdh 8:112 active ghost running
mpathj (2fbedf10437f167596c9ce90007a57d5a) dm-12 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 6:0:0:6 sdr 65:16 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 7:0:0:6 sdv 65:80 active ghost running
mpathi (22da894d1b63d349f6c9ce90007a57d5a) dm-11 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 6:0:0:5 sdq 65:0  active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 7:0:0:5 sdu 65:64 active ghost running
mpathh (2524ec74989c7a95b6c9ce90007a57d5a) dm-10 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 6:0:0:4 sdo 8:224 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 7:0:0:4 sdt 65:48 active ghost running
mpathg (2dfe32757561c617b6c9ce90007a57d5a) dm-9 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 6:0:0:3 sdn 8:208 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 7:0:0:3 sds 65:32 active ghost running
mpathf (213dad4f3bbdf3d5d6c9ce90007a57d5a) dm-8 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 6:0:0:2 sdm 8:192 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 7:0:0:2 sdp 8:240 active ghost running


Failing state recorded after the array returned to healthy at 08:35:43 :

[root@rtp-hpe-ops24 ~]# multipath -ll
mpathe (2635e4cb4327339176c9ce90007a57d5a) dm-7 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 1:0:0:6 sdg 8:96  failed faulty running
  `- 3:0:0:6 sdl 8:176 active ready running
mpathd (244c4c43fb537a0106c9ce90007a57d5a) dm-6 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 1:0:0:5 sdf 8:80  failed faulty running
  `- 3:0:0:5 sdk 8:160 active ready running
mpathc (28d39d8bc9abdd4c96c9ce90007a57d5a) dm-5 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 1:0:0:4 sde 8:64  failed faulty running
  `- 3:0:0:4 sdj 8:144 active ready running
mpathb (206f932dad71084546c9ce90007a57d5a) dm-4 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 1:0:0:3 sdd 8:48  failed faulty running
  `- 3:0:0:3 sdi 8:128 active ready running
mpatha (25c9b224eb73b377e6c9ce90007a57d5a) dm-3 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 1:0:0:2 sdc 8:32  failed faulty running
  `- 3:0:0:2 sdh 8:112 active ready running
mpathj (2fbedf10437f167596c9ce90007a57d5a) dm-12 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 6:0:0:6 sdr 65:16 failed faulty running
  `- 7:0:0:6 sdv 65:80 active ready running
mpathi (22da894d1b63d349f6c9ce90007a57d5a) dm-11 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 6:0:0:5 sdq 65:0  failed faulty running
  `- 7:0:0:5 sdu 65:64 active ready running
mpathh (2524ec74989c7a95b6c9ce90007a57d5a) dm-10 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 6:0:0:4 sdo 8:224 failed faulty running
  `- 7:0:0:4 sdt 65:48 active ready running
mpathg (2dfe32757561c617b6c9ce90007a57d5a) dm-9 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 6:0:0:3 sdn 8:208 failed faulty running
  `- 7:0:0:3 sds 65:32 active ready running
mpathf (213dad4f3bbdf3d5d6c9ce90007a57d5a) dm-8 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 6:0:0:2 sdm 8:192 failed faulty running
  `- 7:0:0:2 sdp 8:240 active ready running

Comment 5 Ben Marzinski 2019-04-26 22:19:20 UTC

Thanks for those messages and timeline. It's hard to tell what is going on with multipathd here, since it doesn't have print statements that cover whatever's happening to those devices that aren't coming back.  I've created some test rpms that you can pick up here:

http://people.redhat.com/~bmarzins/device-mapper-multipath/rpms/RHEL7/bz1701604/

They fix one issue that I noticed while looking at your logs, but which I don't think is related to the devices failing to reinstate. But they also add some additional output that will hopefully catch what's going on here.

If you could install them, recreate this issue, and post the output like you did above, that would be really helpful. If you can't install test rpms on your systems, let me know, and we'll try something different.

Comment 6 Alan Laws 2019-04-29 15:53:02 UTC

Ben,

Thank you for providing the debug packages.  After installing the provided software, I have been unable to reproduce this issue after 5 iterations this morning.  I will run this for longer and let you know if I see it come up again.

I can see a lot of messages like this now while the controller is rebooting with failed paths.  Outside of that, the host appears to behave as I would expect and reinstates paths quickly.

Apr 29 10:51:12 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: sds: unrecognized bus type (1)
Apr 29 10:51:12 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: mpathj: sds - state 2, no message


Below are the log snippets I pulled while trying to reproduce this morning, everything appears to be fine so far though:

++ Controller failover triggered at 10:48:05:
++ Link goes down from controller rebooting:

Apr 29 10:48:05 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: qla2xxx [0000:12:00.0]-500b:1: LOOP DOWN detected (2 7 0 0).
Apr 29 10:48:05 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: qla2xxx [0000:d8:00.0]-500b:6: LOOP DOWN detected (2 7 0 0).

++ Device-Mapper notes 1 remaining path:

Apr 29 10:48:11 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: sdd: mark as failed
Apr 29 10:48:11 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: mpathb: remaining active paths: 1
Apr 29 10:48:11 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: sdc: mark as failed
Apr 29 10:48:11 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: mpatha: remaining active paths: 1
Apr 29 10:48:11 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: sdm: mark as failed
Apr 29 10:48:11 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: mpathf: remaining active paths: 1
Apr 29 10:48:11 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: sdn: mark as failed
Apr 29 10:48:11 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: mpathg: remaining active paths: 1

++ ALUA transitions from Standby->Active on remaining controller:

Apr 29 10:48:11 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 3:0:0:3: Asymmetric access state changed
Apr 29 10:48:11 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 7:0:0:2: Asymmetric access state changed
Apr 29 10:48:23 rtp-hpe-ops24 kernel: sd 3:0:0:5: Asymmetric access state changed
Apr 29 10:48:23 rtp-hpe-ops24 kernel: sd 7:0:0:6: Asymmetric access state changed


++ Device-Mapper reinstates the failed paths from the earlier transitioning state:

Apr 29 10:48:24 rtp-hpe-ops24 multipathd: 8:160: reinstated
Apr 29 10:48:24 rtp-hpe-ops24 multipathd: 8:176: reinstated
Apr 29 10:48:24 rtp-hpe-ops24 multipathd: 65:80: reinstated
Apr 29 10:48:26 rtp-hpe-ops24 multipathd: 8:144: reinstated
Apr 29 10:48:26 rtp-hpe-ops24 multipathd: 65:48: reinstated
Apr 29 10:48:26 rtp-hpe-ops24 multipathd: 65:64: reinstated
Apr 29 10:48:27 rtp-hpe-ops24 multipathd: 8:112: reinstated
Apr 29 10:48:27 rtp-hpe-ops24 multipathd: 8:128: reinstated
Apr 29 10:48:27 rtp-hpe-ops24 multipathd: 8:240: reinstated
Apr 29 10:48:27 rtp-hpe-ops24 multipathd: 65:16: reinstated

++ Link comes back up:

Apr 29 10:53:28 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: qla2xxx [0000:12:00.0]-500a:1: LOOP UP detected (16 Gbps).
Apr 29 10:53:28 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: qla2xxx [0000:d8:00.0]-500a:6: LOOP UP detected (16 Gbps).

++ Device-mapper reinstates paths from link coming back up:

Apr 29 10:53:29 rtp-hpe-ops24 multipathd: 65:0: reinstated
Apr 29 10:53:30 rtp-hpe-ops24 multipathd: 8:48: reinstated
Apr 29 10:53:30 rtp-hpe-ops24 multipathd: 8:64: reinstated
Apr 29 10:53:30 rtp-hpe-ops24 multipathd: 8:208: reinstated
Apr 29 10:53:30 rtp-hpe-ops24 multipathd: 8:224: reinstated
Apr 29 10:53:31 rtp-hpe-ops24 multipathd: 8:32: reinstated
Apr 29 10:53:31 rtp-hpe-ops24 multipathd: 8:192: reinstated
Apr 29 10:53:33 rtp-hpe-ops24 multipathd: 8:80: reinstated
Apr 29 10:53:33 rtp-hpe-ops24 multipathd: 8:96: reinstated
Apr 29 10:53:33 rtp-hpe-ops24 multipathd: 65:32: reinstated


++ Controller failover triggered at 11:03:37:
++ Link goes down from controller rebooting:
Apr 29 11:03:37 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: qla2xxx [0000:d8:00.1]-500b:7: LOOP DOWN detected (2 7 0 0).
Apr 29 11:03:37 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: qla2xxx [0000:12:00.1]-500b:3: LOOP DOWN detected (2 7 0 0).

++ Device-Mapper notes 1 remaining path:
Apr 29 11:03:42 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: sdr: mark as failed
Apr 29 11:03:42 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: mpathg: remaining active paths: 1
Apr 29 11:03:42 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: sdp: mark as failed
Apr 29 11:03:42 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: mpathf: remaining active paths: 1
Apr 29 11:03:42 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: sdh: mark as failed
Apr 29 11:03:42 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: mpatha: remaining active paths: 1
Apr 29 11:03:42 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: sdi: mark as failed
Apr 29 11:03:42 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: mpathb: remaining active paths: 1

++ ALUA transitions from Standby->Active on remaining controller:
Apr 29 11:03:42 rtp-hpe-ops24 kernel: sd 6:0:0:2: Asymmetric access state changed
Apr 29 11:03:42 rtp-hpe-ops24 kernel: sd 1:0:0:3: Asymmetric access state changed
Apr 29 11:03:56 rtp-hpe-ops24 kernel: sd 1:0:0:3: Asymmetric access state changed
Apr 29 11:03:56 rtp-hpe-ops24 kernel: sd 6:0:0:3: Asymmetric access state changed

++ Device-Mapper reinstates the failed paths from the earlier transitioning state:
Apr 29 11:03:57 rtp-hpe-ops24 multipathd: 8:48: reinstated
Apr 29 11:03:57 rtp-hpe-ops24 multipathd: 8:64: reinstated
Apr 29 11:03:57 rtp-hpe-ops24 multipathd: 8:208: reinstated
Apr 29 11:03:57 rtp-hpe-ops24 multipathd: 8:224: reinstated
Apr 29 11:03:57 rtp-hpe-ops24 multipathd: 65:0: reinstated
Apr 29 11:03:58 rtp-hpe-ops24 multipathd: 8:32: reinstated
Apr 29 11:03:58 rtp-hpe-ops24 multipathd: 8:192: reinstated
Apr 29 11:04:00 rtp-hpe-ops24 multipathd: 8:80: reinstated
Apr 29 11:04:00 rtp-hpe-ops24 multipathd: 8:96: reinstated
Apr 29 11:04:00 rtp-hpe-ops24 multipathd: 65:32: reinstated

++ Link comes back up:
Apr 29 11:09:00 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.1]-500a:3: LOOP UP detected (16 Gbps).
Apr 29 11:09:01 rtp-hpe-ops24 kernel: qla2xxx [0000:d8:00.1]-500a:7: LOOP UP detected (16 Gbps).

++ Device-mapper reinstates paths from link coming back up:
Apr 29 11:09:04 rtp-hpe-ops24 multipathd: 8:160: reinstated
Apr 29 11:09:04 rtp-hpe-ops24 multipathd: 8:176: reinstated
Apr 29 11:09:04 rtp-hpe-ops24 multipathd: 8:240: reinstated
Apr 29 11:09:04 rtp-hpe-ops24 multipathd: 65:48: reinstated
Apr 29 11:09:04 rtp-hpe-ops24 multipathd: 65:64: reinstated
Apr 29 11:09:04 rtp-hpe-ops24 multipathd: 65:80: reinstated
Apr 29 11:09:06 rtp-hpe-ops24 multipathd: 8:112: reinstated
Apr 29 11:09:06 rtp-hpe-ops24 multipathd: 8:128: reinstated
Apr 29 11:09:06 rtp-hpe-ops24 multipathd: 8:144: reinstated
Apr 29 11:09:06 rtp-hpe-ops24 multipathd: 65:16: reinstated




++ Controller failover triggered at 11:16:43:

++ Link goes down from controller rebooting:
Apr 29 11:16:43 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.0]-500b:1: LOOP DOWN detected (2 7 0 0).
Apr 29 11:16:43 rtp-hpe-ops24 kernel: qla2xxx [0000:d8:00.0]-500b:6: LOOP DOWN detected (2 7 0 0).

++ Device-Mapper notes 1 remaining path:
Apr 29 11:16:48 rtp-hpe-ops24 multipathd: sdd: mark as failed
Apr 29 11:16:48 rtp-hpe-ops24 multipathd: mpathb: remaining active paths: 1
Apr 29 11:16:48 rtp-hpe-ops24 multipathd: sdc: mark as failed
Apr 29 11:16:48 rtp-hpe-ops24 multipathd: mpatha: remaining active paths: 1
Apr 29 11:16:48 rtp-hpe-ops24 multipathd: sdn: mark as failed
Apr 29 11:16:48 rtp-hpe-ops24 multipathd: mpathg: remaining active paths: 1
Apr 29 11:16:48 rtp-hpe-ops24 multipathd: sdm: mark as failed
Apr 29 11:16:48 rtp-hpe-ops24 multipathd: mpathf: remaining active paths: 1

++ ALUA transitions from Standby->Active on remaining controller:
Apr 29 11:16:48 rtp-hpe-ops24 kernel: sd 7:0:0:2: Asymmetric access state changed
Apr 29 11:16:48 rtp-hpe-ops24 kernel: sd 3:0:0:3: Asymmetric access state changed
Apr 29 11:17:03 rtp-hpe-ops24 kernel: sd 3:0:0:3: Asymmetric access state changed
Apr 29 11:17:03 rtp-hpe-ops24 kernel: sd 7:0:0:2: Asymmetric access state changed

++ Device-Mapper reinstates the failed paths from the earlier transitioning state:
Apr 29 11:17:04 rtp-hpe-ops24 multipathd: 8:112: reinstated
Apr 29 11:17:04 rtp-hpe-ops24 multipathd: 8:128: reinstated
Apr 29 11:17:05 rtp-hpe-ops24 multipathd: 8:144: reinstated
Apr 29 11:17:05 rtp-hpe-ops24 multipathd: 8:240: reinstated
Apr 29 11:17:05 rtp-hpe-ops24 multipathd: 65:16: reinstated
Apr 29 11:17:05 rtp-hpe-ops24 multipathd: 65:48: reinstated
Apr 29 11:17:05 rtp-hpe-ops24 multipathd: 65:64: reinstated
Apr 29 11:17:06 rtp-hpe-ops24 multipathd: 8:160: reinstated
Apr 29 11:17:06 rtp-hpe-ops24 multipathd: 8:176: reinstated
Apr 29 11:17:06 rtp-hpe-ops24 multipathd: 65:80: reinstated

++ Link comes back up:
Apr 29 11:22:10 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.0]-500a:1: LOOP UP detected (16 Gbps).
Apr 29 11:22:10 rtp-hpe-ops24 kernel: qla2xxx [0000:d8:00.0]-500a:6: LOOP UP detected (16 Gbps).

++ Device-mapper reinstates paths from link coming back up:
Apr 29 11:22:13 rtp-hpe-ops24 multipathd: 8:64: reinstated
Apr 29 11:22:13 rtp-hpe-ops24 multipathd: 8:224: reinstated
Apr 29 11:22:14 rtp-hpe-ops24 multipathd: 8:32: reinstated
Apr 29 11:22:14 rtp-hpe-ops24 multipathd: 8:48: reinstated
Apr 29 11:22:14 rtp-hpe-ops24 multipathd: 8:80: reinstated
Apr 29 11:22:14 rtp-hpe-ops24 multipathd: 8:96: reinstated
Apr 29 11:22:14 rtp-hpe-ops24 multipathd: 8:192: reinstated
Apr 29 11:22:14 rtp-hpe-ops24 multipathd: 8:208: reinstated
Apr 29 11:22:14 rtp-hpe-ops24 multipathd: 65:0: reinstated
Apr 29 11:22:14 rtp-hpe-ops24 multipathd: 65:32: reinstated

++ Controller failover triggered at 11:29:55:
++ Link goes down from controller rebooting:
Apr 29 11:29:55 rtp-hpe-ops24 kernel: qla2xxx [0000:d8:00.1]-500b:7: LOOP DOWN detected (2 7 0 0).
Apr 29 11:29:55 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.1]-500b:3: LOOP DOWN detected (2 7 0 0).

++ Device-Mapper notes 1 remaining path:
Apr 29 11:30:00 rtp-hpe-ops24 multipathd: mpathc: remaining active paths: 1
Apr 29 11:30:00 rtp-hpe-ops24 multipathd: sdp: unrecognized bus type (1)
Apr 29 11:30:00 rtp-hpe-ops24 multipathd: mpathf: sdp - state 2, no message
Apr 29 11:30:00 rtp-hpe-ops24 multipathd: mpathf: switch to path group #2
Apr 29 11:30:00 rtp-hpe-ops24 multipathd: sdr: unrecognized bus type (1)
Apr 29 11:30:00 rtp-hpe-ops24 multipathd: mpathg: sdr - state 2, no message
Apr 29 11:30:00 rtp-hpe-ops24 multipathd: mpathg: switch to path group #2
Apr 29 11:30:00 rtp-hpe-ops24 multipathd: sdt: unrecognized bus type (1)
Apr 29 11:30:00 rtp-hpe-ops24 multipathd: mpathh: sdt - state 2, no message
Apr 29 11:30:00 rtp-hpe-ops24 multipathd: checker failed path 65:48 in map mpathh
Apr 29 11:30:00 rtp-hpe-ops24 multipathd: mpathh: remaining active paths: 1
Apr 29 11:30:00 rtp-hpe-ops24 multipathd: sdu: unrecognized bus type (1)
Apr 29 11:30:00 rtp-hpe-ops24 multipathd: mpathi: sdu - state 2, no message
Apr 29 11:30:00 rtp-hpe-ops24 multipathd: checker failed path 65:64 in map mpathi
Apr 29 11:30:00 rtp-hpe-ops24 multipathd: mpathi: remaining active paths: 1

++ ALUA transitions from Standby->Active on remaining controller:
Apr 29 11:30:00 rtp-hpe-ops24 kernel: sd 1:0:0:2: Asymmetric access state changed
Apr 29 11:30:00 rtp-hpe-ops24 kernel: sd 6:0:0:2: Asymmetric access state changed
Apr 29 11:30:13 rtp-hpe-ops24 kernel: sd 1:0:0:6: Asymmetric access state changed
Apr 29 11:30:15 rtp-hpe-ops24 kernel: sd 6:0:0:2: Asymmetric access state changed

++ Device-Mapper reinstates the failed paths from the earlier transitioning state:
Apr 29 11:30:14 rtp-hpe-ops24 multipathd: 8:96: reinstated
Apr 29 11:30:16 rtp-hpe-ops24 multipathd: 8:32: reinstated
Apr 29 11:30:16 rtp-hpe-ops24 multipathd: 8:48: reinstated
Apr 29 11:30:16 rtp-hpe-ops24 multipathd: 8:192: reinstated
Apr 29 11:30:16 rtp-hpe-ops24 multipathd: 8:208: reinstated
Apr 29 11:30:16 rtp-hpe-ops24 multipathd: 8:224: reinstated
Apr 29 11:30:17 rtp-hpe-ops24 multipathd: 8:64: reinstated
Apr 29 11:30:17 rtp-hpe-ops24 multipathd: 8:80: reinstated
Apr 29 11:30:17 rtp-hpe-ops24 multipathd: 65:0: reinstated
Apr 29 11:30:17 rtp-hpe-ops24 multipathd: 65:32: reinstated

++ Link comes back up:
Apr 29 11:35:28 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.1]-500a:3: LOOP UP detected (16 Gbps).
Apr 29 11:35:29 rtp-hpe-ops24 kernel: qla2xxx [0000:d8:00.1]-500a:7: LOOP UP detected (16 Gbps).

++ Device-mapper reinstates paths from link coming back up:
Apr 29 11:35:31 rtp-hpe-ops24 multipathd: 65:48: reinstated
Apr 29 11:35:32 rtp-hpe-ops24 multipathd: 8:112: reinstated
Apr 29 11:35:32 rtp-hpe-ops24 multipathd: 8:128: reinstated
Apr 29 11:35:33 rtp-hpe-ops24 multipathd: 8:144: reinstated
Apr 29 11:35:33 rtp-hpe-ops24 multipathd: 8:240: reinstated
Apr 29 11:35:33 rtp-hpe-ops24 multipathd: 65:16: reinstated
Apr 29 11:35:33 rtp-hpe-ops24 multipathd: 65:64: reinstated
Apr 29 11:35:34 rtp-hpe-ops24 multipathd: 8:160: reinstated
Apr 29 11:35:34 rtp-hpe-ops24 multipathd: 65:80: reinstated
Apr 29 11:35:35 rtp-hpe-ops24 multipathd: 8:176: reinstated

++ Controller failover triggered at 11:03:37:
++ Link goes down from controller rebooting:
++ Device-Mapper notes 1 remaining path:
++ ALUA transitions from Standby->Active on remaining controller:
++ Device-Mapper reinstates the failed paths from the earlier transitioning state:
++ Link comes back up:
++ Device-mapper reinstates paths from link coming back up:


++ Controller failover triggered at 11:03:37:
++ Link goes down from controller rebooting:
++ Device-Mapper notes 1 remaining path:
++ ALUA transitions from Standby->Active on remaining controller:
++ Device-Mapper reinstates the failed paths from the earlier transitioning state:
++ Link comes back up:
++ Device-mapper reinstates paths from link coming back up:


++ Controller failover triggered at 11:03:37:
++ Link goes down from controller rebooting:
Apr 29 11:44:13 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.0]-500b:1: LOOP DOWN detected (2 7 0 0).
Apr 29 11:44:13 rtp-hpe-ops24 kernel: qla2xxx [0000:d8:00.0]-500b:6: LOOP DOWN detected (2 7 0 0).

++ Device-Mapper notes 1 remaining path:
Apr 29 11:44:18 rtp-hpe-ops24 multipathd: mpathf: remaining active paths: 1
Apr 29 11:44:18 rtp-hpe-ops24 multipathd: mpatha: remaining active paths: 1
Apr 29 11:44:18 rtp-hpe-ops24 multipathd: mpathb: remaining active paths: 1
Apr 29 11:44:18 rtp-hpe-ops24 multipathd: mpathg: remaining active paths: 1
Apr 29 11:44:19 rtp-hpe-ops24 multipathd: mpathc: remaining active paths: 1
Apr 29 11:44:19 rtp-hpe-ops24 multipathd: mpathd: remaining active paths: 1
Apr 29 11:44:19 rtp-hpe-ops24 multipathd: mpathe: remaining active paths: 1
Apr 29 11:44:19 rtp-hpe-ops24 multipathd: mpathh: remaining active paths: 1
Apr 29 11:44:19 rtp-hpe-ops24 multipathd: mpathi: remaining active paths: 1
Apr 29 11:44:19 rtp-hpe-ops24 multipathd: mpathj: remaining active paths: 1

++ ALUA transitions from Standby->Active on remaining controller:
Apr 29 11:44:18 rtp-hpe-ops24 kernel: sd 3:0:0:3: Asymmetric access state changed
Apr 29 11:44:18 rtp-hpe-ops24 kernel: sd 7:0:0:2: Asymmetric access state changed
Apr 29 11:44:30 rtp-hpe-ops24 kernel: sd 3:0:0:5: Asymmetric access state changed
Apr 29 11:44:30 rtp-hpe-ops24 kernel: sd 7:0:0:4: Asymmetric access state changed

++ Device-Mapper reinstates the failed paths from the earlier transitioning state:
Apr 29 11:44:31 rtp-hpe-ops24 multipathd: 8:144: reinstated
Apr 29 11:44:31 rtp-hpe-ops24 multipathd: 8:160: reinstated
Apr 29 11:44:31 rtp-hpe-ops24 multipathd: 8:176: reinstated
Apr 29 11:44:31 rtp-hpe-ops24 multipathd: 65:48: reinstated
Apr 29 11:44:31 rtp-hpe-ops24 multipathd: 65:64: reinstated
Apr 29 11:44:31 rtp-hpe-ops24 multipathd: 65:80: reinstated
Apr 29 11:44:35 rtp-hpe-ops24 multipathd: 8:128: reinstated
Apr 29 11:44:35 rtp-hpe-ops24 multipathd: 8:240: reinstated
Apr 29 11:44:35 rtp-hpe-ops24 multipathd: 65:16: reinstated
Apr 29 11:44:36 rtp-hpe-ops24 multipathd: 8:112: reinstated

++ Link comes back up:
Apr 29 11:49:35 rtp-hpe-ops24 kernel: qla2xxx [0000:d8:00.0]-500a:6: LOOP UP detected (16 Gbps).
Apr 29 11:49:35 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.0]-500a:1: LOOP UP detected (16 Gbps).

++ Device-mapper reinstates paths from link coming back up:
Apr 29 11:49:39 rtp-hpe-ops24 multipathd: 8:64: reinstated
Apr 29 11:49:39 rtp-hpe-ops24 multipathd: 8:224: reinstated
Apr 29 11:49:40 rtp-hpe-ops24 multipathd: 8:32: reinstated
Apr 29 11:49:40 rtp-hpe-ops24 multipathd: 8:48: reinstated
Apr 29 11:49:40 rtp-hpe-ops24 multipathd: 8:80: reinstated
Apr 29 11:49:40 rtp-hpe-ops24 multipathd: 8:96: reinstated
Apr 29 11:49:40 rtp-hpe-ops24 multipathd: 8:192: reinstated
Apr 29 11:49:40 rtp-hpe-ops24 multipathd: 8:208: reinstated
Apr 29 11:49:40 rtp-hpe-ops24 multipathd: 65:0: reinstated
Apr 29 11:49:40 rtp-hpe-ops24 multipathd: 65:32: reinstated

Comment 7 Ben Marzinski 2019-04-29 19:09:21 UTC

So the bug I fixed was that in the initial set of messages you sent me, it was obvious that multipath's count of the number of active paths would occasionally get off (there were times when it said that there were -1 active paths, which is clearly an error). I didn't see that issue in the messages you sent in Comment #3, and so I don't see why my fix for that would have fixed
your actual error.

The issue that I did see is that a path was failing with no messages from the checker, either during the failure or afterwards. This is what I added debugging messages for. Unfortunately, I didn't add one where I needed it. However, looking at your messages and the code, it's pretty obvious that when you see these messages

Apr 29 11:30:00 rtp-hpe-ops24 multipathd: sdp: unrecognized bus type (1)
Apr 29 11:30:00 rtp-hpe-ops24 multipathd: mpathf: sdp - state 2, no message

What's actually happening is that the device is in a sysfs state that multipath treats as a failed path, but doesn't print any
messages for.

When these messages appear, if you run:

# cat /sys/block/<devname>/devices/state

so for instance, for the above messages, you would run

# cat /sys/block/sdp/devices/state

I assume you will see something like "transport-offline" or "created-blocked" or some other possible state that multipath just
treats as down path, without printing anything. This explains why you simply stop seeing messages about some paths. I can't explain why these paths are coming back now, but not before. In the orginal messages and the ones from comment 3, once a path
failed with no messages (which is when these debug messages are now printed) it would stay failed. But perhaps once I know what
state it is in, it will make more sense.

Comment 8 Alan Laws 2019-04-30 14:15:54 UTC

Hey Ben,

It looks like it was wishful thinking on my behalf.  The host survived 10 failovers, but on the 11th did not recover and re-entered the state we were trying to reproduce.  I left it overnight and it is currently in state if you want me to collect anything in particular.  I am attaching the contents of /var/log/ and a new sosreport.

++ The debug multipathd package is reporting that the failed devices are remaining in state 2:

Apr 30 10:12:03 rtp-hpe-ops24 multipathd: mpathe: sdg - state 2, no message
Apr 30 10:12:38 rtp-hpe-ops24 multipathd: mpathd: sdf - state 2, no message
Apr 30 10:12:53 rtp-hpe-ops24 multipathd: mpathc: sde - state 2, no message
Apr 30 10:11:18 rtp-hpe-ops24 multipathd: mpathg: sdn - state 2, no message


++ I am currently left with 0 active/ready paths, so the filesystem has been remounted as Read-Only:

Apr 29 18:15:49 rtp-hpe-ops24 kernel: EXT4-fs (dm-8): Remounting filesystem read-only
Apr 29 18:15:49 rtp-hpe-ops24 kernel: EXT4-fs (dm-9): Remounting filesystem read-only
Apr 29 18:15:49 rtp-hpe-ops24 kernel: EXT4-fs (dm-4): Remounting filesystem read-only

++ Multipath shows a mix of failed/faulty with failed/ghost and failed/faulty with active/ghost.  Since we do not allow I/O on standby/ghost paths, the host would only see scsi rejects in this state.

mpathe (2635e4cb4327339176c9ce90007a57d5a) dm-7 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 1:0:0:6 sdg 8:96  failed faulty running
  `- 3:0:0:6 sdl 8:176 failed ghost running
Apr 30 09:53:33 | sdf: unrecognized bus type (1)
mpathd (244c4c43fb537a0106c9ce90007a57d5a) dm-6 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 1:0:0:5 sdf 8:80  failed faulty running
  `- 3:0:0:5 sdk 8:160 failed ghost running
Apr 30 09:53:33 | sde: unrecognized bus type (1)
mpathc (28d39d8bc9abdd4c96c9ce90007a57d5a) dm-5 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 1:0:0:4 sde 8:64  failed faulty running
  `- 3:0:0:4 sdj 8:144 failed ghost running
Apr 30 09:53:33 | sdd: unrecognized bus type (1)
mpathb (206f932dad71084546c9ce90007a57d5a) dm-4 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 1:0:0:3 sdd 8:48  failed faulty running
  `- 3:0:0:3 sdi 8:128 active ghost running
Apr 30 09:53:33 | sdc: unrecognized bus type (1)
mpatha (25c9b224eb73b377e6c9ce90007a57d5a) dm-3 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 1:0:0:2 sdc 8:32  failed faulty running
  `- 3:0:0:2 sdh 8:112 active ghost running
Apr 30 09:53:33 | sds: unrecognized bus type (1)
mpathj (2fbedf10437f167596c9ce90007a57d5a) dm-12 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 6:0:0:6 sds 65:32 failed faulty running
  `- 7:0:0:6 sdv 65:80 failed ghost running
Apr 30 09:53:33 | sdq: unrecognized bus type (1)
mpathi (22da894d1b63d349f6c9ce90007a57d5a) dm-11 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 6:0:0:5 sdq 65:0  failed faulty running
  `- 7:0:0:5 sdu 65:64 failed ghost running
Apr 30 09:53:33 | sdo: unrecognized bus type (1)
mpathh (2524ec74989c7a95b6c9ce90007a57d5a) dm-10 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 6:0:0:4 sdo 8:224 failed faulty running
  `- 7:0:0:4 sdt 65:48 failed ghost running
Apr 30 09:53:33 | sdn: unrecognized bus type (1)
mpathg (2dfe32757561c617b6c9ce90007a57d5a) dm-9 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 6:0:0:3 sdn 8:208 failed faulty running
  `- 7:0:0:3 sdr 65:16 active ghost running
Apr 30 09:53:33 | sdm: unrecognized bus type (1)
mpathf (213dad4f3bbdf3d5d6c9ce90007a57d5a) dm-8 Nimble  ,Server
size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 6:0:0:2 sdm 8:192 failed faulty running
  `- 7:0:0:2 sdp 8:240 active ghost running

++ Links go down:
Apr 29 17:37:09 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.0]-500b:1: LOOP DOWN detected (2 7 0 0).
Apr 29 17:37:09 rtp-hpe-ops24 kernel: qla2xxx [0000:d8:00.0]-500b:6: LOOP DOWN detected (2 7 0 0).

++ Device Mapper notes 1 remaining path:
Apr 29 17:37:14 rtp-hpe-ops24 multipathd: sdd: mark as failed
Apr 29 17:37:14 rtp-hpe-ops24 multipathd: mpathb: remaining active paths: 1
Apr 29 17:37:14 rtp-hpe-ops24 multipathd: sdc: mark as failed
Apr 29 17:37:14 rtp-hpe-ops24 multipathd: mpatha: remaining active paths: 1
Apr 29 17:37:14 rtp-hpe-ops24 multipathd: sdm: mark as failed
Apr 29 17:37:14 rtp-hpe-ops24 multipathd: mpathf: remaining active paths: 1
Apr 29 17:37:14 rtp-hpe-ops24 multipathd: sdn: mark as failed
Apr 29 17:37:14 rtp-hpe-ops24 multipathd: mpathg: remaining active paths: 1

++ ALUA begins transition from Standby->Active
Apr 29 17:37:14 rtp-hpe-ops24 kernel: sd 3:0:0:2: Asymmetric access state changed
Apr 29 17:37:14 rtp-hpe-ops24 kernel: sd 7:0:0:2: Asymmetric access state changed
( I do not see a Asymmetric access state change for sd 3:0:0:5 and 7:0:0:4 as I did before, is this the key? )

++ Device-Mapper reinstates paths after ALUA transition completes:
Apr 29 17:37:27 rtp-hpe-ops24 multipathd: 8:112: reinstated
Apr 29 17:37:27 rtp-hpe-ops24 multipathd: 8:144: reinstated
Apr 29 17:37:27 rtp-hpe-ops24 multipathd: 8:160: reinstated
Apr 29 17:37:27 rtp-hpe-ops24 multipathd: 65:80: reinstated
Apr 29 17:37:28 rtp-hpe-ops24 multipathd: 8:176: reinstated
Apr 29 17:37:28 rtp-hpe-ops24 multipathd: 65:48: reinstated
Apr 29 17:37:31 rtp-hpe-ops24 multipathd: 8:128: reinstated
Apr 29 17:37:31 rtp-hpe-ops24 multipathd: 65:16: reinstated
Apr 29 17:37:31 rtp-hpe-ops24 multipathd: 65:64: reinstated
Apr 29 17:37:32 rtp-hpe-ops24 multipathd: 8:240: reinstated


++ Links comes back up:
Apr 29 17:42:35 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.0]-500a:1: LOOP UP detected (16 Gbps).
Apr 29 17:42:35 rtp-hpe-ops24 kernel: qla2xxx [0000:d8:00.0]-500a:6: LOOP UP detected (16 Gbps).

++ I do not see further reinstatement after the link has been brought up

++ Unfortunately at this point, our automation continues to run and triggers another failover of the storage array, which results in the host having 0 paths to the Active controller.

++ Links go down:
Apr 29 18:13:13 rtp-hpe-ops24 kernel: qla2xxx [0000:d8:00.1]-500b:7: LOOP DOWN detected (2 7 0 0).
Apr 29 18:13:13 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.1]-500b:3: LOOP DOWN detected (2 7 0 0).

++ Device Mapper notes 0 remaining active paths:
Apr 29 18:13:18 rtp-hpe-ops24 multipathd: sdi: mark as failed
Apr 29 18:13:18 rtp-hpe-ops24 multipathd: mpathb: Entering recovery mode: max_retries=30
Apr 29 18:13:18 rtp-hpe-ops24 multipathd: mpathb: remaining active paths: 0
Apr 29 18:13:18 rtp-hpe-ops24 multipathd: sdh: mark as failed
Apr 29 18:13:18 rtp-hpe-ops24 multipathd: mpatha: Entering recovery mode: max_retries=30
Apr 29 18:13:18 rtp-hpe-ops24 multipathd: mpatha: remaining active paths: 0
Apr 29 18:13:18 rtp-hpe-ops24 multipathd: sdr: mark as failed
Apr 29 18:13:18 rtp-hpe-ops24 multipathd: mpathg: Entering recovery mode: max_retries=30
Apr 29 18:13:18 rtp-hpe-ops24 multipathd: mpathg: remaining active paths: 0
Apr 29 18:13:18 rtp-hpe-ops24 multipathd: sdp: mark as failed

++ Eventually, the filesystem is re-mounted as read-only:

Apr 29 18:15:49 rtp-hpe-ops24 kernel: EXT4-fs (dm-8): Remounting filesystem read-only
Apr 29 18:15:49 rtp-hpe-ops24 kernel: EXT4-fs (dm-9): Remounting filesystem read-only
Apr 29 18:15:49 rtp-hpe-ops24 kernel: EXT4-fs (dm-4): Remounting filesystem read-only

Comment 9 Ben Marzinski 2019-05-03 19:20:32 UTC

Were you able to grab the sysfs device state like I mentioned in Comment 7. I just noticed that I made a typo. You should run

# cat /sys/block/<devname>/devices/state

Once the link is brought up, but the path does not get re-enabled.

Comment 10 Alan Laws 2019-05-05 11:58:26 UTC

Hey Ben,

Paths are either being left in 'failed,ghost' or 'failed,faulty' :

[root@rtp-hpe-ops24 block]# multipath -ll | grep -i 'failed'
| `- 1:0:0:6 sdg 8:96  failed ghost running
  `- 3:0:0:6 sdl 8:176 failed faulty running
| `- 1:0:0:5 sdf 8:80  failed ghost running
  `- 3:0:0:5 sdk 8:160 failed faulty running
| `- 1:0:0:4 sde 8:64  failed ghost running
  `- 3:0:0:4 sdj 8:144 failed faulty running
  `- 3:0:0:3 sdi 8:128 failed faulty running
  `- 3:0:0:2 sdh 8:112 failed faulty running

'failed,ghost' paths are showing a sysfs device state of 'running'.  These paths are cabled to the Standby controller on the storage array.
'failed,faulty' paths are showing a sysfs device state of 'transport-offline'.  These paths are cabled to the Active controller on the storage array.

[root@rtp-hpe-ops24 block]# for i in sdg sdl sdf sdk sde sdj sdi sdh ; do cat /sys/block/$i/device/state ; done
running
transport-offline
running
transport-offline
running
transport-offline
transport-offline
transport-offline

The failed paths are for host1 and host3 which map to :
SN1100Q FW:v8.08.204 DVR:v10.01.00.33.07.6-k2-debug
51:40:2e:c0:01:ca:7c:08 
51:40:2e:c0:01:ca:7c:0a 

I have setup an analyzer to try and help with tracing out the transport issues and will share my findings there.

Comment 11 Ben Marzinski 2019-05-06 16:44:41 UTC

So, when multipathd sees that a path is in the transport-offline state, it will just treat it as down (but not offline, since that suggests that user intervention is required).  Clearly, multipathd should do a better job of communicating this status. But as long as the device is in that state, there's nothing that mutipath can do with it.

Comment 12 Ben Marzinski 2019-08-12 04:17:01 UTC

So, I will fix the issue that I mentioned in Comment 7 with this bug. If it still looks like multipath is not doing something that it should, we may need to open a new bug to track that.

Comment 13 Lin Li 2019-08-12 08:49:40 UTC

Hello Alan,
We don't have Nimble storage in our lab. Could you provide test result with fixed version?
Thanks in advance!

Comment 14 Alan Laws 2019-08-12 11:36:11 UTC

I am happy to re-validate with your fix, Ben.  Just let me know once it is available and I will load it in my tests.

Lin, if you need something specific outside of what was provided previously please let me know what details you are looking for and I will provide it from my next round of testing on this.

Comment 17 Ben Marzinski 2019-08-12 21:12:21 UTC

Like I mentioned above, I've fixed the issue that I noticed in comment #7, where multipathd was losing track of the number of active paths. Multipathd still can't do anything about devices in the transport-offline state.

Comment 19 Trinh Dao 2019-10-09 18:14:06 UTC

Hi Ben, 
RH has released RHEL7.8 Alpha.  
Can you tell me if the fix is included in RHEL7.8 Alpha so HPE can verify it?

Thanks,
trinh

Comment 20 Ben Marzinski 2019-10-09 20:39:45 UTC

(In reply to Trinh Dao from comment #19)
> Hi Ben, 
> RH has released RHEL7.8 Alpha.  
> Can you tell me if the fix is included in RHEL7.8 Alpha so HPE can verify it?

Yes. You can double-check by looking at the device-mapper-multipath package version. The fix I wrote is in
device-mapper-multipath-0.4.9-128.el7 and later versions.

Comment 21 Trinh Dao 2020-01-22 18:41:46 UTC

Alan, 
please retest with latest RHEL7.8 snapshot-3 and let us know if it fixes your issue or not.

thanks,
trinh

Comment 23 Lin Li 2020-02-03 08:04:09 UTC

Hi Alan,
Could you tell me if the fixed version works good for you ASAP because the deadline is coming?
Thanks in advance!

Comment 24 Alan Laws 2020-02-03 12:34:34 UTC

Hi Lin,

My hardware is only just now becoming available to test with the Alpha software requested.  I'll be able to perform a quick test of this before moving back to GA software validation.  I should be able to check tomorrow and report back my findings.

Comment 25 Alan Laws 2020-02-04 23:56:10 UTC

Hi Lin,

I was able to setup my test to run and it looks good after the first fault.  I'll let it run overnight and let you know if I see the same behavior again.  For transparency, here's the important part of the log showing the paths reinstate as expected:

Feb 04 18:52:57 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: qla2xxx [0000:12:00.1]-500a:3: LOOP UP detected (16 Gbps).
Feb 04 18:52:57 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: qla2xxx [0000:d8:00.0]-500a:6: LOOP UP detected (16 Gbps).
Feb 04 18:52:59 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 3:0:0:1: Power-on or device reset occurred
Feb 04 18:52:59 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 3:0:0:2: Power-on or device reset occurred
Feb 04 18:52:59 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:1: Power-on or device reset occurred
Feb 04 18:52:59 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:2: Power-on or device reset occurred
Feb 04 18:52:59 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 3:0:0:1: alua: port group 02 state S non-preferred supports tolusna
Feb 04 18:52:59 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 3:0:0:2: alua: port group 02 state S non-preferred supports tolusna
Feb 04 18:52:59 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:1: alua: port group 02 state S non-preferred supports tolusna
Feb 04 18:52:59 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:2: alua: port group 02 state S non-preferred supports tolusna
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: device-mapper: multipath: Reinstating path 8:16.
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathd: sdb - tur checker reports path is in standby state
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: 8:16: reinstated
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathd: remaining active paths: 2
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: sdb: prio changed from 50 to 1
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathd: updating path groups 0
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 3:0:0:1: alua: port group 02 state S non-preferred supports tolusna
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 3:0:0:1: alua: port group 02 state S non-preferred supports tolusna
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathd: load table [0 2097152000 multipath 1 queue_if_no_path 1 alua 2 1 service-time 0 1 1 8:96 1 service-time 0 1 1 8:16 1]
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathe: sdc - tur checker reports path is in standby state
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: 8:32: reinstated
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: device-mapper: multipath: Reinstating path 8:32.
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathe: remaining active paths: 2
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: sdc: prio changed from 50 to 1
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathe: updating path groups 0
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 3:0:0:2: alua: port group 02 state S non-preferred supports tolusna
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 1:0:0:1: alua: port group 01 state A non-preferred supports tolusna
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 3:0:0:2: alua: port group 02 state S non-preferred supports tolusna
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 1:0:0:1: alua: port group 01 state A non-preferred supports tolusna
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathe: load table [0 2097152000 multipath 1 queue_if_no_path 1 alua 2 1 service-time 0 1 1 8:112 1 service-time 0 1 1 8:32 1]
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:0: Power-on or device reset occurred
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathb: sde - tur checker reports path is in standby state
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: 8:64: reinstated
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: device-mapper: multipath: Reinstating path 8:64.
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathb: remaining active paths: 2
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: sde: prio changed from 50 to 1
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathb: updating path groups 0
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 1:0:0:2: alua: port group 01 state A non-preferred supports tolusna
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 1:0:0:2: alua: port group 01 state A non-preferred supports tolusna
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:0: alua: port group 02 state S non-preferred supports tolusna
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:1: alua: port group 02 state S non-preferred supports tolusna
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:1: alua: port group 02 state S non-preferred supports tolusna
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathb: load table [0 2097152000 multipath 1 queue_if_no_path 1 alua 2 1 service-time 0 1 1 8:144 1 service-time 0 1 1 8:64 1]
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathc: sdf - tur checker reports path is in standby state
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: 8:80: reinstated
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathc: remaining active paths: 2
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: sdf: prio changed from 50 to 1
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: device-mapper: multipath: Reinstating path 8:80.
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathc: updating path groups 0
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:2: alua: port group 02 state S non-preferred supports tolusna
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 7:0:0:1: alua: port group 01 state A non-preferred supports tolusna
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:2: alua: port group 02 state S non-preferred supports tolusna
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 7:0:0:1: alua: port group 01 state A non-preferred supports tolusna
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathc: load table [0 2097152000 multipath 1 queue_if_no_path 1 alua 2 1 service-time 0 1 1 8:160 1 service-time 0 1 1 8:80 1]
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 7:0:0:2: alua: port group 01 state A non-preferred supports tolusna
Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 7:0:0:2: alua: port group 01 state A non-preferred supports tolusna
Feb 04 18:53:01 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpatha: sdd - tur checker reports path is in standby state
Feb 04 18:53:01 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: device-mapper: multipath: Reinstating path 8:48.
Feb 04 18:53:01 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: 8:48: reinstated
Feb 04 18:53:01 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpatha: remaining active paths: 2
Feb 04 18:53:01 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: sdd: prio changed from 50 to 1
Feb 04 18:53:01 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpatha: updating path groups 0
Feb 04 18:53:01 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpatha: load table [0 629145600 multipath 1 queue_if_no_path 1 alua 2 1 service-time 0 1 1 8:128 1 service-time 0 1 1 8:48 1]
Feb 04 18:53:01 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:0: alua: port group 02 state S non-preferred supports tolusna
Feb 04 18:53:01 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:0: alua: port group 02 state S non-preferred supports tolusna
Feb 04 18:53:01 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 7:0:0:0: alua: port group 01 state A non-preferred supports tolusna
Feb 04 18:53:01 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 7:0:0:0: alua: port group 01 state A non-preferred supports tolusna

Comment 26 Lin Li 2020-02-05 03:45:01 UTC

Move to verified according to comment 25.

Comment 27 Alan Laws 2020-02-05 15:00:46 UTC

Hi Lin,

All iterations of my test passed successfully using snapshot-3.  We can move this to closed if you agree.

Thank you to yourself and Ben for helping us out here!

Comment 29 errata-xmlrpc 2020-03-31 19:47:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1066