Bug 1701604
| Summary: | [HPESD 7.6 Bug] Qlogic: After Direct Attach Storage faults, device-mapper-multipath does not reinstate paths after login | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Alan Laws <alan.laws> | ||||||
| Component: | device-mapper-multipath | Assignee: | Ben Marzinski <bmarzins> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Lin Li <lilin> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 7.6 | CC: | agk, alan.laws, bmarzins, girish.basrur, heinzm, hmadhani, jkachuck, joseph.szczypek, karen.skweres, lilin, lucas.ham.leborgne, mknutson, msnitzer, phinchman, prajnoha, rhandlin, tom.vaden, trinh.dao, william.gens | ||||||
| Target Milestone: | rc | Keywords: | OtherQA | ||||||
| Target Release: | 7.8 | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | device-mapper-multipath-0.4.9-128.el7 | Doc Type: | Bug Fix | ||||||
| Doc Text: |
Cause: Multipathd can occasionally lose track of the number of active paths
Consequence: multipathd can disable queueing while an active path still exists
Fix: Multipathd now rechecks the number of active paths before every time it changes the value.
Result: multipathd no longer can disable queueing while active paths still exist.
|
Story Points: | --- | ||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2020-03-31 19:47:09 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1704833, 1754591 | ||||||||
| Attachments: |
|
||||||||
The logs in the sosreport are from a period of 4 days, and I'm having problems trying to figure out what is supposed to be happening at what time. Would it be possible to run # mutipath -ll So I can see the initial state of the devices. Then, reproduce this, and send a copy of /var/log/messages, along with some times for when the various steps in reproducing this occurred. Thanks. Created attachment 1558678 [details]
/var/log/messages from faults on April 25
Hi Ben, It took 3 failovers to trigger this issue today. I have attached /var/log/messages and an sosreport for your information. ++ Controller failover triggered at 08:29:55 ++ Link goes down from controller rebooting: Apr 25 08:30:17 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.0]-500b:1: LOOP DOWN detected (2 7 0 0). ++ ALUA transitions from Standby->Active on remaining controller Apr 25 08:30:22 rtp-hpe-ops24 kernel: sd 3:0:0:3: Asymmetric access state changed ++ Device-Mapper reinstates the failed paths from the earlier transitioning state: Apr 25 08:30:39 rtp-hpe-ops24 multipathd: 8:112: reinstated Apr 25 08:30:39 rtp-hpe-ops24 multipathd: 8:128: reinstated Apr 25 08:30:39 rtp-hpe-ops24 multipathd: 8:176: reinstated Apr 25 08:30:39 rtp-hpe-ops24 multipathd: 8:240: reinstated Apr 25 08:30:39 rtp-hpe-ops24 multipathd: 65:32: reinstated Apr 25 08:30:40 rtp-hpe-ops24 multipathd: 8:144: reinstated Apr 25 08:30:40 rtp-hpe-ops24 multipathd: 8:160: reinstated Apr 25 08:30:40 rtp-hpe-ops24 multipathd: 65:48: reinstated Apr 25 08:30:40 rtp-hpe-ops24 multipathd: 65:64: reinstated Apr 25 08:30:40 rtp-hpe-ops24 multipathd: 65:80: reinstated ++ Link comes back up: Apr 25 08:35:43 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.0]-2869:1: LOOP READY. Apr 25 08:35:43 rtp-hpe-ops24 kernel: qla2xxx [0000:d8:00.0]-2869:6: LOOP READY. < No path reinstatements seen > ++ At this point, our test case runs another failover on the storage which causes a complete outage as there are now 0 active paths and the host has to be rebooted to recover. ++ Below are the 'multipath -ll' outputs from the failing scenario described above: Initial state: [root@rtp-hpe-ops24 ~]# multipath -ll mpathe (2635e4cb4327339176c9ce90007a57d5a) dm-7 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | `- 1:0:0:6 sdg 8:96 active ready running `-+- policy='service-time 0' prio=1 status=enabled `- 3:0:0:6 sdl 8:176 active ghost running mpathd (244c4c43fb537a0106c9ce90007a57d5a) dm-6 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | `- 1:0:0:5 sdf 8:80 active ready running `-+- policy='service-time 0' prio=1 status=enabled `- 3:0:0:5 sdk 8:160 active ghost running mpathc (28d39d8bc9abdd4c96c9ce90007a57d5a) dm-5 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | `- 1:0:0:4 sde 8:64 active ready running `-+- policy='service-time 0' prio=1 status=enabled `- 3:0:0:4 sdj 8:144 active ghost running mpathb (206f932dad71084546c9ce90007a57d5a) dm-4 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | `- 1:0:0:3 sdd 8:48 active ready running `-+- policy='service-time 0' prio=1 status=enabled `- 3:0:0:3 sdi 8:128 active ghost running mpatha (25c9b224eb73b377e6c9ce90007a57d5a) dm-3 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | `- 1:0:0:2 sdc 8:32 active ready running `-+- policy='service-time 0' prio=1 status=enabled `- 3:0:0:2 sdh 8:112 active ghost running mpathj (2fbedf10437f167596c9ce90007a57d5a) dm-12 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | `- 6:0:0:6 sdr 65:16 active ready running `-+- policy='service-time 0' prio=1 status=enabled `- 7:0:0:6 sdv 65:80 active ghost running mpathi (22da894d1b63d349f6c9ce90007a57d5a) dm-11 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | `- 6:0:0:5 sdq 65:0 active ready running `-+- policy='service-time 0' prio=1 status=enabled `- 7:0:0:5 sdu 65:64 active ghost running mpathh (2524ec74989c7a95b6c9ce90007a57d5a) dm-10 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | `- 6:0:0:4 sdo 8:224 active ready running `-+- policy='service-time 0' prio=1 status=enabled `- 7:0:0:4 sdt 65:48 active ghost running mpathg (2dfe32757561c617b6c9ce90007a57d5a) dm-9 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | `- 6:0:0:3 sdn 8:208 active ready running `-+- policy='service-time 0' prio=1 status=enabled `- 7:0:0:3 sds 65:32 active ghost running mpathf (213dad4f3bbdf3d5d6c9ce90007a57d5a) dm-8 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | `- 6:0:0:2 sdm 8:192 active ready running `-+- policy='service-time 0' prio=1 status=enabled `- 7:0:0:2 sdp 8:240 active ghost running Failing state recorded after the array returned to healthy at 08:35:43 : [root@rtp-hpe-ops24 ~]# multipath -ll mpathe (2635e4cb4327339176c9ce90007a57d5a) dm-7 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active |- 1:0:0:6 sdg 8:96 failed faulty running `- 3:0:0:6 sdl 8:176 active ready running mpathd (244c4c43fb537a0106c9ce90007a57d5a) dm-6 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active |- 1:0:0:5 sdf 8:80 failed faulty running `- 3:0:0:5 sdk 8:160 active ready running mpathc (28d39d8bc9abdd4c96c9ce90007a57d5a) dm-5 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active |- 1:0:0:4 sde 8:64 failed faulty running `- 3:0:0:4 sdj 8:144 active ready running mpathb (206f932dad71084546c9ce90007a57d5a) dm-4 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active |- 1:0:0:3 sdd 8:48 failed faulty running `- 3:0:0:3 sdi 8:128 active ready running mpatha (25c9b224eb73b377e6c9ce90007a57d5a) dm-3 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active |- 1:0:0:2 sdc 8:32 failed faulty running `- 3:0:0:2 sdh 8:112 active ready running mpathj (2fbedf10437f167596c9ce90007a57d5a) dm-12 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active |- 6:0:0:6 sdr 65:16 failed faulty running `- 7:0:0:6 sdv 65:80 active ready running mpathi (22da894d1b63d349f6c9ce90007a57d5a) dm-11 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active |- 6:0:0:5 sdq 65:0 failed faulty running `- 7:0:0:5 sdu 65:64 active ready running mpathh (2524ec74989c7a95b6c9ce90007a57d5a) dm-10 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active |- 6:0:0:4 sdo 8:224 failed faulty running `- 7:0:0:4 sdt 65:48 active ready running mpathg (2dfe32757561c617b6c9ce90007a57d5a) dm-9 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active |- 6:0:0:3 sdn 8:208 failed faulty running `- 7:0:0:3 sds 65:32 active ready running mpathf (213dad4f3bbdf3d5d6c9ce90007a57d5a) dm-8 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active |- 6:0:0:2 sdm 8:192 failed faulty running `- 7:0:0:2 sdp 8:240 active ready running Thanks for those messages and timeline. It's hard to tell what is going on with multipathd here, since it doesn't have print statements that cover whatever's happening to those devices that aren't coming back. I've created some test rpms that you can pick up here: http://people.redhat.com/~bmarzins/device-mapper-multipath/rpms/RHEL7/bz1701604/ They fix one issue that I noticed while looking at your logs, but which I don't think is related to the devices failing to reinstate. But they also add some additional output that will hopefully catch what's going on here. If you could install them, recreate this issue, and post the output like you did above, that would be really helpful. If you can't install test rpms on your systems, let me know, and we'll try something different. Ben, Thank you for providing the debug packages. After installing the provided software, I have been unable to reproduce this issue after 5 iterations this morning. I will run this for longer and let you know if I see it come up again. I can see a lot of messages like this now while the controller is rebooting with failed paths. Outside of that, the host appears to behave as I would expect and reinstates paths quickly. Apr 29 10:51:12 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: sds: unrecognized bus type (1) Apr 29 10:51:12 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: mpathj: sds - state 2, no message Below are the log snippets I pulled while trying to reproduce this morning, everything appears to be fine so far though: ++ Controller failover triggered at 10:48:05: ++ Link goes down from controller rebooting: Apr 29 10:48:05 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: qla2xxx [0000:12:00.0]-500b:1: LOOP DOWN detected (2 7 0 0). Apr 29 10:48:05 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: qla2xxx [0000:d8:00.0]-500b:6: LOOP DOWN detected (2 7 0 0). ++ Device-Mapper notes 1 remaining path: Apr 29 10:48:11 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: sdd: mark as failed Apr 29 10:48:11 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: mpathb: remaining active paths: 1 Apr 29 10:48:11 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: sdc: mark as failed Apr 29 10:48:11 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: mpatha: remaining active paths: 1 Apr 29 10:48:11 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: sdm: mark as failed Apr 29 10:48:11 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: mpathf: remaining active paths: 1 Apr 29 10:48:11 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: sdn: mark as failed Apr 29 10:48:11 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: mpathg: remaining active paths: 1 ++ ALUA transitions from Standby->Active on remaining controller: Apr 29 10:48:11 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 3:0:0:3: Asymmetric access state changed Apr 29 10:48:11 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 7:0:0:2: Asymmetric access state changed Apr 29 10:48:23 rtp-hpe-ops24 kernel: sd 3:0:0:5: Asymmetric access state changed Apr 29 10:48:23 rtp-hpe-ops24 kernel: sd 7:0:0:6: Asymmetric access state changed ++ Device-Mapper reinstates the failed paths from the earlier transitioning state: Apr 29 10:48:24 rtp-hpe-ops24 multipathd: 8:160: reinstated Apr 29 10:48:24 rtp-hpe-ops24 multipathd: 8:176: reinstated Apr 29 10:48:24 rtp-hpe-ops24 multipathd: 65:80: reinstated Apr 29 10:48:26 rtp-hpe-ops24 multipathd: 8:144: reinstated Apr 29 10:48:26 rtp-hpe-ops24 multipathd: 65:48: reinstated Apr 29 10:48:26 rtp-hpe-ops24 multipathd: 65:64: reinstated Apr 29 10:48:27 rtp-hpe-ops24 multipathd: 8:112: reinstated Apr 29 10:48:27 rtp-hpe-ops24 multipathd: 8:128: reinstated Apr 29 10:48:27 rtp-hpe-ops24 multipathd: 8:240: reinstated Apr 29 10:48:27 rtp-hpe-ops24 multipathd: 65:16: reinstated ++ Link comes back up: Apr 29 10:53:28 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: qla2xxx [0000:12:00.0]-500a:1: LOOP UP detected (16 Gbps). Apr 29 10:53:28 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: qla2xxx [0000:d8:00.0]-500a:6: LOOP UP detected (16 Gbps). ++ Device-mapper reinstates paths from link coming back up: Apr 29 10:53:29 rtp-hpe-ops24 multipathd: 65:0: reinstated Apr 29 10:53:30 rtp-hpe-ops24 multipathd: 8:48: reinstated Apr 29 10:53:30 rtp-hpe-ops24 multipathd: 8:64: reinstated Apr 29 10:53:30 rtp-hpe-ops24 multipathd: 8:208: reinstated Apr 29 10:53:30 rtp-hpe-ops24 multipathd: 8:224: reinstated Apr 29 10:53:31 rtp-hpe-ops24 multipathd: 8:32: reinstated Apr 29 10:53:31 rtp-hpe-ops24 multipathd: 8:192: reinstated Apr 29 10:53:33 rtp-hpe-ops24 multipathd: 8:80: reinstated Apr 29 10:53:33 rtp-hpe-ops24 multipathd: 8:96: reinstated Apr 29 10:53:33 rtp-hpe-ops24 multipathd: 65:32: reinstated ++ Controller failover triggered at 11:03:37: ++ Link goes down from controller rebooting: Apr 29 11:03:37 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: qla2xxx [0000:d8:00.1]-500b:7: LOOP DOWN detected (2 7 0 0). Apr 29 11:03:37 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: qla2xxx [0000:12:00.1]-500b:3: LOOP DOWN detected (2 7 0 0). ++ Device-Mapper notes 1 remaining path: Apr 29 11:03:42 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: sdr: mark as failed Apr 29 11:03:42 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: mpathg: remaining active paths: 1 Apr 29 11:03:42 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: sdp: mark as failed Apr 29 11:03:42 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: mpathf: remaining active paths: 1 Apr 29 11:03:42 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: sdh: mark as failed Apr 29 11:03:42 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: mpatha: remaining active paths: 1 Apr 29 11:03:42 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: sdi: mark as failed Apr 29 11:03:42 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[12353]: mpathb: remaining active paths: 1 ++ ALUA transitions from Standby->Active on remaining controller: Apr 29 11:03:42 rtp-hpe-ops24 kernel: sd 6:0:0:2: Asymmetric access state changed Apr 29 11:03:42 rtp-hpe-ops24 kernel: sd 1:0:0:3: Asymmetric access state changed Apr 29 11:03:56 rtp-hpe-ops24 kernel: sd 1:0:0:3: Asymmetric access state changed Apr 29 11:03:56 rtp-hpe-ops24 kernel: sd 6:0:0:3: Asymmetric access state changed ++ Device-Mapper reinstates the failed paths from the earlier transitioning state: Apr 29 11:03:57 rtp-hpe-ops24 multipathd: 8:48: reinstated Apr 29 11:03:57 rtp-hpe-ops24 multipathd: 8:64: reinstated Apr 29 11:03:57 rtp-hpe-ops24 multipathd: 8:208: reinstated Apr 29 11:03:57 rtp-hpe-ops24 multipathd: 8:224: reinstated Apr 29 11:03:57 rtp-hpe-ops24 multipathd: 65:0: reinstated Apr 29 11:03:58 rtp-hpe-ops24 multipathd: 8:32: reinstated Apr 29 11:03:58 rtp-hpe-ops24 multipathd: 8:192: reinstated Apr 29 11:04:00 rtp-hpe-ops24 multipathd: 8:80: reinstated Apr 29 11:04:00 rtp-hpe-ops24 multipathd: 8:96: reinstated Apr 29 11:04:00 rtp-hpe-ops24 multipathd: 65:32: reinstated ++ Link comes back up: Apr 29 11:09:00 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.1]-500a:3: LOOP UP detected (16 Gbps). Apr 29 11:09:01 rtp-hpe-ops24 kernel: qla2xxx [0000:d8:00.1]-500a:7: LOOP UP detected (16 Gbps). ++ Device-mapper reinstates paths from link coming back up: Apr 29 11:09:04 rtp-hpe-ops24 multipathd: 8:160: reinstated Apr 29 11:09:04 rtp-hpe-ops24 multipathd: 8:176: reinstated Apr 29 11:09:04 rtp-hpe-ops24 multipathd: 8:240: reinstated Apr 29 11:09:04 rtp-hpe-ops24 multipathd: 65:48: reinstated Apr 29 11:09:04 rtp-hpe-ops24 multipathd: 65:64: reinstated Apr 29 11:09:04 rtp-hpe-ops24 multipathd: 65:80: reinstated Apr 29 11:09:06 rtp-hpe-ops24 multipathd: 8:112: reinstated Apr 29 11:09:06 rtp-hpe-ops24 multipathd: 8:128: reinstated Apr 29 11:09:06 rtp-hpe-ops24 multipathd: 8:144: reinstated Apr 29 11:09:06 rtp-hpe-ops24 multipathd: 65:16: reinstated ++ Controller failover triggered at 11:16:43: ++ Link goes down from controller rebooting: Apr 29 11:16:43 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.0]-500b:1: LOOP DOWN detected (2 7 0 0). Apr 29 11:16:43 rtp-hpe-ops24 kernel: qla2xxx [0000:d8:00.0]-500b:6: LOOP DOWN detected (2 7 0 0). ++ Device-Mapper notes 1 remaining path: Apr 29 11:16:48 rtp-hpe-ops24 multipathd: sdd: mark as failed Apr 29 11:16:48 rtp-hpe-ops24 multipathd: mpathb: remaining active paths: 1 Apr 29 11:16:48 rtp-hpe-ops24 multipathd: sdc: mark as failed Apr 29 11:16:48 rtp-hpe-ops24 multipathd: mpatha: remaining active paths: 1 Apr 29 11:16:48 rtp-hpe-ops24 multipathd: sdn: mark as failed Apr 29 11:16:48 rtp-hpe-ops24 multipathd: mpathg: remaining active paths: 1 Apr 29 11:16:48 rtp-hpe-ops24 multipathd: sdm: mark as failed Apr 29 11:16:48 rtp-hpe-ops24 multipathd: mpathf: remaining active paths: 1 ++ ALUA transitions from Standby->Active on remaining controller: Apr 29 11:16:48 rtp-hpe-ops24 kernel: sd 7:0:0:2: Asymmetric access state changed Apr 29 11:16:48 rtp-hpe-ops24 kernel: sd 3:0:0:3: Asymmetric access state changed Apr 29 11:17:03 rtp-hpe-ops24 kernel: sd 3:0:0:3: Asymmetric access state changed Apr 29 11:17:03 rtp-hpe-ops24 kernel: sd 7:0:0:2: Asymmetric access state changed ++ Device-Mapper reinstates the failed paths from the earlier transitioning state: Apr 29 11:17:04 rtp-hpe-ops24 multipathd: 8:112: reinstated Apr 29 11:17:04 rtp-hpe-ops24 multipathd: 8:128: reinstated Apr 29 11:17:05 rtp-hpe-ops24 multipathd: 8:144: reinstated Apr 29 11:17:05 rtp-hpe-ops24 multipathd: 8:240: reinstated Apr 29 11:17:05 rtp-hpe-ops24 multipathd: 65:16: reinstated Apr 29 11:17:05 rtp-hpe-ops24 multipathd: 65:48: reinstated Apr 29 11:17:05 rtp-hpe-ops24 multipathd: 65:64: reinstated Apr 29 11:17:06 rtp-hpe-ops24 multipathd: 8:160: reinstated Apr 29 11:17:06 rtp-hpe-ops24 multipathd: 8:176: reinstated Apr 29 11:17:06 rtp-hpe-ops24 multipathd: 65:80: reinstated ++ Link comes back up: Apr 29 11:22:10 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.0]-500a:1: LOOP UP detected (16 Gbps). Apr 29 11:22:10 rtp-hpe-ops24 kernel: qla2xxx [0000:d8:00.0]-500a:6: LOOP UP detected (16 Gbps). ++ Device-mapper reinstates paths from link coming back up: Apr 29 11:22:13 rtp-hpe-ops24 multipathd: 8:64: reinstated Apr 29 11:22:13 rtp-hpe-ops24 multipathd: 8:224: reinstated Apr 29 11:22:14 rtp-hpe-ops24 multipathd: 8:32: reinstated Apr 29 11:22:14 rtp-hpe-ops24 multipathd: 8:48: reinstated Apr 29 11:22:14 rtp-hpe-ops24 multipathd: 8:80: reinstated Apr 29 11:22:14 rtp-hpe-ops24 multipathd: 8:96: reinstated Apr 29 11:22:14 rtp-hpe-ops24 multipathd: 8:192: reinstated Apr 29 11:22:14 rtp-hpe-ops24 multipathd: 8:208: reinstated Apr 29 11:22:14 rtp-hpe-ops24 multipathd: 65:0: reinstated Apr 29 11:22:14 rtp-hpe-ops24 multipathd: 65:32: reinstated ++ Controller failover triggered at 11:29:55: ++ Link goes down from controller rebooting: Apr 29 11:29:55 rtp-hpe-ops24 kernel: qla2xxx [0000:d8:00.1]-500b:7: LOOP DOWN detected (2 7 0 0). Apr 29 11:29:55 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.1]-500b:3: LOOP DOWN detected (2 7 0 0). ++ Device-Mapper notes 1 remaining path: Apr 29 11:30:00 rtp-hpe-ops24 multipathd: mpathc: remaining active paths: 1 Apr 29 11:30:00 rtp-hpe-ops24 multipathd: sdp: unrecognized bus type (1) Apr 29 11:30:00 rtp-hpe-ops24 multipathd: mpathf: sdp - state 2, no message Apr 29 11:30:00 rtp-hpe-ops24 multipathd: mpathf: switch to path group #2 Apr 29 11:30:00 rtp-hpe-ops24 multipathd: sdr: unrecognized bus type (1) Apr 29 11:30:00 rtp-hpe-ops24 multipathd: mpathg: sdr - state 2, no message Apr 29 11:30:00 rtp-hpe-ops24 multipathd: mpathg: switch to path group #2 Apr 29 11:30:00 rtp-hpe-ops24 multipathd: sdt: unrecognized bus type (1) Apr 29 11:30:00 rtp-hpe-ops24 multipathd: mpathh: sdt - state 2, no message Apr 29 11:30:00 rtp-hpe-ops24 multipathd: checker failed path 65:48 in map mpathh Apr 29 11:30:00 rtp-hpe-ops24 multipathd: mpathh: remaining active paths: 1 Apr 29 11:30:00 rtp-hpe-ops24 multipathd: sdu: unrecognized bus type (1) Apr 29 11:30:00 rtp-hpe-ops24 multipathd: mpathi: sdu - state 2, no message Apr 29 11:30:00 rtp-hpe-ops24 multipathd: checker failed path 65:64 in map mpathi Apr 29 11:30:00 rtp-hpe-ops24 multipathd: mpathi: remaining active paths: 1 ++ ALUA transitions from Standby->Active on remaining controller: Apr 29 11:30:00 rtp-hpe-ops24 kernel: sd 1:0:0:2: Asymmetric access state changed Apr 29 11:30:00 rtp-hpe-ops24 kernel: sd 6:0:0:2: Asymmetric access state changed Apr 29 11:30:13 rtp-hpe-ops24 kernel: sd 1:0:0:6: Asymmetric access state changed Apr 29 11:30:15 rtp-hpe-ops24 kernel: sd 6:0:0:2: Asymmetric access state changed ++ Device-Mapper reinstates the failed paths from the earlier transitioning state: Apr 29 11:30:14 rtp-hpe-ops24 multipathd: 8:96: reinstated Apr 29 11:30:16 rtp-hpe-ops24 multipathd: 8:32: reinstated Apr 29 11:30:16 rtp-hpe-ops24 multipathd: 8:48: reinstated Apr 29 11:30:16 rtp-hpe-ops24 multipathd: 8:192: reinstated Apr 29 11:30:16 rtp-hpe-ops24 multipathd: 8:208: reinstated Apr 29 11:30:16 rtp-hpe-ops24 multipathd: 8:224: reinstated Apr 29 11:30:17 rtp-hpe-ops24 multipathd: 8:64: reinstated Apr 29 11:30:17 rtp-hpe-ops24 multipathd: 8:80: reinstated Apr 29 11:30:17 rtp-hpe-ops24 multipathd: 65:0: reinstated Apr 29 11:30:17 rtp-hpe-ops24 multipathd: 65:32: reinstated ++ Link comes back up: Apr 29 11:35:28 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.1]-500a:3: LOOP UP detected (16 Gbps). Apr 29 11:35:29 rtp-hpe-ops24 kernel: qla2xxx [0000:d8:00.1]-500a:7: LOOP UP detected (16 Gbps). ++ Device-mapper reinstates paths from link coming back up: Apr 29 11:35:31 rtp-hpe-ops24 multipathd: 65:48: reinstated Apr 29 11:35:32 rtp-hpe-ops24 multipathd: 8:112: reinstated Apr 29 11:35:32 rtp-hpe-ops24 multipathd: 8:128: reinstated Apr 29 11:35:33 rtp-hpe-ops24 multipathd: 8:144: reinstated Apr 29 11:35:33 rtp-hpe-ops24 multipathd: 8:240: reinstated Apr 29 11:35:33 rtp-hpe-ops24 multipathd: 65:16: reinstated Apr 29 11:35:33 rtp-hpe-ops24 multipathd: 65:64: reinstated Apr 29 11:35:34 rtp-hpe-ops24 multipathd: 8:160: reinstated Apr 29 11:35:34 rtp-hpe-ops24 multipathd: 65:80: reinstated Apr 29 11:35:35 rtp-hpe-ops24 multipathd: 8:176: reinstated ++ Controller failover triggered at 11:03:37: ++ Link goes down from controller rebooting: ++ Device-Mapper notes 1 remaining path: ++ ALUA transitions from Standby->Active on remaining controller: ++ Device-Mapper reinstates the failed paths from the earlier transitioning state: ++ Link comes back up: ++ Device-mapper reinstates paths from link coming back up: ++ Controller failover triggered at 11:03:37: ++ Link goes down from controller rebooting: ++ Device-Mapper notes 1 remaining path: ++ ALUA transitions from Standby->Active on remaining controller: ++ Device-Mapper reinstates the failed paths from the earlier transitioning state: ++ Link comes back up: ++ Device-mapper reinstates paths from link coming back up: ++ Controller failover triggered at 11:03:37: ++ Link goes down from controller rebooting: Apr 29 11:44:13 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.0]-500b:1: LOOP DOWN detected (2 7 0 0). Apr 29 11:44:13 rtp-hpe-ops24 kernel: qla2xxx [0000:d8:00.0]-500b:6: LOOP DOWN detected (2 7 0 0). ++ Device-Mapper notes 1 remaining path: Apr 29 11:44:18 rtp-hpe-ops24 multipathd: mpathf: remaining active paths: 1 Apr 29 11:44:18 rtp-hpe-ops24 multipathd: mpatha: remaining active paths: 1 Apr 29 11:44:18 rtp-hpe-ops24 multipathd: mpathb: remaining active paths: 1 Apr 29 11:44:18 rtp-hpe-ops24 multipathd: mpathg: remaining active paths: 1 Apr 29 11:44:19 rtp-hpe-ops24 multipathd: mpathc: remaining active paths: 1 Apr 29 11:44:19 rtp-hpe-ops24 multipathd: mpathd: remaining active paths: 1 Apr 29 11:44:19 rtp-hpe-ops24 multipathd: mpathe: remaining active paths: 1 Apr 29 11:44:19 rtp-hpe-ops24 multipathd: mpathh: remaining active paths: 1 Apr 29 11:44:19 rtp-hpe-ops24 multipathd: mpathi: remaining active paths: 1 Apr 29 11:44:19 rtp-hpe-ops24 multipathd: mpathj: remaining active paths: 1 ++ ALUA transitions from Standby->Active on remaining controller: Apr 29 11:44:18 rtp-hpe-ops24 kernel: sd 3:0:0:3: Asymmetric access state changed Apr 29 11:44:18 rtp-hpe-ops24 kernel: sd 7:0:0:2: Asymmetric access state changed Apr 29 11:44:30 rtp-hpe-ops24 kernel: sd 3:0:0:5: Asymmetric access state changed Apr 29 11:44:30 rtp-hpe-ops24 kernel: sd 7:0:0:4: Asymmetric access state changed ++ Device-Mapper reinstates the failed paths from the earlier transitioning state: Apr 29 11:44:31 rtp-hpe-ops24 multipathd: 8:144: reinstated Apr 29 11:44:31 rtp-hpe-ops24 multipathd: 8:160: reinstated Apr 29 11:44:31 rtp-hpe-ops24 multipathd: 8:176: reinstated Apr 29 11:44:31 rtp-hpe-ops24 multipathd: 65:48: reinstated Apr 29 11:44:31 rtp-hpe-ops24 multipathd: 65:64: reinstated Apr 29 11:44:31 rtp-hpe-ops24 multipathd: 65:80: reinstated Apr 29 11:44:35 rtp-hpe-ops24 multipathd: 8:128: reinstated Apr 29 11:44:35 rtp-hpe-ops24 multipathd: 8:240: reinstated Apr 29 11:44:35 rtp-hpe-ops24 multipathd: 65:16: reinstated Apr 29 11:44:36 rtp-hpe-ops24 multipathd: 8:112: reinstated ++ Link comes back up: Apr 29 11:49:35 rtp-hpe-ops24 kernel: qla2xxx [0000:d8:00.0]-500a:6: LOOP UP detected (16 Gbps). Apr 29 11:49:35 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.0]-500a:1: LOOP UP detected (16 Gbps). ++ Device-mapper reinstates paths from link coming back up: Apr 29 11:49:39 rtp-hpe-ops24 multipathd: 8:64: reinstated Apr 29 11:49:39 rtp-hpe-ops24 multipathd: 8:224: reinstated Apr 29 11:49:40 rtp-hpe-ops24 multipathd: 8:32: reinstated Apr 29 11:49:40 rtp-hpe-ops24 multipathd: 8:48: reinstated Apr 29 11:49:40 rtp-hpe-ops24 multipathd: 8:80: reinstated Apr 29 11:49:40 rtp-hpe-ops24 multipathd: 8:96: reinstated Apr 29 11:49:40 rtp-hpe-ops24 multipathd: 8:192: reinstated Apr 29 11:49:40 rtp-hpe-ops24 multipathd: 8:208: reinstated Apr 29 11:49:40 rtp-hpe-ops24 multipathd: 65:0: reinstated Apr 29 11:49:40 rtp-hpe-ops24 multipathd: 65:32: reinstated So the bug I fixed was that in the initial set of messages you sent me, it was obvious that multipath's count of the number of active paths would occasionally get off (there were times when it said that there were -1 active paths, which is clearly an error). I didn't see that issue in the messages you sent in Comment #3, and so I don't see why my fix for that would have fixed your actual error. The issue that I did see is that a path was failing with no messages from the checker, either during the failure or afterwards. This is what I added debugging messages for. Unfortunately, I didn't add one where I needed it. However, looking at your messages and the code, it's pretty obvious that when you see these messages Apr 29 11:30:00 rtp-hpe-ops24 multipathd: sdp: unrecognized bus type (1) Apr 29 11:30:00 rtp-hpe-ops24 multipathd: mpathf: sdp - state 2, no message What's actually happening is that the device is in a sysfs state that multipath treats as a failed path, but doesn't print any messages for. When these messages appear, if you run: # cat /sys/block/<devname>/devices/state so for instance, for the above messages, you would run # cat /sys/block/sdp/devices/state I assume you will see something like "transport-offline" or "created-blocked" or some other possible state that multipath just treats as down path, without printing anything. This explains why you simply stop seeing messages about some paths. I can't explain why these paths are coming back now, but not before. In the orginal messages and the ones from comment 3, once a path failed with no messages (which is when these debug messages are now printed) it would stay failed. But perhaps once I know what state it is in, it will make more sense. Hey Ben, It looks like it was wishful thinking on my behalf. The host survived 10 failovers, but on the 11th did not recover and re-entered the state we were trying to reproduce. I left it overnight and it is currently in state if you want me to collect anything in particular. I am attaching the contents of /var/log/ and a new sosreport. ++ The debug multipathd package is reporting that the failed devices are remaining in state 2: Apr 30 10:12:03 rtp-hpe-ops24 multipathd: mpathe: sdg - state 2, no message Apr 30 10:12:38 rtp-hpe-ops24 multipathd: mpathd: sdf - state 2, no message Apr 30 10:12:53 rtp-hpe-ops24 multipathd: mpathc: sde - state 2, no message Apr 30 10:11:18 rtp-hpe-ops24 multipathd: mpathg: sdn - state 2, no message ++ I am currently left with 0 active/ready paths, so the filesystem has been remounted as Read-Only: Apr 29 18:15:49 rtp-hpe-ops24 kernel: EXT4-fs (dm-8): Remounting filesystem read-only Apr 29 18:15:49 rtp-hpe-ops24 kernel: EXT4-fs (dm-9): Remounting filesystem read-only Apr 29 18:15:49 rtp-hpe-ops24 kernel: EXT4-fs (dm-4): Remounting filesystem read-only ++ Multipath shows a mix of failed/faulty with failed/ghost and failed/faulty with active/ghost. Since we do not allow I/O on standby/ghost paths, the host would only see scsi rejects in this state. mpathe (2635e4cb4327339176c9ce90007a57d5a) dm-7 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=1 status=active |- 1:0:0:6 sdg 8:96 failed faulty running `- 3:0:0:6 sdl 8:176 failed ghost running Apr 30 09:53:33 | sdf: unrecognized bus type (1) mpathd (244c4c43fb537a0106c9ce90007a57d5a) dm-6 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=1 status=active |- 1:0:0:5 sdf 8:80 failed faulty running `- 3:0:0:5 sdk 8:160 failed ghost running Apr 30 09:53:33 | sde: unrecognized bus type (1) mpathc (28d39d8bc9abdd4c96c9ce90007a57d5a) dm-5 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=1 status=active |- 1:0:0:4 sde 8:64 failed faulty running `- 3:0:0:4 sdj 8:144 failed ghost running Apr 30 09:53:33 | sdd: unrecognized bus type (1) mpathb (206f932dad71084546c9ce90007a57d5a) dm-4 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=1 status=active |- 1:0:0:3 sdd 8:48 failed faulty running `- 3:0:0:3 sdi 8:128 active ghost running Apr 30 09:53:33 | sdc: unrecognized bus type (1) mpatha (25c9b224eb73b377e6c9ce90007a57d5a) dm-3 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=1 status=active |- 1:0:0:2 sdc 8:32 failed faulty running `- 3:0:0:2 sdh 8:112 active ghost running Apr 30 09:53:33 | sds: unrecognized bus type (1) mpathj (2fbedf10437f167596c9ce90007a57d5a) dm-12 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=1 status=active |- 6:0:0:6 sds 65:32 failed faulty running `- 7:0:0:6 sdv 65:80 failed ghost running Apr 30 09:53:33 | sdq: unrecognized bus type (1) mpathi (22da894d1b63d349f6c9ce90007a57d5a) dm-11 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=1 status=active |- 6:0:0:5 sdq 65:0 failed faulty running `- 7:0:0:5 sdu 65:64 failed ghost running Apr 30 09:53:33 | sdo: unrecognized bus type (1) mpathh (2524ec74989c7a95b6c9ce90007a57d5a) dm-10 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=1 status=active |- 6:0:0:4 sdo 8:224 failed faulty running `- 7:0:0:4 sdt 65:48 failed ghost running Apr 30 09:53:33 | sdn: unrecognized bus type (1) mpathg (2dfe32757561c617b6c9ce90007a57d5a) dm-9 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=1 status=active |- 6:0:0:3 sdn 8:208 failed faulty running `- 7:0:0:3 sdr 65:16 active ghost running Apr 30 09:53:33 | sdm: unrecognized bus type (1) mpathf (213dad4f3bbdf3d5d6c9ce90007a57d5a) dm-8 Nimble ,Server size=98G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=1 status=active |- 6:0:0:2 sdm 8:192 failed faulty running `- 7:0:0:2 sdp 8:240 active ghost running ++ Links go down: Apr 29 17:37:09 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.0]-500b:1: LOOP DOWN detected (2 7 0 0). Apr 29 17:37:09 rtp-hpe-ops24 kernel: qla2xxx [0000:d8:00.0]-500b:6: LOOP DOWN detected (2 7 0 0). ++ Device Mapper notes 1 remaining path: Apr 29 17:37:14 rtp-hpe-ops24 multipathd: sdd: mark as failed Apr 29 17:37:14 rtp-hpe-ops24 multipathd: mpathb: remaining active paths: 1 Apr 29 17:37:14 rtp-hpe-ops24 multipathd: sdc: mark as failed Apr 29 17:37:14 rtp-hpe-ops24 multipathd: mpatha: remaining active paths: 1 Apr 29 17:37:14 rtp-hpe-ops24 multipathd: sdm: mark as failed Apr 29 17:37:14 rtp-hpe-ops24 multipathd: mpathf: remaining active paths: 1 Apr 29 17:37:14 rtp-hpe-ops24 multipathd: sdn: mark as failed Apr 29 17:37:14 rtp-hpe-ops24 multipathd: mpathg: remaining active paths: 1 ++ ALUA begins transition from Standby->Active Apr 29 17:37:14 rtp-hpe-ops24 kernel: sd 3:0:0:2: Asymmetric access state changed Apr 29 17:37:14 rtp-hpe-ops24 kernel: sd 7:0:0:2: Asymmetric access state changed ( I do not see a Asymmetric access state change for sd 3:0:0:5 and 7:0:0:4 as I did before, is this the key? ) ++ Device-Mapper reinstates paths after ALUA transition completes: Apr 29 17:37:27 rtp-hpe-ops24 multipathd: 8:112: reinstated Apr 29 17:37:27 rtp-hpe-ops24 multipathd: 8:144: reinstated Apr 29 17:37:27 rtp-hpe-ops24 multipathd: 8:160: reinstated Apr 29 17:37:27 rtp-hpe-ops24 multipathd: 65:80: reinstated Apr 29 17:37:28 rtp-hpe-ops24 multipathd: 8:176: reinstated Apr 29 17:37:28 rtp-hpe-ops24 multipathd: 65:48: reinstated Apr 29 17:37:31 rtp-hpe-ops24 multipathd: 8:128: reinstated Apr 29 17:37:31 rtp-hpe-ops24 multipathd: 65:16: reinstated Apr 29 17:37:31 rtp-hpe-ops24 multipathd: 65:64: reinstated Apr 29 17:37:32 rtp-hpe-ops24 multipathd: 8:240: reinstated ++ Links comes back up: Apr 29 17:42:35 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.0]-500a:1: LOOP UP detected (16 Gbps). Apr 29 17:42:35 rtp-hpe-ops24 kernel: qla2xxx [0000:d8:00.0]-500a:6: LOOP UP detected (16 Gbps). ++ I do not see further reinstatement after the link has been brought up ++ Unfortunately at this point, our automation continues to run and triggers another failover of the storage array, which results in the host having 0 paths to the Active controller. ++ Links go down: Apr 29 18:13:13 rtp-hpe-ops24 kernel: qla2xxx [0000:d8:00.1]-500b:7: LOOP DOWN detected (2 7 0 0). Apr 29 18:13:13 rtp-hpe-ops24 kernel: qla2xxx [0000:12:00.1]-500b:3: LOOP DOWN detected (2 7 0 0). ++ Device Mapper notes 0 remaining active paths: Apr 29 18:13:18 rtp-hpe-ops24 multipathd: sdi: mark as failed Apr 29 18:13:18 rtp-hpe-ops24 multipathd: mpathb: Entering recovery mode: max_retries=30 Apr 29 18:13:18 rtp-hpe-ops24 multipathd: mpathb: remaining active paths: 0 Apr 29 18:13:18 rtp-hpe-ops24 multipathd: sdh: mark as failed Apr 29 18:13:18 rtp-hpe-ops24 multipathd: mpatha: Entering recovery mode: max_retries=30 Apr 29 18:13:18 rtp-hpe-ops24 multipathd: mpatha: remaining active paths: 0 Apr 29 18:13:18 rtp-hpe-ops24 multipathd: sdr: mark as failed Apr 29 18:13:18 rtp-hpe-ops24 multipathd: mpathg: Entering recovery mode: max_retries=30 Apr 29 18:13:18 rtp-hpe-ops24 multipathd: mpathg: remaining active paths: 0 Apr 29 18:13:18 rtp-hpe-ops24 multipathd: sdp: mark as failed ++ Eventually, the filesystem is re-mounted as read-only: Apr 29 18:15:49 rtp-hpe-ops24 kernel: EXT4-fs (dm-8): Remounting filesystem read-only Apr 29 18:15:49 rtp-hpe-ops24 kernel: EXT4-fs (dm-9): Remounting filesystem read-only Apr 29 18:15:49 rtp-hpe-ops24 kernel: EXT4-fs (dm-4): Remounting filesystem read-only Were you able to grab the sysfs device state like I mentioned in Comment 7. I just noticed that I made a typo. You should run # cat /sys/block/<devname>/devices/state Once the link is brought up, but the path does not get re-enabled. Hey Ben, Paths are either being left in 'failed,ghost' or 'failed,faulty' : [root@rtp-hpe-ops24 block]# multipath -ll | grep -i 'failed' | `- 1:0:0:6 sdg 8:96 failed ghost running `- 3:0:0:6 sdl 8:176 failed faulty running | `- 1:0:0:5 sdf 8:80 failed ghost running `- 3:0:0:5 sdk 8:160 failed faulty running | `- 1:0:0:4 sde 8:64 failed ghost running `- 3:0:0:4 sdj 8:144 failed faulty running `- 3:0:0:3 sdi 8:128 failed faulty running `- 3:0:0:2 sdh 8:112 failed faulty running 'failed,ghost' paths are showing a sysfs device state of 'running'. These paths are cabled to the Standby controller on the storage array. 'failed,faulty' paths are showing a sysfs device state of 'transport-offline'. These paths are cabled to the Active controller on the storage array. [root@rtp-hpe-ops24 block]# for i in sdg sdl sdf sdk sde sdj sdi sdh ; do cat /sys/block/$i/device/state ; done running transport-offline running transport-offline running transport-offline transport-offline transport-offline The failed paths are for host1 and host3 which map to : SN1100Q FW:v8.08.204 DVR:v10.01.00.33.07.6-k2-debug 51:40:2e:c0:01:ca:7c:08 51:40:2e:c0:01:ca:7c:0a I have setup an analyzer to try and help with tracing out the transport issues and will share my findings there. So, when multipathd sees that a path is in the transport-offline state, it will just treat it as down (but not offline, since that suggests that user intervention is required). Clearly, multipathd should do a better job of communicating this status. But as long as the device is in that state, there's nothing that mutipath can do with it. So, I will fix the issue that I mentioned in Comment 7 with this bug. If it still looks like multipath is not doing something that it should, we may need to open a new bug to track that. Hello Alan, We don't have Nimble storage in our lab. Could you provide test result with fixed version? Thanks in advance! I am happy to re-validate with your fix, Ben. Just let me know once it is available and I will load it in my tests. Lin, if you need something specific outside of what was provided previously please let me know what details you are looking for and I will provide it from my next round of testing on this. Like I mentioned above, I've fixed the issue that I noticed in comment #7, where multipathd was losing track of the number of active paths. Multipathd still can't do anything about devices in the transport-offline state. Hi Ben, RH has released RHEL7.8 Alpha. Can you tell me if the fix is included in RHEL7.8 Alpha so HPE can verify it? Thanks, trinh (In reply to Trinh Dao from comment #19) > Hi Ben, > RH has released RHEL7.8 Alpha. > Can you tell me if the fix is included in RHEL7.8 Alpha so HPE can verify it? Yes. You can double-check by looking at the device-mapper-multipath package version. The fix I wrote is in device-mapper-multipath-0.4.9-128.el7 and later versions. Alan, please retest with latest RHEL7.8 snapshot-3 and let us know if it fixes your issue or not. thanks, trinh Hi Alan, Could you tell me if the fixed version works good for you ASAP because the deadline is coming? Thanks in advance! Hi Lin, My hardware is only just now becoming available to test with the Alpha software requested. I'll be able to perform a quick test of this before moving back to GA software validation. I should be able to check tomorrow and report back my findings. Hi Lin, I was able to setup my test to run and it looks good after the first fault. I'll let it run overnight and let you know if I see the same behavior again. For transparency, here's the important part of the log showing the paths reinstate as expected: Feb 04 18:52:57 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: qla2xxx [0000:12:00.1]-500a:3: LOOP UP detected (16 Gbps). Feb 04 18:52:57 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: qla2xxx [0000:d8:00.0]-500a:6: LOOP UP detected (16 Gbps). Feb 04 18:52:59 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 3:0:0:1: Power-on or device reset occurred Feb 04 18:52:59 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 3:0:0:2: Power-on or device reset occurred Feb 04 18:52:59 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:1: Power-on or device reset occurred Feb 04 18:52:59 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:2: Power-on or device reset occurred Feb 04 18:52:59 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 3:0:0:1: alua: port group 02 state S non-preferred supports tolusna Feb 04 18:52:59 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 3:0:0:2: alua: port group 02 state S non-preferred supports tolusna Feb 04 18:52:59 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:1: alua: port group 02 state S non-preferred supports tolusna Feb 04 18:52:59 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:2: alua: port group 02 state S non-preferred supports tolusna Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: device-mapper: multipath: Reinstating path 8:16. Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathd: sdb - tur checker reports path is in standby state Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: 8:16: reinstated Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathd: remaining active paths: 2 Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: sdb: prio changed from 50 to 1 Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathd: updating path groups 0 Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 3:0:0:1: alua: port group 02 state S non-preferred supports tolusna Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 3:0:0:1: alua: port group 02 state S non-preferred supports tolusna Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathd: load table [0 2097152000 multipath 1 queue_if_no_path 1 alua 2 1 service-time 0 1 1 8:96 1 service-time 0 1 1 8:16 1] Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathe: sdc - tur checker reports path is in standby state Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: 8:32: reinstated Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: device-mapper: multipath: Reinstating path 8:32. Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathe: remaining active paths: 2 Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: sdc: prio changed from 50 to 1 Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathe: updating path groups 0 Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 3:0:0:2: alua: port group 02 state S non-preferred supports tolusna Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 1:0:0:1: alua: port group 01 state A non-preferred supports tolusna Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 3:0:0:2: alua: port group 02 state S non-preferred supports tolusna Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 1:0:0:1: alua: port group 01 state A non-preferred supports tolusna Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathe: load table [0 2097152000 multipath 1 queue_if_no_path 1 alua 2 1 service-time 0 1 1 8:112 1 service-time 0 1 1 8:32 1] Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:0: Power-on or device reset occurred Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathb: sde - tur checker reports path is in standby state Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: 8:64: reinstated Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: device-mapper: multipath: Reinstating path 8:64. Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathb: remaining active paths: 2 Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: sde: prio changed from 50 to 1 Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathb: updating path groups 0 Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 1:0:0:2: alua: port group 01 state A non-preferred supports tolusna Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 1:0:0:2: alua: port group 01 state A non-preferred supports tolusna Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:0: alua: port group 02 state S non-preferred supports tolusna Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:1: alua: port group 02 state S non-preferred supports tolusna Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:1: alua: port group 02 state S non-preferred supports tolusna Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathb: load table [0 2097152000 multipath 1 queue_if_no_path 1 alua 2 1 service-time 0 1 1 8:144 1 service-time 0 1 1 8:64 1] Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathc: sdf - tur checker reports path is in standby state Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: 8:80: reinstated Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathc: remaining active paths: 2 Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: sdf: prio changed from 50 to 1 Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: device-mapper: multipath: Reinstating path 8:80. Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathc: updating path groups 0 Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:2: alua: port group 02 state S non-preferred supports tolusna Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 7:0:0:1: alua: port group 01 state A non-preferred supports tolusna Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:2: alua: port group 02 state S non-preferred supports tolusna Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 7:0:0:1: alua: port group 01 state A non-preferred supports tolusna Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpathc: load table [0 2097152000 multipath 1 queue_if_no_path 1 alua 2 1 service-time 0 1 1 8:160 1 service-time 0 1 1 8:80 1] Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 7:0:0:2: alua: port group 01 state A non-preferred supports tolusna Feb 04 18:53:00 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 7:0:0:2: alua: port group 01 state A non-preferred supports tolusna Feb 04 18:53:01 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpatha: sdd - tur checker reports path is in standby state Feb 04 18:53:01 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: device-mapper: multipath: Reinstating path 8:48. Feb 04 18:53:01 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: 8:48: reinstated Feb 04 18:53:01 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpatha: remaining active paths: 2 Feb 04 18:53:01 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: sdd: prio changed from 50 to 1 Feb 04 18:53:01 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpatha: updating path groups 0 Feb 04 18:53:01 rtp-hpe-ops24.rtplab.nimblestorage.com multipathd[2360]: mpatha: load table [0 629145600 multipath 1 queue_if_no_path 1 alua 2 1 service-time 0 1 1 8:128 1 service-time 0 1 1 8:48 1] Feb 04 18:53:01 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:0: alua: port group 02 state S non-preferred supports tolusna Feb 04 18:53:01 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 6:0:0:0: alua: port group 02 state S non-preferred supports tolusna Feb 04 18:53:01 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 7:0:0:0: alua: port group 01 state A non-preferred supports tolusna Feb 04 18:53:01 rtp-hpe-ops24.rtplab.nimblestorage.com kernel: sd 7:0:0:0: alua: port group 01 state A non-preferred supports tolusna Move to verified according to comment 25. Hi Lin, All iterations of my test passed successfully using snapshot-3. We can move this to closed if you agree. Thank you to yourself and Ben for helping us out here! Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1066 |
Created attachment 1556597 [details] sosreport Description of problem: During failover ( reboot ) of HPE Nimble Storage controllers, Qlogic based paths are not being reinstated as expected. Version-Release number of selected component (if applicable): RHEL 7.6 3.10.0-957.el7.x86_64 device-mapper-multipath-0.4.9-123.el7.x86_64 device-mapper-multipath-debuginfo-0.4.9-123.el7.x86_64 kmod-qlgc-qla2xxx-10.01.00.33.07.6_k2-1.rhel7u6.x86_64 How reproducible: 1/4 attempts see this Steps to Reproduce: 1. Fail over / Reboot Nimble Storage active controller 2. Wait for controller to reboot and come back as Standby 3. Qlogic initiator will detect loop up and log in 4. multipath -ll will show paths as failed / faulty still Actual results: Paths remain failed until reboot of host OS Expected results: Paths should automatically reinstate after successfully PLOGI/PRLI to target adapter Additional info: Attached sosreport, issue is similar to 1698970, but device-mapper does not become unresponsive with the updated qla2xx module on the host