With only 1 node down, multipath -ll shows multiple paths in "failed" state Description ================ We were regressing the fic of BZ#1623433 when this behavior was observed. In a CNS 3.10 setup, we had a block pvc (block4-1) mounted on an app pod bkblock4-1-1-q7wxq on initiator node 10.70.46.145. The block device was created with HA=4. Block device name = bk4_glusterfs_block4-1_107d2230-aeb1-11e8-80c4-0a580a83020 Block device iqn = iqn.2016-12.org.gluster-block:e9e0d58f-e433-463d-ba63-113d40076849 Steps in short ====================== Step 1: Brought down a passive path - 10.70.46.169 and the single path which had failed, restored successfully on powering ON the node. No other paths went into failed state. Step 2: POWERED OFF the active path node - 10.70.46.179 and observed that instead of a single path going to failed state, following things happened: i) paths sdf(Active path for this pvc - 10.70.47.149) and path sdh(10.70.46.53- the node was UP) went into "Failed" state ii) Path sdg became active iii) then suddenly, sdg, sdf and sdi failed and sdh was restored on its own. IO continued from sdh Thus, It looked like , the other 3 paths were also failing & re-instating even though only sdf was supposed to be down Path status before any node poweroff/on ================================================ mpatha (36001405e9e0d58fe433463dba63113d4) dm-18 LIO-ORG ,TCMU device size=3.0G features='1 queue_if_no_path' hwhandler='0' wp=rw |-+- policy='round-robin 0' prio=50 status=active | `- 33:0:0:0 sdf 8:80 active ready running |-+- policy='round-robin 0' prio=10 status=enabled | `- 34:0:0:0 sdg 8:96 active ready running |-+- policy='round-robin 0' prio=10 status=enabled | `- 36:0:0:0 sdi 8:128 active ready running `-+- policy='round-robin 0' prio=10 status=enabled `- 35:0:0:0 sdh 8:112 active ready running [root@dhcp46-145 ~]# ll /dev/disk/by-path/ip* lrwxrwxrwx. 1 root root 9 Sep 2 18:43 /dev/disk/by-path/ip-10.70.46.169:3260-iscsi-iqn.2016-12.org.gluster-block:e9e0d58f-e433-463d-ba63-113d40076849-lun-0 -> ../../sdg lrwxrwxrwx. 1 root root 9 Sep 2 18:43 /dev/disk/by-path/ip-10.70.46.53:3260-iscsi-iqn.2016-12.org.gluster-block:e9e0d58f-e433-463d-ba63-113d40076849-lun-0 -> ../../sdh lrwxrwxrwx. 1 root root 9 Sep 2 18:43 /dev/disk/by-path/ip-10.70.47.149:3260-iscsi-iqn.2016-12.org.gluster-block:e9e0d58f-e433-463d-ba63-113d40076849-lun-0 -> ../../sdf lrwxrwxrwx. 1 root root 9 Sep 2 18:43 /dev/disk/by-path/ip-10.70.47.79:3260-iscsi-iqn.2016-12.org.gluster-block:e9e0d58f-e433-463d-ba63-113d40076849-lun-0 -> ../../sdi # iscsiadm -m session tcp: [1] 10.70.47.149:3260,1 iqn.2016-12.org.gluster-block:e9e0d58f-e433-463d-ba63-113d40076849 (non-flash) tcp: [2] 10.70.46.169:3260,2 iqn.2016-12.org.gluster-block:e9e0d58f-e433-463d-ba63-113d40076849 (non-flash) tcp: [3] 10.70.46.53:3260,3 iqn.2016-12.org.gluster-block:e9e0d58f-e433-463d-ba63-113d40076849 (non-flash) tcp: [4] 10.70.47.79:3260,4 iqn.2016-12.org.gluster-block:e9e0d58f-e433-463d-ba63-113d40076849 (non-flash) Path changes seen in matter of minutes once 10.70.47.149 was in POWERED OFF state ========================================= i) mpatha (36001405e9e0d58fe433463dba63113d4) dm-18 LIO-ORG ,TCMU device size=3.0G features='1 queue_if_no_path' hwhandler='0' wp=rw |-+- policy='round-robin 0' prio=0 status=enabled | `- 33:0:0:0 sdf 8:80 failed faulty running <-----------------------Failed - expected |-+- policy='round-robin 0' prio=10 status=active | `- 34:0:0:0 sdg 8:96 active ready running |-+- policy='round-robin 0' prio=10 status=enabled | `- 36:0:0:0 sdi 8:128 active ready running `-+- policy='round-robin 0' prio=0 status=enabled `- 35:0:0:0 sdh 8:112 failed faulty running <-----------------------Failed -unexpected ii) [root@dhcp46-145 ~]# multipath -ll mpatha (36001405e9e0d58fe433463dba63113d4) dm-18 LIO-ORG ,TCMU device size=3.0G features='1 queue_if_no_path' hwhandler='0' wp=rw |-+- policy='round-robin 0' prio=0 status=enabled | `- 33:0:0:0 sdf 8:80 failed faulty running <-----------------------Failed state - expected |-+- policy='round-robin 0' prio=0 status=enabled | `- 34:0:0:0 sdg 8:96 failed faulty running <-----------------------Failed state- unexpected |-+- policy='round-robin 0' prio=10 status=active | `- 36:0:0:0 sdi 8:128 active ready running `-+- policy='round-robin 0' prio=10 status=enabled `- 35:0:0:0 sdh 8:112 active ready running <------------------------path state again changed from (i)-Failed to active iii) [root@dhcp46-145 ~]# multipath -ll mpatha (36001405e9e0d58fe433463dba63113d4) dm-18 LIO-ORG ,TCMU device size=3.0G features='1 queue_if_no_path' hwhandler='0' wp=rw |-+- policy='round-robin 0' prio=0 status=enabled | `- 33:0:0:0 sdf 8:80 failed faulty running <-----------------------Failed state - expected |-+- policy='round-robin 0' prio=10 status=active | `- 34:0:0:0 sdg 8:96 active ready running |-+- policy='round-robin 0' prio=10 status=enabled | `- 36:0:0:0 sdi 8:128 active ready running `-+- policy='round-robin 0' prio=0 status=enabled `- 35:0:0:0 sdh 8:112 failed faulty running <<------------------------path state again changed from (ii)-active to Failed iv) [root@dhcp46-145 pvc-14befacf-aeb4-11e8-8e57-005056a59bf3]# multipath -ll mpatha (36001405e9e0d58fe433463dba63113d4) dm-18 LIO-ORG ,TCMU device size=3.0G features='1 queue_if_no_path' hwhandler='0' wp=rw |-+- policy='round-robin 0' prio=0 status=enabled | `- 33:0:0:0 sdf 8:80 failed faulty running <-----------------------Failed state - expected |-+- policy='round-robin 0' prio=0 status=enabled | `- 34:0:0:0 sdg 8:96 failed faulty running <-----------------------Failed state - unexpected |-+- policy='round-robin 0' prio=0 status=enabled | `- 36:0:0:0 sdi 8:128 failed faulty running <-----------------------Failed state - unexpected `-+- policy='round-robin 0' prio=10 status=active `- 35:0:0:0 sdh 8:112 active ready running Steps Performed: ========================= 1. Created a 4 node CNS 3.10 setup with RC build 2. Created 2 pvcs of HA=4 (bk4_glusterfs_block4-1_107d2230-aeb1-11e8-80c4-0a580a830204, bk4_glusterfs_block4-2_308c2e27-aeb1-11e8-80c4-0a580a830204) 3. setup has 1 file volume(heketidbstorage) and 1 BHV (vol_f51ea2467acb3ca749e32a9f1993da12) 4. Powered off first gluster node -10.70.46.169 for some time - Approx time = Sun Sep 2 13:31:34 UTC 2018 5. Checked the path status, the path for 10.70.46.169 went into FAILED state as expected. 6. Powered on node 10.70.46.169 - Approx time = Sun Sep 2 13:35:33 UTC 2018. Path is restored. 7. Confirmed that all the paths are were in "RUNNING" state 8. POWERED OFF another gluster node - 10.70.47.149 - Approx time =Sun Sep 2 15:32:56 UTC 2018 . This was the active path sdf for the block device. 9. Checked the multipath status multiple times. each time more than 1 path is seen in failed state. 10. POWERED ON 10.70.47.149 - Approx time = Sun Sep 2 15:48:13 UTC 2018 11. Once the node/glusterfs pod is UP, I see all the 4 paths in RUNNING state. Hence, not sure why multiple paths were going on/off when only 1 path was down. How reproducible: ++++++++++++++++++++++++ Its NOT a 100% reproducible issue and is intermittently seen. Actual results: ++++++++++++++++++++++++ More than 1 path kept going in Failed->Running and vice versa state. Expected results: ++++++++++++++++++++++++ When only 1 path is failed by user, all other paths should still stay in Running state and the IO should continue from the first passive path is FAILED-OVER to.
> Prasanna, lets give Devel ack for this bug considering we will have RHGS fix available with OCS 3.11 release. If it didnt make it or the release timeline is not matching we will take back the acks. Humble, at this point, RHGS would include this fix, but the release date is of the concern. Can't we handle it at higher level till we have RHGS version? that ways, we will have smooth dependency chain in the release?
Have updated the doc text. Kindly verify for doc text accuracy.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0285