Description of problem: +++++++++++++++++++++++++ We were testing Upgrade scenario in Gluster block negative test cases. Original setup = OCP+3.9+CNS3.9 async3 1. Created multiple App pods with block devices bind mounted. Re-spinned one glusterfs pod, say X, to upgrade the container image to latest CNS 3.10 build Note : The node X where the pod was re-spinned was the initiator for some devices, along with being the hosting node for the glusterfs target pod. Node X had 2 mpath devices with 3 paths each. 2. Gluster-blockd and gluster-block target services didnt couldn't come up on the new pod X (similar issue as https://bugzilla.redhat.com/show_bug.cgi?id=1596369) 3. To recover the 'Ds' services in pods, as a workaround, rebooted the node X. 4. After reboot, multipath -ll listed only 2 paths. iscsiadm didnt login to the target portal with IP X, ie. to its own target IP. 5. In dmesg, it is seen that iscsi negotiation failed for the particular target portal, as the required auth was not supplied, maybe by OCP. Following is the log message from the initiator # dmesg [ 79.102645] iSCSI Login negotiation failed. [ 79.103592] connection30:0: detected conn error (1020) [ 79.105298] scsi host63: iSCSI Initiator over TCP/IP [ 79.108097] Initiator is requesting CSG: 1, has not been successfully authenticated, and the Target is enforcing iSCSI Authentication, login failed. [ 79.109120] iSCSI Login negotiation failed. Note: ++++++++ This issue is reproducible even in setups where we didnt do any upgrade or re-spin of pod. With 3 CNS nodes(acting as target(pod) and initiator both) ,after creating app pods, just need to reboot the initiator node. It is seen that we have 1 less path now as the iscsiadm of initiator, IP X, couldn't login to its own target portal IP X. Version-Release number of selected component (if applicable): ++++++++++++++++++++++++++++++++++++++++ We faced this issue on both CNS 3.9 and CNS 3.10. Hence, marking version as 3.10 in bug. oc version oc v3.9.30 Note: 2 pods have gluster-block version "gluster-block-0.2.1-14.1.el7rhgs.x86_64" and 1 pod has latest 3.10 version - gluster-block-0.2.1-20.el7rhgs.x86_64 How reproducible: ++++++++++++++++++ 2/2 on 2 separate setups Steps to Reproduce: +++++++++++++++++ Though the issue was hit after upgrade(re-spin of gluster pod X)-> node reboot X, we are consistently seeing this issue even on a fresh setup. The simplest way to reproduce the issue is: ------------------------------- 1. Create a CNS Cluster with 3 nodes(X,Y,Z) and create 10 app pods(block volumes bind-mounted), distributed amongst the 3 nodes. HA=3 2. Each block device will have 3 paths from glusterfs target pods -X, Y and Z . Also, same nodes will be used as initator for different block devices Note: The node on which App pod gets hosted is the initiator node for that device/volume 3. Login to one of the nodes, say X, Check multipath -ll and iscsiadm -m session for 3 paths against each mpath device # multipath -ll e.g. mpathb (36001405cac98b581d2d488bb7cb3f989) dm-41 LIO-ORG ,TCMU device size=5.0G features='1 queue_if_no_path' hwhandler='0' wp=rw |-+- policy='round-robin 0' prio=0 status=enabled | `- 36:0:0:0 sdj 8:144 failed faulty running |-+- policy='round-robin 0' prio=1 status=active | `- 37:0:0:0 sdk 8:160 active ready running `-+- policy='round-robin 0' prio=1 status=enabled `- 38:0:0:0 sdl 8:176 active ready running # ll /dev/disk/by-path/ip-* lrwxrwxrwx. 1 root root 9 Jul 3 13:07 /dev/disk/by-path/ip-10.70.41.217:3260-iscsi-iqn.2016-12.org.gluster-block:cac98b58-1d2d-488b-b7cb-3f9897795052-lun-0 -> ../../sdk lrwxrwxrwx. 1 root root 9 Jul 3 13:07 /dev/disk/by-path/ip-10.70.42.223:3260-iscsi-iqn.2016-12.org.gluster-block:cac98b58-1d2d-488b-b7cb-3f9897795052-lun-0 -> ../../sdl lrwxrwxrwx. 1 root root 9 Jul 3 13:07 /dev/disk/by-path/ip-10.70.42.84:3260-iscsi-iqn.2016-12.org.gluster-block:cac98b58-1d2d-488b-b7cb-3f9897795052-lun-0 -> ../../sdj 4. Reboot node X. Once it is back up, check multipath, iscsiadm -m session. Now there are only 2 paths # multipath -ll mpathb (36001405cac98b581d2d488bb7cb3f989) dm-30 LIO-ORG ,TCMU device size=5.0G features='1 queue_if_no_path' hwhandler='0' wp=rw |-+- policy='round-robin 0' prio=1 status=active | `- 45:0:0:0 sdg 8:96 active ready running `-+- policy='round-robin 0' prio=1 status=enabled `- 54:0:0:0 sdj 8:144 active ready running [root@dhcp42-84 ~]# ll /dev/disk/by-path/ip-* lrwxrwxrwx. 1 root root 9 Jul 3 16:00 /dev/disk/by-path/ip-10.70.41.217:3260-iscsi-iqn.2016-12.org.gluster-block:cac98b58-1d2d-488b-b7cb-3f9897795052-lun-0 -> ../../sdj lrwxrwxrwx. 1 root root 9 Jul 3 15:59 /dev/disk/by-path/ip-10.70.42.223:3260-iscsi-iqn.2016-12.org.gluster-block:cac98b58-1d2d-488b-b7cb-3f9897795052-lun-0 -> ../../sdg [root@dhcp42-84 ~]# iscsiadm -m session tcp: [13] 10.70.42.223:3260,3 iqn.2016-12.org.gluster-block:cac98b58-1d2d-488b-b7cb-3f9897795052 (non-flash) tcp: [22] 10.70.41.217:3260,2 iqn.2016-12.org.gluster-block:cac98b58-1d2d-488b-b7cb-3f9897795052 (non-flash) Actual results: +++++++++++++++++ When we have the initiator and target glusterfs pods on the same CNS nodes, upon reboot of the nodes, iscsi login fails to its own IP(glusterfs pod IP) and we have one less path. Thus,we have Node X and we reboot Node X. Login again to Node X and check the iscsi logins. It is seen that iscsi login to IP X will fail due to auth issue, though login will succeed to the other 2 target portals. But , meanwhile, on the other node, login to X is successful once X(with its target glusterfs pod) comes back up. Thus on 2nd node, we still have 3 paths, unlike node X which has 2 paths now 2nd node -------------- # multipath -ll mpathc (36001405efc2ec10177b40538d6f54e54) dm-38 LIO-ORG ,TCMU device size=5.0G features='1 queue_if_no_path' hwhandler='0' wp=rw |-+- policy='round-robin 0' prio=1 status=active | `- 39:0:0:0 sdm 8:192 active ready running |-+- policy='round-robin 0' prio=1 status=enabled | `- 40:0:0:0 sdn 8:208 active ready running `-+- policy='round-robin 0' prio=1 status=enabled `- 41:0:0:0 sdo 8:224 active ready running Expected results: +++++++++++++++++++ All paths should be restored upon the reboot of CNS gluster nodes and no auth issue should be seen during iscsi login. Additional info: +++++++++++++++++ More setup details provided in next comment.
Moving the bug to assigned for the following reasons. 1) The bug doesn't have any acks 2) The ask is to reproduce the bug 3) There is no fix provided yet for QE to validate
(In reply to krishnaram Karthick from comment #11) > Moving the bug to assigned for the following reasons. > > 1) The bug doesn't have any acks > 2) The ask is to reproduce the bug > 3) There is no fix provided yet for QE to validate To add to that, the depends on bug #1597320 is still NOT fixed from OCP side and is in ASSIGNED state and currently targeted for OCP 3.10.z and NOT OCP 3.10.
Updated Doc text field, kindly review.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3257