1597726 – On reboot of the CNS node, ISCSI login failed to its own target IP

Bug 1597726 - On reboot of the CNS node, ISCSI login failed to its own target IP

Summary: On reboot of the CNS node, ISCSI login failed to its own target IP

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rhgs-server-container
Sub Component:
Version:	ocs-3.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 3.11.z Batch Update 4
Assignee:	Raghavendra Talur
QA Contact:	Nitin Goyal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1707226
TreeView+	depends on / blocked

Reported:	2018-07-03 14:06 UTC by Neha Berry
Modified:	2019-10-30 12:33 UTC (History)
CC List:	19 users (show)
Fixed In Version:	rhgs-server-container-3.11.4-1
Doc Type:	Bug Fix
Doc Text:	Previously, iscsi login failed on node reboot when the gluster-blockd service was not started because the gluster-block service was not part of the readiness checklist. With this update, the gluster-blockd service is now part of the readiness checklist on reboot, so this situation no longer occurs.
Clone Of:
Environment:
Last Closed:	2019-10-30 12:32:53 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1635586	0	unspecified	CLOSED	Document about GLUSTER_BLOCKD_STATUS_PROBE_ENABLE.	2023-05-08 07:34:36 UTC
Red Hat Product Errata	RHBA-2019:3257	0	None	None	None	2019-10-30 12:33:10 UTC

Internal Links: 1635586

Description Neha Berry 2018-07-03 14:06:02 UTC

Description of problem:
+++++++++++++++++++++++++

We were testing Upgrade scenario in Gluster block negative test cases. 

Original setup = OCP+3.9+CNS3.9 async3

1. Created multiple App pods with block devices bind mounted. Re-spinned one glusterfs pod, say X,  to upgrade the container image to latest CNS 3.10 build 
 Note : The node X where the pod was re-spinned was the initiator for some devices, along with being the hosting node for the glusterfs target pod. Node X had 2 mpath devices with 3 paths each.

2. Gluster-blockd and gluster-block target services didnt couldn't come up on the new pod X (similar issue as https://bugzilla.redhat.com/show_bug.cgi?id=1596369)

3. To recover the 'Ds' services in pods, as a workaround, rebooted the node X.

4. After reboot, multipath -ll listed only 2 paths. iscsiadm didnt login to the target portal with IP X, ie. to its own target IP.

5. In dmesg, it is seen that iscsi negotiation failed for the particular target portal, as the required auth was not supplied, maybe by OCP.


Following is the log message from the initiator

  # dmesg 
 [   79.102645] iSCSI Login negotiation failed.
 [   79.103592]  connection30:0: detected conn error (1020)
 [   79.105298] scsi host63: iSCSI Initiator over TCP/IP
 [   79.108097] Initiator is requesting CSG: 1, has not been successfully authenticated, and the Target is enforcing iSCSI Authentication, login failed.
 [   79.109120] iSCSI Login negotiation failed.

Note: 
++++++++

This issue is reproducible even in setups where we didnt do any upgrade or re-spin of pod. With 3 CNS nodes(acting as target(pod) and initiator both) ,after creating app pods, just need to reboot the initiator node. It is seen that we have 1 less path now as the iscsiadm of initiator, IP X,  couldn't login to its own target portal IP X.

Version-Release number of selected component (if applicable):
++++++++++++++++++++++++++++++++++++++++
We faced this issue on both CNS 3.9 and CNS 3.10. Hence, marking version as 3.10 in bug. 

oc version
oc v3.9.30

Note: 2 pods have gluster-block version "gluster-block-0.2.1-14.1.el7rhgs.x86_64" and 1 pod has latest 3.10 version - gluster-block-0.2.1-20.el7rhgs.x86_64


How reproducible:
++++++++++++++++++

2/2 on 2 separate setups


Steps to Reproduce:
+++++++++++++++++

Though the issue was hit after upgrade(re-spin of gluster pod X)-> node reboot X, we are consistently seeing this issue even on a fresh setup. 

The simplest way to reproduce the issue is:
-------------------------------
1. Create a CNS Cluster with 3 nodes(X,Y,Z) and create 10 app pods(block volumes bind-mounted), distributed amongst the 3 nodes. HA=3 
2. Each block device  will have 3 paths from glusterfs target pods -X, Y and Z . Also, same nodes will be used as initator for different block devices
Note: The node on which App pod gets hosted is the initiator node for that device/volume 

3. Login to one of the nodes, say X, Check multipath -ll and iscsiadm -m session for 3 paths against each mpath device
# multipath -ll

e.g.
mpathb (36001405cac98b581d2d488bb7cb3f989) dm-41 LIO-ORG ,TCMU device    
size=5.0G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=0 status=enabled
| `- 36:0:0:0 sdj 8:144 failed faulty running
|-+- policy='round-robin 0' prio=1 status=active
| `- 37:0:0:0 sdk 8:160 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
  `- 38:0:0:0 sdl 8:176 active ready running
 
# ll /dev/disk/by-path/ip-*

lrwxrwxrwx. 1 root root 9 Jul  3 13:07 /dev/disk/by-path/ip-10.70.41.217:3260-iscsi-iqn.2016-12.org.gluster-block:cac98b58-1d2d-488b-b7cb-3f9897795052-lun-0 -> ../../sdk
lrwxrwxrwx. 1 root root 9 Jul  3 13:07 /dev/disk/by-path/ip-10.70.42.223:3260-iscsi-iqn.2016-12.org.gluster-block:cac98b58-1d2d-488b-b7cb-3f9897795052-lun-0 -> ../../sdl
lrwxrwxrwx. 1 root root 9 Jul  3 13:07 /dev/disk/by-path/ip-10.70.42.84:3260-iscsi-iqn.2016-12.org.gluster-block:cac98b58-1d2d-488b-b7cb-3f9897795052-lun-0 -> ../../sdj
 
4. Reboot node X. Once it is back up, check multipath, iscsiadm -m session. Now there are only 2 paths

# multipath -ll
mpathb (36001405cac98b581d2d488bb7cb3f989) dm-30 LIO-ORG ,TCMU device    
size=5.0G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 45:0:0:0 sdg 8:96  active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
  `- 54:0:0:0 sdj 8:144 active ready running

[root@dhcp42-84 ~]# ll /dev/disk/by-path/ip-*

lrwxrwxrwx. 1 root root 9 Jul  3 16:00 /dev/disk/by-path/ip-10.70.41.217:3260-iscsi-iqn.2016-12.org.gluster-block:cac98b58-1d2d-488b-b7cb-3f9897795052-lun-0 -> ../../sdj
lrwxrwxrwx. 1 root root 9 Jul  3 15:59 /dev/disk/by-path/ip-10.70.42.223:3260-iscsi-iqn.2016-12.org.gluster-block:cac98b58-1d2d-488b-b7cb-3f9897795052-lun-0 -> ../../sdg

[root@dhcp42-84 ~]# iscsiadm -m session
tcp: [13] 10.70.42.223:3260,3 iqn.2016-12.org.gluster-block:cac98b58-1d2d-488b-b7cb-3f9897795052 (non-flash)
tcp: [22] 10.70.41.217:3260,2 iqn.2016-12.org.gluster-block:cac98b58-1d2d-488b-b7cb-3f9897795052 (non-flash)

Actual results:
+++++++++++++++++
When we have the initiator and target glusterfs pods on the same CNS nodes, upon reboot of the nodes, iscsi login fails to its own IP(glusterfs pod IP) and we have one less path.
Thus,we have Node X and we reboot Node X. Login again to  Node X and check the iscsi logins. It is seen that iscsi login to IP X will fail due to auth issue, though login will succeed to the other 2 target portals.

But , meanwhile, on the other node, login to X is successful once X(with its target glusterfs pod) comes back up.

Thus on 2nd node, we still have 3 paths, unlike node X which has 2 paths now

2nd node
--------------
# multipath -ll
mpathc (36001405efc2ec10177b40538d6f54e54) dm-38 LIO-ORG ,TCMU device     
size=5.0G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 39:0:0:0 sdm 8:192 active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 40:0:0:0 sdn 8:208 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
  `- 41:0:0:0 sdo 8:224 active ready running


Expected results:
+++++++++++++++++++
All paths should be restored upon the reboot of CNS gluster nodes and no auth issue should be seen during iscsi login.

Additional info:
+++++++++++++++++
More setup details provided in next comment.

Comment 11 krishnaram Karthick 2018-07-19 17:26:18 UTC

Moving the bug to assigned for the following reasons.

1) The bug doesn't have any acks
2) The ask is to reproduce the bug
3) There is no fix provided yet for QE to validate

Comment 12 Prasanth 2018-07-20 04:37:37 UTC

(In reply to krishnaram Karthick from comment #11)
> Moving the bug to assigned for the following reasons.
> 
> 1) The bug doesn't have any acks
> 2) The ask is to reproduce the bug
> 3) There is no fix provided yet for QE to validate

To add to that, the depends on bug #1597320 is still NOT fixed from OCP side and is in ASSIGNED state and currently targeted for OCP 3.10.z and NOT OCP 3.10.

Comment 43 Anjana KD 2018-10-11 07:33:06 UTC

Updated Doc text field, kindly review.

Comment 67 errata-xmlrpc 2019-10-30 12:32:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3257

Note You need to log in before you can comment on or make changes to this bug.