Bug 1599742

Summary: On app pod restart, mpath device name is not mapped/created for some blockvolumes in the new initiator side
Product: OpenShift Container Platform Reporter: Neha Berry <nberry>
Component: StorageAssignee: Jan Safranek <jsafrane>
Status: CLOSED NOTABUG QA Contact: Jianwei Hou <jhou>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.10.0CC: aos-bugs, aos-storage-staff, bchilds, bgoyal, jsafrane, kramdoss, nberry, pkarampu, pprakash, prasanna.kalever, rcyriac, rhs-bugs, sankarshan, vbellur, xiubli
Target Milestone: ---   
Target Release: 3.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1598740 Environment:
Last Closed: 2018-08-30 16:54:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1599217    
Bug Blocks: 1568862, 1598740    

Comment 1 Jan Safranek 2018-07-10 13:52:21 UTC
Copied from https://bugzilla.redhat.com/show_bug.cgi?id=1599217#c6

Looking into the logs, I can see OpenShift indeed initiated attach, but it timed out waiting for 
10.70.46.1 and 10.70.46.75:

Jul 10 15:56:32 dhcp46-175.lab.eng.blr.redhat.com atomic-openshift-node[2453]: I0710 15:56:32.454227    2453 iscsi_util.go:314] iscsi: dev /dev/disk/by-path/ip-10.70.46.1:3260-iscsi-iqn.2016-12.org.gluster-block:d2a42cc7-6a07-47e0-9b96-c25706d2fad2-lun-0 err Could not attach disk: Timeout after 10s

Jul 10 15:56:42 dhcp46-175.lab.eng.blr.redhat.com atomic-openshift-node[2453]: I0710 15:56:42.217983    2453 iscsi_util.go:314] iscsi: dev /dev/disk/by-path/ip-10.70.46.75:3260-iscsi-iqn.2016-12.org.gluster-block:d2a42cc7-6a07-47e0-9b96-c25706d2fad2-lun-0 err Could not attach disk: Timeout after 10s


Only 10.70.46.175 succeeds:
Jul 10 15:56:43 dhcp46-175.lab.eng.blr.redhat.com atomic-openshift-node[2453]: I0710 15:56:43.375884    2453 iscsi_util.go:318] iscsi: dev /dev/disk/by-path/ip-10.70.46.175:3260-iscsi-iqn.2016-12.org.gluster-block:d2a42cc7-6a07-47e0-9b96-c25706d2fad2-lun-0 added to devicepath

Since only the *last* one succeeded, OpenShift quickly checked that there is no /sys/block/dm-* that has /sys/block/dm-X/slaves/sds (i.e. considers the path as not part of multipath) and mounts it.

There are several issues with this approach:

1. iscsi target or initiator is slow to attach the volume (that's intended, it's a stress test, right?)
2. OpenShift does not wait a while for multipath to evaluate a device.
3. OpenShift has no configurable parameter for attach timeout, 10s is hardcoded.

Comment 2 Jan Safranek 2018-07-10 14:31:59 UTC
This is likely dup of #1597320

Comment 3 Jan Safranek 2018-07-10 15:25:06 UTC
Upstream issue: https://github.com/kubernetes/kubernetes/issues/60894

Comment 4 Jan Safranek 2018-07-10 16:42:24 UTC
One hotfix I can do relatively quickly: OpenShift can check several times (for 10s?) if a device is part of multipath before mounting single path. This is not proper solution to the problem, but it will remove the "blocking" part of this bug.

It will slow down iSCSI setup to customers that run multiple portals for the same volume, but don't use multipath. Is it even valid setup? [I know CNS does not use this setup, but some other iSCSI user might.]

Comment 5 Jan Safranek 2018-07-11 11:39:17 UTC
> This is likely dup of #1597320

Sorry, it's a different issue.

Comment 19 Yaniv Kaul 2018-08-31 20:24:21 UTC
Who's looking at a retry logic on the CSI driver, if needed?