1599742 – On app pod restart, mpath device name is not mapped/created for some blockvolumes in the new initiator side

Bug 1599742 - On app pod restart, mpath device name is not mapped/created for some blockvolumes in the new initiator side

Summary: On app pod restart, mpath device name is not mapped/created for some blockvol...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.10.z
Assignee:	Jan Safranek
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:	1599217
Blocks:	1568862 1598740
TreeView+	depends on / blocked

Reported:	2018-07-10 13:05 UTC by Neha Berry
Modified:	2019-02-04 05:03 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1598740
Environment:
Last Closed:	2018-08-30 16:54:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Comment 1 Jan Safranek 2018-07-10 13:52:21 UTC

Copied from https://bugzilla.redhat.com/show_bug.cgi?id=1599217#c6

Looking into the logs, I can see OpenShift indeed initiated attach, but it timed out waiting for 
10.70.46.1 and 10.70.46.75:

Jul 10 15:56:32 dhcp46-175.lab.eng.blr.redhat.com atomic-openshift-node[2453]: I0710 15:56:32.454227    2453 iscsi_util.go:314] iscsi: dev /dev/disk/by-path/ip-10.70.46.1:3260-iscsi-iqn.2016-12.org.gluster-block:d2a42cc7-6a07-47e0-9b96-c25706d2fad2-lun-0 err Could not attach disk: Timeout after 10s

Jul 10 15:56:42 dhcp46-175.lab.eng.blr.redhat.com atomic-openshift-node[2453]: I0710 15:56:42.217983    2453 iscsi_util.go:314] iscsi: dev /dev/disk/by-path/ip-10.70.46.75:3260-iscsi-iqn.2016-12.org.gluster-block:d2a42cc7-6a07-47e0-9b96-c25706d2fad2-lun-0 err Could not attach disk: Timeout after 10s


Only 10.70.46.175 succeeds:
Jul 10 15:56:43 dhcp46-175.lab.eng.blr.redhat.com atomic-openshift-node[2453]: I0710 15:56:43.375884    2453 iscsi_util.go:318] iscsi: dev /dev/disk/by-path/ip-10.70.46.175:3260-iscsi-iqn.2016-12.org.gluster-block:d2a42cc7-6a07-47e0-9b96-c25706d2fad2-lun-0 added to devicepath

Since only the *last* one succeeded, OpenShift quickly checked that there is no /sys/block/dm-* that has /sys/block/dm-X/slaves/sds (i.e. considers the path as not part of multipath) and mounts it.

There are several issues with this approach:

1. iscsi target or initiator is slow to attach the volume (that's intended, it's a stress test, right?)
2. OpenShift does not wait a while for multipath to evaluate a device.
3. OpenShift has no configurable parameter for attach timeout, 10s is hardcoded.

Comment 2 Jan Safranek 2018-07-10 14:31:59 UTC

This is likely dup of #1597320

Comment 3 Jan Safranek 2018-07-10 15:25:06 UTC

Upstream issue: https://github.com/kubernetes/kubernetes/issues/60894

Comment 4 Jan Safranek 2018-07-10 16:42:24 UTC

One hotfix I can do relatively quickly: OpenShift can check several times (for 10s?) if a device is part of multipath before mounting single path. This is not proper solution to the problem, but it will remove the "blocking" part of this bug.

It will slow down iSCSI setup to customers that run multiple portals for the same volume, but don't use multipath. Is it even valid setup? [I know CNS does not use this setup, but some other iSCSI user might.]

Comment 5 Jan Safranek 2018-07-11 11:39:17 UTC

> This is likely dup of #1597320

Sorry, it's a different issue.

Comment 19 Yaniv Kaul 2018-08-31 20:24:21 UTC

Who's looking at a retry logic on the CSI driver, if needed?

Note You need to log in before you can comment on or make changes to this bug.