Bug 1609703

Summary:	APP pod unable to start after target port failure in cases where single paths are mounted on APP pods(BZ#1599742)
Product:	OpenShift Container Platform	Reporter:	Neha Berry <nberry>
Component:	Storage	Assignee:	Jan Safranek <jsafrane>
Status:	CLOSED ERRATA	QA Contact:	Liang Xia <lxia>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.10.0	CC:	aos-bugs, aos-storage-staff, apanagio, bchilds, hchiramm, jhou, jsafrane, lxia, madam, nberry, vlaad
Target Milestone:	---	Flags:	lxia: needinfo-
Target Release:	3.10.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1609788 1637413 1637422 (view as bug list)		Environment:
Last Closed:	2018-11-11 16:39:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1596021, 1598740, 1609788, 1637413, 1637422

Description Neha Berry 2018-07-30 08:17:52 UTC

Description of problem:
APP pod unable to start after target port failure in cases where single paths are mounted on APP pods(BZ#1599742)

This issue is related and a side-effect of https://bugzilla.redhat.com/show_bug.cgi?id=1599742. When app pods get bind-mounted with single devices instead of mpath, after a path failure, APP pod fails to get restored. This issue wont be seen if we have mpath devices mounted instead of single paths.

Description
==============

In some cases, it is seen that while creating app pods with bind-mounted block devices, no mpath device name is created against the device on the initiator side(https://bugzilla.redhat.com/show_bug.cgi?id=1599742).
Due to this, instead of a mpath device(/dev/mapper/mpatha, etc) getting mounted , single devices (/dev/sdk, etc) get mounted on the app pod. Since single paths are mounted, if somehow the said path goes down, the app pod goes into CrashLoopBackOff state(which is expected). But once the path is restored and the app pod is re-created, the mount path on initiator is still showing input/output error.

Note: multipath service was up and running on all the initiator nodes

We see following 2 issues based on 2 different scenarios :
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Scenario- 1. Once path is restored, the app pod tries to re-spin on the same initiator node
*********************************************************************************************
In this case, when we delete the app pod so as to make it use the restored path again, the app pod creation is stuck with following error message

Sequence of events
====================
1. The app pod is in Running state on Node-A and has bind-mounted path - /dev/sdj as no mpath device is created for /dev/sdj(belonging to target-X)
2. Failed path sdj by blocking the 3260 port in its target portal IP-X
3. The app pod goes into CrashLoopBackState
4. The path is restored to target-X. All other devices which had mpath, show the transition from Failed->Active state. /dev/sdj doesnt have mpath, hence couldn't validate the same using multipath -ll
5. TO make the pod use the restored path again, deleted the app pod. With replica=1, the app pod tried to re-spin, this time on the same initiator node.
6. The app pod stays in ContainerCreating state and oc describe shows following error message:

------------------------------------------------------
Warning FailedMount 31s (x7 over 1m) kubelet, dhcp46-177.lab.eng.blr.redhat.com MountVolume.WaitForAttach failed for volume "pvc-0cb7c180-92e9-11e8-8e5c-005056a5f3b5" : Heuristic determination of mount point failed:stat /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/iscsi/iface-default/10.70.46.172:3260-iqn.2016-12.org.gluster-block:4b9d7ead-83fc-41a9-9c94-83e2244fc1b4-lun-0: input/output error
------------------------------------------------------
7. The app pod is still in the same state, and has had x340 re-tries over 12h. The iscsiadm login to the iqn is also intact. The mount point is input/output error

Scenario- 2. Once path is restored, the app pod tries to re-spin on a different initiator node
***********************************************************************************************

In this case, the app pod tries to re-spin on another node.

Sequence of events
====================
1. The app pod is running on Node-A and has bind-mounted path - /dev/sdm (no mpath device created, BZ#1599742)
2. Failed path /dev/sdm by blocking the 3260 port in its target portal IP-X
3. The app pod goes into CrashLoopBackState as expected. The mount path in initiator becomes input/output error.
4. The path is restored to target-X. All other devices which had mpath, show the transition from Failed->Active state. /dev/sdm doesnt have mpath, hence couldn't validate the same using multipath -ll
5. TO make the pod use the restored path again, deleted the app pod. With replica=1, the app pod tried to re-spin, this time on a different initiator node-B.
6. The app pod stays in ContainerCreating state for some time - issue reported:

---------------------------------------

Warning FailedAttachVolume 3m attachdetach-controller Multi-Attach error for volume "pvc-857a1bd1-93b5-11e8-8e5c-005056a5f3b5" Volume is already exclusively attached to one node and can't be attached to another
Warning FailedMount 1m kubelet, dhcp47-62.lab.eng.blr.redhat.com Unable to mount volumes for pod "mongo-3-1-9d77w_glusterfs(16db0ded-93bd-11e8-8e5c-005056a5f3b5)": timeout expired waiting for volumes to attach or mount for pod "glusterfs"/"mongo-3-1-9d77w". list of unmounted volumes=[mongo-3-data]. list of unattached volumes=[mongo-3-data default-token-dhnj5]

---------------------------------------

7. After some time, the app pod does get successfully created on the other node.

8. But it is seen that the iqn is still logged in from Node-A. Maybe the logout would have failed due to the mount path being input/output error.

9. Thus the same IQN is logged in from 2 initiator nodes which is not a desirable case.

More details are attached in subsequent comments.

Version-Release number of selected component (if applicable):
# oc version
oc v3.10.14

heketi-7.0.0-5.el7rhgs.x86_64
Glusterfs= 3.8.4-54.15
gluster-block-0.2.1-23.el7rhgs.x86_64

How reproducible:
++++++++++++++++++++++
2x2

Steps to Reproduce:
++++++++++++++++++
1. Create APP pods with bind-mounted block pvcs(HA=2/HA=3)
2. Due to BZ#1599742 , it may happen that for some APP pods, the mount path is a single device instead of device mapper name
E.g.
# oc rsh mongo-2-1-5hvxq df -kh|grep mongo
/dev/sdj 1.0G 236M 779M 24% /var/lib/mongodb/data

3. Now bring down the target which services path, say above /dev/sdj path - 10.70.46.172 by rejecting 3260 in iptables.
lrwxrwxrwx. 1 root root 9 Jul 29 10:07 /dev/disk/by-path/ip-10.70.46.172:3260-iscsi-iqn.2016-12.org.gluster-block:4b9d7ead-83fc-41a9-9c94-83e2244fc1b4-lun-0 -> ../../sdj

4. The app pod should crash as expected , since /dev/sdj was mounted as a single path

5. Bring up the path and deleted the pod.

6. Check if the pod comes up on the same initiator or a different one and validate scenarios #1 and #2 above.

Actual results:
+++++++++++++++++++
1. Single path is getting mounted which is unacceptable as no HA is involved.
2. When the app pod comes up on the same initiator, the app pod fails to get re-created. The mount path remains showing i/o error. The umount seems to be unsuccessful.
3. When the app pod comes up on a different initiator, the app pod does get re-created. But the mount path still stays behind in initial node(with i/o error) and not unmounted. Hence, for the same iqn, we still see logins from 2 nodes - new and old initiator.

Expected results:
++++++++++++++++++++
1. Always, mpath device should get mounted on the app pods to get the benefits of multipath.
2. Moreover , if single device is mounted, it should get unmounted successfully from old node, when APP pod gets re-created on a new initiator.

Comment 22 Jan Safranek 2018-10-09 09:11:15 UTC

> Please let me know if I can re-use my setup or it is needed by you for further analysis.

I downloaded the logs (thanks!), you can destroy the machines.

Comment 23 Jan Safranek 2018-10-09 09:12:15 UTC

3.10.z PR: https://github.com/openshift/ose/pull/1431

Comment 24 Jan Safranek 2018-10-09 09:39:26 UTC

> b) Mount existed for block6 on both the nodes for more than 2H. Then finally we saw the iqn was logged out from dhcp47-196. 

in final_journalctl, I can see that pod cirrosblock6-1-qwkp9_glusterfs was deleted, but something prevented its docker container from exiting, the log is full of these messages:

cirrosblock6-1-qwkp9_glusterfs(184cbc6b-c754-11e8-8b48-005056a50953)" has been removed from pod manager. However, it still has one or more containers in the non-exited state. Therefore, it will not be removed from volume manager.


Therefore kubelet did not unmount its volumes. I'd suggest to create a separate issue and consult pod team.

Comment 27 Jan Safranek 2018-10-16 11:20:17 UTC

In 3.10 and 3.11 we included these upstream PRs:

* https://github.com/kubernetes/kubernetes/pull/63176
  - deletes LUNs that are not needed
  - avoids excessive scanning for LUNs that kubelet does not need
  - prerequisite of the other PRs

* https://github.com/kubernetes/kubernetes/pull/67140
  - wait up to 10 seconds for multipath devices to appear
  - this prevents single path mounted instead of multipath

* https://github.com/kubernetes/kubernetes/pull/69140
  - fixes regression introduced by the above PRs

* https://github.com/kubernetes/kubernetes/pull/68141
  - re-tries attaching missing paths, until there are at least two of them available (or the plugin tried 5 times and gave up)
  -> at least two paths (= multipath) should be always used unless there is something really bad with the target.

Summed together, you should always get some multipath, unless one target is in really bad shape.

Comment 32 Jianwei Hou 2018-10-22 19:27:51 UTC

Verified on v3.10.57

- A multipath device is created instead of a single path.

- Login sessions are cleaned up when Pod is deleted.

- Having multiple portals, if one path is unavailable, it retries 5 times.
```
iscsi: failed to sendtargets to portal 10.10.10.10:3260 output: iscsiadm: connect to 10.10.10.10 timed out
iscsiadm: connect to 10.10.10.10 timed out
iscsiadm: connect to 10.10.10.10 timed out
iscsiadm: connect to 10.10.10.10 timed out
iscsiadm: connect to 10.10.10.10 timed out
iscsiadm: connect to 10.10.10.10 timed out
iscsiadm: connection login retries (reopen_max) 5 exceeded
```

- Having multiple portals, if only one portal is available, then the Pod will be stuck at 'ContainerCreating' waiting for multipath to setup.

- If only one portal is specified, then a single path is setup.

Comment 34 errata-xmlrpc 2018-11-11 16:39:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2709