Bug 1564974

Summary: unknown status pod continues to mount storage
Product: OpenShift Container Platform Reporter: Kenjiro Nakayama <knakayam>
Component: NodeAssignee: Seth Jennings <sjenning>
Status: CLOSED WONTFIX QA Contact: DeShuai Ma <dma>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.7.0CC: aos-bugs, jokerman, mmccomas, mori, pdwyer, sjenning
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-10 01:56:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Kenjiro Nakayama 2018-04-09 04:58:41 UTC
Description of problem:
- When openshift node stopped working, pods on the node become "Unknown" status. After that, though new pods will be deployed on another nodes, the old "Unknown" pods continues to mount storage.
- It happens even though we use RWO accessModes.

Version-Release number of selected component (if applicable):
- OCP 3.7

How reproducible: 100%

Steps to Reproduce:
1. deploy pods with PV, which access mode is RWO
2. stop atmoic openshift node (simulate a incident)
3. Pod evacuate to another Node. (Unknown pods will remain, but new pods will be Running)

Actual results:
- After #3, Unknown pods still mount the storage. Logging in by `docker exec` could confirm the issue.

Expected results:
- Unknown pods should not mount the volume.

Additional info:
- This causes an issue that the new pod fails to start, if the storage does not allow to mount from multiple pods.

Comment 1 Seth Jennings 2018-04-09 19:16:23 UTC
This is expected behavior.

If the node process is down, the control plane no way to cleanly terminate the pod.  This means the pod could still be using the storage.  The storage can not be unmounted while the pod is using it and the volume can not be safely detached from the node while it is mounted. The attach-detach controller will continue to ensure that the volume is attached to the node as the pod requiring that volumes is still assigned to the node.

There are two ways to resolve the situation: 1) delete the pod or node explicitly (oc delete pod <podname> --grace-period=0 --force) or 2) bring the node back to acknowledge the deleted pod.

Comment 2 Kenjiro Nakayama 2018-04-10 00:39:49 UTC
> There are two ways to resolve the situation: 1) delete the pod or node explicitly (oc delete pod <podname> --grace-period=0 --force) or 2) bring the node back to acknowledge the deleted pod.

We know how to recover the issue. The reason why we opened this ticket is that the application (w/ RWO) will not be able to fail over when Node process down. It is not possible to fix the issue?

Comment 4 Seth Jennings 2018-04-10 01:56:03 UTC
I agree it is an unexpected and unfortunate limitation, but there is no safe way for OpenShift/Kubernetes to do this.

This upstream issue discusses it in greater detail:

If the pod was not using exclusive RWO storage, the RS/SS/DS would start a new pod with no issue.  However the storage complicates it as it can only be attached to one node at a time and the storage will not detach as long as the old pod is scheduled to the node.

There is no way for OCP/Kube to know that the storage is in a consistent state and not in-use if the node will not respond.  If it were to assume the pod is down, which is has no basis to assume, and detach the volume from the node, it could corrupt the data.

The admin must intervene with out-of-band knowledge that the pod on the old node is terminated and the storage is in an otherwise consistent state and force delete the pod to allow Openshift to detach the storage from the current node and attach to the node where the new pod lands.

This is an unfortunate reality of using RWO persistent volumes on OCP/Kube.

A caveat is if you are using cloud provider integration.  If a node is terminated, the node controller will notice and delete the node and all pods that were running on that node, freeing up any volumes attached as well.

Comment 5 Kenjiro Nakayama 2018-04-11 00:35:49 UTC
Thank you. I'm sorry for bothering you, but one more question. From different approach, if we asked you to implement a restriction setting, which is that new pod will not be spawned, when pod becomes "Unknown", is it not possible?

We are asking because we do not want OpenShift to mount RO volume from 2 pods even if one pod is "unknown" status. The data corruption is what we would like to avoid.

Comment 6 Seth Jennings 2018-04-11 01:48:24 UTC
How would the data be corrupted if the volume is mounted read-only (assuming that is what you mean by RO)?

Comment 7 Kenjiro Nakayama 2018-04-11 02:33:36 UTC
I'm sorry, "RO volume" was not clear (actually wrong expression). I wanted to mean that "RWO" persistent volume.

If we make the backend storage un-exclusive, pods can fail over even its unknown status. However, it means that whenever pods became "unknown" status, it has a possibility to mount PV(RWO) from multiple pods. 

So, we would like to stop spawning pods when it becomes unknown.

Comment 8 Seth Jennings 2018-04-11 03:13:20 UTC
RWO volumes can never be mounted by multiple pods at the same time.  RWO is inherently exclusive.  The the volume is in-use by a pod in Unknown state, the volume is still bound to that pod and can not be used by a new pod.  A new pod might try to start using the same PVC and underlying PV, but it will fail to start on the node as the volume will be unable to attach to the node because it is bound to the node where the pod in Unknown state is running (or was running).