Bug 1317994

Summary: pods are unable to delete after storage endpoints were down
Product: OpenShift Container Platform Reporter: Qian Cai <qcai>
Component: StorageAssignee: Bradley Childs <bchilds>
Status: CLOSED CURRENTRELEASE QA Contact: Liming Zhou <lizhou>
Severity: low Docs Contact:
Priority: low    
Version: 3.5.0CC: agoldste, gouyang, jhou, jsafrane, lizhou, mmcgrath, qcai
Target Milestone: ---Keywords: Extras
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-20 02:54:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Qian Cai 2016-03-15 17:29:45 UTC
Description of problem:
After creating pods using glusterfs/cephrbd volumes, if the endpoints were down for some reasons, we are in the situation that the pods were unable to delete.

# kubectl get pods
NAME            READY     STATUS        RESTARTS   AGE
cephrbd-pod     0/1       Terminating   0          1h
glusterfs-pod   0/1       Terminating   0          1h

The node still show the volume mount points.

Version-Release number of selected component (if applicable):
kubernetes-1.2.0-0.9.alpha1.gitb57e8bd.el7.x86_64

How reproducible:
TBD

Steps to Reproduce:
1. create pods using glusterfs/cephrbd volumes pv/pvc.
2. shutdown glusterfs/cephrbd endpoints.
3. delete pods

Actual results:
Pods stuck in Terminating status for hours.

Expected results:
Pods were deleted and mount points in the node were gone.

Comment 2 Jan Chaloupka 2016-03-15 18:05:18 UTC
Jan, any idea what can be wrong? I believe this is expected behaviour. Once the endpoint is down request for deletion (and other operations) are not forwarded to deamons (or receivers of glusterfs/cephrbd). Maybe the pod worker (pod procedure) just waits for a volume to get deleted/unattached so the data inside of it are stored/garbaged correctly/properly.

Comment 3 Jan Chaloupka 2016-03-15 18:07:12 UTC
CAI, does the pod get deleted once the endpoints are back online?

Comment 4 Qian Cai 2016-03-15 18:17:58 UTC
(In reply to Jan Chaloupka from comment #3)
> CAI, does the pod get deleted once the endpoints are back online?
The endpoints were accidentally gone. Have to wait for someone to bring it back.

Comment 5 Jan Safranek 2016-03-16 08:39:06 UTC
Looking at Gluster volume plugin, it just calls umount() and it probably does not finish timely. That should be fixed by upcoming Attach and Mount controllers - they should let the pod die and periodically try to unmount or detach any stale volumes until it finally succeeds.

Adding Sami to cc:.

Comment 6 Qian Cai 2016-03-16 14:39:58 UTC
Once this happened, it became a mess.

1) those zombie pods just stuck there. They won't disappear even after restart all daemon in master/node and manually unmount the volumes.

2) on the node: "systemctl stop kubelet" will hang in the place that looks like trying to umount the volumes.

1) Reboot the node will hang here,
[[32m  OK  [0m] Stopped Create Static Device Nodes in /dev.
         Stopping Create Static Device Nodes in /dev...
[[32m  OK  [0m] Stopped Remount Root and Kernel File Systems.
         Stopping Remount Root and Kernel File Systems...
[[32m  OK  [0m] Reached target Shutdown.
[1366830.047161] libceph: connect 10.70.43.76:6804 error -101
[1366830.049112] libceph: osd4 10.70.43.76:6804 connect error

Even the soft reboot button from the Openstack management UI won't recover the instance. Have to ask the Openstack admin to fix it.

Comment 7 Qian Cai 2016-03-28 15:09:47 UTC
Suspect this also affect NFS or iSCSI.

Comment 8 Guohua Ouyang 2016-04-29 12:32:56 UTC
https://github.com/kubernetes/kubernetes/issues/24819

looks like they're same problem.

Comment 9 Andy Goldstein 2016-06-24 13:55:20 UTC
Guohua, 24819 is a very specific issue that is unrelated to remote storage.

I believe the actual issue is https://github.com/kubernetes/kubernetes/issues/27463, which was recently fixed in Kubernetes upstream.

Comment 10 Jan Chaloupka 2016-08-08 13:56:12 UTC
CAI, can you try to reproduce it again and write down a regression test for it?

Comment 13 Qian Cai 2017-03-15 20:57:29 UTC
I am setting needinfo for the QA contact to answer the question in the comment #10.

Comment 14 Jianwei Hou 2017-03-16 07:04:18 UTC
Can not reproduce on OCP v3.5.0.52. The pod was removed within grace period when its endpoint was removed. Lowering severity/priority.

Comment 16 Liming Zhou 2017-03-20 02:54:04 UTC
The test verified the issue is gone with OCP v3.5.0.52, the bug can be closed with currentrelease.