Bug 1317994 - pods are unable to delete after storage endpoints were down
Summary: pods are unable to delete after storage endpoints were down
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 3.5.0
Hardware: x86_64
OS: Linux
low
low
Target Milestone: ---
: ---
Assignee: Bradley Childs
QA Contact: Liming Zhou
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-15 17:29 UTC by Qian Cai
Modified: 2017-03-20 02:54 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-03-20 02:54:04 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Qian Cai 2016-03-15 17:29:45 UTC
Description of problem:
After creating pods using glusterfs/cephrbd volumes, if the endpoints were down for some reasons, we are in the situation that the pods were unable to delete.

# kubectl get pods
NAME            READY     STATUS        RESTARTS   AGE
cephrbd-pod     0/1       Terminating   0          1h
glusterfs-pod   0/1       Terminating   0          1h

The node still show the volume mount points.

Version-Release number of selected component (if applicable):
kubernetes-1.2.0-0.9.alpha1.gitb57e8bd.el7.x86_64

How reproducible:
TBD

Steps to Reproduce:
1. create pods using glusterfs/cephrbd volumes pv/pvc.
2. shutdown glusterfs/cephrbd endpoints.
3. delete pods

Actual results:
Pods stuck in Terminating status for hours.

Expected results:
Pods were deleted and mount points in the node were gone.

Comment 2 Jan Chaloupka 2016-03-15 18:05:18 UTC
Jan, any idea what can be wrong? I believe this is expected behaviour. Once the endpoint is down request for deletion (and other operations) are not forwarded to deamons (or receivers of glusterfs/cephrbd). Maybe the pod worker (pod procedure) just waits for a volume to get deleted/unattached so the data inside of it are stored/garbaged correctly/properly.

Comment 3 Jan Chaloupka 2016-03-15 18:07:12 UTC
CAI, does the pod get deleted once the endpoints are back online?

Comment 4 Qian Cai 2016-03-15 18:17:58 UTC
(In reply to Jan Chaloupka from comment #3)
> CAI, does the pod get deleted once the endpoints are back online?
The endpoints were accidentally gone. Have to wait for someone to bring it back.

Comment 5 Jan Safranek 2016-03-16 08:39:06 UTC
Looking at Gluster volume plugin, it just calls umount() and it probably does not finish timely. That should be fixed by upcoming Attach and Mount controllers - they should let the pod die and periodically try to unmount or detach any stale volumes until it finally succeeds.

Adding Sami to cc:.

Comment 6 Qian Cai 2016-03-16 14:39:58 UTC
Once this happened, it became a mess.

1) those zombie pods just stuck there. They won't disappear even after restart all daemon in master/node and manually unmount the volumes.

2) on the node: "systemctl stop kubelet" will hang in the place that looks like trying to umount the volumes.

1) Reboot the node will hang here,
[[32m  OK  [0m] Stopped Create Static Device Nodes in /dev.
         Stopping Create Static Device Nodes in /dev...
[[32m  OK  [0m] Stopped Remount Root and Kernel File Systems.
         Stopping Remount Root and Kernel File Systems...
[[32m  OK  [0m] Reached target Shutdown.
[1366830.047161] libceph: connect 10.70.43.76:6804 error -101
[1366830.049112] libceph: osd4 10.70.43.76:6804 connect error

Even the soft reboot button from the Openstack management UI won't recover the instance. Have to ask the Openstack admin to fix it.

Comment 7 Qian Cai 2016-03-28 15:09:47 UTC
Suspect this also affect NFS or iSCSI.

Comment 8 Guohua Ouyang 2016-04-29 12:32:56 UTC
https://github.com/kubernetes/kubernetes/issues/24819

looks like they're same problem.

Comment 9 Andy Goldstein 2016-06-24 13:55:20 UTC
Guohua, 24819 is a very specific issue that is unrelated to remote storage.

I believe the actual issue is https://github.com/kubernetes/kubernetes/issues/27463, which was recently fixed in Kubernetes upstream.

Comment 10 Jan Chaloupka 2016-08-08 13:56:12 UTC
CAI, can you try to reproduce it again and write down a regression test for it?

Comment 13 Qian Cai 2017-03-15 20:57:29 UTC
I am setting needinfo for the QA contact to answer the question in the comment #10.

Comment 14 Jianwei Hou 2017-03-16 07:04:18 UTC
Can not reproduce on OCP v3.5.0.52. The pod was removed within grace period when its endpoint was removed. Lowering severity/priority.

Comment 16 Liming Zhou 2017-03-20 02:54:04 UTC
The test verified the issue is gone with OCP v3.5.0.52, the bug can be closed with currentrelease.


Note You need to log in before you can comment on or make changes to this bug.