1337479 – After nfs server lost connection ,delete the pod that use nfs pv will made new pod scheduled to the node pendding always

Bug 1337479 - After nfs server lost connection ,delete the pod that use nfs pv will made new pod scheduled to the node pendding always

Summary: After nfs server lost connection ,delete the pod that use nfs pv will made ne...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	3.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	hchen
QA Contact:	Wenqi He
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1367161
TreeView+	depends on / blocked

Reported:	2016-05-19 09:49 UTC by Wang Haoran
Modified:	2016-09-27 09:32 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1367161 (view as bug list)
Environment:
Last Closed:	2016-09-27 09:32:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:1933	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.3 Release Advisory	2016-09-27 13:24:36 UTC

Description Wang Haoran 2016-05-19 09:49:30 UTC

Description of problem:
Create a pod using the NFS pv , after the nfs server is down , delete the pod ,new pod scheduled to the node will pendding

Version-Release number of selected component (if applicable):
openshift v3.2.0.44
kubernetes v1.2.0-36-g4a3f9c5
etcd 2.2.5


How reproducible:
always

Steps to Reproduce:
1.Add scc to allow user create privledge pod
wget https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/authorization/scc/scc_super_template.yaml
update the "#NAME#","NS",""
2.Create a nfs server
$oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/storage/nfs/nfs-server.yaml
3.After the nfs server pod startup,create a pv using the nfs server 
wget https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/image/db-templates/auto-nfs-pv.json
update the template , correct the service ip to nfs server service ip
4. Create a app using the template mysql-persistent
oc new-app mysql-persistent
5. after the mysql pod startup successfuly,remember the node ip, delete the nfs service and pod
6. delete the mysql deployment config and pod
7. create new pod, and check the pod status

Actual results:
After step 6, all new pods scheduled to the node that mysql pod started in will be pending status 

Expected results:
New pod should work well.

Additional info:
After step 5 delete the mysql pod, the nfs volume mount point still exists on the node :
172.30.60.105:/ on /var/lib/origin/openshift.local.volumes/pods/bd4d801a-1d98-11e6-ad81-fa163e8f4eee/volumes/kubernetes.io~nfs/t95es type nfs4 (rw,relatime,vers=4.0,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.1.2.1,local_lock=none,addr=172.30.60.105)

and as the nfs server is down , we cannot umount successfully by :
umount /var/lib/origin/openshift.local.volumes/pods/bd4d801a-1d98-11e6-ad81-fa163e8f4eee/volumes/kubernetes.io~nfs/t95es

but after run :

umount -l /var/lib/origin/openshift.local.volumes/pods/bd4d801a-1d98-11e6-ad81-fa163e8f4eee/volumes/kubernetes.io~nfs/t95es

the node become working well.

Comment 1 hchen 2016-06-10 18:28:50 UTC

when the nfs server is down, unmount will hang till timeout. It probably takes a while (300 seconds) till timeout happens.

Comment 2 Matthew Wong 2016-06-14 14:01:12 UTC

I tried to replicate this and for me it hangs indefinitely (e.g. over the weekend). And the pod that uses the pv is stuck terminating.

If the nfs server is unreachable umount should have -l and/or -f? Here is a kubelet.log http://paste.fedoraproject.org/378985/46591229/ I stop seeing syncloop after a while

Comment 3 Wang Haoran 2016-06-15 02:20:53 UTC

(In reply to hchen from comment #1)
> when the nfs server is down, unmount will hang till timeout. It probably
> takes a while (300 seconds) till timeout happens.

Hi, do you have a pr to resolve this , as you changed the status to modified ?

Comment 4 hchen 2016-06-15 20:08:50 UTC

opened an upstream issue https://github.com/kubernetes/kubernetes/issues/27463

Comment 5 hchen 2016-06-16 12:29:55 UTC

fixed by upstream PR https://github.com/kubernetes/kubernetes/pull/26801

Comment 6 Troy Dawson 2016-07-22 19:55:11 UTC

This has been merged and is in OSE v3.3.0.9 or newer.

Comment 7 Wenqi He 2016-07-25 09:29:14 UTC

I have tested this in below version:
openshift v3.3.0.9
kubernetes v1.3.0+57fb9ac
etcd 2.3.0+git

The pod of mysql keeps in "Terminating" status, but I can create another pod:
[wehe@wehepc octest]$ oc get pods
NAME              READY     STATUS        RESTARTS   AGE
hello-openshift   1/1       Running       0          8m
mysql-1-dfotg     0/1       Terminating   0          49m

on the node, check the mounted path, after nfs server is deleted, the mount path still exist

device 172.30.98.62:/ mounted on /var/lib/origin/openshift.local.volumes/pods/755640e5-5242-11e6-a3ee-fa163e5577f0/volumes/kubernetes.io~nfs/nfs with fstype nfs4 statvers=1.1
        opts:   rw,vers=4.0,rsize=524288,wsize=524288,namlen=255,acregmin=3,acregmax=60,acdirmin=30,acdirmax=60,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.1.1.1,local_lock=none
        age:    1722

@hchen, I think this is not fully fixed because the pod of mysql can not be terminated within 300 seconds and also for the mounted path disappearing.

Comment 8 hchen 2016-07-25 14:13:11 UTC

The kuberenetes fix is to allow new pod creation not blocked by dead mount. However, the dead mount path cannot be not cleaned because the nfs server is unreachable.

Comment 10 Wenqi He 2016-08-16 10:39:25 UTC

Verified on below version:
openshift v3.3.0.19
kubernetes v1.3.0+507d3a7
etcd 2.3.0+git

This bug is fixed.

Comment 12 errata-xmlrpc 2016-09-27 09:32:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1933

Note You need to log in before you can comment on or make changes to this bug.