Description of problem:
Create a pod using the NFS pv , after the nfs server is down , delete the pod ,new pod scheduled to the node will pendding
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.Add scc to allow user create privledge pod
update the "#NAME#","NS",""
2.Create a nfs server
$oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/storage/nfs/nfs-server.yaml
3.After the nfs server pod startup,create a pv using the nfs server
update the template , correct the service ip to nfs server service ip
4. Create a app using the template mysql-persistent
oc new-app mysql-persistent
5. after the mysql pod startup successfuly,remember the node ip, delete the nfs service and pod
6. delete the mysql deployment config and pod
7. create new pod, and check the pod status
After step 6, all new pods scheduled to the node that mysql pod started in will be pending status
New pod should work well.
After step 5 delete the mysql pod, the nfs volume mount point still exists on the node :
172.30.60.105:/ on /var/lib/origin/openshift.local.volumes/pods/bd4d801a-1d98-11e6-ad81-fa163e8f4eee/volumes/kubernetes.io~nfs/t95es type nfs4 (rw,relatime,vers=4.0,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.1.2.1,local_lock=none,addr=172.30.60.105)
and as the nfs server is down , we cannot umount successfully by :
but after run :
umount -l /var/lib/origin/openshift.local.volumes/pods/bd4d801a-1d98-11e6-ad81-fa163e8f4eee/volumes/kubernetes.io~nfs/t95es
the node become working well.
when the nfs server is down, unmount will hang till timeout. It probably takes a while (300 seconds) till timeout happens.
I tried to replicate this and for me it hangs indefinitely (e.g. over the weekend). And the pod that uses the pv is stuck terminating.
If the nfs server is unreachable umount should have -l and/or -f? Here is a kubelet.log http://paste.fedoraproject.org/378985/46591229/ I stop seeing syncloop after a while
(In reply to hchen from comment #1)
> when the nfs server is down, unmount will hang till timeout. It probably
> takes a while (300 seconds) till timeout happens.
Hi, do you have a pr to resolve this , as you changed the status to modified ?
opened an upstream issue https://github.com/kubernetes/kubernetes/issues/27463
fixed by upstream PR https://github.com/kubernetes/kubernetes/pull/26801
This has been merged and is in OSE v18.104.22.168 or newer.
I have tested this in below version:
The pod of mysql keeps in "Terminating" status, but I can create another pod:
[wehe@wehepc octest]$ oc get pods
NAME READY STATUS RESTARTS AGE
hello-openshift 1/1 Running 0 8m
mysql-1-dfotg 0/1 Terminating 0 49m
on the node, check the mounted path, after nfs server is deleted, the mount path still exist
device 172.30.98.62:/ mounted on /var/lib/origin/openshift.local.volumes/pods/755640e5-5242-11e6-a3ee-fa163e5577f0/volumes/kubernetes.io~nfs/nfs with fstype nfs4 statvers=1.1
@email@example.com, I think this is not fully fixed because the pod of mysql can not be terminated within 300 seconds and also for the mounted path disappearing.
The kuberenetes fix is to allow new pod creation not blocked by dead mount. However, the dead mount path cannot be not cleaned because the nfs server is unreachable.
Verified on below version:
This bug is fixed.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.