Description of problem: Create a pod using the NFS pv , after the nfs server is down , delete the pod ,new pod scheduled to the node will pendding Version-Release number of selected component (if applicable): openshift v3.2.0.44 kubernetes v1.2.0-36-g4a3f9c5 etcd 2.2.5 How reproducible: always Steps to Reproduce: 1.Add scc to allow user create privledge pod wget https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/authorization/scc/scc_super_template.yaml update the "#NAME#","NS","" 2.Create a nfs server $oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/storage/nfs/nfs-server.yaml 3.After the nfs server pod startup,create a pv using the nfs server wget https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/image/db-templates/auto-nfs-pv.json update the template , correct the service ip to nfs server service ip 4. Create a app using the template mysql-persistent oc new-app mysql-persistent 5. after the mysql pod startup successfuly,remember the node ip, delete the nfs service and pod 6. delete the mysql deployment config and pod 7. create new pod, and check the pod status Actual results: After step 6, all new pods scheduled to the node that mysql pod started in will be pending status Expected results: New pod should work well. Additional info: After step 5 delete the mysql pod, the nfs volume mount point still exists on the node : 172.30.60.105:/ on /var/lib/origin/openshift.local.volumes/pods/bd4d801a-1d98-11e6-ad81-fa163e8f4eee/volumes/kubernetes.io~nfs/t95es type nfs4 (rw,relatime,vers=4.0,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.1.2.1,local_lock=none,addr=172.30.60.105) and as the nfs server is down , we cannot umount successfully by : umount /var/lib/origin/openshift.local.volumes/pods/bd4d801a-1d98-11e6-ad81-fa163e8f4eee/volumes/kubernetes.io~nfs/t95es but after run : umount -l /var/lib/origin/openshift.local.volumes/pods/bd4d801a-1d98-11e6-ad81-fa163e8f4eee/volumes/kubernetes.io~nfs/t95es the node become working well.
when the nfs server is down, unmount will hang till timeout. It probably takes a while (300 seconds) till timeout happens.
I tried to replicate this and for me it hangs indefinitely (e.g. over the weekend). And the pod that uses the pv is stuck terminating. If the nfs server is unreachable umount should have -l and/or -f? Here is a kubelet.log http://paste.fedoraproject.org/378985/46591229/ I stop seeing syncloop after a while
(In reply to hchen from comment #1) > when the nfs server is down, unmount will hang till timeout. It probably > takes a while (300 seconds) till timeout happens. Hi, do you have a pr to resolve this , as you changed the status to modified ?
opened an upstream issue https://github.com/kubernetes/kubernetes/issues/27463
fixed by upstream PR https://github.com/kubernetes/kubernetes/pull/26801
This has been merged and is in OSE v3.3.0.9 or newer.
I have tested this in below version: openshift v3.3.0.9 kubernetes v1.3.0+57fb9ac etcd 2.3.0+git The pod of mysql keeps in "Terminating" status, but I can create another pod: [wehe@wehepc octest]$ oc get pods NAME READY STATUS RESTARTS AGE hello-openshift 1/1 Running 0 8m mysql-1-dfotg 0/1 Terminating 0 49m on the node, check the mounted path, after nfs server is deleted, the mount path still exist device 172.30.98.62:/ mounted on /var/lib/origin/openshift.local.volumes/pods/755640e5-5242-11e6-a3ee-fa163e5577f0/volumes/kubernetes.io~nfs/nfs with fstype nfs4 statvers=1.1 opts: rw,vers=4.0,rsize=524288,wsize=524288,namlen=255,acregmin=3,acregmax=60,acdirmin=30,acdirmax=60,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.1.1.1,local_lock=none age: 1722 @hchen, I think this is not fully fixed because the pod of mysql can not be terminated within 300 seconds and also for the mounted path disappearing.
The kuberenetes fix is to allow new pod creation not blocked by dead mount. However, the dead mount path cannot be not cleaned because the nfs server is unreachable.
Verified on below version: openshift v3.3.0.19 kubernetes v1.3.0+507d3a7 etcd 2.3.0+git This bug is fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1933