Split this bug off from 1337479 Fixed in 1337479: If the unmount fails, the node no longer blocks Remaining Issues: The mount location on the node isn't fully cleaned up if NFS server down. ---------------------------------- +++ This bug was initially created as a clone of Bug #1337479 +++ Description of problem: Create a pod using the NFS pv , after the nfs server is down , delete the pod ,new pod scheduled to the node will pendding Version-Release number of selected component (if applicable): openshift v3.2.0.44 kubernetes v1.2.0-36-g4a3f9c5 etcd 2.2.5 How reproducible: always Steps to Reproduce: 1.Add scc to allow user create privledge pod wget https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/authorization/scc/scc_super_template.yaml update the "#NAME#","NS","" 2.Create a nfs server $oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/storage/nfs/nfs-server.yaml 3.After the nfs server pod startup,create a pv using the nfs server wget https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/image/db-templates/auto-nfs-pv.json update the template , correct the service ip to nfs server service ip 4. Create a app using the template mysql-persistent oc new-app mysql-persistent 5. after the mysql pod startup successfuly,remember the node ip, delete the nfs service and pod 6. delete the mysql deployment config and pod 7. create new pod, and check the pod status Actual results: After step 6, all new pods scheduled to the node that mysql pod started in will be pending status Expected results: New pod should work well. Additional info: After step 5 delete the mysql pod, the nfs volume mount point still exists on the node : 172.30.60.105:/ on /var/lib/origin/openshift.local.volumes/pods/bd4d801a-1d98-11e6-ad81-fa163e8f4eee/volumes/kubernetes.io~nfs/t95es type nfs4 (rw,relatime,vers=4.0,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.1.2.1,local_lock=none,addr=172.30.60.105) and as the nfs server is down , we cannot umount successfully by : umount /var/lib/origin/openshift.local.volumes/pods/bd4d801a-1d98-11e6-ad81-fa163e8f4eee/volumes/kubernetes.io~nfs/t95es but after run : umount -l /var/lib/origin/openshift.local.volumes/pods/bd4d801a-1d98-11e6-ad81-fa163e8f4eee/volumes/kubernetes.io~nfs/t95es the node become working well. --- Additional comment from on 2016-06-10 14:28:50 EDT --- when the nfs server is down, unmount will hang till timeout. It probably takes a while (300 seconds) till timeout happens. --- Additional comment from Matthew Wong on 2016-06-14 10:01:12 EDT --- I tried to replicate this and for me it hangs indefinitely (e.g. over the weekend). And the pod that uses the pv is stuck terminating. If the nfs server is unreachable umount should have -l and/or -f? Here is a kubelet.log http://paste.fedoraproject.org/378985/46591229/ I stop seeing syncloop after a while --- Additional comment from Wang Haoran on 2016-06-14 22:20:53 EDT --- (In reply to hchen from comment #1) > when the nfs server is down, unmount will hang till timeout. It probably > takes a while (300 seconds) till timeout happens. Hi, do you have a pr to resolve this , as you changed the status to modified ? --- Additional comment from on 2016-06-15 16:08:50 EDT --- opened an upstream issue https://github.com/kubernetes/kubernetes/issues/27463 --- Additional comment from on 2016-06-16 08:29:55 EDT --- fixed by upstream PR https://github.com/kubernetes/kubernetes/pull/26801 --- Additional comment from Troy Dawson on 2016-07-22 15:55:11 EDT --- This has been merged and is in OSE v3.3.0.9 or newer. --- Additional comment from Wenqi He on 2016-07-25 05:29:14 EDT --- I have tested this in below version: openshift v3.3.0.9 kubernetes v1.3.0+57fb9ac etcd 2.3.0+git The pod of mysql keeps in "Terminating" status, but I can create another pod: [wehe@wehepc octest]$ oc get pods NAME READY STATUS RESTARTS AGE hello-openshift 1/1 Running 0 8m mysql-1-dfotg 0/1 Terminating 0 49m on the node, check the mounted path, after nfs server is deleted, the mount path still exist device 172.30.98.62:/ mounted on /var/lib/origin/openshift.local.volumes/pods/755640e5-5242-11e6-a3ee-fa163e5577f0/volumes/kubernetes.io~nfs/nfs with fstype nfs4 statvers=1.1 opts: rw,vers=4.0,rsize=524288,wsize=524288,namlen=255,acregmin=3,acregmax=60,acdirmin=30,acdirmax=60,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.1.1.1,local_lock=none age: 1722 @hchen, I think this is not fully fixed because the pod of mysql can not be terminated within 300 seconds and also for the mounted path disappearing. --- Additional comment from on 2016-07-25 10:13:11 EDT --- The kuberenetes fix is to allow new pod creation not blocked by dead mount. However, the dead mount path cannot be not cleaned because the nfs server is unreachable.
This is being tracked upstream by https://github.com/kubernetes/kubernetes/issues/31272 'Hung volumes can wedge the kubelet'
It has been fixed for 1.5 and will be cherrypicked into 1.4 https://github.com/kubernetes/kubernetes/pull/35038
This has been merged into ocp and is in OCP v3.5.0.16 or newer.
Tested on below version: openshift v3.5.0.16+a26133a kubernetes v1.5.2+43a9be4 After deleted all the pods, no pods keep terminating. This bug is fixed. Thanks.
Still seeing pod/project stuck in terminating status. # openshift version openshift v3.5.0.20+87266c6 kubernetes v1.5.2+43a9be4 etcd 3.1.0 # oc get projects | grep -i terminat 6sgm7 Terminating # oc get pods --all-namespaces | grep -i terminat 6sgm7 mysql-1-p7x04 0/1 Terminating 1 58m # oc describe pod mysql-1-p7x04 -n 6sgm7 Name: mysql-1-p7x04 Namespace: 6sgm7 Security Policy: restricted Node: qe-lxia-node-registry-router-1/10.240.0.11 Start Time: Tue, 14 Feb 2017 23:50:06 -0500 Labels: app=mysql-persistent deployment=mysql-1 deploymentconfig=mysql name=mysql Status: Terminating (expires Tue, 14 Feb 2017 23:59:10 -0500) Termination Grace Period: 30s IP: Controllers: ReplicationController/mysql-1 Containers: mysql: Container ID: docker://668e73ad086d3d2cc0a307c0e7e5d971556b459ae1375f4f136af334cc2fa542 Image: registry.ops.openshift.com/rhscl/mysql-57-rhel7@sha256:3136b2989e331fecabfc0c482ca9112efa5aa08289494e844e47da2f71b3de95 Image ID: docker-pullable://registry.ops.openshift.com/rhscl/mysql-57-rhel7@sha256:3136b2989e331fecabfc0c482ca9112efa5aa08289494e844e47da2f71b3de95 Port: 3306/TCP Limits: memory: 512Mi Requests: memory: 512Mi State: Running Started: Tue, 14 Feb 2017 23:57:48 -0500 Last State: Terminated Reason: Error Exit Code: 137 Started: Tue, 14 Feb 2017 23:56:23 -0500 Finished: Tue, 14 Feb 2017 23:57:46 -0500 Ready: False Restart Count: 1 Liveness: tcp-socket :3306 delay=30s timeout=1s period=10s #success=1 #failure=3 Readiness: exec [/bin/sh -i -c MYSQL_PWD="$MYSQL_PASSWORD" mysql -h 127.0.0.1 -u $MYSQL_USER -D $MYSQL_DATABASE -e 'SELECT 1'] delay=5s timeout=1s period=10s #success=1 #failure=3 Volume Mounts: /var/lib/mysql/data from mysql-data (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-vlm5v (ro) Environment Variables: MYSQL_USER: <set to the key 'database-user' in secret 'mysql'> MYSQL_PASSWORD: <set to the key 'database-password' in secret 'mysql'> MYSQL_ROOT_PASSWORD: <set to the key 'database-root-password' in secret 'mysql'> MYSQL_DATABASE: sampledb Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: mysql-data: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: mysql ReadOnly: false default-token-vlm5v: Type: Secret (a volume populated by a Secret) SecretName: default-token-vlm5v QoS Class: Burstable Tolerations: <none> No events.