1367161 – After nfs server lost connection, properly clean up the node

Bug 1367161 - After nfs server lost connection, properly clean up the node

Summary: After nfs server lost connection, properly clean up the node

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	3.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Hemant Kumar
QA Contact:	Wenqi He
Docs Contact:
URL:
Whiteboard:
Depends On:	1337479
Blocks:
TreeView+	depends on / blocked

Reported:	2016-08-15 18:00 UTC by Bradley Childs
Modified:	2017-07-27 14:49 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1337479
Environment:
Last Closed:	2017-07-27 14:49:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Bradley Childs 2016-08-15 18:00:08 UTC

Split this bug off from 1337479

Fixed in 1337479:
If the unmount fails, the node no longer blocks

Remaining Issues:
The mount location on the node isn't fully cleaned up if NFS server down.

----------------------------------

+++ This bug was initially created as a clone of Bug #1337479 +++

Description of problem:
Create a pod using the NFS pv , after the nfs server is down , delete the pod ,new pod scheduled to the node will pendding

Version-Release number of selected component (if applicable):
openshift v3.2.0.44
kubernetes v1.2.0-36-g4a3f9c5
etcd 2.2.5


How reproducible:
always

Steps to Reproduce:
1.Add scc to allow user create privledge pod
wget https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/authorization/scc/scc_super_template.yaml
update the "#NAME#","NS",""
2.Create a nfs server
$oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/storage/nfs/nfs-server.yaml
3.After the nfs server pod startup,create a pv using the nfs server 
wget https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/image/db-templates/auto-nfs-pv.json
update the template , correct the service ip to nfs server service ip
4. Create a app using the template mysql-persistent
oc new-app mysql-persistent
5. after the mysql pod startup successfuly,remember the node ip, delete the nfs service and pod
6. delete the mysql deployment config and pod
7. create new pod, and check the pod status

Actual results:
After step 6, all new pods scheduled to the node that mysql pod started in will be pending status 

Expected results:
New pod should work well.

Additional info:
After step 5 delete the mysql pod, the nfs volume mount point still exists on the node :
172.30.60.105:/ on /var/lib/origin/openshift.local.volumes/pods/bd4d801a-1d98-11e6-ad81-fa163e8f4eee/volumes/kubernetes.io~nfs/t95es type nfs4 (rw,relatime,vers=4.0,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.1.2.1,local_lock=none,addr=172.30.60.105)

and as the nfs server is down , we cannot umount successfully by :
umount /var/lib/origin/openshift.local.volumes/pods/bd4d801a-1d98-11e6-ad81-fa163e8f4eee/volumes/kubernetes.io~nfs/t95es

but after run :

umount -l /var/lib/origin/openshift.local.volumes/pods/bd4d801a-1d98-11e6-ad81-fa163e8f4eee/volumes/kubernetes.io~nfs/t95es

the node become working well.

--- Additional comment from  on 2016-06-10 14:28:50 EDT ---

when the nfs server is down, unmount will hang till timeout. It probably takes a while (300 seconds) till timeout happens.

--- Additional comment from Matthew Wong on 2016-06-14 10:01:12 EDT ---

I tried to replicate this and for me it hangs indefinitely (e.g. over the weekend). And the pod that uses the pv is stuck terminating.

If the nfs server is unreachable umount should have -l and/or -f? Here is a kubelet.log http://paste.fedoraproject.org/378985/46591229/ I stop seeing syncloop after a while

--- Additional comment from Wang Haoran on 2016-06-14 22:20:53 EDT ---

(In reply to hchen from comment #1)
> when the nfs server is down, unmount will hang till timeout. It probably
> takes a while (300 seconds) till timeout happens.

Hi, do you have a pr to resolve this , as you changed the status to modified ?

--- Additional comment from  on 2016-06-15 16:08:50 EDT ---

opened an upstream issue https://github.com/kubernetes/kubernetes/issues/27463

--- Additional comment from  on 2016-06-16 08:29:55 EDT ---

fixed by upstream PR https://github.com/kubernetes/kubernetes/pull/26801

--- Additional comment from Troy Dawson on 2016-07-22 15:55:11 EDT ---

This has been merged and is in OSE v3.3.0.9 or newer.

--- Additional comment from Wenqi He on 2016-07-25 05:29:14 EDT ---

I have tested this in below version:
openshift v3.3.0.9
kubernetes v1.3.0+57fb9ac
etcd 2.3.0+git

The pod of mysql keeps in "Terminating" status, but I can create another pod:
[wehe@wehepc octest]$ oc get pods
NAME              READY     STATUS        RESTARTS   AGE
hello-openshift   1/1       Running       0          8m
mysql-1-dfotg     0/1       Terminating   0          49m

on the node, check the mounted path, after nfs server is deleted, the mount path still exist

device 172.30.98.62:/ mounted on /var/lib/origin/openshift.local.volumes/pods/755640e5-5242-11e6-a3ee-fa163e5577f0/volumes/kubernetes.io~nfs/nfs with fstype nfs4 statvers=1.1
        opts:   rw,vers=4.0,rsize=524288,wsize=524288,namlen=255,acregmin=3,acregmax=60,acdirmin=30,acdirmax=60,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.1.1.1,local_lock=none
        age:    1722

@hchen, I think this is not fully fixed because the pod of mysql can not be terminated within 300 seconds and also for the mounted path disappearing.

--- Additional comment from  on 2016-07-25 10:13:11 EDT ---

The kuberenetes fix is to allow new pod creation not blocked by dead mount. However, the dead mount path cannot be not cleaned because the nfs server is unreachable.

Comment 1 Matthew Wong 2016-10-05 14:43:15 UTC

This is being tracked upstream by https://github.com/kubernetes/kubernetes/issues/31272 'Hung volumes can wedge the kubelet'

Comment 2 Matthew Wong 2016-10-26 15:18:21 UTC

It has been fixed for 1.5 and will be cherrypicked into 1.4 https://github.com/kubernetes/kubernetes/pull/35038

Comment 4 Troy Dawson 2017-02-03 22:48:04 UTC

This has been merged into ocp and is in OCP v3.5.0.16 or newer.

Comment 5 Wenqi He 2017-02-04 03:40:18 UTC

Tested on below version:
openshift v3.5.0.16+a26133a
kubernetes v1.5.2+43a9be4

After deleted all the pods, no pods keep terminating. This bug is fixed. Thanks.

Comment 6 Liang Xia 2017-02-15 05:51:36 UTC

Still seeing pod/project stuck in terminating status.

# openshift version
openshift v3.5.0.20+87266c6
kubernetes v1.5.2+43a9be4
etcd 3.1.0

# oc get projects | grep -i terminat
6sgm7                             Terminating

# oc get pods --all-namespaces | grep -i terminat
6sgm7             mysql-1-p7x04             0/1       Terminating         1          58m

# oc describe pod mysql-1-p7x04 -n 6sgm7
Name:				mysql-1-p7x04
Namespace:			6sgm7
Security Policy:		restricted
Node:				qe-lxia-node-registry-router-1/10.240.0.11
Start Time:			Tue, 14 Feb 2017 23:50:06 -0500
Labels:				app=mysql-persistent
				deployment=mysql-1
				deploymentconfig=mysql
				name=mysql
Status:				Terminating (expires Tue, 14 Feb 2017 23:59:10 -0500)
Termination Grace Period:	30s
IP:				
Controllers:			ReplicationController/mysql-1
Containers:
  mysql:
    Container ID:	docker://668e73ad086d3d2cc0a307c0e7e5d971556b459ae1375f4f136af334cc2fa542
    Image:		registry.ops.openshift.com/rhscl/mysql-57-rhel7@sha256:3136b2989e331fecabfc0c482ca9112efa5aa08289494e844e47da2f71b3de95
    Image ID:		docker-pullable://registry.ops.openshift.com/rhscl/mysql-57-rhel7@sha256:3136b2989e331fecabfc0c482ca9112efa5aa08289494e844e47da2f71b3de95
    Port:		3306/TCP
    Limits:
      memory:	512Mi
    Requests:
      memory:		512Mi
    State:		Running
      Started:		Tue, 14 Feb 2017 23:57:48 -0500
    Last State:		Terminated
      Reason:		Error
      Exit Code:	137
      Started:		Tue, 14 Feb 2017 23:56:23 -0500
      Finished:		Tue, 14 Feb 2017 23:57:46 -0500
    Ready:		False
    Restart Count:	1
    Liveness:		tcp-socket :3306 delay=30s timeout=1s period=10s #success=1 #failure=3
    Readiness:		exec [/bin/sh -i -c MYSQL_PWD="$MYSQL_PASSWORD" mysql -h 127.0.0.1 -u $MYSQL_USER -D $MYSQL_DATABASE -e 'SELECT 1'] delay=5s timeout=1s period=10s #success=1 #failure=3
    Volume Mounts:
      /var/lib/mysql/data from mysql-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-vlm5v (ro)
    Environment Variables:
      MYSQL_USER:		<set to the key 'database-user' in secret 'mysql'>
      MYSQL_PASSWORD:		<set to the key 'database-password' in secret 'mysql'>
      MYSQL_ROOT_PASSWORD:	<set to the key 'database-root-password' in secret 'mysql'>
      MYSQL_DATABASE:		sampledb
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	False 
  PodScheduled 	True 
Volumes:
  mysql-data:
    Type:	PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:	mysql
    ReadOnly:	false
  default-token-vlm5v:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-vlm5v
QoS Class:	Burstable
Tolerations:	<none>
No events.

Note You need to log in before you can comment on or make changes to this bug.