Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1387651

Summary:

OCP 3.4/Docker 1.12 Pods stuck in Terminating state when stopped in Docker

Product:

OpenShift Container Platform

Reporter:

Mike Fiedler <mifiedle>

Component:

Node

Assignee:

Avesh Agarwal <avagarwa>

Status:

CLOSED WORKSFORME

QA Contact:

DeShuai Ma <dma>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

3.4.0

CC:

aos-bugs, decarr, jokerman, mifiedle, mmccomas, schituku, wmeng

Target Milestone:

---

Keywords:

UpcomingRelease

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-11-02 19:05:36 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
master logs	none
node log	none

Description Mike Fiedler 2016-10-21 12:52:02 UTC

Created attachment 1212846 [details]
master logs

Description of problem:

In 3.4 with Docker 1.12, seeing multiple instances where deleting a namespace leaves pods stuck in Terminating space (the namespace is stuck in Terminating too).

On the node where the pod was running, docker ps and docker ps -a do not show the pod.   df -k does show the tmpfs for the secrets as still mounted.    

In the node log, the following messages are repeated while the pod is in Terminating state:

Oct 21 08:37:23 ip-172-31-54-14 atomic-openshift-node: I1021 08:37:23.293664   97046 reconciler.go:299] MountVolume operation started for volume "kubernetes.io/secret/adae41bb-9701-11e6-98f4-029980ce4b73-default-token-vc74p" (spec.Name: "default-token-vc74p") to pod "adae41bb-9701-11e6-98f4-029980ce4b73" (UID: "adae41bb-9701-11e6-98f4-029980ce4b73"). Volume is already mounted to pod, but remount was requested.
Oct 21 08:37:23 ip-172-31-54-14 atomic-openshift-node: E1021 08:37:23.295758   97046 secret.go:182] Couldn't get secret clusterproject0/default-token-vc74p
Oct 21 08:37:23 ip-172-31-54-14 atomic-openshift-node: E1021 08:37:23.295944   97046 nestedpendingoperations.go:253] Operation for "\"kubernetes.io/secret/adae41bb-9701-11e6-98f4-029980ce4b73-default-token-vc74p\" (\"adae41bb-9701-11e6-98f4-029980ce4b73\")" failed. No retries permitted until 2016-10-21 08:39:23.295881273 -0400 EDT (durationBeforeRetry 2m0s). Error: MountVolume.SetUp failed for volume "kubernetes.io/secret/adae41bb-9701-11e6-98f4-029980ce4b73-default-token-vc74p" (spec.Name: "default-token-vc74p") pod "adae41bb-9701-11e6-98f4-029980ce4b73" (UID: "adae41bb-9701-11e6-98f4-029980ce4b73") with: secrets "default-token-vc74p" not found
Oct 21 08:37:23 ip-172-31-54-14 atomic-openshift-node: E1021 08:37:23.312734   97046 event.go:199] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"pausepods389.147f57512c835453", GenerateName:"", Namespace:"clusterproject0", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:api.ObjectReference{Kind:"Pod", Namespace:"clusterproject0", Name:"pausepods389", UID:"adae41bb-9701-11e6-98f4-029980ce4b73", APIVersion:"v1", ResourceVersion:"17165", FieldPath:""}, Reason:"FailedMount", Message:"MountVolume.SetUp failed for volume \"kubernetes.io/secret/adae41bb-9701-11e6-98f4-029980ce4b73-default-token-vc74p\" (spec.Name: \"default-token-vc74p\") pod \"adae41bb-9701-11e6-98f4-029980ce4b73\" (UID: \"adae41bb-9701-11e6-98f4-029980ce4b73\") with: secrets \"default-token-vc74p\" not found", Source:api.EventSource{Component:"kubelet", Host:"ip-172-31-54-14.us-west-2.compute.internal"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63612592008, nsec:951583827, loc:(*time.Location)(0xa29dc40)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63612650243, nsec:295868335, loc:(*time.Location)(0xa29dc40)}}, Count:493, Type:"Warning"}': 'events "pausepods389.147f57512c835453" is forbidden: unable to create new content in namespace clusterproject0 because it is being terminated.' (will not retry!)


oc get events on the namespace comes back empty - events must already be gone.

Attaching the master and node logs.




Version-Release number of selected component (if applicable): 3.4.0.13 and docker-1.12.2-3.el7.x86_64


How reproducible: infrequent, but has been seen several times in 3.4.  Not reproducible on demand


Steps to Reproduce:
1.  Install simple cluster:  1 master/etcd, 1 infra system, 2 application nodes
2.  Run the svt cluster-loader script (https://github.com/openshift/svt/tree/master/openshift_scalability) with the nodeVertical.yaml workload at 250 pods/node
3.  After the test completes, oc delete project -l purpose=test

Actual results:

Project never fully terminates.  1 or more pods left in Terminating state forever.    The docker containers are terminated and the tmpfs filesystem for the secrets is left mounted.

Expected results:

Project and all pods are deleted successfully.

Comment 1 Mike Fiedler 2016-10-21 12:52:44 UTC

Created attachment 1212847 [details]
node log

Comment 2 Mike Fiedler 2016-10-21 13:04:04 UTC

Tried rebooting and the tmpfs were re-mounted after reboot.

Tried deleting the pods explicitly (oc delete pod) and they stay Terminating

Tried deleting the node the pods were running on and they stay Terminating.

So far, I've yet to find a way to get rid of them.

Comment 3 Derek Carr 2016-10-21 15:00:50 UTC

Avesh - do you want to take a stab at debugging this further?  Pull in Paul if you need help in the code area.

Comment 4 Avesh Agarwal 2016-10-21 15:07:52 UTC

Sure Derek.

Comment 5 Mike Fiedler 2016-10-21 16:59:12 UTC

Seeing this on occasion when running the e2es/Conformance on a live cluster as well.

Comment 6 Avesh Agarwal 2016-10-26 20:18:40 UTC

I am running 1 master and 1 node (on the same VM) kube cluster to reproduce this issue. When the allowed pod count is 110 (default) on the node, I can not reproduce this issue.

So I increased the allowed max pod count on the node to 250, however beyond 200, the pods are stuck in ContainerCreating state and and I see too many open files errors in the kubectl describe output. So I am still looking into it.

[root@fedora24 ~]# kubectl get pods --namespace clusterproject0| grep -i container
pausepods200   0/1       ContainerCreating   0          8m
pausepods201   0/1       ContainerCreating   0          8m
pausepods202   0/1       ContainerCreating   0          8m
pausepods203   0/1       ContainerCreating   0          8m
pausepods204   0/1       ContainerCreating   0          8m
pausepods205   0/1       ContainerCreating   0          8m
pausepods206   0/1       ContainerCreating   0          8m
pausepods207   0/1       ContainerCreating   0          8m
pausepods208   0/1       ContainerCreating   0          8m
pausepods209   0/1       ContainerCreating   0          8m
pausepods210   0/1       ContainerCreating   0          8m
pausepods211   0/1       ContainerCreating   0          8m
pausepods212   0/1       ContainerCreating   0          8m
pausepods213   0/1       ContainerCreating   0          8m
pausepods214   0/1       ContainerCreating   0          8m
pausepods215   0/1       ContainerCreating   0          8m
pausepods216   0/1       ContainerCreating   0          8m
pausepods217   0/1       ContainerCreating   0          8m
pausepods218   0/1       ContainerCreating   0          8m
pausepods219   0/1       ContainerCreating   0          8m
pausepods220   0/1       ContainerCreating   0          8m
pausepods221   0/1       ContainerCreating   0          8m
pausepods222   0/1       ContainerCreating   0          8m
pausepods223   0/1       ContainerCreating   0          8m
pausepods224   0/1       ContainerCreating   0          8m
pausepods225   0/1       ContainerCreating   0          8m
pausepods226   0/1       ContainerCreating   0          8m
pausepods227   0/1       ContainerCreating   0          8m
pausepods228   0/1       ContainerCreating   0          8m
pausepods229   0/1       ContainerCreating   0          8m
pausepods230   0/1       ContainerCreating   0          8m
pausepods231   0/1       ContainerCreating   0          8m
pausepods232   0/1       ContainerCreating   0          8m
pausepods233   0/1       ContainerCreating   0          8m
pausepods234   0/1       ContainerCreating   0          8m
pausepods235   0/1       ContainerCreating   0          8m
pausepods236   0/1       ContainerCreating   0          8m
pausepods237   0/1       ContainerCreating   0          8m
pausepods238   0/1       ContainerCreating   0          8m
pausepods239   0/1       ContainerCreating   0          8m
[root@fedora24 ~]# kubectl describe pods pausepods226 --namespace clusterproject0
Name:		pausepods226
Namespace:	clusterproject0
Node:		127.0.0.1/127.0.0.1
Start Time:	Wed, 26 Oct 2016 16:00:07 -0400
Labels:		name=pause-amd64
Status:		Pending
IP:		
Controllers:	<none>
Containers:
  pause-amd64:
    Container ID:	
    Image:		gcr.io/google_containers/pause-amd64:3.0
    Image ID:		
    Port:		8080/TCP
    State:		Waiting
      Reason:		ContainerCreating
    Ready:		False
    Restart Count:	0
    Volume Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-f4ozb (ro)
    Environment Variables:	<none>
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	False 
  PodScheduled 	True 
Volumes:
  default-token-f4ozb:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-f4ozb
QoS Class:	BestEffort
Tolerations:	<none>
Events:
  FirstSeen	LastSeen	Count	From			SubObjectPath	Type		Reason		Message
  ---------	--------	-----	----			-------------	--------	------		-------
  9m		9m		1	{default-scheduler }			Normal		Scheduled	Successfully assigned pausepods226 to 127.0.0.1
  7m		7m		1	{kubelet 127.0.0.1}			Warning		FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: Error response from daemon: {\"message\":\"open /dev/mapper/docker-253:0-1050167-7684f5d9dd5228ba3b17edf448dde3b3f1d6ef10888db628815946ede0fb7ce8-init: too many open files\"}"

  8m	6m	2	{kubelet 127.0.0.1}		Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: Error response from daemon: {\"message\":\"open /var/lib/docker/image/devicemapper/imagedb/content/sha256/99e59f495ffaa222bfeb67580213e8c28c1e885f1d245ab2bbe3b1b1ec3bd0b2: too many open files\"}"

  6m	6m	1	{kubelet 127.0.0.1}		Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: Error response from daemon: {\"message\":\"open /var/lib/docker/containers/4dfc0842c78d6c2543d5428342a95b1b3efba09c44d85b08876b8e114eb8aa58/.tmp-hostconfig.json564561370: too many open files\"}"

  5m	5m	1	{kubelet 127.0.0.1}		Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "POD" with ErrImagePull: "image pull failed for gcr.io/google_containers/pause-amd64:3.0, this may be because there are no credentials on this request.  details: (Error pulling image (3.0) from gcr.io/google_containers/pause-amd64, failed to register layer: devmapper: Error activating devmapper device for '14f11f9004d8cd8c570af44c15db333088a687ee53e71beb6cbf13b567ee4fc4': devicemapper: Can't set cookie dm_task_set_cookie failed)"

  8m	4m	3	{kubelet 127.0.0.1}		Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "POD" with ErrImagePull: "image pull failed for gcr.io/google_containers/pause-amd64:3.0, this may be because there are no credentials on this request.  details: (open /etc/docker/certs.d/gcr.io: too many open files)"

  2m	2m	1	{kubelet 127.0.0.1}		Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "POD" with ErrImagePull: "image pull failed for gcr.io/google_containers/pause-amd64:3.0, this may be because there are no credentials on this request.  details: (Error pulling image (3.0) from gcr.io/google_containers/pause-amd64, Server error: Status 0 while fetching image layer (bdb43c586e887b513a056722b50553727b255e3a3d9166f318632d4209963464))"

  1m	1m	1	{kubelet 127.0.0.1}		Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: Error response from daemon: {\"message\":\"devmapper: Error saving transaction metadata: devmapper: Error creating metadata file: open /var/lib/docker/devicemapper/metadata/.tmp459530672: too many open files\"}"

  37s	37s	1	{kubelet 127.0.0.1}		Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: Error response from daemon: {\"message\":\"containerd: container did not start before the specified timeout\"}"

  4m	23s	6	{kubelet 127.0.0.1}		Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "POD" with ImagePullBackOff: "Back-off pulling image \"gcr.io/google_containers/pause-amd64:3.0\""

Comment 8 Avesh Agarwal 2016-10-26 21:30:57 UTC

Mike,

It seems that I am able to reproduce as I see some pods stuck in terminating state and the namespace is also in terminating state, and its been quite some time now, though I am still monitoring my cluster to see if it changes. However I am not sure if the way I arrived at this issue is similar to the way Mike you arrived to this issue, so here is the more explanation:

(Anyway I had increased open file limit on my VM to avoid the too many open files issue, but still the kube cluster did not recover from that.  and Yes I am *NOT* using loopback  see the info: https://paste.fedoraproject.org/461576/16648147/)

So at this point, there were 200 containers in Running state, and 40 containers in RunContainerError (seems like their status changed from ContainerCreating to RunContainerError after I, it seems to me, changed open files limit), and they seemed to stuck in the event "Back-off pulling image".

At this point, I deleted the namespace clusterproject0. But I still see the namespace is stuck in Terminating state. More logs:

# kubectl get ns
NAME              STATUS        AGE
clusterproject0   Terminating   2h
default           Active        2h
kube-system       Active        2h

# kubectl get pods  --namespace clusterproject0
NAME           READY     STATUS        RESTARTS   AGE
pausepods210   0/1       Terminating   0          1h
pausepods211   0/1       Terminating   0          1h
pausepods224   0/1       Terminating   0          1h
pausepods234   0/1       Terminating   0          1h

Please note that these are those pods which were stuck in ContainerCreating and then move to RunContainerError but never got to Running state. Though there were other pods (36 other pods) to in the similar state but somehow, these 4 are still stuck in Terminating state leading the namespace to be stuck too.

Kubectl describe of one of the stuck pods:

[root@fedora24 ~]# kubectl describe pods pausepods210 --namespace clusterproject0
Name:				pausepods210
Namespace:			clusterproject0
Node:				127.0.0.1/127.0.0.1
Start Time:			Wed, 26 Oct 2016 16:00:07 -0400
Labels:				name=pause-amd64
Status:				Terminating (expires Wed, 26 Oct 2016 16:46:12 -0400)
Termination Grace Period:	30s
IP:				
Controllers:			<none>
Containers:
  pause-amd64:
    Container ID:	
    Image:		gcr.io/google_containers/pause-amd64:3.0
    Image ID:		
    Port:		8080/TCP
    State:		Waiting
      Reason:		ContainerCreating
    Ready:		False
    Restart Count:	0
    Volume Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-f4ozb (ro)
    Environment Variables:	<none>
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	False 
  PodScheduled 	True 
Volumes:
  default-token-f4ozb:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-f4ozb
QoS Class:	BestEffort
Tolerations:	<none>
No events.

Docker output at this stage:

[root@fedora24 ~]# docker ps -a
CONTAINER ID        IMAGE                                      COMMAND             CREATED             STATUS              PORTS               NAMES
82bc3a2dac6e        gcr.io/google_containers/pause-amd64:3.0   "/pause"            13 minutes ago      Created                                 k8s_POD.1e570369_pausepods210_clusterproject0_b7e2e3e7-9bb6-11e6-a755-525400b5933e_6fcb2388
6d914e0c18ef        gcr.io/google_containers/pause-amd64:3.0   "/pause"            18 minutes ago      Dead                                    k8s_POD.1e570369_pausepods235_clusterproject0_c4d0bb89-9bb6-11e6-a755-525400b5933e_2a743fa3
4ebd3b167bf9        gcr.io/google_containers/pause-amd64:3.0   "/pause"            19 minutes ago      Dead                                    k8s_POD.1e570369_pausepods229_clusterproject0_c1ced268-9bb6-11e6-a755-525400b5933e_2a30e0bb
3b7ee28b077e        gcr.io/google_containers/pause-amd64:3.0   "/pause"            20 minutes ago      Created                                 k8s_POD.1e570369_pausepods211_clusterproject0_b825bda4-9bb6-11e6-a755-525400b5933e_c6b07236
c8c504f94144        gcr.io/google_containers/pause-amd64:3.0   "/pause"            23 minutes ago      Created                                 k8s_POD.1e570369_pausepods234_clusterproject0_c497fedb-9bb6-11e6-a755-525400b5933e_cdabae77
11a28e257e00        gcr.io/google_containers/pause-amd64:3.0   "/pause"            24 minutes ago      Dead                                    k8s_POD.1e570369_pausepods227_clusterproject0_c12e01e7-9bb6-11e6-a755-525400b5933e_e80e767c
b0454c977ab1        gcr.io/google_containers/pause-amd64:3.0   "/pause"            24 minutes ago      Dead                                    k8s_POD.1e570369_pausepods202_clusterproject0_b56ae547-9bb6-11e6-a755-525400b5933e_302e436a
f4870e7052ac        gcr.io/google_containers/pause-amd64:3.0   "/pause"            25 minutes ago      Dead                                    k8s_POD.1e570369_pausepods234_clusterproject0_c497fedb-9bb6-11e6-a755-525400b5933e_c2def0e8
492c356f83a4        gcr.io/google_containers/pause-amd64:3.0   "/pause"            27 minutes ago      Dead                                    k8s_POD.1e570369_pausepods211_clusterproject0_b825bda4-9bb6-11e6-a755-525400b5933e_db8f7e08
36f3f24c46cc        gcr.io/google_containers/pause-amd64:3.0   "/pause"            28 minutes ago      Dead                                    k8s_POD.1e570369_pausepods230_clusterproject0_c2140dc0-9bb6-11e6-a755-525400b5933e_3895b3a5
b7ba24540a65        gcr.io/google_containers/pause-amd64:3.0   "/pause"            28 minutes ago      Dead                                    k8s_POD.1e570369_pausepods206_clusterproject0_b69826fb-9bb6-11e6-a755-525400b5933e_a8e745b8
224060f3a59b        gcr.io/google_containers/pause-amd64:3.0   "/pause"            30 minutes ago      Dead                                    k8s_POD.1e570369_pausepods200_clusterproject0_b4b3225d-9bb6-11e6-a755-525400b5933e_f3ed010a
5c91453e4281        gcr.io/google_containers/pause-amd64:3.0   "/pause"            31 minutes ago      Dead                                    k8s_POD.1e570369_pausepods202_clusterproject0_b56ae547-9bb6-11e6-a755-525400b5933e_bff96438
ac4059eb171e        gcr.io/google_containers/pause-amd64:3.0   "/pause"            34 minutes ago      Dead                                    k8s_POD.1e570369_pausepods210_clusterproject0_b7e2e3e7-9bb6-11e6-a755-525400b5933e_c410fff2
2659be7bc17c        gcr.io/google_containers/pause-amd64:3.0   "/pause"            36 minutes ago      Created                                 k8s_POD.1e570369_pausepods224_clusterproject0_c0513dfa-9bb6-11e6-a755-525400b5933e_3dda34e4
9cd0356315ec        gcr.io/google_containers/pause-amd64:3.0   "/pause"            41 minutes ago      Dead                                    k8s_POD.1e570369_pausepods224_clusterproject0_c0513dfa-9bb6-11e6-a755-525400b5933e_0436fdaa
6ab7f460f8fa        gcr.io/google_containers/pause-amd64:3.0   "/pause"            42 minutes ago      Dead                                    k8s_POD.1e570369_pausepods227_clusterproject0_c12e01e7-9bb6-11e6-a755-525400b5933e_e875378d
fe392dcc2c1e        gcr.io/google_containers/pause-amd64:3.0   "/pause"            43 minutes ago      Dead                                    k8s_POD.1e570369_pausepods236_clusterproject0_c502678e-9bb6-11e6-a755-525400b5933e_9b2bedae
551bcf302dab        gcr.io/google_containers/pause-amd64:3.0   "/pause"            43 minutes ago      Dead                                    k8s_POD.1e570369_pausepods202_clusterproject0_b56ae547-9bb6-11e6-a755-525400b5933e_87e32e5a
dc37e99abc6d        gcr.io/google_containers/pause-amd64:3.0   "/pause"            44 minutes ago      Dead                                    k8s_POD.1e570369_pausepods224_clusterproject0_c0513dfa-9bb6-11e6-a755-525400b5933e_b209b91b
fe4b686a7f7e        gcr.io/google_containers/pause-amd64:3.0   "/pause"            49 minutes ago      Dead                                    k8s_POD.1e570369_pausepods229_clusterproject0_c1ced268-9bb6-11e6-a755-525400b5933e_34d881c4
0cabd92bc495        gcr.io/google_containers/pause-amd64:3.0   "/pause"            52 minutes ago      Dead                                    k8s_POD.1e570369_pausepods219_clusterproject0_bc225c07-9bb6-11e6-a755-525400b5933e_3de90916
9446dc4cfbe1        gcr.io/google_containers/pause-amd64:3.0   "/pause"            56 minutes ago      Dead                                    k8s_POD.1e570369_pausepods201_clusterproject0_b50c08ce-9bb6-11e6-a755-525400b5933e_27a5b168

And I also see some tokens still mounted.

I  dont see "Volume is already mounted to pod, but remount was requested." error in my kubelet logs.

I will debug it more tomorrow. 

Anyway it seems that though I can see the same issue on my kube cluster, I am not sure if if we arrived at the issue in a similar way.

Mike, do you know when you deleted the namespace in your cluster, were all pods in Running state?

Comment 9 Mike Fiedler 2016-10-26 21:45:04 UTC

In some tests all pods were running.   In others, 80% Running and 20% Error (builds failed by design).  I've also seen it a couple times running the Conformance e2e tests - those put pods through a whole range of states.

Comment 10 Mike Fiedler 2016-11-01 16:30:34 UTC

Bumping the severity.  Having a hard time getting through scale up/down tests without getting wedged.

Comment 11 Mike Fiedler 2016-11-01 16:36:27 UTC

May have been a config issue.   Further testing underway.   Re-setting sev to medium.

Comment 12 Avesh Agarwal 2016-11-01 22:40:07 UTC

Mike, I have been trying to reproduce the issue again, but can not for the past couple of days. It only happened with me once the first time and I have been analyzing those logs. In the later runs, I am seeing similar stuff in the logs, but still namespace does not get stuck. In the most recent runs, I am noticing that a namespace remians in terminating state for some time but after some tine, the namespace is finally deleted with all pods deleted. 
 
So in case if you see the issue again, please make sure you wait for some time and if the namespace is still not cleared, please collect all logs from the past long duration for analysis.

Comment 13 Avesh Agarwal 2016-11-02 15:53:46 UTC

Created an upstream issue: https://github.com/kubernetes/kubernetes/issues/36069

Comment 14 Avesh Agarwal 2016-11-02 15:55:30 UTC

Just to clarify again, the reason to get namespace/pods stuck on my setup is different from original logs in this bz, as I dont see similar errors.

Comment 15 Mike Fiedler 2016-11-02 19:05:36 UTC

Not currently seeing this.  Will reopen if it becomes a reproducible issue.