Hide Forgot
Created attachment 1212846 [details] master logs Description of problem: In 3.4 with Docker 1.12, seeing multiple instances where deleting a namespace leaves pods stuck in Terminating space (the namespace is stuck in Terminating too). On the node where the pod was running, docker ps and docker ps -a do not show the pod. df -k does show the tmpfs for the secrets as still mounted. In the node log, the following messages are repeated while the pod is in Terminating state: Oct 21 08:37:23 ip-172-31-54-14 atomic-openshift-node: I1021 08:37:23.293664 97046 reconciler.go:299] MountVolume operation started for volume "kubernetes.io/secret/adae41bb-9701-11e6-98f4-029980ce4b73-default-token-vc74p" (spec.Name: "default-token-vc74p") to pod "adae41bb-9701-11e6-98f4-029980ce4b73" (UID: "adae41bb-9701-11e6-98f4-029980ce4b73"). Volume is already mounted to pod, but remount was requested. Oct 21 08:37:23 ip-172-31-54-14 atomic-openshift-node: E1021 08:37:23.295758 97046 secret.go:182] Couldn't get secret clusterproject0/default-token-vc74p Oct 21 08:37:23 ip-172-31-54-14 atomic-openshift-node: E1021 08:37:23.295944 97046 nestedpendingoperations.go:253] Operation for "\"kubernetes.io/secret/adae41bb-9701-11e6-98f4-029980ce4b73-default-token-vc74p\" (\"adae41bb-9701-11e6-98f4-029980ce4b73\")" failed. No retries permitted until 2016-10-21 08:39:23.295881273 -0400 EDT (durationBeforeRetry 2m0s). Error: MountVolume.SetUp failed for volume "kubernetes.io/secret/adae41bb-9701-11e6-98f4-029980ce4b73-default-token-vc74p" (spec.Name: "default-token-vc74p") pod "adae41bb-9701-11e6-98f4-029980ce4b73" (UID: "adae41bb-9701-11e6-98f4-029980ce4b73") with: secrets "default-token-vc74p" not found Oct 21 08:37:23 ip-172-31-54-14 atomic-openshift-node: E1021 08:37:23.312734 97046 event.go:199] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"pausepods389.147f57512c835453", GenerateName:"", Namespace:"clusterproject0", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:api.ObjectReference{Kind:"Pod", Namespace:"clusterproject0", Name:"pausepods389", UID:"adae41bb-9701-11e6-98f4-029980ce4b73", APIVersion:"v1", ResourceVersion:"17165", FieldPath:""}, Reason:"FailedMount", Message:"MountVolume.SetUp failed for volume \"kubernetes.io/secret/adae41bb-9701-11e6-98f4-029980ce4b73-default-token-vc74p\" (spec.Name: \"default-token-vc74p\") pod \"adae41bb-9701-11e6-98f4-029980ce4b73\" (UID: \"adae41bb-9701-11e6-98f4-029980ce4b73\") with: secrets \"default-token-vc74p\" not found", Source:api.EventSource{Component:"kubelet", Host:"ip-172-31-54-14.us-west-2.compute.internal"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63612592008, nsec:951583827, loc:(*time.Location)(0xa29dc40)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63612650243, nsec:295868335, loc:(*time.Location)(0xa29dc40)}}, Count:493, Type:"Warning"}': 'events "pausepods389.147f57512c835453" is forbidden: unable to create new content in namespace clusterproject0 because it is being terminated.' (will not retry!) oc get events on the namespace comes back empty - events must already be gone. Attaching the master and node logs. Version-Release number of selected component (if applicable): 3.4.0.13 and docker-1.12.2-3.el7.x86_64 How reproducible: infrequent, but has been seen several times in 3.4. Not reproducible on demand Steps to Reproduce: 1. Install simple cluster: 1 master/etcd, 1 infra system, 2 application nodes 2. Run the svt cluster-loader script (https://github.com/openshift/svt/tree/master/openshift_scalability) with the nodeVertical.yaml workload at 250 pods/node 3. After the test completes, oc delete project -l purpose=test Actual results: Project never fully terminates. 1 or more pods left in Terminating state forever. The docker containers are terminated and the tmpfs filesystem for the secrets is left mounted. Expected results: Project and all pods are deleted successfully.
Created attachment 1212847 [details] node log
Tried rebooting and the tmpfs were re-mounted after reboot. Tried deleting the pods explicitly (oc delete pod) and they stay Terminating Tried deleting the node the pods were running on and they stay Terminating. So far, I've yet to find a way to get rid of them.
Avesh - do you want to take a stab at debugging this further? Pull in Paul if you need help in the code area.
Sure Derek.
Seeing this on occasion when running the e2es/Conformance on a live cluster as well.
I am running 1 master and 1 node (on the same VM) kube cluster to reproduce this issue. When the allowed pod count is 110 (default) on the node, I can not reproduce this issue. So I increased the allowed max pod count on the node to 250, however beyond 200, the pods are stuck in ContainerCreating state and and I see too many open files errors in the kubectl describe output. So I am still looking into it. [root@fedora24 ~]# kubectl get pods --namespace clusterproject0| grep -i container pausepods200 0/1 ContainerCreating 0 8m pausepods201 0/1 ContainerCreating 0 8m pausepods202 0/1 ContainerCreating 0 8m pausepods203 0/1 ContainerCreating 0 8m pausepods204 0/1 ContainerCreating 0 8m pausepods205 0/1 ContainerCreating 0 8m pausepods206 0/1 ContainerCreating 0 8m pausepods207 0/1 ContainerCreating 0 8m pausepods208 0/1 ContainerCreating 0 8m pausepods209 0/1 ContainerCreating 0 8m pausepods210 0/1 ContainerCreating 0 8m pausepods211 0/1 ContainerCreating 0 8m pausepods212 0/1 ContainerCreating 0 8m pausepods213 0/1 ContainerCreating 0 8m pausepods214 0/1 ContainerCreating 0 8m pausepods215 0/1 ContainerCreating 0 8m pausepods216 0/1 ContainerCreating 0 8m pausepods217 0/1 ContainerCreating 0 8m pausepods218 0/1 ContainerCreating 0 8m pausepods219 0/1 ContainerCreating 0 8m pausepods220 0/1 ContainerCreating 0 8m pausepods221 0/1 ContainerCreating 0 8m pausepods222 0/1 ContainerCreating 0 8m pausepods223 0/1 ContainerCreating 0 8m pausepods224 0/1 ContainerCreating 0 8m pausepods225 0/1 ContainerCreating 0 8m pausepods226 0/1 ContainerCreating 0 8m pausepods227 0/1 ContainerCreating 0 8m pausepods228 0/1 ContainerCreating 0 8m pausepods229 0/1 ContainerCreating 0 8m pausepods230 0/1 ContainerCreating 0 8m pausepods231 0/1 ContainerCreating 0 8m pausepods232 0/1 ContainerCreating 0 8m pausepods233 0/1 ContainerCreating 0 8m pausepods234 0/1 ContainerCreating 0 8m pausepods235 0/1 ContainerCreating 0 8m pausepods236 0/1 ContainerCreating 0 8m pausepods237 0/1 ContainerCreating 0 8m pausepods238 0/1 ContainerCreating 0 8m pausepods239 0/1 ContainerCreating 0 8m [root@fedora24 ~]# kubectl describe pods pausepods226 --namespace clusterproject0 Name: pausepods226 Namespace: clusterproject0 Node: 127.0.0.1/127.0.0.1 Start Time: Wed, 26 Oct 2016 16:00:07 -0400 Labels: name=pause-amd64 Status: Pending IP: Controllers: <none> Containers: pause-amd64: Container ID: Image: gcr.io/google_containers/pause-amd64:3.0 Image ID: Port: 8080/TCP State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Volume Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-f4ozb (ro) Environment Variables: <none> Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: default-token-f4ozb: Type: Secret (a volume populated by a Secret) SecretName: default-token-f4ozb QoS Class: BestEffort Tolerations: <none> Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 9m 9m 1 {default-scheduler } Normal Scheduled Successfully assigned pausepods226 to 127.0.0.1 7m 7m 1 {kubelet 127.0.0.1} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: Error response from daemon: {\"message\":\"open /dev/mapper/docker-253:0-1050167-7684f5d9dd5228ba3b17edf448dde3b3f1d6ef10888db628815946ede0fb7ce8-init: too many open files\"}" 8m 6m 2 {kubelet 127.0.0.1} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: Error response from daemon: {\"message\":\"open /var/lib/docker/image/devicemapper/imagedb/content/sha256/99e59f495ffaa222bfeb67580213e8c28c1e885f1d245ab2bbe3b1b1ec3bd0b2: too many open files\"}" 6m 6m 1 {kubelet 127.0.0.1} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: Error response from daemon: {\"message\":\"open /var/lib/docker/containers/4dfc0842c78d6c2543d5428342a95b1b3efba09c44d85b08876b8e114eb8aa58/.tmp-hostconfig.json564561370: too many open files\"}" 5m 5m 1 {kubelet 127.0.0.1} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "POD" with ErrImagePull: "image pull failed for gcr.io/google_containers/pause-amd64:3.0, this may be because there are no credentials on this request. details: (Error pulling image (3.0) from gcr.io/google_containers/pause-amd64, failed to register layer: devmapper: Error activating devmapper device for '14f11f9004d8cd8c570af44c15db333088a687ee53e71beb6cbf13b567ee4fc4': devicemapper: Can't set cookie dm_task_set_cookie failed)" 8m 4m 3 {kubelet 127.0.0.1} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "POD" with ErrImagePull: "image pull failed for gcr.io/google_containers/pause-amd64:3.0, this may be because there are no credentials on this request. details: (open /etc/docker/certs.d/gcr.io: too many open files)" 2m 2m 1 {kubelet 127.0.0.1} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "POD" with ErrImagePull: "image pull failed for gcr.io/google_containers/pause-amd64:3.0, this may be because there are no credentials on this request. details: (Error pulling image (3.0) from gcr.io/google_containers/pause-amd64, Server error: Status 0 while fetching image layer (bdb43c586e887b513a056722b50553727b255e3a3d9166f318632d4209963464))" 1m 1m 1 {kubelet 127.0.0.1} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: Error response from daemon: {\"message\":\"devmapper: Error saving transaction metadata: devmapper: Error creating metadata file: open /var/lib/docker/devicemapper/metadata/.tmp459530672: too many open files\"}" 37s 37s 1 {kubelet 127.0.0.1} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: Error response from daemon: {\"message\":\"containerd: container did not start before the specified timeout\"}" 4m 23s 6 {kubelet 127.0.0.1} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "POD" with ImagePullBackOff: "Back-off pulling image \"gcr.io/google_containers/pause-amd64:3.0\""
Mike, It seems that I am able to reproduce as I see some pods stuck in terminating state and the namespace is also in terminating state, and its been quite some time now, though I am still monitoring my cluster to see if it changes. However I am not sure if the way I arrived at this issue is similar to the way Mike you arrived to this issue, so here is the more explanation: (Anyway I had increased open file limit on my VM to avoid the too many open files issue, but still the kube cluster did not recover from that. and Yes I am *NOT* using loopback see the info: https://paste.fedoraproject.org/461576/16648147/) So at this point, there were 200 containers in Running state, and 40 containers in RunContainerError (seems like their status changed from ContainerCreating to RunContainerError after I, it seems to me, changed open files limit), and they seemed to stuck in the event "Back-off pulling image". At this point, I deleted the namespace clusterproject0. But I still see the namespace is stuck in Terminating state. More logs: # kubectl get ns NAME STATUS AGE clusterproject0 Terminating 2h default Active 2h kube-system Active 2h # kubectl get pods --namespace clusterproject0 NAME READY STATUS RESTARTS AGE pausepods210 0/1 Terminating 0 1h pausepods211 0/1 Terminating 0 1h pausepods224 0/1 Terminating 0 1h pausepods234 0/1 Terminating 0 1h Please note that these are those pods which were stuck in ContainerCreating and then move to RunContainerError but never got to Running state. Though there were other pods (36 other pods) to in the similar state but somehow, these 4 are still stuck in Terminating state leading the namespace to be stuck too. Kubectl describe of one of the stuck pods: [root@fedora24 ~]# kubectl describe pods pausepods210 --namespace clusterproject0 Name: pausepods210 Namespace: clusterproject0 Node: 127.0.0.1/127.0.0.1 Start Time: Wed, 26 Oct 2016 16:00:07 -0400 Labels: name=pause-amd64 Status: Terminating (expires Wed, 26 Oct 2016 16:46:12 -0400) Termination Grace Period: 30s IP: Controllers: <none> Containers: pause-amd64: Container ID: Image: gcr.io/google_containers/pause-amd64:3.0 Image ID: Port: 8080/TCP State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Volume Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-f4ozb (ro) Environment Variables: <none> Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: default-token-f4ozb: Type: Secret (a volume populated by a Secret) SecretName: default-token-f4ozb QoS Class: BestEffort Tolerations: <none> No events. Docker output at this stage: [root@fedora24 ~]# docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 82bc3a2dac6e gcr.io/google_containers/pause-amd64:3.0 "/pause" 13 minutes ago Created k8s_POD.1e570369_pausepods210_clusterproject0_b7e2e3e7-9bb6-11e6-a755-525400b5933e_6fcb2388 6d914e0c18ef gcr.io/google_containers/pause-amd64:3.0 "/pause" 18 minutes ago Dead k8s_POD.1e570369_pausepods235_clusterproject0_c4d0bb89-9bb6-11e6-a755-525400b5933e_2a743fa3 4ebd3b167bf9 gcr.io/google_containers/pause-amd64:3.0 "/pause" 19 minutes ago Dead k8s_POD.1e570369_pausepods229_clusterproject0_c1ced268-9bb6-11e6-a755-525400b5933e_2a30e0bb 3b7ee28b077e gcr.io/google_containers/pause-amd64:3.0 "/pause" 20 minutes ago Created k8s_POD.1e570369_pausepods211_clusterproject0_b825bda4-9bb6-11e6-a755-525400b5933e_c6b07236 c8c504f94144 gcr.io/google_containers/pause-amd64:3.0 "/pause" 23 minutes ago Created k8s_POD.1e570369_pausepods234_clusterproject0_c497fedb-9bb6-11e6-a755-525400b5933e_cdabae77 11a28e257e00 gcr.io/google_containers/pause-amd64:3.0 "/pause" 24 minutes ago Dead k8s_POD.1e570369_pausepods227_clusterproject0_c12e01e7-9bb6-11e6-a755-525400b5933e_e80e767c b0454c977ab1 gcr.io/google_containers/pause-amd64:3.0 "/pause" 24 minutes ago Dead k8s_POD.1e570369_pausepods202_clusterproject0_b56ae547-9bb6-11e6-a755-525400b5933e_302e436a f4870e7052ac gcr.io/google_containers/pause-amd64:3.0 "/pause" 25 minutes ago Dead k8s_POD.1e570369_pausepods234_clusterproject0_c497fedb-9bb6-11e6-a755-525400b5933e_c2def0e8 492c356f83a4 gcr.io/google_containers/pause-amd64:3.0 "/pause" 27 minutes ago Dead k8s_POD.1e570369_pausepods211_clusterproject0_b825bda4-9bb6-11e6-a755-525400b5933e_db8f7e08 36f3f24c46cc gcr.io/google_containers/pause-amd64:3.0 "/pause" 28 minutes ago Dead k8s_POD.1e570369_pausepods230_clusterproject0_c2140dc0-9bb6-11e6-a755-525400b5933e_3895b3a5 b7ba24540a65 gcr.io/google_containers/pause-amd64:3.0 "/pause" 28 minutes ago Dead k8s_POD.1e570369_pausepods206_clusterproject0_b69826fb-9bb6-11e6-a755-525400b5933e_a8e745b8 224060f3a59b gcr.io/google_containers/pause-amd64:3.0 "/pause" 30 minutes ago Dead k8s_POD.1e570369_pausepods200_clusterproject0_b4b3225d-9bb6-11e6-a755-525400b5933e_f3ed010a 5c91453e4281 gcr.io/google_containers/pause-amd64:3.0 "/pause" 31 minutes ago Dead k8s_POD.1e570369_pausepods202_clusterproject0_b56ae547-9bb6-11e6-a755-525400b5933e_bff96438 ac4059eb171e gcr.io/google_containers/pause-amd64:3.0 "/pause" 34 minutes ago Dead k8s_POD.1e570369_pausepods210_clusterproject0_b7e2e3e7-9bb6-11e6-a755-525400b5933e_c410fff2 2659be7bc17c gcr.io/google_containers/pause-amd64:3.0 "/pause" 36 minutes ago Created k8s_POD.1e570369_pausepods224_clusterproject0_c0513dfa-9bb6-11e6-a755-525400b5933e_3dda34e4 9cd0356315ec gcr.io/google_containers/pause-amd64:3.0 "/pause" 41 minutes ago Dead k8s_POD.1e570369_pausepods224_clusterproject0_c0513dfa-9bb6-11e6-a755-525400b5933e_0436fdaa 6ab7f460f8fa gcr.io/google_containers/pause-amd64:3.0 "/pause" 42 minutes ago Dead k8s_POD.1e570369_pausepods227_clusterproject0_c12e01e7-9bb6-11e6-a755-525400b5933e_e875378d fe392dcc2c1e gcr.io/google_containers/pause-amd64:3.0 "/pause" 43 minutes ago Dead k8s_POD.1e570369_pausepods236_clusterproject0_c502678e-9bb6-11e6-a755-525400b5933e_9b2bedae 551bcf302dab gcr.io/google_containers/pause-amd64:3.0 "/pause" 43 minutes ago Dead k8s_POD.1e570369_pausepods202_clusterproject0_b56ae547-9bb6-11e6-a755-525400b5933e_87e32e5a dc37e99abc6d gcr.io/google_containers/pause-amd64:3.0 "/pause" 44 minutes ago Dead k8s_POD.1e570369_pausepods224_clusterproject0_c0513dfa-9bb6-11e6-a755-525400b5933e_b209b91b fe4b686a7f7e gcr.io/google_containers/pause-amd64:3.0 "/pause" 49 minutes ago Dead k8s_POD.1e570369_pausepods229_clusterproject0_c1ced268-9bb6-11e6-a755-525400b5933e_34d881c4 0cabd92bc495 gcr.io/google_containers/pause-amd64:3.0 "/pause" 52 minutes ago Dead k8s_POD.1e570369_pausepods219_clusterproject0_bc225c07-9bb6-11e6-a755-525400b5933e_3de90916 9446dc4cfbe1 gcr.io/google_containers/pause-amd64:3.0 "/pause" 56 minutes ago Dead k8s_POD.1e570369_pausepods201_clusterproject0_b50c08ce-9bb6-11e6-a755-525400b5933e_27a5b168 And I also see some tokens still mounted. I dont see "Volume is already mounted to pod, but remount was requested." error in my kubelet logs. I will debug it more tomorrow. Anyway it seems that though I can see the same issue on my kube cluster, I am not sure if if we arrived at the issue in a similar way. Mike, do you know when you deleted the namespace in your cluster, were all pods in Running state?
In some tests all pods were running. In others, 80% Running and 20% Error (builds failed by design). I've also seen it a couple times running the Conformance e2e tests - those put pods through a whole range of states.
Bumping the severity. Having a hard time getting through scale up/down tests without getting wedged.
May have been a config issue. Further testing underway. Re-setting sev to medium.
Mike, I have been trying to reproduce the issue again, but can not for the past couple of days. It only happened with me once the first time and I have been analyzing those logs. In the later runs, I am seeing similar stuff in the logs, but still namespace does not get stuck. In the most recent runs, I am noticing that a namespace remians in terminating state for some time but after some tine, the namespace is finally deleted with all pods deleted. So in case if you see the issue again, please make sure you wait for some time and if the namespace is still not cleared, please collect all logs from the past long duration for analysis.
Created an upstream issue: https://github.com/kubernetes/kubernetes/issues/36069
Just to clarify again, the reason to get namespace/pods stuck on my setup is different from original logs in this bz, as I dont see similar errors.
Not currently seeing this. Will reopen if it becomes a reproducible issue.