Bug 1460729

Summary: Pod stuck in Terminating status for hours
Product: OpenShift Container Platform Reporter: Vikas Laad <vlaad>
Component: ContainersAssignee: Antonio Murdaca <amurdaca>
Status: CLOSED CURRENTRELEASE QA Contact: DeShuai Ma <dma>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: amurdaca, aos-bugs, ddarrah, dwalsh, ekuric, eparis, erich, fkluknav, imcleod, jeder, jhonce, jokerman, jupierce, mifiedle, mmccomas, santiago, vgoyal, vlaad, wabouham, xiazhao, xtian
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: aos-scalability-36
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-16 20:47:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1470373, 1476433    
Attachments:
Description Flags
describe pod
none
pod json
none
pod yaml
none
pod yaml
none
pod yaml
none
pod yaml
none
pod yaml
none
pod yaml
none
pod yaml
none
Minimal debugging of container 613353 none

Description Vikas Laad 2017-06-12 14:10:51 UTC
Description of problem:
I am running reliability tests where we create/delete/access/build/scale quickstart apps for few days. I see there are few Pods which are stuck in terminating after delete project was executed. This is containerized install of OCP. This is how OCP cluster looks

1 m4.xlarge master
1 m4.xlarge etcd
1 m4.xlarge infra node
2 m4.xlarge compute nodes

root@ip-172-31-41-232: ~ # oc get pods
NAME                 READY     STATUS        RESTARTS   AGE
postgresql-1-0szlb   1/1       Terminating   0          16h

root@ip-172-31-41-232: ~ # oc get pods -n dancer-mysql-u-67-20-217-78 
NAME               READY     STATUS        RESTARTS   AGE
database-1-0bxw6   1/1       Terminating   1          12h

root@ip-172-31-41-232: ~ # oc get pods -n rails-postgresql-u-67-20-250-88 
NAME                 READY     STATUS        RESTARTS   AGE
postgresql-1-52lx5   1/1       Terminating   0          10h

Version-Release number of selected component (if applicable):


How reproducible:
openshift v3.6.102
kubernetes v1.6.1+5115d708d7
etcd 3.2.0-rc.1

Client:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-1.12.6-28.git1398f24.el7.x86_64
 Go version:      go1.7.4
 Git commit:      1398f24/1.12.6
 Built:           Wed May 17 01:16:44 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-1.12.6-28.git1398f24.el7.x86_64
 Go version:      go1.7.4
 Git commit:      1398f24/1.12.6
 Built:           Wed May 17 01:16:44 2017
 OS/Arch:         linux/amd64

Actual results:
Pods stuck in Terminating status

Expected results:
Pods should be deleted.

Additional info:
Attaching master logs and logs from compute nodes.

Comment 1 Vikas Laad 2017-06-12 14:19:26 UTC
Created attachment 1287060 [details]
describe pod

Comment 2 Vikas Laad 2017-06-12 14:19:56 UTC
Created attachment 1287061 [details]
pod json

Comment 7 Vikas Laad 2017-06-15 14:43:31 UTC
Created attachment 1288088 [details]
pod yaml

Comment 8 Vikas Laad 2017-06-15 14:43:56 UTC
Created attachment 1288089 [details]
pod yaml

Comment 9 Vikas Laad 2017-06-15 14:44:17 UTC
Created attachment 1288090 [details]
pod yaml

Comment 10 Vikas Laad 2017-06-15 14:45:42 UTC
Created attachment 1288091 [details]
pod yaml

Comment 11 Vikas Laad 2017-06-15 14:46:16 UTC
Created attachment 1288092 [details]
pod yaml

Comment 12 Vikas Laad 2017-06-15 14:46:40 UTC
Created attachment 1288093 [details]
pod yaml

Comment 13 Vikas Laad 2017-06-15 14:47:18 UTC
Created attachment 1288094 [details]
pod yaml

Comment 14 Vikas Laad 2017-06-15 15:07:42 UTC
one of the nodes was showing NotReady for some time here are logs from that node http://file.rdu.redhat.com/vlaad/logs/1460729/not_ready_node.zip

Comment 16 Mike Fiedler 2017-06-21 08:17:40 UTC
Currently blocking the 3.6 reliability runs

Comment 17 Seth Jennings 2017-06-21 18:57:26 UTC
I am looking at this but haven't made much progress.

Comment 18 Seth Jennings 2017-06-21 22:49:47 UTC
I have a 3.6.1 cluster up now.  Trying to recreate.

Comment 19 Seth Jennings 2017-06-22 01:15:11 UTC
I've been doing this for a few hours.

https://gist.githubusercontent.com/sjenning/ac42430c05f39ffef2753a11c58224ae/raw/7a319f2d77ca7f2632a52fbdffc6ce151bdc71bd/pod-terminating.sh

Haven't recreated yet.

Vikas, is this script somewhat close to what you are doing?

Comment 20 Seth Jennings 2017-06-22 01:21:02 UTC
Elvir, do you have any logs for the nodes running the terminating pods?

Yes, projects will be effected as deleting a project is a cascading delete and will require that all resources in the project be cleaned up before deleting the project.  Stuck pods will block project deletion.

Comment 21 Elvir Kuric 2017-06-22 08:55:15 UTC
(In reply to Seth Jennings from comment #20)
> Elvir, do you have any logs for the nodes running the terminating pods?
> 
> Yes, projects will be effected as deleting a project is a cascading delete
> and will require that all resources in the project be cleaned up before
> deleting the project.  Stuck pods will block project deletion.

I re-installed cluster, I do not have pod logs. 

Originally when I hit this problem I had packages version "3.6.116", and issue was present where projects were hanging in "terminate" state, then I moved to 3.6.121 , and issue still was there, nothing helped. I then cleaned all, and reinstalled ocp cluster with 3.6.121 bits... and I cannot reproduce it now. So, I believe that issue originally is in 3.6.116, and once pods hanged in "Terminate" state upgrade did not helped,but clean install with 3.6.121 seems to help.

Comment 22 Vikas Laad 2017-06-22 12:24:49 UTC
Seth, I have a cluster which is showing this problem again, will ping you on IRC and give you access to the cluster.

Comment 23 Seth Jennings 2017-06-22 18:40:11 UTC
Ok, so looking at the nodes all the hung pods are getting errors like this:

pod_workers.go:182] Error syncing pod 1769cdd6-56ab-11e7-93c6-02c521393f06 ("mongodb-1-vhsvf_nodejs-mongodb-u-150-50-335-118(1769cdd6-56ab-11e7-93c6-02c521393f06)"), skipping: error killing pod: failed to "KillContainer" for "mongodb" with KillContainerError: "rpc error: code = 4 desc = context deadline exceeded"

I tried a "docker rm -f" on one of the pod containers. It took 23s and returned a DM error, but did delete container.

# docker rm -f f995897b953e
Error response from daemon: Driver devicemapper failed to remove root filesystem f995897b953e9f4f9643b9b0d3362b3841f10d9dfa0ad703079158173a1edb29: failed to remove device 4546ccff8c33a860b7db050971f55e399a84ab0f3ec9a795d404a8ef770979bb:Device is Busy

Log entries of interest:

dockerd-current[14986]: time="2017-06-22T14:31:59.346285130-04:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container f995897b953e9f4f9643b9b0d3362b3841f10d9dfa0ad703079158173a1edb29: rpc error: code = 2 desc = containerd: container not found"

curl -s --insecure -H "Authorization: Bearer $(oc whoami -t)" https://ip-172-31-48-254.us-west-2.compute.internal:10250/metrics | grep -i docker_operations

...
kubelet_docker_operations_latency_microseconds{operation_type="stop_container",quantile="0.5"} 4.0002713e+07
kubelet_docker_operations_latency_microseconds{operation_type="stop_container",quantile="0.9"} 4.0011855e+07
kubelet_docker_operations_latency_microseconds{operation_type="stop_container",quantile="0.99"} 4.0029932e+07
...

At least half the stop_container operations are all pegged at 40s. I guessing they are hitting a timeout there.

There node is not under heavy CPU, memory, or I/O load.

Comment 24 Seth Jennings 2017-06-22 18:44:58 UTC
Also found this in the dmesg, correlating to the 3 "docker rm -f"s I did:

[188397.454924] device-mapper: thin: Deletion of thin device 3012 failed.
[189504.239177] device-mapper: thin: Deletion of thin device 2783 failed.
[189780.613358] device-mapper: thin: Deletion of thin device 2049 failed.

Comment 25 Eric Paris 2017-06-22 18:45:24 UTC
So we need to give access on these reproducing nodes to the container team and tell them to figure out why docker is being so dang slow?

Comment 27 Seth Jennings 2017-06-22 19:06:11 UTC
The i/o interference theory doesn't seem to be the case.  The node process is trying to delete the containers and is operating normally.  The machine is not under heavy load wrt cpu, memory, or i/o.

Here are the two remaining hung pods:

django-psql-u-301-104-368-132        postgresql-1-j8qzn                    1/1       Terminating   0          20h       <none>          ip-172-31-53-165.us-west-2.compute.internal
rails-postgresql-u-100-32-100-35     postgresql-1-lfdsp                    1/1       Terminating   0          1d        <none>          ip-172-31-53-165.us-west-2.compute.internal

docker://6133534837c7977561e35f71c0f0f63f6796ffbd6b082446587b884e06893155
docker://7c9d01457776d278da1397e1fa203a12f1757ffba8aa3d4febf32ac9f0ec9f61

The node is accessible with the libra key on

root.compute.amazonaws.com

Sending back to Containers.

Comment 28 Eric Paris 2017-06-22 19:11:58 UTC
Created attachment 1290803 [details]
Minimal debugging of container 613353

Comment 29 Vivek Goyal 2017-06-22 21:43:16 UTC
(In reply to Eric Paris from comment #28)
> Created attachment 1290803 [details]
> Minimal debugging of container 613353

Hmm.., looking at it, so may be container exited but docker does not know about it and does not update its own metadata about container state. Not sure how does that happen. May be somebody killed container process outside of docker?

Comment 30 Vivek Goyal 2017-06-22 21:44:50 UTC
BTW, what does it mean when we say pod is in terminating state. Termination mean container is exiting or it means container is being deleted?

Also why is it in terminating state for a long time. Is is docker which is not giving a success/error response, or is it kube/openshift is retrying operation.

Comment 31 Eric Paris 2017-06-23 02:37:22 UTC
Terminating means kube is trying to stop the container, but docker still hasn't told kube it is gone. kube will keep retrying, keep telling docker to stop the container forever, until docker tells it that the container is stopped.

There are 2 stuck containers on that node. I only looked at one of them. Now sure what state the other container is in. Vikas made this happen on numerous nodes. No idea what his workload is like or what magic thing it is doing...

Comment 32 Mike Fiedler 2017-06-23 02:50:08 UTC
The reliability workload is nothing super-special.  It periodically does the following at varying frequencies

create/delete project
display project info
create/delete app
modify app
visit app
scale app up/down
modify and deploy app
create/delete user
login/logout user

delete app, delete project, deploy app and scale down app would likely be the ones of interest here.   Any single one of these can succeed or occasionally leave the pods in this state.   If there is suspected timing involved we can try to get a better reproducer.

Comment 33 Seth Jennings 2017-06-23 04:09:54 UTC
*** Bug 1462064 has been marked as a duplicate of this bug. ***

Comment 34 Vivek Goyal 2017-06-23 13:49:14 UTC
(In reply to Eric Paris from comment #31)
> Terminating means kube is trying to stop the container, but docker still
> hasn't told kube it is gone. kube will keep retrying, keep telling docker to
> stop the container forever, until docker tells it that the container is
> stopped.

I logged into one of the nodes and issued "docker stop" manually. It returns
after few seconds but when I do "docker ps" again, container is still there but PID is gone.

So there are few issues.

- How did we reach into a state where PID is gone but docker still thinks
  container is running. This sounds like a bug and should be fixed in docker.

- I am wondering why docker is not marking such containers stopped if PID
  is not found.

- docker stop does return in few seconds, possibly with error (I did not see any error being printed on command line). So kube can possibly stop trying 
exit container after a while instead of retrying infinitely. 

I did a quick search and looks like other people have reported this issue
but its not clear to me if problem was solved.

 
https://github.com/moby/moby/issues/24985

https://github.com/moby/moby/issues/31768

There are more issues which show up in search. May be it is an unresolved problem upstream as well.

Would be good if we can find a reliable reproducer for this problem. Debugging and solving this will become little easier in that case.

Comment 35 Daniel Walsh 2017-06-23 14:41:48 UTC
Antonio any ideas?

Comment 36 Daniel Walsh 2017-06-29 15:41:33 UTC
Docker blaimed the issue on OverlayFS which is not the case here.

Comment 37 Ian McLeod 2017-06-29 18:03:36 UTC
We likely need Seth or Mike to help us get a reproducer environment spun up for Antonio or Jhon to look at.  Can you advise?

Comment 39 Antonio Murdaca 2017-07-03 13:52:37 UTC
So, I see some issues here. The most relevant discussion I found upstream is this https://github.com/moby/moby/issues/24974. It's about docker exec but it describes what seems to be happening here. Upstream issue was about out-of-sync docker/containerd dependencies. I looked at our repositories and backported some missing patches on our containerd 1.12.6 branch.
Another issue I'm seeing is that I see "Device Busy" logs in docker logs on the node. That may not be related to anything here, but I noticed the docker version on the node is 1398f24. By looking at projectatomic/docker history here https://github.com/projectatomic/docker/commits/docker-1.12.6 - that commit is the first of some fixes about devmapper. However, 1398f24 isn't the _last_ commit merged as part of those devmapper fixes. Indeed, if you look at the history there are more commits _above_ 1398f24 related to devmapper.

All that said, is there any way to rebuild docker (with containerd) for RHEL and have someone test it on a cluster? (even the aws one we already have in place). If someone can provision another cluster and reproduce this would be great to understand if the new docker fixes this also.

Comment 40 Ian McLeod 2017-07-03 13:57:13 UTC
(In reply to Antonio Murdaca from comment #39)
> So, I see some issues here. The most relevant discussion I found upstream is
> this https://github.com/moby/moby/issues/24974. It's about docker exec but
> it describes what seems to be happening here. Upstream issue was about
> out-of-sync docker/containerd dependencies. I looked at our repositories and
> backported some missing patches on our containerd 1.12.6 branch.
> Another issue I'm seeing is that I see "Device Busy" logs in docker logs on
> the node. That may not be related to anything here, but I noticed the docker
> version on the node is 1398f24. By looking at projectatomic/docker history
> here https://github.com/projectatomic/docker/commits/docker-1.12.6 - that
> commit is the first of some fixes about devmapper. However, 1398f24 isn't
> the _last_ commit merged as part of those devmapper fixes. Indeed, if you
> look at the history there are more commits _above_ 1398f24 related to
> devmapper.
> 
> All that said, is there any way to rebuild docker (with containerd) for RHEL
> and have someone test it on a cluster? (even the aws one we already have in
> place). If someone can provision another cluster and reproduce this would be
> great to understand if the new docker fixes this also.

Thanks for the analysis Antonio.  Just to make things concrete, can you suggest specific commits for each component you'd like to update? (We can then see about having someone rebuild a candidate docker RPM.)

Comment 41 Antonio Murdaca 2017-07-03 14:06:19 UTC
docker: f55a11849bb4b654c511f0d9cfe7e25e833c2bed
containerd: fa8fb3d455e1baf716f3131581f0ed8b07c573a6

Comment 51 Walid A. 2017-07-10 19:41:02 UTC
Update:  I let the reliability the run through the week-end on both RHEL 7.3 updated and 7.4 pre-release environments.  I am not seeing any pods stuck in terminating state on either one.

Comment 53 Vikas Laad 2017-07-12 19:08:53 UTC
I have started another run on 07/11, dont see any problem so far.

Comment 54 Vikas Laad 2017-07-24 15:54:27 UTC
I have a test running for 2 weeks now, did not see this issue.

openshift v3.6.140
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

Client:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-1.12.6-41.1.gitf55a118.el7.x86_64
 Go version:      go1.8.3
 Git commit:      f55a118/1.12.6
 Built:           Thu Jul  6 11:51:52 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-1.12.6-41.1.gitf55a118.el7.x86_64
 Go version:      go1.8.3
 Git commit:      f55a118/1.12.6
 Built:           Thu Jul  6 11:51:52 2017
 OS/Arch:         linux/amd64

Comment 55 Jeff Ligon 2017-08-01 15:19:34 UTC
*** Bug 1476433 has been marked as a duplicate of this bug. ***

Comment 56 Seth Jennings 2017-08-08 17:50:31 UTC
*** Bug 1479362 has been marked as a duplicate of this bug. ***