Bug 1460729
Summary: | Pod stuck in Terminating status for hours | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Vikas Laad <vlaad> | ||||||||||||||||||||||
Component: | Containers | Assignee: | Antonio Murdaca <amurdaca> | ||||||||||||||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | DeShuai Ma <dma> | ||||||||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||||||||
Priority: | unspecified | ||||||||||||||||||||||||
Version: | 3.6.0 | CC: | amurdaca, aos-bugs, ddarrah, dwalsh, ekuric, eparis, erich, fkluknav, imcleod, jeder, jhonce, jokerman, jupierce, mifiedle, mmccomas, santiago, vgoyal, vlaad, wabouham, xiazhao, xtian | ||||||||||||||||||||||
Target Milestone: | --- | ||||||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||||||||||
OS: | Unspecified | ||||||||||||||||||||||||
Whiteboard: | aos-scalability-36 | ||||||||||||||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||||||
Last Closed: | 2017-08-16 20:47:30 UTC | Type: | Bug | ||||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||||
Bug Depends On: | |||||||||||||||||||||||||
Bug Blocks: | 1470373, 1476433 | ||||||||||||||||||||||||
Attachments: |
|
Description
Vikas Laad
2017-06-12 14:10:51 UTC
Created attachment 1287060 [details]
describe pod
Created attachment 1287061 [details]
pod json
Created attachment 1288088 [details]
pod yaml
Created attachment 1288089 [details]
pod yaml
Created attachment 1288090 [details]
pod yaml
Created attachment 1288091 [details]
pod yaml
Created attachment 1288092 [details]
pod yaml
Created attachment 1288093 [details]
pod yaml
Created attachment 1288094 [details]
pod yaml
one of the nodes was showing NotReady for some time here are logs from that node http://file.rdu.redhat.com/vlaad/logs/1460729/not_ready_node.zip Currently blocking the 3.6 reliability runs I am looking at this but haven't made much progress. I have a 3.6.1 cluster up now. Trying to recreate. I've been doing this for a few hours. https://gist.githubusercontent.com/sjenning/ac42430c05f39ffef2753a11c58224ae/raw/7a319f2d77ca7f2632a52fbdffc6ce151bdc71bd/pod-terminating.sh Haven't recreated yet. Vikas, is this script somewhat close to what you are doing? Elvir, do you have any logs for the nodes running the terminating pods? Yes, projects will be effected as deleting a project is a cascading delete and will require that all resources in the project be cleaned up before deleting the project. Stuck pods will block project deletion. (In reply to Seth Jennings from comment #20) > Elvir, do you have any logs for the nodes running the terminating pods? > > Yes, projects will be effected as deleting a project is a cascading delete > and will require that all resources in the project be cleaned up before > deleting the project. Stuck pods will block project deletion. I re-installed cluster, I do not have pod logs. Originally when I hit this problem I had packages version "3.6.116", and issue was present where projects were hanging in "terminate" state, then I moved to 3.6.121 , and issue still was there, nothing helped. I then cleaned all, and reinstalled ocp cluster with 3.6.121 bits... and I cannot reproduce it now. So, I believe that issue originally is in 3.6.116, and once pods hanged in "Terminate" state upgrade did not helped,but clean install with 3.6.121 seems to help. Seth, I have a cluster which is showing this problem again, will ping you on IRC and give you access to the cluster. Ok, so looking at the nodes all the hung pods are getting errors like this: pod_workers.go:182] Error syncing pod 1769cdd6-56ab-11e7-93c6-02c521393f06 ("mongodb-1-vhsvf_nodejs-mongodb-u-150-50-335-118(1769cdd6-56ab-11e7-93c6-02c521393f06)"), skipping: error killing pod: failed to "KillContainer" for "mongodb" with KillContainerError: "rpc error: code = 4 desc = context deadline exceeded" I tried a "docker rm -f" on one of the pod containers. It took 23s and returned a DM error, but did delete container. # docker rm -f f995897b953e Error response from daemon: Driver devicemapper failed to remove root filesystem f995897b953e9f4f9643b9b0d3362b3841f10d9dfa0ad703079158173a1edb29: failed to remove device 4546ccff8c33a860b7db050971f55e399a84ab0f3ec9a795d404a8ef770979bb:Device is Busy Log entries of interest: dockerd-current[14986]: time="2017-06-22T14:31:59.346285130-04:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container f995897b953e9f4f9643b9b0d3362b3841f10d9dfa0ad703079158173a1edb29: rpc error: code = 2 desc = containerd: container not found" curl -s --insecure -H "Authorization: Bearer $(oc whoami -t)" https://ip-172-31-48-254.us-west-2.compute.internal:10250/metrics | grep -i docker_operations ... kubelet_docker_operations_latency_microseconds{operation_type="stop_container",quantile="0.5"} 4.0002713e+07 kubelet_docker_operations_latency_microseconds{operation_type="stop_container",quantile="0.9"} 4.0011855e+07 kubelet_docker_operations_latency_microseconds{operation_type="stop_container",quantile="0.99"} 4.0029932e+07 ... At least half the stop_container operations are all pegged at 40s. I guessing they are hitting a timeout there. There node is not under heavy CPU, memory, or I/O load. Also found this in the dmesg, correlating to the 3 "docker rm -f"s I did: [188397.454924] device-mapper: thin: Deletion of thin device 3012 failed. [189504.239177] device-mapper: thin: Deletion of thin device 2783 failed. [189780.613358] device-mapper: thin: Deletion of thin device 2049 failed. So we need to give access on these reproducing nodes to the container team and tell them to figure out why docker is being so dang slow? The i/o interference theory doesn't seem to be the case. The node process is trying to delete the containers and is operating normally. The machine is not under heavy load wrt cpu, memory, or i/o. Here are the two remaining hung pods: django-psql-u-301-104-368-132 postgresql-1-j8qzn 1/1 Terminating 0 20h <none> ip-172-31-53-165.us-west-2.compute.internal rails-postgresql-u-100-32-100-35 postgresql-1-lfdsp 1/1 Terminating 0 1d <none> ip-172-31-53-165.us-west-2.compute.internal docker://6133534837c7977561e35f71c0f0f63f6796ffbd6b082446587b884e06893155 docker://7c9d01457776d278da1397e1fa203a12f1757ffba8aa3d4febf32ac9f0ec9f61 The node is accessible with the libra key on root.compute.amazonaws.com Sending back to Containers. Created attachment 1290803 [details]
Minimal debugging of container 613353
(In reply to Eric Paris from comment #28) > Created attachment 1290803 [details] > Minimal debugging of container 613353 Hmm.., looking at it, so may be container exited but docker does not know about it and does not update its own metadata about container state. Not sure how does that happen. May be somebody killed container process outside of docker? BTW, what does it mean when we say pod is in terminating state. Termination mean container is exiting or it means container is being deleted? Also why is it in terminating state for a long time. Is is docker which is not giving a success/error response, or is it kube/openshift is retrying operation. Terminating means kube is trying to stop the container, but docker still hasn't told kube it is gone. kube will keep retrying, keep telling docker to stop the container forever, until docker tells it that the container is stopped. There are 2 stuck containers on that node. I only looked at one of them. Now sure what state the other container is in. Vikas made this happen on numerous nodes. No idea what his workload is like or what magic thing it is doing... The reliability workload is nothing super-special. It periodically does the following at varying frequencies create/delete project display project info create/delete app modify app visit app scale app up/down modify and deploy app create/delete user login/logout user delete app, delete project, deploy app and scale down app would likely be the ones of interest here. Any single one of these can succeed or occasionally leave the pods in this state. If there is suspected timing involved we can try to get a better reproducer. *** Bug 1462064 has been marked as a duplicate of this bug. *** (In reply to Eric Paris from comment #31) > Terminating means kube is trying to stop the container, but docker still > hasn't told kube it is gone. kube will keep retrying, keep telling docker to > stop the container forever, until docker tells it that the container is > stopped. I logged into one of the nodes and issued "docker stop" manually. It returns after few seconds but when I do "docker ps" again, container is still there but PID is gone. So there are few issues. - How did we reach into a state where PID is gone but docker still thinks container is running. This sounds like a bug and should be fixed in docker. - I am wondering why docker is not marking such containers stopped if PID is not found. - docker stop does return in few seconds, possibly with error (I did not see any error being printed on command line). So kube can possibly stop trying exit container after a while instead of retrying infinitely. I did a quick search and looks like other people have reported this issue but its not clear to me if problem was solved. https://github.com/moby/moby/issues/24985 https://github.com/moby/moby/issues/31768 There are more issues which show up in search. May be it is an unresolved problem upstream as well. Would be good if we can find a reliable reproducer for this problem. Debugging and solving this will become little easier in that case. Antonio any ideas? Docker blaimed the issue on OverlayFS which is not the case here. We likely need Seth or Mike to help us get a reproducer environment spun up for Antonio or Jhon to look at. Can you advise? So, I see some issues here. The most relevant discussion I found upstream is this https://github.com/moby/moby/issues/24974. It's about docker exec but it describes what seems to be happening here. Upstream issue was about out-of-sync docker/containerd dependencies. I looked at our repositories and backported some missing patches on our containerd 1.12.6 branch. Another issue I'm seeing is that I see "Device Busy" logs in docker logs on the node. That may not be related to anything here, but I noticed the docker version on the node is 1398f24. By looking at projectatomic/docker history here https://github.com/projectatomic/docker/commits/docker-1.12.6 - that commit is the first of some fixes about devmapper. However, 1398f24 isn't the _last_ commit merged as part of those devmapper fixes. Indeed, if you look at the history there are more commits _above_ 1398f24 related to devmapper. All that said, is there any way to rebuild docker (with containerd) for RHEL and have someone test it on a cluster? (even the aws one we already have in place). If someone can provision another cluster and reproduce this would be great to understand if the new docker fixes this also. (In reply to Antonio Murdaca from comment #39) > So, I see some issues here. The most relevant discussion I found upstream is > this https://github.com/moby/moby/issues/24974. It's about docker exec but > it describes what seems to be happening here. Upstream issue was about > out-of-sync docker/containerd dependencies. I looked at our repositories and > backported some missing patches on our containerd 1.12.6 branch. > Another issue I'm seeing is that I see "Device Busy" logs in docker logs on > the node. That may not be related to anything here, but I noticed the docker > version on the node is 1398f24. By looking at projectatomic/docker history > here https://github.com/projectatomic/docker/commits/docker-1.12.6 - that > commit is the first of some fixes about devmapper. However, 1398f24 isn't > the _last_ commit merged as part of those devmapper fixes. Indeed, if you > look at the history there are more commits _above_ 1398f24 related to > devmapper. > > All that said, is there any way to rebuild docker (with containerd) for RHEL > and have someone test it on a cluster? (even the aws one we already have in > place). If someone can provision another cluster and reproduce this would be > great to understand if the new docker fixes this also. Thanks for the analysis Antonio. Just to make things concrete, can you suggest specific commits for each component you'd like to update? (We can then see about having someone rebuild a candidate docker RPM.) docker: f55a11849bb4b654c511f0d9cfe7e25e833c2bed containerd: fa8fb3d455e1baf716f3131581f0ed8b07c573a6 Update: I let the reliability the run through the week-end on both RHEL 7.3 updated and 7.4 pre-release environments. I am not seeing any pods stuck in terminating state on either one. I have started another run on 07/11, dont see any problem so far. I have a test running for 2 weeks now, did not see this issue. openshift v3.6.140 kubernetes v1.6.1+5115d708d7 etcd 3.2.1 Client: Version: 1.12.6 API version: 1.24 Package version: docker-1.12.6-41.1.gitf55a118.el7.x86_64 Go version: go1.8.3 Git commit: f55a118/1.12.6 Built: Thu Jul 6 11:51:52 2017 OS/Arch: linux/amd64 Server: Version: 1.12.6 API version: 1.24 Package version: docker-1.12.6-41.1.gitf55a118.el7.x86_64 Go version: go1.8.3 Git commit: f55a118/1.12.6 Built: Thu Jul 6 11:51:52 2017 OS/Arch: linux/amd64 *** Bug 1476433 has been marked as a duplicate of this bug. *** *** Bug 1479362 has been marked as a duplicate of this bug. *** |