Bug 1456943
| Summary: | Build failures due to conflict in container name | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Vikas Laad <vlaad> | ||||||||||
| Component: | Containers | Assignee: | Jhon Honce <jhonce> | ||||||||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | DeShuai Ma <dma> | ||||||||||
| Severity: | medium | Docs Contact: | |||||||||||
| Priority: | unspecified | ||||||||||||
| Version: | 3.6.0 | CC: | aos-bugs, jeder, jokerman, mifiedle, mmccomas, sdodson, vlaad | ||||||||||
| Target Milestone: | --- | Keywords: | Reopened | ||||||||||
| Target Release: | 3.7.0 | ||||||||||||
| Hardware: | Unspecified | ||||||||||||
| OS: | Unspecified | ||||||||||||
| Whiteboard: | aos-scalability-36 | ||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2017-09-22 16:33:14 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Attachments: |
|
||||||||||||
Created attachment 1283508 [details]
compute node logs
the issue is that the attempt to create a pod sandbox is timing out when interacting with the docker runtime. this is demonstrated with all of the "operation timeout: context deadline exceeded" errors throughout the logs. this typically is symptomatic of an overwhelmed docker daemon.
how many concurrent pods are running on this node?
how many of those pods are builds (it sounds like ~30 but are they spread across different nodes)?
what are the resource requirements for the build pods in this scenario?
did this scenario work in earlier versions of the product? i.e. have we ever been able to support 30 concurrent build pods in this environment? is 30 a tipping point, and things worked great with 10, 20, 25, etc? it would be good to have metrics data collected for some of the fields i reference below to know where the system actually tips over with longer tail latency in these types of bug reports.
we did not appear to set --system-reserved values for this node. why? we should be running all test scenarios by at least reserving time for system daemons, kubelet, and docker daemon to function. do we know if docker is cpu starved in this scenario? are you able to get usage stats for docker relative to other end-user pods during this run?
from the logs, it appears we would be setting /kubepods.slice cgroup to have cpu shares equal to 4096, and /system.slice must have the default of 1024. this ratio may be out of whack if we are actually cpu starved. i would like us to actually start monitoring usage of the docker daemon in this scenario, and understand that we should be reserving resource, and how much resource. so setting --system-reserved=cpu=1,memory={what we think /system.slice} is using in this scenario would be a good start. we could then look to not just enforce node allocatable on end-user pods but also the /system.slice by setting --enforce-node-allocatable=pods,system-reserved, but i would like to start by actually setting some cpu and memory reservation before doing that to see if the cpu ratio of 4096:1024 was just not right versus say 3096:1024 (assuming you actually reserve cpu=1).
it would be good to know what the metrics are for the following fields on on this node in this scenario as well.
can you collect the output of the following to know what the 50, 90, and 95% latency measurements and attach to the bz?
$ oc get --raw /api/v1/nodes/{node-name}/proxy/metrics)
in particular, i am interested in seeing what the tail latencies are for the following:
pod_start_latency_microseconds
docker_operations_latency_microseconds
docker_operations
docker_operations_errors
docker_operations_timeout
pod_worker_start_latency_microseconds
pleg_relist_latency_microseconds
pleg_relist_interval_microseconds
runtime_operations
runtime_operations_latency_microseconds
runtime_operations_errors
i suspect we are seeing high tail latency on pleg_relist and docker_operations when this is happening.
Hi Derek,
Please see other answers below
how many concurrent pods are running on this node?
Current status on the node is following
Containers: 125
Running: 2
Paused: 0
Stopped: 123
Images: 378
Server Version: 1.12.6
how many of those pods are builds (it sounds like ~30 but are they spread across different nodes)?
They are spread across 2 compute nodes, to ideally it should have 15 build pods running at one point. There is nothing else in the projects just build configs.
what are the resource requirements for the build pods in this scenario?
I am attaching "oc describe node" and "oc describe buuld pod" for this info.
did this scenario work in earlier versions of the product?
We have run 50 concurrent builds before on 2 compute nodes and never seen this error. We have seen more failures at 50 concurrent builds on 2 nodes but 30 builds have been stable.
we did not appear to set --system-reserved values for this node. why?
$ oc get --raw /api/v1/nodes/{node-name}/proxy/metrics)
Please see output attached.
Created attachment 1284161 [details]
describe build pod
Created attachment 1284163 [details]
describe node
Created attachment 1284164 [details]
oc get raw output
This is an openshift-ansible install. If --system-reserved is not set, it is because the install did not set it. @sdodson - should I open a bz for that? 30 concurrent builds is a workload which was successful in previous releases. On an AWS EC2 m4.xlarge (4vCPU/16GB RAM) instance, 50 concurrent builds was the "normal" outer limit in previous releases (3.5, 3.4, etc) Mike, Yeah, lets open a bug to add system reserved resources. It's come up in the past as a good idea but we've never done it. We'd need guidance on what to default those values to, if that's documented somewhere already let us know. Thanks for the data, I will investigate further.
From the metrics, I see the following:
kubelet_docker_operations{operation_type="create_container"} 5336
kubelet_docker_operations{operation_type="info"} 1283
kubelet_docker_operations{operation_type="inspect_container"} 55139
kubelet_docker_operations{operation_type="inspect_image"} 16749
kubelet_docker_operations{operation_type="list_containers"} 343141
kubelet_docker_operations{operation_type="list_images"} 20207
kubelet_docker_operations{operation_type="pull_image"} 7
kubelet_docker_operations{operation_type="remove_container"} 4600
kubelet_docker_operations{operation_type="remove_image"} 848
kubelet_docker_operations{operation_type="start_container"} 3090
kubelet_docker_operations{operation_type="stop_container"} 5858
kubelet_docker_operations{operation_type="version"} 51625
# HELP kubelet_docker_operations_errors Cumulative number of Docker operation errors by operation type.
# TYPE kubelet_docker_operations_errors counter
kubelet_docker_operations_errors{operation_type="create_container"} 2246
kubelet_docker_operations_errors{operation_type="inspect_container"} 177
kubelet_docker_operations_errors{operation_type="inspect_image"} 7
kubelet_docker_operations_errors{operation_type="list_containers"} 23975
kubelet_docker_operations_errors{operation_type="list_images"} 3222
kubelet_docker_operations_errors{operation_type="remove_container"} 432
kubelet_docker_operations_errors{operation_type="remove_image"} 232
kubelet_docker_operations_errors{operation_type="start_container"} 1
kubelet_docker_operations_errors{operation_type="stop_container"} 48
kubelet_docker_operations_errors{operation_type="version"} 7076
It looks like 1/2 of our attempts to create a container resulted in an error, followed by a fair number of errors when listing containers and listing images. I suspect most of these were similar deadline exceeded errors or its follow-on.
I need to research if the move to the CRI versus running without the CRI would have had a major impact here for the underlying docker interaction.
Possible variant of https://bugzilla.redhat.com/show_bug.cgi?id=1451902 ? I agree this is a variant of the same underlying problem referenced in 1451902. Following the life of the pod: cakephp-mysql-example-26-build_proj11
May 30 13:10:55 ip-172-31-5-23 atomic-openshift-node: I0530 13:10:55.236933 15333 kubelet.go:1813] SyncLoop (ADD, "api"): "cakephp-mysql-example-26-build_proj11(f0f7f873-455a-11e7-9b8e-02ca84fd3e44)"
May 30 13:10:55 ip-172-31-5-23 atomic-openshift-node: I0530 13:10:55.566783 15333 kuberuntime_manager.go:370] No sandbox for pod "cakephp-mysql-example-26-build_proj11(f0f7f873-455a-11e7-9b8e-02ca84fd3e44)" ca
n be found. Need to start a new one
May 30 13:12:56 ip-172-31-5-23 atomic-openshift-node: E0530 13:12:56.524357 15333 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "cakephp-mysql-example-26-build_proj11(f0f7f873-455a-11e7-9b8e-02ca84fd3e44)" failed: rpc error: code = 2 desc = failed to create a sandbox for pod "cakephp-mysql-example-26-build": operation timeout: context deadline exceeded
The first attempt to create the pod sandbox took longer than 2m to respond. This deadline is imposed by the CRI, so any attempt to recover in the dockershim for a conflicting name will not be handled. The theory here is that initial create of the sandbox container just took longer than 2m and may or may not have succeeded. Note from the log, the attempt to create the container was >2m.
For the pod in question, we try for many minutes to create the pod sandbox (for example, 3 minutes later)
May 30 13:15:19 ip-172-31-5-23 atomic-openshift-node: E0530 13:15:19.110654 15333 pod_workers.go:182] Error syncing pod f0f7f873-455a-11e7-9b8e-02ca84fd3e44 ("cakephp-mysql-example-26-build_proj11(f0f7f873-455
a-11e7-9b8e-02ca84fd3e44)"), skipping: failed to "CreatePodSandbox" for "cakephp-mysql-example-26-build_proj11(f0f7f873-455a-11e7-9b8e-02ca84fd3e44)" with CreatePodSandboxError: "CreatePodSandbox for pod \"cakep
hp-mysql-example-26-build_proj11(f0f7f873-455a-11e7-9b8e-02ca84fd3e44)\" failed: rpc error: code = 2 desc = failed to create a sandbox for pod \"cakephp-mysql-example-26-build\": operation timeout: context deadl
ine exceeded"
At each attempt, the operation is timing out after 2m
May 30 13:15:44 ip-172-31-5-23 dockerd-current: time="2017-05-30T13:15:44.727273103-04:00" level=error msg="Handler for POST /v1.24/containers/create?name=k8s_POD_cakephp-mysql-example-26-build_proj11_f0f7f873-4
55a-11e7-9b8e-02ca84fd3e44_1 returned error: Conflict. The name \"/k8s_POD_cakephp-mysql-example-26-build_proj11_f0f7f873-455a-11e7-9b8e-02ca84fd3e44_1\" is already in use by container 98b847320b96651007b7be98ab
b4104bacf9334c2a28057b822f7f0431c8717b. You have to remove (or rename) that container to be able to reuse that name."
May 30 13:15:44 ip-172-31-5-23 dockerd-current: time="2017-05-30T13:15:44.727322814-04:00" level=error msg="Handler for POST /v1.24/containers/create returned error: Conflict. The name \"/k8s_POD_cakephp-mysql-e
xample-26-build_proj11_f0f7f873-455a-11e7-9b8e-02ca84fd3e44_1\" is already in use by container 98b847320b96651007b7be98abb4104bacf9334c2a28057b822f7f0431c8717b. You have to remove (or rename) that container to b
e able to reuse that name."
May 30 13:15:44 ip-172-31-5-23 atomic-openshift-node: W0530 13:15:44.741984 15333 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "cakephp-mysql-example-26-build_proj11": Unexpected command output nsenter: cannot open : No such file or directory
May 30 13:15:44 ip-172-31-5-23 atomic-openshift-node: W0530 13:15:44.755155 15333 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "cakephp-mysql-example-26-build_proj11": Unexpected command output nsenter: cannot open : No such file or directory
Later, it appears the docker daemon finally gives a response that the container with the specified name exists (presumably, it actually completed in one of the many prior attempts that had exceeded 2m earlier), but the kubelet has no idea if the sandbox had been created prior or not since everything just timed out.
Approximately 30s later, the initial sandbox container died.
May 30 13:16:16 ip-172-31-5-23 atomic-openshift-node: I0530 13:16:16.002501 15333 kubelet.go:1847] SyncLoop (PLEG): "cakephp-mysql-example-26-build_proj11(f0f7f873-455a-11e7-9b8e-02ca84fd3e44)", event: &pleg.PodLifecycleEvent{ID:"f0f7f873-455a-11e7-9b8e-02ca84fd3e44", Type:"ContainerDied", Data:"98b847320b96651007b7be98abb4104bacf9334c2a28057b822f7f0431c8717b"}
And the kubelet again tries to create a new sandbox:
May 30 13:16:53 ip-172-31-5-23 atomic-openshift-node: E0530 13:16:53.658582
15333 remote_runtime.go:86] RunPodSandbox from runtime service failed: rpc error: code = 2 desc = failed to create a sandbox for pod "cakephp-mysql-example-26-build": Error response from daemon: {"message":"Conflict. The name \"/k8s_POD_cakephp-mysql-example-26-build_proj11_f0f7f873-455a-11e7-9b8e-02ca84fd3e44_1\" is already in use by container 98b847320b96651007b7be98abb4104bacf9334c2a28057b822f7f0431c8717b. You have to remove (or rename) that container to be able to reuse that name."}
May 30 13:16:53 ip-172-31-5-23 atomic-openshift-node: E0530 13:16:53.658655 15333 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "cakephp-mysql-example-26-build_proj11(f0f7f873-455a-11e7-9b8e-02ca84fd3e44)" failed: rpc error: code = 2 desc = failed to create a sandbox for pod "cakephp-mysql-example-26-build": Error response from daemon: {"message":"Conflict. The name \"/k8s_POD_cakephp-mysql-example-26-build_proj11_f0f7f873-455a-11e7-9b8e-02ca84fd3e44_1\" is already in use by container 98b847320b96651007b7be98abb4104bacf9334c2a28057b822f7f0431c8717b. You have to remove (or rename) that container to be able to reuse that name."}
After many more minutes, the docker daemon still is overwhelmed, and calls to create a container taking more than 2m to respond.
May 30 13:22:37 ip-172-31-5-23 atomic-openshift-node: E0530 13:22:37.826214 15333 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "cakephp-mysql-example-26-build_proj11(f0f7f873-455a-11e7-9b8e-02ca84fd3e44)" failed: rpc error: code = 2 desc = failed to create a sandbox for pod "cakephp-mysql-example-26-build": operation timeout: context deadline exceeded
May 30 13:22:37 ip-172-31-5-23 atomic-openshift-node: E0530 13:22:37.826236 15333 kuberuntime_manager.go:619] createPodSandbox for pod "cakephp-mysql-example-26-build_proj11(f0f7f873-455a-11e7-9b8e-02ca84fd3e44)" failed: rpc error: code = 2 desc = failed to create a sandbox for pod "cakephp-mysql-example-26-build": operation timeout: context deadline exceeded
I think its safe to say that this node was not healthy, the kubelet was trying to do what it could given the environment.
We need to know the root cause for the 2m timeout, as a result, I am duping on the referenced bug.
*** This bug has been marked as a duplicate of bug 1451902 ***
On second thought, I think this is slightly different than 1451902 in that the core timeout is happening on CreateContainer operation (>2 minutes for multiple minutes) which is different than the PLEG calling list containers. we need to understand why a create container would take > 2m, which is the timeout from the kubelet -> CRI shim interaction. Was the forced delete option on for the docker daemon? This was a typical openshift-ansible install, There were no changes done to docker settings after the install. How do I check for force delete option ? --storage-opts dm.use_deferred_deletion=true |
Description of problem: Builds are failing due to conflict in container name rpc error: code = 2 desc = Error response from daemon: {"message":"Conflict. The name \"/k8s_sti-build_cakephp-mysql-example-26-build_proj27_f0d04e4c-455a-11e7-9b8e-02ca84fd3e44_0\" is already in use by container e0d916b0ba24721582405d5517cb1408049310e6d18dc35b66b782c19b7fee18. You have to remove (or rename) that container to be able to reuse that name."} 0 10m oc logs for pod show following Error from server (BadRequest): container "sti-build" in pod "cakephp-mysql-example-26-build" is waiting to start: rpc error: code = 2 desc = Error response from daemon: {"message":"Conflict. The name \"/k8s_sti-build_cakephp-mysql-example-26-build_proj22_f0dd6098-455a-11e7-9b8e-02ca84fd3e44_0\" is already in use by container 7f8d403b92c7db283dfc8cb18744897ad068378c12e01c497f54f609abb24f38. You have to remove (or rename) that container to be able to reuse that name."} Version-Release number of selected component (if applicable): openshift v3.6.79 kubernetes v1.6.1+5115d708d7 etcd 3.1.0 Client: Version: 1.12.6 API version: 1.24 Package version: docker-1.12.6-28.git1398f24.el7.x86_64 Go version: go1.7.4 Git commit: 1398f24/1.12.6 Built: Wed May 17 01:16:44 2017 OS/Arch: linux/amd64 Server: Version: 1.12.6 API version: 1.24 Package version: docker-1.12.6-28.git1398f24.el7.x86_64 Go version: go1.7.4 Git commit: 1398f24/1.12.6 Built: Wed May 17 01:16:44 2017 How reproducible: Running 30 concurrent builds on 2 m4.xlarge compute nodes. Actual results: Builds fail due to the error. Expected results: Builds should pass. Additional info: Attaching node logs