Description of problem: We've see sometimes pods running on a problem node keeps crash looping, first time restart the docker, it will failed with: can't create unix socket /var/run/docker.sock: is a directory, after manually delete the dir, restart will succeed, and pods start to running well. docker logs: -- Logs begin at Sat 2020-02-29 22:10:13 UTC. -- Mar 31 08:14:56 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:14:56.517942760Z" level=error msg="Handler for POST /v1.26/containers/fc75afd05389647d2cbf1fe138193e4fd271517da85ad205c2bd5aa0a7125423/stop?t=10 returned error: Container fc75afd05389647d2cbf1fe138193e4fd271517da85ad205c2bd5aa0a7125423 is already stopped" Mar 31 08:14:56 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:14:56.518149524Z" level=error msg="Handler for POST /v1.26/containers/fc75afd05389647d2cbf1fe138193e4fd271517da85ad205c2bd5aa0a7125423/stop returned error: Container fc75afd05389647d2cbf1fe138193e4fd271517da85ad205c2bd5aa0a7125423 is already stopped" Mar 31 08:15:16 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:15:16.100872696Z" level=error msg="Handler for POST /v1.26/containers/fc75afd05389647d2cbf1fe138193e4fd271517da85ad205c2bd5aa0a7125423/stop?t=10 returned error: Container fc75afd05389647d2cbf1fe138193e4fd271517da85ad205c2bd5aa0a7125423 is already stopped" Mar 31 08:15:16 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:15:16.101272450Z" level=error msg="Handler for POST /v1.26/containers/fc75afd05389647d2cbf1fe138193e4fd271517da85ad205c2bd5aa0a7125423/stop returned error: Container fc75afd05389647d2cbf1fe138193e4fd271517da85ad205c2bd5aa0a7125423 is already stopped" Mar 31 08:15:16 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:15:16.551314196Z" level=warning msg="534293a4e1d18a144e14548c8577610cf41782c6cf45c418b58c3d0e78709241 cleanup: failed to unmount secrets: invalid argument" Mar 31 08:15:16 ip-10-110-2-215.ec2.internal oci-umount[25925]: umounthook <debug>: prestart container_id:f9b26227ce55 rootfs:/var/lib/docker/overlay2/880c25c0b706d99665825bd8cbecbe1652bea0861e8740f4139fe1d91f2bd314/merged Mar 31 08:15:43 ip-10-110-2-215.ec2.internal oci-umount[26330]: umounthook <debug>: prestart container_id:a43518148cda rootfs:/var/lib/docker/overlay2/f31911191c528fb246052d54c0068751aa755f084720f768021ac2c626774f0f/merged Mar 31 08:15:52 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:15:52.476750292Z" level=info msg="Container 1be624cbf57f444e5bb3bda17881965dabf6d9f403fa3c180ad4a87ff50d4a2c failed to exit within 120 seconds of signal 15 - using the force" Mar 31 08:15:52 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:15:52.477068613Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 1be624cbf57f444e5bb3bda17881965dabf6d9f403fa3c180ad4a87ff50d4a2c: rpc error: code = 2 desc = containerd: container not found" Mar 31 08:15:52 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:15:52.488634996Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 1be624cbf57f444e5bb3bda17881965dabf6d9f403fa3c180ad4a87ff50d4a2c: rpc error: code = 2 desc = containerd: container not found" Mar 31 08:16:02 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:16:02.477208905Z" level=info msg="Container 1be624cbf57f failed to exit within 10 seconds of kill - trying direct SIGKILL" Mar 31 08:16:11 ip-10-110-2-215.ec2.internal oci-umount[26898]: umounthook <debug>: prestart container_id:fab1f6cfd3a8 rootfs:/var/lib/docker/overlay2/c3f4b6f171bfece7869850dd0b5b33e988ce916ebffd31196d219ead5cb3c133/merged Mar 31 08:16:18 ip-10-110-2-215.ec2.internal oci-umount[27038]: umounthook <debug>: prestart container_id:07d597c3c23e rootfs:/var/lib/docker/overlay2/9475f04a9b07eed9aa8bf1e43bc4dbff99514dc9e8691fc44c749e12b23e1b6d/merged Mar 31 08:16:34 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:16:34.110457859Z" level=warning msg="a43518148cdae84ef1bf443525f3973b0e981f6492de8302e4b319b8e81c4f37 cleanup: failed to unmount secrets: invalid argument" Mar 31 08:16:34 ip-10-110-2-215.ec2.internal oci-umount[27340]: umounthook <debug>: prestart container_id:eaad84c94398 rootfs:/var/lib/docker/overlay2/a484781d5d56bfbba9b15a506212b10198ad89f7dc5e101aed1692365bbdede6/merged Mar 31 08:16:41 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:16:41.623288244Z" level=warning msg="fab1f6cfd3a8170eceec4aeefe02880de330eb3eec747c63ea1f5549343e4b30 cleanup: failed to unmount secrets: invalid argument" Mar 31 08:16:41 ip-10-110-2-215.ec2.internal oci-umount[27601]: umounthook <debug>: prestart container_id:1cf8a53d32d2 rootfs:/var/lib/docker/overlay2/a5c3eb47653c7cb79bb77f3707b3c32de4808e3a204d381fb3f0b61d435b5a38/merged Mar 31 08:16:51 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:16:51.172288259Z" level=warning msg="07d597c3c23e464d5a63bf9a0314e059bf2e1c0723a4955f9c487895db4a071d cleanup: failed to unmount secrets: invalid argument" Mar 31 08:16:51 ip-10-110-2-215.ec2.internal oci-umount[27806]: umounthook <debug>: prestart container_id:778b8ce4e38f rootfs:/var/lib/docker/overlay2/86cc805d40c23e31adfce518812b41118abb35ce413a43319b87d023b6fc8007/merged Mar 31 08:17:11 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:17:11.631669354Z" level=warning msg="1cf8a53d32d26266725eeb18cf198017a1a993fd95acb57c1815ab4b5c0d795b cleanup: failed to unmount secrets: invalid argument" Mar 31 08:17:17 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:17:17.325794194Z" level=warning msg="f9b26227ce55db3ecabe0e26d1c2b145ad829263105e235438b8986838c795b8 cleanup: failed to unmount secrets: invalid argument" Mar 31 08:17:21 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:17:21.123348833Z" level=warning msg="778b8ce4e38f57553ea5dcd5f59d8470b5fde0da4d01d564a0a7b25f8f8ed279 cleanup: failed to unmount secrets: invalid argument" Mar 31 08:17:24 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:17:24.508628240Z" level=warning msg="eaad84c943988708eaba85b607c9c059723ca605b8ae9754ea9252edb6288123 cleanup: failed to unmount secrets: invalid argument" Mar 31 08:17:27 ip-10-110-2-215.ec2.internal oci-umount[28506]: umounthook <debug>: prestart container_id:9d05abf2ff56 rootfs:/var/lib/docker/overlay2/071cfa6ab1cee6f98012a4192cd799a2af25daaac90ecb2de524c4d094745485/merged Mar 31 08:17:52 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:17:52.488781519Z" level=info msg="Container 1be624cbf57f444e5bb3bda17881965dabf6d9f403fa3c180ad4a87ff50d4a2c failed to exit within 120 seconds of signal 15 - using the force" Mar 31 08:17:52 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:17:52.489145982Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 1be624cbf57f444e5bb3bda17881965dabf6d9f403fa3c180ad4a87ff50d4a2c: rpc error: code = 2 desc = containerd: container not found" Mar 31 08:17:54 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:17:54.488361826Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 1be624cbf57f444e5bb3bda17881965dabf6d9f403fa3c180ad4a87ff50d4a2c: rpc error: code = 2 desc = containerd: container not found" Mar 31 08:17:54 ip-10-110-2-215.ec2.internal oci-umount[28910]: umounthook <debug>: prestart container_id:dc7527c97738 rootfs:/var/lib/docker/overlay2/4c2b095b249c381feae0200e4326f111b40c688b6fde62a6a0d59c9038895249/merged Mar 31 08:17:57 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:17:57.537465334Z" level=warning msg="9d05abf2ff5652316e352ce7a7875f607938fcd59abaaf4c1991c0a1c4ca5c04 cleanup: failed to unmount secrets: invalid argument" Mar 31 08:17:57 ip-10-110-2-215.ec2.internal oci-umount[29046]: umounthook <debug>: prestart container_id:cd75b59e98c9 rootfs:/var/lib/docker/overlay2/2508cfccf6795cd95bdb25c38702425b5099636f933e2b1f895e999133f93d24/merged Mar 31 08:18:02 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:18:02.489270039Z" level=info msg="Container 1be624cbf57f failed to exit within 10 seconds of kill - trying direct SIGKILL" Mar 31 08:18:22 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:18:22.055769141Z" level=warning msg="dc7527c977383e7b71c2aea4575b5a5de13e44b7f1a8841a4a0ee9c2bb664937 cleanup: failed to unmount secrets: invalid argument" Mar 31 08:18:22 ip-10-110-2-215.ec2.internal oci-umount[29528]: umounthook <debug>: prestart container_id:4db0919e510e rootfs:/var/lib/docker/overlay2/62668c57b84408f9f72c751dd25580d27498c4daf8ef433cf6030fd1654494ad/merged Mar 31 08:18:27 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:18:27.534672079Z" level=warning msg="cd75b59e98c987108a6ceaa8d7eb63463859dca739545017ce4f05bd27688217 cleanup: failed to unmount secrets: invalid argument" Mar 31 08:18:52 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:18:52.019458450Z" level=warning msg="4db0919e510e3b7f95ced7d8cd95bf22bd51dca069c44d6e68f70e8f0e6a8b43 cleanup: failed to unmount secrets: invalid argument" Version-Release number of selected component (if applicable): openshift v3.11.154 Docker version: Client: Version: 1.13.1 API version: 1.26 Package version: docker-1.13.1-104.git4ef4b30.el7.x86_64 Go version: go1.10.3 Git commit: 4ef4b30/1.13.1 Built: Tue Sep 24 18:53:48 2019 OS/Arch: linux/amd64 Server: Version: 1.13.1 API version: 1.26 (minimum version 1.12) Package version: docker-1.13.1-104.git4ef4b30.el7.x86_64 Go version: go1.10.3 Git commit: 4ef4b30/1.13.1 Built: Tue Sep 24 18:53:48 2019 OS/Arch: linux/amd64 Experimental: false How reproducible: Sometimes Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
failed first time restart docker: [root@ip-10-110-2-215 ~]# systemctl restart docker Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details. [root@ip-10-110-2-215 ~]# systemctl status docker ● docker.service - Docker Application Container Engine Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/docker.service.d └─custom.conf Active: failed (Result: exit-code) since Tue 2020-03-31 08:20:29 UTC; 9s ago Docs: http://docs.docker.com Process: 33823 ExecStart=/usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --init-path=/usr/libexec/docker/docker-init-current --seccomp-profile=/etc/docker/seccomp.json $OPTIONS $DOCKER_STORAGE_OPTIONS $DOCKER_NETWORK_OPTIONS $ADD_REGISTRY $BLOCK_REGISTRY $INSECURE_REGISTRY $REGISTRIES (code=exited, status=1/FAILURE) Main PID: 33823 (code=exited, status=1/FAILURE) Tasks: 55 Memory: 153.5M CGroup: /system.slice/docker.service ├─32778 /usr/bin/docker-containerd-shim-current 6a1a309f46b7971733b4e4de1956f75c71946105ad9beb6e13b4984ea5d4a33b /var/run/docker/libcontainerd/6a1a309f46b79717... ├─32838 /usr/bin/docker-containerd-shim-current 40be0a9cc50d5806fa313907ebdb207ec33b140d0daa81328cf36e4ed93bbd03 /var/run/docker/libcontainerd/40be0a9cc50d5806... ├─33071 /usr/bin/docker-containerd-shim-current 20c7f99e7681371156eda41b1f7382226a56224e2fa45e03ce97f182e77d3f21 /var/run/docker/libcontainerd/20c7f99e76813711... ├─33129 /usr/bin/docker-containerd-shim-current 74ab47cd3ad12b443f970cb3ae694d62923b47bab9e7e4270b3d4756c4ec515e /var/run/docker/libcontainerd/74ab47cd3ad12b44... └─33183 /usr/bin/docker-containerd-shim-current 3dd1bf2f31c96767a26483f25a92a908e052f5937d1d3d68be84fcbfe38a0f63 /var/run/docker/libcontainerd/3dd1bf2f31c96767... Mar 31 08:20:29 ip-10-110-2-215.ec2.internal systemd[1]: Starting Docker Application Container Engine... Mar 31 08:20:29 ip-10-110-2-215.ec2.internal dockerd-current[33823]: time="2020-03-31T08:20:29.316759629Z" level=warning msg="could not change group /var/run/dock...t found" Mar 31 08:20:29 ip-10-110-2-215.ec2.internal dockerd-current[33823]: can't create unix socket /var/run/docker.sock: is a directory Mar 31 08:20:29 ip-10-110-2-215.ec2.internal systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE Mar 31 08:20:29 ip-10-110-2-215.ec2.internal systemd[1]: Failed to start Docker Application Container Engine. Mar 31 08:20:29 ip-10-110-2-215.ec2.internal systemd[1]: Unit docker.service entered failed state. Mar 31 08:20:29 ip-10-110-2-215.ec2.internal systemd[1]: docker.service failed. Hint: Some lines were ellipsized, use -l to show in full. [root@ip-10-110-2-215 ~]# rm -rf /var/run/docker.sock [root@ip-10-110-2-215 ~]# systemctl restart docker
Urvashi, could you take a peak at this please?
Ryan, can one of your folks look into this please.
If the bug is about inability to restart dockerd because of /var/run/docker.sock directory presence, this is the same as https://github.com/moby/moby/issues/30348, and is supposed to be partially addressed by https://github.com/moby/moby/pull/33330 (ported as https://github.com/projectatomic/docker/pull/328). The way to ultimately address this is NOT to use old volumes API (which auto-creates a bind mount source as a directory if it does not exist, and this is what's causing /var/run/docker.sock directory to appear), but use the new mounts API. In case it is already used (I don't know), the directory might still be created because of the race (see https://github.com/moby/moby/issues/37083), fixed by https://github.com/moby/moby/pull/37378. Let me backport this to projectatomic/docker.
If the bug is about "pods running on a problem node keeps crash looping" (as the bug subject implies), there is not enough information to do anything about it, at the very least I need a repro. Please clarify what the bug is about.
> Let me backport this to projectatomic/docker. Please see https://github.com/projectatomic/docker/pull/376
The problem is at some time, the pods running on that specific node cannot running, restart docker should be able to fix the problem, but like what I said, restart always failed with the error that I pasted in comment 1, after manually delete that sock dir, restart will succeed, and pods start to running well, I haven't see this problem for a long time, and we haven't upgrade docker also, so it's not easy to reproduce.
Again, there are two problems in what you described, here is the first one: > the pods running on that specific node cannot running and here is the second one: > restart always failed with the error that I pasted Since there is not enough data provided about first problem to be actionable, I assume we're dealing with the second one here. The backport I mentioned earlier in comment 8 has been merged. I am currently looking into whether OC is using volumes API or mounts API. Might come up with a PR later.
> I am currently looking into whether OC is using volumes API or mounts API. It should use mounts API, but have yet to look into it.