Bug 1819110 - Pods keeps crash due to docker problem
Summary: Pods keeps crash due to docker problem
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.11.z
Assignee: Kir Kolyshkin
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-31 08:48 UTC by Wang Haoran
Modified: 2020-11-25 14:56 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-25 14:56:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Wang Haoran 2020-03-31 08:48:50 UTC
Description of problem:

We've see sometimes pods running on a problem node keeps crash looping, first time restart the docker, it will failed with: can't create unix socket /var/run/docker.sock: is a directory, after manually delete the dir, restart will succeed, and pods start to running well.

docker logs:
-- Logs begin at Sat 2020-02-29 22:10:13 UTC. --
Mar 31 08:14:56 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:14:56.517942760Z" level=error msg="Handler for POST /v1.26/containers/fc75afd05389647d2cbf1fe138193e4fd271517da85ad205c2bd5aa0a7125423/stop?t=10 returned error: Container fc75afd05389647d2cbf1fe138193e4fd271517da85ad205c2bd5aa0a7125423 is already stopped"
Mar 31 08:14:56 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:14:56.518149524Z" level=error msg="Handler for POST /v1.26/containers/fc75afd05389647d2cbf1fe138193e4fd271517da85ad205c2bd5aa0a7125423/stop returned error: Container fc75afd05389647d2cbf1fe138193e4fd271517da85ad205c2bd5aa0a7125423 is already stopped"
Mar 31 08:15:16 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:15:16.100872696Z" level=error msg="Handler for POST /v1.26/containers/fc75afd05389647d2cbf1fe138193e4fd271517da85ad205c2bd5aa0a7125423/stop?t=10 returned error: Container fc75afd05389647d2cbf1fe138193e4fd271517da85ad205c2bd5aa0a7125423 is already stopped"
Mar 31 08:15:16 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:15:16.101272450Z" level=error msg="Handler for POST /v1.26/containers/fc75afd05389647d2cbf1fe138193e4fd271517da85ad205c2bd5aa0a7125423/stop returned error: Container fc75afd05389647d2cbf1fe138193e4fd271517da85ad205c2bd5aa0a7125423 is already stopped"
Mar 31 08:15:16 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:15:16.551314196Z" level=warning msg="534293a4e1d18a144e14548c8577610cf41782c6cf45c418b58c3d0e78709241 cleanup: failed to unmount secrets: invalid argument"
Mar 31 08:15:16 ip-10-110-2-215.ec2.internal oci-umount[25925]: umounthook <debug>: prestart container_id:f9b26227ce55 rootfs:/var/lib/docker/overlay2/880c25c0b706d99665825bd8cbecbe1652bea0861e8740f4139fe1d91f2bd314/merged
Mar 31 08:15:43 ip-10-110-2-215.ec2.internal oci-umount[26330]: umounthook <debug>: prestart container_id:a43518148cda rootfs:/var/lib/docker/overlay2/f31911191c528fb246052d54c0068751aa755f084720f768021ac2c626774f0f/merged
Mar 31 08:15:52 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:15:52.476750292Z" level=info msg="Container 1be624cbf57f444e5bb3bda17881965dabf6d9f403fa3c180ad4a87ff50d4a2c failed to exit within 120 seconds of signal 15 - using the force"
Mar 31 08:15:52 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:15:52.477068613Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 1be624cbf57f444e5bb3bda17881965dabf6d9f403fa3c180ad4a87ff50d4a2c: rpc error: code = 2 desc = containerd: container not found"
Mar 31 08:15:52 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:15:52.488634996Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 1be624cbf57f444e5bb3bda17881965dabf6d9f403fa3c180ad4a87ff50d4a2c: rpc error: code = 2 desc = containerd: container not found"
Mar 31 08:16:02 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:16:02.477208905Z" level=info msg="Container 1be624cbf57f failed to exit within 10 seconds of kill - trying direct SIGKILL"



Mar 31 08:16:11 ip-10-110-2-215.ec2.internal oci-umount[26898]: umounthook <debug>: prestart container_id:fab1f6cfd3a8 rootfs:/var/lib/docker/overlay2/c3f4b6f171bfece7869850dd0b5b33e988ce916ebffd31196d219ead5cb3c133/merged


Mar 31 08:16:18 ip-10-110-2-215.ec2.internal oci-umount[27038]: umounthook <debug>: prestart container_id:07d597c3c23e rootfs:/var/lib/docker/overlay2/9475f04a9b07eed9aa8bf1e43bc4dbff99514dc9e8691fc44c749e12b23e1b6d/merged
Mar 31 08:16:34 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:16:34.110457859Z" level=warning msg="a43518148cdae84ef1bf443525f3973b0e981f6492de8302e4b319b8e81c4f37 cleanup: failed to unmount secrets: invalid argument"
Mar 31 08:16:34 ip-10-110-2-215.ec2.internal oci-umount[27340]: umounthook <debug>: prestart container_id:eaad84c94398 rootfs:/var/lib/docker/overlay2/a484781d5d56bfbba9b15a506212b10198ad89f7dc5e101aed1692365bbdede6/merged


Mar 31 08:16:41 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:16:41.623288244Z" level=warning msg="fab1f6cfd3a8170eceec4aeefe02880de330eb3eec747c63ea1f5549343e4b30 cleanup: failed to unmount secrets: invalid argument"
Mar 31 08:16:41 ip-10-110-2-215.ec2.internal oci-umount[27601]: umounthook <debug>: prestart container_id:1cf8a53d32d2 rootfs:/var/lib/docker/overlay2/a5c3eb47653c7cb79bb77f3707b3c32de4808e3a204d381fb3f0b61d435b5a38/merged
Mar 31 08:16:51 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:16:51.172288259Z" level=warning msg="07d597c3c23e464d5a63bf9a0314e059bf2e1c0723a4955f9c487895db4a071d cleanup: failed to unmount secrets: invalid argument"
Mar 31 08:16:51 ip-10-110-2-215.ec2.internal oci-umount[27806]: umounthook <debug>: prestart container_id:778b8ce4e38f rootfs:/var/lib/docker/overlay2/86cc805d40c23e31adfce518812b41118abb35ce413a43319b87d023b6fc8007/merged
Mar 31 08:17:11 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:17:11.631669354Z" level=warning msg="1cf8a53d32d26266725eeb18cf198017a1a993fd95acb57c1815ab4b5c0d795b cleanup: failed to unmount secrets: invalid argument"
Mar 31 08:17:17 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:17:17.325794194Z" level=warning msg="f9b26227ce55db3ecabe0e26d1c2b145ad829263105e235438b8986838c795b8 cleanup: failed to unmount secrets: invalid argument"
Mar 31 08:17:21 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:17:21.123348833Z" level=warning msg="778b8ce4e38f57553ea5dcd5f59d8470b5fde0da4d01d564a0a7b25f8f8ed279 cleanup: failed to unmount secrets: invalid argument"
Mar 31 08:17:24 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:17:24.508628240Z" level=warning msg="eaad84c943988708eaba85b607c9c059723ca605b8ae9754ea9252edb6288123 cleanup: failed to unmount secrets: invalid argument"
Mar 31 08:17:27 ip-10-110-2-215.ec2.internal oci-umount[28506]: umounthook <debug>: prestart container_id:9d05abf2ff56 rootfs:/var/lib/docker/overlay2/071cfa6ab1cee6f98012a4192cd799a2af25daaac90ecb2de524c4d094745485/merged
Mar 31 08:17:52 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:17:52.488781519Z" level=info msg="Container 1be624cbf57f444e5bb3bda17881965dabf6d9f403fa3c180ad4a87ff50d4a2c failed to exit within 120 seconds of signal 15 - using the force"
Mar 31 08:17:52 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:17:52.489145982Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 1be624cbf57f444e5bb3bda17881965dabf6d9f403fa3c180ad4a87ff50d4a2c: rpc error: code = 2 desc = containerd: container not found"
Mar 31 08:17:54 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:17:54.488361826Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 1be624cbf57f444e5bb3bda17881965dabf6d9f403fa3c180ad4a87ff50d4a2c: rpc error: code = 2 desc = containerd: container not found"
Mar 31 08:17:54 ip-10-110-2-215.ec2.internal oci-umount[28910]: umounthook <debug>: prestart container_id:dc7527c97738 rootfs:/var/lib/docker/overlay2/4c2b095b249c381feae0200e4326f111b40c688b6fde62a6a0d59c9038895249/merged
Mar 31 08:17:57 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:17:57.537465334Z" level=warning msg="9d05abf2ff5652316e352ce7a7875f607938fcd59abaaf4c1991c0a1c4ca5c04 cleanup: failed to unmount secrets: invalid argument"
Mar 31 08:17:57 ip-10-110-2-215.ec2.internal oci-umount[29046]: umounthook <debug>: prestart container_id:cd75b59e98c9 rootfs:/var/lib/docker/overlay2/2508cfccf6795cd95bdb25c38702425b5099636f933e2b1f895e999133f93d24/merged
Mar 31 08:18:02 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:18:02.489270039Z" level=info msg="Container 1be624cbf57f failed to exit within 10 seconds of kill - trying direct SIGKILL"
Mar 31 08:18:22 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:18:22.055769141Z" level=warning msg="dc7527c977383e7b71c2aea4575b5a5de13e44b7f1a8841a4a0ee9c2bb664937 cleanup: failed to unmount secrets: invalid argument"
Mar 31 08:18:22 ip-10-110-2-215.ec2.internal oci-umount[29528]: umounthook <debug>: prestart container_id:4db0919e510e rootfs:/var/lib/docker/overlay2/62668c57b84408f9f72c751dd25580d27498c4daf8ef433cf6030fd1654494ad/merged
Mar 31 08:18:27 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:18:27.534672079Z" level=warning msg="cd75b59e98c987108a6ceaa8d7eb63463859dca739545017ce4f05bd27688217 cleanup: failed to unmount secrets: invalid argument"
Mar 31 08:18:52 ip-10-110-2-215.ec2.internal dockerd-current[3184]: time="2020-03-31T08:18:52.019458450Z" level=warning msg="4db0919e510e3b7f95ced7d8cd95bf22bd51dca069c44d6e68f70e8f0e6a8b43 cleanup: failed to unmount secrets: invalid argument"


Version-Release number of selected component (if applicable):
openshift v3.11.154
Docker version:
Client:
 Version:         1.13.1
 API version:     1.26
 Package version: docker-1.13.1-104.git4ef4b30.el7.x86_64
 Go version:      go1.10.3
 Git commit:      4ef4b30/1.13.1
 Built:           Tue Sep 24 18:53:48 2019
 OS/Arch:         linux/amd64

Server:
 Version:         1.13.1
 API version:     1.26 (minimum version 1.12)
 Package version: docker-1.13.1-104.git4ef4b30.el7.x86_64
 Go version:      go1.10.3
 Git commit:      4ef4b30/1.13.1
 Built:           Tue Sep 24 18:53:48 2019
 OS/Arch:         linux/amd64
 Experimental:    false


How reproducible:
Sometimes

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Wang Haoran 2020-03-31 08:51:02 UTC
failed first time restart docker:

[root@ip-10-110-2-215 ~]# systemctl restart docker
Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
[root@ip-10-110-2-215 ~]# systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/docker.service.d
           └─custom.conf
   Active: failed (Result: exit-code) since Tue 2020-03-31 08:20:29 UTC; 9s ago
     Docs: http://docs.docker.com
  Process: 33823 ExecStart=/usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --init-path=/usr/libexec/docker/docker-init-current --seccomp-profile=/etc/docker/seccomp.json $OPTIONS $DOCKER_STORAGE_OPTIONS $DOCKER_NETWORK_OPTIONS $ADD_REGISTRY $BLOCK_REGISTRY $INSECURE_REGISTRY $REGISTRIES (code=exited, status=1/FAILURE)
 Main PID: 33823 (code=exited, status=1/FAILURE)
    Tasks: 55
   Memory: 153.5M
   CGroup: /system.slice/docker.service
           ├─32778 /usr/bin/docker-containerd-shim-current 6a1a309f46b7971733b4e4de1956f75c71946105ad9beb6e13b4984ea5d4a33b /var/run/docker/libcontainerd/6a1a309f46b79717...
           ├─32838 /usr/bin/docker-containerd-shim-current 40be0a9cc50d5806fa313907ebdb207ec33b140d0daa81328cf36e4ed93bbd03 /var/run/docker/libcontainerd/40be0a9cc50d5806...
           ├─33071 /usr/bin/docker-containerd-shim-current 20c7f99e7681371156eda41b1f7382226a56224e2fa45e03ce97f182e77d3f21 /var/run/docker/libcontainerd/20c7f99e76813711...
           ├─33129 /usr/bin/docker-containerd-shim-current 74ab47cd3ad12b443f970cb3ae694d62923b47bab9e7e4270b3d4756c4ec515e /var/run/docker/libcontainerd/74ab47cd3ad12b44...
           └─33183 /usr/bin/docker-containerd-shim-current 3dd1bf2f31c96767a26483f25a92a908e052f5937d1d3d68be84fcbfe38a0f63 /var/run/docker/libcontainerd/3dd1bf2f31c96767...

Mar 31 08:20:29 ip-10-110-2-215.ec2.internal systemd[1]: Starting Docker Application Container Engine...
Mar 31 08:20:29 ip-10-110-2-215.ec2.internal dockerd-current[33823]: time="2020-03-31T08:20:29.316759629Z" level=warning msg="could not change group /var/run/dock...t found"
Mar 31 08:20:29 ip-10-110-2-215.ec2.internal dockerd-current[33823]: can't create unix socket /var/run/docker.sock: is a directory
Mar 31 08:20:29 ip-10-110-2-215.ec2.internal systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Mar 31 08:20:29 ip-10-110-2-215.ec2.internal systemd[1]: Failed to start Docker Application Container Engine.
Mar 31 08:20:29 ip-10-110-2-215.ec2.internal systemd[1]: Unit docker.service entered failed state.
Mar 31 08:20:29 ip-10-110-2-215.ec2.internal systemd[1]: docker.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
[root@ip-10-110-2-215 ~]# rm -rf /var/run/docker.sock
[root@ip-10-110-2-215 ~]# systemctl restart docker

Comment 2 Tom Sweeney 2020-03-31 16:12:58 UTC
Urvashi, could you take a peak at this please?

Comment 5 Tom Sweeney 2020-07-02 20:05:25 UTC
Ryan, can one of your folks look into this please.

Comment 6 Kir Kolyshkin 2020-07-08 18:33:55 UTC
If the bug is about inability to restart dockerd because of /var/run/docker.sock directory presence, this is the same as https://github.com/moby/moby/issues/30348, and is supposed to be partially addressed by https://github.com/moby/moby/pull/33330 (ported as https://github.com/projectatomic/docker/pull/328).

The way to ultimately address this is NOT to use old volumes API (which auto-creates a bind mount source as a directory if it does not exist, and this is what's causing /var/run/docker.sock directory to appear), but use the new mounts API.

In case it is already used (I don't know), the directory might still be created because of the race (see https://github.com/moby/moby/issues/37083), fixed by https://github.com/moby/moby/pull/37378.

Let me backport this to projectatomic/docker.

Comment 7 Kir Kolyshkin 2020-07-08 18:45:23 UTC
If the bug is about "pods running on a problem node keeps crash looping" (as the bug subject implies), there is not enough information to do anything about it, at the very least I need a repro.

Please clarify what the bug is about.

Comment 8 Kir Kolyshkin 2020-07-09 00:34:31 UTC
> Let me backport this to projectatomic/docker.

Please see https://github.com/projectatomic/docker/pull/376

Comment 9 Wang Haoran 2020-07-09 00:44:16 UTC
The problem is at some time, the pods running on that specific node cannot running, restart docker should be able to fix the problem, but like what I said, restart always failed with the error that I pasted in comment 1, after manually delete that sock dir, restart will succeed, and pods start to running well, I haven't see this problem for a long time, and we haven't upgrade docker also, so it's not easy to reproduce.

Comment 10 Kir Kolyshkin 2020-07-10 21:17:11 UTC
Again, there are two problems in what you described, here is the first one:

> the pods running on that specific node cannot running

and here is the second one:

> restart always failed with the error that I pasted

Since there is not enough data provided about first problem to be actionable, I assume we're dealing with the second one here.

The backport I mentioned earlier in comment 8 has been merged.

I am currently looking into whether OC is using volumes API or mounts API. Might come up with a PR later.

Comment 11 Kir Kolyshkin 2020-07-29 21:35:08 UTC
> I am currently looking into whether OC is using volumes API or mounts API.

It should use mounts API, but have yet to look into it.


Note You need to log in before you can comment on or make changes to this bug.