Bug 1819110
| Summary: | Pods keeps crash due to docker problem | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Wang Haoran <haowang> |
| Component: | Node | Assignee: | Kir Kolyshkin <kir> |
| Node sub component: | CRI-O | QA Contact: | Sunil Choudhary <schoudha> |
| Status: | CLOSED WORKSFORME | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | aos-bugs, dwalsh, jokerman, mpatel, nagrawal, tsweeney |
| Version: | 3.11.0 | Keywords: | ServiceDeliveryImpact |
| Target Milestone: | --- | ||
| Target Release: | 3.11.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-11-25 14:56:44 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Wang Haoran
2020-03-31 08:48:50 UTC
failed first time restart docker:
[root@ip-10-110-2-215 ~]# systemctl restart docker
Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
[root@ip-10-110-2-215 ~]# systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/docker.service.d
└─custom.conf
Active: failed (Result: exit-code) since Tue 2020-03-31 08:20:29 UTC; 9s ago
Docs: http://docs.docker.com
Process: 33823 ExecStart=/usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --init-path=/usr/libexec/docker/docker-init-current --seccomp-profile=/etc/docker/seccomp.json $OPTIONS $DOCKER_STORAGE_OPTIONS $DOCKER_NETWORK_OPTIONS $ADD_REGISTRY $BLOCK_REGISTRY $INSECURE_REGISTRY $REGISTRIES (code=exited, status=1/FAILURE)
Main PID: 33823 (code=exited, status=1/FAILURE)
Tasks: 55
Memory: 153.5M
CGroup: /system.slice/docker.service
├─32778 /usr/bin/docker-containerd-shim-current 6a1a309f46b7971733b4e4de1956f75c71946105ad9beb6e13b4984ea5d4a33b /var/run/docker/libcontainerd/6a1a309f46b79717...
├─32838 /usr/bin/docker-containerd-shim-current 40be0a9cc50d5806fa313907ebdb207ec33b140d0daa81328cf36e4ed93bbd03 /var/run/docker/libcontainerd/40be0a9cc50d5806...
├─33071 /usr/bin/docker-containerd-shim-current 20c7f99e7681371156eda41b1f7382226a56224e2fa45e03ce97f182e77d3f21 /var/run/docker/libcontainerd/20c7f99e76813711...
├─33129 /usr/bin/docker-containerd-shim-current 74ab47cd3ad12b443f970cb3ae694d62923b47bab9e7e4270b3d4756c4ec515e /var/run/docker/libcontainerd/74ab47cd3ad12b44...
└─33183 /usr/bin/docker-containerd-shim-current 3dd1bf2f31c96767a26483f25a92a908e052f5937d1d3d68be84fcbfe38a0f63 /var/run/docker/libcontainerd/3dd1bf2f31c96767...
Mar 31 08:20:29 ip-10-110-2-215.ec2.internal systemd[1]: Starting Docker Application Container Engine...
Mar 31 08:20:29 ip-10-110-2-215.ec2.internal dockerd-current[33823]: time="2020-03-31T08:20:29.316759629Z" level=warning msg="could not change group /var/run/dock...t found"
Mar 31 08:20:29 ip-10-110-2-215.ec2.internal dockerd-current[33823]: can't create unix socket /var/run/docker.sock: is a directory
Mar 31 08:20:29 ip-10-110-2-215.ec2.internal systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Mar 31 08:20:29 ip-10-110-2-215.ec2.internal systemd[1]: Failed to start Docker Application Container Engine.
Mar 31 08:20:29 ip-10-110-2-215.ec2.internal systemd[1]: Unit docker.service entered failed state.
Mar 31 08:20:29 ip-10-110-2-215.ec2.internal systemd[1]: docker.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
[root@ip-10-110-2-215 ~]# rm -rf /var/run/docker.sock
[root@ip-10-110-2-215 ~]# systemctl restart docker
Urvashi, could you take a peak at this please? Ryan, can one of your folks look into this please. If the bug is about inability to restart dockerd because of /var/run/docker.sock directory presence, this is the same as https://github.com/moby/moby/issues/30348, and is supposed to be partially addressed by https://github.com/moby/moby/pull/33330 (ported as https://github.com/projectatomic/docker/pull/328). The way to ultimately address this is NOT to use old volumes API (which auto-creates a bind mount source as a directory if it does not exist, and this is what's causing /var/run/docker.sock directory to appear), but use the new mounts API. In case it is already used (I don't know), the directory might still be created because of the race (see https://github.com/moby/moby/issues/37083), fixed by https://github.com/moby/moby/pull/37378. Let me backport this to projectatomic/docker. If the bug is about "pods running on a problem node keeps crash looping" (as the bug subject implies), there is not enough information to do anything about it, at the very least I need a repro. Please clarify what the bug is about. > Let me backport this to projectatomic/docker. Please see https://github.com/projectatomic/docker/pull/376 The problem is at some time, the pods running on that specific node cannot running, restart docker should be able to fix the problem, but like what I said, restart always failed with the error that I pasted in comment 1, after manually delete that sock dir, restart will succeed, and pods start to running well, I haven't see this problem for a long time, and we haven't upgrade docker also, so it's not easy to reproduce. Again, there are two problems in what you described, here is the first one: > the pods running on that specific node cannot running and here is the second one: > restart always failed with the error that I pasted Since there is not enough data provided about first problem to be actionable, I assume we're dealing with the second one here. The backport I mentioned earlier in comment 8 has been merged. I am currently looking into whether OC is using volumes API or mounts API. Might come up with a PR later. > I am currently looking into whether OC is using volumes API or mounts API.
It should use mounts API, but have yet to look into it.
|