Bug 1671861
Summary: | Director deployed OCP 3.11 deployment fails with openshift-ansible getting stuck when restarting docker service on master nodes | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Marius Cornea <mcornea> |
Component: | docker | Assignee: | Valentin Rothberg <vrothber> |
Status: | CLOSED ERRATA | QA Contact: | atomic-bugs <atomic-bugs> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 7.6 | CC: | afariasa, alexander.lackner, amodi, amurdaca, apevec, apmukher, atonner, bendik.paulsrud, chris.liles, cshereme, dbecker, dominik.mierzejewski, dornelas, eaj, fan-wxa, fkluknav, fshaikh, hasuzuki, igreen, jrosenta, kelly.brown1, lsm5, lsu, magnus.heino, m.andre, mburns, mcornea, mnoguera, morazi, mukeshk2, ncurry, nils.ketelsen, palonsor, pasik, pdwyer, rludva, rmanes, rsandu, sasha, subhat, travi, uobergfe, v.nandha, vrothber, wmeng, wwurzbac, yprokule |
Target Milestone: | rc | Keywords: | Extras, Regression, ZStream |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | docker-1.13.1-93.gitb2f74b2.el7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-03-13 01:52:23 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1186913, 1647729, 1661804, 1663922 |
Description
Marius Cornea
2019-02-01 22:19:01 UTC
The infra and worker nodes do not have any containers running at this step, this could be the reason why they had no issue when restarting docker. The full logs from the master node provide more context: Feb 01 16:30:58 openshift-master-0 dockerd-current[17126]: time="2019-02-01T16:30:58.875673621-05:00" level=warning msg="d8e8ffa50cb6850aa67e7c3ca064d33ad84f397a53814d0566013f6bc0a52965 cleanup: failed to unmount secrets: invalid argument" Feb 01 16:38:47 openshift-master-0 systemd[1]: Stopping Docker Application Container Engine... Feb 01 16:38:47 openshift-master-0 dockerd-current[17126]: time="2019-02-01T16:38:47.679732066-05:00" level=info msg="Processing signal 'terminated'" Feb 01 16:38:47 openshift-master-0 dockerd-current[17126]: time="2019-02-01T16:38:47.682139079-05:00" level=info msg="stopping containerd after receiving terminated" Feb 01 16:38:47 openshift-master-0 dockerd-current[17126]: time="2019-02-01T16:38:47.68246551-05:00" level=fatal msg="containerd: serve grpc" error="accept unix /var/run/docker/libcontainerd/docker-containerd.sock: use of closed network connection" Feb 01 16:38:48 openshift-master-0 systemd[1]: Stopped Docker Application Container Engine. Feb 01 16:38:48 openshift-master-0 systemd[1]: Starting Docker Application Container Engine... Feb 01 16:38:49 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:49.173839784-05:00" level=info msg="libcontainerd: new containerd process, pid: 27908" Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.284350795-05:00" level=info msg="Graph migration to content-addressability took 0.00 seconds" Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.285523377-05:00" level=info msg="Loading containers: start." Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.303415591-05:00" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container (9237ecfa83412005b786c07f51911478fe6786bca720b978ee8a6b2943e34278)." Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.317157074-05:00" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container (2559d51999b2f905915807ee0501a81df04f028ebecab057db1a1ba97f1c1462)." Seems like an issue in containerd. Additional note, ran "sudo systemctl stop docker" twice and docker finally came up after some time: Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.317157074-05:00" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container (2559d51999b2f905915807ee0501a81df04f028ebecab057db1a1ba97f1c1462)." Feb 04 03:57:41 openshift-master-0 dockerd-current[27899]: time="2019-02-04T03:57:41.136471082-05:00" level=info msg="Processing signal 'terminated'" Feb 04 03:59:41 openshift-master-0 systemd[1]: docker.service stop-sigterm timed out. Killing. Feb 04 03:59:41 openshift-master-0 systemd[1]: docker.service: main process exited, code=killed, status=9/KILL Feb 04 03:59:41 openshift-master-0 systemd[1]: Unit docker.service entered failed state. Feb 04 03:59:41 openshift-master-0 systemd[1]: docker.service failed. Feb 04 03:59:41 openshift-master-0 systemd[1]: Starting Docker Application Container Engine... Feb 04 03:59:42 openshift-master-0 dockerd-current[14489]: time="2019-02-04T03:59:42.494812650-05:00" level=info msg="libcontainerd: new containerd process, pid: 14499" Feb 04 03:59:42 openshift-master-0 dockerd-current[14489]: time="2019-02-04T03:59:42.515315761-05:00" level=error msg="containerd: notify OOM events" error="open /proc/21179/cgroup: no such file or directory" Feb 04 03:59:43 openshift-master-0 dockerd-current[14489]: time="2019-02-04T03:59:43.570918238-05:00" level=info msg="Graph migration to content-addressability took 0.00 seconds" Feb 04 03:59:43 openshift-master-0 dockerd-current[14489]: time="2019-02-04T03:59:43.571784432-05:00" level=info msg="Loading containers: start." Feb 04 03:59:49 openshift-master-0 dockerd-current[14489]: time="2019-02-04T03:59:49.348632137-05:00" level=info msg="Processing signal 'terminated'" Feb 04 04:01:49 openshift-master-0 systemd[1]: docker.service stop-sigterm timed out. Killing. Feb 04 04:01:49 openshift-master-0 systemd[1]: docker.service: main process exited, code=killed, status=9/KILL Feb 04 04:01:49 openshift-master-0 systemd[1]: Unit docker.service entered failed state. Feb 04 04:01:49 openshift-master-0 systemd[1]: docker.service failed. Feb 04 04:01:49 openshift-master-0 systemd[1]: Starting Docker Application Container Engine... Feb 04 04:01:50 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:50.202769479-05:00" level=info msg="libcontainerd: new containerd process, pid: 14681" Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.267182649-05:00" level=info msg="Graph migration to content-addressability took 0.00 seconds" Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.268272443-05:00" level=info msg="Loading containers: start." Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.271485461-05:00" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container (2559d51999b2f905915807ee0501a81df04f028ebecab057db1a1ba97f1c1462)." Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.271762013-05:00" level=warning msg="libcontainerd: failed to retrieve container 2559d51999b2f905915807ee0501a81df04f028ebecab057db1a1ba97f1c1462 state: rpc error: code = 2 desc = containerd: container not found" Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.282410560-05:00" level=warning msg="2559d51999b2f905915807ee0501a81df04f028ebecab057db1a1ba97f1c1462 cleanup: failed to unmount secrets: invalid argument" Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.286743847-05:00" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container (9237ecfa83412005b786c07f51911478fe6786bca720b978ee8a6b2943e34278)." Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.286900077-05:00" level=warning msg="libcontainerd: failed to retrieve container 9237ecfa83412005b786c07f51911478fe6786bca720b978ee8a6b2943e34278 state: rpc error: code = 2 desc = containerd: container not found" Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.300409683-05:00" level=warning msg="9237ecfa83412005b786c07f51911478fe6786bca720b978ee8a6b2943e34278 cleanup: failed to unmount secrets: invalid argument" Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.326403986-05:00" level=info msg="Firewalld running: false" Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.414573477-05:00" level=info msg="Removing stale sandbox 4398493a2f2c78b6d76f811fec87decf3a838e177252a2b3e0bfb7f8bada9006 (9237ecfa83412005b786c07f51911478fe6786bca720b978ee8a6b2943e34278)" Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.546565500-05:00" level=info msg="Removing stale sandbox fad8d18437d6522d8bf7ae954044f56725b2e400f3d93efba287789efe7a090e (2559d51999b2f905915807ee0501a81df04f028ebecab057db1a1ba97f1c1462)" Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.572361662-05:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.31.0.0/24. Daemon option --bip can be used to set a preferred IP address" Feb 04 04:01:51 openshift-master-0 oci-umount[14886]: umounthook <debug>: prestart container_id:9237ecfa8341 rootfs:/var/lib/docker/overlay2/cd2092479b3df841fe082c0b632fc0ff78d042445d657267f1a8ea5fdaad3dec/merged Feb 04 04:01:51 openshift-master-0 oci-umount[14887]: umounthook <debug>: prestart container_id:2559d51999b2 rootfs:/var/lib/docker/overlay2/423a7edd1e0fab34950a0562320acfe8b3f015557de5624f0d0008f396ecfd87/merged Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: + sudo -E kolla_set_configs Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: + sudo -E kolla_set_configs Antonio any idea? Can you all provide a core dump of both dockerd and containerd for inspection please? We've identified the broken commit which introduced this regression and reverted it https://github.com/projectatomic/containerd/commit/7989550b83317f799af20ab4df3a5b6487767fc9 Should this be closed as a duplicate of bug 1673692 (or vice-versa) ? *** Bug 1673692 has been marked as a duplicate of this bug. *** *** Bug 1672327 has been marked as a duplicate of this bug. *** *** Bug 1678096 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:0487 *** Bug 1690640 has been marked as a duplicate of this bug. *** *** Bug 1687944 has been marked as a duplicate of this bug. *** *** Bug 1679858 has been marked as a duplicate of this bug. *** |