Bug 1746435
Summary: | killing and restarting containerd | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Mario Abajo <mabajodu> |
Component: | docker | Assignee: | Jindrich Novy <jnovy> |
Status: | CLOSED ERRATA | QA Contact: | atomic-bugs <atomic-bugs> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 7.8 | CC: | abraj, agawand, ajia, alolivei, amurdaca, anisal, aos-bugs, apaladug, asheth, dapark, dconsoli, dornelas, dwalsh, dwojewod, erich, erolon, jcrumple, jnovy, joboyer, jokerman, josebarato321, jperezve, jrosenta, kir, lsm5, lstanton, ltitov, lucarval, mpatel, mrobson, nagrawal, obockows, openshift-bugs-escalate, pasik, pehunt, pkhaire, pthomas, rabdulra, rmanes, rsandu, shsaxena, simore, ssadhale, tmicheli, tsweeney, vwalek |
Target Milestone: | rc | Keywords: | Extras |
Target Release: | --- | Flags: | dconsoli:
needinfo-
dconsoli: needinfo- dwojewod: needinfo- dwojewod: needinfo- dconsoli: needinfo- |
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | docker-1.13.1-204.git0be3e21.el7_9 or newer | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-03-16 14:42:07 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1186913, 1913377 |
Comment 1
Hugo Cisneiros (Eitch)
2019-11-14 21:50:23 UTC
Hi, I got anothe case with this error. Again, not a master node: Nov 19 10:48:20 dockerd-current[9065]: time="2019-11-19T10:48:19.450127115-05:00" level=info msg="killing and restarting containerd" Nov 19 10:48:21 dockerd-current[9065]: time="2019-11-19T10:48:21.124108655-05:00" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 13 desc = transport is closing" Nov 19 10:48:21 dockerd-current[9065]: time="2019-11-19T10:48:21.177531314-05:00" level=info msg="libcontainerd: new containerd process, pid: 116629" Nov 19 10:48:22 dockerd-current[9065]: time="2019-11-19T10:48:21.981786415-05:00" level=info msg="killing and restarting containerd" There was only 2 restarts, but because of this there was a liveness probe falure in an ovs pod: Nov 19 10:48:23 atomic-openshift-node[6811]: I1119 10:48:23.003315 6811 prober.go:111] Liveness probe for "ovs-wpw9l_openshift-sdn(7168c5d1-565f-11e9-95c4-0aaea96ecb78):openvswitch" failed (failure): rpc error: code = 14 desc = grpc: the connection is unavailable It seems that OpenShift 'desynced' and couldn't exec the liveness probe 'forever'. After awhile OpenShift couldn't detect this 'lost' ovs pod and spawned another pod, thus getting things more confusing as two ovs containers were running at the same time at the system... The node got down because of it. Nov 20 15:57:23 atomic-openshift-node[6811]: I1120 15:57:23.601660 6811 kubelet.go:1820] SyncLoop (ADD, "api"): "ovs-pmffh_openshift-sdn(59d069dc-0bd8-11ea-95c4-0aaea96ecb78)" Version Info: OCP 3.11.88 docker-1.13.1-94.gitb2f74b2 Another case with similar issues. Nov 18 17:44:33 dockerd-current[2544]: time="2019-11-18T17:44:30.159729213-03:00" level=info msg="killing and restarting containerd" Nov 18 17:44:34 dockerd-current[2544]: time="2019-11-18T17:44:31.680365491-03:00" level=info msg="libcontainerd: new containerd process, pid: 62083" Nov 18 17:44:34 dockerd-current[2544]: time="2019-11-18T17:44:32.625607700-03:00" level=info msg="killing and restarting containerd" Nov 18 17:45:18 dockerd-current[2544]: time="2019-11-18T17:44:58.020475742-03:00" level=info msg="libcontainerd: new containerd process, pid: 62244" Nov 18 17:45:18 dockerd-current[2544]: time="2019-11-18T17:45:01.370124458-03:00" level=info msg="killing and restarting containerd" Nov 18 17:45:18 dockerd-current[2544]: time="2019-11-18T17:45:02.925274706-03:00" level=info msg="libcontainerd: new containerd process, pid: 62292" Nov 18 17:45:18 dockerd-current[2544]: time="2019-11-18T17:45:03.631771733-03:00" level=info msg="killing and restarting containerd" This repeats a lot, and contributes to the system high load. Provided workaround is to reboot de node (or stop/start both atomic-openshift-node and docker services). Is there anything we can do here to at least mitigate ovs/sdn pods shutting down? Version Info: OCP 3.11.141 docker-1.13.1-104.git4ef4b30 One more case: Nov 27 12:42:12 dockerd-current[1649]: time="2019-11-27T12:42:11.706413993Z" level=info msg="killing and restarting containerd" Nov 27 12:42:12 dockerd-current[1649]: time="2019-11-27T12:42:12.333064712Z" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 13 desc = transport is closing" Nov 27 12:42:12 dockerd-current[1649]: time="2019-11-27T12:42:12.772160707Z" level=info msg="libcontainerd: new containerd process, pid: 111814" Nov 27 12:42:13 dockerd-current[1649]: time="2019-11-27T12:42:13.348255275Z" level=info msg="killing and restarting containerd" Nov 27 12:42:13 dockerd-current[1649]: time="2019-11-27T12:42:13.449019674Z" level=info msg="libcontainerd: new containerd process, pid: 111841" Nov 27 12:42:14 dockerd-current[1649]: time="2019-11-27T12:42:14.047946349Z" level=error msg="Error running exec in container: rpc error: code = 14 desc = grpc: the connection is unavailable" This repeats a lot for 30 minutes - probably due to an extended system high load. After that, it doesn't restart anymore, but new pods can't be created in the node, with the message: Warning FailedCreatePodSandBox kubelet, node.example.com Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "77e9d45d52520f555a01684dea35d81fa617aa76b0b98647a4b8eb942c792a7a" network for pod "logging-curator-1574911800-nxptr": NetworkPlugin cni failed to set up pod "logging-curator-1574911800-nxptr_openshift-logging" network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: connection refused, failed to clean up sandbox container "77e9d45d52520f555a01684dea35d81fa617aa76b0b98647a4b8eb942c792a7a" network for pod "logging-curator-1574911800-nxptr": NetworkPlugin cni failed to teardown pod "logging-curator-1574911800-nxptr_openshift-logging" network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: connection refused] I think the sdn pods got 'lost' again. Version Info: OCP 3.11.98 This has fallen down too far on my stack and I'm not sure I'll be able to pop it up much in the near term. Plus Peter Hunt is probably more familiar with this codebase than I am at the moment. Peter can I ask you to look at this please? I just moved this onto Peter's plate and I know he's got a number of other irons going. I'm going to move this to next sprint. Peter have you had a chance to look at this one yet? This PR might help to fix the issue: https://github.com/projectatomic/docker/pull/377 Not sure how to test it though, as I don't have a repro. The fix is being tested by at least one customer. From that log I see dockerd panics during start, while trying to create a new instance of plugin manager. Can't see anything wrong with the code, will try to repro Alas I do not have any news, good or bad. Will take a fresh look at it next week. OK I found a stupid typo in one of my commits which was in plain sight but somehow sort of invisible. The bug is now fixed, see updated https://github.com/projectatomic/docker/pull/377. I have also mocked up a repro for the issue: 1. run dockerd as usual. 2. check that it's working # docker run --rm busybox echo hey hey 3. run killall -STOP docker-containerd-current as root. This will make containerd unresponsive. 4. After some time (10s or so) dockerd will try to restart containerd. from dockerd logs: ... INFO[0099] killing and restarting containerd ERRO[0099] libcontainerd: failed to receive event from containerd: rpc error: code = 13 desc = transport is closing INFO[0099] libcontainerd: new containerd process, pid: 12230 WARN[0000] containerd: low RLIMIT_NOFILE changing to max current=1024 max=4096 INFO[0100] killing and restarting containerd INFO[0000] === BEGIN goroutine stack dump === ... 5. Now dockerd is non-functional as it's not reconnecting to a new containerd properly: [root@centos70-ne dynbinary-client]# docker run --rm busybox echo hey /usr/bin/docker-current: Error response from daemon: transport is closing. 6. I found out that it eventually is working again, but given the high load on a customer's system it might take a while. Working on validating the fix, got some other issues (not sure if they are related). OK the additional issue I got after fixing the initial one is also fixed. For the record, the additional issue was that "docker run --rm busybox echo hey" never exited, because dockerd goroutine responsible for handling containerd events (such as "container exited") is gone while containerd was restarted. This is now fixed in my patched by backporting an upstream PR. To my best knowledge, current state of https://github.com/projectatomic/docker/pull/377 fixes this. Tested locally using the mock repro from the previous comment. What we need now is a test build based on https://github.com/projectatomic/docker/pull/377, and a customer who wishes to give it a try. @jnovy can you please create a test build? Kir, the test scratch-build is now available here: http://brew-task-repos.usersys.redhat.com/repos/scratch/jnovy/docker/1.13.1/204.git0be3e21.el7_9/ @jnovy I checked the source rpm, and unfortunately docker-1746435.patch has the old version of https://github.com/projectatomic/docker/pull/377.patch. There should be 6 commits in there, not 3. I guess there was some caching involved. Can you please re-build with the correct patches? Kir, the scratch build is now updated: http://brew-task-repos.usersys.redhat.com/repos/scratch/jnovy/docker/1.13.1/204.git0be3e21.el7_9/ @dwojewod @erich @tmicheli @asheth @dconsoli @dapark can you please test the new build (comment #67)? I have tested in locally and to my best knowledge it fixes the issue. > Can we have detailed steps that need to be implemented by customers? One of my CU(02624821) is willing to do this. Perhaps some support engineer who is better aware of how a particular customer is deploying and updating OC should step up here and help. A rough guide is: 1. Download the packages from URL provided in comment #67 (this should be done by a Red Hat employee, as this is hosted on intranet) and provide them to a customer. 2. Update docker* packages on OC nodes that have a high chance of exhibiting this bug. The command to do that is something like sudo rpm -Uhv docker*.rpm). 3. Restart docker daemon (sudo systemctl restart docker). 4. Make sure the updated dockerd is running (docker info should show Server version of 1.13.1-204.git0be3e21.el7_9 or so). 5. Do whatever it takes to reproduce the issue (???). @kir, @jnovy Another OCP 3.11 customer seeing this issue (case #02788875). I am not certain if they would want to just test a test build. Any thoughts on when this change could be merged to OCP 3.11? Anand, we need to be sure this change doesn't break other things so any feedback from customer's testing would be appreciated. If all goes well then it can be released in the next Extras-RHEL7 update. Unfortunately, the issue occurred in PROD, and they don't want to deploy a fix if it's not in a GA release. Anyway to repro this issue in the house? I understand the logic completely Rob. We certainly don't want to cause a regression in a private fork for a lot more customers. I can tell the customer that they have to wait until we are certain or provide an option to deploy the fix in prod (as Asheth's customer above did). Let's see what the customer comes back with. @ssadhale thank you for testing this. Spent some time looking at the logs and the source code -- it seems that the fixes I propose all make sense, but are not enough to fix the issue in cases the system is overloaded. Will try to come up with additional fix(es) on top of what we already have. https://github.com/projectatomic/docker/pull/377 was merged yesterday. Fixed in docker-1.13.1-203.git7b2844e.el7_9 Dan, note docker-1.13.1-203.git7b2844e.el7_9 is a scratch build. We are now waiting for testing feedback. The fix is not committed yet. Moving this bug to VERIFIED state according to the following tests. [root@master-0 ~]# rpm -q docker docker-1.13.1-204.git0be3e21.el7_9.x86_64 [root@master-0 ~]# docker run --rm quay.io/libpod/busybox echo hey hey [root@master-0 ~]# killall -STOP docker-containerd-current [root@master-0 ~]# docker run --rm quay.io/libpod/busybox echo hey /usr/bin/docker-current: Error response from daemon: transport is closing. [root@master-0 ~]# journalctl |grep -e "killing and restarting containerd" Feb 24 06:08:49 master-0.sharedocp311cns.lab.upshift.rdu2.redhat.com dockerd-current[36837]: time="2021-02-24T06:08:49.169316998-05:00" level=info msg="killing and restarting containerd" Feb 24 06:08:50 master-0.sharedocp311cns.lab.upshift.rdu2.redhat.com dockerd-current[36837]: time="2021-02-24T06:08:50.168330732-05:00" level=info msg="killing and restarting containerd" Feb 24 06:08:51 master-0.sharedocp311cns.lab.upshift.rdu2.redhat.com dockerd-current[36837]: time="2021-02-24T06:08:51.668382852-05:00" level=info msg="killing and restarting containerd" [root@master-0 ~]# docker run --rm quay.io/libpod/busybox echo hey hey How can I get the fixed package? Seems likely http://brew-task-repos.usersys.redhat.com/repos/scratch/jnovy/docker/1.13.1/204.git0be3e21.el7_9/ is a private link. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (docker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0888 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |