Bug 1746435

Summary:	killing and restarting containerd
Product:	Red Hat Enterprise Linux 7	Reporter:	Mario Abajo <mabajodu>
Component:	docker	Assignee:	Jindrich Novy <jnovy>
Status:	CLOSED ERRATA	QA Contact:	atomic-bugs <atomic-bugs>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.8	CC:	abraj, agawand, ajia, alolivei, amurdaca, anisal, aos-bugs, apaladug, asheth, dapark, dconsoli, dornelas, dwalsh, dwojewod, erich, erolon, jcrumple, jnovy, joboyer, jokerman, josebarato321, jperezve, jrosenta, kir, lsm5, lstanton, ltitov, lucarval, mpatel, mrobson, nagrawal, obockows, openshift-bugs-escalate, pasik, pehunt, pkhaire, pthomas, rabdulra, rmanes, rsandu, shsaxena, simore, ssadhale, tmicheli, tsweeney, vwalek
Target Milestone:	rc	Keywords:	Extras
Target Release:	---	Flags:	dconsoli: needinfo- dconsoli: needinfo- dwojewod: needinfo- dwojewod: needinfo- dconsoli: needinfo-
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	docker-1.13.1-204.git0be3e21.el7_9 or newer	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-03-16 14:42:07 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1186913, 1913377

Comment 1 Hugo Cisneiros (Eitch) 2019-11-14 21:50:23 UTC

Hi folks,

I also got a customer with this problem, but it wasn't a master node. Lots of messages like these began suddenly:

time="2019-11-13T02:06:28.267154141-03:00" level=info msg="killing and restarting containerd"

After some minutes with those messages appearing, the last one was from ~6 minutes later:

time="2019-11-13T02:12:18.613139120-03:00" level=info msg="killing and restarting containerd"
time="2019-11-13T02:12:18.721713017-03:00" level=info msg="libcontainerd: new containerd process, pid: 107907"

At this time we had Linux oom-killer killed an application:

Nov 13 02:11:37 kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=968
Nov 13 02:11:37 kernel: java cpuset=docker-34b807f7b19c6267176cc9a05435980eec478dd8254e4e56006a96066abd7043.scope mems_allowed=0
Nov 13 02:11:37 kernel: CPU: 1 PID: 111971 Comm: java Kdump: loaded Not tainted 3.10.0-957.10.1.el7.x86_64 #1

Some containers were affected, stuck in 'Terminating' state. Lots of these lines from docker after this:

Nov 13 02:16:02 time="2019-11-13T02:16:02.708288273-03:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 582ef9e0156a0740300044bc8bcdba4ae42125464e7e1e6ebb2cd75dac037335: rpc error: code = 2 desc = containerd: container not found"
Nov 13 02:16:26 dockerd-current[48546]: time="2019-11-13T02:16:26.069457473-03:00" level=error msg="Error running exec in container: rpc error: code = 2 desc = containerd: container not found"

Version Info:

OCP 3.11.88
docker-1.13.1-94.gitb2f74b2

Comment 2 Hugo Cisneiros (Eitch) 2019-11-22 18:10:56 UTC

Hi,

I got anothe case with this error. Again, not a master node:

Nov 19 10:48:20 dockerd-current[9065]: time="2019-11-19T10:48:19.450127115-05:00" level=info msg="killing and restarting containerd"
Nov 19 10:48:21 dockerd-current[9065]: time="2019-11-19T10:48:21.124108655-05:00" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 13 desc = transport is closing"
Nov 19 10:48:21 dockerd-current[9065]: time="2019-11-19T10:48:21.177531314-05:00" level=info msg="libcontainerd: new containerd process, pid: 116629"
Nov 19 10:48:22 dockerd-current[9065]: time="2019-11-19T10:48:21.981786415-05:00" level=info msg="killing and restarting containerd"

There was only 2 restarts, but because of this there was a liveness probe falure in an ovs pod: 

Nov 19 10:48:23 atomic-openshift-node[6811]: I1119 10:48:23.003315    6811 prober.go:111] Liveness probe for "ovs-wpw9l_openshift-sdn(7168c5d1-565f-11e9-95c4-0aaea96ecb78):openvswitch" failed (failure): rpc error: code = 14 desc = grpc: the connection is unavailable

It seems that OpenShift 'desynced' and couldn't exec the liveness probe 'forever'. After awhile OpenShift couldn't detect this 'lost' ovs pod and spawned another pod, thus getting things more confusing as two ovs containers were running at the same time at the system... The node got down because of it.

Nov 20 15:57:23 atomic-openshift-node[6811]: I1120 15:57:23.601660    6811 kubelet.go:1820] SyncLoop (ADD, "api"): "ovs-pmffh_openshift-sdn(59d069dc-0bd8-11ea-95c4-0aaea96ecb78)"

Version Info:

OCP 3.11.88
docker-1.13.1-94.gitb2f74b2

Comment 3 Hugo Cisneiros (Eitch) 2019-11-28 14:43:02 UTC

Another case with similar issues.

Nov 18 17:44:33 dockerd-current[2544]: time="2019-11-18T17:44:30.159729213-03:00" level=info msg="killing and restarting containerd"
Nov 18 17:44:34 dockerd-current[2544]: time="2019-11-18T17:44:31.680365491-03:00" level=info msg="libcontainerd: new containerd process, pid: 62083"
Nov 18 17:44:34 dockerd-current[2544]: time="2019-11-18T17:44:32.625607700-03:00" level=info msg="killing and restarting containerd"
Nov 18 17:45:18 dockerd-current[2544]: time="2019-11-18T17:44:58.020475742-03:00" level=info msg="libcontainerd: new containerd process, pid: 62244"
Nov 18 17:45:18 dockerd-current[2544]: time="2019-11-18T17:45:01.370124458-03:00" level=info msg="killing and restarting containerd"
Nov 18 17:45:18 dockerd-current[2544]: time="2019-11-18T17:45:02.925274706-03:00" level=info msg="libcontainerd: new containerd process, pid: 62292"
Nov 18 17:45:18 dockerd-current[2544]: time="2019-11-18T17:45:03.631771733-03:00" level=info msg="killing and restarting containerd"

This repeats a lot, and contributes to the system high load.

Provided workaround is to reboot de node (or stop/start both atomic-openshift-node and docker services).

Is there anything we can do here to at least mitigate ovs/sdn pods shutting down?

Version Info:

OCP 3.11.141
docker-1.13.1-104.git4ef4b30

Comment 4 Hugo Cisneiros (Eitch) 2019-12-02 21:16:35 UTC

One more case:

Nov 27 12:42:12 dockerd-current[1649]: time="2019-11-27T12:42:11.706413993Z" level=info msg="killing and restarting containerd"
Nov 27 12:42:12 dockerd-current[1649]: time="2019-11-27T12:42:12.333064712Z" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 13 desc = transport is closing"
Nov 27 12:42:12 dockerd-current[1649]: time="2019-11-27T12:42:12.772160707Z" level=info msg="libcontainerd: new containerd process, pid: 111814"
Nov 27 12:42:13 dockerd-current[1649]: time="2019-11-27T12:42:13.348255275Z" level=info msg="killing and restarting containerd"
Nov 27 12:42:13 dockerd-current[1649]: time="2019-11-27T12:42:13.449019674Z" level=info msg="libcontainerd: new containerd process, pid: 111841"
Nov 27 12:42:14 dockerd-current[1649]: time="2019-11-27T12:42:14.047946349Z" level=error msg="Error running exec in container: rpc error: code = 14 desc = grpc: the connection is unavailable"

This repeats a lot for 30 minutes - probably due to an extended system high load. After that, it doesn't restart anymore, but new pods can't be created in the node, with the message:

Warning   FailedCreatePodSandBox   kubelet, node.example.com   Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "77e9d45d52520f555a01684dea35d81fa617aa76b0b98647a4b8eb942c792a7a" network for pod "logging-curator-1574911800-nxptr": NetworkPlugin cni failed to set up pod "logging-curator-1574911800-nxptr_openshift-logging" network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: connection refused, failed to clean up sandbox container "77e9d45d52520f555a01684dea35d81fa617aa76b0b98647a4b8eb942c792a7a" network for pod "logging-curator-1574911800-nxptr": NetworkPlugin cni failed to teardown pod "logging-curator-1574911800-nxptr_openshift-logging" network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: connection refused]

I think the sdn pods got 'lost' again.

Version Info:

OCP 3.11.98

Comment 7 Tom Sweeney 2020-05-15 16:41:01 UTC

This has fallen down too far on my stack and I'm not sure I'll be able to pop it up much in the near term.  Plus Peter Hunt is probably more familiar with this codebase than I am at the moment.  Peter can I ask you to look at this please?

Comment 8 Tom Sweeney 2020-05-21 21:47:58 UTC

I just moved this onto Peter's plate and I know he's got a number of other irons going.  I'm going to move this to next sprint.

Comment 11 Tom Sweeney 2020-06-09 17:12:30 UTC

Peter have you had a chance to look at this one yet?

Comment 28 Kir Kolyshkin 2020-07-30 02:44:27 UTC

This PR might help to fix the issue: https://github.com/projectatomic/docker/pull/377

Not sure how to test it though, as I don't have a repro.

Comment 36 Kir Kolyshkin 2020-08-20 21:34:12 UTC

The fix is being tested by at least one customer.

Comment 54 Kir Kolyshkin 2020-09-28 22:22:19 UTC

From that log I see dockerd panics during start, while trying to create a new instance of plugin manager. Can't see anything wrong with the code, will try to repro

Comment 61 Kir Kolyshkin 2020-10-22 05:14:46 UTC

Alas I do not have any news, good or bad. Will take a fresh look at it next week.

Comment 62 Kir Kolyshkin 2020-10-27 23:26:20 UTC

OK I found a stupid typo in one of my commits which was in plain sight but somehow sort of invisible.

The bug is now fixed, see updated https://github.com/projectatomic/docker/pull/377.

I have also mocked up a repro for the issue:

1. run dockerd as usual.

2. check that it's working

        # docker run --rm busybox echo hey
        hey

3. run

        killall -STOP docker-containerd-current

as root. This will make containerd unresponsive.

4. After some time (10s or so) dockerd will try to restart containerd. from dockerd logs:

...
INFO[0099] killing and restarting containerd            
ERRO[0099] libcontainerd: failed to receive event from containerd: rpc error: code = 13 desc = transport is closing 
INFO[0099] libcontainerd: new containerd process, pid: 12230 
WARN[0000] containerd: low RLIMIT_NOFILE changing to max  current=1024 max=4096
INFO[0100] killing and restarting containerd            
INFO[0000] === BEGIN goroutine stack dump ===
...

5. Now dockerd is non-functional as it's not reconnecting to a new containerd properly:

[root@centos70-ne dynbinary-client]# docker run --rm busybox echo hey
/usr/bin/docker-current: Error response from daemon: transport is closing.

6. I found out that it eventually is working again, but given the high load on a customer's system it might take a while.

Working on validating the fix, got some other issues (not sure if they are related).

Comment 63 Kir Kolyshkin 2020-10-28 00:49:33 UTC

OK the additional issue I got after fixing the initial one is also fixed.

For the record, the additional issue was that "docker run --rm busybox echo hey" never exited, because dockerd goroutine responsible for handling containerd events (such as "container exited") is gone while containerd was restarted. This is now fixed in my patched by backporting an upstream PR.

To my best knowledge, current state of https://github.com/projectatomic/docker/pull/377 fixes this. Tested locally using the mock repro from the previous comment.

Comment 64 Kir Kolyshkin 2020-10-28 00:51:13 UTC

What we need now is a test build based on https://github.com/projectatomic/docker/pull/377, and a customer who wishes to give it a try.

@jnovy can you please create a test build?

Comment 65 Jindrich Novy 2020-10-28 07:17:15 UTC

Kir, the test scratch-build is now available here: http://brew-task-repos.usersys.redhat.com/repos/scratch/jnovy/docker/1.13.1/204.git0be3e21.el7_9/

Comment 66 Kir Kolyshkin 2020-10-28 18:20:37 UTC

@jnovy I checked the source rpm, and unfortunately docker-1746435.patch has the old version of https://github.com/projectatomic/docker/pull/377.patch.

There should be 6 commits in there, not 3. I guess there was some caching involved.

Can you please re-build with the correct patches?

Comment 67 Jindrich Novy 2020-10-29 05:43:28 UTC

Kir, the scratch build is now updated: http://brew-task-repos.usersys.redhat.com/repos/scratch/jnovy/docker/1.13.1/204.git0be3e21.el7_9/

Comment 69 Kir Kolyshkin 2020-10-29 19:48:03 UTC

@dwojewod @erich @tmicheli @asheth @dconsoli @dapark can you please test the new build (comment #67)? I have tested in locally and to my best knowledge it fixes the issue.

Comment 76 Kir Kolyshkin 2020-11-04 03:55:36 UTC

> Can we have detailed steps that need to be implemented by customers? One of my CU(02624821) is willing to do this.

Perhaps some support engineer who is better aware of how a particular customer is deploying and updating OC should step up here and help.

A rough guide is:

1. Download the packages from URL provided in comment #67 (this should be done by a Red Hat employee, as this is hosted on intranet) and provide them to a customer.
2. Update docker* packages on OC nodes that have a high chance of exhibiting this bug. The command to do that is something like sudo rpm -Uhv docker*.rpm).
3. Restart docker daemon (sudo systemctl restart docker).
4. Make sure the updated dockerd is running (docker info should show Server version of 1.13.1-204.git0be3e21.el7_9 or so).
5. Do whatever it takes to reproduce the issue (???).

Comment 82 Anand Paladugu 2020-11-12 00:22:12 UTC

@kir, @jnovy

Another OCP 3.11 customer seeing this issue (case #02788875).  I am not certain if they would want to just test a test build.  Any thoughts on when this change could be merged to OCP 3.11?

Comment 83 Jindrich Novy 2020-11-12 08:17:58 UTC

Anand, we need to be sure this change doesn't break other things so any feedback from customer's testing would be appreciated. If all goes well then it can be released in the next Extras-RHEL7 update.

Comment 84 Anand Paladugu 2020-11-12 15:17:54 UTC

Unfortunately, the issue occurred in PROD, and they don't want to deploy a fix if it's not in a GA release.  Anyway to repro this issue in the house?

Comment 87 Anand Paladugu 2020-11-12 17:23:23 UTC

I understand the logic completely Rob. We certainly don't want to cause a regression in a private fork for a lot more customers.   I can tell the customer that they have to wait until we are certain or provide an option to deploy the fix in prod (as Asheth's customer above did).  Let's see what the customer comes back with.

Comment 99 Kir Kolyshkin 2020-12-15 22:35:37 UTC

@ssadhale thank you for testing this. Spent some time looking at the logs and the source code -- it seems that the fixes I propose all make sense, but are not enough to fix the issue in cases the system is overloaded. Will try to come up with additional fix(es) on top of what we already have.

Comment 103 Kir Kolyshkin 2021-01-06 18:01:48 UTC

https://github.com/projectatomic/docker/pull/377 was merged yesterday.

Comment 105 Daniel Walsh 2021-01-28 11:35:00 UTC

Fixed in docker-1.13.1-203.git7b2844e.el7_9

Comment 106 Jindrich Novy 2021-01-28 12:14:06 UTC

Dan, note docker-1.13.1-203.git7b2844e.el7_9 is a scratch build. We are now waiting for testing feedback. The fix is not committed yet.

Comment 110 Alex Jia 2021-02-24 11:11:20 UTC

Moving this bug to VERIFIED state according to the following tests.

[root@master-0 ~]# rpm -q docker
docker-1.13.1-204.git0be3e21.el7_9.x86_64

[root@master-0 ~]# docker run --rm quay.io/libpod/busybox echo hey
hey

[root@master-0 ~]# killall -STOP docker-containerd-current
[root@master-0 ~]# docker run --rm quay.io/libpod/busybox echo hey
/usr/bin/docker-current: Error response from daemon: transport is closing.

[root@master-0 ~]# journalctl |grep -e "killing and restarting containerd"
Feb 24 06:08:49 master-0.sharedocp311cns.lab.upshift.rdu2.redhat.com dockerd-current[36837]: time="2021-02-24T06:08:49.169316998-05:00" level=info msg="killing and restarting containerd"
Feb 24 06:08:50 master-0.sharedocp311cns.lab.upshift.rdu2.redhat.com dockerd-current[36837]: time="2021-02-24T06:08:50.168330732-05:00" level=info msg="killing and restarting containerd"
Feb 24 06:08:51 master-0.sharedocp311cns.lab.upshift.rdu2.redhat.com dockerd-current[36837]: time="2021-02-24T06:08:51.668382852-05:00" level=info msg="killing and restarting containerd"

[root@master-0 ~]# docker run --rm quay.io/libpod/busybox echo hey
hey

Comment 111 jose-ccsp 2021-03-13 12:31:13 UTC

How can I get the fixed package?
Seems likely  http://brew-task-repos.usersys.redhat.com/repos/scratch/jnovy/docker/1.13.1/204.git0be3e21.el7_9/ is a private link.

Comment 114 errata-xmlrpc 2021-03-16 14:42:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (docker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0888

Comment 115 Red Hat Bugzilla 2025-01-30 04:25:04 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days