1710626 – kubelet exited with "failed to write xxx to cgroup.procs .. .scope/cgroup.procs: invalid argument

Bug 1710626 - kubelet exited with "failed to write xxx to cgroup.procs .. .scope/cgroup.procs: invalid argument

Summary: kubelet exited with "failed to write xxx to cgroup.procs .. .scope/cgroup.pro...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1741608 1743293
TreeView+	depends on / blocked

Reported:	2019-05-15 21:14 UTC by Justin Pierce
Modified:	2019-10-16 06:29 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1743293 (view as bug list)
Environment:
Last Closed:	2019-10-16 06:28:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
kubelet logs leading up to exit (195.78 KB, text/plain) 2019-05-15 21:14 UTC, Justin Pierce	no flags	Details
Corresponding crio logs (157.77 KB, text/plain) 2019-05-16 14:00 UTC, Justin Pierce	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:29:04 UTC

Description Justin Pierce 2019-05-15 21:14:36 UTC

Created attachment 1569216 [details]
kubelet logs leading up to exit

Description of problem:
Noticed that a kubelet terminated on an RHCOS node. Proving logs in case it is useful.

Last gasp (full logs in attachment):
May 01 16:17:45 ip-10-0-147-186 hyperkube[1047]: E0501 16:17:45.421433    1047 pod_workers.go:190] Error syncing pod 5de4c650-6c2c-11e9-9837-021d26480ed8 ("crel-monitors-app-creation-test-1-deploy_openshift-monitoring(5de4c650-6c2c-11e9-9837-021d26480ed8)"), skipping: timeout expired waiting for volumes to attach or mount for pod "openshift-monitoring"/"crel-monitors-app-creation-test-1-deploy". list of unmounted volumes=[deployer-token-tlvlh]. list of unattached volumes=[deployer-token-tlvlh]
May 01 17:47:08 ip-10-0-147-186 hyperkube[1047]: E0501 17:47:08.780235    1047 remote_runtime.go:336] ExecSync 58b3ef666b673113dd6b02afc2d01d7fe1a17c22817191b427e6b1029411f8e7 '/usr/share/openvswitch/scripts/ovs-ctl status' from runtime service failed: rpc error: code = Unknown desc = command error: exec failed: container_linux.go:336: starting container process caused "process_linux.go:90: adding pid 95434 to cgroups caused \"failed to write 95434 to cgroup.procs: write /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poddc9cd6f6_6ade_11e9_b5a4_021d26480ed8.slice/crio-58b3ef666b673113dd6b02afc2d01d7fe1a17c22817191b427e6b1029411f8e7.scope/cgroup.procs: invalid argument\""
May 01 17:47:08 ip-10-0-147-186 hyperkube[1047]: , stdout: , stderr: , exit code -1

Version-Release number of selected component (if applicable):
4.1.0-rc.0

How reproducible:
Has affected 4% of nodes in starter cluster over the course of ~2 weeks.

Steps to Reproduce:
Unknown

Additional info:
Kubelet attempted restart as expected after terminating

Comment 1 Seth Jennings 2019-05-16 13:52:00 UTC

fatal error is that kubelet can't connect to the crio socket on startup

Apr 30 00:27:59 ip-10-0-147-186 hyperkube[879]: F0430 00:27:59.425159     879 server.go:264] failed to run Kubelet: failed to create kubelet: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory"
Apr 30 00:27:59 ip-10-0-147-186 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Apr 30 00:27:59 ip-10-0-147-186 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Apr 30 00:27:59 ip-10-0-147-186 systemd[1]: Failed to start Kubernetes Kubelet.

Any indication that cri-o was having trouble in these windows where the kubelet restarts?

Comment 2 Justin Pierce 2019-05-16 14:00:30 UTC

Created attachment 1569525 [details]
Corresponding crio logs

Comment 3 Justin Pierce 2019-05-16 14:05:44 UTC

I've attached the cri-o journal for the node. 
- Apr 30 00:27:59 kubelet exit seems legit as crio exits during "normal" CNI setup dance.
- May 01 17:47:08 kubelet exit doesn't seem to correspond to anything in cri-o logs.

Comment 4 Seth Jennings 2019-05-16 14:15:20 UTC

This is right after a reboot and crio is also just getting a chance to come up

-- Reboot --
Apr 30 00:27:59 ip-10-0-147-186 systemd[1]: Starting Open Container Initiative Daemon...
Apr 30 00:28:00 ip-10-0-147-186 crio[980]: time="2019-04-30T00:28:00Z" level=warning msg="DefaultWorkloadTrust deprecated
Apr 30 00:28:00 ip-10-0-147-186 crio[980]: /*\ Warning /*\
Apr 30 00:28:00 ip-10-0-147-186 crio[980]: DEPRECATED: Use Runtimes instead.
Apr 30 00:28:00 ip-10-0-147-186 crio[980]: The support of this option will continue through versions 1.12 and 1.13. By
Apr 30 00:28:00 ip-10-0-147-186 crio[980]: version 1.14, this option will no longer exist.
Apr 30 00:28:00 ip-10-0-147-186 crio[980]: /*\ Warning /*\
Apr 30 00:28:00 ip-10-0-147-186 crio[980]: "
Apr 30 00:28:00 ip-10-0-147-186 crio[980]: time="2019-04-30T00:28:00Z" level=warning msg="runtime is deprecated in favor of runtimes, please switch to that"
Apr 30 00:28:00 ip-10-0-147-186 crio[980]: time="2019-04-30 00:28:00.485825703Z" level=error msg="error updating cni config: Missing CNI default network"
Apr 30 00:28:00 ip-10-0-147-186 systemd[1]: Started Open Container Initiative Daemon.

Unless you are seeing the kubelet fail and restart during steady-state operations, this isn't a problem.

However, we should probably add Wants=crio.service in the kubelet.service file and see if that helps.

Comment 5 Seth Jennings 2019-05-16 14:18:09 UTC

I am also not sure what to make of

May 01 17:47:08 ip-10-0-147-186 hyperkube[1047]: , stdout: , stderr: , exit code -1

The log message before it is not fatal and it is the last line in the log.  No dump and no indication of why the kubelet stopped.

Comment 6 Ryan Phillips 2019-05-16 15:59:23 UTC

It looks like this is fixed in later versions of runc [1,2].

[1] https://github.com/opencontainers/runc/issues/1884
[2] https://community.pivotal.io/s/article/Failure-to-start-container-failing-to-write-to-cgroup-procs-invalid-argument

Comment 7 Justin Pierce 2019-05-16 17:01:51 UTC

Apr 30 00:27:59 - Agreed this is not a concern. 
May 01 17:47:08 - This is the mystery restart. There is nothing exception in the combined journal for this timestamp:

May 01 17:27:25 ip-10-0-147-186 NetworkManager[871]: <info>  [1556731645.4525] dhcp4 (ens3):   mtu 9001
May 01 17:27:25 ip-10-0-147-186 NetworkManager[871]: <info>  [1556731645.4527] dhcp4 (ens3): state changed bound -> bound
May 01 17:27:25 ip-10-0-147-186 systemd[1]: Starting Network Manager Script Dispatcher Service...
May 01 17:27:25 ip-10-0-147-186 dbus-daemon[869]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
May 01 17:27:25 ip-10-0-147-186 systemd[1]: Started Network Manager Script Dispatcher Service.
May 01 17:27:25 ip-10-0-147-186 nm-dispatcher[75562]: req:1 'dhcp4-change' [ens3]: new request (3 scripts)
May 01 17:27:25 ip-10-0-147-186 nm-dispatcher[75562]: req:1 'dhcp4-change' [ens3]: start running ordered scripts...
May 01 17:27:35 ip-10-0-147-186 systemd[1]: NetworkManager-dispatcher.service: Consumed 20ms CPU time
May 01 17:47:08 ip-10-0-147-186 hyperkube[1047]: E0501 17:47:08.780235    1047 remote_runtime.go:336] ExecSync 58b3ef666b673113dd6b02afc2d01d7fe1a17c22817191b427e6b1029411f8e7 '/usr/share/openvswitch/scripts/ov>
May 01 17:47:08 ip-10-0-147-186 hyperkube[1047]: , stdout: , stderr: , exit code -1
May 01 17:57:25 ip-10-0-147-186 NetworkManager[871]: <info>  [1556733445.8207] dhcp4 (ens3):   address 10.0.147.186
May 01 17:57:25 ip-10-0-147-186 NetworkManager[871]: <info>  [1556733445.8208] dhcp4 (ens3):   plen 20
May 01 17:57:25 ip-10-0-147-186 dbus-daemon[869]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.3' (uid=0 pid=8>
May 01 17:57:25 ip-10-0-147-186 NetworkManager[871]: <info>  [1556733445.8208] dhcp4 (ens3):   expires in 3600 seconds
May 01 17:57:25 ip-10-0-147-186 NetworkManager[871]: <info>  [1556733445.8208] dhcp4 (ens3):   nameserver '10.0.0.2'
May 01 17:57:25 ip-10-0-147-186 NetworkManager[871]: <info>  [1556733445.8208] dhcp4 (ens3):   domain name 'us-west-1.compute.internal'
May 01 17:57:25 ip-10-0-147-186 NetworkManager[871]: <info>  [1556733445.8208] dhcp4 (ens3):   hostname 'ip-10-0-147-186'

Is there anything I can pull / traps to set for when this happens again? If the failure rate holds, we would should expect to see it again within a couple days.

Comment 8 Ryan Phillips 2019-05-16 18:38:16 UTC

Is ceph enabled on these nodes or bidirectional mounts?

Comment 9 Justin Pierce 2019-05-16 19:53:22 UTC

ceph - no. All AWS EBS storage.
bidirectional mounts - "oc get pods -o=yaml --all-namespaces | grep mountPropagation"  only found "mountPropagation: HostToContainer" entries

Comment 10 Ryan Phillips 2019-05-16 20:16:18 UTC

PR: https://github.com/openshift/machine-config-operator/pull/772

Comment 11 Justin Pierce 2019-05-16 23:36:59 UTC

I think the PR addresses the Apr 30 00:27:59 issue. Is the idea then that https://bugzilla.redhat.com/show_bug.cgi?id=1710626#c6 is the cause of May 01 17:47:08 (which was steady state before the exit)?

Comment 12 Ryan Phillips 2019-06-18 20:23:27 UTC

PR merged.

Comment 15 Sunil Choudhary 2019-06-20 15:39:42 UTC

Checked with Ryan and we do see couple of errors in last 6 days.

https://search.svc.ci.openshift.org/?search=process_linux.go&maxAge=336h&context=2&type=all

stderr:
exec failed: container_linux.go:336: starting container process caused "process_linux.go:90: adding pid 69510 to cgroups caused \"failed to write 69510 to cgroup.procs: write /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod0411dd70_92ae_11e9_be34_120e46d86d90.slice/crio-67294fd018ada3821111e95df92477f4e52cfd91a56d4df54e1f0041219146f2.scope/cgroup.procs: invalid argument\""
command terminated with exit code 1

Comment 16 Sunil Choudhary 2019-06-20 15:49:40 UTC

We need to backport https://github.com/opencontainers/runc/issues/1884

Comment 18 Ryan Phillips 2019-07-22 15:00:53 UTC

Going to call this fixed. There were version bumps of various components, including crio to 1.14, that included the fix for this bug. Just make sure the machine is using crio 1.14.

Comment 37 errata-xmlrpc 2019-10-16 06:28:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.