Created attachment 1569216 [details] kubelet logs leading up to exit Description of problem: Noticed that a kubelet terminated on an RHCOS node. Proving logs in case it is useful. Last gasp (full logs in attachment): May 01 16:17:45 ip-10-0-147-186 hyperkube[1047]: E0501 16:17:45.421433 1047 pod_workers.go:190] Error syncing pod 5de4c650-6c2c-11e9-9837-021d26480ed8 ("crel-monitors-app-creation-test-1-deploy_openshift-monitoring(5de4c650-6c2c-11e9-9837-021d26480ed8)"), skipping: timeout expired waiting for volumes to attach or mount for pod "openshift-monitoring"/"crel-monitors-app-creation-test-1-deploy". list of unmounted volumes=[deployer-token-tlvlh]. list of unattached volumes=[deployer-token-tlvlh] May 01 17:47:08 ip-10-0-147-186 hyperkube[1047]: E0501 17:47:08.780235 1047 remote_runtime.go:336] ExecSync 58b3ef666b673113dd6b02afc2d01d7fe1a17c22817191b427e6b1029411f8e7 '/usr/share/openvswitch/scripts/ovs-ctl status' from runtime service failed: rpc error: code = Unknown desc = command error: exec failed: container_linux.go:336: starting container process caused "process_linux.go:90: adding pid 95434 to cgroups caused \"failed to write 95434 to cgroup.procs: write /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poddc9cd6f6_6ade_11e9_b5a4_021d26480ed8.slice/crio-58b3ef666b673113dd6b02afc2d01d7fe1a17c22817191b427e6b1029411f8e7.scope/cgroup.procs: invalid argument\"" May 01 17:47:08 ip-10-0-147-186 hyperkube[1047]: , stdout: , stderr: , exit code -1 Version-Release number of selected component (if applicable): 4.1.0-rc.0 How reproducible: Has affected 4% of nodes in starter cluster over the course of ~2 weeks. Steps to Reproduce: Unknown Additional info: Kubelet attempted restart as expected after terminating
fatal error is that kubelet can't connect to the crio socket on startup Apr 30 00:27:59 ip-10-0-147-186 hyperkube[879]: F0430 00:27:59.425159 879 server.go:264] failed to run Kubelet: failed to create kubelet: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory" Apr 30 00:27:59 ip-10-0-147-186 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a Apr 30 00:27:59 ip-10-0-147-186 systemd[1]: kubelet.service: Failed with result 'exit-code'. Apr 30 00:27:59 ip-10-0-147-186 systemd[1]: Failed to start Kubernetes Kubelet. Any indication that cri-o was having trouble in these windows where the kubelet restarts?
Created attachment 1569525 [details] Corresponding crio logs
I've attached the cri-o journal for the node. - Apr 30 00:27:59 kubelet exit seems legit as crio exits during "normal" CNI setup dance. - May 01 17:47:08 kubelet exit doesn't seem to correspond to anything in cri-o logs.
This is right after a reboot and crio is also just getting a chance to come up -- Reboot -- Apr 30 00:27:59 ip-10-0-147-186 systemd[1]: Starting Open Container Initiative Daemon... Apr 30 00:28:00 ip-10-0-147-186 crio[980]: time="2019-04-30T00:28:00Z" level=warning msg="DefaultWorkloadTrust deprecated Apr 30 00:28:00 ip-10-0-147-186 crio[980]: /*\ Warning /*\ Apr 30 00:28:00 ip-10-0-147-186 crio[980]: DEPRECATED: Use Runtimes instead. Apr 30 00:28:00 ip-10-0-147-186 crio[980]: The support of this option will continue through versions 1.12 and 1.13. By Apr 30 00:28:00 ip-10-0-147-186 crio[980]: version 1.14, this option will no longer exist. Apr 30 00:28:00 ip-10-0-147-186 crio[980]: /*\ Warning /*\ Apr 30 00:28:00 ip-10-0-147-186 crio[980]: " Apr 30 00:28:00 ip-10-0-147-186 crio[980]: time="2019-04-30T00:28:00Z" level=warning msg="runtime is deprecated in favor of runtimes, please switch to that" Apr 30 00:28:00 ip-10-0-147-186 crio[980]: time="2019-04-30 00:28:00.485825703Z" level=error msg="error updating cni config: Missing CNI default network" Apr 30 00:28:00 ip-10-0-147-186 systemd[1]: Started Open Container Initiative Daemon. Unless you are seeing the kubelet fail and restart during steady-state operations, this isn't a problem. However, we should probably add Wants=crio.service in the kubelet.service file and see if that helps.
I am also not sure what to make of May 01 17:47:08 ip-10-0-147-186 hyperkube[1047]: , stdout: , stderr: , exit code -1 The log message before it is not fatal and it is the last line in the log. No dump and no indication of why the kubelet stopped.
It looks like this is fixed in later versions of runc [1,2]. [1] https://github.com/opencontainers/runc/issues/1884 [2] https://community.pivotal.io/s/article/Failure-to-start-container-failing-to-write-to-cgroup-procs-invalid-argument
Apr 30 00:27:59 - Agreed this is not a concern. May 01 17:47:08 - This is the mystery restart. There is nothing exception in the combined journal for this timestamp: May 01 17:27:25 ip-10-0-147-186 NetworkManager[871]: <info> [1556731645.4525] dhcp4 (ens3): mtu 9001 May 01 17:27:25 ip-10-0-147-186 NetworkManager[871]: <info> [1556731645.4527] dhcp4 (ens3): state changed bound -> bound May 01 17:27:25 ip-10-0-147-186 systemd[1]: Starting Network Manager Script Dispatcher Service... May 01 17:27:25 ip-10-0-147-186 dbus-daemon[869]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher' May 01 17:27:25 ip-10-0-147-186 systemd[1]: Started Network Manager Script Dispatcher Service. May 01 17:27:25 ip-10-0-147-186 nm-dispatcher[75562]: req:1 'dhcp4-change' [ens3]: new request (3 scripts) May 01 17:27:25 ip-10-0-147-186 nm-dispatcher[75562]: req:1 'dhcp4-change' [ens3]: start running ordered scripts... May 01 17:27:35 ip-10-0-147-186 systemd[1]: NetworkManager-dispatcher.service: Consumed 20ms CPU time May 01 17:47:08 ip-10-0-147-186 hyperkube[1047]: E0501 17:47:08.780235 1047 remote_runtime.go:336] ExecSync 58b3ef666b673113dd6b02afc2d01d7fe1a17c22817191b427e6b1029411f8e7 '/usr/share/openvswitch/scripts/ov> May 01 17:47:08 ip-10-0-147-186 hyperkube[1047]: , stdout: , stderr: , exit code -1 May 01 17:57:25 ip-10-0-147-186 NetworkManager[871]: <info> [1556733445.8207] dhcp4 (ens3): address 10.0.147.186 May 01 17:57:25 ip-10-0-147-186 NetworkManager[871]: <info> [1556733445.8208] dhcp4 (ens3): plen 20 May 01 17:57:25 ip-10-0-147-186 dbus-daemon[869]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.3' (uid=0 pid=8> May 01 17:57:25 ip-10-0-147-186 NetworkManager[871]: <info> [1556733445.8208] dhcp4 (ens3): expires in 3600 seconds May 01 17:57:25 ip-10-0-147-186 NetworkManager[871]: <info> [1556733445.8208] dhcp4 (ens3): nameserver '10.0.0.2' May 01 17:57:25 ip-10-0-147-186 NetworkManager[871]: <info> [1556733445.8208] dhcp4 (ens3): domain name 'us-west-1.compute.internal' May 01 17:57:25 ip-10-0-147-186 NetworkManager[871]: <info> [1556733445.8208] dhcp4 (ens3): hostname 'ip-10-0-147-186' Is there anything I can pull / traps to set for when this happens again? If the failure rate holds, we would should expect to see it again within a couple days.
Is ceph enabled on these nodes or bidirectional mounts?
ceph - no. All AWS EBS storage. bidirectional mounts - "oc get pods -o=yaml --all-namespaces | grep mountPropagation" only found "mountPropagation: HostToContainer" entries
PR: https://github.com/openshift/machine-config-operator/pull/772
I think the PR addresses the Apr 30 00:27:59 issue. Is the idea then that https://bugzilla.redhat.com/show_bug.cgi?id=1710626#c6 is the cause of May 01 17:47:08 (which was steady state before the exit)?
PR merged.
Checked with Ryan and we do see couple of errors in last 6 days. https://search.svc.ci.openshift.org/?search=process_linux.go&maxAge=336h&context=2&type=all stderr: exec failed: container_linux.go:336: starting container process caused "process_linux.go:90: adding pid 69510 to cgroups caused \"failed to write 69510 to cgroup.procs: write /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod0411dd70_92ae_11e9_be34_120e46d86d90.slice/crio-67294fd018ada3821111e95df92477f4e52cfd91a56d4df54e1f0041219146f2.scope/cgroup.procs: invalid argument\"" command terminated with exit code 1
We need to backport https://github.com/opencontainers/runc/issues/1884
Going to call this fixed. There were version bumps of various components, including crio to 1.14, that included the fix for this bug. Just make sure the machine is using crio 1.14.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922