Description of problem: DaemonSets pods fail to start after reboot or atomic-openshift-node restart A lot of the following messages in atomic-openshift-node logs: # journalctl -u atomic-openshift-node | grep checkpoint | grep logging-fluentd | tail -n 1 May 09 03:11:42 xyz.example.com atomic-openshift-node[3490]: E0509 03:11:42.975032 3761 kuberuntime_gc.go:152] Failed to stop sandbox "bf6abb4d6c368b505a013e680842b41abb4fcc29714d8752415916994944d125" before removing: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "logging-fluentd-xfs4k_logging" network: could not retrieve port mappings: checkpoint is not found. Customer's Workaround The directory /var/lib/dockershim, where checkpoints are stored, resides on the container rw layer. # docker exec -ti atomic-openshift-node bash # df /var/lib/dockershim Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/docker-253:5-33574976-07be611136ee5596d755ae568a65bb7cb7c0382162b4f2108b22d8705cb9f08a 10467328 1272424 9194904 13% / Since atomic-openshift-node restart removes the previous container as pre-start step, the contents (checkpoints) are lost during a restart. # systemctl show -p ExecStartPre atomic-openshift-node | head -1 ExecStartPre={ path=/usr/bin/docker ; argv[]=/usr/bin/docker rm -f atomic-openshift-node ; ignore_errors=yes ; start_time=[Fri 2018-04-27 04:45:07 UTC] ; stop_time=[Fri 2018-04-27 04:45:07 UTC] ; pid=7244 ; code=exited ; status=0 } If we persist the checkpoints via a host bind mount e.g. by adding "-v /var/lib/dockershim:/var/lib/dockershim:z" to /etc/systemd/system/atomic-openshift-node.service: # systemctl show -p ExecStart atomic-openshift-node --no-pager ExecStart={ path=/usr/bin/docker ; argv[]=/usr/bin/docker run --name atomic-openshift-node --rm --privileged --net=host --pid=host --env-file=/etc/sysconfig/atomic-openshift-node -v /:/rootfs:ro,rslave -e CONFIG_FILE=${CONFIG_FILE} -e OPTIONS=${OPTIONS} -e HOST=/rootfs -e HOST_ETC=/host-etc -v /var/lib/origin:/var/lib/origin:rslave -v /etc/origin/node:/etc/origin/node -v /etc/localtime:/etc/localtime:ro -v /etc/machine-id:/etc/machine-id:ro -v /run:/run -v /sys:/sys:rw -v /sys/fs/cgroup:/sys/fs/cgroup:rw -v /usr/bin/docker:/usr/bin/docker:ro -v /var/lib/docker:/var/lib/docker -v /lib/modules:/lib/modules -v /etc/origin/openvswitch:/etc/openvswitch -v /etc/origin/sdn:/etc/openshift-sdn -v /var/lib/cni:/var/lib/cni -v /etc/systemd/system:/host-etc/systemd/system -v /var/log:/var/log -v /dev:/dev $DOCKER_ADDTL_BIND_MOUNTS -v /etc/pki:/etc/pki:ro -v /var/lib/dockershim:/var/lib/dockershim:z openshift_paas-openshift_platform_v3_images-openshift3_node:${IMAGE_VERSION} ; ignore_errors=no ; start_time=[Tue 2018-05-08 06:26:56 UTC] ; stop_time=[n/a] ; pid=3490 ; code=(null) ; status=0/0 } After this change, the DaemonSets seem to start properly after a reboot. Actual results: DaemonSets pods fail to start after reboot Expected results: It should start properly after a reboot Additional info: Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1463574
not sure if this is a pod or container problem - doesn't appear to be specific to logging
xrefs https://bugzilla.redhat.com/show_bug.cgi?id=1534419 https://github.com/openshift/origin/issues/19604 https://github.com/openshift/origin/issues/18827 https://github.com/openshift/origin/issues/19138 The fix is: https://github.com/kubernetes/kubernetes/pull/55826 This fix exists in 3.9 and later. Risk is low on this fix. Proceeding with backport.
Actually this fix was already backported and included in v3.7.23 and later. Customer reports being on v3.7.14. Please advise customer to upgrade.