1576777 – DaemonSets pods fail to start after reboot

Bug 1576777 - DaemonSets pods fail to start after reboot

Summary: DaemonSets pods fail to start after reboot

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Seth Jennings
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-10 11:40 UTC by Priyanka Kanthale
Modified:	2018-05-10 16:26 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-05-10 16:26:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Priyanka Kanthale 2018-05-10 11:40:17 UTC

Description of problem:

DaemonSets pods fail to start after reboot or  atomic-openshift-node restart

A lot of the following messages in atomic-openshift-node logs:

# journalctl -u atomic-openshift-node | grep checkpoint | grep logging-fluentd | tail -n 1

May 09 03:11:42 xyz.example.com atomic-openshift-node[3490]: E0509 03:11:42.975032    3761 kuberuntime_gc.go:152] Failed to stop sandbox "bf6abb4d6c368b505a013e680842b41abb4fcc29714d8752415916994944d125" before removing: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "logging-fluentd-xfs4k_logging" network: could not retrieve port mappings: checkpoint is not found.


Customer's Workaround

The directory /var/lib/dockershim, where checkpoints are stored, resides on the container rw layer.

# docker exec -ti atomic-openshift-node bash
# df /var/lib/dockershim
Filesystem                                                                                         1K-blocks    Used Available Use% Mounted on
/dev/mapper/docker-253:5-33574976-07be611136ee5596d755ae568a65bb7cb7c0382162b4f2108b22d8705cb9f08a  10467328 1272424   9194904  13% /

Since atomic-openshift-node restart removes the previous container as pre-start step, the contents (checkpoints) are lost during a restart.

 # systemctl show -p ExecStartPre atomic-openshift-node | head -1
ExecStartPre={ path=/usr/bin/docker ; argv[]=/usr/bin/docker rm -f atomic-openshift-node ; ignore_errors=yes ; start_time=[Fri 2018-04-27 04:45:07 UTC] ; stop_time=[Fri 2018-04-27 04:45:07 UTC] ; pid=7244 ; code=exited ; status=0 }


If we persist the checkpoints via a host bind mount e.g. by adding "-v /var/lib/dockershim:/var/lib/dockershim:z" to /etc/systemd/system/atomic-openshift-node.service:


# systemctl show -p ExecStart atomic-openshift-node --no-pager
ExecStart={ path=/usr/bin/docker ; argv[]=/usr/bin/docker run --name atomic-openshift-node --rm --privileged --net=host --pid=host --env-file=/etc/sysconfig/atomic-openshift-node -v /:/rootfs:ro,rslave -e CONFIG_FILE=${CONFIG_FILE} -e OPTIONS=${OPTIONS} -e HOST=/rootfs -e HOST_ETC=/host-etc -v /var/lib/origin:/var/lib/origin:rslave -v /etc/origin/node:/etc/origin/node -v /etc/localtime:/etc/localtime:ro -v /etc/machine-id:/etc/machine-id:ro -v /run:/run -v /sys:/sys:rw -v /sys/fs/cgroup:/sys/fs/cgroup:rw -v /usr/bin/docker:/usr/bin/docker:ro -v /var/lib/docker:/var/lib/docker -v /lib/modules:/lib/modules -v /etc/origin/openvswitch:/etc/openvswitch -v /etc/origin/sdn:/etc/openshift-sdn -v /var/lib/cni:/var/lib/cni -v /etc/systemd/system:/host-etc/systemd/system -v /var/log:/var/log -v /dev:/dev $DOCKER_ADDTL_BIND_MOUNTS -v /etc/pki:/etc/pki:ro -v /var/lib/dockershim:/var/lib/dockershim:z openshift_paas-openshift_platform_v3_images-openshift3_node:${IMAGE_VERSION} ; ignore_errors=no ; start_time=[Tue 2018-05-08 06:26:56 UTC] ; stop_time=[n/a] ; pid=3490 ; code=(null) ; status=0/0 }


After this change, the DaemonSets seem to start properly after a reboot.


Actual results: DaemonSets pods fail to start after reboot


Expected results: It should start properly after a reboot


Additional info:  
Ref:  https://bugzilla.redhat.com/show_bug.cgi?id=1463574

Comment 1 Rich Megginson 2018-05-10 14:42:12 UTC

not sure if this is a pod or container problem - doesn't appear to be specific to logging

Comment 2 Seth Jennings 2018-05-10 16:18:55 UTC

xrefs
https://bugzilla.redhat.com/show_bug.cgi?id=1534419
https://github.com/openshift/origin/issues/19604
https://github.com/openshift/origin/issues/18827
https://github.com/openshift/origin/issues/19138

The fix is:
https://github.com/kubernetes/kubernetes/pull/55826

This fix exists in 3.9 and later.

Risk is low on this fix.  Proceeding with backport.

Comment 3 Seth Jennings 2018-05-10 16:26:52 UTC

Actually this fix was already backported and included in v3.7.23 and later.
Customer reports being on v3.7.14. Please advise customer to upgrade.

Note You need to log in before you can comment on or make changes to this bug.