Bug 1576777

Summary:	DaemonSets pods fail to start after reboot
Product:	OpenShift Container Platform	Reporter:	Priyanka Kanthale <pkanthal>
Component:	Node	Assignee:	Seth Jennings <sjenning>
Status:	CLOSED CURRENTRELEASE	QA Contact:	DeShuai Ma <dma>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.7.0	CC:	aos-bugs, jokerman, mmccomas, rmeggins
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-05-10 16:26:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Priyanka Kanthale 2018-05-10 11:40:17 UTC

Description of problem:

DaemonSets pods fail to start after reboot or  atomic-openshift-node restart

A lot of the following messages in atomic-openshift-node logs:

# journalctl -u atomic-openshift-node | grep checkpoint | grep logging-fluentd | tail -n 1

May 09 03:11:42 xyz.example.com atomic-openshift-node[3490]: E0509 03:11:42.975032    3761 kuberuntime_gc.go:152] Failed to stop sandbox "bf6abb4d6c368b505a013e680842b41abb4fcc29714d8752415916994944d125" before removing: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "logging-fluentd-xfs4k_logging" network: could not retrieve port mappings: checkpoint is not found.


Customer's Workaround

The directory /var/lib/dockershim, where checkpoints are stored, resides on the container rw layer.

# docker exec -ti atomic-openshift-node bash
# df /var/lib/dockershim
Filesystem                                                                                         1K-blocks    Used Available Use% Mounted on
/dev/mapper/docker-253:5-33574976-07be611136ee5596d755ae568a65bb7cb7c0382162b4f2108b22d8705cb9f08a  10467328 1272424   9194904  13% /

Since atomic-openshift-node restart removes the previous container as pre-start step, the contents (checkpoints) are lost during a restart.

 # systemctl show -p ExecStartPre atomic-openshift-node | head -1
ExecStartPre={ path=/usr/bin/docker ; argv[]=/usr/bin/docker rm -f atomic-openshift-node ; ignore_errors=yes ; start_time=[Fri 2018-04-27 04:45:07 UTC] ; stop_time=[Fri 2018-04-27 04:45:07 UTC] ; pid=7244 ; code=exited ; status=0 }


If we persist the checkpoints via a host bind mount e.g. by adding "-v /var/lib/dockershim:/var/lib/dockershim:z" to /etc/systemd/system/atomic-openshift-node.service:


# systemctl show -p ExecStart atomic-openshift-node --no-pager
ExecStart={ path=/usr/bin/docker ; argv[]=/usr/bin/docker run --name atomic-openshift-node --rm --privileged --net=host --pid=host --env-file=/etc/sysconfig/atomic-openshift-node -v /:/rootfs:ro,rslave -e CONFIG_FILE=${CONFIG_FILE} -e OPTIONS=${OPTIONS} -e HOST=/rootfs -e HOST_ETC=/host-etc -v /var/lib/origin:/var/lib/origin:rslave -v /etc/origin/node:/etc/origin/node -v /etc/localtime:/etc/localtime:ro -v /etc/machine-id:/etc/machine-id:ro -v /run:/run -v /sys:/sys:rw -v /sys/fs/cgroup:/sys/fs/cgroup:rw -v /usr/bin/docker:/usr/bin/docker:ro -v /var/lib/docker:/var/lib/docker -v /lib/modules:/lib/modules -v /etc/origin/openvswitch:/etc/openvswitch -v /etc/origin/sdn:/etc/openshift-sdn -v /var/lib/cni:/var/lib/cni -v /etc/systemd/system:/host-etc/systemd/system -v /var/log:/var/log -v /dev:/dev $DOCKER_ADDTL_BIND_MOUNTS -v /etc/pki:/etc/pki:ro -v /var/lib/dockershim:/var/lib/dockershim:z openshift_paas-openshift_platform_v3_images-openshift3_node:${IMAGE_VERSION} ; ignore_errors=no ; start_time=[Tue 2018-05-08 06:26:56 UTC] ; stop_time=[n/a] ; pid=3490 ; code=(null) ; status=0/0 }


After this change, the DaemonSets seem to start properly after a reboot.


Actual results: DaemonSets pods fail to start after reboot


Expected results: It should start properly after a reboot


Additional info:  
Ref:  https://bugzilla.redhat.com/show_bug.cgi?id=1463574

Comment 1 Rich Megginson 2018-05-10 14:42:12 UTC

not sure if this is a pod or container problem - doesn't appear to be specific to logging

Comment 2 Seth Jennings 2018-05-10 16:18:55 UTC

xrefs
https://bugzilla.redhat.com/show_bug.cgi?id=1534419
https://github.com/openshift/origin/issues/19604
https://github.com/openshift/origin/issues/18827
https://github.com/openshift/origin/issues/19138

The fix is:
https://github.com/kubernetes/kubernetes/pull/55826

This fix exists in 3.9 and later.

Risk is low on this fix.  Proceeding with backport.

Comment 3 Seth Jennings 2018-05-10 16:26:52 UTC

Actually this fix was already backported and included in v3.7.23 and later.
Customer reports being on v3.7.14. Please advise customer to upgrade.