Bug 1473531

Summary:	atomic-openshift-node does not terminate journalctl child process cleanly
Product:	OpenShift Container Platform	Reporter:	Gan Huang <ghuang>
Component:	Node	Assignee:	Seth Jennings <sjenning>
Status:	CLOSED CANTFIX	QA Contact:	DeShuai Ma <dma>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.6.0	CC:	aos-bugs, jchaloup, jokerman, jsafrane, mmccomas, sdodson, wmeng
Target Milestone:	---
Target Release:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-09-06 19:30:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Gan Huang 2017-07-21 06:00:44 UTC

Description of problem:
Installation sometimes (10%) failed at "TASK [openshift_node : Start and enable node]"

Version-Release number of the following components:
openshift-ansible-3.6.161-1

# openshift version
openshift v3.6.153
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

docker-1.12.6-48.git0fdc778.el7.x86_64

RHEL-7.4
# uname -r
3.10.0-693.el7.x86_64

How reproducible:
almost one fail in 10 attempts with same configurations

Steps to Reproduce:
1. Trigger installation via openshift-ansible


Actual results:
TASK [openshift_node : Start and enable node] **********************************
Friday 21 July 2017  05:00:12 +0000 (0:00:00.109)       0:08:52.795 *********** 
FAILED - RETRYING: TASK: openshift_node : Start and enable node (1 retries left).
FAILED - RETRYING: TASK: openshift_node : Start and enable node (1 retries left).
fatal: [host-8-241-28.host.centralci.eng.rdu2.redhat.com]: FAILED! => {
    "attempts": 1, 
    "changed": false, 
    "failed": true
}

MSG:

Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details.


...ignoring
fatal: [host-8-241-40.host.centralci.eng.rdu2.redhat.com]: FAILED! => {
    "attempts": 1, 
    "changed": false, 
    "failed": true
}

MSG:

Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details.


...ignoring


Expected results:
No errors

Additional info:
The node logs on the master-node and app-node indicate 2 different issues, and the node should able to back active after restart atomic-openshift-master service.

On the master node:
# journalctl -u atomic-openshift-node
Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[51700]: I0721 01:35:39.795384   51700 start_node.go:251] Reading node configuration from /etc/origin/node/node-config.yaml
Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[51700]: I0721 01:35:39.827406   51700 node.go:123] Initializing SDN node of type "redhat/openshift-ovs-subnet" with configured hostname "host-8-241-28.host.centralci.eng.rdu2.redhat.com" (IP ""), iptables sync period "30s"
Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[51700]: I0721 01:35:39.827852   51700 ipcmd.go:44] Executing: /usr/sbin/ip link set lbr0 down
Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[51700]: I0721 01:35:39.831722   51700 ipcmd.go:48] Error executing /usr/sbin/ip: Cannot find device "lbr0"
Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[51700]: I0721 01:35:39.831835   51700 docker.go:364] Connecting to docker on unix:///var/run/docker.sock
Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[51700]: I0721 01:35:39.831848   51700 docker.go:384] Start docker client with request timeout=2m0s
Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[51700]: W0721 01:35:39.833592   51700 cni.go:157] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[51700]: F0721 01:35:39.841989   51700 start_node.go:140] could not start DNS, unable to read config file: open /etc/origin/node/resolv.conf: no such file or directory
Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com systemd[1]: Failed to start OpenShift Node.
Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com systemd[1]: Unit atomic-openshift-node.service entered failed state.
Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com systemd[1]: atomic-openshift-node.service failed.

On the app node:
# journalctl -u atomic-openshift-node
Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: W0721 01:36:19.644933   33431 cni.go:157] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.645039   33431 kubelet.go:2069] Container runtime status: Runtime Conditions: RuntimeReady=true reason: message:, NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: E0721 01:36:19.645060   33431 kubelet.go:2072] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.645156   33431 iptables.go:394] running iptables -C [KUBE-FIREWALL -t filter -m comment --comment kubernetes firewall for dropping marked packets -m mark --mark 0x00008000/0x00008000 -j DROP]
Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.648548   33431 iptables.go:394] running iptables -C [OUTPUT -t filter -j KUBE-FIREWALL]
Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.650726   33431 iptables.go:394] running iptables -C [INPUT -t filter -j KUBE-FIREWALL]
Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.652538   33431 iptables.go:394] running iptables -N [KUBE-MARK-MASQ -t nat]
Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.653987   33431 iptables.go:394] running iptables -N [KUBE-POSTROUTING -t nat]
Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.655618   33431 iptables.go:394] running iptables -C [KUBE-MARK-MASQ -t nat -j MARK --set-xmark 0x00004000/0x00004000]
Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.657336   33431 iptables.go:394] running iptables -C [POSTROUTING -t nat -m comment --comment kubernetes postrouting rules -j KUBE-POSTROUTING]
Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.659052   33431 iptables.go:394] running iptables -C [KUBE-POSTROUTING -t nat -m comment --comment kubernetes service traffic requiring SNAT -m mark --mark 0x00004000/0x00004000 -j MASQUERADE]
Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: E0721 01:36:19.661405   33431 factory.go:330] devicemapper filesystem stats will not be reported: usage of thin_ls is disabled to preserve iops
Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.661418   33431 factory.go:342] Registering Docker factory
Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: W0721 01:36:19.661428   33431 manager.go:247] Registration of the rkt container factory failed: unable to communicate with Rkt api service: rkt: cannot tcp Dial rkt api service: dial tcp [::1]:15441: getsockopt: connection refused
Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.661432   33431 factory.go:54] Registering systemd factory
Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.661530   33431 factory.go:86] Registering Raw factory
Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: F0721 01:36:19.661621   33431 kubelet.go:1235] Failed to start cAdvisor inotify_init: too many open files
Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com systemd[1]: Failed to start OpenShift Node.
Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com systemd[1]: Unit atomic-openshift-node.service entered failed state.
Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com systemd[1]: atomic-openshift-node.service failed.

Comment 13 Seth Jennings 2017-07-25 18:52:04 UTC

The upstream commit that exposed this is being reverted:

https://github.com/openshift/openshift-ansible/pull/4863 (master)
https://github.com/openshift/openshift-ansible/pull/4864 (3.6)

I'm going to keep this bug open though and reduce the severity and moving target to 3.7 because we should stop the exec'ed journalctl processes cleanly but that will require an upstream kube/cadvisor change.

Comment 14 Seth Jennings 2017-09-06 19:30:54 UTC

installer change that exposed this issue has been reverted so the journalctl process is killed with the node process at unit stop time.