Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1293826

Summary:	node is trying to connect master when master is not ready yet during container install.
Product:	OpenShift Container Platform	Reporter:	Johnny Liu <jialiu>
Component:	Installer	Assignee:	Andrew Butcher <abutcher>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Ma xiaoqiang <xiama>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.1.0	CC:	aos-bugs, bleanhar, jialiu, jokerman, mmccomas, tdawson, xtian
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openshift-ansible-3.0.33-1.git.0.bb1f8aa.el7aos	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-01-29 20:58:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Johnny Liu 2015-12-23 08:00:31 UTC

Description of problem:
Run container install on AH, installation failed.
$ ansible-playbook -i ~/installer/hosts-3.1 ./playbooks/byo/config.yml
<--snip-->
TASK: [openshift_manage_node | Wait for Node Registration] ******************** 
ok: [test1.cluster.local] => (item=test1.cluster.local)
failed: [test1.cluster.local] => (item=test2.cluster.local) => {"attempts": 20, "changed": false, "cmd": ["oc", "get", "node", "test2.cluster.local"], "delta": "0:00:02.070645", "end": "2015-12-23 15:06:00.226005", "failed": true, "item": "test2.cluster.local", "rc": 1, "start": "2015-12-23 15:05:58.155360", "stdout_lines": [], "warnings": []}
stderr: Error from server: node "test2.cluster.local" not found
msg: Task failed as maximum retries was encountered
failed: [test1.cluster.local] => (item=test3.cluster.local) => {"attempts": 20, "changed": false, "cmd": ["oc", "get", "node", "test3.cluster.local"], "delta": "0:00:02.223930", "end": "2015-12-23 15:08:29.962372", "failed": true, "item": "test3.cluster.local", "rc": 1, "start": "2015-12-23 15:08:27.738442", "stdout_lines": [], "warnings": []}
stderr: Error from server: node "test3.cluster.local" not found
msg: Task failed as maximum retries was encountered

FATAL: all hosts have already failed -- aborting


Version-Release number of selected component (if applicable):
https://github.com/openshift/openshift-ansible.git -b master

How reproducible:
Often (especially on powerful hardware)

Steps to Reproduce:
1. Launch AH vms on my physical machine
2. Run ansible script to set up container install
3.

Actual results:
Installation failed.

Expected results:
Installation should PASS.

Additional info:
When failure is happening, check master and node service logs:
master:
# docker logs 85f50037c9d9
W1223 02:03:16.751128       1 start_master.go:269] assetConfig.loggingPublicURL: invalid value '', Details: required to view aggregated container logs in the console
W1223 02:03:16.866502       1 start_master.go:269] assetConfig.metricsPublicURL: invalid value '', Details: required to view cluster metrics in the console
I1223 02:03:17.795335       1 plugins.go:71] No cloud provider specified.
I1223 02:03:18.034573       1 start_master.go:380] Starting master on 0.0.0.0:8443 (v3.1.1.0)
I1223 02:03:18.034621       1 start_master.go:381] Public master address is https://atomic.nay.redhat.com:8443
I1223 02:03:18.034647       1 start_master.go:385] Using images from "registry.access.redhat.com/openshift3/ose-<component>:v3.1.1.0"
2015-12-23 02:03:18.034840 I | netutil: resolving test1.cluster.local:7001 to 10.66.81.156:7001
2015-12-23 02:03:18.034927 I | netutil: resolving test1.cluster.local:4001 to 10.66.81.156:4001
I1223 02:03:18.304385       1 server.go:65] etcd: peerTLS: cert = /etc/origin/master/etcd.server.crt, key = /etc/origin/master/etcd.server.key, ca = /etc/origin/master/ca.crt
I1223 02:03:18.523217       1 server.go:76] etcd: listening for peers on https://0.0.0.0:7001
I1223 02:03:18.523246       1 server.go:87] etcd: clientTLS: cert = /etc/origin/master/etcd.server.crt, key = /etc/origin/master/etcd.server.key, ca = /etc/origin/master/ca.crt
I1223 02:03:18.738948       1 server.go:98] etcd: listening for client requests on https://0.0.0.0:4001
2015-12-23 02:03:18.739616 I | etcdserver: name = openshift.local
2015-12-23 02:03:18.739647 I | etcdserver: data dir = /var/lib/origin/openshift.local.etcd
<--snip-->

node:
# docker logs 4e7fe2505fa3
I1223 15:02:27.771803    3644 start_node.go:178] Starting a node connected to https://test1.cluster.local:8443
I1223 15:02:27.777989    3644 server.go:328] Running kubelet in containerized mode (experimental)
I1223 15:02:27.783461    3644 plugins.go:71] No cloud provider specified.
I1223 15:02:27.787300    3644 common.go:70] Self IP: 10.66.81.158.
I1223 15:02:27.787335    3644 start_node.go:255] Starting node test2.cluster.local (v3.1.1.0)
I1223 15:02:27.792851    3644 node.go:54] Connecting to Docker at unix:///var/run/docker.sock
I1223 15:02:27.795864    3644 manager.go:128] cAdvisor running in container: "/system.slice/docker-4e7fe2505fa35d3da2ec4e5be7728bf77bbba5b28f253f8dc5ee25d380fa06ba.scope"
E1223 15:02:27.803302    3644 common.go:197] Failed to obtain ClusterNetwork: Get https://test1.cluster.local:8443/oapi/v1/clusternetworks/default: dial tcp 10.66.81.156:8443: connection refused
F1223 15:02:27.803354    3644 node.go:173] SDN Node failed: Get https://test1.cluster.local:8443/oapi/v1/clusternetworks/default: dial tcp 10.66.81.156:8443: connection refused
<--snip-->

Compare timestamp from logs, when node service is trying to master, master is not started yet (due to default timezone is different between master and node container, the hour value is different, we only need focus on minute and seconds value to compare).


After the failure happened, re-start node docker container, node service is started successfully.

Comment 1 Scott Dodson 2016-01-07 22:25:55 UTC

Proposed fix : https://github.com/openshift/openshift-ansible/pull/1137

Comment 2 Scott Dodson 2016-01-07 22:28:08 UTC

to test

git pull https://github.com/sdodson/openshift-ansible bz1293826

Comment 5 Scott Dodson 2016-01-11 19:54:29 UTC

Which tasks fail now? Can you provide the log output of the failed task and the tasks immediately preceding it so I can figure out where else we're having this problem?

Comment 7 Andrew Butcher 2016-01-13 16:57:09 UTC

New PR with proposed fix is:
https://github.com/openshift/openshift-ansible/pull/1159

The API is verified in several places during install such as prior to node registration.

Moving to ON_QA for advanced install method.

Comment 8 Johnny Liu 2016-01-14 09:06:43 UTC

Still reproduce.

GATHERING FACTS *************************************************************** 
ok: [test1.cluster.local]

TASK: [Wait for master API to become available before proceeding] ************* 
ok: [test1.cluster.local]

TASK: [fail ] ***************************************************************** 
skipping: [test1.cluster.local]

TASK: [openshift_manage_node | Wait for Node Registration] ******************** 
ok: [test1.cluster.local] => (item=test1.cluster.local)
failed: [test1.cluster.local] => (item=test2.cluster.local) => {"attempts": 20, "changed": false, "cmd": ["oc", "get", "node", "test2.cluster.local"], "delta": "0:00:02.227486", "end": "2016-01-14 16:33:18.045583", "failed": true, "item": "test2.cluster.local", "rc": 1, "start": "2016-01-14 16:33:15.818097", "stdout_lines": [], "warnings": []}
stderr: 
================================================================================
ATTENTION: You are running oc via a wrapper around 'docker run openshift3/ose'.
This wrapper is intended only to be used to bootstrap an environment. Please
install client tools on another host once you have granted cluster-admin
privileges to a user. 
See https://docs.openshift.com/enterprise/latest/cli_reference/get_started_cli.html
=================================================================================

Error from server: node "test2.cluster.local" not found
msg: Task failed as maximum retries was encountered
ok: [test1.cluster.local] => (item=test3.cluster.local)

FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/home/jialiu/config.retry

localhost                  : ok=13   changed=0    unreachable=0    failed=0   
test1.cluster.local        : ok=287  changed=62   unreachable=0    failed=1   
test2.cluster.local        : ok=89   changed=23   unreachable=0    failed=0   
test3.cluster.local        : ok=89   changed=23   unreachable=0    failed=0   



Actually this issue involved "restart docker during node initialization" for container install mentioned by Scott in the PR.

Master:
-bash-4.2# docker ps -a
CONTAINER ID        IMAGE                    COMMAND                  CREATED             STATUS              PORTS               NAMES
c8b7b62ee6f5        openshift3/node          "/usr/local/bin/origi"   5 minutes ago       Up 5 minutes                            atomic-openshift-node
e8f1200118b2        openshift3/ose           "oc get node test1.cl"   5 minutes ago       Dead                                    
3fbe2f30323e        openshift3/ose           "/usr/bin/openshift s"   5 minutes ago       Up 5 minutes                            atomic-openshift-master
4d035939ba46        openshift3/openvswitch   "/usr/local/bin/ovs-r"   5 minutes ago       Up 5 minutes                            openvswitch
d4b4edd41b72        openshift3/openvswitch   "/usr/local/bin/ovs-r"   7 minutes ago       Dead                                    agitated_franklin
6b5a3de5f713        openshift3/ose           "/usr/bin/openshift s"   12 minutes ago      Dead                                    trusting_shockley

Seen from the above container list, found there are two master containers (one id dead, one is running), the following is my assumption:
When master is started for the 1st time, and API is available, node service on vm1, vm2, vm3 will be booting on.
*NOTE*: 
vm1 is master and it is also a node, vm2 and vm3 is pure nodes.

When node service is starting on vm1 for the 1st time, it will trigger a docker restart due to SDN configuration initialization. The docker restart will kill the running master container, the 2nd master container will be starting.
At the same time, node containers on vm2 and vm3 are launching, and they are trying to connect the master API service.
Unfortunately, some nodes failed to be started (such as: vm2), due to the 2nd master container is booting up, API is not ready yet.

The following is the 2nd master container's service log on vm1.
-bash-4.2# docker logs 3fbe2f30323e
W0114 03:29:50.516815       1 start_master.go:269] assetConfig.loggingPublicURL: invalid value '', Details: required to view aggregated container logs in the console
W0114 03:29:50.528617       1 start_master.go:269] assetConfig.metricsPublicURL: invalid value '', Details: required to view cluster metrics in the console
I0114 03:29:51.721133       1 plugins.go:71] No cloud provider specified.
I0114 03:29:51.949401       1 start_master.go:380] Starting master on 0.0.0.0:8443 (v3.1.1.2)
I0114 03:29:51.949436       1 start_master.go:381] Public master address is https://atomic.nay.redhat.com:8443
I0114 03:29:51.951534       1 start_master.go:385] Using images from "registry.access.redhat.com/openshift3/ose-<component>:v3.1.1.2"
2016-01-14 03:29:51.951674 I | netutil: resolving test1.cluster.local:7001 to 10.66.81.114:7001

The following is the node container's service log on vm2.
-bash-4.2# docker ps -a
CONTAINER ID        IMAGE                    COMMAND                  CREATED             STATUS              PORTS               NAMES
c6ebe5475fb4        openshift3/node          "/usr/local/bin/origi"   3 minutes ago       Up 3 minutes                            atomic-openshift-node
243f3af874c0        openshift3/openvswitch   "/usr/local/bin/ovs-r"   4 minutes ago       Up 3 minutes                            openvswitch
-bash-4.2# docker logs c6ebe5475fb4
I0114 16:28:32.053584    3644 start_node.go:178] Starting a node connected to https://test1.cluster.local:8443
I0114 16:28:32.065973    3644 server.go:328] Running kubelet in containerized mode (experimental)
I0114 16:28:32.072097    3644 plugins.go:71] No cloud provider specified.
I0114 16:28:32.075507    3644 common.go:70] Self IP: 10.66.81.113.
I0114 16:28:32.075586    3644 start_node.go:255] Starting node test2.cluster.local (v3.1.1.2)
I0114 16:28:32.091017    3644 node.go:54] Connecting to Docker at unix:///var/run/docker.sock
I0114 16:28:32.095107    3644 manager.go:128] cAdvisor running in container: "/system.slice/docker-c6ebe5475fb4e736f74613bffc03be50f6dd2f3446c29a3bd71432c63dfdc7d3.scope"
E0114 16:28:32.097292    3644 common.go:197] Failed to obtain ClusterNetwork: Get https://test1.cluster.local:8443/oapi/v1/clusternetworks/default: dial tcp 10.66.81.114:8443: connection refused
F0114 16:28:32.097332    3644 node.go:173] SDN Node failed: Get https://test1.cluster.local:8443/oapi/v1/clusternetworks/default: dial tcp 10.66.81.114:8443: connection refused


The following is the node container's service log on vm3.
-bash-4.2# docker ps -a
CONTAINER ID        IMAGE                    COMMAND                  CREATED             STATUS              PORTS               NAMES
30b9c1aa4946        openshift3/node          "/usr/local/bin/origi"   4 minutes ago       Up 4 minutes                            atomic-openshift-node
cb283a07e2b3        openshift3/openvswitch   "/usr/local/bin/ovs-r"   4 minutes ago       Up 4 minutes                            openvswitch
e980061022ae        openshift3/openvswitch   "/usr/local/bin/ovs-r"   6 minutes ago       Dead                                    sleepy_einstein
-bash-4.2# docker logs 30b9c1aa4946
I0114 16:30:03.460692    4271 start_node.go:178] Starting a node connected to https://test1.cluster.local:8443
I0114 16:30:03.472615    4271 server.go:328] Running kubelet in containerized mode (experimental)
I0114 16:30:03.482893    4271 plugins.go:71] No cloud provider specified.
I0114 16:30:05.987759    4271 common.go:70] Self IP: 10.66.81.115.
I0114 16:30:05.987817    4271 start_node.go:255] Starting node test3.cluster.local (v3.1.1.2)
I0114 16:30:06.034235    4271 node.go:54] Connecting to Docker at unix:///var/run/docker.sock
I0114 16:30:06.556958    4271 manager.go:128] cAdvisor running in container: "/system.slice/docker-30b9c1aa49463330de595f356fd6a390147a2b72b1d69ff47798164124818c57.scope"



Check the timestamps in the logs, you will found node service on vm2 is trying to connect master API before the 2nd master service is not ready yet.


So add "Wait for master API to become available before proceeding" before "Wait for Node Registration" can not resolve such issues. Because node service start is already triggered before the 2nd master API is getting ready.

The workaround should be add "node restart" for those failed node in "Wait for Node Registration".
Or the best resolution is add some retries to connect master API in openshift source code when node is stating, but not only connect one time to master API in node booting-up process.

I am not sure my understanding is right, hope that could help you some.

Comment 9 Andrew Butcher 2016-01-14 19:53:17 UTC

Thanks Johnny, that was a huge help and I was able to reproduce locally. The node was unhealthy to begin with so checking the API prior to checking node registration wasn't going to get us anywhere.

https://github.com/openshift/openshift-ansible/pull/1185 adds an API check prior to starting the node so that we can ensure it starts correctly. Both of the API checks (prior to node start and prior to registration check) ensure the nodes will start and register in my testing.

Comment 12 Johnny Liu 2016-01-18 12:20:26 UTC

Verified this bug with OpenShiftEnterpriseErrata/3.1/2016-01-16.1 puddle, and PASS.

The issue does NOT reproduce now.