Bug 1614904

Summary:	Validation of static pod fails due to inconsistent names
Product:	OpenShift Container Platform	Reporter:	Steven Walter <stwalter>
Component:	Installer	Assignee:	Michael Gugino <mgugino>
Status:	CLOSED ERRATA	QA Contact:	Johnny Liu <jialiu>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	3.10.0	CC:	aleks, aos-bugs, arghosh, brian.millett, byount, dhwanil.raval, fshaikh, jcrumple, jkaur, jokerman, jolee, mark.vinkx, maupadhy, mmccomas, msomasun, openshift-bugs-escalate, rbost, rhowe, rkant, rkshirsa, schoudha, scuppett, sdodson, sgarciam, sheldyakov, shlao, torben, wmeng
Target Milestone:	---
Target Release:	3.11.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1638525 (view as bug list)		Environment:
Last Closed:	2018-11-20 03:10:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1638525

Description Steven Walter 2018-08-10 16:50:43 UTC

Description of problem:

Installation/startup in AWS fails, network plugin is not ready, cni config uninitialized

Pods are failing to start with the following messages in /var/log/messages:

Aug  7 11:29:27 AWGMEUOM01 atomic-openshift-node: W0807 11:29:27.047991   32375 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
Aug  7 11:29:27 AWGMEUOM01 atomic-openshift-node: E0807 11:29:27.048134   32375 kubelet.go:2147] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized



Version-Release number of the following components:
Customer tried with:
$ rpm -qa | grep -i -e ansible -e atomic
openshift-ansible-roles-3.10.21-1.git.0.6446011.el7.noarch
ansible-2.6.2-1.el7.noarch
openshift-ansible-playbooks-3.10.21-1.git.0.6446011.el7.noarch
openshift-ansible-3.10.21-1.git.0.6446011.el7.noarch
openshift-ansible-docs-3.10.21-1.git.0.6446011.el7.noarch
I advised customer use ansible 2.4 rpms instead, they downgraded the packages and tried again but same issue.


Actual results:
FAILED - RETRYING: Wait for control plane pods to appear (18 retries left).Result was: {
. . .
    "msg": {
        "cmd": "/bin/oc get pod master-etcd-awgmeuom02 -o json -n kube-system", 
        "results": [
            {}
        ], 
        "returncode": 1, 
        "stderr": "Unable to connect to the server: EOF\n", 
        "stdout": ""
    }, 

I'll upload full ansible logs to the bz.



NOTES:
This issue seems similar to https://github.com/openshift/openshift-ansible/issues/7967 and https://bugzilla.redhat.com/show_bug.cgi?id=1592010
However I'm not certain its the same issue because it is a different version and I dont see all the same messages, like "Unable to connect to the server"

Comment 4 Scott Dodson 2018-08-10 18:21:00 UTC

Steven,

We need to get logs from the static pods and the complete journal from the node service on all masters.

`journalctl --no-pager > node.log`
`master-logs etcd etcd &> etcd.log`
`master-logs api api &> api.log`
`master-logs controllers controllers &> controllers.log`

The static pods for the API should come up before CNI and SDN are initialized and the node is marked ready.

There should be no need to install atomic-openshift-sdn-ovs in 3.10, this is all handled via a daemonset that's provisioned after the API bootstraps.

Comment 5 Steven Walter 2018-08-10 18:42:16 UTC

Hi,
We previously requested to check if pods were around, like:
oc get pods  -n kubesystem
But customer was not able to get output for these due to master not responding. Do we expect that command to respond if we ask for it in a different namespace? Or else how should we check for these logs? Sorry, the "services running as pods" thing is still a bit new to me.

Is "master-logs" a command or shorthand for getting journalctl output? I dont see it as an option in my 3.10 cluster so I assume the latter

Comment 6 Steven Walter 2018-08-10 18:43:52 UTC

Nevermind, I see "master-logs" in /usr/local/bin, I'll have the customer grab those

Comment 10 Scott Dodson 2018-08-14 20:15:04 UTC

*** Bug 1615754 has been marked as a duplicate of this bug. ***

Comment 13 Stephen Cuppett 2018-08-22 14:45:23 UTC

*** Bug 1613348 has been marked as a duplicate of this bug. ***

Comment 31 Johnny Liu 2018-09-29 06:47:47 UTC

QE also hit some similar issue as this bug, refer to scenario #1 in https://bugzilla.redhat.com/show_bug.cgi?id=1629726#c2.

Comment 32 Dhwanil Raval 2018-10-02 14:40:19 UTC

Getting similar on bare metal environment. Any updates?

Comment 37 Scott Dodson 2018-10-11 19:32:10 UTC

This should be addressed via https://github.com/openshift/openshift-ansible/pull/10356 on release-3.11.

Comment 38 Johnny Liu 2018-10-17 12:18:23 UTC

According to dev's proposed verification path.
https://gist.github.com/michaelgugino/c961476d8be7d160a5e53fe9a9734051

For 3.11 fresh install, for testing scenarios #4, also need similar backport like what is done in 3.10 https://github.com/openshift/openshift-ansible/pull/10409

Comment 39 Michael Gugino 2018-10-23 14:10:14 UTC

PR created for 3.11: https://github.com/openshift/openshift-ansible/pull/10447

Comment 40 Michael Gugino 2018-10-23 14:13:40 UTC

3.11 merged.

Comment 42 Johnny Liu 2018-11-05 08:08:08 UTC

Verified this bug with openshift-ansible-3.11.38-1.git.0.d146f83.el7.noarch, and PASS.

Scenario #1:
Try to install a new 3.11 cluster with openshift_kubelet_name_override set. Installs should fail.

############ ANSIBLE RUN: playbooks/prerequisites.yml ############

PLAY [Fail openshift_kubelet_name_override for new hosts] **********************

TASK [Gathering Facts] *********************************************************
Monday 05 November 2018  14:33:22 +0800 (0:00:00.111)       0:00:00.111 ******* 
ok: [qe-jialiu312-master-etcd-1.1105-0gs.qe.rhcloud.com]
ok: [qe-jialiu312-node-1.1105-0gs.qe.rhcloud.com]
ok: [qe-jialiu312-node-registry-router-1.1105-0gs.qe.rhcloud.com]

TASK [Fail when openshift_kubelet_name_override is defined] ********************
Monday 05 November 2018  14:33:23 +0800 (0:00:01.097)       0:00:01.209 ******* 
fatal: [qe-jialiu312-master-etcd-1.1105-0gs.qe.rhcloud.com]: FAILED! => {"changed": false, "msg": "openshift_kubelet_name_override Cannot be defined for new hosts"}
fatal: [qe-jialiu312-node-registry-router-1.1105-0gs.qe.rhcloud.com]: FAILED! => {"changed": false, "msg": "openshift_kubelet_name_override Cannot be defined for new hosts"}
fatal: [qe-jialiu312-node-1.1105-0gs.qe.rhcloud.com]: FAILED! => {"changed": false, "msg": "openshift_kubelet_name_override Cannot be defined for new hosts"}
	to retry, use: --limit @/home/slave3/workspace/Launch Environment Flexy Wrapper/private-openshift-ansible/playbooks/prerequisites.retry


Scenario #2:
cluster install on OSP (snvl2) without cloudprovider enabled + short hostname, PASS.

Scenario #3:
cluster install on OSP (snvl2) with cloudprovider enabled + short hostname, PASS.

Comment 44 errata-xmlrpc 2018-11-20 03:10:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3537