Bug 1614904 - Validation of static pod fails due to inconsistent names
Summary: Validation of static pod fails due to inconsistent names
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.11.z
Assignee: Michael Gugino
QA Contact: Johnny Liu
URL:
Whiteboard:
: 1613348 1615754 (view as bug list)
Depends On:
Blocks: 1638525
TreeView+ depends on / blocked
 
Reported: 2018-08-10 16:50 UTC by Steven Walter
Modified: 2019-04-03 17:13 UTC (History)
28 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1638525 (view as bug list)
Environment:
Last Closed: 2018-11-20 03:10:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3564171 0 None None None 2018-12-10 18:41:55 UTC
Red Hat Product Errata RHBA-2018:3537 0 None None None 2018-11-20 03:11:48 UTC

Description Steven Walter 2018-08-10 16:50:43 UTC
Description of problem:

Installation/startup in AWS fails, network plugin is not ready, cni config uninitialized

Pods are failing to start with the following messages in /var/log/messages:

Aug  7 11:29:27 AWGMEUOM01 atomic-openshift-node: W0807 11:29:27.047991   32375 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
Aug  7 11:29:27 AWGMEUOM01 atomic-openshift-node: E0807 11:29:27.048134   32375 kubelet.go:2147] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized



Version-Release number of the following components:
Customer tried with:
$ rpm -qa | grep -i -e ansible -e atomic
openshift-ansible-roles-3.10.21-1.git.0.6446011.el7.noarch
ansible-2.6.2-1.el7.noarch
openshift-ansible-playbooks-3.10.21-1.git.0.6446011.el7.noarch
openshift-ansible-3.10.21-1.git.0.6446011.el7.noarch
openshift-ansible-docs-3.10.21-1.git.0.6446011.el7.noarch
I advised customer use ansible 2.4 rpms instead, they downgraded the packages and tried again but same issue.


Actual results:
FAILED - RETRYING: Wait for control plane pods to appear (18 retries left).Result was: {
. . .
    "msg": {
        "cmd": "/bin/oc get pod master-etcd-awgmeuom02 -o json -n kube-system", 
        "results": [
            {}
        ], 
        "returncode": 1, 
        "stderr": "Unable to connect to the server: EOF\n", 
        "stdout": ""
    }, 

I'll upload full ansible logs to the bz.



NOTES:
This issue seems similar to https://github.com/openshift/openshift-ansible/issues/7967 and https://bugzilla.redhat.com/show_bug.cgi?id=1592010
However I'm not certain its the same issue because it is a different version and I dont see all the same messages, like "Unable to connect to the server"

Comment 4 Scott Dodson 2018-08-10 18:21:00 UTC
Steven,

We need to get logs from the static pods and the complete journal from the node service on all masters.

`journalctl --no-pager > node.log`
`master-logs etcd etcd &> etcd.log`
`master-logs api api &> api.log`
`master-logs controllers controllers &> controllers.log`

The static pods for the API should come up before CNI and SDN are initialized and the node is marked ready.

There should be no need to install atomic-openshift-sdn-ovs in 3.10, this is all handled via a daemonset that's provisioned after the API bootstraps.

Comment 5 Steven Walter 2018-08-10 18:42:16 UTC
Hi,
We previously requested to check if pods were around, like:
oc get pods  -n kubesystem
But customer was not able to get output for these due to master not responding. Do we expect that command to respond if we ask for it in a different namespace? Or else how should we check for these logs? Sorry, the "services running as pods" thing is still a bit new to me.

Is "master-logs" a command or shorthand for getting journalctl output? I dont see it as an option in my 3.10 cluster so I assume the latter

Comment 6 Steven Walter 2018-08-10 18:43:52 UTC
Nevermind, I see "master-logs" in /usr/local/bin, I'll have the customer grab those

Comment 10 Scott Dodson 2018-08-14 20:15:04 UTC
*** Bug 1615754 has been marked as a duplicate of this bug. ***

Comment 13 Stephen Cuppett 2018-08-22 14:45:23 UTC
*** Bug 1613348 has been marked as a duplicate of this bug. ***

Comment 31 Johnny Liu 2018-09-29 06:47:47 UTC
QE also hit some similar issue as this bug, refer to scenario #1 in https://bugzilla.redhat.com/show_bug.cgi?id=1629726#c2.

Comment 32 Dhwanil Raval 2018-10-02 14:40:19 UTC
Getting similar on bare metal environment. Any updates?

Comment 37 Scott Dodson 2018-10-11 19:32:10 UTC
This should be addressed via https://github.com/openshift/openshift-ansible/pull/10356 on release-3.11.

Comment 38 Johnny Liu 2018-10-17 12:18:23 UTC
According to dev's proposed verification path.
https://gist.github.com/michaelgugino/c961476d8be7d160a5e53fe9a9734051

For 3.11 fresh install, for testing scenarios #4, also need similar backport like what is done in 3.10 https://github.com/openshift/openshift-ansible/pull/10409

Comment 39 Michael Gugino 2018-10-23 14:10:14 UTC
PR created for 3.11: https://github.com/openshift/openshift-ansible/pull/10447

Comment 40 Michael Gugino 2018-10-23 14:13:40 UTC
3.11 merged.

Comment 42 Johnny Liu 2018-11-05 08:08:08 UTC
Verified this bug with openshift-ansible-3.11.38-1.git.0.d146f83.el7.noarch, and PASS.

Scenario #1:
Try to install a new 3.11 cluster with openshift_kubelet_name_override set. Installs should fail.

############ ANSIBLE RUN: playbooks/prerequisites.yml ############

PLAY [Fail openshift_kubelet_name_override for new hosts] **********************

TASK [Gathering Facts] *********************************************************
Monday 05 November 2018  14:33:22 +0800 (0:00:00.111)       0:00:00.111 ******* 
ok: [qe-jialiu312-master-etcd-1.1105-0gs.qe.rhcloud.com]
ok: [qe-jialiu312-node-1.1105-0gs.qe.rhcloud.com]
ok: [qe-jialiu312-node-registry-router-1.1105-0gs.qe.rhcloud.com]

TASK [Fail when openshift_kubelet_name_override is defined] ********************
Monday 05 November 2018  14:33:23 +0800 (0:00:01.097)       0:00:01.209 ******* 
fatal: [qe-jialiu312-master-etcd-1.1105-0gs.qe.rhcloud.com]: FAILED! => {"changed": false, "msg": "openshift_kubelet_name_override Cannot be defined for new hosts"}
fatal: [qe-jialiu312-node-registry-router-1.1105-0gs.qe.rhcloud.com]: FAILED! => {"changed": false, "msg": "openshift_kubelet_name_override Cannot be defined for new hosts"}
fatal: [qe-jialiu312-node-1.1105-0gs.qe.rhcloud.com]: FAILED! => {"changed": false, "msg": "openshift_kubelet_name_override Cannot be defined for new hosts"}
	to retry, use: --limit @/home/slave3/workspace/Launch Environment Flexy Wrapper/private-openshift-ansible/playbooks/prerequisites.retry


Scenario #2:
cluster install on OSP (snvl2) without cloudprovider enabled + short hostname, PASS.

Scenario #3:
cluster install on OSP (snvl2) with cloudprovider enabled + short hostname, PASS.

Comment 44 errata-xmlrpc 2018-11-20 03:10:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3537


Note You need to log in before you can comment on or make changes to this bug.