Bug 1614904
| Summary: | Validation of static pod fails due to inconsistent names | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Steven Walter <stwalter> | |
| Component: | Installer | Assignee: | Michael Gugino <mgugino> | |
| Status: | CLOSED ERRATA | QA Contact: | Johnny Liu <jialiu> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 3.10.0 | CC: | aleks, aos-bugs, arghosh, brian.millett, byount, dhwanil.raval, fshaikh, jcrumple, jkaur, jokerman, jolee, mark.vinkx, maupadhy, mmccomas, msomasun, openshift-bugs-escalate, rbost, rhowe, rkant, rkshirsa, schoudha, scuppett, sdodson, sgarciam, sheldyakov, shlao, torben, wmeng | |
| Target Milestone: | --- | |||
| Target Release: | 3.11.z | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1638525 (view as bug list) | Environment: | ||
| Last Closed: | 2018-11-20 03:10:43 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1638525 | |||
Steven, We need to get logs from the static pods and the complete journal from the node service on all masters. `journalctl --no-pager > node.log` `master-logs etcd etcd &> etcd.log` `master-logs api api &> api.log` `master-logs controllers controllers &> controllers.log` The static pods for the API should come up before CNI and SDN are initialized and the node is marked ready. There should be no need to install atomic-openshift-sdn-ovs in 3.10, this is all handled via a daemonset that's provisioned after the API bootstraps. Hi, We previously requested to check if pods were around, like: oc get pods -n kubesystem But customer was not able to get output for these due to master not responding. Do we expect that command to respond if we ask for it in a different namespace? Or else how should we check for these logs? Sorry, the "services running as pods" thing is still a bit new to me. Is "master-logs" a command or shorthand for getting journalctl output? I dont see it as an option in my 3.10 cluster so I assume the latter Nevermind, I see "master-logs" in /usr/local/bin, I'll have the customer grab those *** Bug 1615754 has been marked as a duplicate of this bug. *** *** Bug 1613348 has been marked as a duplicate of this bug. *** QE also hit some similar issue as this bug, refer to scenario #1 in https://bugzilla.redhat.com/show_bug.cgi?id=1629726#c2. Getting similar on bare metal environment. Any updates? This should be addressed via https://github.com/openshift/openshift-ansible/pull/10356 on release-3.11. According to dev's proposed verification path. https://gist.github.com/michaelgugino/c961476d8be7d160a5e53fe9a9734051 For 3.11 fresh install, for testing scenarios #4, also need similar backport like what is done in 3.10 https://github.com/openshift/openshift-ansible/pull/10409 PR created for 3.11: https://github.com/openshift/openshift-ansible/pull/10447 3.11 merged. Verified this bug with openshift-ansible-3.11.38-1.git.0.d146f83.el7.noarch, and PASS.
Scenario #1:
Try to install a new 3.11 cluster with openshift_kubelet_name_override set. Installs should fail.
############ ANSIBLE RUN: playbooks/prerequisites.yml ############
PLAY [Fail openshift_kubelet_name_override for new hosts] **********************
TASK [Gathering Facts] *********************************************************
Monday 05 November 2018 14:33:22 +0800 (0:00:00.111) 0:00:00.111 *******
ok: [qe-jialiu312-master-etcd-1.1105-0gs.qe.rhcloud.com]
ok: [qe-jialiu312-node-1.1105-0gs.qe.rhcloud.com]
ok: [qe-jialiu312-node-registry-router-1.1105-0gs.qe.rhcloud.com]
TASK [Fail when openshift_kubelet_name_override is defined] ********************
Monday 05 November 2018 14:33:23 +0800 (0:00:01.097) 0:00:01.209 *******
fatal: [qe-jialiu312-master-etcd-1.1105-0gs.qe.rhcloud.com]: FAILED! => {"changed": false, "msg": "openshift_kubelet_name_override Cannot be defined for new hosts"}
fatal: [qe-jialiu312-node-registry-router-1.1105-0gs.qe.rhcloud.com]: FAILED! => {"changed": false, "msg": "openshift_kubelet_name_override Cannot be defined for new hosts"}
fatal: [qe-jialiu312-node-1.1105-0gs.qe.rhcloud.com]: FAILED! => {"changed": false, "msg": "openshift_kubelet_name_override Cannot be defined for new hosts"}
to retry, use: --limit @/home/slave3/workspace/Launch Environment Flexy Wrapper/private-openshift-ansible/playbooks/prerequisites.retry
Scenario #2:
cluster install on OSP (snvl2) without cloudprovider enabled + short hostname, PASS.
Scenario #3:
cluster install on OSP (snvl2) with cloudprovider enabled + short hostname, PASS.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3537 |
Description of problem: Installation/startup in AWS fails, network plugin is not ready, cni config uninitialized Pods are failing to start with the following messages in /var/log/messages: Aug 7 11:29:27 AWGMEUOM01 atomic-openshift-node: W0807 11:29:27.047991 32375 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d Aug 7 11:29:27 AWGMEUOM01 atomic-openshift-node: E0807 11:29:27.048134 32375 kubelet.go:2147] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized Version-Release number of the following components: Customer tried with: $ rpm -qa | grep -i -e ansible -e atomic openshift-ansible-roles-3.10.21-1.git.0.6446011.el7.noarch ansible-2.6.2-1.el7.noarch openshift-ansible-playbooks-3.10.21-1.git.0.6446011.el7.noarch openshift-ansible-3.10.21-1.git.0.6446011.el7.noarch openshift-ansible-docs-3.10.21-1.git.0.6446011.el7.noarch I advised customer use ansible 2.4 rpms instead, they downgraded the packages and tried again but same issue. Actual results: FAILED - RETRYING: Wait for control plane pods to appear (18 retries left).Result was: { . . . "msg": { "cmd": "/bin/oc get pod master-etcd-awgmeuom02 -o json -n kube-system", "results": [ {} ], "returncode": 1, "stderr": "Unable to connect to the server: EOF\n", "stdout": "" }, I'll upload full ansible logs to the bz. NOTES: This issue seems similar to https://github.com/openshift/openshift-ansible/issues/7967 and https://bugzilla.redhat.com/show_bug.cgi?id=1592010 However I'm not certain its the same issue because it is a different version and I dont see all the same messages, like "Unable to connect to the server"