Bug 1754939
Summary: | [upi] [baremetal] Installer doesn't validate dns requirements | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Sam Yangsao <syangsao> | |
Component: | Installer | Assignee: | Abhinav Dahiya <adahiya> | |
Installer sub component: | openshift-installer | QA Contact: | sheng.lao <shlao> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | medium | |||
Priority: | medium | CC: | adahiya, aos-bugs, bbennett, danw, deads, gblomqui, mfojtik, nagrawal, wking | |
Version: | 4.2.0 | Keywords: | Reopened | |
Target Milestone: | --- | |||
Target Release: | 4.3.0 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1755111 (view as bug list) | Environment: | ||
Last Closed: | 2020-01-23 11:06:54 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1755111 |
Description
Sam Yangsao
2019-09-24 12:04:30 UTC
Please gather the output of `oc adm must-gather` and attach that to this bug. Likely root cause is either ingress or auth. Did you add the compute nodes? https://docs.openshift.com/container-platform/4.1/installing/installing_bare_metal/installing-bare-metal.html#machine-requirements_installing-bare-metal (In reply to Abhinav Dahiya from comment #4) > Did you add the compute nodes? > https://docs.openshift.com/container-platform/4.1/installing/ > installing_bare_metal/installing-bare-metal.html#machine- > requirements_installing-bare-metal I did not yet, I can PXE boot them up with the current configs, since we're at this state, should I start over and also ensure the compute nodes are up (along with the bootstrap/control nodes)? Or should I start over while setting this option to `0` instead? compute: - hyperthreading: Disabled name: worker replicas: 3 <--- Change to `0` and restart a fresh install (In reply to Scott Dodson from comment #3) > Please gather the output of `oc adm must-gather` and attach that to this bug. > > Likely root cause is either ingress or auth. It fails to run, I'm assuming because the cluster is not fully up. # oc adm must-gather [must-gather ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9ecf8fce3bc1cf67073b09cbf95006b35bd8715d141ddd9f961dcebf74719b43 [must-gather ] OUT namespace/openshift-must-gather-tx6rj created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-rvbll created [must-gather ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9ecf8fce3bc1cf67073b09cbf95006b35bd8715d141ddd9f961dcebf74719b43 created [must-gather-m72n4] POD 2019/09/24 16:06:39 Finished successfully with no errors. [must-gather-m72n4] POD 2019/09/24 16:06:40 Gathering data for ns/openshift-cluster-version... [must-gather-m72n4] POD 2019/09/24 16:06:40 Collecting resources for namespace "openshift-cluster-version"... [must-gather-m72n4] POD 2019/09/24 16:06:40 Gathering pod data for namespace "openshift-cluster-version"... [must-gather-m72n4] POD 2019/09/24 16:06:40 Gathering data for pod "cluster-version-operator-85b98666c8-zgl9x" [must-gather-m72n4] POD 2019/09/24 16:06:49 Skipping container endpoint collection for pod "cluster-version-operator-85b98666c8-zgl9x" container "cluster-version-operator": No ports [must-gather-m72n4] POD 2019/09/24 16:07:34 Finished successfully with no errors. [must-gather-m72n4] POD 2019/09/24 16:07:35 Gathering config.openshift.io resource data... [must-gather-m72n4] POD 2019/09/24 16:07:37 Gathering kubeapiserver.operator.openshift.io resource data... [must-gather-m72n4] POD 2019/09/24 16:07:37 Gathering cluster operator resource data... [must-gather-m72n4] POD 2019/09/24 16:07:37 Gathering related object reference information for ClusterOperator "authentication"... [must-gather-m72n4] POD 2019/09/24 16:07:37 Found related object "authentications.operator.openshift.io/cluster" for ClusterOperator "authentication"... [must-gather-m72n4] POD 2019/09/24 16:07:37 Found related object "authentications.config.openshift.io/cluster" for ClusterOperator "authentication"... [must-gather-m72n4] POD 2019/09/24 16:07:37 Found related object "infrastructures.config.openshift.io/cluster" for ClusterOperator "authentication"... [must-gather-m72n4] POD 2019/09/24 16:07:37 Found related object "oauths.config.openshift.io/cluster" for ClusterOperator "authentication"... [must-gather-m72n4] POD 2019/09/24 16:07:37 Found related object "namespaces/openshift-config" for ClusterOperator "authentication"... [must-gather-m72n4] POD 2019/09/24 16:07:37 Found related object "namespaces/openshift-config-managed" for ClusterOperator "authentication"... [must-gather-m72n4] POD 2019/09/24 16:07:37 Found related object "namespaces/openshift-authentication" for ClusterOperator "authentication"... [must-gather-m72n4] POD 2019/09/24 16:07:37 Found related object "namespaces/openshift-authentication-operator" for ClusterOperator "authentication"... [must-gather-m72n4] OUT gather logs unavailable: unexpected EOF [must-gather-m72n4] OUT waiting for gather to complete [must-gather-m72n4] OUT gather never finished: timed out waiting for the condition [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-rvbll deleted [must-gather ] OUT namespace/openshift-must-gather-tx6rj deleted error: gather never finished for pod must-gather-m72n4: timed out waiting for the condition Must gather says CVO reports no ports. Sending to networking. (In reply to Greg Blomquist from comment #8) > Must gather says CVO reports no ports. Sending to networking. That error appears to mean "the container does not declare any ports, so I'm not going to test if I can reach those ports". It does not indicate a networking error. Providing some concrete next steps: 1. FATAL failed to initialize the cluster: Some cluster operators are still updating: authentication, console, image-registry, ingress, monitoring suggests network-edge, you could copy dan mace 2. we need some data to starting digging in. `oc get clusteroperators -oyaml` will be helpful in cases where must-gather failed. `oc get clusteroperators` may also help you determine who to engage with 3. In cases where you can't collect must-gather, we're going to want a separate bug against `oc` so we can always provide *something*. 4. In cases without must-gather, you're likely to want to give dev access to the cluster to poke around and see what's available. An email to the owner or @neelesh will keep creds out of bugzilla. (In reply to David Eads from comment #10) > 1. FATAL failed to initialize the cluster: Some cluster operators are still > updating: authentication, console, image-registry, ingress, monitoring > suggests network-edge, you could copy dan mace bouncing to Routing. To do anything with this bug, we need either concrete, portable reproducer steps, or at least information David asked for in https://bugzilla.redhat.com/show_bug.cgi?id=1754939#c10. This has failed because the wildcard dns entry isn't present, we should validate that either by having the Installer emitting a warning or making the bootstrap process fail with a clear message indicating the root cause. message: 'RouteHealthDegraded: failed to GET route: dial tcp: lookup oauth-openshift.apps.lab.msp.redhat.com on 172.30.0.10:53: no such host' An installer warning is likely to be ignored and we'll end up chasing this failure through multiple operators. If the bootstrap process hard fails, the user will get a clear message and any bug report will be directed to a team with the knowledge to help a user through the failure. Deferring the failure and debugging until the bootstrap has "succeeded" makes distinguishing types of failures harder, increases the time to resolution, and produces output without any real value. Just an update, the install is still going.. odd that it shows 97% initially, then jumps back to 49% [root@tatooine ocp42]# openshift-install --dir=sam/ wait-for install-complete --log-level debug DEBUG OpenShift Installer v4.2.0 DEBUG Built from commit f96afb99f1ce4f8976ce62f7df44acb24d2062d6 INFO Waiting up to 30m0s for the cluster at https://api.lab.msp.redhat.com:6443 to initialize... DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 97% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647 DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 10% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 16% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 26% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 38% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 40% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 49% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 49% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 49% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 49% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 49% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647 DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 24% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 41% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 50% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 53% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 55% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 60% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 65% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 69% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 71% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 76% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 85% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 98% complete [root@etcd-0 ~]# crictl pods POD ID CREATED STATE NAME NAMESPACE ATTEMPT c345ccd93f74c 16 minutes ago NotReady revision-pruner-8-etcd-0 openshift-kube-apiserver 0 913189c5f154b 21 minutes ago Ready kube-apiserver-etcd-0 openshift-kube-apiserver 0 4519ea8a316ae 22 minutes ago NotReady installer-8-etcd-0 openshift-kube-apiserver 0 28ffcaa78440c About an hour ago Ready packageserver-7b58f6fc-ql4sp openshift-operator-lifecycle-manager 0 e14688578ab67 2 hours ago Ready node-exporter-svzdg openshift-monitoring 0 652c44af0cf88 3 hours ago Ready oauth-openshift-7d6b8b465f-jvpw8 openshift-authentication 0 a1e3b9c4b23dc 3 hours ago Ready console-b6fbc547d-sd4tb openshift-console 0 278ac32e1c531 3 hours ago Ready cluster-autoscaler-operator-87d7696b9-kzdjk openshift-machine-api 0 92e0e826b1a8e 3 hours ago NotReady revision-pruner-7-etcd-0 openshift-kube-apiserver 0 72ccfe3835808 3 hours ago NotReady installer-7-etcd-0 openshift-kube-apiserver 0 d74486afb9ae4 4 hours ago NotReady revision-pruner-6-etcd-0 openshift-kube-scheduler 0 0a1880ba2f76d 4 hours ago Ready prometheus-operator-668f98845c-6g99l openshift-monitoring 0 90f52417a1bf2 4 hours ago NotReady revision-pruner-7-etcd-0 openshift-kube-controller-manager 0 8fbd4cb25bb3b 4 hours ago Ready openshift-kube-scheduler-etcd-0 openshift-kube-scheduler 0 11a0457f32c06 4 hours ago Ready kube-controller-manager-etcd-0 openshift-kube-controller-manager 0 c2dd6f7f154fd 4 hours ago NotReady installer-6-etcd-0 openshift-kube-scheduler 0 9e583fd107371 4 hours ago NotReady installer-7-etcd-0 openshift-kube-controller-manager 0 eff3a6e151fef 5 hours ago Ready console-operator-6f984ccd78-j96qk openshift-console-operator 0 75a1e71330e5d 5 hours ago Ready cluster-image-registry-operator-fcf6564b8-zgfnc openshift-image-registry 0 0623776fd46a0 5 hours ago NotReady revision-pruner-5-etcd-0 openshift-kube-apiserver 0 0000a2c618bf5 5 hours ago NotReady revision-pruner-6-etcd-0 openshift-kube-controller-manager 0 befcb7df525a2 5 hours ago NotReady installer-6-etcd-0 openshift-kube-controller-manager 0 445f89f69f504 5 hours ago NotReady installer-5-etcd-0 openshift-kube-apiserver 0 783f064d23f39 5 hours ago Ready controller-manager-k5smz openshift-controller-manager 0 2dfa70d011930 5 hours ago NotReady revision-pruner-4-etcd-0 openshift-kube-apiserver 0 d4e56a6ab2ffb 5 hours ago NotReady revision-pruner-2-etcd-0 openshift-kube-apiserver 0 b0b0e16e8a6e6 5 hours ago NotReady installer-4-etcd-0 openshift-kube-apiserver 0 bbf3821e69aeb 21 hours ago Ready etcd-quorum-guard-6c5d5869bf-5n9cq openshift-machine-config-operator 0 922317e327bcb 21 hours ago Ready machine-config-server-v2bdw openshift-machine-config-operator 0 44d2c4afd3dd5 22 hours ago Ready tuned-fm9mb openshift-cluster-node-tuning-operator 0 727ea4b9adf1f 22 hours ago NotReady revision-pruner-5-etcd-0 openshift-kube-controller-manager 0 16ad938ec3479 22 hours ago Ready openshift-service-catalog-apiserver-operator-9766c9d48-qrvcg openshift-service-catalog-apiserver-operator 0 fe6b1b518c4ca 22 hours ago Ready openshift-service-catalog-controller-manager-operator-7b76tvfgm openshift-service-catalog-controller-manager-operator 0 aeb8a9c230338 22 hours ago NotReady installer-5-etcd-0 openshift-kube-controller-manager 0 128e5ead83e30 22 hours ago NotReady revision-pruner-5-etcd-0 openshift-kube-scheduler 0 55df920fc0a5c 22 hours ago NotReady revision-pruner-4-etcd-0 openshift-kube-controller-manager 0 0d8c3170f20ab 22 hours ago NotReady installer-5-etcd-0 openshift-kube-scheduler 0 258b76cfee50b 22 hours ago NotReady revision-pruner-3-etcd-0 openshift-kube-scheduler 0 5ab412aa94bdb 22 hours ago NotReady installer-4-etcd-0 openshift-kube-controller-manager 0 8b93ea33a2dec 22 hours ago NotReady installer-3-etcd-0 openshift-kube-scheduler 0 7864345a4c382 22 hours ago Ready multus-admission-controller-h8jkx openshift-multus 0 1037f39df3846 22 hours ago Ready apiserver-gk2kz openshift-apiserver 0 944682a513e71 22 hours ago Ready catalog-operator-65857c7d75-cgw6n openshift-operator-lifecycle-manager 0 9c13d00f124db 22 hours ago NotReady revision-pruner-3-etcd-0 openshift-kube-controller-manager 0 7b78936881253 22 hours ago Ready machine-config-daemon-9qwn6 openshift-machine-config-operator 0 0e6bf9c6b139e 22 hours ago Ready dns-default-kv5zk openshift-dns 0 f3313296533af 22 hours ago NotReady installer-3-etcd-0 openshift-kube-controller-manager 0 967aaa2983ca8 22 hours ago NotReady revision-pruner-2-etcd-0 openshift-kube-scheduler 0 b43d1931516b2 22 hours ago NotReady revision-pruner-2-etcd-0 openshift-kube-controller-manager 0 8f28976c60584 22 hours ago NotReady installer-2-etcd-0 openshift-kube-apiserver 0 269224e6b9fee 22 hours ago NotReady installer-2-etcd-0 openshift-kube-scheduler 0 df5dbafd4048b 22 hours ago NotReady installer-2-etcd-0 openshift-kube-controller-manager 0 58d71fd3b607e 22 hours ago Ready cloud-credential-operator-58bbb76884-z7llh openshift-cloud-credential-operator 0 cd7f7179274b5 22 hours ago Ready machine-config-operator-76bfc97464-dfsld openshift-machine-config-operator 0 333fd3fced044 22 hours ago Ready openshift-apiserver-operator-6b479c8f9f-lqspx openshift-apiserver-operator 0 67c2ce11821a0 22 hours ago Ready cluster-version-operator-85b98666c8-8m4rq openshift-cluster-version 0 3a89c83600810 22 hours ago Ready sdn-d7ffx openshift-sdn 0 024a862aec6d1 23 hours ago Ready sdn-controller-vxgt8 openshift-sdn 0 6af7554b88fb4 23 hours ago Ready ovs-ndbbz openshift-sdn 0 a7af937bea208 23 hours ago Ready multus-nbl7h openshift-multus 0 b98740725942e 23 hours ago Ready network-operator-56668895ff-rpp6g openshift-network-operator 0 d24e348db8189 24 hours ago Ready etcd-member-etcd-0 openshift-etcd 0 [root@tatooine ocp42]# oc get csr NAME AGE REQUESTOR CONDITION csr-2t72z 4h23m system:node:worker2 Approved,Issued csr-5nhzl 6h6m system:node:etcd-1 Approved,Issued csr-7246v 23h system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-74wcx 4h38m system:node:worker2 Approved,Issued csr-7smv8 23h system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-8cpt2 23h system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-8xjs4 3h41m system:node:worker1 Approved,Issued csr-9gm94 23h system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-9qqqb 23h system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-bbtxq 4h21m system:node:etcd-1 Approved,Issued csr-bckk8 3h38m system:node:etcd-0 Approved,Issued csr-blmq7 143m system:node:etcd-2 Approved,Issued csr-d7p6p 3h26m system:node:worker1 Approved,Issued csr-dscsk 23h system:node:etcd-2 Approved,Issued csr-fbbv6 4h52m system:node:etcd-0 Approved,Issued csr-fnwn9 84m system:node:worker2 Approved,Issued csr-fzdqk 4h59m system:node:worker1 Approved,Issued csr-gkzff 23h system:node:worker1 Approved,Issued csr-h9wrq 23h system:node:worker2 Approved,Issued csr-hq24b 4h51m system:node:etcd-0 Approved,Issued csr-hrtdv 4h51m system:node:etcd-0 Approved,Issued csr-hv7zq 81m system:node:etcd-1 Approved,Issued csr-k87z2 7h27m system:node:etcd-2 Approved,Issued csr-ksr74 23h system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-mg8dj 3h56m system:node:worker1 Approved,Issued csr-mh5m7 79m system:node:etcd-2 Approved,Issued csr-mqppd 3h22m system:node:etcd-0 Approved,Issued csr-n8hqd 3h11m system:node:worker1 Approved,Issued csr-nfpps 4h38m system:node:worker2 Approved,Issued csr-nssxq 3h22m system:node:etcd-0 Approved,Issued csr-nvxhm 23h system:node:etcd-1 Approved,Issued csr-pkxdw 79m system:node:etcd-2 Approved,Issued csr-q2pgf 23h system:node:etcd-0 Approved,Issued csr-rff48 4h38m system:node:worker2 Approved,Issued csr-sc7cc 3h53m system:node:etcd-0 Approved,Issued csr-vnpwp 5h22m system:node:etcd-2 Approved,Issued csr-wsctc 23h system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-zlgs7 4h36m system:node:etcd-1 Approved,Issued csr-zq7rf 6h42m system:node:worker2 Approved,Issued The difference between this install versus the original install for this BZ. I recreated the ignition configuration as follows: <snip> # cat install-config.yaml apiVersion: v1 baseDomain: msp.redhat.com compute: - hyperthreading: Disabled name: worker replicas: 2 <<< Changed from '3' controlPlane: hyperthreading: Disabled name: master replicas: 3 metadata: name: lab networking: clusterNetworks: - cidr: 10.128.0.0/14 hostPrefix: 23 networkType: OpenShiftSDN serviceNetwork: - 172.30.0.0/16 platform: none: {} pullSecret: '{"auths": #######}' sshKey: 'ssh-ed25519 ######' </snip> PXE booted the following VM's for installation: bootstrap etcd-0 etcd-1 etcd-2 worker1 * did not exist in the other install worker2 * did not exist in the other install Installation is still going for ~24 hours, I'm assuming due mainly to the slower disk backend, but at least its still running. I'll write up another Bug, but I was thinking if the install_config has compute nodes configured and the install does not see them, we should have some error reporting on this stating that it's not finding the compute nodes to continue on with the installation. > odd that it shows 97% initially, then jumps back to 49% This is explained in bug 1690816. You'll need a bug in order to backport your change back into 4.2.z. I think this one works well and clearly demonstrates the problem you're solving and the value you're bringing. I've reopened for you. This pr provides more readable information when jobs is failed. I launch a cluster on baremetal with 4.3.0-0.nightly-2019-10-22-165241, and it works good, besides the openshift-monitoring. Also linking the follow-up fix, which was in 4.3.0-0.nightly-2019-10-22-165241 and so got VERIFIED as well in comment 23. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062 |