Created attachment 1664283 [details] install logs Description of problem: As part of CRC we need to create the single node openshift cluster and for that we uses https://github.com/code-ready/snc/blob/master/snc.sh one. Testing it with 4.4 nightly casing issues every single time and bootstrap failed with waiting for kubernetes API. Version-Release number of the following components: ``` $ export OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=quay.io/openshift-release-dev/ocp-release-nightly@sha256:3dd2b6f2ad288f220005c48e92c287bfde8eaa74735afa0a5cf469fc20eac86e $ openshift-install version openshift-install unreleased-master-2546-g43bed121efd7d9b3353e7ef5bd85dae07e0cc97e built from commit 43bed121efd7d9b3353e7ef5bd85dae07e0cc97e ``` How reproducible: - Clone snc repo (https://github.com/code-ready/snc) ``` $ cd snc $ export MIRROR=https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview $ export OPENSHIFT_VERSION=4.4.0-0.nightly-2020-02-18-104959 $ export OPENSHIFT_PULL_SECRET='<Get from cloud.openshift.com>' $ ./snc.sh ``` Actual results: ``` DEBUG Still waiting for the Kubernetes API: Get https://api.crc.testing:6443/version?timeout=32s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting head ers) ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get https://api.crc.testing:6443/apis/config.openshift.io/v1/clusteroperators: dial tcp 192.168.126.11:6443: i/o timeout DEBUG Fetching Install Config... DEBUG Loading Install Config... DEBUG Loading SSH Key... DEBUG Loading Base Domain... DEBUG Loading Platform... DEBUG Loading Cluster Name... DEBUG Loading Base Domain... DEBUG Loading Platform... DEBUG Loading Pull Secret... DEBUG Loading Platform... DEBUG Using Install Config loaded from state file DEBUG Reusing previously-fetched Install Config INFO Pulling debug logs from the bootstrap machine ERROR Attempted to gather debug logs after installation failure: failed to create SSH client: failed to initialize the SSH agent: failed to parse SSH private key from "/home/prkumar/.ssh/authorized_ keys": ssh: no key found FATAL Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded failed to create the cluster, but that is expected. We will block on a successful cluster via a future wait-for. ``` Expected results: Should able to provision cluster successfully. Additional info: Install logs and bootstrap logs attached.
Created attachment 1664284 [details] bootstrap-logs bootstarp and control plane logs
Waiting on CEO, moving to etcd.
AFAIK this is the PR which broke this https://github.com/openshift/cluster-etcd-operator/pull/157/files#diff-16c82eb805d9624f37fc2f0121ddc6eaR46
We have a solution we are working on for 4.4, with luck that will ship. Currently, this is being tested.
Current PR https://github.com/openshift/cluster-etcd-operator/pull/266 which we (CRC team) actively testing, looks like it able to create the single node cluster on libvirt but the cert now generated by the etcd operator is depend on the cluster internal IP and in case of libvirt it is the IP which configured by libvirt provider instead using srv records (which used to be the case till 4.3.x side). In case of CRC we create the bundle and then run the genereated bundle on different platform on which we are not able to force an static IP so running the bundle on those platform will create issue and we are going to be still block :(
Just an update, we thought if we able to create a virtual network [0] and if it can be picked up by openshift then we can able to deal with this etcd cert issue but as per our experiment the openshift doesn't take that network IP but uses what on actual interface have and this is not going to work out for us atm :( ``` core@crc-p5vnv-master-0 ~]$ ifconfig ens3 ens3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.130.11 netmask 255.255.255.0 broadcast 192.168.130.255 inet6 fe80::53fd:8725:ea4a:8093 prefixlen 64 scopeid 0x20<link> ether 52:fd:fc:07:21:82 txqueuelen 1000 (Ethernet) RX packets 25456 bytes 9639852 (9.1 MiB) RX errors 0 dropped 10 overruns 0 frame 0 TX packets 34018 bytes 38511681 (36.7 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 [core@crc-p5vnv-master-0 ~]$ ifconfig ens3:0 ens3:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.126.11 netmask 255.255.255.0 broadcast 192.168.126.255 ether 52:fd:fc:07:21:82 txqueuelen 1000 (Ethernet) $ route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 192.168.130.1 0.0.0.0 UG 100 0 0 ens3 10.88.0.0 0.0.0.0 255.255.0.0 U 0 0 0 cni-podman0 10.128.0.0 0.0.0.0 255.252.0.0 U 0 0 0 tun0 172.30.0.0 0.0.0.0 255.255.0.0 U 0 0 0 tun0 192.168.126.0 0.0.0.0 255.255.255.0 U 0 0 0 ens3 192.168.126.0 0.0.0.0 255.255.255.0 U 100 0 0 ens3 192.168.130.0 0.0.0.0 255.255.255.0 U 100 0 0 ens3 $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME crc-p5vnv-master-0 Ready master,worker 7h17m v1.17.1 192.168.130.11 <none> Red Hat Enterprise Linux CoreOS 45.81.202003231628-0 (Ootpa) 4.18.0-147.5.1.el8_1.x86_64 cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el8 $ oc logs etcd-crc-p5vnv-master-0 -n openshift-etcd -c etcd [...] 2020-03-24 15:36:50.006277 I | embed: rejected connection from "10.128.0.79:33292" (error "remote error: tls: bad certificate", ServerName "") 2020-03-24 15:36:51.219809 I | embed: rejected connection from "192.168.130.11:33952" (error "remote error: tls: bad certificate", ServerName "") 2020-03-24 15:36:51.717504 I | embed: rejected connection from "10.128.0.79:33326" (error "remote error: tls: bad certificate", ServerName "") 2020-03-24 15:36:51.805955 I | embed: rejected connection from "192.168.130.11:33962" (error "remote error: tls: bad certificate", ServerName "") 2020-03-24 15:36:52.581834 I | embed: rejected connection from "192.168.130.11:33968" (error "remote error: tls: bad certificate", ServerName "") 2020-03-24 15:36:53.981921 I | embed: rejected connection from "192.168.130.11:33982" (error "remote error: tls: bad certificate", ServerName "") 2020-03-24 15:36:54.804054 I | embed: rejected connection from "10.128.0.79:33360" (error "remote error: tls: bad certificate", ServerName "") 2020-03-24 15:36:55.943713 I | embed: rejected connection from "192.168.130.11:34006" (error "remote error: tls: bad certificate", ServerName "") 2020-03-24 15:36:56.544935 I | embed: rejected connection from "192.168.130.11:34012" (error "remote error: tls: bad certificate", ServerName "") 2020-03-24 15:36:57.032870 I | embed: rejected connection from "192.168.130.11:34026" (error "remote error: tls: bad certificate", ServerName "") 2020-03-24 15:36:57.767818 I | embed: rejected connection from "192.168.130.11:34032" (error "remote error: tls: bad certificate", ServerName "") 2020-03-24 15:36:58.176810 I | embed: rejected connection from "192.168.130.11:34040" (error "remote error: tls: bad certificate", ServerName "") 2020-03-24 15:36:58.643069 I | embed: rejected connection from "192.168.130.11:34044" (error "remote error: tls: bad certificate", ServerName "") 2020-03-24 15:36:58.867680 I | embed: rejected connection from "192.168.130.11:34048" (error "remote error: tls: bad certificate", ServerName "") 2020-03-24 15:36:58.874256 I | embed: rejected connection from "10.128.0.79:33418" (error "remote error: tls: bad certificate", ServerName "") 2020-03-24 15:37:02.810137 I | embed: rejected connection from "192.168.130.11:34108" (error "remote error: tls: bad certificate", ServerName "") 2020-03-24 15:37:02.844207 I | embed: rejected connection from "192.168.130.11:34110" (error "remote error: tls: bad certificate", ServerName "") ``` [0] https://linuxconfig.org/configuring-virtual-network-interfaces-in-linux
Another update, We are able to test #266 (which is not closed without merge) by using the dummy network and making change in the kubelet systemd unit file (adding `--node-ip` [Thanks Alay]). but since now #279 is the one which should resolve it but I am afraid how we can automate this bundle process since this one need a manual intervention to patch the `etcd` resource as soon as the bootstrap make API available :( It would be great if there is a way to have this change to part of manifest so that can be added before starting the cluster creation.
I verified this bug and it need manual intervention [0] to patch the `etcd` resource as soon as the bootstrap make API available. After applying the `oc patch etcd cluster -p='{"spec": {"unsupportedConfigOverrides": {"useUnsupportedUnsafeNonHANonProductionUnstableEtcd": true}}}' --type=merge` I can see the bootstrap success without any issue. [0] https://github.com/openshift/cluster-etcd-operator/pull/279#issue-393886988
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409