Bug 1805034

Summary:

Bootstraps fails when installing 4.4 nightly on single node

Product:

OpenShift Container Platform

Reporter:

Praveen Kumar <prkumar>

Component:

Etcd

Assignee:

Sam Batschelet <sbatsche>

Status:

CLOSED ERRATA

QA Contact:

ge liu <geliu>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

4.4

CC:

alpatel, benjamin.dabelow, cfergeau, dbelenky, eparis, fdeutsch, fromani, fsimonce, gercan, kboumedh, mfojtik, mfuruta, mnewby, moddi, ngompa13, oshoval, skolicha, sspeiche, sttts, ykashtan

Target Milestone:

---

Target Release:

4.5.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Enhancement

Doc Text:

Feature: Support single node cluster installations Reason: This is needed for development-only for CRC testing Result: With proper env set, and patch applied, single node clusters are supported.

Story Points:

---

Clone Of:

Clones:

1821748 (view as bug list)

Environment:

Last Closed:

2020-07-13 17:16:20 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1821748

Attachments:

Description	Flags
install logs	none
bootstrap-logs	none

Description Praveen Kumar 2020-02-20 06:35:36 UTC

Created attachment 1664283 [details]
install logs

Description of problem: As part of CRC we need to create the single node openshift cluster and for that we uses https://github.com/code-ready/snc/blob/master/snc.sh one. Testing it with 4.4 nightly casing issues every single time and bootstrap failed with waiting for kubernetes API.

Version-Release number of the following components:

```
$ export OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=quay.io/openshift-release-dev/ocp-release-nightly@sha256:3dd2b6f2ad288f220005c48e92c287bfde8eaa74735afa0a5cf469fc20eac86e 
$ openshift-install version
openshift-install unreleased-master-2546-g43bed121efd7d9b3353e7ef5bd85dae07e0cc97e
built from commit 43bed121efd7d9b3353e7ef5bd85dae07e0cc97e

```

How reproducible:
- Clone snc repo (https://github.com/code-ready/snc)

```
$ cd snc
$ export MIRROR=https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview
$ export OPENSHIFT_VERSION=4.4.0-0.nightly-2020-02-18-104959
$ export OPENSHIFT_PULL_SECRET='<Get from cloud.openshift.com>'
$ ./snc.sh
```

Actual results:
```
DEBUG Still waiting for the Kubernetes API: Get https://api.crc.testing:6443/version?timeout=32s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting head
ers)
ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get https://api.crc.testing:6443/apis/config.openshift.io/v1/clusteroperators: dial tcp
192.168.126.11:6443: i/o timeout
DEBUG Fetching Install Config...
DEBUG Loading Install Config...
DEBUG   Loading SSH Key...
DEBUG   Loading Base Domain...
DEBUG     Loading Platform...
DEBUG   Loading Cluster Name...
DEBUG     Loading Base Domain...
DEBUG     Loading Platform...
DEBUG   Loading Pull Secret...
DEBUG   Loading Platform...
DEBUG Using Install Config loaded from state file
DEBUG Reusing previously-fetched Install Config
INFO Pulling debug logs from the bootstrap machine
ERROR Attempted to gather debug logs after installation failure: failed to create SSH client: failed to initialize the SSH agent: failed to parse SSH private key from "/home/prkumar/.ssh/authorized_
keys": ssh: no key found
FATAL Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded
failed to create the cluster, but that is expected.  We will block on a successful cluster via a future wait-for.

```

Expected results:
Should able to provision cluster successfully.

Additional info:
Install logs and bootstrap logs attached.

Comment 1 Praveen Kumar 2020-02-20 06:40:22 UTC

Created attachment 1664284 [details]
bootstrap-logs

bootstarp and control plane logs

Comment 2 Scott Dodson 2020-02-20 13:35:31 UTC

Waiting on CEO, moving to etcd.

Comment 8 Yuval Kashtan 2020-02-27 19:01:30 UTC

AFAIK
this is the PR which broke this
https://github.com/openshift/cluster-etcd-operator/pull/157/files#diff-16c82eb805d9624f37fc2f0121ddc6eaR46

Comment 17 Sam Batschelet 2020-03-13 12:35:02 UTC

We have a solution we are working on for 4.4, with luck that will ship. Currently, this is being tested.

Comment 18 Praveen Kumar 2020-03-24 11:44:02 UTC

Current PR https://github.com/openshift/cluster-etcd-operator/pull/266 which we (CRC team) actively testing, looks like it able to create the single node cluster on libvirt but the cert now generated by the etcd operator is depend on the cluster internal IP and in case of libvirt it is the IP which configured by libvirt provider instead using srv records (which used to be the case till 4.3.x side). In case of CRC we create the bundle and then run the genereated bundle on different platform on which we are not able to force an static IP so running the bundle on those platform will create issue and we are going to be still block :(

Comment 19 Praveen Kumar 2020-03-24 15:37:47 UTC

Just an update, we thought if we able to create a virtual network [0] and if it can be picked up by openshift then we can able to deal with this etcd cert issue but as per our experiment the openshift doesn't take that network IP but uses what on actual interface have and this is not going to work out for us atm :(

```
core@crc-p5vnv-master-0 ~]$ ifconfig ens3
ens3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.130.11  netmask 255.255.255.0  broadcast 192.168.130.255
        inet6 fe80::53fd:8725:ea4a:8093  prefixlen 64  scopeid 0x20<link>
        ether 52:fd:fc:07:21:82  txqueuelen 1000  (Ethernet)
        RX packets 25456  bytes 9639852 (9.1 MiB)
        RX errors 0  dropped 10  overruns 0  frame 0
        TX packets 34018  bytes 38511681 (36.7 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[core@crc-p5vnv-master-0 ~]$ ifconfig ens3:0
ens3:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.126.11  netmask 255.255.255.0  broadcast 192.168.126.255
        ether 52:fd:fc:07:21:82  txqueuelen 1000  (Ethernet)

$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.130.1   0.0.0.0         UG    100    0        0 ens3
10.88.0.0       0.0.0.0         255.255.0.0     U     0      0        0 cni-podman0
10.128.0.0      0.0.0.0         255.252.0.0     U     0      0        0 tun0
172.30.0.0      0.0.0.0         255.255.0.0     U     0      0        0 tun0
192.168.126.0   0.0.0.0         255.255.255.0   U     0      0        0 ens3
192.168.126.0   0.0.0.0         255.255.255.0   U     100    0        0 ens3
192.168.130.0   0.0.0.0         255.255.255.0   U     100    0        0 ens3

$ oc get nodes -o wide
NAME                 STATUS   ROLES           AGE     VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                CONTAINER-RUNTIME
crc-p5vnv-master-0   Ready    master,worker   7h17m   v1.17.1   192.168.130.11   <none>        Red Hat Enterprise Linux CoreOS 45.81.202003231628-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el8


$ oc logs etcd-crc-p5vnv-master-0 -n openshift-etcd -c etcd
[...]
2020-03-24 15:36:50.006277 I | embed: rejected connection from "10.128.0.79:33292" (error "remote error: tls: bad certificate", ServerName "")
2020-03-24 15:36:51.219809 I | embed: rejected connection from "192.168.130.11:33952" (error "remote error: tls: bad certificate", ServerName "")
2020-03-24 15:36:51.717504 I | embed: rejected connection from "10.128.0.79:33326" (error "remote error: tls: bad certificate", ServerName "")
2020-03-24 15:36:51.805955 I | embed: rejected connection from "192.168.130.11:33962" (error "remote error: tls: bad certificate", ServerName "")
2020-03-24 15:36:52.581834 I | embed: rejected connection from "192.168.130.11:33968" (error "remote error: tls: bad certificate", ServerName "")
2020-03-24 15:36:53.981921 I | embed: rejected connection from "192.168.130.11:33982" (error "remote error: tls: bad certificate", ServerName "")
2020-03-24 15:36:54.804054 I | embed: rejected connection from "10.128.0.79:33360" (error "remote error: tls: bad certificate", ServerName "")
2020-03-24 15:36:55.943713 I | embed: rejected connection from "192.168.130.11:34006" (error "remote error: tls: bad certificate", ServerName "")
2020-03-24 15:36:56.544935 I | embed: rejected connection from "192.168.130.11:34012" (error "remote error: tls: bad certificate", ServerName "")
2020-03-24 15:36:57.032870 I | embed: rejected connection from "192.168.130.11:34026" (error "remote error: tls: bad certificate", ServerName "")
2020-03-24 15:36:57.767818 I | embed: rejected connection from "192.168.130.11:34032" (error "remote error: tls: bad certificate", ServerName "")
2020-03-24 15:36:58.176810 I | embed: rejected connection from "192.168.130.11:34040" (error "remote error: tls: bad certificate", ServerName "")
2020-03-24 15:36:58.643069 I | embed: rejected connection from "192.168.130.11:34044" (error "remote error: tls: bad certificate", ServerName "")
2020-03-24 15:36:58.867680 I | embed: rejected connection from "192.168.130.11:34048" (error "remote error: tls: bad certificate", ServerName "")
2020-03-24 15:36:58.874256 I | embed: rejected connection from "10.128.0.79:33418" (error "remote error: tls: bad certificate", ServerName "")
2020-03-24 15:37:02.810137 I | embed: rejected connection from "192.168.130.11:34108" (error "remote error: tls: bad certificate", ServerName "")
2020-03-24 15:37:02.844207 I | embed: rejected connection from "192.168.130.11:34110" (error "remote error: tls: bad certificate", ServerName "")
```

[0] https://linuxconfig.org/configuring-virtual-network-interfaces-in-linux

Comment 20 Praveen Kumar 2020-03-26 13:13:47 UTC

Another update, We are able to test #266 (which is not closed without merge) by using the dummy network and making change in the kubelet systemd unit file (adding `--node-ip` [Thanks Alay]). but since now #279 is the one which should resolve it but I am afraid how we can automate this bundle process since this one need a manual intervention to patch the `etcd` resource as soon as the bootstrap make API available :( It would be great if there is a way to have this change to part of manifest so that can be added before starting the cluster creation.

Comment 23 Praveen Kumar 2020-04-08 07:53:12 UTC

I verified this bug and it need manual intervention [0] to patch the `etcd` resource as soon as the bootstrap make API available. After applying the `oc patch etcd cluster -p='{"spec": {"unsupportedConfigOverrides": {"useUnsupportedUnsafeNonHANonProductionUnstableEtcd": true}}}' --type=merge` I can see the bootstrap success without any issue.

[0] https://github.com/openshift/cluster-etcd-operator/pull/279#issue-393886988

Comment 25 errata-xmlrpc 2020-07-13 17:16:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409