Description of problem: Error: Failed to evict container: "": Failed to find container "etcd-signer" in state: no container with name or ID etcd-signer found: no such container Version-Release number of the following components: 4.5.7 vSphere 6.7U3 How reproducible: Unsure Steps to Reproduce: 1. Following disconnected installation instructions on vSphere [1] 2. Bootstrap node fails with the error message in the results below 3. [1] https://docs.openshift.com/container-platform/4.5/installing/installing_vsphere/installing-restricted-networks-vsphere.html#installation-initializing-manual_installing-restricted-networks-vsphere Actual results: [core@bootstrap ~]$ journalctl -b -f -u release-image.service -u bootkube.service -- Logs begin at Fri 2020-09-04 19:24:26 UTC. -- [...] Sep 04 20:29:10 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: Skipped "secret-kube-apiserver-to-kubelet-signer.yaml" secrets.v1./kube-apiserver-to-kubelet-signer -n openshift-kube-apiserver-operator as it already exists Sep 04 20:29:11 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: Skipped "secret-loadbalancer-serving-signer.yaml" secrets.v1./loadbalancer-serving-signer -n openshift-kube-apiserver-operator as it already exists Sep 04 20:29:11 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: Skipped "secret-localhost-serving-signer.yaml" secrets.v1./localhost-serving-signer -n openshift-kube-apiserver-operator as it already exists Sep 04 20:29:11 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: Skipped "secret-service-network-serving-signer.yaml" secrets.v1./service-network-serving-signer -n openshift-kube-apiserver-operator as it already exists Sep 04 20:29:21 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: E0904 20:29:21.735288 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug="" Sep 04 20:29:21 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: E0904 20:29:21.759732 1 reflector.go:251] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to watch *v1.Pod: Get https://localhost:6443/api/v1/pods?watch=true: dial tcp [::1]:6443: connect: connection refused Sep 04 20:29:22 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: E0904 20:29:22.761579 1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods: dial tcp [::1]:6443: connect: connection refused Sep 04 20:29:23 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: E0904 20:29:23.763460 1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods: dial tcp [::1]:6443: connect: connection refused Sep 04 20:48:37 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: Error: error while checking pod status: timed out waiting for the condition Sep 04 20:48:37 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: Tearing down temporary bootstrap control plane... Sep 04 20:48:37 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: Error: error while checking pod status: timed out waiting for the condition Sep 04 20:48:38 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: Error: Failed to evict container: "": Failed to find container "etcd-signer" in state: no container with name or ID etcd-signer found: no such container Sep 04 20:48:38 bootstrap.discocp4.lab.msp.redhat.com systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE Sep 04 20:48:38 bootstrap.discocp4.lab.msp.redhat.com systemd[1]: bootkube.service: Failed with result 'exit-code'. Sep 04 20:48:43 bootstrap.discocp4.lab.msp.redhat.com systemd[1]: bootkube.service: Service RestartSec=5s expired, scheduling restart. Sep 04 20:48:43 bootstrap.discocp4.lab.msp.redhat.com systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 4. Sep 04 20:48:43 bootstrap.discocp4.lab.msp.redhat.com systemd[1]: Stopped Bootstrap a Kubernetes cluster. Sep 04 20:48:43 bootstrap.discocp4.lab.msp.redhat.com systemd[1]: Started Bootstrap a Kubernetes cluster. Sep 04 20:49:00 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[33924]: Starting etcd certificate signer... Sep 04 20:49:01 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[33924]: 6a59a3e4c6a4a93d53756df48b49fea9e64149c059f105101d3b6262aabd9ac2 Sep 04 20:49:02 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[33924]: https://localhost:2379 is healthy: successfully committed proposal: took = 15.98495ms Sep 04 20:49:02 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[33924]: etcd cluster up. Killing etcd certificate signer... Sep 04 20:49:02 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[33924]: 6a59a3e4c6a4a93d53756df48b49fea9e64149c059f105101d3b6262aabd9ac2 Sep 04 20:49:02 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[33924]: Starting cluster-bootstrap... Sep 04 20:49:03 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[33924]: Starting temporary bootstrap control plane... Sep 04 20:49:03 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[33924]: Skipped "0000_00_cluster-version-operator_00_namespace.yaml" namespaces.v1./openshift-cluster-version -n as it already exists Sep 04 20:49:03 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[33924]: Skipped "0000_00_cluster-version-operator_01_clusteroperator.crd.yaml" customresourcedefinitions.v1beta1.apiextensions.k8s.io/clusteroperators.config.openshift.io -n as it already exists [...] [root@bootstrap ~]# crictl pods POD ID CREATED STATE NAME NAMESPACE ATTEMPT 96913e0cd7de9 11 minutes ago Ready bootstrap-kube-apiserver-bootstrap.discocp4.lab.msp.redhat.com kube-system 1 71ce0c944231d 11 minutes ago Ready bootstrap-kube-scheduler-bootstrap.discocp4.lab.msp.redhat.com kube-system 1 2518d5fbaf566 11 minutes ago Ready bootstrap-kube-controller-manager-bootstrap.discocp4.lab.msp.redhat.com kube-system 1 560e173e96a84 11 minutes ago Ready bootstrap-cluster-version-operator-bootstrap.discocp4.lab.msp.redhat.com openshift-cluster-version 1 1f59d9b34619f 11 minutes ago Ready cloud-credential-operator-bootstrap.discocp4.lab.msp.redhat.com openshift-cloud-credential-operator 1 bf512c1a3af90 32 minutes ago NotReady bootstrap-kube-scheduler-bootstrap.discocp4.lab.msp.redhat.com kube-system 0 6887e35e74929 32 minutes ago NotReady bootstrap-kube-controller-manager-bootstrap.discocp4.lab.msp.redhat.com kube-system 0 dcc9121a53df1 32 minutes ago NotReady bootstrap-kube-apiserver-bootstrap.discocp4.lab.msp.redhat.com kube-system 0 c278592bc9851 32 minutes ago NotReady bootstrap-cluster-version-operator-bootstrap.discocp4.lab.msp.redhat.com openshift-cluster-version 0 705aca2b677d9 32 minutes ago Ready bootstrap-machine-config-operator-bootstrap.discocp4.lab.msp.redhat.com default 0 f849fb2f7034b 33 minutes ago Ready etcd-bootstrap-member-bootstrap.discocp4.lab.msp.redhat.com openshift-etcd 0 [root@bootstrap ~]# crictl images IMAGE TAG IMAGE ID SIZE quay.io/openshift-release-dev/ocp-release@sha256 <none> e7c443017e821 306MB quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256 <none> 790b38ec6f81b 307MB quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256 <none> f67097361498f 283MB quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256 <none> 3fcd563edad3b 255MB quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256 <none> b0d508e56910d 305MB quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256 <none> 7e44a17a2951a 282MB quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256 <none> c5072ae56904b 308MB quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256 <none> 0c893df5a716e 308MB quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256 <none> 793d4a1e7161c 305MB quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256 <none> eaff45a171adb 307MB quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256 <none> d1bb18c7027ae 432MB quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256 <none> 5afa4eae3d651 311MB quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256 <none> d1eec47fd97e5 326MB quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256 <none> 32b54e50bc4bc 288MB quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256 <none> d8375a61d36e3 674MB Expected results: Installation should run through and/or give us some details on why it's failing Additional info: Install config is as follows: [root@tatooine ocp45]# cat install-config.yaml apiVersion: v1 baseDomain: lab.msp.redhat.com compute: - hyperthreading: Disabled name: worker replicas: 2 controlPlane: hyperthreading: Disabled name: master replicas: 3 metadata: name: discocp4 platform: vsphere: vcenter: vcenter01.lab.msp.redhat.com username: ocp4 password: OpenShift2020! datacenter: msp-lab defaultDatastore: storage03-iscsi-lun0 networking: clusterNetworks: - cidr: 10.128.0.0/14 hostPrefix: 23 networkType: OpenShiftSDN serviceNetwork: - 172.30.0.0/16 platform: none: {} pullSecret: '{"auths": ...}' sshKey: 'ssh-ed25519 AAAA...' imageContentSources: - mirrors: - registry.lab.msp.redhat.com:5000/ocp4/openshift4 source: quay.io/openshift-release-dev/ocp-release - mirrors: - registry.lab.msp.redhat.com:5000/ocp4/openshift4 source: quay.io/openshift-release-dev/ocp-v4.0-art-dev
Please attach the log bundle generated from `openshift-install gather bootstrap` see --help if you're not familiar with the command. When the installer failed it should've attempted to gather the bundle or emitted instructions to do so. That log bundle should be attached to any bug involving bootstrap failure.
Created attachment 1713847 [details] bootstrap logs Log bundle attached.
Created attachment 1714154 [details] log bundle 2 I was able to reproduce the issue again this morning, log bundle 2 attached from the bootstrap node.
The etcd team creates the etcd signer, so i think they can help the best here.
This is fixed in 4.6 by removing the signer container entirely: https://github.com/openshift/cluster-etcd-operator/pull/412 https://github.com/openshift/installer/pull/3995 https://github.com/openshift/cluster-etcd-operator/pull/416 We can backport this to 4.5. I opened https://bugzilla.redhat.com/show_bug.cgi?id=1877374 to track that work.
> Error: Failed to evict container: "": Failed to find container "etcd-signer" in state: no container with name or ID etcd-signer found: no such container We are looking to backport removal of etcd-signer from 4.5 but the logging above in my opnion is not the reason for your cluster not bootstrapping. We set a trap to remove the container on error this was tripped by > Sep 04 20:48:37 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: Error: error while checking pod status: timed out waiting for the condition So it is expected that we attempt to remove the container. But in this case the container was already scaled down. Based on logs in https://bugzilla.redhat.com/show_bug.cgi?id=1876091#c3 machine-config-server does not show any of the masters have pulled ignition. This has nothing to do etcd. ``` bootstrap/containers/machine-config-server-ab360f7f7def867b4818f6792787b53954d73b8880c17e5efea011c062ae4732.log I0908 15:14:00.969656 1 bootstrap.go:37] Version: v4.5.0-202008130542.p0-dirty (f6ec58e7b69f4fc1eb2297c2734b0470a581f378) I0908 15:14:00.969890 1 api.go:56] Launching server on :22624 I0908 15:14:00.969963 1 api.go:56] Launching server on :22623 ``` Why are you master nodes not making requests to pull ignition? Can you review the terminal logs for these instances to see why?
(In reply to Sam Batschelet from comment #8) > > Error: Failed to evict container: "": Failed to find container "etcd-signer" in state: no container with name or ID etcd-signer found: no such container > > We are looking to backport removal of etcd-signer from 4.5 but the logging > above in my opnion is not the reason for your cluster not bootstrapping. We > set a trap to remove the container on error this was tripped by > > > Sep 04 20:48:37 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: Error: error while checking pod status: timed out waiting for the condition > > So it is expected that we attempt to remove the container. But in this case > the container was already scaled down. > > Based on logs in https://bugzilla.redhat.com/show_bug.cgi?id=1876091#c3 > machine-config-server does not show any of the masters have pulled ignition. > This has nothing to do etcd. > > ``` > bootstrap/containers/machine-config-server- > ab360f7f7def867b4818f6792787b53954d73b8880c17e5efea011c062ae4732.log > I0908 15:14:00.969656 1 bootstrap.go:37] Version: > v4.5.0-202008130542.p0-dirty (f6ec58e7b69f4fc1eb2297c2734b0470a581f378) > I0908 15:14:00.969890 1 api.go:56] Launching server on :22624 > I0908 15:14:00.969963 1 api.go:56] Launching server on :22623 > > ``` > Why are you master nodes not making requests to pull ignition? Can you > review the terminal logs for these instances to see why? OK, in earlier versions of the OCP installer (4.4 or below), the bootstrap node would take a bit to start up and have more pods up and running before kicking off the installer for the masters. This behaviour seems to have changed, not sure if this is because we're doing a `disconnected` install here, but with fewer pods up and running on bootstrap, the master nodes ignite quicker than what I've seen in older releases. With both the bootstrap and masters started up simultaneously, the installer runs through the boostrap node, completes (~ 15 minutes) and I'm now waiting on the masters.
The workflow has changed because of etcd-operator added in 4.4. In the past the etcd static pod manifests were embeded in the ignition. Bootkube would wait for the masters to pull ignition and boostrap the etcd cluster. Then we would pivot from the temp control-plane to the master control-plane. But now we don't need to wait for etcd to bootstrap. We start a single etcd instance on the bootstrap node. So the temp control-plane can get started faster. We then deploy the operator and scale up etcd across the master nodes. So if you don't have masters pulling ignition that tells me something is wrong. > With both the bootstrap and masters started up simultaneously, the installer runs through the boostrap node, completes (~ 15 minutes) and I'm now waiting on the masters. Yeah I don'tr know why the masters are not pull ignition. If they are running I would check the console logs for hints as to why they can't connect to the machine-config-server on the bootstrap node whihc is hosting the ignition files.
Installed 4.6 disconnected UPI on vsphere, have not hit this err.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196