Bug 1876091

Summary:

Error: Failed to evict container: "": Failed to find container "etcd-signer" in state: no container with name or ID etcd-signer found: no such container

Product:

OpenShift Container Platform

Reporter:

Sam Yangsao <syangsao>

Component:

Etcd

Assignee:

Dan Mace <dmace>

Status:

CLOSED ERRATA

QA Contact:

ge liu <geliu>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

4.5

CC:

adahiya, dmace, sbatsche, skolicha

Target Milestone:

---

Target Release:

4.6.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Clones:

1877374 (view as bug list)

Environment:

Last Closed:

2020-10-27 16:37:57 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1877374

Attachments:

Description	Flags
bootstrap logs	none
log bundle 2	none

Description Sam Yangsao 2020-09-05 13:46:59 UTC

Description of problem:

Error: Failed to evict container: "": Failed to find container "etcd-signer" in state: no container with name or ID etcd-signer found: no such container

Version-Release number of the following components:

4.5.7
vSphere 6.7U3

How reproducible:

Unsure

Steps to Reproduce:

1. Following disconnected installation instructions on vSphere [1]
2. Bootstrap node fails with the error message in the results below
3.

[1] https://docs.openshift.com/container-platform/4.5/installing/installing_vsphere/installing-restricted-networks-vsphere.html#installation-initializing-manual_installing-restricted-networks-vsphere

Actual results:

[core@bootstrap ~]$ journalctl -b -f -u release-image.service -u bootkube.service
-- Logs begin at Fri 2020-09-04 19:24:26 UTC. --
[...]
Sep 04 20:29:10 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: Skipped "secret-kube-apiserver-to-kubelet-signer.yaml" secrets.v1./kube-apiserver-to-kubelet-signer -n openshift-kube-apiserver-operator as it already exists
Sep 04 20:29:11 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: Skipped "secret-loadbalancer-serving-signer.yaml" secrets.v1./loadbalancer-serving-signer -n openshift-kube-apiserver-operator as it already exists
Sep 04 20:29:11 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: Skipped "secret-localhost-serving-signer.yaml" secrets.v1./localhost-serving-signer -n openshift-kube-apiserver-operator as it already exists
Sep 04 20:29:11 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: Skipped "secret-service-network-serving-signer.yaml" secrets.v1./service-network-serving-signer -n openshift-kube-apiserver-operator as it already exists
Sep 04 20:29:21 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: E0904 20:29:21.735288       1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug=""
Sep 04 20:29:21 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: E0904 20:29:21.759732       1 reflector.go:251] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to watch *v1.Pod: Get https://localhost:6443/api/v1/pods?watch=true: dial tcp [::1]:6443: connect: connection refused
Sep 04 20:29:22 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: E0904 20:29:22.761579       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods: dial tcp [::1]:6443: connect: connection refused
Sep 04 20:29:23 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: E0904 20:29:23.763460       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods: dial tcp [::1]:6443: connect: connection refused
Sep 04 20:48:37 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: Error: error while checking pod status: timed out waiting for the condition
Sep 04 20:48:37 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: Tearing down temporary bootstrap control plane...
Sep 04 20:48:37 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: Error: error while checking pod status: timed out waiting for the condition
Sep 04 20:48:38 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: Error: Failed to evict container: "": Failed to find container "etcd-signer" in state: no container with name or ID etcd-signer found: no such container
Sep 04 20:48:38 bootstrap.discocp4.lab.msp.redhat.com systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE
Sep 04 20:48:38 bootstrap.discocp4.lab.msp.redhat.com systemd[1]: bootkube.service: Failed with result 'exit-code'.
Sep 04 20:48:43 bootstrap.discocp4.lab.msp.redhat.com systemd[1]: bootkube.service: Service RestartSec=5s expired, scheduling restart.
Sep 04 20:48:43 bootstrap.discocp4.lab.msp.redhat.com systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 4.
Sep 04 20:48:43 bootstrap.discocp4.lab.msp.redhat.com systemd[1]: Stopped Bootstrap a Kubernetes cluster.
Sep 04 20:48:43 bootstrap.discocp4.lab.msp.redhat.com systemd[1]: Started Bootstrap a Kubernetes cluster.
Sep 04 20:49:00 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[33924]: Starting etcd certificate signer...
Sep 04 20:49:01 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[33924]: 6a59a3e4c6a4a93d53756df48b49fea9e64149c059f105101d3b6262aabd9ac2
Sep 04 20:49:02 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[33924]: https://localhost:2379 is healthy: successfully committed proposal: took = 15.98495ms
Sep 04 20:49:02 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[33924]: etcd cluster up. Killing etcd certificate signer...
Sep 04 20:49:02 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[33924]: 6a59a3e4c6a4a93d53756df48b49fea9e64149c059f105101d3b6262aabd9ac2
Sep 04 20:49:02 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[33924]: Starting cluster-bootstrap...
Sep 04 20:49:03 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[33924]: Starting temporary bootstrap control plane...
Sep 04 20:49:03 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[33924]: Skipped "0000_00_cluster-version-operator_00_namespace.yaml" namespaces.v1./openshift-cluster-version -n  as it already exists
Sep 04 20:49:03 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[33924]: Skipped "0000_00_cluster-version-operator_01_clusteroperator.crd.yaml" customresourcedefinitions.v1beta1.apiextensions.k8s.io/clusteroperators.config.openshift.io -n  as it already exists
[...]

[root@bootstrap ~]# crictl pods
POD ID              CREATED             STATE               NAME                                                                       NAMESPACE                             ATTEMPT
96913e0cd7de9       11 minutes ago      Ready               bootstrap-kube-apiserver-bootstrap.discocp4.lab.msp.redhat.com             kube-system                           1
71ce0c944231d       11 minutes ago      Ready               bootstrap-kube-scheduler-bootstrap.discocp4.lab.msp.redhat.com             kube-system                           1
2518d5fbaf566       11 minutes ago      Ready               bootstrap-kube-controller-manager-bootstrap.discocp4.lab.msp.redhat.com    kube-system                           1
560e173e96a84       11 minutes ago      Ready               bootstrap-cluster-version-operator-bootstrap.discocp4.lab.msp.redhat.com   openshift-cluster-version             1
1f59d9b34619f       11 minutes ago      Ready               cloud-credential-operator-bootstrap.discocp4.lab.msp.redhat.com            openshift-cloud-credential-operator   1
bf512c1a3af90       32 minutes ago      NotReady            bootstrap-kube-scheduler-bootstrap.discocp4.lab.msp.redhat.com             kube-system                           0
6887e35e74929       32 minutes ago      NotReady            bootstrap-kube-controller-manager-bootstrap.discocp4.lab.msp.redhat.com    kube-system                           0
dcc9121a53df1       32 minutes ago      NotReady            bootstrap-kube-apiserver-bootstrap.discocp4.lab.msp.redhat.com             kube-system                           0
c278592bc9851       32 minutes ago      NotReady            bootstrap-cluster-version-operator-bootstrap.discocp4.lab.msp.redhat.com   openshift-cluster-version             0
705aca2b677d9       32 minutes ago      Ready               bootstrap-machine-config-operator-bootstrap.discocp4.lab.msp.redhat.com    default                               0
f849fb2f7034b       33 minutes ago      Ready               etcd-bootstrap-member-bootstrap.discocp4.lab.msp.redhat.com                openshift-etcd                        0

[root@bootstrap ~]# crictl images
IMAGE                                                   TAG                 IMAGE ID            SIZE
quay.io/openshift-release-dev/ocp-release@sha256        <none>              e7c443017e821       306MB
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256   <none>              790b38ec6f81b       307MB
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256   <none>              f67097361498f       283MB
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256   <none>              3fcd563edad3b       255MB
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256   <none>              b0d508e56910d       305MB
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256   <none>              7e44a17a2951a       282MB
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256   <none>              c5072ae56904b       308MB
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256   <none>              0c893df5a716e       308MB
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256   <none>              793d4a1e7161c       305MB
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256   <none>              eaff45a171adb       307MB
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256   <none>              d1bb18c7027ae       432MB
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256   <none>              5afa4eae3d651       311MB
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256   <none>              d1eec47fd97e5       326MB
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256   <none>              32b54e50bc4bc       288MB
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256   <none>              d8375a61d36e3       674MB

Expected results:

Installation should run through and/or give us some details on why it's failing

Additional info:

Install config is as follows:

[root@tatooine ocp45]# cat install-config.yaml 
apiVersion: v1
baseDomain: lab.msp.redhat.com
compute:
- hyperthreading: Disabled   
  name: worker
  replicas: 2 
controlPlane:
  hyperthreading: Disabled   
  name: master 
  replicas: 3 
metadata:
  name: discocp4
platform:
  vsphere:
    vcenter: vcenter01.lab.msp.redhat.com
    username: ocp4
    password: OpenShift2020!
    datacenter: msp-lab
    defaultDatastore: storage03-iscsi-lun0 
networking:
  clusterNetworks:
  - cidr: 10.128.0.0/14 
    hostPrefix: 23 
  networkType: OpenShiftSDN
  serviceNetwork: 
  - 172.30.0.0/16
platform:
  none: {}
pullSecret: '{"auths": ...}' 
sshKey: 'ssh-ed25519 AAAA...' 
imageContentSources: 
- mirrors:
  - registry.lab.msp.redhat.com:5000/ocp4/openshift4
  source: quay.io/openshift-release-dev/ocp-release
- mirrors:
  - registry.lab.msp.redhat.com:5000/ocp4/openshift4
  source: quay.io/openshift-release-dev/ocp-v4.0-art-dev

Comment 1 Scott Dodson 2020-09-05 16:46:28 UTC

Please attach the log bundle generated from `openshift-install gather bootstrap` see --help if you're not familiar with the command. When the installer failed it should've attempted to gather the bundle or emitted instructions to do so. That log bundle should be attached to any bug involving bootstrap failure.

Comment 2 Sam Yangsao 2020-09-06 00:01:22 UTC

Created attachment 1713847 [details]
bootstrap logs

Log bundle attached.

Comment 3 Sam Yangsao 2020-09-08 15:47:55 UTC

Created attachment 1714154 [details]
log bundle 2

I was able to reproduce the issue again this morning, log bundle 2 attached from the bootstrap node.

Comment 4 Abhinav Dahiya 2020-09-08 16:54:34 UTC

The etcd team creates the etcd signer, so i think they can help the best here.

Comment 5 Dan Mace 2020-09-09 13:39:53 UTC

This is fixed in 4.6 by removing the signer container entirely:

https://github.com/openshift/cluster-etcd-operator/pull/412
https://github.com/openshift/installer/pull/3995
https://github.com/openshift/cluster-etcd-operator/pull/416

We can backport this to 4.5. I opened https://bugzilla.redhat.com/show_bug.cgi?id=1877374 to track that work.

Comment 8 Sam Batschelet 2020-09-09 13:49:49 UTC

> Error: Failed to evict container: "": Failed to find container "etcd-signer" in state: no container with name or ID etcd-signer found: no such container

We are looking to backport removal of etcd-signer from 4.5 but the logging above in my opnion is not the reason for your cluster not bootstrapping. We set a trap to remove the container on error this was tripped by

> Sep 04 20:48:37 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: Error: error while checking pod status: timed out waiting for the condition

So it is expected that we attempt to remove the container. But in this case the container was already scaled down.

Based on logs in https://bugzilla.redhat.com/show_bug.cgi?id=1876091#c3 machine-config-server does not show any of the masters have pulled ignition. This has nothing to do etcd.

```
bootstrap/containers/machine-config-server-ab360f7f7def867b4818f6792787b53954d73b8880c17e5efea011c062ae4732.log
  I0908 15:14:00.969656       1 bootstrap.go:37] Version: v4.5.0-202008130542.p0-dirty (f6ec58e7b69f4fc1eb2297c2734b0470a581f378)
  I0908 15:14:00.969890       1 api.go:56] Launching server on :22624
  I0908 15:14:00.969963       1 api.go:56] Launching server on :22623

```
Why are you master nodes not making requests to pull ignition? Can you review the terminal logs for these instances to see why?

Comment 9 Sam Yangsao 2020-09-15 17:50:53 UTC

(In reply to Sam Batschelet from comment #8)
> > Error: Failed to evict container: "": Failed to find container "etcd-signer" in state: no container with name or ID etcd-signer found: no such container
> 
> We are looking to backport removal of etcd-signer from 4.5 but the logging
> above in my opnion is not the reason for your cluster not bootstrapping. We
> set a trap to remove the container on error this was tripped by
> 
> > Sep 04 20:48:37 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: Error: error while checking pod status: timed out waiting for the condition
> 
> So it is expected that we attempt to remove the container. But in this case
> the container was already scaled down.
> 
> Based on logs in https://bugzilla.redhat.com/show_bug.cgi?id=1876091#c3
> machine-config-server does not show any of the masters have pulled ignition.
> This has nothing to do etcd.
> 
> ```
> bootstrap/containers/machine-config-server-
> ab360f7f7def867b4818f6792787b53954d73b8880c17e5efea011c062ae4732.log
>   I0908 15:14:00.969656       1 bootstrap.go:37] Version:
> v4.5.0-202008130542.p0-dirty (f6ec58e7b69f4fc1eb2297c2734b0470a581f378)
>   I0908 15:14:00.969890       1 api.go:56] Launching server on :22624
>   I0908 15:14:00.969963       1 api.go:56] Launching server on :22623
> 
> ```
> Why are you master nodes not making requests to pull ignition? Can you
> review the terminal logs for these instances to see why?

OK, in earlier versions of the OCP installer (4.4 or below), the bootstrap node would take a bit to start up and have more pods up and running before kicking off the installer for the masters.

This behaviour seems to have changed, not sure if this is because we're doing a `disconnected` install here, but with fewer pods up and running on bootstrap, the master nodes ignite quicker than what I've seen in older releases.

With both the bootstrap and masters started up simultaneously, the installer runs through the boostrap node, completes (~ 15 minutes) and I'm now waiting on the masters.

Comment 10 Sam Batschelet 2020-09-15 18:09:42 UTC

The workflow has changed because of etcd-operator added in 4.4. In the past the etcd static pod manifests were embeded in the ignition. Bootkube would wait for the masters to pull ignition and boostrap the etcd cluster. Then we would pivot from the temp control-plane to the master control-plane.

But now we don't need to wait for etcd to bootstrap. We start a single etcd instance on the bootstrap node. So the temp control-plane can get started faster. We then deploy the operator and scale up etcd across the master nodes. So if you don't have masters pulling ignition that tells me something is wrong.


> With both the bootstrap and masters started up simultaneously, the installer runs through the boostrap node, completes (~ 15 minutes) and I'm now waiting on the masters.

Yeah I don'tr know why the masters are not pull ignition. If they are running I would check the console logs for hints as to why they can't connect to the machine-config-server on the bootstrap node whihc is hosting the ignition files.

Comment 11 ge liu 2020-09-27 07:59:54 UTC

Installed 4.6 disconnected UPI on vsphere, have not hit this err.

Comment 13 errata-xmlrpc 2020-10-27 16:37:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196