Bug 1876091
Summary: | Error: Failed to evict container: "": Failed to find container "etcd-signer" in state: no container with name or ID etcd-signer found: no such container | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Sam Yangsao <syangsao> | ||||||
Component: | Etcd | Assignee: | Dan Mace <dmace> | ||||||
Status: | CLOSED ERRATA | QA Contact: | ge liu <geliu> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 4.5 | CC: | adahiya, dmace, sbatsche, skolicha | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 4.6.0 | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1877374 (view as bug list) | Environment: | |||||||
Last Closed: | 2020-10-27 16:37:57 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1877374 | ||||||||
Attachments: |
|
Description
Sam Yangsao
2020-09-05 13:46:59 UTC
Please attach the log bundle generated from `openshift-install gather bootstrap` see --help if you're not familiar with the command. When the installer failed it should've attempted to gather the bundle or emitted instructions to do so. That log bundle should be attached to any bug involving bootstrap failure. Created attachment 1713847 [details]
bootstrap logs
Log bundle attached.
Created attachment 1714154 [details]
log bundle 2
I was able to reproduce the issue again this morning, log bundle 2 attached from the bootstrap node.
The etcd team creates the etcd signer, so i think they can help the best here. This is fixed in 4.6 by removing the signer container entirely: https://github.com/openshift/cluster-etcd-operator/pull/412 https://github.com/openshift/installer/pull/3995 https://github.com/openshift/cluster-etcd-operator/pull/416 We can backport this to 4.5. I opened https://bugzilla.redhat.com/show_bug.cgi?id=1877374 to track that work. > Error: Failed to evict container: "": Failed to find container "etcd-signer" in state: no container with name or ID etcd-signer found: no such container We are looking to backport removal of etcd-signer from 4.5 but the logging above in my opnion is not the reason for your cluster not bootstrapping. We set a trap to remove the container on error this was tripped by > Sep 04 20:48:37 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: Error: error while checking pod status: timed out waiting for the condition So it is expected that we attempt to remove the container. But in this case the container was already scaled down. Based on logs in https://bugzilla.redhat.com/show_bug.cgi?id=1876091#c3 machine-config-server does not show any of the masters have pulled ignition. This has nothing to do etcd. ``` bootstrap/containers/machine-config-server-ab360f7f7def867b4818f6792787b53954d73b8880c17e5efea011c062ae4732.log I0908 15:14:00.969656 1 bootstrap.go:37] Version: v4.5.0-202008130542.p0-dirty (f6ec58e7b69f4fc1eb2297c2734b0470a581f378) I0908 15:14:00.969890 1 api.go:56] Launching server on :22624 I0908 15:14:00.969963 1 api.go:56] Launching server on :22623 ``` Why are you master nodes not making requests to pull ignition? Can you review the terminal logs for these instances to see why? (In reply to Sam Batschelet from comment #8) > > Error: Failed to evict container: "": Failed to find container "etcd-signer" in state: no container with name or ID etcd-signer found: no such container > > We are looking to backport removal of etcd-signer from 4.5 but the logging > above in my opnion is not the reason for your cluster not bootstrapping. We > set a trap to remove the container on error this was tripped by > > > Sep 04 20:48:37 bootstrap.discocp4.lab.msp.redhat.com bootkube.sh[26740]: Error: error while checking pod status: timed out waiting for the condition > > So it is expected that we attempt to remove the container. But in this case > the container was already scaled down. > > Based on logs in https://bugzilla.redhat.com/show_bug.cgi?id=1876091#c3 > machine-config-server does not show any of the masters have pulled ignition. > This has nothing to do etcd. > > ``` > bootstrap/containers/machine-config-server- > ab360f7f7def867b4818f6792787b53954d73b8880c17e5efea011c062ae4732.log > I0908 15:14:00.969656 1 bootstrap.go:37] Version: > v4.5.0-202008130542.p0-dirty (f6ec58e7b69f4fc1eb2297c2734b0470a581f378) > I0908 15:14:00.969890 1 api.go:56] Launching server on :22624 > I0908 15:14:00.969963 1 api.go:56] Launching server on :22623 > > ``` > Why are you master nodes not making requests to pull ignition? Can you > review the terminal logs for these instances to see why? OK, in earlier versions of the OCP installer (4.4 or below), the bootstrap node would take a bit to start up and have more pods up and running before kicking off the installer for the masters. This behaviour seems to have changed, not sure if this is because we're doing a `disconnected` install here, but with fewer pods up and running on bootstrap, the master nodes ignite quicker than what I've seen in older releases. With both the bootstrap and masters started up simultaneously, the installer runs through the boostrap node, completes (~ 15 minutes) and I'm now waiting on the masters. The workflow has changed because of etcd-operator added in 4.4. In the past the etcd static pod manifests were embeded in the ignition. Bootkube would wait for the masters to pull ignition and boostrap the etcd cluster. Then we would pivot from the temp control-plane to the master control-plane.
But now we don't need to wait for etcd to bootstrap. We start a single etcd instance on the bootstrap node. So the temp control-plane can get started faster. We then deploy the operator and scale up etcd across the master nodes. So if you don't have masters pulling ignition that tells me something is wrong.
> With both the bootstrap and masters started up simultaneously, the installer runs through the boostrap node, completes (~ 15 minutes) and I'm now waiting on the masters.
Yeah I don'tr know why the masters are not pull ignition. If they are running I would check the console logs for hints as to why they can't connect to the machine-config-server on the bootstrap node whihc is hosting the ignition files.
Installed 4.6 disconnected UPI on vsphere, have not hit this err. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |