Bug 2010665 - Bootkube tries to use oc after cluster bootstrap is done and there is no API
Summary: Bootkube tries to use oc after cluster bootstrap is done and there is no API
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.10.0
Assignee: Matthew Staebler
QA Contact: Alexander Chuzhoy
URL:
Whiteboard:
: 2011324 (view as bug list)
Depends On:
Blocks: 2011652 2011701
TreeView+ depends on / blocked
 
Reported: 2021-10-05 10:14 UTC by Igal Tsoiref
Modified: 2023-09-15 01:15 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Bug fix for a BZ that already has appropriate doc text.
Clone Of:
: 2011652 2011701 (view as bug list)
Environment:
Last Closed: 2022-03-10 16:17:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 5271 0 None open Bug 2010665: do not modify cvo ignores for bootstrap-in-place 2021-10-06 20:52:54 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:17:38 UTC

Description Igal Tsoiref 2021-10-05 10:14:44 UTC
Single node iBIP flow:
Cluster bootstrap will try to send event after tear down of temporary control plane.
The issue happened 5 /7 I the son live-iso CI.
https://sippy.ci.openshift.org/sippy-ng/jobs/4.9/runs?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-single-node-live-iso%22%7D%5D%7D&sortField=timestamp&sort=desc

The issue started with 4.9.0-rc.5 (rc.4 works without issues)

See the race here:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-assisted-test-infra-master-e2e-metal-single-node-live-iso-periodic/1444452225400180736

from the logs we can see that kube-api got shutdown request and that caused cluster bootstrap to fail on sending event

kube-api:
I1003 00:22:20.076822       1 genericapiserver.go:421] "[graceful-termination] shutdown event" name="InFlightRequestsDrained"
I1003 00:22:20.076840       1 genericapiserver.go:751] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"default", Name:"openshift-kube-apiserver", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'InFlightRequestsDrained' All non long-running request(s) in-flight have drained
I1003 00:22:20.077676       1 dynamic_serving_content.go:144] "Shutting down controller" name="aggregator-proxy-cert::/etc/kubernetes/secrets/apiserver-proxy.crt::/etc/kubernetes/secrets/apiserver-proxy.key"

bootkube:
Oct 03 00:22:20 test-infra-cluster-master-0 bootkube.sh[2601]: Tearing down temporary bootstrap control plane...
Oct 03 00:22:20 test-infra-cluster-master-0 bootkube.sh[2601]: Sending bootstrap-finished event.The connection to the server api-int.test-infra-cluster.redhat.com:6443 was refused - did you specify the right host or port?
Oct 03 00:22:20 test-infra-cluster-master-0 systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE
Oct 03 00:22:20 test-infra-cluster-master-0 systemd[1]: bootkube.service: Failed with result 'exit-code'.
Oct 03 00:22:25 test-infra-cluster-master-0 systemd[1]: bootkube.service: Service RestartSec=5s expired, scheduling restart.
Oct 03 00:22:25 test-infra-cluster-master-0 systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 1.

Comment 1 Marius Cornea 2021-10-06 12:51:40 UTC
The same issue is reproducing for me with the zero touch provisioning flow and 4.9.0-0.nightly-2021-10-05-004711

Comment 2 Marius Cornea 2021-10-06 14:07:56 UTC
*** Bug 2011324 has been marked as a duplicate of this bug. ***

Comment 3 Ryan Phillips 2021-10-06 15:29:48 UTC
There is a new rhcos image being built for 4.9 that fixes pod status reporting... so this may clear up...

https://bugzilla.redhat.com/show_bug.cgi?id=2011050

Comment 4 Eran Cohen 2021-10-06 15:33:05 UTC
Setting as a blocker since this issue fails the single node live-iso CI.

Comment 5 Eran Cohen 2021-10-06 16:09:54 UTC
The actual issue is:
https://github.com/openshift/installer/blob/6617bc2e334654bb6e85976f049e51fc1c01aa3f/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L427
Bootkube tries to use oc after cluster bootstrap is done and there is no API

This explains why all the bootkube retries failed with this error:
Oct 06 15:51:53 test-infra-cluster-master-0 bootkube.sh[243426]: Starting cluster-bootstrap...
Oct 06 15:51:53 test-infra-cluster-master-0 bootkube.sh[243426]: The connection to the server api-int.test-infra-cluster.redhat.com:6443 was refused - did you specify the right host or port? 
It's not really starting cluster-bootstrap, cluster-bootstrap is done (the log is misleading) and the command that fails is the oc patch

Comment 6 Omri Hochman 2021-10-06 17:56:10 UTC
*** Bug 2011306 has been marked as a duplicate of this bug. ***

Comment 11 Alexander Chuzhoy 2021-10-07 17:46:02 UTC
FailedQA

Version: registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-10-07-110446



[core@sno-0-0 ~]$   journalctl -b -f -u release-image.service -u bootkube.service
-- Logs begin at Thu 2021-10-07 16:53:11 UTC. --
Oct 07 17:45:02 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: Started Bootstrap a Kubernetes cluster.
Oct 07 17:45:10 sno-0-0.sealusa35-0.qe.lab.redhat.com bootkube.sh[703794]: https://localhost:2379 is healthy: successfully committed proposal: took = 12.689336ms
Oct 07 17:45:10 sno-0-0.sealusa35-0.qe.lab.redhat.com bootkube.sh[703794]: Starting cluster-bootstrap...
Oct 07 17:45:10 sno-0-0.sealusa35-0.qe.lab.redhat.com bootkube.sh[703794]: The connection to the server api-int.sealusa35-0.qe.lab.redhat.com:6443 was refused - did you specify the right host or port?
Oct 07 17:45:10 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE
Oct 07 17:45:10 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Failed with result 'exit-code'.
Oct 07 17:45:15 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Service RestartSec=5s expired, scheduling restart.
Oct 07 17:45:15 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 212.
Oct 07 17:45:16 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: Stopped Bootstrap a Kubernetes cluster.
Oct 07 17:45:16 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: Started Bootstrap a Kubernetes cluster.
Oct 07 17:45:24 sno-0-0.sealusa35-0.qe.lab.redhat.com bootkube.sh[707152]: https://localhost:2379 is healthy: successfully committed proposal: took = 12.694707ms
Oct 07 17:45:24 sno-0-0.sealusa35-0.qe.lab.redhat.com bootkube.sh[707152]: Starting cluster-bootstrap...
Oct 07 17:45:24 sno-0-0.sealusa35-0.qe.lab.redhat.com bootkube.sh[707152]: The connection to the server api-int.sealusa35-0.qe.lab.redhat.com:6443 was refused - did you specify the right host or port?
Oct 07 17:45:24 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE
Oct 07 17:45:24 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Failed with result 'exit-code'.
Oct 07 17:45:29 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Service RestartSec=5s expired, scheduling restart.
Oct 07 17:45:29 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 213.
Oct 07 17:45:30 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: Stopped Bootstrap a Kubernetes cluster.
Oct 07 17:45:30 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: Started Bootstrap a Kubernetes cluster.

Comment 12 Matthew Staebler 2021-10-07 18:19:49 UTC
(In reply to Alexander Chuzhoy from comment #11)
> FailedQA
> 
> Version:
> registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-10-07-110446

That nightly does not include the fix. This BZ targets 4.10.0. You want registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2021-10-07-095518 or more recent.

Comment 13 Alexander Chuzhoy 2021-10-07 19:53:11 UTC
(In reply to Matthew Staebler from comment #12)
> (In reply to Alexander Chuzhoy from comment #11)
> > FailedQA
> > 
> > Version:
> > registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-10-07-110446
> 
> That nightly does not include the fix. This BZ targets 4.10.0. You want
> registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2021-10-07-095518 or
> more recent.

Do we need to clone this one for 4.9? The failure is still relevant in 4.9

Comment 14 Matthew Staebler 2021-10-07 20:52:20 UTC
> Do we need to clone this one for 4.9? The failure is still relevant in 4.9

It has already been cloned. https://bugzilla.redhat.com/show_bug.cgi?id=2011701

Comment 17 Alexander Chuzhoy 2021-12-06 18:56:13 UTC
VerifiedQA
Version: 4.10.0-0.nightly-2021-12-06-123512

The reported issue doesn't reproduce.

The bootstrap phase completed successfully (truncated message):


Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[13945]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Log bundle written to log-bundle-bootstrap.tar.gz
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Adding bootstrap control plane and bootstrap installer-gather bundle to master ignition
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[13930]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14112]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[13959]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[13934]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14038]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14033]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[13963]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[13971]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14106]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14023]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14121]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14094]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14016]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[13992]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14071]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:49 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: bootkube.service complete
Dec 06 18:49:49 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d systemd[1]: bootkube.service: Succeeded.

Broadcast message from root@random-hostname-defd6463-4144-48f9-92ef-870ec209b77d (Mon 2021-12-06 18:50:49 UTC):

'Installation completed, server is going to reboot.'
The system is going down for reboot at Mon 2021-12-06 18:51:49 UTC!

q
Broadcast message from root@random-hostname-defd6463-4144-48f9-92ef-870ec209b77d (Mon 2021-12-06 18:51:49 UTC):

'Installation completed, server is going to reboot.'
The system is going down for reboot NOW!

Connection to 192.168.123.142 closed by remote host.
Connection to 192.168.123.142 closed.

Comment 20 errata-xmlrpc 2022-03-10 16:17:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 21 Red Hat Bugzilla 2023-09-15 01:15:51 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.