Bug 2010665

Summary: Bootkube tries to use oc after cluster bootstrap is done and there is no API
Product: OpenShift Container Platform Reporter: Igal Tsoiref <itsoiref>
Component: InstallerAssignee: Matthew Staebler <mstaeble>
Installer sub component: openshift-installer QA Contact: Alexander Chuzhoy <sasha>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: agurenko, aireilly, aos-bugs, ccrum, ercohen, keyoung, mcornea, mifiedle, mstaeble, nstielau, ohochman, rphillips, sasha, wking
Version: 4.9Keywords: TestBlocker
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Bug fix for a BZ that already has appropriate doc text.
Story Points: ---
Clone Of:
: 2011652 2011701 (view as bug list) Environment:
Last Closed: 2022-03-10 16:17:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2011652, 2011701    

Description Igal Tsoiref 2021-10-05 10:14:44 UTC
Single node iBIP flow:
Cluster bootstrap will try to send event after tear down of temporary control plane.
The issue happened 5 /7 I the son live-iso CI.
https://sippy.ci.openshift.org/sippy-ng/jobs/4.9/runs?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-single-node-live-iso%22%7D%5D%7D&sortField=timestamp&sort=desc

The issue started with 4.9.0-rc.5 (rc.4 works without issues)

See the race here:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-assisted-test-infra-master-e2e-metal-single-node-live-iso-periodic/1444452225400180736

from the logs we can see that kube-api got shutdown request and that caused cluster bootstrap to fail on sending event

kube-api:
I1003 00:22:20.076822       1 genericapiserver.go:421] "[graceful-termination] shutdown event" name="InFlightRequestsDrained"
I1003 00:22:20.076840       1 genericapiserver.go:751] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"default", Name:"openshift-kube-apiserver", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'InFlightRequestsDrained' All non long-running request(s) in-flight have drained
I1003 00:22:20.077676       1 dynamic_serving_content.go:144] "Shutting down controller" name="aggregator-proxy-cert::/etc/kubernetes/secrets/apiserver-proxy.crt::/etc/kubernetes/secrets/apiserver-proxy.key"

bootkube:
Oct 03 00:22:20 test-infra-cluster-master-0 bootkube.sh[2601]: Tearing down temporary bootstrap control plane...
Oct 03 00:22:20 test-infra-cluster-master-0 bootkube.sh[2601]: Sending bootstrap-finished event.The connection to the server api-int.test-infra-cluster.redhat.com:6443 was refused - did you specify the right host or port?
Oct 03 00:22:20 test-infra-cluster-master-0 systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE
Oct 03 00:22:20 test-infra-cluster-master-0 systemd[1]: bootkube.service: Failed with result 'exit-code'.
Oct 03 00:22:25 test-infra-cluster-master-0 systemd[1]: bootkube.service: Service RestartSec=5s expired, scheduling restart.
Oct 03 00:22:25 test-infra-cluster-master-0 systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 1.

Comment 1 Marius Cornea 2021-10-06 12:51:40 UTC
The same issue is reproducing for me with the zero touch provisioning flow and 4.9.0-0.nightly-2021-10-05-004711

Comment 2 Marius Cornea 2021-10-06 14:07:56 UTC
*** Bug 2011324 has been marked as a duplicate of this bug. ***

Comment 3 Ryan Phillips 2021-10-06 15:29:48 UTC
There is a new rhcos image being built for 4.9 that fixes pod status reporting... so this may clear up...

https://bugzilla.redhat.com/show_bug.cgi?id=2011050

Comment 4 Eran Cohen 2021-10-06 15:33:05 UTC
Setting as a blocker since this issue fails the single node live-iso CI.

Comment 5 Eran Cohen 2021-10-06 16:09:54 UTC
The actual issue is:
https://github.com/openshift/installer/blob/6617bc2e334654bb6e85976f049e51fc1c01aa3f/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L427
Bootkube tries to use oc after cluster bootstrap is done and there is no API

This explains why all the bootkube retries failed with this error:
Oct 06 15:51:53 test-infra-cluster-master-0 bootkube.sh[243426]: Starting cluster-bootstrap...
Oct 06 15:51:53 test-infra-cluster-master-0 bootkube.sh[243426]: The connection to the server api-int.test-infra-cluster.redhat.com:6443 was refused - did you specify the right host or port? 
It's not really starting cluster-bootstrap, cluster-bootstrap is done (the log is misleading) and the command that fails is the oc patch

Comment 6 Omri Hochman 2021-10-06 17:56:10 UTC
*** Bug 2011306 has been marked as a duplicate of this bug. ***

Comment 11 Alexander Chuzhoy 2021-10-07 17:46:02 UTC
FailedQA

Version: registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-10-07-110446



[core@sno-0-0 ~]$   journalctl -b -f -u release-image.service -u bootkube.service
-- Logs begin at Thu 2021-10-07 16:53:11 UTC. --
Oct 07 17:45:02 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: Started Bootstrap a Kubernetes cluster.
Oct 07 17:45:10 sno-0-0.sealusa35-0.qe.lab.redhat.com bootkube.sh[703794]: https://localhost:2379 is healthy: successfully committed proposal: took = 12.689336ms
Oct 07 17:45:10 sno-0-0.sealusa35-0.qe.lab.redhat.com bootkube.sh[703794]: Starting cluster-bootstrap...
Oct 07 17:45:10 sno-0-0.sealusa35-0.qe.lab.redhat.com bootkube.sh[703794]: The connection to the server api-int.sealusa35-0.qe.lab.redhat.com:6443 was refused - did you specify the right host or port?
Oct 07 17:45:10 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE
Oct 07 17:45:10 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Failed with result 'exit-code'.
Oct 07 17:45:15 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Service RestartSec=5s expired, scheduling restart.
Oct 07 17:45:15 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 212.
Oct 07 17:45:16 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: Stopped Bootstrap a Kubernetes cluster.
Oct 07 17:45:16 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: Started Bootstrap a Kubernetes cluster.
Oct 07 17:45:24 sno-0-0.sealusa35-0.qe.lab.redhat.com bootkube.sh[707152]: https://localhost:2379 is healthy: successfully committed proposal: took = 12.694707ms
Oct 07 17:45:24 sno-0-0.sealusa35-0.qe.lab.redhat.com bootkube.sh[707152]: Starting cluster-bootstrap...
Oct 07 17:45:24 sno-0-0.sealusa35-0.qe.lab.redhat.com bootkube.sh[707152]: The connection to the server api-int.sealusa35-0.qe.lab.redhat.com:6443 was refused - did you specify the right host or port?
Oct 07 17:45:24 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE
Oct 07 17:45:24 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Failed with result 'exit-code'.
Oct 07 17:45:29 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Service RestartSec=5s expired, scheduling restart.
Oct 07 17:45:29 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 213.
Oct 07 17:45:30 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: Stopped Bootstrap a Kubernetes cluster.
Oct 07 17:45:30 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: Started Bootstrap a Kubernetes cluster.

Comment 12 Matthew Staebler 2021-10-07 18:19:49 UTC
(In reply to Alexander Chuzhoy from comment #11)
> FailedQA
> 
> Version:
> registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-10-07-110446

That nightly does not include the fix. This BZ targets 4.10.0. You want registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2021-10-07-095518 or more recent.

Comment 13 Alexander Chuzhoy 2021-10-07 19:53:11 UTC
(In reply to Matthew Staebler from comment #12)
> (In reply to Alexander Chuzhoy from comment #11)
> > FailedQA
> > 
> > Version:
> > registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-10-07-110446
> 
> That nightly does not include the fix. This BZ targets 4.10.0. You want
> registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2021-10-07-095518 or
> more recent.

Do we need to clone this one for 4.9? The failure is still relevant in 4.9

Comment 14 Matthew Staebler 2021-10-07 20:52:20 UTC
> Do we need to clone this one for 4.9? The failure is still relevant in 4.9

It has already been cloned. https://bugzilla.redhat.com/show_bug.cgi?id=2011701

Comment 17 Alexander Chuzhoy 2021-12-06 18:56:13 UTC
VerifiedQA
Version: 4.10.0-0.nightly-2021-12-06-123512

The reported issue doesn't reproduce.

The bootstrap phase completed successfully (truncated message):


Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[13945]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Log bundle written to log-bundle-bootstrap.tar.gz
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Adding bootstrap control plane and bootstrap installer-gather bundle to master ignition
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[13930]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14112]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[13959]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[13934]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14038]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14033]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[13963]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[13971]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14106]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14023]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14121]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14094]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14016]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[13992]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host
Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14071]: pam_unix(sudo:session): session closed for user root
Dec 06 18:49:49 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: bootkube.service complete
Dec 06 18:49:49 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d systemd[1]: bootkube.service: Succeeded.

Broadcast message from root@random-hostname-defd6463-4144-48f9-92ef-870ec209b77d (Mon 2021-12-06 18:50:49 UTC):

'Installation completed, server is going to reboot.'
The system is going down for reboot at Mon 2021-12-06 18:51:49 UTC!

q
Broadcast message from root@random-hostname-defd6463-4144-48f9-92ef-870ec209b77d (Mon 2021-12-06 18:51:49 UTC):

'Installation completed, server is going to reboot.'
The system is going down for reboot NOW!

Connection to 192.168.123.142 closed by remote host.
Connection to 192.168.123.142 closed.

Comment 20 errata-xmlrpc 2022-03-10 16:17:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 21 Red Hat Bugzilla 2023-09-15 01:15:51 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days