Single node iBIP flow: Cluster bootstrap will try to send event after tear down of temporary control plane. The issue happened 5 /7 I the son live-iso CI. https://sippy.ci.openshift.org/sippy-ng/jobs/4.9/runs?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-single-node-live-iso%22%7D%5D%7D&sortField=timestamp&sort=desc The issue started with 4.9.0-rc.5 (rc.4 works without issues) See the race here: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-assisted-test-infra-master-e2e-metal-single-node-live-iso-periodic/1444452225400180736 from the logs we can see that kube-api got shutdown request and that caused cluster bootstrap to fail on sending event kube-api: I1003 00:22:20.076822 1 genericapiserver.go:421] "[graceful-termination] shutdown event" name="InFlightRequestsDrained" I1003 00:22:20.076840 1 genericapiserver.go:751] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"default", Name:"openshift-kube-apiserver", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'InFlightRequestsDrained' All non long-running request(s) in-flight have drained I1003 00:22:20.077676 1 dynamic_serving_content.go:144] "Shutting down controller" name="aggregator-proxy-cert::/etc/kubernetes/secrets/apiserver-proxy.crt::/etc/kubernetes/secrets/apiserver-proxy.key" bootkube: Oct 03 00:22:20 test-infra-cluster-master-0 bootkube.sh[2601]: Tearing down temporary bootstrap control plane... Oct 03 00:22:20 test-infra-cluster-master-0 bootkube.sh[2601]: Sending bootstrap-finished event.The connection to the server api-int.test-infra-cluster.redhat.com:6443 was refused - did you specify the right host or port? Oct 03 00:22:20 test-infra-cluster-master-0 systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE Oct 03 00:22:20 test-infra-cluster-master-0 systemd[1]: bootkube.service: Failed with result 'exit-code'. Oct 03 00:22:25 test-infra-cluster-master-0 systemd[1]: bootkube.service: Service RestartSec=5s expired, scheduling restart. Oct 03 00:22:25 test-infra-cluster-master-0 systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 1.
The same issue is reproducing for me with the zero touch provisioning flow and 4.9.0-0.nightly-2021-10-05-004711
*** Bug 2011324 has been marked as a duplicate of this bug. ***
There is a new rhcos image being built for 4.9 that fixes pod status reporting... so this may clear up... https://bugzilla.redhat.com/show_bug.cgi?id=2011050
Setting as a blocker since this issue fails the single node live-iso CI.
The actual issue is: https://github.com/openshift/installer/blob/6617bc2e334654bb6e85976f049e51fc1c01aa3f/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L427 Bootkube tries to use oc after cluster bootstrap is done and there is no API This explains why all the bootkube retries failed with this error: Oct 06 15:51:53 test-infra-cluster-master-0 bootkube.sh[243426]: Starting cluster-bootstrap... Oct 06 15:51:53 test-infra-cluster-master-0 bootkube.sh[243426]: The connection to the server api-int.test-infra-cluster.redhat.com:6443 was refused - did you specify the right host or port? It's not really starting cluster-bootstrap, cluster-bootstrap is done (the log is misleading) and the command that fails is the oc patch
*** Bug 2011306 has been marked as a duplicate of this bug. ***
FailedQA Version: registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-10-07-110446 [core@sno-0-0 ~]$ journalctl -b -f -u release-image.service -u bootkube.service -- Logs begin at Thu 2021-10-07 16:53:11 UTC. -- Oct 07 17:45:02 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: Started Bootstrap a Kubernetes cluster. Oct 07 17:45:10 sno-0-0.sealusa35-0.qe.lab.redhat.com bootkube.sh[703794]: https://localhost:2379 is healthy: successfully committed proposal: took = 12.689336ms Oct 07 17:45:10 sno-0-0.sealusa35-0.qe.lab.redhat.com bootkube.sh[703794]: Starting cluster-bootstrap... Oct 07 17:45:10 sno-0-0.sealusa35-0.qe.lab.redhat.com bootkube.sh[703794]: The connection to the server api-int.sealusa35-0.qe.lab.redhat.com:6443 was refused - did you specify the right host or port? Oct 07 17:45:10 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE Oct 07 17:45:10 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Failed with result 'exit-code'. Oct 07 17:45:15 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Service RestartSec=5s expired, scheduling restart. Oct 07 17:45:15 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 212. Oct 07 17:45:16 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: Stopped Bootstrap a Kubernetes cluster. Oct 07 17:45:16 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: Started Bootstrap a Kubernetes cluster. Oct 07 17:45:24 sno-0-0.sealusa35-0.qe.lab.redhat.com bootkube.sh[707152]: https://localhost:2379 is healthy: successfully committed proposal: took = 12.694707ms Oct 07 17:45:24 sno-0-0.sealusa35-0.qe.lab.redhat.com bootkube.sh[707152]: Starting cluster-bootstrap... Oct 07 17:45:24 sno-0-0.sealusa35-0.qe.lab.redhat.com bootkube.sh[707152]: The connection to the server api-int.sealusa35-0.qe.lab.redhat.com:6443 was refused - did you specify the right host or port? Oct 07 17:45:24 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE Oct 07 17:45:24 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Failed with result 'exit-code'. Oct 07 17:45:29 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Service RestartSec=5s expired, scheduling restart. Oct 07 17:45:29 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 213. Oct 07 17:45:30 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: Stopped Bootstrap a Kubernetes cluster. Oct 07 17:45:30 sno-0-0.sealusa35-0.qe.lab.redhat.com systemd[1]: Started Bootstrap a Kubernetes cluster.
(In reply to Alexander Chuzhoy from comment #11) > FailedQA > > Version: > registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-10-07-110446 That nightly does not include the fix. This BZ targets 4.10.0. You want registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2021-10-07-095518 or more recent.
(In reply to Matthew Staebler from comment #12) > (In reply to Alexander Chuzhoy from comment #11) > > FailedQA > > > > Version: > > registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-10-07-110446 > > That nightly does not include the fix. This BZ targets 4.10.0. You want > registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2021-10-07-095518 or > more recent. Do we need to clone this one for 4.9? The failure is still relevant in 4.9
> Do we need to clone this one for 4.9? The failure is still relevant in 4.9 It has already been cloned. https://bugzilla.redhat.com/show_bug.cgi?id=2011701
VerifiedQA Version: 4.10.0-0.nightly-2021-12-06-123512 The reported issue doesn't reproduce. The bootstrap phase completed successfully (truncated message): Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[13945]: pam_unix(sudo:session): session closed for user root Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Log bundle written to log-bundle-bootstrap.tar.gz Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Adding bootstrap control plane and bootstrap installer-gather bundle to master ignition Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[13930]: pam_unix(sudo:session): session closed for user root Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14112]: pam_unix(sudo:session): session closed for user root Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[13959]: pam_unix(sudo:session): session closed for user root Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[13934]: pam_unix(sudo:session): session closed for user root Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14038]: pam_unix(sudo:session): session closed for user root Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14033]: pam_unix(sudo:session): session closed for user root Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[13963]: pam_unix(sudo:session): session closed for user root Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[13971]: pam_unix(sudo:session): session closed for user root Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14106]: pam_unix(sudo:session): session closed for user root Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14023]: pam_unix(sudo:session): session closed for user root Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14121]: pam_unix(sudo:session): session closed for user root Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14094]: pam_unix(sudo:session): session closed for user root Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14016]: pam_unix(sudo:session): session closed for user root Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[13992]: pam_unix(sudo:session): session closed for user root Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: Unable to connect to the server: dial tcp: lookup api-int.elvis2.qe.lab.redhat.com on 192.168.123.1:53: no such host Dec 06 18:49:34 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d sudo[14071]: pam_unix(sudo:session): session closed for user root Dec 06 18:49:49 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d bootkube.sh[5299]: bootkube.service complete Dec 06 18:49:49 random-hostname-defd6463-4144-48f9-92ef-870ec209b77d systemd[1]: bootkube.service: Succeeded. Broadcast message from root@random-hostname-defd6463-4144-48f9-92ef-870ec209b77d (Mon 2021-12-06 18:50:49 UTC): 'Installation completed, server is going to reboot.' The system is going down for reboot at Mon 2021-12-06 18:51:49 UTC! q Broadcast message from root@random-hostname-defd6463-4144-48f9-92ef-870ec209b77d (Mon 2021-12-06 18:51:49 UTC): 'Installation completed, server is going to reboot.' The system is going down for reboot NOW! Connection to 192.168.123.142 closed by remote host. Connection to 192.168.123.142 closed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days