Bug 2011652 - Bootkube tries to use oc after cluster bootstrap is done and there is no API
Summary: Bootkube tries to use oc after cluster bootstrap is done and there is no API
Keywords:
Status: CLOSED DUPLICATE of bug 2011701
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.9.0
Assignee: Matthew Staebler
QA Contact: Omri Hochman
URL:
Whiteboard:
Depends On: 2010665 2011701
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-10-07 01:16 UTC by Matthew Staebler
Modified: 2021-10-07 13:43 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2010665
Environment:
Last Closed: 2021-10-07 13:43:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Matthew Staebler 2021-10-07 01:16:20 UTC
+++ This bug was initially created as a clone of Bug #2010665 +++

Single node iBIP flow:
Cluster bootstrap will try to send event after tear down of temporary control plane.
The issue happened 5 /7 I the son live-iso CI.
https://sippy.ci.openshift.org/sippy-ng/jobs/4.9/runs?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-single-node-live-iso%22%7D%5D%7D&sortField=timestamp&sort=desc

The issue started with 4.9.0-rc.5 (rc.4 works without issues)

See the race here:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-assisted-test-infra-master-e2e-metal-single-node-live-iso-periodic/1444452225400180736

from the logs we can see that kube-api got shutdown request and that caused cluster bootstrap to fail on sending event

kube-api:
I1003 00:22:20.076822       1 genericapiserver.go:421] "[graceful-termination] shutdown event" name="InFlightRequestsDrained"
I1003 00:22:20.076840       1 genericapiserver.go:751] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"default", Name:"openshift-kube-apiserver", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'InFlightRequestsDrained' All non long-running request(s) in-flight have drained
I1003 00:22:20.077676       1 dynamic_serving_content.go:144] "Shutting down controller" name="aggregator-proxy-cert::/etc/kubernetes/secrets/apiserver-proxy.crt::/etc/kubernetes/secrets/apiserver-proxy.key"

bootkube:
Oct 03 00:22:20 test-infra-cluster-master-0 bootkube.sh[2601]: Tearing down temporary bootstrap control plane...
Oct 03 00:22:20 test-infra-cluster-master-0 bootkube.sh[2601]: Sending bootstrap-finished event.The connection to the server api-int.test-infra-cluster.redhat.com:6443 was refused - did you specify the right host or port?
Oct 03 00:22:20 test-infra-cluster-master-0 systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE
Oct 03 00:22:20 test-infra-cluster-master-0 systemd[1]: bootkube.service: Failed with result 'exit-code'.
Oct 03 00:22:25 test-infra-cluster-master-0 systemd[1]: bootkube.service: Service RestartSec=5s expired, scheduling restart.
Oct 03 00:22:25 test-infra-cluster-master-0 systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 1.

--- Additional comment from Marius Cornea on 2021-10-06 12:51:40 UTC ---

The same issue is reproducing for me with the zero touch provisioning flow and 4.9.0-0.nightly-2021-10-05-004711

--- Additional comment from Marius Cornea on 2021-10-06 14:07:56 UTC ---



--- Additional comment from Ryan Phillips on 2021-10-06 15:29:48 UTC ---

There is a new rhcos image being built for 4.9 that fixes pod status reporting... so this may clear up...

https://bugzilla.redhat.com/show_bug.cgi?id=2011050

--- Additional comment from Eran Cohen on 2021-10-06 15:33:05 UTC ---

Setting as a blocker since this issue fails the single node live-iso CI.

--- Additional comment from Eran Cohen on 2021-10-06 16:09:54 UTC ---

The actual issue is:
https://github.com/openshift/installer/blob/6617bc2e334654bb6e85976f049e51fc1c01aa3f/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L427
Bootkube tries to use oc after cluster bootstrap is done and there is no API

This explains why all the bootkube retries failed with this error:
Oct 06 15:51:53 test-infra-cluster-master-0 bootkube.sh[243426]: Starting cluster-bootstrap...
Oct 06 15:51:53 test-infra-cluster-master-0 bootkube.sh[243426]: The connection to the server api-int.test-infra-cluster.redhat.com:6443 was refused - did you specify the right host or port? 
It's not really starting cluster-bootstrap, cluster-bootstrap is done (the log is misleading) and the command that fails is the oc patch

--- Additional comment from Omri Hochman on 2021-10-06 17:56:10 UTC ---

Comment 1 Matthew Staebler 2021-10-07 13:43:39 UTC

*** This bug has been marked as a duplicate of bug 2011701 ***


Note You need to log in before you can comment on or make changes to this bug.