Version: OCP 4.6.12 IPI Adding a manifest that contains a service with nodePort specified blocks bootstrap. Bootkube keeps logging this: Jan 27 09:51:30 gcp-ocp46-oss193-cc-94498-bootstrap.c.cilium-dev.internal bootkube.sh[2403]: "cluster-network-04-cilium-test-0001-echo-b-service.yaml": failed to create services.v1./echo-b -n cilium-test: Service "echo-b" is invalid: spec.ports[0].nodePort: Invalid value: 31313: provided port is already allocated The cluster eventually comes up fine, but only after bootkube timesout, so it takes extra long time to setup the cluster and not at all clear to the user what has gone wrong. I should say, I wish manifests were somehow validated by the installer, the are many failure modes that occur due to various minor issues with user-provided manifests. This one, of course, is not even a user error, it's actually mishandling of service objects in bootkube's proto-kubeclt-apply. Previously reported in https://github.com/openshift/okd/issues/484.
The YAML manifest that broke this look like this: --- apiVersion: v1 kind: Service metadata: name: echo-b labels: name: echo-b topology: any component: services-check traffic: internal quarantine: "false" type: autocheck namespace: cilium-test spec: ports: - name: http port: 8080 nodePort: 31313 type: NodePort selector: name: echo-b
Please provide the gather bundle collected by the installer when it failed the installation.
Looking at the log bundle provided in the OKD issue, the service is getting created successfully at one point. ``` Jan 27 09:10:27 gcp-ocp46-oss193-cc-94498-bootstrap.c.cilium-dev.internal bootkube.sh[2403]: Created "cluster-network-04-cilium-test-0001-echo-b-service.yaml" services.v1./echo-b -n cilium-test ``` I can see that the service exists in the cluster. However, when the cluster-bootstrap loops through the manifests again, it does not recognize that the resource already exists and tries to apply it again. ``` Jan 27 09:23:00 gcp-ocp46-oss193-cc-94498-bootstrap.c.cilium-dev.internal bootkube.sh[2403]: Failed to create "cluster-network-04-cilium-test-0001-echo-b-service.yaml" services.v1./echo-b -n cilium-test: Service "echo-b" is invalid: spec.ports[0].nodePort: Invalid value: 31313: provided port is already allocated ```
So I think what is happening is an issue with kubernetes Services. The error coming back from the API server when there is an attempt to create the Service again after it has already been created is a validation error on the nodePort already being in use instead of an AlreadyExists error. The cluster-boostrap is relying on the server responding with AlreadyExists in this case.
Moving this to the openshift-apiserver team as the owner of cluster-bootstrap.
Researched https://github.com/openshift/okd/issues/484 , the symptom occurred on the user provided manifests of a nodeport service and a csv. So following the reproducer instruction https://docs.cilium.io/en/v1.9/gettingstarted/k8s-install-openshift-okd/ , provide user manifests of these, to test openshift-install 4.8.0-0.nightly-2021-03-22-104536: $ CLUSTER_NAME="cluster-1" $ openshift-install version openshift-install 4.8.0-0.nightly-2021-03-22-104536 $ openshift-install create install-config --dir "${CLUSTER_NAME}" $ sed -i 's/networkType:\ [^ ]*/networkType:\ Cilium/' "${CLUSTER_NAME}/install-config.yaml" $ openshift-install create manifests --dir "${CLUSTER_NAME}" Then download https://github.com/cilium/cilium-olm/tree/b4799bc72bcaf19bffef55aca2a5c23f921dff88/manifests/cilium.v1.9.1 manifests under ${CLUSTER_NAME}/manifests/ . Also put above comment 1 yaml manifest there, rename it to cluster-network-06-cilium-00015-echo-b-service.yaml, modify namespace name to cilium. Then run: $ openshift-install create cluster --dir "${CLUSTER_NAME}" --log-level debug Meanwhile, in another terminal: $ ssh core@<bootstrap_host> journalctl -b -n all -f -u release-image.service -u bootkube.service > ~/all/bootstrap_services.log The bootstrap host is terminated within 20mins. Not like the issue, which had timeout for hours. $ tail -f -n +1 ~/all/bootstrap_services.log | grep -e echo-b -e clusterserviceversion Mar 23 10:56:44 ip-10-0-26-245 bootkube.sh[2305]: Created "cluster-network-06-cilium-00015-echo-b-service.yaml" services.v1./echo-b -n cilium Mar 23 10:56:56 ip-10-0-26-245 bootkube.sh[2305]: "cluster-network-06-cilium-00014-cilium.v1.9.1-clusterserviceversion.yaml": unable to get REST mapping for "cluster-network-06-cilium-00014-cilium.v1.9.1-clusterserviceversion.yaml": no matches for kind "ClusterServiceVersion" in version "operators.coreos.com/v1alpha1" Mar 23 10:57:17 ip-10-0-26-245 bootkube.sh[2305]: "cluster-network-06-cilium-00014-cilium.v1.9.1-clusterserviceversion.yaml": unable to get REST mapping for "cluster-network-06-cilium-00014-cilium.v1.9.1-clusterserviceversion.yaml": no matches for kind "ClusterServiceVersion" in version "operators.coreos.com/v1alpha1" Mar 23 10:57:24 ip-10-0-26-245 bootkube.sh[2305]: "cluster-network-06-cilium-00014-cilium.v1.9.1-clusterserviceversion.yaml": unable to get REST mapping for "cluster-network-06-cilium-00014-cilium.v1.9.1-clusterserviceversion.yaml": no matches for kind "ClusterServiceVersion" in version "operators.coreos.com/v1alpha1" Mar 23 10:57:25 ip-10-0-26-245 bootkube.sh[2305]: Created "cluster-network-06-cilium-00014-cilium.v1.9.1-clusterserviceversion.yaml" clusterserviceversions.v1alpha1.operators.coreos.com/cilium.v1.9.1 -n cilium Mar 23 11:10:49 ip-10-0-26-245 bootkube.sh[2305]: Skipped "cluster-network-06-cilium-00014-cilium.v1.9.1-clusterserviceversion.yaml" clusterserviceversions.v1alpha1.operators.coreos.com/cilium.v1.9.1 -n cilium as it already exists Mar 23 11:10:50 ip-10-0-26-245 bootkube.sh[2305]: Skipped "cluster-network-06-cilium-00015-echo-b-service.yaml" services.v1./echo-b -n cilium as it already exists The logs are same as the PR. So moving to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438