1933263 – user manifest with nodeport services causes bootstrap to block

Bug 1933263 - user manifest with nodeport services causes bootstrap to block

Summary: user manifest with nodeport services causes bootstrap to block

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Stefan Schimanski
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-26 11:34 UTC by Ilya Dmitrichenko
Modified:	2021-07-27 22:48 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:48:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-bootstrap pull 56	None	open	Bug 1933263: bump(library-go)	2021-03-02 11:21:58 UTC
Github	openshift library-go pull 1002	None	open	Bug 1933263: pkg/assets: get resources before create to cope with pre-etcd-create errors	2021-03-01 15:41:31 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 22:48:48 UTC

Description Ilya Dmitrichenko 2021-02-26 11:34:02 UTC

Version: OCP 4.6.12 IPI

Adding a manifest that contains a service with nodePort specified blocks bootstrap.

Bootkube keeps logging this:

Jan 27 09:51:30 gcp-ocp46-oss193-cc-94498-bootstrap.c.cilium-dev.internal bootkube.sh[2403]: "cluster-network-04-cilium-test-0001-echo-b-service.yaml": failed to create services.v1./echo-b -n cilium-test: Service "echo-b" is invalid: spec.ports[0].nodePort: Invalid value: 31313: provided port is already allocated

The cluster eventually comes up fine, but only after bootkube timesout, so it takes extra long time to setup the cluster and not at all clear to the user what has gone wrong.

I should say, I wish manifests were somehow validated by the installer, the are many failure modes that occur due to various minor issues with user-provided manifests. This one, of course, is not even a user error, it's actually mishandling of service objects in bootkube's proto-kubeclt-apply.

Previously reported in https://github.com/openshift/okd/issues/484.

Comment 1 Ilya Dmitrichenko 2021-02-26 11:44:51 UTC

The YAML manifest that broke this look like this:


---
apiVersion: v1
kind: Service
metadata:
  name: echo-b
  labels:
    name: echo-b
    topology: any
    component: services-check
    traffic: internal
    quarantine: "false"
    type: autocheck
  namespace: cilium-test
spec:
  ports:
  - name: http
    port: 8080
    nodePort: 31313
  type: NodePort
  selector:
    name: echo-b

Comment 2 Matthew Staebler 2021-02-26 18:26:47 UTC

Please provide the gather bundle collected by the installer when it failed the installation.

Comment 3 Matthew Staebler 2021-02-26 18:42:47 UTC

Looking at the log bundle provided in the OKD issue, the service is getting created successfully at one point.

```
Jan 27 09:10:27 gcp-ocp46-oss193-cc-94498-bootstrap.c.cilium-dev.internal bootkube.sh[2403]: Created "cluster-network-04-cilium-test-0001-echo-b-service.yaml" services.v1./echo-b -n cilium-test

```

I can see that the service exists in the cluster.

However, when the cluster-bootstrap loops through the manifests again, it does not recognize that the resource already exists and tries to apply it again.

```
Jan 27 09:23:00 gcp-ocp46-oss193-cc-94498-bootstrap.c.cilium-dev.internal bootkube.sh[2403]: Failed to create "cluster-network-04-cilium-test-0001-echo-b-service.yaml" services.v1./echo-b -n cilium-test: Service "echo-b" is invalid: spec.ports[0].nodePort: Invalid value: 31313: provided port is already allocated
```

Comment 4 Matthew Staebler 2021-02-26 18:52:06 UTC

So I think what is happening is an issue with kubernetes Services. The error coming back from the API server when there is an attempt to create the Service again after it has already been created is a validation error on the nodePort already being in use instead of an AlreadyExists error. The cluster-boostrap is relying on the server responding with AlreadyExists in this case.

Comment 5 Matthew Staebler 2021-02-26 18:53:00 UTC

Moving this to the openshift-apiserver team as the owner of cluster-bootstrap.

Comment 9 Xingxing Xia 2021-03-23 11:26:14 UTC

Researched https://github.com/openshift/okd/issues/484 , the symptom occurred on the user provided manifests of a nodeport service and a csv. So following the reproducer instruction https://docs.cilium.io/en/v1.9/gettingstarted/k8s-install-openshift-okd/ , provide user manifests of these, to test openshift-install 4.8.0-0.nightly-2021-03-22-104536:
$ CLUSTER_NAME="cluster-1"
$ openshift-install version
openshift-install 4.8.0-0.nightly-2021-03-22-104536
$ openshift-install create install-config --dir "${CLUSTER_NAME}"
$ sed -i 's/networkType:\ [^ ]*/networkType:\ Cilium/' "${CLUSTER_NAME}/install-config.yaml"
$ openshift-install create manifests --dir "${CLUSTER_NAME}"
Then download https://github.com/cilium/cilium-olm/tree/b4799bc72bcaf19bffef55aca2a5c23f921dff88/manifests/cilium.v1.9.1 manifests under ${CLUSTER_NAME}/manifests/ .
Also put above comment 1 yaml manifest there, rename it to cluster-network-06-cilium-00015-echo-b-service.yaml, modify namespace name to cilium.
Then run:
$ openshift-install create cluster --dir "${CLUSTER_NAME}" --log-level debug

Meanwhile, in another terminal:
$ ssh core@<bootstrap_host> journalctl -b -n all -f -u release-image.service -u bootkube.service > ~/all/bootstrap_services.log
The bootstrap host is terminated within 20mins. Not like the issue, which had timeout for hours.

$ tail -f -n +1 ~/all/bootstrap_services.log | grep -e echo-b -e clusterserviceversion
Mar 23 10:56:44 ip-10-0-26-245 bootkube.sh[2305]: Created "cluster-network-06-cilium-00015-echo-b-service.yaml" services.v1./echo-b -n cilium
Mar 23 10:56:56 ip-10-0-26-245 bootkube.sh[2305]: "cluster-network-06-cilium-00014-cilium.v1.9.1-clusterserviceversion.yaml": unable to get REST mapping for "cluster-network-06-cilium-00014-cilium.v1.9.1-clusterserviceversion.yaml": no matches for kind "ClusterServiceVersion" in version "operators.coreos.com/v1alpha1"
Mar 23 10:57:17 ip-10-0-26-245 bootkube.sh[2305]: "cluster-network-06-cilium-00014-cilium.v1.9.1-clusterserviceversion.yaml": unable to get REST mapping for "cluster-network-06-cilium-00014-cilium.v1.9.1-clusterserviceversion.yaml": no matches for kind "ClusterServiceVersion" in version "operators.coreos.com/v1alpha1"
Mar 23 10:57:24 ip-10-0-26-245 bootkube.sh[2305]: "cluster-network-06-cilium-00014-cilium.v1.9.1-clusterserviceversion.yaml": unable to get REST mapping for "cluster-network-06-cilium-00014-cilium.v1.9.1-clusterserviceversion.yaml": no matches for kind "ClusterServiceVersion" in version "operators.coreos.com/v1alpha1"
Mar 23 10:57:25 ip-10-0-26-245 bootkube.sh[2305]: Created "cluster-network-06-cilium-00014-cilium.v1.9.1-clusterserviceversion.yaml" clusterserviceversions.v1alpha1.operators.coreos.com/cilium.v1.9.1 -n cilium
Mar 23 11:10:49 ip-10-0-26-245 bootkube.sh[2305]: Skipped "cluster-network-06-cilium-00014-cilium.v1.9.1-clusterserviceversion.yaml" clusterserviceversions.v1alpha1.operators.coreos.com/cilium.v1.9.1 -n cilium as it already exists
Mar 23 11:10:50 ip-10-0-26-245 bootkube.sh[2305]: Skipped "cluster-network-06-cilium-00015-echo-b-service.yaml" services.v1./echo-b -n cilium as it already exists

The logs are same as the PR. So moving to VERIFIED.

Comment 12 errata-xmlrpc 2021-07-27 22:48:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.