Bug 1732984

Summary:	GCP RHCOS image cannot accept hostnames greater than 64 characters
Product:	OpenShift Container Platform	Reporter:	Abhinav Dahiya <adahiya>
Component:	RHCOS	Assignee:	Steve Milner <smilner>
Status:	CLOSED WONTFIX	QA Contact:	Micah Abbott <miabbott>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.2.0	CC:	bbreard, dustymabe, imcleod, jligon, lucab, nstielau, walters
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-07-26 12:49:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Abhinav Dahiya 2019-07-24 21:53:08 UTC

Description of problem:

For an instance with name `cewong-8wj4b-worker-us-east1-b-cqpsv.c.openshift-dev-installer.internal`

Jul 24 18:10:16 localhost NetworkManager[931]: <info>  [1563991816.2584] dhcp4 (ens4):   hostname 'cewong-8wj4b-worker-us-east1-b-cqpsv.c.openshift-dev-installer.internal'
Jul 24 18:10:16 localhost NetworkManager[931]: <info>  [1563991816.2705] policy: set-hostname: set hostname to 'cewong-8wj4b-worker-us-east1-b-cqpsv.c.openshift-dev-installer.internal' (from DHCPv4)
Jul 24 18:10:16 localhost NetworkManager[931]: <warn>  [1563991816.2759] hostname: couldn't set the system hostname to 'cewong-8wj4b-worker-us-east1-b-cqpsv.c.openshift-dev-installer.internal' using hostnamed: GDBus.Error:org.freedesktop.DBus.Error.InvalidArgs: Invalid hostname 'cewong-8wj4b-worker-us-east1-b-cqpsv.c.openshift-dev-installer.internal'
Jul 24 18:10:16 localhost NetworkManager[931]: <warn>  [1563991816.2760] policy: set-hostname: couldn't set the system hostname to 'cewong-8wj4b-worker-us-east1-b-cqpsv.c.openshift-dev-installer.internal': (1) Operation not permitted
Jul 24 18:10:16 localhost NetworkManager[931]: <warn>  [1563991816.2760] policy: set-hostname: you should use hostnamed when systemd hardening is in effect!
Jul 24 18:10:17 localhost NetworkManager[931]: <info>  [1563991817.4836] policy: set-hostname: current hostname was changed outside NetworkManager: 'localhost'

The OS fails to set the hostname and uses localhost.


Based on https://github.com/coreos/bugs/issues/2273

it seems like GKE doesn't have this problem and we probably should also be accepting this.

Comment 1 Luca BRUNO 2019-07-25 08:04:54 UTC

For reference, systemd-networkd does truncate the hostname to the first dot or to `HOST_MAX_LEN` (whatever comes earlier) when receiving an overlong one from DHCP: https://github.com/systemd/systemd/pull/7616

Comment 3 Steve Milner 2019-07-25 15:23:02 UTC

The hostname is invalid according to systemd https://github.com/systemd/systemd/issues/3979#issuecomment-240887597 but we'll take a look.

Comment 4 Colin Walters 2019-07-25 16:01:14 UTC

>  Based on https://github.com/coreos/bugs/issues/2273

That's a Container Linux bug...

> it seems like GKE doesn't have this problem and we probably should also be accepting this.

Which doesn't really have a direct relationship with GKE.

I don't think we should truncate; seems highly likely to cause problems with node identity, CSR signing etc.

I think the installer PR to use shorter names is the right thing here short term.

*Longer* term I think we should have a better concept of node identity than hostnames; basically the MCO/machineAPI would combine to own this, and the injected Ignition would include bits to control the hostname or so.

That said I am spinning up a GKE cluster right now to see what they do, out of curiosity.

Comment 5 Abhinav Dahiya 2019-07-25 16:13:41 UTC

> I don't think we should truncate; seems highly likely to cause problems with node identity, CSR signing etc.

based on https://cloud.google.com/compute/docs/internal-dns#instance-fully-qualified-domain-names

the fqdn that RHCOS will receive for instance on GCP is `[INSTANCE_NAME].[ZONE].c.[PROJECT_ID].internal` so the bits (except the instance name) will be of length 23 (longest zone name northamerica-northeast1) + 1 (c) + 30 (max project id) + 8 (internal) + 4 (dots) = 66 which is longer than the HOST_NAME_MAX of 64.

SO it looks like the hostname has to truncated from FQDN to first dns label... for RHCOS to be GCP supported.


> I don't think we should truncate; seems highly likely to cause problems with node identity, CSR signing etc.


Yes we definitely need the kubelet to register with FQDN of the instance because the node-name in the cluster currently needs to be resolvable inside the cluster.

but if we truncate the kubelet on GCP will use the `os.GetHostname` for the node-name https://github.com/kubernetes/kubernetes/blob/81684586dba9ee4d446c624e91d2a82346f022df/staging/src/k8s.io/legacy-cloud-providers/gce/gce_instances.go#L356-L360

So we might have to edit our kubelet service on GCP to use the `--hostname-override` flag to set the node-name to be registered as the FQDN of the instance.

Comment 6 Colin Walters 2019-07-25 17:24:46 UTC

xref https://github.com/openshift/installer/pull/2088

Comment 7 Colin Walters 2019-07-25 18:21:56 UTC

I was curious to poke at the current state of GKE, so I just spun up a cluster there.
(One side note, apparently `oc` can't work with GCE auth...and Fedora ships /usr/bin/kubectl -> oc...
 had to build upstream kubectl)

Spun up a privileged pod and chrooted into the host, and the hostname is just:

gke-walters-test-default-pool-39fad701-fmbd

From what I can tell, that's from cloud-init, not DHCP.

Jul 25 16:05:49 gke-walters-test-default-pool-39fad701-fmbd cloud-init[1223]: [CLOUDINIT] url_helper.py[DEBUG]: [0/6] open 'http://metadata.google.internal/computeMetadata/v1/instance/hostname' with {'url': 'http://metadata.google.internal/computeMetadata/v1/instance/hostname', 'headers': {'X-Google-Metadata-Request': 'True'}, 'allow_redirects': True, 'method': 'GET'} configuration
Jul 25 16:05:49 gke-walters-test-default-pool-39fad701-fmbd cloud-init[1223]: [CLOUDINIT] url_helper.py[DEBUG]: Read from http://metadata.google.internal/computeMetadata/v1/instance/hostname (200, 74b) after 1 attempts

So, perhaps RHCOS should do the same?  And here by "RHCOS" I really mean Afterburn https://github.com/coreos/afterburn/ which is doing a similar thing for Azure.

Comment 8 Luca BRUNO 2019-07-26 10:35:45 UTC

> So, perhaps RHCOS should do the same?

I'd rather not. The DHCP provides the hostname for the node, that's the authoritative source of truth.

If we want to statically override the hostname that's a legit customization, and Afterburn supports that (on GCP too, see `AFTERBURN_GCP_HOSTNAME` and `--hostname`).

However by default we don't do that as it would introduce a two-general-problem regarding the source of truth of a machine hostname. Especially in case of failures/bugs in Afterburn or in the metadata service.

> which is doing a similar thing for Azure

It is not. On Azure, the DHCP does not provide the hostname for the node. As such, we are forced to hack around via Afterburn.

Comment 9 Colin Walters 2019-07-26 12:49:02 UTC

> It is not. On Azure, the DHCP does not provide the hostname for the node. As such, we are forced to hack around via Afterburn.
...
> I'd rather not. The DHCP provides the hostname for the node, that's the authoritative source of truth.

Right, fair enough!

RESOLVED => DHCP