Description of problem: In OpenShift on GCP, we are hitting an issue because of invalid entry in /etc/resolve.conf file. The file says it is generated by Network Manager. See https://github.com/kubevirt/kubevirt/issues/5447#issuecomment-926646739 Steps to Reproduce: 1. Get a OCP cluster on GCP via OpenShift CI 2. Check resolve.conf file Actual results: It contains an entry ending with hypen Expected results: It should not contain an entry ending with hypen > Feel free to ping me (eerol) to reproduce the issue.
There should be no truncation of DNS searches done by NM. Please enable trace logging in NetworkManager by setting level=trace in the [logging] section of /etc/NetworkManager/NetworkManager.conf; also set "RateLimitIntervalSec=0" in /etc/systemd/journald.conf to avoid that systemd-journald drops some messages. Then reproduce the problem and please attach the output of 'journalctl -u NetworkManager -b'. Thanks.
I reproduced the issue and attached the logs. I observed something weird in this case. I checked "/etc/resolve.conf" file before our tests and there was not such an entry there. While our tests are being run, I observed the invalid entry again. Then it disappeared after our tests. Even second run of the tests have passed. Also, I checked some of the nodes. I only observe this issue in the node we run pods for virtualmachine.
Hi, thanks for the log. I think the problem is that when NM starts it finds the system hostname set to "ci-op-dgnrb143-44266-42t6h-worker-c-ldlk9.c.openshift-gce-devel-" which is the 64-byte truncation of "ci-op-dgnrb143-44266-42t6h-worker-c-ldlk9.c.openshift-gce-devel-ci.internal". Then NM interprets "c.openshift-gce-devel-" as the suffix of the hostname and places it in the search list. How is the initial hostname set? Is dracut involved? I suspect that something is doing: echo "ci-op-dgnrb143-44266-42t6h-worker-c-ldlk9.c.openshift-gce-devel-ci.internal" > /proc/sys/kernel/hostname without considering that the string will be truncated by kernel. I would be better if kernel returned an error instead of truncating the string.
If you look at the journal log of boot ("journalctl -b") you should be able to see at which point the wrong hostname gets set.
The MCO lands a couple of systemd units that verify the hostname of a system on GCP is correct: the `gcp-hostname` service and the `node-valid-hostname` service https://github.com/openshift/machine-config-operator/blob/master/templates/common/gcp/units/gcp-hostname.service.yaml https://github.com/openshift/machine-config-operator/blob/master/templates/common/_base/units/node-valid-hostname.service.yaml Both call out to the `mco-hostname` binary: https://github.com/openshift/machine-config-operator/blob/master/templates/common/_base/units/node-valid-hostname.service.yaml This is probably what is involved with the truncation of the hostname. To investigate, I pulled the worker node journal from one of the failed tests linked to https://github.com/kubevirt/kubevirt/issues/5447 https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/1329/pull-ci-kubevirt-hyperconverged-cluster-operator-main-hco-e2e-image-index-gcp/1425837152389828608/artifacts/hco-e2e-image-index-gcp/gather-extra/artifacts/nodes/ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm/journal In the logs, you'll see the system boot once, lay down some files via Ignition, then pivot to the new/updated OS. Afterwards, we can look at how the hostname is determined. From my read of the logs, it looks like systemd originally sets the hostname with a trailing `-` ``` Aug 12 15:39:22.707401 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- systemd[1]: Set hostname to <ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel->. ``` NetworkManager starts up and uses the hostname from systemd and I'm guessing this is where the incorrect entry in `resolv.conf` happens: ``` Aug 12 15:39:25.641650 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- systemd[1]: Starting Hostname Service... Aug 12 15:39:25.725293 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- dbus-daemon[1010]: [system] Successfully activated service 'org.freedesktop.hostname1' Aug 12 15:39:25.725928 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- systemd[1]: Started Hostname Service. Aug 12 15:39:25.727656 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- NetworkManager[1205]: <info> [1628782765.7273] hostname: hostname: using hostnamed Aug 12 15:39:25.727688 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- NetworkManager[1205]: <info> [1628782765.7276] hostname: hostname changed from (none) to "ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel-" ``` NM finishes, then the `gcp-hostname` starts, fetches the hostname from the GCP metadata service, and truncates it: ``` Aug 12 15:39:25.798521 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- NetworkManager[1205]: <info> [1628782765.7984] manager: NetworkManager state is now CONNECTED_GLOBAL Aug 12 15:39:25.799396 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- NetworkManager[1205]: <info> [1628782765.7993] manager: startup complete Aug 12 15:39:25.803501 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- systemd[1]: Started Network Manager Wait Online. Aug 12 15:39:25.813118 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- systemd[1]: Starting Set GCP Transient Hostname... Aug 12 15:39:25.822096 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- systemd[1]: Starting Configures OVS with proper host networking configuration... Aug 12 15:39:25.826267 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- configure-ovs.sh[1272]: + touch /var/run/ovs-config-executed Aug 12 15:39:25.843070 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- configure-ovs.sh[1272]: + NM_CONN_OVERLAY=/etc/NetworkManager/systemConnectionsMerged Aug 12 15:39:25.843070 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- configure-ovs.sh[1272]: + NM_CONN_UNDERLAY=/etc/NetworkManager/system-connections Aug 12 15:39:25.843070 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- configure-ovs.sh[1272]: + '[' -d /etc/NetworkManager/systemConnectionsMerged ']' Aug 12 15:39:25.843070 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- configure-ovs.sh[1272]: + NM_CONN_PATH=/etc/NetworkManager/systemConnectionsMerged Aug 12 15:39:25.843070 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- configure-ovs.sh[1272]: + MANAGED_NM_CONN_SUFFIX=-slave-ovs-clone Aug 12 15:39:25.898075 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm nm-dispatcher[1251]: Error: Device '' not found. Aug 12 15:39:25.898654 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm mco-hostname[1267]: Aug 12 15:39:25.854 INFO Fetching http://metadata.google.internal/computeMetadata/v1/instance/hostname: Attempt #1 Aug 12 15:39:25.898654 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm mco-hostname[1267]: Aug 12 15:39:25.859 INFO Fetch successful Aug 12 15:39:25.899120 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm configure-ovs.sh[1272]: + rpm -qa Aug 12 15:39:25.888026 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm systemd-hostnamed[1219]: Changed static host name to 'ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm' Aug 12 15:39:25.914627 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm nm-dispatcher[1251]: Error: Device '' not found. Aug 12 15:39:25.914669 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm mco-hostname[1267]: ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel-ci.internal is longer than 63 characters, using truncated hostname Aug 12 15:39:25.914669 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm mco-hostname[1267]: setting static hostname to ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm Aug 12 15:39:25.914732 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm configure-ovs.sh[1272]: + grep -q openvswitch Aug 12 15:39:25.888568 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm NetworkManager[1205]: <info> [1628782765.8885] hostname: hostname changed from "ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel-" to "ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm" Aug 12 15:39:25.892952 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm systemd[1]: Started Set GCP Transient Hostname. ``` Then the `node-valid-hostname` service starts and happily reports that the hostname is sane: ``` Aug 12 15:39:31.181426 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm systemd[1]: Starting Wait for a non-localhost hostname... Aug 12 15:39:31.185097 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm mco-hostname[1393]: waiting for non-localhost hostname to be assigned Aug 12 15:39:31.186621 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm mco-hostname[1393]: node identified as ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm Aug 12 15:39:31.187551 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm systemd[1]: Started Wait for a non-localhost hostname. ``` My thought is that perhaps the `gcp-hostname` service could be enhanced to remove the NM generated `search` entry from `resolv.conf` and add a valid entry.
(In reply to Micah Abbott from comment #6) > My thought is that perhaps the `gcp-hostname` service could be enhanced to > remove the NM generated `search` entry from `resolv.conf` and add a valid > entry. I might be missing something. To me the obvious solution would be to set a correct hostname in the beginning instead of setting a wrong one and then cleaning this up afterwards. As far as I understand, NM's behavior is correct and even the resolv.conf entry since it matches the domain of the hostname. Micah, please assign this bug to the right component for the gcp-hostname service or to the component that is responsible for setting the wrong hostname in the first place?
(In reply to Till Maas from comment #7) > (In reply to Micah Abbott from comment #6) > > > My thought is that perhaps the `gcp-hostname` service could be enhanced to > > remove the NM generated `search` entry from `resolv.conf` and add a valid > > entry. > > I might be missing something. To me the obvious solution would be to set a > correct hostname in the beginning instead of setting a wrong one and then > cleaning this up afterwards. As far as I understand, NM's behavior is > correct and even the resolv.conf entry since it matches the domain of the > hostname. IIRC, we tried to solve the valid hostname problem in the best way possible given the circumstances. I believe the root cause of the too-long hostname was a combination of the node ID generated by the CI jobs and the hostname being offered by the GCP DHCP service. I also think we discussed trying to shorten the node ID used by the CI jobs, but weren't able to come to a satisfactory agreement with all involved. I do agree that NM is behaving properly in this case; the problem lies in the hostname input which NM is operating on. > Micah, please assign this bug to the right component for the gcp-hostname > service or to the component that is responsible for setting the wrong > hostname in the first place? I think we can reassign this to the RHCOS team and we can try to sort this out on our end.
After long previous discussions on this, the basic conclusion here is that *only on GCP* we know that we can safely truncate the hostname, because the GCP DNS resolvers also truncate it that way. I think what we want here is to: - Push truncation support into NetworkManager - Teach the MCO to add a drop-in `/etc/NetworkManager/conf.d/05-gcp-truncation.conf that has e.g. `[dns]\ntruncation=last-dot` And the NetworkManager code would apply the truncation logic to both the hostname and the DNS search domains. Having the MCO code try to hackily clean up `/etc/resolv.conf` after it's been written by NM is a losing proposition because unlike the hostname case we can't (AFAIK) take ownership of writing it.
If you (NM developers) agree, can you take the action of filing an issue/BZ against NM and then we'll mark it as a dependency of this one?
> I also think we discussed trying to shorten the node ID used by the CI jobs, I believe this issue can be hit by regular users and customers too, not just our CI jobs. Having long cluster identifiers is a reasonable thing to do.
```(In reply to Micah Abbott from comment #6) > > From my read of the logs, it looks like systemd originally sets the hostname > with a trailing `-` > > ``` > Aug 12 15:39:22.707401 > ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- systemd[1]: > Set hostname to > <ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel->. > ``` (In reply to Colin Walters from comment #10) > After long previous discussions on this, the basic conclusion here is that > *only on GCP* > we know that we can safely truncate the hostname, because the GCP DNS > resolvers also truncate it that way. Please elaborate, why it would be useful to have a search list entry that is generated like this? As far as I understand https://github.com/kubevirt/kubevirt/issues/5447#issuecomment-926646739 you suggest that the search list should be search c.openshift-gce-devel-ci.internal google.internal c So as far as I understand, this might lead to lookups like example.c which don't seem to be useful to me. Or in case a FQDN is something like ci-op-dgnrb143-44266-42t6h-worker-c-ldlk9.com.openshift-gce-devel-ci.internal it would be saved in the kernel as ci-op-dgnrb143-44266-42t6h-worker-c-ldlk9.com.openshift-gce-deve and then truncated by that logic to ci-op-dgnrb143-44266-42t6h-worker-c-ldlk9.com leading to a search list like search com which would then translate the lookup for the hostname example to example.com > > I think what we want here is to: > > - Push truncation support into NetworkManager Initially, systemd is setting the (wrongly?) truncated hostname, why is that not the correct place to fix this? > - Teach the MCO to add a drop-in `/etc/NetworkManager/conf.d/05-gcp-truncation.conf that has e.g. `[dns]\ntruncation=last-dot` Why the last dot and not the first dot? When should it be truncated, only when the hostname is 64 bytes? Why should it be a configuration option? (In reply to Colin Walters from comment #11) > If you (NM developers) agree, can you take the action of filing an issue/BZ > against NM and then we'll mark it as a dependency of this one? I would prefer to get more clarity on the requirements before being able to file a meaningful BZ.
Hi all, As far as I understand, you found the problem, which is great. Thanks for that! Is there anything I can do? We really need a fix for this issue to verify our upstream code on GCP.
(In reply to Erkan Erol from comment #14) > Hi all, > > As far as I understand, you found the problem, which is great. Thanks for > that! Is there anything I can do? We really need a fix for this issue to > verify our upstream code on GCP. What is your opinion about the expected behavior as I asked in comment:13?
(In reply to Till Maas from comment #13) > Or in case a FQDN is something like > > ci-op-dgnrb143-44266-42t6h-worker-c-ldlk9.com.openshift-gce-devel-ci.internal > > it would be saved in the kernel as > > ci-op-dgnrb143-44266-42t6h-worker-c-ldlk9.com.openshift-gce-deve > > and then truncated by that logic to > > > ci-op-dgnrb143-44266-42t6h-worker-c-ldlk9.com > > leading to a search list like > > search com > > which would then translate the lookup for the hostname example to example.com I think the proposal is that instead of letting other components set the hostname in kernel, this should be done by NM, which would truncate the long hostname received from DHCP: ci-op-dgnrb143-44266-42t6h-worker-c-ldlk9.com.openshift-gce-devel-ci.internal to something like: (1) ci-op-dgnrb143-44266-42t6h-worker-c-ldlk9 or (2) ci-op-dgnrb143-44266-42t6h-worker-c-ldlk9.com Then it would add a search domain to resolv.conf accordingly. The code that handles this is in NM: https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/blob/1.33.4-dev/src/core/dns/nm-dns-manager.c#L1263-1280 Note that domain_is_valid() on Fedora uses the public suffix list https://github.com/rockdaboot/libpsl to avoid adding top level domains (e.g. "com") as search domains. On RHEL, the PSL is not used by NM. Therefore, the search domain will be: for (1): none for (2): Fedora: "ci-op-dgnrb143-44266-42t6h-worker-c-ldlk9.com" RHEL : "com"
(In reply to Beniamino Galvani from comment #16) > > (In reply to Till Maas from comment #13) > > > Or in case a FQDN is something like > > > > ci-op-dgnrb143-44266-42t6h-worker-c-ldlk9.com.openshift-gce-devel-ci.internal > > > > it would be saved in the kernel as > > > > ci-op-dgnrb143-44266-42t6h-worker-c-ldlk9.com.openshift-gce-deve > > > > and then truncated by that logic to > > > > > > ci-op-dgnrb143-44266-42t6h-worker-c-ldlk9.com > > > > leading to a search list like > > > > search com > > > > which would then translate the lookup for the hostname example to example.com > > I think the proposal is that instead of letting other components set the > hostname in kernel, this should be done by NM, which would truncate the long > hostname received from DHCP: I don't see the proposal that other tools will not set/change the hostname in kernel. So it would be great if this can be confirmed by the team for the other tools. Is this issue only about DHCP, though? It seems that DHCP is a different scenario. When looking at how other projects handle DHCP, they truncate at first dot or HOST_MAX_LEN, which seems plausible to me, too: https://github.com/systemd/systemd/pull/7616 https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/572 Also, in case of DHCP, NM has all the information, so it can both 1) set the truncated hostname for the kernel (without domain) 2) Set the correct search domain in resolv.conf (NM has then access to the untruncated domain from DHCP). Nevertheless, this would not fix the issue with reading a truncated hostname from the kernel (which was reported here) and in that case, it is the job of the tool setting the kernel hostname to properly truncate it (first dot or HOST_MAX_LEN). > https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/blob/1.33.4- > dev/src/core/dns/nm-dns-manager.c#L1263-1280 > > Note that domain_is_valid() on Fedora uses the public suffix list > > https://github.com/rockdaboot/libpsl > > to avoid adding top level domains (e.g. "com") as search domains. On RHEL, What is the value of having a truncated search domain, are external resolvers trying to add another domain on top of that? This seems like a problematic approach since it will lead to many unnecessary DNS traffic.
(In reply to Colin Walters from comment #10) > After long previous discussions on this, the basic conclusion here is that > *only on GCP* > we know that we can safely truncate the hostname, because the GCP DNS > resolvers also truncate it that way. This seems to be a bad assumption, > > I think what we want here is to: > > - Push truncation support into NetworkManager > - Teach the MCO to add a drop-in > `/etc/NetworkManager/conf.d/05-gcp-truncation.conf that has e.g. > `[dns]\ntruncation=last-dot` in this BZ you assume/propose to truncate at the last dot, in https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/677#note_691128 it is the first dot. Not sure, what changed in the mean time or if you confused first and last, nevertheless, it proves again that we need clear requirements/user stories/acceptance criteria from users/customers and then we can develop a proper solution. Requests for a specific solution without providing the background information led to problems in the past and seem to do here, too.
For the way we're doing OCP/CoreOS today, the "mco-hostname" special bits are injected via Ignition and run in the real root. Broadly speaking, we don't need DNS until *after* mco-hostname.service runs. We do need to contact the link-local IP 169.254.169.254 for metadata, but doing so doesn't involve DNS. In other words, I think for now we can safely ignore all behavior from the initial NetworkManager run in the initramfs, including the temporarily invalid hostname, etc. (It would clearly be better to have this logic built into NM and keyed off being in GCP, but again we can ignore that for now) Now: our truncation of the hostname is clearly predicated on the search domains working. Hmm, does NM not log the search domains received from GCP? I just launched a cluster and got: ``` sh-4.4# cat /host/etc/resolv.conf # Generated by NetworkManager search c.openshift-gce-devel-ci.internal google.internal c.openshift-gce-devel-ci.inte ``` What *is* truncating that 3rd search domain? Based on the original ticket, the problem is the trailing `-` there breaking some Go resolv.conf parsing code. I think offhand, we only need the first two, so simply dropping it instead of truncating would work.
Here's the trace logging of the DHCP response from NM: ``` > [1635948966.3423] dhcp4 (ens4): received ACK of 10.0.0.4 from 10.0.0.1 > [1635948966.3423] dhcp4 (ens4): client event 2 > [1635948966.3423] dhcp4 (ens4): lease available (new) > [1635948966.3451] dhcp4 (ens4): option dhcp_lease_time => '86400' > [1635948966.3452] dhcp4 (ens4): option dhcp_server_identifier => '169.254.169.254' > [1635948966.3452] dhcp4 (ens4): option domain_name => 'c.openshift-gce-devel-ci.internal' > [1635948966.3452] dhcp4 (ens4): option domain_name_servers => '169.254.169.254' > [1635948966.3453] dhcp4 (ens4): option domain_search => 'c.openshift-gce-devel-ci.internal google.internal' > [1635948966.3453] dhcp4 (ens4): option expiry => '1636035366' > [1635948966.3453] dhcp4 (ens4): option host_name => 'ci-ln-jptyi32-72292-zqdxg-master-0.c.openshift-gce-devel-ci.internal' > [1635948966.3454] dhcp4 (ens4): option interface_mtu => '1460' > [1635948966.3454] dhcp4 (ens4): option ip_address => '10.0.0.4' > [1635948966.3454] dhcp4 (ens4): option next_server => '10.0.0.1' > [1635948966.3454] dhcp4 (ens4): option ntp_servers => '169.254.169.254' > [1635948966.3454] dhcp4 (ens4): option requested_broadcast_address => '1' > [1635948966.3454] dhcp4 (ens4): option requested_domain_name => '1' > [1635948966.3454] dhcp4 (ens4): option requested_domain_name_servers => '1' > [1635948966.3454] dhcp4 (ens4): option requested_domain_search => '1' > [1635948966.3454] dhcp4 (ens4): option requested_host_name => '1' > [1635948966.3454] dhcp4 (ens4): option requested_interface_mtu => '1' > [1635948966.3454] dhcp4 (ens4): option requested_ms_classless_static_routes => '1' > [1635948966.3454] dhcp4 (ens4): option requested_nis_domain => '1' > [1635948966.3455] dhcp4 (ens4): option requested_nis_servers => '1' > [1635948966.3455] dhcp4 (ens4): option requested_ntp_servers => '1' > [1635948966.3455] dhcp4 (ens4): option requested_rfc3442_classless_static_routes => '1' > [1635948966.3455] dhcp4 (ens4): option requested_root_path => '1' > [1635948966.3455] dhcp4 (ens4): option requested_routers => '1' > [1635948966.3455] dhcp4 (ens4): option requested_static_routes => '1' > [1635948966.3455] dhcp4 (ens4): option requested_subnet_mask => '1' > [1635948966.3455] dhcp4 (ens4): option requested_time_offset => '1' > [1635948966.3455] dhcp4 (ens4): option requested_wpad => '1' > [1635948966.3455] dhcp4 (ens4): option rfc3442_classless_static_routes => '10.0.0.1/32 0.0.0.0 0.0.0.0/0 10.0.0.1' > [1635948966.3455] dhcp4 (ens4): option routers => '10.0.0.1' > [1635948966.3455] dhcp4 (ens4): option subnet_mask => '255.255.255.255' [1635948966.3455] dhcp4 (ens4): state changed unknown -> bound, address=10.0.0.4 ``` From this, > [1635948966.3453] dhcp4 (ens4): option domain_search => 'c.openshift-gce-devel-ci.internal google.internal' seems clear. So presumably it is NM synthesizing the invalid 3rd entry there?
Not replying to Colins comments 19 and 20, here. Just want to track some information from more discussion in the NM team. Regarding hostnames from DHCP, there is certainly a possibility for improvement and we will continue to work on https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/677 - current status is to get it ready with the truncation rule first dot/64 chars (depending on what is shorter). Additional requirements like supportint truncation already at 63 chars is handled as a second feature request. This will be tracked in new NM BZ with a clear user story and acceptance criteria, so it can be reviewed. Regarding this BZ, according to the log line Aug 12 15:39:25.727688 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- NetworkManager[1205]: <info> [1628782765.7276] hostname: hostname changed from (none) to "ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel-" something else is already changing the hostname and NM uses that one instead of the DHCP one. So to unblock this issue, the team working on the code that sets the hostname should implement the correct truncation to unblock this. Eventually, once the DHCP fix is available via NM, maybe the can stop setting the hostname and let NM set it from DHCP.
(In reply to Colin Walters from comment #19) > ``` > sh-4.4# cat /host/etc/resolv.conf > # Generated by NetworkManager > search c.openshift-gce-devel-ci.internal google.internal > c.openshift-gce-devel-ci.inte > ``` > > What *is* truncating that 3rd search domain? It's generated by NM taking the domain part of the current system hostname. See: https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/blob/1.33.4-dev/src/core/dns/nm-dns-manager.c#L1263-1280
OK, so two things: First, if the hostname is configured to not be managed by NM, NM should not do that injection into resolv.conf: https://github.com/openshift/machine-config-operator/blob/5b4670f644e10a425ec9858cab25d56b5542a84c/templates/common/gcp/files/etc-networkmanager-conf.d-hostname.yaml#L12 Second, I think regardless, NM should not inject truncated and potentially invalid search domains based on the hostname. I think this is likely to be fixed by https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/677 But it'd still be good to have extra verification around that search path injection. > something else is already changing the hostname and NM uses that one instead of the DHCP one. Yes, we changed OpenShift to handle it on GCP as a stop gap until NM is able to do the same. > Eventually, once the DHCP fix is available via NM, maybe the can stop setting the hostname and let NM set it from DHCP. Yes, agreed.
(In reply to Colin Walters from comment #23) > OK, so two things: > > First, if the hostname is configured to not be managed by NM, NM should not > do that injection into resolv.conf: > https://github.com/openshift/machine-config-operator/blob/ > 5b4670f644e10a425ec9858cab25d56b5542a84c/templates/common/gcp/files/etc- > networkmanager-conf.d-hostname.yaml#L12 This needs further discussion since the config option does not mention anything about search domains. It mentions only the hostname which NM does not touch. > Second, I think regardless, NM should not inject truncated and potentially > invalid search domains based on the hostname. As far as I undestand, NM cannot distinguish truncated hostnames from hostnames that are exactly 64 chars long. So can you please mention your expectations, for Given a system with a system hostname that is 64 characters long an contains a "." and ends with a "-", when NM configures the DNS search domains, then ... Given a system with a system hostname that is 64 characters long and contains one or more ".", when NM configures the DNS search domains, .... Given a system with a system hostname that is less than 64 characters long and contains one or more ".", when NM configures the DNS search domains, then ... Currently, I expectation that in all these cases: NM will add the part after the first "." to the DNS search domains. If the system hostname is already truncated, then it is an configuration error IMHO. Given a system with a system hostname that is 64 characters long and does not contain a ".", when NM configures the DNS search domains, then ... Given a system with a system hostname that is less than 64 characters long and does not contain a ".", when NM configures the DNS search domains, then ... My expectation: NM will not change the DNS search domains based on the system hostname. AFAIU, this behavior is expected regardless of whether DHCP is used, if it also depends on whether DHCP is enabled or other factors, we need more criteria. > I think this is likely to be fixed by > https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/ > merge_requests/677 No, that patch is about changing the system hostname based on DHCP hostnames, so it does not affect the behavior based on an already configured hostname. (Beniamino, please correct me if necessary). > But it'd still be good to have extra verification around that search path > injection. > > > something else is already changing the hostname and NM uses that one instead of the DHCP one. > > Yes, we changed OpenShift to handle it on GCP as a stop gap until NM is able > to do the same. Do you mean as a result of this BZ? I am asking because this BZ occurred because there is something (not NM) storing an wrongly truncated system hostname which needs to be fixed.
> No, that patch is about changing the system hostname based on DHCP hostnames, so it does not affect the behavior based on an already configured hostname. (Beniamino, please correct me if necessary). The search domain addition is *derived* from the hostname. I think (hope) that if NM is truncating the hostname correctly, then that search domain code: https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/blob/1.33.4-dev/src/core/dns/nm-dns-manager.c#L1263-1280 will also not incorrectly truncate the search domain addition.
(In reply to Colin Walters from comment #25) > > No, that patch is about changing the system hostname based on DHCP hostnames, so it does not affect the behavior based on an already configured hostname. (Beniamino, please correct me if necessary). > > The search domain addition is *derived* from the hostname. I think (hope) > that if NM is truncating the hostname correctly, then that search domain > code: > https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/blob/1.33.4- > dev/src/core/dns/nm-dns-manager.c#L1263-1280 > will also not incorrectly truncate the search domain addition. There are two hostnames involved in this BZ, one is the system hostname that is managed outside of NM and it is truncate incorrectly. NM will use the system hostname to setup the search domain, you requested that this should be something that can be disabled/should not be done if NM does not handle the hostname. This needs discussion at NM. The change that is agreed, is that the hostname from DHCP will be truncated properly before making it a system hostname. In this case, NM should still use the full search domain (it has access to it through DHCP), so there will be no truncation for the search list, only for the hostname. This will not fix this BZ, since something else is setting the system hostname without proper truncation.
Colin, can you please update the bug with a tentative target date when you think it will be fixed?
> This will not fix this BZ, since something else is setting the system hostname without proper truncation. After NM does the truncation internally, we will drop the MCO handling of hostname on GCP, so it will be solely NM which will handle the hostname.
xref with the upstream issue - https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/572
(In reply to Colin Walters from comment #28) > > This will not fix this BZ, since something else is setting the system hostname without proper truncation. > > After NM does the truncation internally, we will drop the MCO handling of > hostname on GCP, so it will be solely NM which will handle the hostname. Can you please confirm that the acceptance criteria from comment 24 work for you regarding this bug? Also, you don't need the NM change, the code that sets the hostname in /proc/sys/kernel/hostname just needs to be fixed to truncate after the first dot if necessary (AFAICS), so this is not blocked by NM.
It seems we keep both restating what was said before. I have re-read your replies, but at this time I still believe you are either not understanding me, or you are wrong. First I'm going to try restating what I think are *facts*: - There are two things involved: the hostname and the search domain - NM has an option to allow us to control the hostname, but does not offer an API to control the search domain (I think there's an option to stop it writing /etc/resolv.conf entirely, but that means more burden on OCP here) - Today in OCP, we changed things so that the MCO writes the hostname, but NM still writes the search domain - This BZ is about the search domain - The search domain in NM is *derived* from the hostname - and specifically from NM's view of the hostname! And now, to restate https://bugzilla.redhat.com/show_bug.cgi?id=2008521#c28 if OCP *stops* doing the truncation external to NM (i.e. NM handles the truncation of the hostname) then it will also then be much easier for NM to correctly generate the search domain, because that's derived from the hostname. Now, regarding your question about the exact behavior form https://bugzilla.redhat.com/show_bug.cgi?id=2008521#c24 Really the first pivotal question to answer is: Is this *specific to GCP* or not? In OCP today, it is. That means the problem domain is much less generic; we know exactly what we need to do. I think this sub-thread was mostly covered by https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/677 where it seemed other NM developers wanted to do it across the board and not specialize on GCP (which seems valid too). I would defer this to the NM team.
I think we have to distinguish the two cases: 1) when NM finds the truncated hostname already set; 2) when NM sets the hostname from DHCP (or via reverse DNS) and handles truncation. Currently 1) happens in OCP because the MCO writes the truncated hostname. In this case, NM doesn't know whether the system hostname was truncated or not, so there is little to change there. What we could improve is that NM doesn't write search domains that are clearly invalid, like those ending with '-'. Anyway, I think we agree that the better solution is 2). In such case NM would be in control of the hostname and would truncate it according to these rules: - if the first MAX_LENGTH characters contain a dot, then the hostname is truncated to the dot. The part after the first dot of the the original hostname is used as search domain in resolv.conf. - otherwise, truncate to the first MAX_LENGTH characters. The part after the first dot of the the original hostname is used as search domain in resolv.conf. MAX_LENGTH is 64. Does this sound ok? According to [1] in some scenarios there is the need to enforce a hostname not longer than 63 characters. In that case, we can make MAX_LENGTH configurable with a global config option (e.g. main.hostname-max-length). Thoughts? [1] https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/677#note_692143 > I think this sub-thread was mostly covered by > https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/677 > where it seemed other NM developers wanted to do it across the board > and not specialize on GCP (which seems valid too). I would defer > this to the NM team. Yes, the result of the previous discussion in the MR is that we want to do the truncation unconditionally.
(In reply to Colin Walters from comment #31) > It seems we keep both restating what was said before. I have re-read your > replies, but at this time I still believe you are either not understanding > me, or you are wrong. Yes, I am not understanding you, since you do not provide me the information that I requested. Please tell me which code is writing the (untruncated or badly truncated) hostname to /proc/sys/kernel/hostname which causes the problem hightlighted in comment:6 | From my read of the logs, it looks like systemd originally sets the hostname with a trailing `-` | ``` | Aug 12 15:39:22.707401 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- systemd[1]: Set hostname to <ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel->. | ``` Fix this to set the correct hostname (truncated after the first dot) and this bug is fixed. > First I'm going to try restating what I think are *facts*: > > - There are two things involved: the hostname and the search domain > - NM has an option to allow us to control the hostname, but does not offer > an API to control the search domain > (I think there's an option to stop it writing /etc/resolv.conf entirely, > but that means more burden on OCP here) > - Today in OCP, we changed things so that the MCO writes the hostname, but > NM still writes the search domain > - This BZ is about the search domain > - The search domain in NM is *derived* from the hostname - and specifically > from NM's view of the hostname! No, it is not NM's view, it is the system's hostname that is stored in the kernel at the time NM configures the search domains. > And now, to restate https://bugzilla.redhat.com/show_bug.cgi?id=2008521#c28 > if OCP *stops* doing the truncation external to NM (i.e. NM handles the > truncation of the hostname) OCP does not do the truncation at the right point in time, so there is nothing to stop here. It is very likely that the kernel truncates the hostname due to its maximum size. I guess to are referring to these log lines: Aug 12 15:39:25.914669 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm mco-hostname[1267]: ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel-ci.internal is longer than 63 characters, using truncated hostname Aug 12 15:39:25.914669 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm mco-hostname[1267]: setting static hostname to ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm Which show that something is happening too late so it is not relevant here. What's important is since Beniamino asked in comment:4, what is causing this the untruncated hostname to be stored: Aug 12 15:39:22.707401 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- systemd[1]: Set hostname to <ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel->. > I think this sub-thread was mostly covered by > https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/ > merge_requests/677 > where it seemed other NM developers wanted to do it across the board and not > specialize on GCP (which seems valid too). > I would defer this to the NM team. NM can address the feature request but AFAIU, it will not change anything as long as something is writing the bad hostname to the kernel before NM is doing something and it will also not fix the bug that the kernel hostname is invalid.
(In reply to Beniamino Galvani from comment #32) > I think we have to distinguish the two cases: > > 1) when NM finds the truncated hostname already set; > > 2) when NM sets the hostname from DHCP (or via reverse DNS) and > handles truncation. > > Currently 1) happens in OCP because the MCO writes the truncated hostname. > In this case, NM doesn't know whether the system hostname was truncated > or not, so there is little to change there. What we could improve is > that NM doesn't write search domains that are clearly invalid, like those > ending with '-'. > > Anyway, I think we agree that the better solution is 2). In such case > NM would be in control of the hostname and would truncate it according > to these rules: > > - if the first MAX_LENGTH characters contain a dot, then the hostname > is truncated to the dot. The part after the first dot of the the > original hostname is used as search domain in resolv.conf. > > - otherwise, truncate to the first MAX_LENGTH characters. The part > after the first dot of the the original hostname is used as search > domain in resolv.conf. > > MAX_LENGTH is 64. Does this sound ok? > > According to [1] in some scenarios there is the need to enforce a hostname > not longer than 63 characters. In that case, we can make MAX_LENGTH > configurable with a global config option (e.g. main.hostname-max-length). > Thoughts? > > [1] > https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/ > merge_requests/677#note_692143 > > > I think this sub-thread was mostly covered by > > https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/677 > > where it seemed other NM developers wanted to do it across the board > > and not specialize on GCP (which seems valid too). I would defer > > this to the NM team. > > Yes, the result of the previous discussion in the MR is that we want to > do the truncation unconditionally. I replied in https://bugzilla.redhat.com/show_bug.cgi?id=2033643#c3 this is about discussing the new feature in NM, which is tracked in the other bug report.
> Please tell me which code is writing the (untruncated or badly truncated) hostname https://github.com/openshift/machine-config-operator/blob/master/templates/common/_base/files/usr-local-bin-mco-hostname.yaml (This was linked in https://bugzilla.redhat.com/show_bug.cgi?id=2008521#c6 too ) I think the hostname is not "badly" truncated. We know the hostname is valid when truncated this way. The problem in this bug is that the injected search domain computed from that hostname is invalid. Which goes to this: (In reply to Beniamino Galvani from comment #32) > ... > What we could improve is > that NM doesn't write search domains that are clearly invalid, like those > ending with '-'. Right, I'd need to double check this, but I think it would work if NM skipped writing a hostname-derived search domain if it would be invalid. Looking at the generated resolv.conf: https://github.com/kubevirt/kubevirt/issues/5447#issuecomment-926646739 $ cat /etc/resolv.conf # Generated by NetworkManager search c.openshift-gce-devel-ci.internal google.internal c.openshift-gce-devel- nameserver 169.254.169.254 $ I think the DHCP-provided c.openshift-gce-devel-ci.internal search domain is all we need. (Actually, do other networking systems do this hostname-derived search domain injection at all? Does systemd-networkd do it? Does other unix like FreeBSD? In other words, is this really a NetworkManager-specific problem? What's the origin of this hostname-derived search domain injection?)
(In reply to Colin Walters from comment #35) > > Please tell me which code is writing the (untruncated or badly truncated) hostname > > https://github.com/openshift/machine-config-operator/blob/master/templates/ > common/_base/files/usr-local-bin-mco-hostname.yaml > > (This was linked in https://bugzilla.redhat.com/show_bug.cgi?id=2008521#c6 > too ) This is not what I asked for. This is causing the log entries Aug 12 15:39:25.914669 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm mco-hostname[1267]: ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel-ci.internal is longer than 63 characters, using truncated hostname Aug 12 15:39:25.914669 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm mco-hostname[1267]: setting static hostname to ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm to quote my comment:33 again: Which show that something is happening too late so it is not relevant here. What's important is since Beniamino asked in comment:4, what is causing this the untruncated hostname to be stored: Aug 12 15:39:22.707401 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- systemd[1]: Set hostname to <ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel->.
(In reply to Colin Walters from comment #35) > I think the hostname is not "badly" truncated. We know the hostname is > valid when truncated this way. The problem in this bug is that the injected > search domain computed from that hostname is invalid. Which goes to this: Just to clarify again, NM does not derive the search domain from this hostname and the script does not seem to truncate it badly. And it is also not what needs to be discussed or analyzed here. > (In reply to Beniamino Galvani from comment #32) > > ... > > What we could improve is > > that NM doesn't write search domains that are clearly invalid, like those > > ending with '-'. > > Right, I'd need to double check this, but I think it would work if NM > skipped writing a hostname-derived search domain if it would be invalid. > Looking at the generated resolv.conf: > https://github.com/kubevirt/kubevirt/issues/5447#issuecomment-926646739 > > $ cat /etc/resolv.conf > # Generated by NetworkManager > search c.openshift-gce-devel-ci.internal google.internal > c.openshift-gce-devel- > nameserver 169.254.169.254 > $ > > I think the DHCP-provided c.openshift-gce-devel-ci.internal search domain is > all we need. Yes, it could work. There are a lot of other solutions that might work. These do not make them correct, given that it is still not justified or explained why the hostname is set to the bad one ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- in the first place. I checked again with the engineers since the initial analysis was a while ago and I wanted to make sure that I remember the details correctly. NM will not change the hostname if it is already set to a static hostname (which seems to be the case here). So even if NM starts to implement the RFE that was proposed in https://bugzilla.redhat.com/show_bug.cgi?id=2033643 NM will still add the domain from the static hostname to the search list, so this part will still need to be fixed. I am really not following why you keep diverting from this point. This needs an analysis from someone who knows this setup/OpenShift/CoreOS. > (Actually, do other networking systems do this hostname-derived search > domain injection at all? Does systemd-networkd do it? Does other unix like > FreeBSD? > In other words, is this really a NetworkManager-specific problem? What's > the origin of this hostname-derived search domain injection?) Deriving the search domain from the hostname is the default if no search list is specified (see https://man7.org/linux/man-pages/man5/resolv.conf.5.html). And according to https://man7.org/linux/man-pages/man2/gethostname.2.html there is also no dot based truncation. The default seems to be similar on FreeBSD: https://www.freebsd.org/cgi/man.cgi?resolv.conf
> to quote my comment:33 again: > Which show that something is happening too late so it is not relevant here. What's important is since Beniamino asked in comment:4, what is causing this the untruncated hostname to be stored: > Aug 12 15:39:22.707401 ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel- systemd[1]: Set hostname to <ci-op-rp567tzw-44266-78d6h-worker-c-n2wjm.c.openshift-gce-devel->. Sorry, you're right I missed this distinction. I had thought that log message was systemd reflecting the MCO script explicitly setting the hostname, but indeed you're right, it's a separate thing. In systemd v239 (rhel8) as far as I can tell, that code is just reading /etc/hostname https://github.com/systemd/systemd/blob/de7436b02badc82200dc127ff190b8155769b8e7/src/core/hostname-setup.c#L49 And: ``` [root@ci-ln-yld8kqt-72292-m6s5b-worker-a-lx4jn ~]# journalctl -q -b --grep='Set hostname' Jan 05 22:23:26 ci-ln-yld8kqt-72292-m6s5b-worker-a-lx4jn.c.openshift-gce-devel-c systemd[1]: Set hostname to <ci-ln-yld8kqt-72292-m6s5b-worker-a-lx4jn.c.openshift-gce-devel-c>. [root@ci-ln-yld8kqt-72292-m6s5b-worker-a-lx4jn ~]# cat /etc/hostname ci-ln-yld8kqt-72292-m6s5b-worker-a-lx4jn [root@ci-ln-yld8kqt-72292-m6s5b-worker-a-lx4jn ~]# ``` Ohhh but wait, confusingly it actually calls `gethostname()` first. So I think all we're seeing in this log message is the hostname as configured in the initramfs, which comes from NetworkManager's DHCP request¹. I may do a PR to systemd to more clearly distinguish the origin of the hostname. So...actually though you may be right in that this hostname is the origin of the corrupted `/etc/resolv.conf` entry, which is again an argument to have NM own this problem, no? ¹ more detail: because our mco-hostname service only takes effect in the real root, not the initramfs because in CoreOS we don't right now encourage initramfs customization. But having it be consistent in the initramfs and real root is another argument for having NM own this.
OK in latest Fedora CoreOS at least the log messages here are all much more informative, e.g.: ``` [root@cosa-devsh ~]# journalctl --grep=hostname | grep -v audit Jan 05 22:58:52 localhost systemd[1]: afterburn-hostname.service - Afterburn Hostname was skipped because all trigger condition checks failed. ... Jan 05 22:59:00 localhost NetworkManager[1112]: <info> [1641423540.1286] policy: set-hostname: set hostname to 'cosa-devsh' (from DHCPv4) Jan 05 22:59:00 cosa-devsh systemd-hostnamed[1181]: Hostname set to <cosa-devsh> (transient) Jan 05 22:59:00 cosa-devsh systemd-resolved[1103]: System hostname changed to 'cosa-devsh'. ... [root@cosa-devsh ~]# hostnamectl set-hostname --static foo [root@cosa-devsh ~]# journalctl --grep=hostname | grep -v audit ... Jan 05 23:00:41 foo NetworkManager[1112]: <info> [1641423641.8623] policy: set-hostname: set hostname to 'foo' (from system configuration) Jan 05 23:00:41 foo systemd-resolved[1103]: System hostname changed to 'foo'. Jan 05 23:00:41 foo systemd-hostnamed[1470]: Hostname set to <foo> (static) ```
We had a great real-time collaboration call about this. Some notes: - We realized that afterburn is writing /sysroot/etc/hostname from the initramfs, which is the full, untruncated hostname - systemd then switches root, and reads /etc/hostname, and *truncates it* https://github.com/systemd/systemd/blob/e97a3001483951f081d14ec5726ef6108da636f2/src/basic/hostname-util.c#L139 At this point, the kernel hostname is set to a truncated 64 character version, but /etc/hostname contains an untruncated version Note that this systemd truncation is *not* the "first dot" truncation that mco-hostname and networkd uses for DHCP that we want! TODO: consider changing systemd to either fail on overlong /etc/hostname or truncate as desired - In the real root, NetworkManager runs again and does DHCP, but does not set the hostname because we told it not to and because /etc/hostname exists But it's at this point that NM derives the search domain from the incorrect afterburn->systemd truncated value (We are pretty sure it's afterburn, but still to be verified) And that generates /etc/resolv.conf which persists - Now that we're past network-online.target, mco-hostname runs. It normally just sets the transient hostname, but when it runs it notices that /etc/hostname exists and changes to write a persistent (static) hostname. So confusingly, it overwrites /etc/hostname that afterburn did from the first boot. Shorter term fixes: - Change afterburn to do first-dot truncation - Change systemd to do truncation or fail - Even more quickly, see if we can change mco-hostname to retrigger NM's regeneration of /etc/resolv.conf (Worst case, we can tear down the network) But is there a nicer way to do this? Maybe just SIGHUP of NM? Or `nmcli device reapply` or so Longer term fixes: - Change NM to handle truncation and own the hostname consistently - not afterburn on gcp! There's also nm-cloud-setup here. - To me, we should never, ever be writing /etc/hostname except by explicit system administrator action. Afterburn should just be setting the *transient* hostname on each boot. This also relates to https://github.com/coreos/fedora-coreos-config/commit/4584017792c2757eddcb5dc21f4cf978ef70efac which is trying to propagate the NM state "out of band"
There was already an afterburn issue for this: https://github.com/coreos/afterburn/issues/509 so let's use that to track a fix there. I also did https://github.com/coreos/afterburn/pull/668
I also want to explicitly state here: I was basically wrong in most of my earlier comments - I was operating on some faulty assumptions around which actors were in play (e.g. I didn't think afterburn was involved) and had missed a key indicator that something else was wrong in the systemd log message that Till was correctly pointing out. Sorry about that!
Benjamin landed a fix in afterburn to do hostname truncation - https://github.com/coreos/afterburn/pull/673 There is a new afterburn release in flight and we expect to see a new package in RHCOS 4.10 with this fix.
Will we get update on this bug when afterburn release with fix will be be available in RHCOS?
The upstream PR landed in `afterburn-5.2.0-1.rhaos4.10.el8` which was included in RHCOS 410.84.202201182334-0 on Jan 18. Moving to MODIFIED.
In practice this needs a bootimage bump.
This bug has been reported fixed in a new RHCOS build and is ready for QE verification. To mark the bug verified, set the Verified field to Tested. This bug will automatically move to MODIFIED once the fix has landed in a new bootimage.
Pre-verify passed with latest RHCOS 4.10.0-0.nightly-2022-01-25-023600, get fixed version afterburn-5.2.0-1.rhaos4.10.el8.x86_64, and check resolv.conf does not contain an entry ending with hypen $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-25-023600 True False 6m49s Cluster version is 4.10.0-0.nightly-2022-01-25-023600 $ oc get nodes ... ci-ln-w7bw75k-72292-x989q-worker-c-slcn6 Ready worker 14m v1.23.0+06791f6 $ oc debug node/ci-ln-w7bw75k-72292-x989q-worker-c-slcn6 ... sh-4.4# rpm -qa afterburn afterburn-5.2.0-1.rhaos4.10.el8.x86_64 sh-4.4# rpm-ostree status State: idle Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b288e48d43ed5c349e5b5621cbe6d05d750575a05d2360358cc137241c0735d2 CustomOrigin: Managed by machine-config-operator Version: 410.84.202201242137-0 (2022-01-24T21:40:37Z) ostree://658b35d30d5da7226bf2abeb9c318a92c1521de2ea65486bc47632f2eee4e6c6 Version: 410.84.202112040202-0 (2021-12-04T02:05:40Z) sh-4.4# journalctl -q -b --grep='Set hostname' | more Jan 25 07:36:50 ci-ln-w7bw75k-72292-x989q-worker-c-slcn6.c.openshift-gce-devel-c systemd[1]: Set hostname to <ci-ln-w7bw75k-72292-x989q-worker-c-slcn6.c.opens hift-gce-devel-c>. sh-4.4# cat /etc/hostname ci-ln-w7bw75k-72292-x989q-worker-c-slcn6 sh-4.4# cat /etc/resolv.conf # Generated by NetworkManager search c.openshift-gce-devel-ci.internal google.internal c.openshift-gce-devel-c nameserver 169.254.169.254 sh-4.4# rpm-ostree db list 658b35d30d5da7226bf2abeb9c318a92c1521de2ea65486bc47632f2eee4e6c6 | grep afterburn afterburn-5.1.0-1.rhaos4.10.el8.x86_64 sh-4.4# cat /etc/os-release NAME="Red Hat Enterprise Linux CoreOS" VERSION="410.84.202201242137-0" ID="rhcos" ID_LIKE="rhel fedora" VERSION_ID="4.10" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 410.84.202201242137-0 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.10/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.10" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.10" OPENSHIFT_VERSION="4.10" RHEL_VERSION="8.4" OSTREE_VERSION='410.84.202201242137-0'
The fix for this bug has landed in a bootimage bump, as tracked in bug 2043297 (now in status MODIFIED). Moving this bug to MODIFIED.
I verified that the latest 4.10.0-0.nightly-2022-01-31-012936 (Red Hat Enterprise Linux CoreOS 410.84.202201302203-0) has: ``` Jan 31 21:32:55 localhost afterburn[1013]: Jan 31 21:32:55.602 INFO received hostname "ci-ln-vfhzbct-f76d1-dphk2-master-0.c.openshift-gce-devel-ci.internal" longer than 64 characters; truncating Jan 31 21:32:55 localhost afterburn[1013]: Jan 31 21:32:55.604 INFO wrote hostname ci-ln-vfhzbct-f76d1-dphk2-master-0 to /sysroot/etc/hostname ``` The search domains in /etc/resolv.conf look correct.
Change status to VERIFIED based on Comment 52 and below result with 4.10.0-0.nightly-2022-02-04-015640, get fixed afterburn-5.2.0-1.rhaos4.10.el8.x86_64 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-02-04-015640 True False 3m21s Cluster version is 4.10.0-0.nightly-2022-02-04-015640 $ oc get nodes ... ci-ln-cj33r8b-72292-lvjph-worker-a-xkdf5 Ready worker 17m v1.23.3+b63be7f $ oc debug node/ci-ln-cj33r8b-72292-lvjph-worker-a-xkdf5 Starting pod/ci-ln-cj33r8b-72292-lvjph-worker-a-xkdf5-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.128.3 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# rpm -qa afterburn afterburn-5.2.0-1.rhaos4.10.el8.x86_64 sh-4.4# cat /etc/hostname ci-ln-cj33r8b-72292-lvjph-worker-a-xkdf5 sh-4.4# cat /etc/resolv.conf # Generated by NetworkManager search c.openshift-gce-devel-ci.internal google.internal nameserver 169.254.169.254
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056
Reworked the doc text.