1804426 – ip= kernel argument not working after reboot w/ static ips configured, hostname reverting to "localhost" in RHCOS

Bug 1804426 - ip= kernel argument not working after reboot w/ static ips configured, hostname reverting to "localhost" in RHCOS

Summary: ip= kernel argument not working after reboot w/ static ips configured, hostna...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	RHCOS
Sub Component:
Version:	4.3.z
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Colin Walters
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1813019 (view as bug list)
Depends On:
Blocks:	1186913 1815669
TreeView+	depends on / blocked

Reported:	2020-02-18 19:42 UTC by Dave Cain
Modified:	2024-06-13 22:28 UTC (History)
CC List:	27 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1815669 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:15:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
active-active static bond (39.58 KB, image/png) 2020-02-18 20:29 UTC, umesh_sunnapu	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	coreos ignition-dracut pull 156	None	closed	[spec2x] Persist the first-boot hostname if not otherwise set	2020-12-05 20:39:54 UTC
Red Hat Bugzilla	1743661	medium	CLOSED	Fail to bootstrap an UPI BM cluster with OCP 4.2	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1763700	high	CLOSED	NetworkManager & kubelet startup race condition	2023-10-06 18:41:33 UTC
Red Hat Bugzilla	1800900	urgent	CLOSED	After a reboot nodes get "localhost.localdomain" when "idrac" NIC is present	2024-01-06 04:27:59 UTC
Red Hat Product Errata	RHBA-2020:2409	None	None	None	2020-07-13 17:16:10 UTC

Description Dave Cain 2020-02-18 19:42:04 UTC

Description of problem:
Utilizing the ip=<ip address>... kernel arguments in OCP 4.3 results in initially a working configuration in the target environment utilizing channel bonding and static IPs.  This was discovered in a UPI baremetal footprint.

Passed in arguments:
```
ip=192.168.68.90::192.168.68.1:255.255.255.128:rdu-worker1.dcain-ocp4.raleigh.redhat.com:bond0:off nameserver=192.168.68.254 nameserver=8.8.8.8 bond=bond0:enp6s0,ens15:mode=active-backup,miimon=100,primary=enp6s0
```

This node (a worker) comes up with the correct static IP, correct bonding configuration to match the underlying physical infrastructure, and correct hostname (at least initially).

hostname & hostnamectl report the correct Transient hostname from the passed in command line arguments:

```
[root@rdu-worker1 ~]# hostname
rdu-worker1.dcain-ocp4.raleigh.redhat.com

[root@rdu-worker1 ~]# hostnamectl 
   Static hostname: n/a
Transient hostname: rdu-worker1.dcain-ocp4.raleigh.redhat.com
```

However, after a reboot, the system picks up "localhost" as its hostname, which is unintended:
```
[core@localhost ~]$ hostname
localhost
[core@localhost ~]$ hostnamectl
   Static hostname: n/a
Transient hostname: localhost
```

Version-Release number of selected component (if applicable):
OCP 4.3.0
Red Hat Enterprise Linux CoreOS 43.81.202001142154.0


How reproducible:
Everytime.


Steps to Reproduce:
1. Pass in command line arguments via ip= conventions, provision a node
2. System hostname is correct on initial boot after provision
3. System hostname is incorrect after one reboot (triggered by install process)


Expected results:
System keeps its statically defined hostname through micro/minor updates as well as reboots.

Comment 1 Micah Abbott 2020-02-18 20:12:51 UTC

This BZ looks similar - https://bugzilla.redhat.com/show_bug.cgi?id=1803962

Is NetworkManager-wait-online.service failing?  Are there any systemd units failed?

Comment 2 Dave Cain 2020-02-18 20:27:56 UTC

Yes.  Only other unit failing is rdma.service, which is a red herring I think.

[core@localhost ~]$ journalctl -u NetworkManager-wait-online.service
-- Logs begin at Tue 2020-02-18 19:28:15 UTC, end at Tue 2020-02-18 20:32:40 UTC. --
Feb 18 19:29:37 rdu-worker1.dcain-ocp4.raleigh.redhat.com systemd[1]: Starting Network Manager Wait Online...
Feb 18 19:30:08 rdu-worker1.dcain-ocp4.raleigh.redhat.com systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Feb 18 19:30:08 rdu-worker1.dcain-ocp4.raleigh.redhat.com systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'.
Feb 18 19:30:08 rdu-worker1.dcain-ocp4.raleigh.redhat.com systemd[1]: Failed to start Network Manager Wait Online.
Feb 18 19:30:08 rdu-worker1.dcain-ocp4.raleigh.redhat.com systemd[1]: NetworkManager-wait-online.service: Consumed 77ms CPU time
-- Reboot --
Feb 18 19:34:00 localhost systemd[1]: Starting Network Manager Wait Online...
Feb 18 19:34:30 localhost systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Feb 18 19:34:30 localhost systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'.
Feb 18 19:34:30 localhost systemd[1]: Failed to start Network Manager Wait Online.
Feb 18 19:34:30 localhost systemd[1]: NetworkManager-wait-online.service: Consumed 83ms CPU time

The only difference in that BZ and this one is they are using DHCP, I'm trying to use statically assigned addresses.

Comment 3 umesh_sunnapu 2020-02-18 20:29:27 UTC

Created attachment 1663888 [details]
active-active static bond

Comment 4 umesh_sunnapu 2020-02-18 20:31:19 UTC

I see a similar issue when using static ips for active-active setup. Active-Passive with static IP works fine though. I do have still DHCP, DNS in the cluster admin host that I believe is playing a role in sending the correct hostname.

Comment 5 umesh_sunnapu 2020-02-18 20:35:10 UTC

Below is my static ip entry in grub.cfg for one of the worker node (as a reference)


menuentry 'r3worker2' --class fedora --class gnu-linux --class gnu --class os {
  linuxefi rhcos/4.3/rhcos-4.3.0-x86_64-installer-kernel nomodeset rd.neednet=1 coreos.inst=yes coreos.inst.install_dev=nvme0n1 coreos.inst.image_url=http://<ipaddr:httpport>/rhcos/4.3/rhcos-4.3.0-x86_64-metal.raw.gz coreos.inst.ignition_url=http://<ipaddr:httpport>/ignition/worker.ign ip=<ipaddress>::<gateway>:<netmask>:r3worker2.oss.labs:bond0:none bond=bond0:ens2f0,ens2f1:mode=active-backup,miimon=100 nameserver=<dns ip>
  initrdefi rhcos/4.3/rhcos-4.3.0-x86_64-installer-initramfs.img
}

Comment 6 Micah Abbott 2020-02-18 20:39:12 UTC

(In reply to Dave Cain from comment #2)
> Yes.  Only other unit failing is rdma.service, which is a red herring I
> think.
> 
> [core@localhost ~]$ journalctl -u NetworkManager-wait-online.service
> -- Logs begin at Tue 2020-02-18 19:28:15 UTC, end at Tue 2020-02-18 20:32:40
> UTC. --
> Feb 18 19:29:37 rdu-worker1.dcain-ocp4.raleigh.redhat.com systemd[1]:
> Starting Network Manager Wait Online...
> Feb 18 19:30:08 rdu-worker1.dcain-ocp4.raleigh.redhat.com systemd[1]:
> NetworkManager-wait-online.service: Main process exited, code=exited,
> status=1/FAILURE
> Feb 18 19:30:08 rdu-worker1.dcain-ocp4.raleigh.redhat.com systemd[1]:
> NetworkManager-wait-online.service: Failed with result 'exit-code'.
> Feb 18 19:30:08 rdu-worker1.dcain-ocp4.raleigh.redhat.com systemd[1]: Failed
> to start Network Manager Wait Online.
> Feb 18 19:30:08 rdu-worker1.dcain-ocp4.raleigh.redhat.com systemd[1]:
> NetworkManager-wait-online.service: Consumed 77ms CPU time
> -- Reboot --
> Feb 18 19:34:00 localhost systemd[1]: Starting Network Manager Wait Online...
> Feb 18 19:34:30 localhost systemd[1]: NetworkManager-wait-online.service:
> Main process exited, code=exited, status=1/FAILURE
> Feb 18 19:34:30 localhost systemd[1]: NetworkManager-wait-online.service:
> Failed with result 'exit-code'.
> Feb 18 19:34:30 localhost systemd[1]: Failed to start Network Manager Wait
> Online.
> Feb 18 19:34:30 localhost systemd[1]: NetworkManager-wait-online.service:
> Consumed 83ms CPU time
> 
> The only difference in that BZ and this one is they are using DHCP, I'm
> trying to use statically assigned addresses.


Anything else in the journal between the start/failure of NetworkManager?  The error reported isn't a lot to go on.

Comment 7 Dave Cain 2020-02-18 23:51:12 UTC

This was caused by a missing A/PTR DNS record in my environment for the system being provisioned.  Take those records out and you have this fallback "localhost" behavior.  Put it back in and the hostname persists across reboots as desired by the user/deployment.

I really think that if a user defines a hostname it should persist on the node in question, regardless of what is in the DNS.

Comment 13 Ben Howard 2020-03-05 17:39:55 UTC

I believe that this will affect any UPI installation: Bare-metal or VMware.

Comment 15 Colin Walters 2020-03-05 19:02:19 UTC

I think http://bugzilla.redhat.com/1763700 is strongly related here.  It's due out for the next 4.3.X.

Comment 21 Micah Abbott 2020-03-20 20:33:23 UTC

This was fixed upstream here - https://github.com/coreos/ignition-dracut/pull/156

It landed in RHCOS 45.81.202003121328-0 via `ignition-0.35.1-2.rhaos4.5.git7afbeba.el8`

It was also fixed in 4.4; a separate BZ will be cloned for that.

Comment 24 Michael Nguyen 2020-04-02 16:28:28 UTC

[core@myhostname ~]$ rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
* ostree://25314b9608a0f6b1c95a9c17af338f463ab287bb78a40c04181b1e0bd776b5b9
                   Version: 45.81.202004020816-0 (2020-04-02T08:22:41Z)
[core@myhostname ~]$ cat /proc/cmdline 
BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-306a4d8154fea1020c3352651fe383d95c452fae599db866b1e15e318b0bed3e/vmlinuz-4.18.0-147.5.1.el8_1.x86_64 rhcos.root=crypt_rootfs console=tty0 console=ttyS0,115200n8 ignition.platform.id=qemu rd.luks.options=discard ignition.firstboot rd.neednet=1 ip=192.168.122.30::192.168.122.1:255.255.255.0:myhostname:enp1s0:none nameserver=192.168.122.1 ostree=/ostree/boot.1/rhcos/306a4d8154fea1020c3352651fe383d95c452fae599db866b1e15e318b0bed3e/0
[core@myhostname ~]$ hostname
myhostname
[core@myhostname ~]$ hostnamectl 
   Static hostname: myhostname
         Icon name: computer-vm
           Chassis: vm
        Machine ID: 1c840d6c49214a9ca1e9d5601f5d8036
           Boot ID: 9e0b3debe90041bdaedbbe0be0b005a3
    Virtualization: kvm
  Operating System: Red Hat Enterprise Linux CoreOS 45.81.202004020816-0 (Ootpa)
            Kernel: Linux 4.18.0-147.5.1.el8_1.x86_64
      Architecture: x86-64
[core@myhostname ~]$ cat /proc/sys/kernel/random/boot_id
9e0b3deb-e900-41bd-aedb-be0be0b005a3
[core@myhostname ~]$ sudo systemctl reboot

--snip--

myhostname login: core
Password: 
Last login: Thu Apr  2 16:22:58 on ttyS0
Red Hat Enterprise Linux CoreOS 45.81.202004020816-0
  Part of OpenShift 4.5, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.5/architecture/architecture-rhcos.html

---
[core@myhostname ~]$ cat /proc/sys/kernel/random/boot_id
e7269ef3-1c9d-4ead-9021-a7e603fba2e6
[core@myhostname ~]$ hostname
myhostname
[core@myhostname ~]$ hostnamectl 
   Static hostname: myhostname
         Icon name: computer-vm
           Chassis: vm
        Machine ID: 1c840d6c49214a9ca1e9d5601f5d8036
           Boot ID: e7269ef31c9d4ead9021a7e603fba2e6
    Virtualization: kvm
  Operating System: Red Hat Enterprise Linux CoreOS 45.81.202004020816-0 (Ootpa)
            Kernel: Linux 4.18.0-147.5.1.el8_1.x86_64
      Architecture: x86-64
[core@myhostname ~]$ sudo systemctl reboot

--snip--
myhostname login: core
Password: 
Last login: Thu Apr  2 16:24:09 on ttyS0
Red Hat Enterprise Linux CoreOS 45.81.202004020816-0
  Part of OpenShift 4.5, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.5/architecture/architecture-rhcos.html

---

[core@myhostname ~]$ cat /proc/sys/kernel/random/boot_id
7c902a6e-738e-4d33-85ee-c3a8a6a1f139
[core@myhostname ~]$ hostnamectl 
   Static hostname: myhostname
         Icon name: computer-vm
           Chassis: vm
        Machine ID: 1c840d6c49214a9ca1e9d5601f5d8036
           Boot ID: 7c902a6e738e4d3385eec3a8a6a1f139
    Virtualization: kvm
  Operating System: Red Hat Enterprise Linux CoreOS 45.81.202004020816-0 (Ootpa)
            Kernel: Linux 4.18.0-147.5.1.el8_1.x86_64
      Architecture: x86-64
[core@myhostname ~]$ hostname
myhostname
[core@myhostname ~]$ rpm -q ignition
ignition-0.35.1-4.rhaos4.5.gite49283b.el8.x86_64

Comment 25 Michael Nguyen 2020-04-02 16:40:12 UTC

The previous RHCOS release maps to OCP 4.5.0-0.nightly-2020-04-02-104742

Comment 26 Dusty Mabe 2020-06-18 20:31:03 UTC

*** Bug 1813019 has been marked as a duplicate of this bug. ***

Comment 30 errata-xmlrpc 2020-07-13 17:15:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 31 Red Hat Bugzilla 2024-01-06 04:28:06 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.

bbreard
bchardim
behoward
dmoessne
dornelas
dustymabe
dyocum
fmarting
imcleod
jligon
jorge.martinezgarcia
jtudelag
kholtz
mhernon
miabbott
mouimet
nstielau
obockows
rpuccini
sellis
smilner
syangsao
umesh_sunnapu
walters
william.caban
wking
wwurzbac