Description of problem: Utilizing the ip=<ip address>... kernel arguments in OCP 4.3 results in initially a working configuration in the target environment utilizing channel bonding and static IPs. This was discovered in a UPI baremetal footprint. Passed in arguments: ``` ip=192.168.68.90::192.168.68.1:255.255.255.128:rdu-worker1.dcain-ocp4.raleigh.redhat.com:bond0:off nameserver=192.168.68.254 nameserver=8.8.8.8 bond=bond0:enp6s0,ens15:mode=active-backup,miimon=100,primary=enp6s0 ``` This node (a worker) comes up with the correct static IP, correct bonding configuration to match the underlying physical infrastructure, and correct hostname (at least initially). hostname & hostnamectl report the correct Transient hostname from the passed in command line arguments: ``` [root@rdu-worker1 ~]# hostname rdu-worker1.dcain-ocp4.raleigh.redhat.com [root@rdu-worker1 ~]# hostnamectl Static hostname: n/a Transient hostname: rdu-worker1.dcain-ocp4.raleigh.redhat.com ``` However, after a reboot, the system picks up "localhost" as its hostname, which is unintended: ``` [core@localhost ~]$ hostname localhost [core@localhost ~]$ hostnamectl Static hostname: n/a Transient hostname: localhost ``` Version-Release number of selected component (if applicable): OCP 4.3.0 Red Hat Enterprise Linux CoreOS 43.81.202001142154.0 How reproducible: Everytime. Steps to Reproduce: 1. Pass in command line arguments via ip= conventions, provision a node 2. System hostname is correct on initial boot after provision 3. System hostname is incorrect after one reboot (triggered by install process) Expected results: System keeps its statically defined hostname through micro/minor updates as well as reboots.
This BZ looks similar - https://bugzilla.redhat.com/show_bug.cgi?id=1803962 Is NetworkManager-wait-online.service failing? Are there any systemd units failed?
Yes. Only other unit failing is rdma.service, which is a red herring I think. [core@localhost ~]$ journalctl -u NetworkManager-wait-online.service -- Logs begin at Tue 2020-02-18 19:28:15 UTC, end at Tue 2020-02-18 20:32:40 UTC. -- Feb 18 19:29:37 rdu-worker1.dcain-ocp4.raleigh.redhat.com systemd[1]: Starting Network Manager Wait Online... Feb 18 19:30:08 rdu-worker1.dcain-ocp4.raleigh.redhat.com systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE Feb 18 19:30:08 rdu-worker1.dcain-ocp4.raleigh.redhat.com systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'. Feb 18 19:30:08 rdu-worker1.dcain-ocp4.raleigh.redhat.com systemd[1]: Failed to start Network Manager Wait Online. Feb 18 19:30:08 rdu-worker1.dcain-ocp4.raleigh.redhat.com systemd[1]: NetworkManager-wait-online.service: Consumed 77ms CPU time -- Reboot -- Feb 18 19:34:00 localhost systemd[1]: Starting Network Manager Wait Online... Feb 18 19:34:30 localhost systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE Feb 18 19:34:30 localhost systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'. Feb 18 19:34:30 localhost systemd[1]: Failed to start Network Manager Wait Online. Feb 18 19:34:30 localhost systemd[1]: NetworkManager-wait-online.service: Consumed 83ms CPU time The only difference in that BZ and this one is they are using DHCP, I'm trying to use statically assigned addresses.
Created attachment 1663888 [details] active-active static bond
I see a similar issue when using static ips for active-active setup. Active-Passive with static IP works fine though. I do have still DHCP, DNS in the cluster admin host that I believe is playing a role in sending the correct hostname.
Below is my static ip entry in grub.cfg for one of the worker node (as a reference) menuentry 'r3worker2' --class fedora --class gnu-linux --class gnu --class os { linuxefi rhcos/4.3/rhcos-4.3.0-x86_64-installer-kernel nomodeset rd.neednet=1 coreos.inst=yes coreos.inst.install_dev=nvme0n1 coreos.inst.image_url=http://<ipaddr:httpport>/rhcos/4.3/rhcos-4.3.0-x86_64-metal.raw.gz coreos.inst.ignition_url=http://<ipaddr:httpport>/ignition/worker.ign ip=<ipaddress>::<gateway>:<netmask>:r3worker2.oss.labs:bond0:none bond=bond0:ens2f0,ens2f1:mode=active-backup,miimon=100 nameserver=<dns ip> initrdefi rhcos/4.3/rhcos-4.3.0-x86_64-installer-initramfs.img }
(In reply to Dave Cain from comment #2) > Yes. Only other unit failing is rdma.service, which is a red herring I > think. > > [core@localhost ~]$ journalctl -u NetworkManager-wait-online.service > -- Logs begin at Tue 2020-02-18 19:28:15 UTC, end at Tue 2020-02-18 20:32:40 > UTC. -- > Feb 18 19:29:37 rdu-worker1.dcain-ocp4.raleigh.redhat.com systemd[1]: > Starting Network Manager Wait Online... > Feb 18 19:30:08 rdu-worker1.dcain-ocp4.raleigh.redhat.com systemd[1]: > NetworkManager-wait-online.service: Main process exited, code=exited, > status=1/FAILURE > Feb 18 19:30:08 rdu-worker1.dcain-ocp4.raleigh.redhat.com systemd[1]: > NetworkManager-wait-online.service: Failed with result 'exit-code'. > Feb 18 19:30:08 rdu-worker1.dcain-ocp4.raleigh.redhat.com systemd[1]: Failed > to start Network Manager Wait Online. > Feb 18 19:30:08 rdu-worker1.dcain-ocp4.raleigh.redhat.com systemd[1]: > NetworkManager-wait-online.service: Consumed 77ms CPU time > -- Reboot -- > Feb 18 19:34:00 localhost systemd[1]: Starting Network Manager Wait Online... > Feb 18 19:34:30 localhost systemd[1]: NetworkManager-wait-online.service: > Main process exited, code=exited, status=1/FAILURE > Feb 18 19:34:30 localhost systemd[1]: NetworkManager-wait-online.service: > Failed with result 'exit-code'. > Feb 18 19:34:30 localhost systemd[1]: Failed to start Network Manager Wait > Online. > Feb 18 19:34:30 localhost systemd[1]: NetworkManager-wait-online.service: > Consumed 83ms CPU time > > The only difference in that BZ and this one is they are using DHCP, I'm > trying to use statically assigned addresses. Anything else in the journal between the start/failure of NetworkManager? The error reported isn't a lot to go on.
This was caused by a missing A/PTR DNS record in my environment for the system being provisioned. Take those records out and you have this fallback "localhost" behavior. Put it back in and the hostname persists across reboots as desired by the user/deployment. I really think that if a user defines a hostname it should persist on the node in question, regardless of what is in the DNS.
I believe that this will affect any UPI installation: Bare-metal or VMware.
I think http://bugzilla.redhat.com/1763700 is strongly related here. It's due out for the next 4.3.X.
This was fixed upstream here - https://github.com/coreos/ignition-dracut/pull/156 It landed in RHCOS 45.81.202003121328-0 via `ignition-0.35.1-2.rhaos4.5.git7afbeba.el8` It was also fixed in 4.4; a separate BZ will be cloned for that.
[core@myhostname ~]$ rpm-ostree status State: idle AutomaticUpdates: disabled Deployments: * ostree://25314b9608a0f6b1c95a9c17af338f463ab287bb78a40c04181b1e0bd776b5b9 Version: 45.81.202004020816-0 (2020-04-02T08:22:41Z) [core@myhostname ~]$ cat /proc/cmdline BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-306a4d8154fea1020c3352651fe383d95c452fae599db866b1e15e318b0bed3e/vmlinuz-4.18.0-147.5.1.el8_1.x86_64 rhcos.root=crypt_rootfs console=tty0 console=ttyS0,115200n8 ignition.platform.id=qemu rd.luks.options=discard ignition.firstboot rd.neednet=1 ip=192.168.122.30::192.168.122.1:255.255.255.0:myhostname:enp1s0:none nameserver=192.168.122.1 ostree=/ostree/boot.1/rhcos/306a4d8154fea1020c3352651fe383d95c452fae599db866b1e15e318b0bed3e/0 [core@myhostname ~]$ hostname myhostname [core@myhostname ~]$ hostnamectl Static hostname: myhostname Icon name: computer-vm Chassis: vm Machine ID: 1c840d6c49214a9ca1e9d5601f5d8036 Boot ID: 9e0b3debe90041bdaedbbe0be0b005a3 Virtualization: kvm Operating System: Red Hat Enterprise Linux CoreOS 45.81.202004020816-0 (Ootpa) Kernel: Linux 4.18.0-147.5.1.el8_1.x86_64 Architecture: x86-64 [core@myhostname ~]$ cat /proc/sys/kernel/random/boot_id 9e0b3deb-e900-41bd-aedb-be0be0b005a3 [core@myhostname ~]$ sudo systemctl reboot --snip-- myhostname login: core Password: Last login: Thu Apr 2 16:22:58 on ttyS0 Red Hat Enterprise Linux CoreOS 45.81.202004020816-0 Part of OpenShift 4.5, RHCOS is a Kubernetes native operating system managed by the Machine Config Operator (`clusteroperator/machine-config`). WARNING: Direct SSH access to machines is not recommended; instead, make configuration changes via `machineconfig` objects: https://docs.openshift.com/container-platform/4.5/architecture/architecture-rhcos.html --- [core@myhostname ~]$ cat /proc/sys/kernel/random/boot_id e7269ef3-1c9d-4ead-9021-a7e603fba2e6 [core@myhostname ~]$ hostname myhostname [core@myhostname ~]$ hostnamectl Static hostname: myhostname Icon name: computer-vm Chassis: vm Machine ID: 1c840d6c49214a9ca1e9d5601f5d8036 Boot ID: e7269ef31c9d4ead9021a7e603fba2e6 Virtualization: kvm Operating System: Red Hat Enterprise Linux CoreOS 45.81.202004020816-0 (Ootpa) Kernel: Linux 4.18.0-147.5.1.el8_1.x86_64 Architecture: x86-64 [core@myhostname ~]$ sudo systemctl reboot --snip-- myhostname login: core Password: Last login: Thu Apr 2 16:24:09 on ttyS0 Red Hat Enterprise Linux CoreOS 45.81.202004020816-0 Part of OpenShift 4.5, RHCOS is a Kubernetes native operating system managed by the Machine Config Operator (`clusteroperator/machine-config`). WARNING: Direct SSH access to machines is not recommended; instead, make configuration changes via `machineconfig` objects: https://docs.openshift.com/container-platform/4.5/architecture/architecture-rhcos.html --- [core@myhostname ~]$ cat /proc/sys/kernel/random/boot_id 7c902a6e-738e-4d33-85ee-c3a8a6a1f139 [core@myhostname ~]$ hostnamectl Static hostname: myhostname Icon name: computer-vm Chassis: vm Machine ID: 1c840d6c49214a9ca1e9d5601f5d8036 Boot ID: 7c902a6e738e4d3385eec3a8a6a1f139 Virtualization: kvm Operating System: Red Hat Enterprise Linux CoreOS 45.81.202004020816-0 (Ootpa) Kernel: Linux 4.18.0-147.5.1.el8_1.x86_64 Architecture: x86-64 [core@myhostname ~]$ hostname myhostname [core@myhostname ~]$ rpm -q ignition ignition-0.35.1-4.rhaos4.5.gite49283b.el8.x86_64
The previous RHCOS release maps to OCP 4.5.0-0.nightly-2020-04-02-104742
*** Bug 1813019 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days