OCP Version at Install Time: 4.10 (development) RHCOS Version at Install Time: 410.84.202112040202-0 Platform: baremetal Architecture: x86_64 Installing masters in a baremetal cluster (simulated using VMs), using coreos-installer install --copy-network from an image booted via PXE. There are two network interfaces - a provisioning network (connected to enp1s0) and the external network (connected to enp2s0). The installation works fine, but after rebooting the ignition file cannot be retrieved from the Machine Config Operator because NetworkManager does not start DHCP on the enp2s0. (DHCP is started on enp1s0, but by this point the DHCP server is no longer running so it times out. This is not a problem in itself, as it wouldn't help anyway.) This behaviour persists even after setting the kernel command line flag "ip=dhcp" on the PXE boot, and passing the following additional arguments to coreos-installer install: --append-karg ip=dhcp --firstboot-args rd.neednet=1 An equivalent test using an ISO booted via virtualmedia instead of the PXE boot shows no such problems, though this may be due to a different network topology in the test. The apparent cause is that on the first boot via PXE, the dracut NetworkManager module creates a file /etc/NetworkManager/system-connections/default_connection.nmconnection that points to enp1s0. When installing with --copy-network, this file appears to be copied to disk, contrary to expectations created by https://github.com/coreos/fedora-coreos-config/pull/773. When the host reboots into the installed OS, dracut sees this file and enables only the default connection. Note that using --copy-network is a requirement, because we *allow* users to provide custom network keyfiles. However we cannot *require* users to provide custom network keyfiles; on a fairly ordinary network (no bonds, no tagged VLANs, DHCP available) it must work out of the box. This could presumably be reproduced by booting a CoreOS host over PXE, and running "coreos-installer install --copy-network" with an ignition containing a pointer to further ignition data that can only be downloaded over the *second* network interface.
I've attached the journal/console output from the initial PXE boot and the boot of the installed OS. Notable features follow: The PXE boot is started with ip=dhcp Dec 15 02:44:15 localhost dracut-cmdline[457]: Using kernel command line parameters: rd.driver.pre=dm_multipath deploy_kernel selinux=0 troubleshoot=0 text nofb nomodeset vga=normal ipa-insecure=1 sshkey="<ssh-pubkey>" ip=dhcp coreos.live.rootfs_url=http://172.22.0.2:80/images/ironic-python-agent.rootfs ignition.firstboot ignition.platform.id=metal ipa-debug=1 ipa-inspection-collectors=default,extra-hardware,logs ipa-enable-vlan-interfaces=all ipa-inspection-dhcp-all-interfaces=1 ipa-collect-lldp=1 ipa-inspection-callback-url=http://172.22.0.2:5050/v1/continue ipa-api-url=http://172.22.0.2:6385 ipa-global-request-id=req-1c1589ef-3485-4d1a-9daf-2897bb58dc98 BOOTIF=00:84:54:eb:aa:47 initrd=deploy_ramdisk The default_connection produced by dracut is propagated into the running (live) system: Dec 15 02:44:29 localhost coreos-teardown-initramfs[1114]: Files /run/coreos-teardown-initramfs/connections-compare-1/default_connection.nmconnection and /run/coreos-teardown-initramfs/connections-compare-2/default_connection.nmconnection differ Dec 15 02:44:29 localhost coreos-teardown-initramfs[1114]: info: propagating initramfs networking config to the real root Dec 15 02:44:29 localhost coreos-teardown-initramfs[1114]: '/run/NetworkManager/system-connections/default_connection.nmconnection' -> '/sysroot/etc/NetworkManager/system-connections/default_connection.nmconnection' Dec 15 02:44:30 localhost coreos-teardown-initramfs[1114]: Relabeled /sysroot//etc/NetworkManager/system-connections/default_connection.nmconnection from (null) to system_u:object_r:NetworkManager_etc_rw_t:s0 Dec 15 02:44:30 localhost systemd[1]: coreos-teardown-initramfs.service: Succeeded. Dec 15 02:44:30 localhost coreos-teardown-initramfs[1114]: Files /run/coreos-teardown-initramfs/connections-compare-1/default_connection.nmconnection and /run/coreos-teardown-initramfs/connections-compare-2/default_connection.nmconnection differ Dec 15 02:44:30 localhost coreos-teardown-initramfs[1114]: info: propagating initramfs networking config to the real root Dec 15 02:44:30 localhost coreos-teardown-initramfs[1114]: '/run/NetworkManager/system-connections/default_connection.nmconnection' -> '/sysroot/etc/NetworkManager/system-connections/default_connection.nmconnection' Dec 15 02:44:30 localhost coreos-teardown-initramfs[1114]: Relabeled /sysroot//etc/NetworkManager/system-connections/default_connection.nmconnection from (null) to system_u:object_r:NetworkManager_etc_rw_t:s0 Dec 15 02:44:30 localhost systemd[1]: coreos-teardown-initramfs.service: Succeeded. After installation. the kernel args include ip=dhcp and rd.neednet=1: [ 0.000000] Command line: BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-b0c17d2741adcd571b08d69dbae84aa880469604b4ea9d5e7b6ccbe53c3a3cf2/vmlinuz-4.18.0-305.28.1.el8_4.x86_64 random.trust_cpu=on console=tty0 console=ttyS 0,115200n8 ignition.platform.id=metal ignition.firstboot rd.neednet=1 ostree=/ostree/boot.1/rhcos/b0c17d2741adcd571b08d69dbae84aa880469604b4ea9d5e7b6ccbe53c3a3cf2/0 ip=dhcp The default_connection is still present: [ 8.094080] coreos-copy-firstboot-network[834]: info: copying files from /mnt/boot_partition/coreos-firstboot-network to /run/NetworkManager/system-connections/ [ 8.100189] coreos-copy-firstboot-network[834]: '/mnt/boot_partition/coreos-firstboot-network/default_connection.nmconnection' -> '/run/NetworkManager/system-connections/default_connection.nmconnection' [ 8.114086] systemd[1]: Started Copy CoreOS Firstboot Networking Config. [ OK ] Started Copy CoreOS Firstboot Networking Config.
Created attachment 1846314 [details] default_connection file pointing to enp1s0
After generating the default config (as seen in https://github.com/coreos/fedora-coreos-config/pull/773 - though necessarily _after_ the initrd has been cleaned up and dracut has exited) with the following command: /usr/libexec/nm-initrd-generator -c initrd-generated/ -i /run/coreos-teardown-initramfs/initrd-data-dir -- ip=dhcp,dhcp6 We see the following differences from the actual config produced by dracut: --- default_connection.nmconnection 2021-12-14 22:40:53.266338062 -0500 +++ default_connection.nmconnection.generated 2021-12-14 22:58:31.365653260 -0500 @@ -1,21 +1,19 @@ [connection] id=Wired Connection -uuid=0f56d96f-b25d-421a-b788-b61dab1ce19a +uuid=85147c74-0481-456e-95c4-b88fc0df20be type=ethernet autoconnect-retries=1 multi-connect=3 permissions= -wait-device-timeout=60000 [ethernet] -mac-address=00:E7:34:0E:9E:A7 mac-address-blacklist= [ipv4] dhcp-timeout=90 dns-search= -may-fail=false method=auto +required-timeout=20000 [ipv6] addr-gen-mode=eui64 When deciding whether to copy the file, the uuid and wait-device-timeout are ignored. However, the mac-address, may-fail, and required-timeout options are not: https://github.com/coreos/fedora-coreos-config/blob/testing-devel/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-ignition/coreos-teardown-initramfs.sh#L25-L59
This is arguably the same bug as bug 1901517, with the heuristics used there either not working in some circumstances or having regressed.
If we do: /usr/libexec/nm-initrd-generator -c initrd-generated/ -i /run/coreos-teardown-initramfs/initrd-data-dir -- ip=dhcp (i.e. matching the kernel params - without the ",dhcp6") there are fewer differences: --- default_connection.nmconnection 2021-12-14 22:40:53.266338062 -0500 +++ default_connection.nmconnection.generated-dhcp4-only 2021-12-15 00:10:10.856014066 -0500 @@ -1,14 +1,12 @@ [connection] id=Wired Connection -uuid=0f56d96f-b25d-421a-b788-b61dab1ce19a +uuid=3655b1b4-14e0-4a4a-a7e4-2b8a7c529e38 type=ethernet autoconnect-retries=1 multi-connect=3 permissions= -wait-device-timeout=60000 [ethernet] -mac-address=00:E7:34:0E:9E:A7 mac-address-blacklist= [ipv4] Even so, the addition of the mac-address is not ignored, which presumably is why this issue was occurring even before adding ip=dhcp to the kernel command line.
First attempt at a hacky workaround: https://github.com/openshift/ironic-agent-image/pull/27
Note that when I tested it on Fedora CoreOS, it worked as you expect: the temporary connection was created in /run/something, not in /etc
I think what's throwing it off here is the `BOOTIF=` kernel argument, which I presume is injected by PXE boot. With this, nm-initrd-generator will output a connection with a specific MAC address. I.e. it's assuming (without any more device-specific karg constraints) that you want to bring up networking on the same device that you booted from. Now, that's fine for the PXE boot, but then we're copying that into the live real root and then into the target system via --copy-network. Should we special-case this and not propagate the config? I'm not sure. At the very least, it'd be a breaking change. Dusty and I talked about this, and we'll need to think about it more, but for now you should be able to specify `coreos.no_persist_ip` to the live PXE boot to prevent the device-specific configuration from being propagated forward. You can still provide `--copy-network` but it will be a no-op in the default case, and otherwise propagate networking from the live root if any was provided via the live Ignition config. Remove all other kargs, like `rd.neednet=1` and `ip=` kargs, and all `--append-karg` and `--firstboot-args` bits. If you can also make `--copy-network` conditional on the user actually providing a NM config, it would also fix this because then the live initrd code will not forward anything.
More information: there exists an `rd.bootif` flag that should be able to be used to get the behavior we (the RHCOS team) have expected from `nm-initrd-generator` even if `BOOTIF` is on the kernel command line: ``` rd.bootif=0 Disable BOOTIF parsing, which is provided by PXE ``` Even if this flag is used we still recommend removing all other kargs and also only conditionally adding `--copy-network` if the user is actually providing networking configuration.
Opened an upstream issue to discuss how we want to handle the `nm-initrd-generator` parsing of `BOOTIF` going forward: https://github.com/coreos/fedora-coreos-tracker/issues/1048
Thanks, this explains why it is only PXE that seems to be affected. For now the workaround in https://github.com/openshift/ironic-agent-image/pull/27 is doing the trick, so we'll merge that and investigate the alternative coreos.no_persist_ip flag workaround when everything is not on fire :D I suspect we will still need the ip=dhcp/ip=dhcp6, but for other reasons (we had them even before switching to CoreOS for the IPA image). I'll watch for the outcome of the upstream discussion with interest.
This bug is waiting on work or input from an upstream project. It is not scheduled to be completed this sprint.
Upstream PR to account for this problem: https://github.com/coreos/fedora-coreos-config/pull/1559
This landed in https://github.com/openshift/os/commit/bd3c4943f82b77d0b2f3bea0d0aeb66e886aaa0d First ART build with it is `411.85.202203040239-0`.
Since this needs a bootimage bump to correctly deliver the fix, I am going to set this back to POST
Reproduce the problem with rhcos-410.84.202201251210-0 build, will pre-verify with fixed version when the live images are available. Install VM using PXE with 2 different networks(nic1 and nic2), put the remote file on another VM which is in the same network as nic2. After finish installation, VM boot failed with getting remote file: files: createFilesystemsFiles: createFiles: op(2): [started] writing file "/sysroot/var/home/core/remotefile" [ 28.965658] ignition[917]: INFO : files: createFilesystemsFiles: createFiles: op(2): GET http://192.168.100.217:8088/hello: attempt #1 [ 29.985894] ignition[917]: INFO : files: createFilesystemsFiles: createFiles: op(2): GET error: Get "http://192.168.100.217:8088/hello": dial tcp 192.168.100.217:8088: connect: connection refused
The fix for this bug has landed in a bootimage bump, as tracked in bug 2047935 (now in status MODIFIED). Moving this bug to MODIFIED.
Verify passed with 411.85.202203242008-0 live images according to the steps in Comment 29
This issue was reproduced also in the case of installation via live iso image (rather than pxe install): rhcos-4.10.3-x86_64-live.x86_64.iso Steps to reproduce: * installed OpenShift cluster version 4.10.5 on VMware vSphere using the bare metal install method * used Dell PowerEdge MX740c (BIOS Version 2.13.3) to boot the iso via Virtual CD in iDRAC * after node has booted, configure manually network: nmcli connection add type bond ifname bond0 con-name bond0 mode 0 miimon 100 nmcli connection modify bond0 ipv4.method disabled ipv6.method ignore nmcli connection add type bond-slave ifname ens1f0 con-name bond0-slave-ens1f0 master bond0 nmcli connection add type bond-slave ifname ens1f1 con-name bond0-slave-ens1f1 master bond0 nmcli connection up bond0 nmcli connection add type vlan ifname bond0.xxxx con-name bond0.xxxx id xxxx dev bond0 ip4 a.b.c.d/xx gw4 a.b.c.d ipv4.dns a.b.c.d,e.f.j.k * tested network e.g. with nslookup google.com * tested reachability of API and API-INT: curl https://api.ocp4.example.com:6443 -k curl https://api-int.ocp4.example.com:22623/config/worker -k <non-pretty json of the ignition obtained as output> * used fdisk to identify HDD fdisk -l Disk /dev/loop0: 377.3 GiB Disk /dev/sda: 1.0 TiB * try to install CoreOS with ignition url signed with a non-trusted CA, thus using --insecure or --insecure-ignition (or both) flags: coreos-installer install /dev/sda --ignition-url https://api-int.ocp4.example.com:22623/config/worker --copy-network --insecure Error: downloading source ignition config https://api-int.ocp4.example.com:22623/config/worker Caused by: 0: fetching 'https://api-int.ocp4.example.com:22623/config/worker' 1: error sending request for url (https://api-int.ocp4.example.com:22623/config/worker): error trying to connect: error:1416F086:SSL routines:tls_process_server_certificate:certificate verify failed:ssl/statem/statem_cint.c:1915: (unable to get local issuer certificate) 2: error trying to connect: error:1416F086:SSL routines:tls_process_server_certificate:certificate verify failed:ssl/statem/statem_cint.c:1915: (unable to get local issuer certificate) 3: error:1416F086:SSL routines:tls_process_server_certificate:certificate verify failed:ssl/statem/statem_cint.c:1915: (unable to get local issuer certificate) 4: error:1416F086:SSL routines:tls_process_server_certificate:certificate verify failed:ssl/statem/statem_cint.c:1915: coreos-installer install /dev/sda --ignition-url https://api-int.ocp4.example.com:22623/config/worker --copy-network --insecure-ignition Error: downloading source ignition config https://api-int.ocp4.example.com:22623/config/worker Caused by: 0: fetching 'https://api-int.ocp4.example.com:22623/config/worker' 1: error sending request for url (https://api-int.ocp4.example.com:22623/config/worker): error trying to connect: error:1416F086:SSL routines:tls_process_server_certificate:certificate verify failed:ssl/statem/statem_cint.c:1915: (unable to get local issuer certificate) 2: error trying to connect: error:1416F086:SSL routines:tls_process_server_certificate:certificate verify failed:ssl/statem/statem_cint.c:1915: (unable to get local issuer certificate) 3: error:1416F086:SSL routines:tls_process_server_certificate:certificate verify failed:ssl/statem/statem_cint.c:1915: (unable to get local issuer certificate) 4: error:1416F086:SSL routines:tls_process_server_certificate:certificate verify failed:ssl/statem/statem_cint.c:1915: coreos-installer install /dev/sda --ignition-url https://api-int.ocp4.example.com:22623/config/worker --copy-network --insecure-ignition --insecure Error: downloading source ignition config https://api-int.ocp4.example.com:22623/config/worker Caused by: 0: fetching 'https://api-int.ocp4.example.com:22623/config/worker' 1: error sending request for url (https://api-int.ocp4.example.com:22623/config/worker): error trying to connect: error:1416F086:SSL routines:tls_process_server_certificate:certificate verify failed:ssl/statem/statem_cint.c:1915: (unable to get local issuer certificate) 2: error trying to connect: error:1416F086:SSL routines:tls_process_server_certificate:certificate verify failed:ssl/statem/statem_cint.c:1915: (unable to get local issuer certificate) 3: error:1416F086:SSL routines:tls_process_server_certificate:certificate verify failed:ssl/statem/statem_cint.c:1915: (unable to get local issuer certificate) 4: error:1416F086:SSL routines:tls_process_server_certificate:certificate verify failed:ssl/statem/statem_cint.c:1915: * Downloaded manually the ignition from api-int curl https://api-int.ocp4.example.com:22623/config/worker -k -o worker.ign * converted json to pretty json cat worker.ign | jq '.' > pretty.ign * checked ignition version in the pretti-fied json file downloaded from api-int (the one generated at installation time uses version 3.2.0): "ignition": { "version": "2.2.0" } * used the locally downloaded ignition file to install CoreOS on disk: coreos-installer install /dev/sda -i pretty.ign --copy-network Installing Red Hat Enterprise Linux CoreOS 410.84.202201251210-0 (Ootpa) x86_64 (512-byte sectors) ... Read disk 3.0 GiB/3.0 GiB (100%) Writing Ignition config Copying networking configuration from /etc/NetworkManager/system-connections/ Copying /etc/NetworkManager/system-connections/bond0.nmconnection to installed system Copying /etc/NetworkManager/system-connections/bond0-slave-ens1f0.nmconnection to installed system Copying /etc/NetworkManager/system-connections/bond0-slave-ens1f1.nmconnection to installed system Copying /etc/NetworkManager/system-connections/bond0.xxxx.nmconnection to installed system Install complete. * rebooted the node * Grub screen shows one entry named "Red Hat Enterprise Linux CoreOS 410.84.202201251210-0 (Ootpa) (ostree:0)" * boot messages available as part of a sosreport in case they are needed. One thing to notice is the message: systemd[1]: Started CoreOS Ignition User Config Setup. systemd[1]: Starting Ignition (fetch-offline)... ignition[xxxx]: Ignition 2.13.0 systemd[1]: ignition-fetch-offline.service: Main process exited, code=exited, status=1/FAILURE ignition[xxxx]: Stage: fetch-offline systemd[1]: ignition-fetch-offline.service: Failed with result 'exit-code'. ignition[xxxx]: reading system config file "/usr/lib/ignition/base.d/00-core.ign" systemd[1]: Failed to start Ignition (fetch-offline). systemd[1]: Dependency failed for Ignition Complete. ignition[xxxx]: no config dir at "/usr/lib/ignition/base.platform.d/metal" ignition[xxxx]: no config URL provided ignition[xxxx]: reading system config file "/usr/lib/ignition/user.ign" systemd[1]: ignition-complete.target: Job ignition-complete.target/start failed with result 'dependency'.
Adding some more messages from the booting node of previous comment: systemd[1]: Starting Ignition (fetch-offline)... ignition[xxxx]: Ignition 2.13.0 systemd[1]: ignition-fetch-offline.service: Main process exited, code=exited, status=1/FAILURE ignition[xxxx]: Stage: fetch-offline systemd[1]: ignition-fetch-offline.service: Failed with result 'exit-code'. ignition[xxxx]: reading system config file "/usr/lib/ignition/base.d/00-core.ign" systemd[1]: Failed to start Ignition (fetch-offline). systemd[1]: Dependency failed for Ignition Complete. ignition[xxxx]: no config dir at "/usr/lib/ignition/base.platform.d/metal" ignition[xxxx]: no config URL provided systemd[1]: initrd.target: Triggering OnFailure= dependencies. ignition[xxxx]: reading system config file "/usr/lib/ignition/user.ign" systemd[1]: ignition-complete.target: Job ignition-complete.target/start failed with result 'dependency'. systemd[1]: ignition-fetch-offline.service: Triggering OnFailure= dependencies. ignition[xxxx]: failed to fetch config: unsupported config version systemd[1]: coreos-ignition-setup-user.service: Succeeded. ignition[xxxx]: failed to acquire config: unsupported config version systemd[1]: Stopped CoreOS Ignition User Config Setup. ignition[xxxx]: Ignition failed: unsupported config version
@fminafra, that doesn't appear to be related to this issue. You are correct that coreos-installer does not have a built-in mechanism for trusting a non-trusted CA. You can use the existing system mechanisms for that, or manually curl the config as you have done. However, when curling a config from the MCS, you'll need to set the HTTP Accept header, or the MCS will fall back to serving Ignition spec 2. You can use `curl -H "Accept: application/vnd.coreos.ignition+json;version=3.2.0, */*;q=0.1"` for that. If you have further concerns, please file a new BZ.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069