Bug 1881703
Summary: | Ignition Configuration to disable network interface on worker nodes does not work leading to networking problems | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Sai Sindhur Malleni <smalleni> |
Component: | Machine Config Operator | Assignee: | Antonio Murdaca <amurdaca> |
Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> |
Severity: | high | Docs Contact: | |
Priority: | urgent | ||
Version: | 4.6 | CC: | adahiya, bbreard, brad, dblack, dustymabe, imcleod, jerzhang, jlebon, jligon, mcornea, miabbott, mifiedle, nstielau, pablo.iranzo, shardy, smilner, trozet, tsedovic, wabouham, walters, wking, yprokule |
Target Milestone: | --- | Keywords: | Regression, TestBlocker |
Target Release: | 4.6.0 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-27 16:44:05 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1831748 |
Description
Sai Sindhur Malleni
2020-09-22 22:07:45 UTC
To add... This is a regression from 4.5. This same setup worked on 4.5. These are the steps to create ignition configs and modify them, that are being used in our playbooks https://github.com/openshift-kni/baremetal-deploy/blob/d577f5911061f7b8ed5b7bdc02ed84813b8d31ef/ansible-ipi-install/roles/installer/tasks/55_customize_filesystem.yml We essentially create a fake-root directory and use filetranspiler to modify ignition configs created through create "ignition-configs" for both masters and workers that use an ifcfg file to dsiable NIC eno1. It is working on masters but not workers. (In reply to Brad P. Crochet from comment #1) > To add... This is a regression from 4.5. This same setup worked on 4.5. Yes, same setup/hardware has worked in 4.3, 4.4 and 4.5.. The installer is not involved with configuring the machine using ignition. It is the RHCOS team and the MCO team. Since this configuration is being applied using the stub ignition config I'm moving to RHCOS for triage. ALso please make sure when you move the component you attach a comment explaining the reason it helps with the context @Brad P Crochet Can you provide the journal from the worker nodes? It should contain entries from Ignition showing the files being written out to the host. If the masters are getting configured properly but the workers are not, it makes me wonder if the MCS is not serving up the worker configs properly. Also if you can attach the full worker + master Ignition configs, that would be useful. A quick grep of the worker log doesn't show any evidence that `/etc/sysconfig/network-scripts/ifcfg-eno1` is being written out by Ignition. (Similarly, there is no evidence of it happening on the masters, but that log looks like the Ignition portion was truncated from the beginning of the log) The platform ID of the worker node suggests it is an OpenStack environment and we have reports of troubles with early networking in BZ#1877740 Additionally, changes have recently been made to upstream Ignition around OpenStack which will appear in the new boot images used by `openshift-install`: https://github.com/coreos/ignition/pull/1094 https://github.com/coreos/ignition/pull/1095 https://github.com/coreos/ignition/pull/1098 https://github.com/openshift/installer/pull/4206 I would like to see if this problem is repeatable with the new installer/RHCOS boot image. @Jonathan do you think you could take a look at this? (In reply to Micah Abbott from comment #9) > A quick grep of the worker log doesn't show any evidence that > `/etc/sysconfig/network-scripts/ifcfg-eno1` is being written out by > Ignition. (Similarly, there is no evidence of it happening on the masters, > but that log looks like the Ignition portion was truncated from the > beginning of the log) > > The platform ID of the worker node suggests it is an OpenStack environment > and we have reports of troubles with early networking in BZ#1877740 This is OpenShift on Baremetal, not sure why the platform ID would report as OpenStack. > > Additionally, changes have recently been made to upstream Ignition around > OpenStack which will appear in the new boot images used by > `openshift-install`: > > https://github.com/coreos/ignition/pull/1094 > https://github.com/coreos/ignition/pull/1095 > https://github.com/coreos/ignition/pull/1098 > https://github.com/openshift/installer/pull/4206 > > I would like to see if this problem is repeatable with the new > installer/RHCOS boot image. > > @Jonathan do you think you could take a look at this? Yes, the master journal seems truncated, anyway to get it back. I'm just using the command "journalctl" but that doesn't give me everything. In the case of worker nodes the full log sems to be present. I can confirm from a previous look, when I looked at 'journalctl -b -1' on the masters, I could see /etc/sysconfig/network-scripts/ifcfg-eno1 being written on the masters (and the file was actually present), whereas the same could not be said for the workers. [root@master-0 core]# grep -inr ifcfg-eno1 master_journal_new.log 2080:Sep 23 19:51:28 e16-h12-b02-fc640.rdu2.scalelab.redhat.com ignition[1697]: INFO : files: createFilesystemsFiles: createFiles: op(21): [started] writing file "/sysroot/etc/sysconfig/network-scripts/ifcfg-eno1" 2081:Sep 23 19:51:28 e16-h12-b02-fc640.rdu2.scalelab.redhat.com ignition[1697]: INFO : files: createFilesystemsFiles: createFiles: op(21): [finished] writing file "/sysroot/etc/sysconfig/network-scripts/ifcfg-eno1" 2471:Sep 23 19:51:32 e16-h12-b02-fc640.rdu2.scalelab.redhat.com ignition[1697]: INFO : files: createFilesystemsFiles: createFiles: op(21): [started] writing file "/sysroot/etc/sysconfig/network-scripts/ifcfg-eno1" 2472:Sep 23 19:51:32 e16-h12-b02-fc640.rdu2.scalelab.redhat.com ignition[1697]: INFO : files: createFilesystemsFiles: createFiles: op(21): [finished] writing file "/sysroot/etc/sysconfig/network-scripts/ifcfg-eno1" (In reply to Sai Sindhur Malleni from comment #10) > (In reply to Micah Abbott from comment #9) > > A quick grep of the worker log doesn't show any evidence that > > `/etc/sysconfig/network-scripts/ifcfg-eno1` is being written out by > > Ignition. (Similarly, there is no evidence of it happening on the masters, > > but that log looks like the Ignition portion was truncated from the > > beginning of the log) > > > > The platform ID of the worker node suggests it is an OpenStack environment > > and we have reports of troubles with early networking in BZ#1877740 > This is OpenShift on Baremetal, not sure why the platform ID would report as > OpenStack. $ grep -m 1 ignition.platform journal_worker.log Sep 22 20:06:57 localhost kernel: Command line: BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-7fb133ae75316366f0c9ead0f7b95f476a097e0b5c443fa5e584d52942193364/vmlinuz-4.18.0-211.el8.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ignition.firstboot rd.neednet=1 ostree=/ostree/boot.1/rhcos/7fb133ae75316366f0c9ead0f7b95f476a097e0b5c443fa5e584d52942193364/0 ignition.platform.id=openstack I think this is "quirk" of BM IPI, but perhaps I am speaking out of line. (In reply to Micah Abbott from comment #15) > (In reply to Sai Sindhur Malleni from comment #10) > > (In reply to Micah Abbott from comment #9) > > > A quick grep of the worker log doesn't show any evidence that > > > `/etc/sysconfig/network-scripts/ifcfg-eno1` is being written out by > > > Ignition. (Similarly, there is no evidence of it happening on the masters, > > > but that log looks like the Ignition portion was truncated from the > > > beginning of the log) > > > > > > The platform ID of the worker node suggests it is an OpenStack environment > > > and we have reports of troubles with early networking in BZ#1877740 > > This is OpenShift on Baremetal, not sure why the platform ID would report as > > OpenStack. > > $ grep -m 1 ignition.platform journal_worker.log > Sep 22 20:06:57 localhost kernel: Command line: > BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos- > 7fb133ae75316366f0c9ead0f7b95f476a097e0b5c443fa5e584d52942193364/vmlinuz-4. > 18.0-211.el8.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 > console=ttyS0,115200n8 rd.luks.options=discard ignition.firstboot > rd.neednet=1 > ostree=/ostree/boot.1/rhcos/ > 7fb133ae75316366f0c9ead0f7b95f476a097e0b5c443fa5e584d52942193364/0 > ignition.platform.id=openstack > > I think this is "quirk" of BM IPI, but perhaps I am speaking out of line. Thanks for the clarification, looks like that warrants a BZ. I'm going to open that up a separate BZ. This is a regression in 4.6 as well. This worked on 4.6.0-0.nightly-2020-09-01-042030 Looking at the worker and master journal logs here, it does seem like Ignition isn't writing the `/etc/sysconfig/network-scripts/ifcfg-eno1` file at all in the worker case, but is in the master case. It could be a bug in Ignition, but it's much more likely that the config Ignition was given simply doesn't have that file in the worker case. This is unlikely to be an RHCOS bug and more likely something in the provisioning stack. But to double-check, here are two things to try: 1. On a worker node, you can check if the file is in the final merged worker config that Ignition handled using e.g. `jq ".storage.files[] | select(.path==\"/etc/sysconfig/network-scripts/ifcfg-eno1\")" /run/ignition.json`. 2. On a worker node, you can mount the config drive and look at the user-data directly in there to see if it matches what you provided. Can you check those? If you don't see the sysconfig file in those configs, we should get the Bare Metal IPI folks to take a look. Can you also provide the exact procedure you use to run the install? >> I think this is "quirk" of BM IPI, but perhaps I am speaking out of line. > > Thanks for the clarification, looks like that warrants a BZ. I'm going to open that up a separate BZ. Correct, the BM IPI flow uses the OpenStack image by design currently because it reuses the provisioning stack from OpenStack. They will eventually migrate to the bare metal image, so this is definitely already on their radar. Could you also add the contents of the rendered-worker being served? i.e. what's in "https://192.168.222.3:22623/config/worker" You can check `oc get mc` and paste the output of `oc describe mc/rendered-worker-xxx` Just to make sure we're not overwriting your file somehow Looks like while the masters nodes have a mention of ifcfg-eno1 [kni@e16-h12-b01-fc640 config-2]$ grep -inr eno1 * master/openstack/2012-08-10/user_data:1:{"ignition": {"config": {"merge": [{"source": "https://192.168.222.3:22623/config/master"}]}, "security": {"tls": {"certificateAuthorities": [{"source": "data:text/plain;charset=utf-8;base64,LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURFRENDQWZpZ0F3SUJBZ0lJTU9FWFo1dXdaVW93RFFZSktvWklodmNOQVFFTEJRQXdKakVTTUJBR0ExVUUKQ3hNSmIzQmxibk5vYVdaME1SQXdEZ1lEVlFRREV3ZHliMjkwTFdOaE1CNFhEVEl3TURreU5URXpNRGswTUZvWApEVE13TURreU16RXpNRGswTUZvd0pqRVNNQkFHQTFVRUN4TUpiM0JsYm5Ob2FXWjBNUkF3RGdZRFZRUURFd2R5CmIyOTBMV05oTUlJQklqQU5CZ2txaGtpRzl3MEJBUUVGQUFPQ0FROEFNSUlCQ2dLQ0FRRUFzMk5Wc2pWMWZVYnYKT0VoQXZLVklNaEJSNFo5cTJteXkrUThRenc3MUYxZ2xIb0M4ZnJFZHVkZ0RpbEtrS0svN1RBRVBVdWZnajdYdwpRV2hNWTVwOXUrZEpxekZZZU5KYnFGdUFpT2lFaG53M1U4alE2SEtOYWp0WGNqLzVySWlRM3VGTk9XUkttRU93CkM1SllSQVNQaWdrV3Z1MkYzTnBKK1hxYmlWL2s3VjZHc2prL2FiRjhIN3V1UTN0RTR3K3QxTWpGWmhQK1JOczgKNzZGYm9MNVN4ajBZMW9kUUJTSGVVYU1mRmpHR1lTc3hvdlVwaU9TWTNrdUVNcTlQRWpCdU1nRTVJTVVLQzRqagp4NEp4QzlzWS82M0xZNFkybTRZUFhhWTlIa1VoUWt0ZDJLYVg0V0tJbFRETkZvZUFjS2VXWXY5Z0tZclBhRnpiCi81NWNPSXFWNFFJREFRQUJvMEl3UURBT0JnTlZIUThCQWY4RUJBTUNBcVF3RHdZRFZSMFRBUUgvQkFVd0F3RUIKL3pBZEJnTlZIUTRFRmdRVWNjZnJ6bXdkeWZwaGVac1c4eCtSQ1RSYWlra3dEUVlKS29aSWh2Y05BUUVMQlFBRApnZ0VCQUlGSld0N1ZOSSt4dmFwYXA2Y0JWMkdzR3NsOFBIbjBoTFFUaVJEMVo4eFlsUlkrNS9ydkczRjM4SjdFClhWOHMwT0tPYVJ5Y2dHdENvWURocFZEVm5nbFBMby9xSWVwbkdEK0htakNSYUU0T2NEZk5ZUUVQVVltazBuK1YKaXlDM2RpcWFtamo1KzZrc2NWR2l4bnVxZjBSZlRVYm1PNTVoT3hCUTlOdUlvSUxYb2x6aVI1UUYwaElqeW5UVQpVaFpYUEwySzV4a3VZOFcwenRKbnR1SEZZR2VrL21KS09pd1pjTzUya2s1b2JkY3dxMEFyT1VPMGkzNXpiTDRwCnE5NHVHbmRjblJpKzZpRk95aS9Oc3p0SUxVYjlwSCtscGRWTENSb0NzVS9qaGJYMWhDRnZSQlcxQ1Q3UnYvVXoKaUtYeE1MQ1Y2S2lCRU82TzgxVFYvc0ZEaVRnPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg=="}]}}, "version": "3.1.0"}, "storage": {"files": [{"path": "/etc/sysconfig/network-scripts/ifcfg-eno1", "mode": 436, "overwrite": true, "contents": {"source": "data:,DEVICE%3Deno1%0ABOOTPROTO%3Dnone%0AONBOOT%3Dno%0A"}}]}} There is no such thing for worker nodes. Indeed it seems like the bare metal provisioning stack is somehow not properly passing the worker Ignition configs. Moving to Bare Metal component. I'm currently not clear how/why this may be specific to baremetal, but I'll take the bz for now and try to reproduce so we can clarify that. Ok I reproduced this and believe it may be related to the recent MCO changes to manage the pointer ignition config. I use dev scripts ref https://github.com/openshift-metal3/dev-scripts/blob/master/utils.sh#L47..L94 - which uses openshift-baremetal-install (which is openshift-install compiled with flags to enable baremetal IPI) Summary of the process openshift-install create manifests <add some manifests for NTP, unrelated to this bug> openshift-install create ignition-configs <merge some ignition for masters/workers to add a test file> The ignition I merged looks like $ cat ignition/file_example.ign | jq . { "ignition": { "version": "3.1.0" }, "storage": { "files": [ { "path": "/etc/test", "mode": 436, "contents": { "source": "data:,test-foo%0A" } } ] } } We see this reflected in the installer generated ignition config (for both masters and workers, but I'll only refer to workers below) $ jq .storage worker.ign { "files": [ { "path": "/etc/test", "mode": 436, "contents": { "source": "data:,test-foo%0A" } } ] } We also see it in the manifest passed via the bootstrap.ign to create the user-data secret: $ cat bootstrap.ign | jq -r '.storage.files[]|select(.path=="/opt/openshift/openshift/99_openshift-cluster-api_worker-user-data-secret.yaml")' | jq -r .contents.source | sed "s/data:text\plain;charset=utf-8;base64,//" | base64 -d | yq -r .data.userData | base64 -d | jq .storage.files [ { "path": "/etc/test", "mode": 436, "contents": { "source": "data:,test-foo%0A" } } ] However, after running create cluster, we see the following: $ oc get secret worker-user-data-managed -o json | jq -r .data.userData | base64 -d | jq .storage {} This is the data consumed via the machine API to deploy the workers, which explains why the injected file is missing: $ for i in 20 21 22 23 24; do ssh core.111.$i "hostname && cat /etc/test"; done master-0 test-foo master-1 test-foo master-2 test-foo worker-0 cat: /etc/test: No such file or directory worker-1 cat: /etc/test: No such file or directory The reason the masters contain the file is these get deployed via terraform (using the installer data directly, which as we see above does contain the additional ignition config). I suspect this is related to the changes in https://github.com/openshift/machine-config-operator/pull/1792 - in particular it seems the generated config doesn't consider any additional content from the installer generated resource? https://github.com/openshift/machine-config-operator/pull/1792/files#diff-5926e0cec606d949b66f823310790298 Probably needs a close look from the MCO team, so I'll reassign this for their input. Is there a reason you're not providing this change via MachineConfig? If you don't have machine-specific configuration that's definitely the preferred path. (And doing it via MachineConfig should work transparently in both 4.5 and current 4.6, and has the additional benefit that the changes are tracked by the MCO so you can see and change them "day 2") (In reply to Colin Walters from comment #29) > Is there a reason you're not providing this change via MachineConfig? If > you don't have machine-specific configuration that's definitely the > preferred path. This is to disable the lab public interface on first boot, we really need this interface disabled before everything else to be able to get repeatable deployments in our labs. Is MachineConfig done early enough so that the interface is disabled and the ovs system-configuration picks the right interface to add to br-ex? (In our case if this is not done soon enough the NIC that should have been disabled has the default route and that ends up getting added to br-ex breaking communication between machine config daemon on the worker nodes and the MCS) > This is to disable the lab public interface on first boot, we really need this interface disabled before everything else to be able to get repeatable deployments in our labs. Is MachineConfig done early enough so that the interface is disabled and the ovs system-configuration picks the right interface to add to br-ex? Yes. Any MachineConfig you provide "day 1" via https://github.com/openshift/installer/blob/master/docs/user/customization.md#install-time-customization-for-machine-configuration is included in the full rendered Ignition that is provided on firstboot for machines. This matches the general philosophy of Ignition - your system is either configured or not. There's no "half configured" state and we try hard to avoid "multiple phase configuration" steps. The only power that you get from explicitly customizing the pointer configuration today is support for *per machine* configuration, see discussion in https://github.com/openshift/machine-config-operator/issues/1720 To clarify, in the general case it depends whether the customization needs to affect the Ignition stage itself, or just the real root. If the latter, then indeed, day 1 MCs should work fine. Otherwise, today the hack is to inject things in the qcow2 (to be replaced by a cleaner approach once the bare metal provisioning stack natively supports these customizations, though worth noting also that with https://github.com/openshift/enhancements/pull/467, even the former case would be addressed by day 1 MCs). (In reply to Colin Walters from comment #29) > Is there a reason you're not providing this change via MachineConfig? If > you don't have machine-specific configuration that's definitely the > preferred path. Our original deployment scripting design for the shared labs avoided the MC approach partially due to lack of expertise and partially due to an apparent need to disable the network interface very early in the process. So we built and have been using since 4.3 a method based around modifying the ignition config files. Considering this, though, I just attempted to do a deployment in the lab where I inserted a MC file prior to install, and it turns out that did work for me and I was able to get a complete cluster up and running with masters and workers. So for our lab purposes, we may now have a workaround for this BZ, and potentially a future change for our deployment methods. (In reply to Colin Walters from comment #29) > Is there a reason you're not providing this change via MachineConfig? If > you don't have machine-specific configuration that's definitely the > preferred path. I guess the problem is this is a currently documented/supported interface, even if it's not the preferred one? I'm aware of folks in the field using the same approach, so I think we'll need to avoid breaking it, even if the plan is to deprecate the ignition customization and mandate all config should be passed via MachineConfig manifests? (In reply to Steven Hardy from comment #35) > (In reply to Colin Walters from comment #29) > > Is there a reason you're not providing this change via MachineConfig? If > > you don't have machine-specific configuration that's definitely the > > preferred path. > > I guess the problem is this is a currently documented/supported interface, > even if it's not the preferred one? > > I'm aware of folks in the field using the same approach, so I think we'll > need to avoid breaking it, even if the plan is to deprecate the ignition > customization and mandate all config should be passed via MachineConfig > manifests? Right, we're moving forward with the revert that caused this in 4.6 indeed. We'll work in a future release to provide that feature again (synced by the MCO pointer config). The revert is aligned with the fact that there are users leveraging the pointer config interface and we don't want to break them. This is also slightly related to the flattened ignition config enhancement > This is also slightly related to the flattened ignition config enhancement Yes that's true, if we didn't have to support this interface, we could just use the rendered config and ignore the pointer config (which we did previously, and that broke this interface in a similar way ref https://bugzilla.redhat.com/show_bug.cgi?id=1833483 and led to a revert of that implementation) Verified that this is fixed and working in [kni@e16-h12-b01-fc640 ~]$ ^C [kni@e16-h12-b01-fc640 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-10-03-051134 True False 105s Cluster version is 4.6.0-0.nightly-2020-10-03-051134 Marking this BZ verified per comment #39 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |