Description of problem: In our baremetal environments, we typically disable the lab public interface on the nodes by using a ignition configuration file, that places an ifcfg file disabling the NIC. Disabling this NIC is a technical requirement for getting OpenShift running in our labs and that is what we have been doing successfully until now. However, with OCP 4.6, we are seeing that the NIC is disabled on the master nodes, it is not disabled on the worker nodes leading to problem with Networking as the wrong interface is moved to the OVS bridge on worker nodes leading to pods running on worker nodes not being able to reach the API server. Worker and Master igniton configs [kni@e16-h12-b01-fc640 clusterconfigs]$ diff worker.ign.bkup master.ign.bkup 1c1 < {"ignition": {"config": {"merge": [{"source": "https://192.168.222.3:22623/config/worker"}]}, "security": {"tls": {"certificateAuthorities": [{"source": "data:text/plain;charset=utf-8;base64,LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURFRENDQWZpZ0F3SUJBZ0lJVDJxejcvMzZlUlV3RFFZSktvWklodmNOQVFFTEJRQXdKakVTTUJBR0ExVUUKQ3hNSmIzQmxibk5vYVdaME1SQXdEZ1lEVlFRREV3ZHliMjkwTFdOaE1CNFhEVEl3TURreU1qRTRORGMwTjFvWApEVE13TURreU1ERTRORGMwTjFvd0pqRVNNQkFHQTFVRUN4TUpiM0JsYm5Ob2FXWjBNUkF3RGdZRFZRUURFd2R5CmIyOTBMV05oTUlJQklqQU5CZ2txaGtpRzl3MEJBUUVGQUFPQ0FROEFNSUlCQ2dLQ0FRRUExbXZRWm5DYnFGa3cKcGI3MDlrTFV1TGpqOVRPL1g2Mks2ZmR3Ynp3azBrNzZ6RThrUE5GaUFIRlk5MDJDVmlHcGRqZHByMVBnNVlCTQphOTNzRE1KT2xJUC9XbnpBSUJER2d5UjhRRFJOMDZiRjlRM0g1M01BcEVYRklYb01TZjNTU1MzQXEzVXp1OFVzClg5UjcvRDUzY2kzUzlhTjk3blpmWmlycld3VEtpMW1CWlpGTU5KVTdFVHZQTWRpd1pHeTFBdXhibmhqK2FZc0IKWWNjZndrZ2I4M0ltRlY2d3N1K1hqVy96R1RhV2kzL09xUGNvWWxHMzdXdTN6amxsMkgwWWVzb3hFSkRHYmVIdQpvNHhFM2VtYjdrSXNLVTI1YUJZa3pZVWpIbzRwUTdFNjhLTUxyN3N2ZWthZ0ROd0ljTkN5RStncWlDSnFtc3NjCkNMSUhGaFlaeVFJREFRQUJvMEl3UURBT0JnTlZIUThCQWY4RUJBTUNBcVF3RHdZRFZSMFRBUUgvQkFVd0F3RUIKL3pBZEJnTlZIUTRFRmdRVTQwQktsMk5PQzljcWJYWkNkOUUydk16QVpHc3dEUVlKS29aSWh2Y05BUUVMQlFBRApnZ0VCQUlRSmJOR0JDN1U0ZHFQOXZxYTNCb1o2RWwyMHcramwzbE5VY2IzRVlRRWtiNkRFU1o5WTJwTCt6cDRTCnZGWHhEakVlWWhGWGhUQkRNRHRQK0pzampXLzI1Mk5sUm1PdVNWRWNld2MyQUZGV2hJSlZmTklkT0pkYkNkMmMKUnF3VTV3U3k3cE0zaXdxSkNYUldjdWdEMTdiMUV1b1B6QnB0NTF0d1Eza2diUy9iWEttRHFhc3g2czNGM200SgpYWURLbE1ZbS9ld2xtMGIyWkNNS2JxNjlxWG9MOE9WaXAwdGZiaURHcVRNWnVaaGY0QU9iYnBEajMvdjFQdDBGCm1mVUQ3OWNBTlRCbWg2Z1I1QVdoZGFOTTQvRXNoVHRPTDlmdGJ0SVU5eWszeERMREo3ekxiSVJrditLK3RqM3IKYU5TRXZ4ZGNmYWMvS3JKVm50VUh0RTFFblpJPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg=="}]}}, "version": "3.1.0"}, "storage": {"files": [{"path": "/etc/sysconfig/network-scripts/ifcfg-eno1", "mode": 436, "overwrite": true, "contents": {"source": "data:,DEVICE%3Deno1%0ABOOTPROTO%3Dnone%0AONBOOT%3Dno%0A"}}]}} \ No newline at end of file --- > {"ignition": {"config": {"merge": [{"source": "https://192.168.222.3:22623/config/master"}]}, "security": {"tls": {"certificateAuthorities": [{"source": "data:text/plain;charset=utf-8;base64,LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURFRENDQWZpZ0F3SUJBZ0lJVDJxejcvMzZlUlV3RFFZSktvWklodmNOQVFFTEJRQXdKakVTTUJBR0ExVUUKQ3hNSmIzQmxibk5vYVdaME1SQXdEZ1lEVlFRREV3ZHliMjkwTFdOaE1CNFhEVEl3TURreU1qRTRORGMwTjFvWApEVE13TURreU1ERTRORGMwTjFvd0pqRVNNQkFHQTFVRUN4TUpiM0JsYm5Ob2FXWjBNUkF3RGdZRFZRUURFd2R5CmIyOTBMV05oTUlJQklqQU5CZ2txaGtpRzl3MEJBUUVGQUFPQ0FROEFNSUlCQ2dLQ0FRRUExbXZRWm5DYnFGa3cKcGI3MDlrTFV1TGpqOVRPL1g2Mks2ZmR3Ynp3azBrNzZ6RThrUE5GaUFIRlk5MDJDVmlHcGRqZHByMVBnNVlCTQphOTNzRE1KT2xJUC9XbnpBSUJER2d5UjhRRFJOMDZiRjlRM0g1M01BcEVYRklYb01TZjNTU1MzQXEzVXp1OFVzClg5UjcvRDUzY2kzUzlhTjk3blpmWmlycld3VEtpMW1CWlpGTU5KVTdFVHZQTWRpd1pHeTFBdXhibmhqK2FZc0IKWWNjZndrZ2I4M0ltRlY2d3N1K1hqVy96R1RhV2kzL09xUGNvWWxHMzdXdTN6amxsMkgwWWVzb3hFSkRHYmVIdQpvNHhFM2VtYjdrSXNLVTI1YUJZa3pZVWpIbzRwUTdFNjhLTUxyN3N2ZWthZ0ROd0ljTkN5RStncWlDSnFtc3NjCkNMSUhGaFlaeVFJREFRQUJvMEl3UURBT0JnTlZIUThCQWY4RUJBTUNBcVF3RHdZRFZSMFRBUUgvQkFVd0F3RUIKL3pBZEJnTlZIUTRFRmdRVTQwQktsMk5PQzljcWJYWkNkOUUydk16QVpHc3dEUVlKS29aSWh2Y05BUUVMQlFBRApnZ0VCQUlRSmJOR0JDN1U0ZHFQOXZxYTNCb1o2RWwyMHcramwzbE5VY2IzRVlRRWtiNkRFU1o5WTJwTCt6cDRTCnZGWHhEakVlWWhGWGhUQkRNRHRQK0pzampXLzI1Mk5sUm1PdVNWRWNld2MyQUZGV2hJSlZmTklkT0pkYkNkMmMKUnF3VTV3U3k3cE0zaXdxSkNYUldjdWdEMTdiMUV1b1B6QnB0NTF0d1Eza2diUy9iWEttRHFhc3g2czNGM200SgpYWURLbE1ZbS9ld2xtMGIyWkNNS2JxNjlxWG9MOE9WaXAwdGZiaURHcVRNWnVaaGY0QU9iYnBEajMvdjFQdDBGCm1mVUQ3OWNBTlRCbWg2Z1I1QVdoZGFOTTQvRXNoVHRPTDlmdGJ0SVU5eWszeERMREo3ekxiSVJrditLK3RqM3IKYU5TRXZ4ZGNmYWMvS3JKVm50VUh0RTFFblpJPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg=="}]}}, "version": "3.1.0"}, "storage": {"files": [{"path": "/etc/sysconfig/network-scripts/ifcfg-eno1", "mode": 436, "overwrite": true, "contents": {"source": "data:,DEVICE%3Deno1%0ABOOTPROTO%3Dnone%0AONBOOT%3Dno%0A"}}]}} \ No newline at end of file We can see how the same configuration leads to ifcfg-eno1 being placed on master nodes but not workers [root@master-0 core]# cat /etc/sysconfig/network-scripts/ifcfg-eno1 DEVICE=eno1 BOOTPROTO=none ONBOOT=no [core@worker000 ~]$ sudo su [systemd] Failed Units: 1 NetworkManager-wait-online.service [root@worker000 core]# cd /etc/sysconfig/network-scripts/ [root@worker000 network-scripts]# ls This leads to eno1 being added to br-ex on worker node, while ens2f1 should be the interface added as can be seen on the master nodes =========== Worker Node =========== Bridge br-ex Port br-ex Interface br-ex type: internal Port eno1 Interface eno1 type: system Port patch-br-ex_worker000-to-br-int Interface patch-br-ex_worker000-to-br-int type: patch options: {peer=patch-br-int-to-br-ex_worker000} ovs_version: "2.13.2" =========== Master Node =========== Bridge br-ex Port ens2f1 Interface ens2f1 type: system Port br-ex Interface br-ex type: internal Port patch-br-ex_master-0-to-br-int Interface patch-br-ex_master-0-to-br-int type: patch options: {peer=patch-br-int-to-br-ex_master-0} ovs_version: "2.13.2" It looks like the NIC with default route is moved to the OVS bridge, so this shouldn't have happened at all because if the NIC was dsiabled, it wouldn't have had the default route. How reproducible: 100% when disabling a NIC through ignition Steps to Reproduce: 1. Disable a NIC through ignition for worker nodes 2. Verify NIC has actually been disabled 3. Actual results: Several pods on workers fail to come up (ingress, monitoring etc) due to networking issues by having the wrong interface attached to OVS bridge br-ex (should have been ens2f1 instead of eno1 which should have been disabled) Expected results: The NIC eno1 should have been disabled and not have had the default route Additional info: Default routes on master and worker [root@master-1 core]# route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 192.168.222.1 0.0.0.0 UG 800 0 0 br-ex 10.128.0.0 0.0.0.0 255.255.254.0 U 0 0 0 ovn-k8s-mp0 10.128.0.0 10.128.0.1 255.252.0.0 UG 0 0 0 ovn-k8s-mp0 169.254.0.0 0.0.0.0 255.255.240.0 U 0 0 0 ovn-k8s-gw0 172.22.0.0 0.0.0.0 255.255.255.0 U 101 0 0 ens2f0 172.30.0.0 10.128.0.1 255.255.0.0 UG 0 0 0 ovn-k8s-mp0 192.168.222.0 0.0.0.0 255.255.255.0 U 800 0 0 br-ex [root@worker001 core]# route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 192.168.222.1 0.0.0.0 UG 103 0 0 ens2f1 0.0.0.0 10.1.39.254 0.0.0.0 UG 800 0 0 br-ex 10.1.36.0 0.0.0.0 255.255.252.0 U 800 0 0 br-ex 10.128.0.0 10.128.2.1 255.252.0.0 UG 0 0 0 ovn-k8s-mp0 10.128.2.0 0.0.0.0 255.255.254.0 U 0 0 0 ovn-k8s-mp0 169.254.0.0 0.0.0.0 255.255.240.0 U 0 0 0 ovn-k8s-gw0 172.22.0.0 0.0.0.0 255.255.255.0 U 102 0 0 ens2f0 172.30.0.0 10.128.2.1 255.255.0.0 UG 0 0 0 ovn-k8s-mp0 192.168.222.0 0.0.0.0 255.255.255.0 U 103 0 0 ens2f1 Logs from router pod E0922 22:03:15.744114 1 reflector.go:127] github.com/openshift/router/pkg/router/template/service_lookup.go:33: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout I0922 22:03:41.265909 1 trace.go:205] Trace[1397635573]: "Reflector ListAndWatch" name:github.com/openshift/router/pkg/router/controller/factory/factory.go:125 (22-Sep-2020 22:03:11.265) (total time: 30000ms): Trace[1397635573]: [30.000516158s] [30.000516158s] END E0922 22:03:41.265931 1 reflector.go:127] github.com/openshift/router/pkg/router/controller/factory/factory.go:125: Failed to watch *v1.Route: failed to list *v1.Route: Get "https://172.30.0.1:443/apis/route.openshift.io/v1/routes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout I0922 22:03:58.915245 1 trace.go:205] Trace[882044693]: "Reflector ListAndWatch" name:github.com/openshift/router/pkg/router/controller/factory/factory.go:125 (22-Sep-2020 22:03:28.914) (total time: 30000ms): Trace[882044693]: [30.000623427s] [30.000623427s] END E0922 22:03:58.915284 1 reflector.go:127] github.com/openshift/router/pkg/router/controller/factory/factory.go:125: Failed to watch *v1beta1.EndpointSlice: failed to list *v1beta1.EndpointSlice: Get "https://172.30.0.1:443/apis/discovery.k8s.io/v1beta1/endpointslices?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout Logs from machine-config-daemon pods on workers E0922 22:03:52.117844 6271 reflector.go:127] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout I0922 22:04:35.379500 6271 trace.go:205] Trace[1145578265]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (22-Sep-2020 22:04:05.378) (total time: 30000ms): Trace[1145578265]: [30.000682973s] [30.000682973s] END E0922 22:04:35.379522 6271 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout I0922 22:05:01.235364 6271 trace.go:205] Trace[1747172884]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (22-Sep-2020 22:04:31.234) (total time: 30000ms): Trace[1747172884]: [30.000641364s] [30.000641364s] END E0922 22:05:01.235389 6271 reflector.go:127] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout There are no errors in machine-config-server pods. [kni@e16-h12-b01-fc640 clusterconfigs]$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication False False True 164m cloud-credential 4.6.0-0.nightly-2020-09-22-051033 True False False 3h7m cluster-autoscaler 4.6.0-0.nightly-2020-09-22-051033 True False False 156m config-operator 4.6.0-0.nightly-2020-09-22-051033 True False False 165m console 4.6.0-0.nightly-2020-09-22-051033 Unknown True False 137m csi-snapshot-controller 4.6.0-0.nightly-2020-09-22-051033 True False False 155m dns 4.6.0-0.nightly-2020-09-22-051033 True False False 164m etcd 4.6.0-0.nightly-2020-09-22-051033 True False False 162m image-registry 4.6.0-0.nightly-2020-09-22-051033 True False False 138m ingress False True True 156m insights 4.6.0-0.nightly-2020-09-22-051033 True False False 156m kube-apiserver 4.6.0-0.nightly-2020-09-22-051033 True False False 162m kube-controller-manager 4.6.0-0.nightly-2020-09-22-051033 True False False 159m kube-scheduler 4.6.0-0.nightly-2020-09-22-051033 True False False 158m kube-storage-version-migrator 4.6.0-0.nightly-2020-09-22-051033 True False False 109m machine-api 4.6.0-0.nightly-2020-09-22-051033 True False False 130m machine-approver 4.6.0-0.nightly-2020-09-22-051033 True False False 161m machine-config 4.6.0-0.nightly-2020-09-22-051033 True False False 159m marketplace 4.6.0-0.nightly-2020-09-22-051033 True False False 136m monitoring False True True 151m network 4.6.0-0.nightly-2020-09-22-051033 True False False 160m node-tuning 4.6.0-0.nightly-2020-09-22-051033 True False False 165m openshift-apiserver 4.6.0-0.nightly-2020-09-22-051033 True False False 141m openshift-controller-manager 4.6.0-0.nightly-2020-09-22-051033 True False False 155m openshift-samples 4.6.0-0.nightly-2020-09-22-051033 True False False 141m operator-lifecycle-manager 4.6.0-0.nightly-2020-09-22-051033 True False False 164m operator-lifecycle-manager-catalog 4.6.0-0.nightly-2020-09-22-051033 True False False 164m operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-09-22-051033 True False False 138m service-ca 4.6.0-0.nightly-2020-09-22-051033 True False False 165m storage 4.6.0-0.nightly-2020-09-22-051033 True False False 165m [kni@e16-h12-b01-fc640 clusterconfigs]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 3h9m Unable to apply 4.6.0-0.nightly-2020-09-22-051033: some cluster operators have not yet rolled out
To add... This is a regression from 4.5. This same setup worked on 4.5.
These are the steps to create ignition configs and modify them, that are being used in our playbooks https://github.com/openshift-kni/baremetal-deploy/blob/d577f5911061f7b8ed5b7bdc02ed84813b8d31ef/ansible-ipi-install/roles/installer/tasks/55_customize_filesystem.yml We essentially create a fake-root directory and use filetranspiler to modify ignition configs created through create "ignition-configs" for both masters and workers that use an ifcfg file to dsiable NIC eno1. It is working on masters but not workers.
(In reply to Brad P. Crochet from comment #1) > To add... This is a regression from 4.5. This same setup worked on 4.5. Yes, same setup/hardware has worked in 4.3, 4.4 and 4.5..
The installer is not involved with configuring the machine using ignition. It is the RHCOS team and the MCO team. Since this configuration is being applied using the stub ignition config I'm moving to RHCOS for triage. ALso please make sure when you move the component you attach a comment explaining the reason it helps with the context @Brad P Crochet
Can you provide the journal from the worker nodes? It should contain entries from Ignition showing the files being written out to the host. If the masters are getting configured properly but the workers are not, it makes me wonder if the MCS is not serving up the worker configs properly. Also if you can attach the full worker + master Ignition configs, that would be useful.
A quick grep of the worker log doesn't show any evidence that `/etc/sysconfig/network-scripts/ifcfg-eno1` is being written out by Ignition. (Similarly, there is no evidence of it happening on the masters, but that log looks like the Ignition portion was truncated from the beginning of the log) The platform ID of the worker node suggests it is an OpenStack environment and we have reports of troubles with early networking in BZ#1877740 Additionally, changes have recently been made to upstream Ignition around OpenStack which will appear in the new boot images used by `openshift-install`: https://github.com/coreos/ignition/pull/1094 https://github.com/coreos/ignition/pull/1095 https://github.com/coreos/ignition/pull/1098 https://github.com/openshift/installer/pull/4206 I would like to see if this problem is repeatable with the new installer/RHCOS boot image. @Jonathan do you think you could take a look at this?
(In reply to Micah Abbott from comment #9) > A quick grep of the worker log doesn't show any evidence that > `/etc/sysconfig/network-scripts/ifcfg-eno1` is being written out by > Ignition. (Similarly, there is no evidence of it happening on the masters, > but that log looks like the Ignition portion was truncated from the > beginning of the log) > > The platform ID of the worker node suggests it is an OpenStack environment > and we have reports of troubles with early networking in BZ#1877740 This is OpenShift on Baremetal, not sure why the platform ID would report as OpenStack. > > Additionally, changes have recently been made to upstream Ignition around > OpenStack which will appear in the new boot images used by > `openshift-install`: > > https://github.com/coreos/ignition/pull/1094 > https://github.com/coreos/ignition/pull/1095 > https://github.com/coreos/ignition/pull/1098 > https://github.com/openshift/installer/pull/4206 > > I would like to see if this problem is repeatable with the new > installer/RHCOS boot image. > > @Jonathan do you think you could take a look at this?
Yes, the master journal seems truncated, anyway to get it back. I'm just using the command "journalctl" but that doesn't give me everything. In the case of worker nodes the full log sems to be present.
I can confirm from a previous look, when I looked at 'journalctl -b -1' on the masters, I could see /etc/sysconfig/network-scripts/ifcfg-eno1 being written on the masters (and the file was actually present), whereas the same could not be said for the workers.
[root@master-0 core]# grep -inr ifcfg-eno1 master_journal_new.log 2080:Sep 23 19:51:28 e16-h12-b02-fc640.rdu2.scalelab.redhat.com ignition[1697]: INFO : files: createFilesystemsFiles: createFiles: op(21): [started] writing file "/sysroot/etc/sysconfig/network-scripts/ifcfg-eno1" 2081:Sep 23 19:51:28 e16-h12-b02-fc640.rdu2.scalelab.redhat.com ignition[1697]: INFO : files: createFilesystemsFiles: createFiles: op(21): [finished] writing file "/sysroot/etc/sysconfig/network-scripts/ifcfg-eno1" 2471:Sep 23 19:51:32 e16-h12-b02-fc640.rdu2.scalelab.redhat.com ignition[1697]: INFO : files: createFilesystemsFiles: createFiles: op(21): [started] writing file "/sysroot/etc/sysconfig/network-scripts/ifcfg-eno1" 2472:Sep 23 19:51:32 e16-h12-b02-fc640.rdu2.scalelab.redhat.com ignition[1697]: INFO : files: createFilesystemsFiles: createFiles: op(21): [finished] writing file "/sysroot/etc/sysconfig/network-scripts/ifcfg-eno1"
(In reply to Sai Sindhur Malleni from comment #10) > (In reply to Micah Abbott from comment #9) > > A quick grep of the worker log doesn't show any evidence that > > `/etc/sysconfig/network-scripts/ifcfg-eno1` is being written out by > > Ignition. (Similarly, there is no evidence of it happening on the masters, > > but that log looks like the Ignition portion was truncated from the > > beginning of the log) > > > > The platform ID of the worker node suggests it is an OpenStack environment > > and we have reports of troubles with early networking in BZ#1877740 > This is OpenShift on Baremetal, not sure why the platform ID would report as > OpenStack. $ grep -m 1 ignition.platform journal_worker.log Sep 22 20:06:57 localhost kernel: Command line: BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-7fb133ae75316366f0c9ead0f7b95f476a097e0b5c443fa5e584d52942193364/vmlinuz-4.18.0-211.el8.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ignition.firstboot rd.neednet=1 ostree=/ostree/boot.1/rhcos/7fb133ae75316366f0c9ead0f7b95f476a097e0b5c443fa5e584d52942193364/0 ignition.platform.id=openstack I think this is "quirk" of BM IPI, but perhaps I am speaking out of line.
(In reply to Micah Abbott from comment #15) > (In reply to Sai Sindhur Malleni from comment #10) > > (In reply to Micah Abbott from comment #9) > > > A quick grep of the worker log doesn't show any evidence that > > > `/etc/sysconfig/network-scripts/ifcfg-eno1` is being written out by > > > Ignition. (Similarly, there is no evidence of it happening on the masters, > > > but that log looks like the Ignition portion was truncated from the > > > beginning of the log) > > > > > > The platform ID of the worker node suggests it is an OpenStack environment > > > and we have reports of troubles with early networking in BZ#1877740 > > This is OpenShift on Baremetal, not sure why the platform ID would report as > > OpenStack. > > $ grep -m 1 ignition.platform journal_worker.log > Sep 22 20:06:57 localhost kernel: Command line: > BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos- > 7fb133ae75316366f0c9ead0f7b95f476a097e0b5c443fa5e584d52942193364/vmlinuz-4. > 18.0-211.el8.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 > console=ttyS0,115200n8 rd.luks.options=discard ignition.firstboot > rd.neednet=1 > ostree=/ostree/boot.1/rhcos/ > 7fb133ae75316366f0c9ead0f7b95f476a097e0b5c443fa5e584d52942193364/0 > ignition.platform.id=openstack > > I think this is "quirk" of BM IPI, but perhaps I am speaking out of line. Thanks for the clarification, looks like that warrants a BZ. I'm going to open that up a separate BZ.
This is a regression in 4.6 as well. This worked on 4.6.0-0.nightly-2020-09-01-042030
Looking at the worker and master journal logs here, it does seem like Ignition isn't writing the `/etc/sysconfig/network-scripts/ifcfg-eno1` file at all in the worker case, but is in the master case. It could be a bug in Ignition, but it's much more likely that the config Ignition was given simply doesn't have that file in the worker case. This is unlikely to be an RHCOS bug and more likely something in the provisioning stack. But to double-check, here are two things to try: 1. On a worker node, you can check if the file is in the final merged worker config that Ignition handled using e.g. `jq ".storage.files[] | select(.path==\"/etc/sysconfig/network-scripts/ifcfg-eno1\")" /run/ignition.json`. 2. On a worker node, you can mount the config drive and look at the user-data directly in there to see if it matches what you provided. Can you check those? If you don't see the sysconfig file in those configs, we should get the Bare Metal IPI folks to take a look. Can you also provide the exact procedure you use to run the install? >> I think this is "quirk" of BM IPI, but perhaps I am speaking out of line. > > Thanks for the clarification, looks like that warrants a BZ. I'm going to open that up a separate BZ. Correct, the BM IPI flow uses the OpenStack image by design currently because it reuses the provisioning stack from OpenStack. They will eventually migrate to the bare metal image, so this is definitely already on their radar.
Could you also add the contents of the rendered-worker being served? i.e. what's in "https://192.168.222.3:22623/config/worker" You can check `oc get mc` and paste the output of `oc describe mc/rendered-worker-xxx` Just to make sure we're not overwriting your file somehow
Looks like while the masters nodes have a mention of ifcfg-eno1 [kni@e16-h12-b01-fc640 config-2]$ grep -inr eno1 * master/openstack/2012-08-10/user_data:1:{"ignition": {"config": {"merge": [{"source": "https://192.168.222.3:22623/config/master"}]}, "security": {"tls": {"certificateAuthorities": [{"source": "data:text/plain;charset=utf-8;base64,LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURFRENDQWZpZ0F3SUJBZ0lJTU9FWFo1dXdaVW93RFFZSktvWklodmNOQVFFTEJRQXdKakVTTUJBR0ExVUUKQ3hNSmIzQmxibk5vYVdaME1SQXdEZ1lEVlFRREV3ZHliMjkwTFdOaE1CNFhEVEl3TURreU5URXpNRGswTUZvWApEVE13TURreU16RXpNRGswTUZvd0pqRVNNQkFHQTFVRUN4TUpiM0JsYm5Ob2FXWjBNUkF3RGdZRFZRUURFd2R5CmIyOTBMV05oTUlJQklqQU5CZ2txaGtpRzl3MEJBUUVGQUFPQ0FROEFNSUlCQ2dLQ0FRRUFzMk5Wc2pWMWZVYnYKT0VoQXZLVklNaEJSNFo5cTJteXkrUThRenc3MUYxZ2xIb0M4ZnJFZHVkZ0RpbEtrS0svN1RBRVBVdWZnajdYdwpRV2hNWTVwOXUrZEpxekZZZU5KYnFGdUFpT2lFaG53M1U4alE2SEtOYWp0WGNqLzVySWlRM3VGTk9XUkttRU93CkM1SllSQVNQaWdrV3Z1MkYzTnBKK1hxYmlWL2s3VjZHc2prL2FiRjhIN3V1UTN0RTR3K3QxTWpGWmhQK1JOczgKNzZGYm9MNVN4ajBZMW9kUUJTSGVVYU1mRmpHR1lTc3hvdlVwaU9TWTNrdUVNcTlQRWpCdU1nRTVJTVVLQzRqagp4NEp4QzlzWS82M0xZNFkybTRZUFhhWTlIa1VoUWt0ZDJLYVg0V0tJbFRETkZvZUFjS2VXWXY5Z0tZclBhRnpiCi81NWNPSXFWNFFJREFRQUJvMEl3UURBT0JnTlZIUThCQWY4RUJBTUNBcVF3RHdZRFZSMFRBUUgvQkFVd0F3RUIKL3pBZEJnTlZIUTRFRmdRVWNjZnJ6bXdkeWZwaGVac1c4eCtSQ1RSYWlra3dEUVlKS29aSWh2Y05BUUVMQlFBRApnZ0VCQUlGSld0N1ZOSSt4dmFwYXA2Y0JWMkdzR3NsOFBIbjBoTFFUaVJEMVo4eFlsUlkrNS9ydkczRjM4SjdFClhWOHMwT0tPYVJ5Y2dHdENvWURocFZEVm5nbFBMby9xSWVwbkdEK0htakNSYUU0T2NEZk5ZUUVQVVltazBuK1YKaXlDM2RpcWFtamo1KzZrc2NWR2l4bnVxZjBSZlRVYm1PNTVoT3hCUTlOdUlvSUxYb2x6aVI1UUYwaElqeW5UVQpVaFpYUEwySzV4a3VZOFcwenRKbnR1SEZZR2VrL21KS09pd1pjTzUya2s1b2JkY3dxMEFyT1VPMGkzNXpiTDRwCnE5NHVHbmRjblJpKzZpRk95aS9Oc3p0SUxVYjlwSCtscGRWTENSb0NzVS9qaGJYMWhDRnZSQlcxQ1Q3UnYvVXoKaUtYeE1MQ1Y2S2lCRU82TzgxVFYvc0ZEaVRnPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg=="}]}}, "version": "3.1.0"}, "storage": {"files": [{"path": "/etc/sysconfig/network-scripts/ifcfg-eno1", "mode": 436, "overwrite": true, "contents": {"source": "data:,DEVICE%3Deno1%0ABOOTPROTO%3Dnone%0AONBOOT%3Dno%0A"}}]}} There is no such thing for worker nodes.
Indeed it seems like the bare metal provisioning stack is somehow not properly passing the worker Ignition configs. Moving to Bare Metal component.
I'm currently not clear how/why this may be specific to baremetal, but I'll take the bz for now and try to reproduce so we can clarify that.
Ok I reproduced this and believe it may be related to the recent MCO changes to manage the pointer ignition config. I use dev scripts ref https://github.com/openshift-metal3/dev-scripts/blob/master/utils.sh#L47..L94 - which uses openshift-baremetal-install (which is openshift-install compiled with flags to enable baremetal IPI) Summary of the process openshift-install create manifests <add some manifests for NTP, unrelated to this bug> openshift-install create ignition-configs <merge some ignition for masters/workers to add a test file> The ignition I merged looks like $ cat ignition/file_example.ign | jq . { "ignition": { "version": "3.1.0" }, "storage": { "files": [ { "path": "/etc/test", "mode": 436, "contents": { "source": "data:,test-foo%0A" } } ] } } We see this reflected in the installer generated ignition config (for both masters and workers, but I'll only refer to workers below) $ jq .storage worker.ign { "files": [ { "path": "/etc/test", "mode": 436, "contents": { "source": "data:,test-foo%0A" } } ] } We also see it in the manifest passed via the bootstrap.ign to create the user-data secret: $ cat bootstrap.ign | jq -r '.storage.files[]|select(.path=="/opt/openshift/openshift/99_openshift-cluster-api_worker-user-data-secret.yaml")' | jq -r .contents.source | sed "s/data:text\plain;charset=utf-8;base64,//" | base64 -d | yq -r .data.userData | base64 -d | jq .storage.files [ { "path": "/etc/test", "mode": 436, "contents": { "source": "data:,test-foo%0A" } } ] However, after running create cluster, we see the following: $ oc get secret worker-user-data-managed -o json | jq -r .data.userData | base64 -d | jq .storage {} This is the data consumed via the machine API to deploy the workers, which explains why the injected file is missing: $ for i in 20 21 22 23 24; do ssh core.111.$i "hostname && cat /etc/test"; done master-0 test-foo master-1 test-foo master-2 test-foo worker-0 cat: /etc/test: No such file or directory worker-1 cat: /etc/test: No such file or directory The reason the masters contain the file is these get deployed via terraform (using the installer data directly, which as we see above does contain the additional ignition config). I suspect this is related to the changes in https://github.com/openshift/machine-config-operator/pull/1792 - in particular it seems the generated config doesn't consider any additional content from the installer generated resource? https://github.com/openshift/machine-config-operator/pull/1792/files#diff-5926e0cec606d949b66f823310790298 Probably needs a close look from the MCO team, so I'll reassign this for their input.
Is there a reason you're not providing this change via MachineConfig? If you don't have machine-specific configuration that's definitely the preferred path.
(And doing it via MachineConfig should work transparently in both 4.5 and current 4.6, and has the additional benefit that the changes are tracked by the MCO so you can see and change them "day 2")
(In reply to Colin Walters from comment #29) > Is there a reason you're not providing this change via MachineConfig? If > you don't have machine-specific configuration that's definitely the > preferred path. This is to disable the lab public interface on first boot, we really need this interface disabled before everything else to be able to get repeatable deployments in our labs. Is MachineConfig done early enough so that the interface is disabled and the ovs system-configuration picks the right interface to add to br-ex? (In our case if this is not done soon enough the NIC that should have been disabled has the default route and that ends up getting added to br-ex breaking communication between machine config daemon on the worker nodes and the MCS)
> This is to disable the lab public interface on first boot, we really need this interface disabled before everything else to be able to get repeatable deployments in our labs. Is MachineConfig done early enough so that the interface is disabled and the ovs system-configuration picks the right interface to add to br-ex? Yes. Any MachineConfig you provide "day 1" via https://github.com/openshift/installer/blob/master/docs/user/customization.md#install-time-customization-for-machine-configuration is included in the full rendered Ignition that is provided on firstboot for machines. This matches the general philosophy of Ignition - your system is either configured or not. There's no "half configured" state and we try hard to avoid "multiple phase configuration" steps. The only power that you get from explicitly customizing the pointer configuration today is support for *per machine* configuration, see discussion in https://github.com/openshift/machine-config-operator/issues/1720
To clarify, in the general case it depends whether the customization needs to affect the Ignition stage itself, or just the real root. If the latter, then indeed, day 1 MCs should work fine. Otherwise, today the hack is to inject things in the qcow2 (to be replaced by a cleaner approach once the bare metal provisioning stack natively supports these customizations, though worth noting also that with https://github.com/openshift/enhancements/pull/467, even the former case would be addressed by day 1 MCs).
(In reply to Colin Walters from comment #29) > Is there a reason you're not providing this change via MachineConfig? If > you don't have machine-specific configuration that's definitely the > preferred path. Our original deployment scripting design for the shared labs avoided the MC approach partially due to lack of expertise and partially due to an apparent need to disable the network interface very early in the process. So we built and have been using since 4.3 a method based around modifying the ignition config files. Considering this, though, I just attempted to do a deployment in the lab where I inserted a MC file prior to install, and it turns out that did work for me and I was able to get a complete cluster up and running with masters and workers. So for our lab purposes, we may now have a workaround for this BZ, and potentially a future change for our deployment methods.
(In reply to Colin Walters from comment #29) > Is there a reason you're not providing this change via MachineConfig? If > you don't have machine-specific configuration that's definitely the > preferred path. I guess the problem is this is a currently documented/supported interface, even if it's not the preferred one? I'm aware of folks in the field using the same approach, so I think we'll need to avoid breaking it, even if the plan is to deprecate the ignition customization and mandate all config should be passed via MachineConfig manifests?
(In reply to Steven Hardy from comment #35) > (In reply to Colin Walters from comment #29) > > Is there a reason you're not providing this change via MachineConfig? If > > you don't have machine-specific configuration that's definitely the > > preferred path. > > I guess the problem is this is a currently documented/supported interface, > even if it's not the preferred one? > > I'm aware of folks in the field using the same approach, so I think we'll > need to avoid breaking it, even if the plan is to deprecate the ignition > customization and mandate all config should be passed via MachineConfig > manifests? Right, we're moving forward with the revert that caused this in 4.6 indeed. We'll work in a future release to provide that feature again (synced by the MCO pointer config). The revert is aligned with the fact that there are users leveraging the pointer config interface and we don't want to break them. This is also slightly related to the flattened ignition config enhancement
> This is also slightly related to the flattened ignition config enhancement Yes that's true, if we didn't have to support this interface, we could just use the rendered config and ignore the pointer config (which we did previously, and that broke this interface in a similar way ref https://bugzilla.redhat.com/show_bug.cgi?id=1833483 and led to a revert of that implementation)
Verified that this is fixed and working in [kni@e16-h12-b01-fc640 ~]$ ^C [kni@e16-h12-b01-fc640 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-10-03-051134 True False 105s Cluster version is 4.6.0-0.nightly-2020-10-03-051134
Marking this BZ verified per comment #39
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196