Bug 1881703

Summary:	Ignition Configuration to disable network interface on worker nodes does not work leading to networking problems
Product:	OpenShift Container Platform	Reporter:	Sai Sindhur Malleni <smalleni>
Component:	Machine Config Operator	Assignee:	Antonio Murdaca <amurdaca>
Status:	CLOSED ERRATA	QA Contact:	Michael Nguyen <mnguyen>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	4.6	CC:	adahiya, bbreard, brad, dblack, dustymabe, imcleod, jerzhang, jlebon, jligon, mcornea, miabbott, mifiedle, nstielau, pablo.iranzo, shardy, smilner, trozet, tsedovic, wabouham, walters, wking, yprokule
Target Milestone:	---	Keywords:	Regression, TestBlocker
Target Release:	4.6.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:44:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1831748

Description Sai Sindhur Malleni 2020-09-22 22:07:45 UTC

Description of problem: In our baremetal environments, we typically disable the lab public interface on the nodes by using a ignition configuration file, that places an ifcfg file disabling the NIC. Disabling this NIC is a technical requirement for getting OpenShift running in our labs and that is what we have been doing successfully until now. However, with OCP 4.6, we are seeing that the NIC is disabled on the master nodes, it is not disabled on the worker nodes leading to problem with Networking as the wrong interface is moved to the OVS bridge on worker nodes leading to pods running on worker nodes not being able to reach the API server.

Worker and Master igniton configs
[kni@e16-h12-b01-fc640 clusterconfigs]$ diff worker.ign.bkup master.ign.bkup 
1c1
< {"ignition": {"config": {"merge": [{"source": "https://192.168.222.3:22623/config/worker"}]}, "security": {"tls": {"certificateAuthorities": [{"source": "data:text/plain;charset=utf-8;base64,LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURFRENDQWZpZ0F3SUJBZ0lJVDJxejcvMzZlUlV3RFFZSktvWklodmNOQVFFTEJRQXdKakVTTUJBR0ExVUUKQ3hNSmIzQmxibk5vYVdaME1SQXdEZ1lEVlFRREV3ZHliMjkwTFdOaE1CNFhEVEl3TURreU1qRTRORGMwTjFvWApEVE13TURreU1ERTRORGMwTjFvd0pqRVNNQkFHQTFVRUN4TUpiM0JsYm5Ob2FXWjBNUkF3RGdZRFZRUURFd2R5CmIyOTBMV05oTUlJQklqQU5CZ2txaGtpRzl3MEJBUUVGQUFPQ0FROEFNSUlCQ2dLQ0FRRUExbXZRWm5DYnFGa3cKcGI3MDlrTFV1TGpqOVRPL1g2Mks2ZmR3Ynp3azBrNzZ6RThrUE5GaUFIRlk5MDJDVmlHcGRqZHByMVBnNVlCTQphOTNzRE1KT2xJUC9XbnpBSUJER2d5UjhRRFJOMDZiRjlRM0g1M01BcEVYRklYb01TZjNTU1MzQXEzVXp1OFVzClg5UjcvRDUzY2kzUzlhTjk3blpmWmlycld3VEtpMW1CWlpGTU5KVTdFVHZQTWRpd1pHeTFBdXhibmhqK2FZc0IKWWNjZndrZ2I4M0ltRlY2d3N1K1hqVy96R1RhV2kzL09xUGNvWWxHMzdXdTN6amxsMkgwWWVzb3hFSkRHYmVIdQpvNHhFM2VtYjdrSXNLVTI1YUJZa3pZVWpIbzRwUTdFNjhLTUxyN3N2ZWthZ0ROd0ljTkN5RStncWlDSnFtc3NjCkNMSUhGaFlaeVFJREFRQUJvMEl3UURBT0JnTlZIUThCQWY4RUJBTUNBcVF3RHdZRFZSMFRBUUgvQkFVd0F3RUIKL3pBZEJnTlZIUTRFRmdRVTQwQktsMk5PQzljcWJYWkNkOUUydk16QVpHc3dEUVlKS29aSWh2Y05BUUVMQlFBRApnZ0VCQUlRSmJOR0JDN1U0ZHFQOXZxYTNCb1o2RWwyMHcramwzbE5VY2IzRVlRRWtiNkRFU1o5WTJwTCt6cDRTCnZGWHhEakVlWWhGWGhUQkRNRHRQK0pzampXLzI1Mk5sUm1PdVNWRWNld2MyQUZGV2hJSlZmTklkT0pkYkNkMmMKUnF3VTV3U3k3cE0zaXdxSkNYUldjdWdEMTdiMUV1b1B6QnB0NTF0d1Eza2diUy9iWEttRHFhc3g2czNGM200SgpYWURLbE1ZbS9ld2xtMGIyWkNNS2JxNjlxWG9MOE9WaXAwdGZiaURHcVRNWnVaaGY0QU9iYnBEajMvdjFQdDBGCm1mVUQ3OWNBTlRCbWg2Z1I1QVdoZGFOTTQvRXNoVHRPTDlmdGJ0SVU5eWszeERMREo3ekxiSVJrditLK3RqM3IKYU5TRXZ4ZGNmYWMvS3JKVm50VUh0RTFFblpJPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg=="}]}}, "version": "3.1.0"}, "storage": {"files": [{"path": "/etc/sysconfig/network-scripts/ifcfg-eno1", "mode": 436, "overwrite": true, "contents": {"source": "data:,DEVICE%3Deno1%0ABOOTPROTO%3Dnone%0AONBOOT%3Dno%0A"}}]}}
\ No newline at end of file
---
> {"ignition": {"config": {"merge": [{"source": "https://192.168.222.3:22623/config/master"}]}, "security": {"tls": {"certificateAuthorities": [{"source": "data:text/plain;charset=utf-8;base64,LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURFRENDQWZpZ0F3SUJBZ0lJVDJxejcvMzZlUlV3RFFZSktvWklodmNOQVFFTEJRQXdKakVTTUJBR0ExVUUKQ3hNSmIzQmxibk5vYVdaME1SQXdEZ1lEVlFRREV3ZHliMjkwTFdOaE1CNFhEVEl3TURreU1qRTRORGMwTjFvWApEVE13TURreU1ERTRORGMwTjFvd0pqRVNNQkFHQTFVRUN4TUpiM0JsYm5Ob2FXWjBNUkF3RGdZRFZRUURFd2R5CmIyOTBMV05oTUlJQklqQU5CZ2txaGtpRzl3MEJBUUVGQUFPQ0FROEFNSUlCQ2dLQ0FRRUExbXZRWm5DYnFGa3cKcGI3MDlrTFV1TGpqOVRPL1g2Mks2ZmR3Ynp3azBrNzZ6RThrUE5GaUFIRlk5MDJDVmlHcGRqZHByMVBnNVlCTQphOTNzRE1KT2xJUC9XbnpBSUJER2d5UjhRRFJOMDZiRjlRM0g1M01BcEVYRklYb01TZjNTU1MzQXEzVXp1OFVzClg5UjcvRDUzY2kzUzlhTjk3blpmWmlycld3VEtpMW1CWlpGTU5KVTdFVHZQTWRpd1pHeTFBdXhibmhqK2FZc0IKWWNjZndrZ2I4M0ltRlY2d3N1K1hqVy96R1RhV2kzL09xUGNvWWxHMzdXdTN6amxsMkgwWWVzb3hFSkRHYmVIdQpvNHhFM2VtYjdrSXNLVTI1YUJZa3pZVWpIbzRwUTdFNjhLTUxyN3N2ZWthZ0ROd0ljTkN5RStncWlDSnFtc3NjCkNMSUhGaFlaeVFJREFRQUJvMEl3UURBT0JnTlZIUThCQWY4RUJBTUNBcVF3RHdZRFZSMFRBUUgvQkFVd0F3RUIKL3pBZEJnTlZIUTRFRmdRVTQwQktsMk5PQzljcWJYWkNkOUUydk16QVpHc3dEUVlKS29aSWh2Y05BUUVMQlFBRApnZ0VCQUlRSmJOR0JDN1U0ZHFQOXZxYTNCb1o2RWwyMHcramwzbE5VY2IzRVlRRWtiNkRFU1o5WTJwTCt6cDRTCnZGWHhEakVlWWhGWGhUQkRNRHRQK0pzampXLzI1Mk5sUm1PdVNWRWNld2MyQUZGV2hJSlZmTklkT0pkYkNkMmMKUnF3VTV3U3k3cE0zaXdxSkNYUldjdWdEMTdiMUV1b1B6QnB0NTF0d1Eza2diUy9iWEttRHFhc3g2czNGM200SgpYWURLbE1ZbS9ld2xtMGIyWkNNS2JxNjlxWG9MOE9WaXAwdGZiaURHcVRNWnVaaGY0QU9iYnBEajMvdjFQdDBGCm1mVUQ3OWNBTlRCbWg2Z1I1QVdoZGFOTTQvRXNoVHRPTDlmdGJ0SVU5eWszeERMREo3ekxiSVJrditLK3RqM3IKYU5TRXZ4ZGNmYWMvS3JKVm50VUh0RTFFblpJPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg=="}]}}, "version": "3.1.0"}, "storage": {"files": [{"path": "/etc/sysconfig/network-scripts/ifcfg-eno1", "mode": 436, "overwrite": true, "contents": {"source": "data:,DEVICE%3Deno1%0ABOOTPROTO%3Dnone%0AONBOOT%3Dno%0A"}}]}}
\ No newline at end of file


We can see how the same configuration leads to ifcfg-eno1 being placed on master nodes but not workers
[root@master-0 core]# cat /etc/sysconfig/network-scripts/ifcfg-eno1 
DEVICE=eno1
BOOTPROTO=none
ONBOOT=no
[core@worker000 ~]$ sudo su
[systemd]
Failed Units: 1
  NetworkManager-wait-online.service
[root@worker000 core]# cd /etc/sysconfig/network-scripts/
[root@worker000 network-scripts]# ls


This leads to eno1 being added to br-ex on worker node, while ens2f1 should be the interface added as can be seen on the master nodes

===========
Worker Node
===========
    Bridge br-ex
        Port br-ex
            Interface br-ex
                type: internal
        Port eno1
            Interface eno1
                type: system
        Port patch-br-ex_worker000-to-br-int
            Interface patch-br-ex_worker000-to-br-int
                type: patch
                options: {peer=patch-br-int-to-br-ex_worker000}
    ovs_version: "2.13.2"

===========
Master Node
===========

    Bridge br-ex
        Port ens2f1
            Interface ens2f1
                type: system
        Port br-ex
            Interface br-ex
                type: internal
        Port patch-br-ex_master-0-to-br-int
            Interface patch-br-ex_master-0-to-br-int
                type: patch
                options: {peer=patch-br-int-to-br-ex_master-0}
    ovs_version: "2.13.2"


It looks like the NIC with default route is moved to the OVS bridge, so this shouldn't have happened at all because if the NIC was dsiabled, it wouldn't have had the default route.

How reproducible:
100% when disabling a NIC through ignition

Steps to Reproduce:
1. Disable a NIC through ignition for worker nodes
2. Verify NIC has actually been disabled
3.

Actual results:
Several pods on workers fail to come up (ingress, monitoring etc) due to networking issues by having the wrong interface attached to OVS bridge br-ex (should have been ens2f1 instead of eno1 which should have been disabled)

Expected results:

The NIC eno1 should have been disabled and not have had the default route
Additional info:

Default routes on master and worker

[root@master-1 core]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.222.1   0.0.0.0         UG    800    0        0 br-ex
10.128.0.0      0.0.0.0         255.255.254.0   U     0      0        0 ovn-k8s-mp0
10.128.0.0      10.128.0.1      255.252.0.0     UG    0      0        0 ovn-k8s-mp0
169.254.0.0     0.0.0.0         255.255.240.0   U     0      0        0 ovn-k8s-gw0
172.22.0.0      0.0.0.0         255.255.255.0   U     101    0        0 ens2f0
172.30.0.0      10.128.0.1      255.255.0.0     UG    0      0        0 ovn-k8s-mp0
192.168.222.0   0.0.0.0         255.255.255.0   U     800    0        0 br-ex

[root@worker001 core]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.222.1   0.0.0.0         UG    103    0        0 ens2f1
0.0.0.0         10.1.39.254     0.0.0.0         UG    800    0        0 br-ex
10.1.36.0       0.0.0.0         255.255.252.0   U     800    0        0 br-ex
10.128.0.0      10.128.2.1      255.252.0.0     UG    0      0        0 ovn-k8s-mp0
10.128.2.0      0.0.0.0         255.255.254.0   U     0      0        0 ovn-k8s-mp0
169.254.0.0     0.0.0.0         255.255.240.0   U     0      0        0 ovn-k8s-gw0
172.22.0.0      0.0.0.0         255.255.255.0   U     102    0        0 ens2f0
172.30.0.0      10.128.2.1      255.255.0.0     UG    0      0        0 ovn-k8s-mp0
192.168.222.0   0.0.0.0         255.255.255.0   U     103    0        0 ens2f1



Logs from router pod
E0922 22:03:15.744114       1 reflector.go:127] github.com/openshift/router/pkg/router/template/service_lookup.go:33: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout
I0922 22:03:41.265909       1 trace.go:205] Trace[1397635573]: "Reflector ListAndWatch" name:github.com/openshift/router/pkg/router/controller/factory/factory.go:125 (22-Sep-2020 22:03:11.265) (total time: 30000ms):
Trace[1397635573]: [30.000516158s] [30.000516158s] END
E0922 22:03:41.265931       1 reflector.go:127] github.com/openshift/router/pkg/router/controller/factory/factory.go:125: Failed to watch *v1.Route: failed to list *v1.Route: Get "https://172.30.0.1:443/apis/route.openshift.io/v1/routes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout
I0922 22:03:58.915245       1 trace.go:205] Trace[882044693]: "Reflector ListAndWatch" name:github.com/openshift/router/pkg/router/controller/factory/factory.go:125 (22-Sep-2020 22:03:28.914) (total time: 30000ms):
Trace[882044693]: [30.000623427s] [30.000623427s] END
E0922 22:03:58.915284       1 reflector.go:127] github.com/openshift/router/pkg/router/controller/factory/factory.go:125: Failed to watch *v1beta1.EndpointSlice: failed to list *v1beta1.EndpointSlice: Get "https://172.30.0.1:443/apis/discovery.k8s.io/v1beta1/endpointslices?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout




Logs from machine-config-daemon pods on workers

E0922 22:03:52.117844    6271 reflector.go:127] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout
I0922 22:04:35.379500    6271 trace.go:205] Trace[1145578265]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (22-Sep-2020 22:04:05.378) (total time: 30000ms):
Trace[1145578265]: [30.000682973s] [30.000682973s] END
E0922 22:04:35.379522    6271 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout
I0922 22:05:01.235364    6271 trace.go:205] Trace[1747172884]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (22-Sep-2020 22:04:31.234) (total time: 30000ms):
Trace[1747172884]: [30.000641364s] [30.000641364s] END
E0922 22:05:01.235389    6271 reflector.go:127] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout


There are no errors in machine-config-server pods.

[kni@e16-h12-b01-fc640 clusterconfigs]$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                                                 False       False         True       164m
cloud-credential                           4.6.0-0.nightly-2020-09-22-051033   True        False         False      3h7m
cluster-autoscaler                         4.6.0-0.nightly-2020-09-22-051033   True        False         False      156m
config-operator                            4.6.0-0.nightly-2020-09-22-051033   True        False         False      165m
console                                    4.6.0-0.nightly-2020-09-22-051033   Unknown     True          False      137m
csi-snapshot-controller                    4.6.0-0.nightly-2020-09-22-051033   True        False         False      155m
dns                                        4.6.0-0.nightly-2020-09-22-051033   True        False         False      164m
etcd                                       4.6.0-0.nightly-2020-09-22-051033   True        False         False      162m
image-registry                             4.6.0-0.nightly-2020-09-22-051033   True        False         False      138m
ingress                                                                        False       True          True       156m
insights                                   4.6.0-0.nightly-2020-09-22-051033   True        False         False      156m
kube-apiserver                             4.6.0-0.nightly-2020-09-22-051033   True        False         False      162m
kube-controller-manager                    4.6.0-0.nightly-2020-09-22-051033   True        False         False      159m
kube-scheduler                             4.6.0-0.nightly-2020-09-22-051033   True        False         False      158m
kube-storage-version-migrator              4.6.0-0.nightly-2020-09-22-051033   True        False         False      109m
machine-api                                4.6.0-0.nightly-2020-09-22-051033   True        False         False      130m
machine-approver                           4.6.0-0.nightly-2020-09-22-051033   True        False         False      161m
machine-config                             4.6.0-0.nightly-2020-09-22-051033   True        False         False      159m
marketplace                                4.6.0-0.nightly-2020-09-22-051033   True        False         False      136m
monitoring                                                                     False       True          True       151m
network                                    4.6.0-0.nightly-2020-09-22-051033   True        False         False      160m
node-tuning                                4.6.0-0.nightly-2020-09-22-051033   True        False         False      165m
openshift-apiserver                        4.6.0-0.nightly-2020-09-22-051033   True        False         False      141m
openshift-controller-manager               4.6.0-0.nightly-2020-09-22-051033   True        False         False      155m
openshift-samples                          4.6.0-0.nightly-2020-09-22-051033   True        False         False      141m
operator-lifecycle-manager                 4.6.0-0.nightly-2020-09-22-051033   True        False         False      164m
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-09-22-051033   True        False         False      164m
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-09-22-051033   True        False         False      138m
service-ca                                 4.6.0-0.nightly-2020-09-22-051033   True        False         False      165m
storage                                    4.6.0-0.nightly-2020-09-22-051033   True        False         False      165m
[kni@e16-h12-b01-fc640 clusterconfigs]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          3h9m    Unable to apply 4.6.0-0.nightly-2020-09-22-051033: some cluster operators have not yet rolled out

Comment 1 Brad P. Crochet 2020-09-23 14:29:14 UTC

To add... This is a regression from 4.5. This same setup worked on 4.5.

Comment 2 Sai Sindhur Malleni 2020-09-23 14:34:49 UTC

These are the steps to create ignition configs and modify them, that are being used in our playbooks
https://github.com/openshift-kni/baremetal-deploy/blob/d577f5911061f7b8ed5b7bdc02ed84813b8d31ef/ansible-ipi-install/roles/installer/tasks/55_customize_filesystem.yml

We essentially create a fake-root directory and use filetranspiler to modify ignition configs created through create "ignition-configs" for both masters and workers that use an ifcfg file to dsiable NIC eno1. It is working on masters but not workers.

Comment 3 Sai Sindhur Malleni 2020-09-23 14:35:20 UTC

(In reply to Brad P. Crochet from comment #1)
> To add... This is a regression from 4.5. This same setup worked on 4.5.

Yes, same setup/hardware has worked in 4.3, 4.4 and 4.5..

Comment 4 Abhinav Dahiya 2020-09-23 16:54:11 UTC

The installer is not involved with configuring the machine using ignition. It is the RHCOS team and the MCO team. Since this configuration is being applied using the stub ignition config I'm moving to RHCOS for triage.

ALso please make sure when you move the component you attach a comment explaining the reason it helps with the context @Brad P Crochet

Comment 5 Micah Abbott 2020-09-23 17:02:34 UTC

Can you provide the journal from the worker nodes?  It should contain entries from Ignition showing the files being written out to the host.

If the masters are getting configured properly but the workers are not, it makes me wonder if the MCS is not serving up the worker configs properly.

Also if you can attach the full worker + master Ignition configs, that would be useful.

Comment 9 Micah Abbott 2020-09-23 19:05:47 UTC

A quick grep of the worker log doesn't show any evidence that `/etc/sysconfig/network-scripts/ifcfg-eno1` is being written out by Ignition.  (Similarly, there is no evidence of it happening on the masters, but that log looks like the Ignition portion was truncated from the beginning of the log)

The platform ID of the worker node suggests it is an OpenStack environment and we have reports of troubles with early networking in BZ#1877740

Additionally, changes have recently been made to upstream Ignition around OpenStack which will appear in the new boot images used by `openshift-install`:

https://github.com/coreos/ignition/pull/1094
https://github.com/coreos/ignition/pull/1095
https://github.com/coreos/ignition/pull/1098
https://github.com/openshift/installer/pull/4206

I would like to see if this problem is repeatable with the new installer/RHCOS boot image.

@Jonathan do you think you could take a look at this?

Comment 10 Sai Sindhur Malleni 2020-09-23 19:13:52 UTC

(In reply to Micah Abbott from comment #9)
> A quick grep of the worker log doesn't show any evidence that
> `/etc/sysconfig/network-scripts/ifcfg-eno1` is being written out by
> Ignition.  (Similarly, there is no evidence of it happening on the masters,
> but that log looks like the Ignition portion was truncated from the
> beginning of the log)
> 
> The platform ID of the worker node suggests it is an OpenStack environment
> and we have reports of troubles with early networking in BZ#1877740
This is OpenShift on Baremetal, not sure why the platform ID would report as OpenStack.
> 
> Additionally, changes have recently been made to upstream Ignition around
> OpenStack which will appear in the new boot images used by
> `openshift-install`:
> 
> https://github.com/coreos/ignition/pull/1094
> https://github.com/coreos/ignition/pull/1095
> https://github.com/coreos/ignition/pull/1098
> https://github.com/openshift/installer/pull/4206
> 
> I would like to see if this problem is repeatable with the new
> installer/RHCOS boot image.
> 
> @Jonathan do you think you could take a look at this?

Comment 11 Sai Sindhur Malleni 2020-09-23 19:20:11 UTC

Yes, the master journal seems truncated, anyway to get it back. I'm just using the command "journalctl" but that doesn't give me everything. In the case of worker nodes the full log sems to be present.

Comment 12 Brad P. Crochet 2020-09-23 19:31:42 UTC

I can confirm from a previous look, when I looked at 'journalctl -b -1' on the masters, I could see /etc/sysconfig/network-scripts/ifcfg-eno1 being written on the masters (and the file was actually present), whereas the same could not be said for the workers.

Comment 14 Sai Sindhur Malleni 2020-09-23 20:11:55 UTC

[root@master-0 core]# grep -inr ifcfg-eno1 master_journal_new.log 
2080:Sep 23 19:51:28 e16-h12-b02-fc640.rdu2.scalelab.redhat.com ignition[1697]: INFO     : files: createFilesystemsFiles: createFiles: op(21): [started]  writing file "/sysroot/etc/sysconfig/network-scripts/ifcfg-eno1"
2081:Sep 23 19:51:28 e16-h12-b02-fc640.rdu2.scalelab.redhat.com ignition[1697]: INFO     : files: createFilesystemsFiles: createFiles: op(21): [finished] writing file "/sysroot/etc/sysconfig/network-scripts/ifcfg-eno1"
2471:Sep 23 19:51:32 e16-h12-b02-fc640.rdu2.scalelab.redhat.com ignition[1697]: INFO     : files: createFilesystemsFiles: createFiles: op(21): [started]  writing file "/sysroot/etc/sysconfig/network-scripts/ifcfg-eno1"
2472:Sep 23 19:51:32 e16-h12-b02-fc640.rdu2.scalelab.redhat.com ignition[1697]: INFO     : files: createFilesystemsFiles: createFiles: op(21): [finished] writing file "/sysroot/etc/sysconfig/network-scripts/ifcfg-eno1"

Comment 15 Micah Abbott 2020-09-23 20:14:35 UTC

(In reply to Sai Sindhur Malleni from comment #10)
> (In reply to Micah Abbott from comment #9)
> > A quick grep of the worker log doesn't show any evidence that
> > `/etc/sysconfig/network-scripts/ifcfg-eno1` is being written out by
> > Ignition.  (Similarly, there is no evidence of it happening on the masters,
> > but that log looks like the Ignition portion was truncated from the
> > beginning of the log)
> > 
> > The platform ID of the worker node suggests it is an OpenStack environment
> > and we have reports of troubles with early networking in BZ#1877740
> This is OpenShift on Baremetal, not sure why the platform ID would report as
> OpenStack.

$ grep -m 1 ignition.platform journal_worker.log 
Sep 22 20:06:57 localhost kernel: Command line: BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-7fb133ae75316366f0c9ead0f7b95f476a097e0b5c443fa5e584d52942193364/vmlinuz-4.18.0-211.el8.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ignition.firstboot rd.neednet=1 ostree=/ostree/boot.1/rhcos/7fb133ae75316366f0c9ead0f7b95f476a097e0b5c443fa5e584d52942193364/0 ignition.platform.id=openstack

I think this is "quirk" of BM IPI, but perhaps I am speaking out of line.

Comment 16 Sai Sindhur Malleni 2020-09-24 14:20:57 UTC

(In reply to Micah Abbott from comment #15)
> (In reply to Sai Sindhur Malleni from comment #10)
> > (In reply to Micah Abbott from comment #9)
> > > A quick grep of the worker log doesn't show any evidence that
> > > `/etc/sysconfig/network-scripts/ifcfg-eno1` is being written out by
> > > Ignition.  (Similarly, there is no evidence of it happening on the masters,
> > > but that log looks like the Ignition portion was truncated from the
> > > beginning of the log)
> > > 
> > > The platform ID of the worker node suggests it is an OpenStack environment
> > > and we have reports of troubles with early networking in BZ#1877740
> > This is OpenShift on Baremetal, not sure why the platform ID would report as
> > OpenStack.
> 
> $ grep -m 1 ignition.platform journal_worker.log 
> Sep 22 20:06:57 localhost kernel: Command line:
> BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-
> 7fb133ae75316366f0c9ead0f7b95f476a097e0b5c443fa5e584d52942193364/vmlinuz-4.
> 18.0-211.el8.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0
> console=ttyS0,115200n8 rd.luks.options=discard ignition.firstboot
> rd.neednet=1
> ostree=/ostree/boot.1/rhcos/
> 7fb133ae75316366f0c9ead0f7b95f476a097e0b5c443fa5e584d52942193364/0
> ignition.platform.id=openstack
> 
> I think this is "quirk" of BM IPI, but perhaps I am speaking out of line.
Thanks for the clarification, looks like that warrants a BZ. I'm going to open that up a separate BZ.

Comment 17 Mike Fiedler 2020-09-24 17:57:33 UTC

This is a regression in 4.6 as well.   This worked on 4.6.0-0.nightly-2020-09-01-042030

Comment 18 Jonathan Lebon 2020-09-24 18:53:56 UTC

Looking at the worker and master journal logs here, it does seem like Ignition isn't writing the `/etc/sysconfig/network-scripts/ifcfg-eno1` file at all in the worker case, but is in the master case. It could be a bug in Ignition, but it's much more likely that the config Ignition was given simply doesn't have that file in the worker case.

This is unlikely to be an RHCOS bug and more likely something in the provisioning stack. But to double-check, here are two things to try:
1. On a worker node, you can check if the file is in the final merged worker config that Ignition handled using e.g. `jq ".storage.files[] | select(.path==\"/etc/sysconfig/network-scripts/ifcfg-eno1\")" /run/ignition.json`.
2. On a worker node, you can mount the config drive and look at the user-data directly in there to see if it matches what you provided.

Can you check those? If you don't see the sysconfig file in those configs, we should get the Bare Metal IPI folks to take a look.  Can you also provide the exact procedure you use to run the install?

>> I think this is "quirk" of BM IPI, but perhaps I am speaking out of line.
>
> Thanks for the clarification, looks like that warrants a BZ. I'm going to open that up a separate BZ.

Correct, the BM IPI flow uses the OpenStack image by design currently because it reuses the provisioning stack from OpenStack. They will eventually migrate to the bare metal image, so this is definitely already on their radar.

Comment 19 Yu Qi Zhang 2020-09-24 20:07:16 UTC

Could you also add the contents of the rendered-worker being served? i.e. what's in "https://192.168.222.3:22623/config/worker"

You can check `oc get mc` and paste the output of `oc describe mc/rendered-worker-xxx`

Just to make sure we're not overwriting your file somehow

Comment 25 Sai Sindhur Malleni 2020-09-25 15:55:45 UTC

Looks like while the masters nodes have a mention of ifcfg-eno1
[kni@e16-h12-b01-fc640 config-2]$ grep -inr eno1 *
master/openstack/2012-08-10/user_data:1:{"ignition": {"config": {"merge": [{"source": "https://192.168.222.3:22623/config/master"}]}, "security": {"tls": {"certificateAuthorities": [{"source": "data:text/plain;charset=utf-8;base64,LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURFRENDQWZpZ0F3SUJBZ0lJTU9FWFo1dXdaVW93RFFZSktvWklodmNOQVFFTEJRQXdKakVTTUJBR0ExVUUKQ3hNSmIzQmxibk5vYVdaME1SQXdEZ1lEVlFRREV3ZHliMjkwTFdOaE1CNFhEVEl3TURreU5URXpNRGswTUZvWApEVE13TURreU16RXpNRGswTUZvd0pqRVNNQkFHQTFVRUN4TUpiM0JsYm5Ob2FXWjBNUkF3RGdZRFZRUURFd2R5CmIyOTBMV05oTUlJQklqQU5CZ2txaGtpRzl3MEJBUUVGQUFPQ0FROEFNSUlCQ2dLQ0FRRUFzMk5Wc2pWMWZVYnYKT0VoQXZLVklNaEJSNFo5cTJteXkrUThRenc3MUYxZ2xIb0M4ZnJFZHVkZ0RpbEtrS0svN1RBRVBVdWZnajdYdwpRV2hNWTVwOXUrZEpxekZZZU5KYnFGdUFpT2lFaG53M1U4alE2SEtOYWp0WGNqLzVySWlRM3VGTk9XUkttRU93CkM1SllSQVNQaWdrV3Z1MkYzTnBKK1hxYmlWL2s3VjZHc2prL2FiRjhIN3V1UTN0RTR3K3QxTWpGWmhQK1JOczgKNzZGYm9MNVN4ajBZMW9kUUJTSGVVYU1mRmpHR1lTc3hvdlVwaU9TWTNrdUVNcTlQRWpCdU1nRTVJTVVLQzRqagp4NEp4QzlzWS82M0xZNFkybTRZUFhhWTlIa1VoUWt0ZDJLYVg0V0tJbFRETkZvZUFjS2VXWXY5Z0tZclBhRnpiCi81NWNPSXFWNFFJREFRQUJvMEl3UURBT0JnTlZIUThCQWY4RUJBTUNBcVF3RHdZRFZSMFRBUUgvQkFVd0F3RUIKL3pBZEJnTlZIUTRFRmdRVWNjZnJ6bXdkeWZwaGVac1c4eCtSQ1RSYWlra3dEUVlKS29aSWh2Y05BUUVMQlFBRApnZ0VCQUlGSld0N1ZOSSt4dmFwYXA2Y0JWMkdzR3NsOFBIbjBoTFFUaVJEMVo4eFlsUlkrNS9ydkczRjM4SjdFClhWOHMwT0tPYVJ5Y2dHdENvWURocFZEVm5nbFBMby9xSWVwbkdEK0htakNSYUU0T2NEZk5ZUUVQVVltazBuK1YKaXlDM2RpcWFtamo1KzZrc2NWR2l4bnVxZjBSZlRVYm1PNTVoT3hCUTlOdUlvSUxYb2x6aVI1UUYwaElqeW5UVQpVaFpYUEwySzV4a3VZOFcwenRKbnR1SEZZR2VrL21KS09pd1pjTzUya2s1b2JkY3dxMEFyT1VPMGkzNXpiTDRwCnE5NHVHbmRjblJpKzZpRk95aS9Oc3p0SUxVYjlwSCtscGRWTENSb0NzVS9qaGJYMWhDRnZSQlcxQ1Q3UnYvVXoKaUtYeE1MQ1Y2S2lCRU82TzgxVFYvc0ZEaVRnPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg=="}]}}, "version": "3.1.0"}, "storage": {"files": [{"path": "/etc/sysconfig/network-scripts/ifcfg-eno1", "mode": 436, "overwrite": true, "contents": {"source": "data:,DEVICE%3Deno1%0ABOOTPROTO%3Dnone%0AONBOOT%3Dno%0A"}}]}}


There is no such thing for worker nodes.

Comment 26 Jonathan Lebon 2020-09-25 16:00:03 UTC

Indeed it seems like the bare metal provisioning stack is somehow not properly passing the worker Ignition configs. Moving to Bare Metal component.

Comment 27 Steven Hardy 2020-09-28 15:30:44 UTC

I'm currently not clear how/why this may be specific to baremetal, but I'll take the bz for now and try to reproduce so we can clarify that.

Comment 28 Steven Hardy 2020-09-29 10:24:17 UTC

Ok I reproduced this and believe it may be related to the recent MCO changes to manage the pointer ignition config.

I use dev scripts ref https://github.com/openshift-metal3/dev-scripts/blob/master/utils.sh#L47..L94 - which uses openshift-baremetal-install (which is openshift-install compiled with flags to enable baremetal IPI)

Summary of the process

  openshift-install create manifests
  <add some manifests for NTP, unrelated to this bug>
  openshift-install create ignition-configs
  <merge some ignition for masters/workers to add a test file>

The ignition I merged looks like

  $ cat ignition/file_example.ign | jq .
  {
    "ignition": {
      "version": "3.1.0"
    },
    "storage": {
      "files": [
        {
          "path": "/etc/test",
          "mode": 436,
          "contents": {
            "source": "data:,test-foo%0A"
          }
        }
      ]
    }
  }

We see this reflected in the installer generated ignition config (for both masters and workers, but I'll only refer to workers below)

  $ jq .storage worker.ign
  {
    "files": [
      {
        "path": "/etc/test",
        "mode": 436,
        "contents": {
        "source": "data:,test-foo%0A"
        }
      }
    ]
  }

We also see it in the manifest passed via the bootstrap.ign to create the user-data secret:

  $ cat bootstrap.ign | jq -r '.storage.files[]|select(.path=="/opt/openshift/openshift/99_openshift-cluster-api_worker-user-data-secret.yaml")' | jq -r .contents.source | sed "s/data:text\plain;charset=utf-8;base64,//" | base64 -d | yq -r .data.userData | base64 -d | jq .storage.files
  [
    {
      "path": "/etc/test",
      "mode": 436,
      "contents": {
        "source": "data:,test-foo%0A"
      }
    }
  ]

However, after running create cluster, we see the following:

  $ oc get secret worker-user-data-managed -o json | jq -r .data.userData | base64 -d | jq .storage
  {}

This is the data consumed via the machine API to deploy the workers, which explains why the injected file is missing:

  $ for i in 20 21 22 23 24; do ssh core.111.$i "hostname && cat /etc/test"; done
  master-0
  test-foo
  master-1
  test-foo
  master-2
  test-foo
  worker-0
  cat: /etc/test: No such file or directory
  worker-1
  cat: /etc/test: No such file or directory

The reason the masters contain the file is these get deployed via terraform (using the installer data directly, which as we see above does contain the additional ignition config).

I suspect this is related to the changes in https://github.com/openshift/machine-config-operator/pull/1792 - in particular it seems the generated config doesn't consider any additional content from the installer generated resource? https://github.com/openshift/machine-config-operator/pull/1792/files#diff-5926e0cec606d949b66f823310790298

Probably needs a close look from the MCO team, so I'll reassign this for their input.

Comment 29 Colin Walters 2020-09-29 15:44:31 UTC

Is there a reason you're not providing this change via MachineConfig?  If you don't have machine-specific configuration that's definitely the preferred path.

Comment 30 Colin Walters 2020-09-29 15:45:34 UTC

(And doing it via MachineConfig should work transparently in both 4.5 and current 4.6, and has the additional benefit that the changes are tracked by the MCO so you can see and change them "day 2")

Comment 31 Sai Sindhur Malleni 2020-09-29 18:58:10 UTC

(In reply to Colin Walters from comment #29)
> Is there a reason you're not providing this change via MachineConfig?  If
> you don't have machine-specific configuration that's definitely the
> preferred path.

This is to disable the lab public interface on first boot, we really need this interface disabled before everything else to be able to get repeatable deployments in our labs. Is MachineConfig done early enough so that the interface is disabled and the ovs system-configuration picks the right interface to add to br-ex? (In our case if this is not done soon enough the NIC that should have been disabled has the default route and that ends up getting added to br-ex breaking communication between machine config daemon on the worker nodes and the MCS)

Comment 32 Colin Walters 2020-09-30 13:41:22 UTC

> This is to disable the lab public interface on first boot, we really need this interface disabled before everything else to be able to get repeatable deployments in our labs. Is MachineConfig done early enough so that the interface is disabled and the ovs system-configuration picks the right interface to add to br-ex? 

Yes.  Any MachineConfig you provide "day 1" via https://github.com/openshift/installer/blob/master/docs/user/customization.md#install-time-customization-for-machine-configuration
is included in the full rendered Ignition that is provided on firstboot for machines.  This matches the general philosophy of Ignition - your system is either configured or not.  There's no "half configured" state and we try hard to avoid "multiple phase configuration" steps.

The only power that you get from explicitly customizing the pointer configuration today is support for *per machine* configuration, see
discussion in https://github.com/openshift/machine-config-operator/issues/1720

Comment 33 Jonathan Lebon 2020-09-30 15:31:11 UTC

To clarify, in the general case it depends whether the customization needs to affect the Ignition stage itself, or just the real root. If the latter, then indeed, day 1 MCs should work fine. Otherwise, today the hack is to inject things in the qcow2 (to be replaced by a cleaner approach once the bare metal provisioning stack natively supports these customizations, though worth noting also that with https://github.com/openshift/enhancements/pull/467, even the former case would be addressed by day 1 MCs).

Comment 34 Dustin Black 2020-09-30 19:33:26 UTC

(In reply to Colin Walters from comment #29)
> Is there a reason you're not providing this change via MachineConfig?  If
> you don't have machine-specific configuration that's definitely the
> preferred path.

Our original deployment scripting design for the shared labs avoided the MC approach partially due to lack of expertise and partially due to an apparent need to disable the network interface very early in the process. So we built and have been using since 4.3 a method based around modifying the ignition config files.

Considering this, though, I just attempted to do a deployment in the lab where I inserted a MC file prior to install, and it turns out that did work for me and I was able to get a complete cluster up and running with masters and workers.

So for our lab purposes, we may now have a workaround for this BZ, and potentially a future change for our deployment methods.

Comment 35 Steven Hardy 2020-10-01 10:56:31 UTC

(In reply to Colin Walters from comment #29)
> Is there a reason you're not providing this change via MachineConfig?  If
> you don't have machine-specific configuration that's definitely the
> preferred path.

I guess the problem is this is a currently documented/supported interface, even if it's not the preferred one?

I'm aware of folks in the field using the same approach, so I think we'll need to avoid breaking it, even if the plan is to deprecate the ignition customization and mandate all config should be passed via MachineConfig manifests?

Comment 36 Antonio Murdaca 2020-10-01 10:59:59 UTC

(In reply to Steven Hardy from comment #35)
> (In reply to Colin Walters from comment #29)
> > Is there a reason you're not providing this change via MachineConfig?  If
> > you don't have machine-specific configuration that's definitely the
> > preferred path.
> 
> I guess the problem is this is a currently documented/supported interface,
> even if it's not the preferred one?
> 
> I'm aware of folks in the field using the same approach, so I think we'll
> need to avoid breaking it, even if the plan is to deprecate the ignition
> customization and mandate all config should be passed via MachineConfig
> manifests?

Right, we're moving forward with the revert that caused this in 4.6 indeed. We'll work in a future release to provide that feature again (synced by the MCO pointer config).
The revert is aligned with the fact that there are users leveraging the pointer config interface and we don't want to break them.
This is also slightly related to the flattened ignition config enhancement

Comment 37 Steven Hardy 2020-10-02 13:07:15 UTC

> This is also slightly related to the flattened ignition config enhancement

Yes that's true, if we didn't have to support this interface, we could just use the rendered config and ignore the pointer config (which we did previously, and that broke this interface in a similar way ref https://bugzilla.redhat.com/show_bug.cgi?id=1833483 and led to a revert of that implementation)

Comment 39 Sai Sindhur Malleni 2020-10-05 14:28:23 UTC

Verified that this is fixed and working in
[kni@e16-h12-b01-fc640 ~]$ ^C
[kni@e16-h12-b01-fc640 ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-10-03-051134   True        False         105s    Cluster version is 4.6.0-0.nightly-2020-10-03-051134

Comment 40 Micah Abbott 2020-10-05 17:38:25 UTC

Marking this BZ verified per comment #39

Comment 43 errata-xmlrpc 2020-10-27 16:44:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196