Description of problem: If you have a BareMetal UPI install, and the system has more than 1 nic (say 6), the install fails because > [systemd] > Failed Units: 2 > crio.service > NetworkManager-wait-onlin.service Version-Release number of selected component (if applicable): 4.1.z Additional info: - I used ip=eno2:dhcp but it still goes and tries to DHCP on all the interfaces. - I tried disabling other interfaces with ip=eno3:off and so on but the installer stop once it finds the first one complaining this flag cannot be used without specifying static IP addresses If using a VM as Bootstrap nodes ==> everything works - If simulating BM install on a virtual environment ==> everything works When using an actual BM server as the Bootstrap node, it is NOT working. I've been doing multiple configuration types and every single one of them fails. Here are some of the most common failures I'm seeing with BM Bootstrap Nodes (rhcos-410.8.20190516): --> If multiple NICs ==> it fails with CRIO and NetworkManager and bootkube is not even started --> If using a NIC different from the first NIC (as detected by the system) ==> fails with NetworkManager-wait-online.service and bootkube start failing and goes into a loop starting the etcd and even 24hrs later it still doing this --> If passing NM configuration to disable other interfaces ==> 3 processes fail, one about certificates, CRIO, NetworkManager, (and it even assign the wrong flags to the file "130" instead of "600", see configs bellow [1]). --> If passing NM configurations to the Bootstrap Node ==> the bootkube process fails and journalctl shows another process complaining of missing OCP certificates keys under /opt/openshift/<some-path> and when I go into that folder, there is nothing under /opt
This bug looks similar to: - https://bugzilla.redhat.com/show_bug.cgi?id=1715203 - https://bugzilla.redhat.com/show_bug.cgi?id=1715194
Yeah this is from William. I believe we hit this because the oob management interface was attaching as a USB NIC and NetworkManager was assigning the default route that interface. This was stopping the coreos-install, and we fixed that with dracut options. The work around at boot was to lengthen the time out on the NetworkManager-wail-online unit to 300. We still have the defaults set in the 4.2 stream. Type=oneshot ExecStart=/usr/bin/nm-online -s -q --timeout=30 RemainAfterExit=yes ...but as I'm writing this I want to say that there was a fix in CRI-O so it wouldn't be impacted at all by this. If that's true, the default timeout is ideal.
Possible cri-o fix referenced: https://github.com/cri-o/cri-o/pull/2662
Note in 4.2 we're not starting `crio.service` by default, it's started by the installer on the bootstrap node, which needs networking anyways to pull the release image. Are we sure changes to RHCOS here are still needed in 4.2?
See also https://github.com/openshift/installer/pull/1768
Both QE and the original reporter lack the access to the hardware where this issue was originally seen, so it is not possible to completely verify the fix. We are able to confirm that the version of RHCOS in 4.2.0-0.nightly-2019-09-18-114152 has the proposed timeout workaround included. ``` $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.nightly-2019-09-18-114152 True False 84m Cluster version is 4.2.0-0.nightly-2019-09-18-114152 $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-131-105.us-west-2.compute.internal Ready master 103m v1.14.6+75b9923b0 ip-10-0-137-4.us-west-2.compute.internal Ready worker 95m v1.14.6+75b9923b0 ip-10-0-154-225.us-west-2.compute.internal Ready master 103m v1.14.6+75b9923b0 ip-10-0-157-206.us-west-2.compute.internal Ready worker 95m v1.14.6+75b9923b0 ip-10-0-167-176.us-west-2.compute.internal Ready worker 95m v1.14.6+75b9923b0 ip-10-0-173-108.us-west-2.compute.internal Ready master 103m v1.14.6+75b9923b0 $ oc debug node/ip-10-0-137-4.us-west-2.compute.internal Starting pod/ip-10-0-137-4us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.137.4 If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# cat /usr/lib/systemd/system/NetworkManager-wait-online.service.d/timeout.conf [Service] ExecStart= ExecStart=/usr/bin/nm-online -s -q --timeout=300 sh-4.4# systemctl status NetworkManager-wait-online.service ● NetworkManager-wait-online.service - Network Manager Wait Online Loaded: loaded (/usr/lib/systemd/system/NetworkManager-wait-online.service; enabled; vendor preset: disabled) Drop-In: /usr/lib/systemd/system/NetworkManager-wait-online.service.d └─timeout.conf Active: active (exited) since Thu 2019-09-19 13:14:47 UTC; 1h 38min ago Docs: man:nm-online(1) Process: 1001 ExecStart=/usr/bin/nm-online -s -q --timeout=300 (code=exited, status=0/SUCCESS) Main PID: 1001 (code=exited, status=0/SUCCESS) CPU: 36ms Sep 19 13:14:47 localhost systemd[1]: Starting Network Manager Wait Online... Sep 19 13:14:47 ip-10-0-137-4 systemd[1]: Started Network Manager Wait Online. sh-4.4# rpm-ostree status State: idle AutomaticUpdates: disabled Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bf3ac436a9049274fb59e615a906fe1809a18e0c62b8085bbce34ba37ac2954a CustomOrigin: Managed by machine-config-operator Version: 42.80.20190918.0 (2019-09-18T05:52:50Z) pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5a2427869e1675f392cf1feb10ed62214801b74c9a0801ec986b3deb5b51a2d2 CustomOrigin: Image generated via coreos-assembler Version: 42.80.20190827.1 (2019-08-27T20:55:34Z) ```
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922