Bug 1741296 - Systems with multiple nics fail to boot/complete an install.
Summary: Systems with multiple nics fail to boot/complete an install.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.1.z
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.2.0
Assignee: Steve Milner
QA Contact: Micah Abbott
URL:
Whiteboard:
Depends On:
Blocks: 1741694
TreeView+ depends on / blocked
 
Reported: 2019-08-14 17:26 UTC by Eric Rich
Modified: 2019-10-16 06:36 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1741694 (view as bug list)
Environment:
Last Closed: 2019-10-16 06:35:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:36:10 UTC

Description Eric Rich 2019-08-14 17:26:25 UTC
Description of problem: If you have a BareMetal UPI install, and the system has more than 1 nic (say 6), the install fails because 

> [systemd]
> Failed Units: 2
>   crio.service
>   NetworkManager-wait-onlin.service


Version-Release number of selected component (if applicable): 4.1.z

Additional info:

- I used ip=eno2:dhcp but it still goes and tries to DHCP on all the interfaces.
- I tried disabling other interfaces with ip=eno3:off and so on but the installer stop once it finds the first one complaining this flag cannot be used without specifying static IP addresses

 If using a VM as Bootstrap nodes ==> everything works
- If simulating BM install on a virtual environment ==> everything works

When using an actual BM server as the Bootstrap node, it is NOT working. I've been doing multiple configuration types and every single one of them fails. Here are some of the most common failures I'm seeing with BM Bootstrap Nodes (rhcos-410.8.20190516):
--> If multiple NICs ==> it fails with CRIO and NetworkManager and bootkube is not even started
--> If using a NIC different from the first NIC (as detected by the system) ==> fails with NetworkManager-wait-online.service and bootkube start failing and goes into a loop starting the etcd and even 24hrs later it still doing this
--> If passing NM configuration to disable other interfaces ==> 3 processes fail, one about certificates, CRIO, NetworkManager,  (and it even assign the wrong flags to the file "130" instead of "600", see configs bellow [1]).
--> If passing NM configurations to the Bootstrap Node ==> the bootkube process fails and journalctl shows another process complaining of missing OCP certificates keys under /opt/openshift/<some-path> and when I go into that folder, there is nothing under /opt

Comment 2 Ben Breard 2019-08-14 19:14:02 UTC
Yeah this is from William. 

I believe we hit this because the oob management interface was attaching as a USB NIC and NetworkManager was assigning the default route that interface. This was stopping the coreos-install, and we fixed that with dracut options. The work around at boot was to lengthen the time out on the NetworkManager-wail-online unit to 300.

We still have the defaults set in the 4.2 stream.

Type=oneshot
ExecStart=/usr/bin/nm-online -s -q --timeout=30
RemainAfterExit=yes

...but as I'm writing this I want to say that there was a fix in CRI-O so it wouldn't be impacted at all by this. If that's true, the default timeout is ideal.

Comment 4 Steve Milner 2019-08-14 19:36:40 UTC
Possible cri-o fix referenced: https://github.com/cri-o/cri-o/pull/2662

Comment 12 Colin Walters 2019-08-19 13:48:58 UTC
Note in 4.2 we're not starting `crio.service` by default, it's started by the installer on the bootstrap node, which needs networking anyways to pull the release image.

Are we sure changes to RHCOS here are still needed in 4.2?

Comment 13 Colin Walters 2019-08-19 14:01:35 UTC
See also https://github.com/openshift/installer/pull/1768

Comment 14 Micah Abbott 2019-09-19 15:19:31 UTC
Both QE and the original reporter lack the access to the hardware where this issue was originally seen, so it is not possible to completely verify the fix.

We are able to confirm that the version of RHCOS in 4.2.0-0.nightly-2019-09-18-114152 has the proposed timeout workaround included.


```
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-09-18-114152   True        False         84m     Cluster version is 4.2.0-0.nightly-2019-09-18-114152

$ oc get nodes
NAME                                         STATUS   ROLES    AGE    VERSION
ip-10-0-131-105.us-west-2.compute.internal   Ready    master   103m   v1.14.6+75b9923b0
ip-10-0-137-4.us-west-2.compute.internal     Ready    worker   95m    v1.14.6+75b9923b0
ip-10-0-154-225.us-west-2.compute.internal   Ready    master   103m   v1.14.6+75b9923b0
ip-10-0-157-206.us-west-2.compute.internal   Ready    worker   95m    v1.14.6+75b9923b0
ip-10-0-167-176.us-west-2.compute.internal   Ready    worker   95m    v1.14.6+75b9923b0
ip-10-0-173-108.us-west-2.compute.internal   Ready    master   103m   v1.14.6+75b9923b0

$ oc debug node/ip-10-0-137-4.us-west-2.compute.internal
Starting pod/ip-10-0-137-4us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.137.4
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# cat /usr/lib/systemd/system/NetworkManager-wait-online.service.d/timeout.conf 
[Service]
ExecStart=
ExecStart=/usr/bin/nm-online -s -q --timeout=300
sh-4.4# systemctl status NetworkManager-wait-online.service                                                                                    
● NetworkManager-wait-online.service - Network Manager Wait Online
   Loaded: loaded (/usr/lib/systemd/system/NetworkManager-wait-online.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/NetworkManager-wait-online.service.d 
           └─timeout.conf
   Active: active (exited) since Thu 2019-09-19 13:14:47 UTC; 1h 38min ago
     Docs: man:nm-online(1)
  Process: 1001 ExecStart=/usr/bin/nm-online -s -q --timeout=300 (code=exited, status=0/SUCCESS)
 Main PID: 1001 (code=exited, status=0/SUCCESS)
      CPU: 36ms

Sep 19 13:14:47 localhost systemd[1]: Starting Network Manager Wait Online...
Sep 19 13:14:47 ip-10-0-137-4 systemd[1]: Started Network Manager Wait Online.
sh-4.4# rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bf3ac436a9049274fb59e615a906fe1809a18e0c62b8085bbce34ba37ac2954a
              CustomOrigin: Managed by machine-config-operator
                   Version: 42.80.20190918.0 (2019-09-18T05:52:50Z)

  pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5a2427869e1675f392cf1feb10ed62214801b74c9a0801ec986b3deb5b51a2d2
              CustomOrigin: Image generated via coreos-assembler
                   Version: 42.80.20190827.1 (2019-08-27T20:55:34Z)

```

Comment 15 errata-xmlrpc 2019-10-16 06:35:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.