Bug 1741296
| Summary: | Systems with multiple nics fail to boot/complete an install. | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Eric Rich <erich> | |
| Component: | RHCOS | Assignee: | Steve Milner <smilner> | |
| Status: | CLOSED ERRATA | QA Contact: | Micah Abbott <miabbott> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 4.1.z | CC: | bbreard, dustymabe, imcleod, jligon, nstielau, walters | |
| Target Milestone: | --- | |||
| Target Release: | 4.2.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1741694 (view as bug list) | Environment: | ||
| Last Closed: | 2019-10-16 06:35:55 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1741694 | |||
This bug looks similar to: - https://bugzilla.redhat.com/show_bug.cgi?id=1715203 - https://bugzilla.redhat.com/show_bug.cgi?id=1715194 Yeah this is from William. I believe we hit this because the oob management interface was attaching as a USB NIC and NetworkManager was assigning the default route that interface. This was stopping the coreos-install, and we fixed that with dracut options. The work around at boot was to lengthen the time out on the NetworkManager-wail-online unit to 300. We still have the defaults set in the 4.2 stream. Type=oneshot ExecStart=/usr/bin/nm-online -s -q --timeout=30 RemainAfterExit=yes ...but as I'm writing this I want to say that there was a fix in CRI-O so it wouldn't be impacted at all by this. If that's true, the default timeout is ideal. Possible cri-o fix referenced: https://github.com/cri-o/cri-o/pull/2662 Note in 4.2 we're not starting `crio.service` by default, it's started by the installer on the bootstrap node, which needs networking anyways to pull the release image. Are we sure changes to RHCOS here are still needed in 4.2? Both QE and the original reporter lack the access to the hardware where this issue was originally seen, so it is not possible to completely verify the fix.
We are able to confirm that the version of RHCOS in 4.2.0-0.nightly-2019-09-18-114152 has the proposed timeout workaround included.
```
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.2.0-0.nightly-2019-09-18-114152 True False 84m Cluster version is 4.2.0-0.nightly-2019-09-18-114152
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-131-105.us-west-2.compute.internal Ready master 103m v1.14.6+75b9923b0
ip-10-0-137-4.us-west-2.compute.internal Ready worker 95m v1.14.6+75b9923b0
ip-10-0-154-225.us-west-2.compute.internal Ready master 103m v1.14.6+75b9923b0
ip-10-0-157-206.us-west-2.compute.internal Ready worker 95m v1.14.6+75b9923b0
ip-10-0-167-176.us-west-2.compute.internal Ready worker 95m v1.14.6+75b9923b0
ip-10-0-173-108.us-west-2.compute.internal Ready master 103m v1.14.6+75b9923b0
$ oc debug node/ip-10-0-137-4.us-west-2.compute.internal
Starting pod/ip-10-0-137-4us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.137.4
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# cat /usr/lib/systemd/system/NetworkManager-wait-online.service.d/timeout.conf
[Service]
ExecStart=
ExecStart=/usr/bin/nm-online -s -q --timeout=300
sh-4.4# systemctl status NetworkManager-wait-online.service
● NetworkManager-wait-online.service - Network Manager Wait Online
Loaded: loaded (/usr/lib/systemd/system/NetworkManager-wait-online.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/NetworkManager-wait-online.service.d
└─timeout.conf
Active: active (exited) since Thu 2019-09-19 13:14:47 UTC; 1h 38min ago
Docs: man:nm-online(1)
Process: 1001 ExecStart=/usr/bin/nm-online -s -q --timeout=300 (code=exited, status=0/SUCCESS)
Main PID: 1001 (code=exited, status=0/SUCCESS)
CPU: 36ms
Sep 19 13:14:47 localhost systemd[1]: Starting Network Manager Wait Online...
Sep 19 13:14:47 ip-10-0-137-4 systemd[1]: Started Network Manager Wait Online.
sh-4.4# rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bf3ac436a9049274fb59e615a906fe1809a18e0c62b8085bbce34ba37ac2954a
CustomOrigin: Managed by machine-config-operator
Version: 42.80.20190918.0 (2019-09-18T05:52:50Z)
pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5a2427869e1675f392cf1feb10ed62214801b74c9a0801ec986b3deb5b51a2d2
CustomOrigin: Image generated via coreos-assembler
Version: 42.80.20190827.1 (2019-08-27T20:55:34Z)
```
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 |
Description of problem: If you have a BareMetal UPI install, and the system has more than 1 nic (say 6), the install fails because > [systemd] > Failed Units: 2 > crio.service > NetworkManager-wait-onlin.service Version-Release number of selected component (if applicable): 4.1.z Additional info: - I used ip=eno2:dhcp but it still goes and tries to DHCP on all the interfaces. - I tried disabling other interfaces with ip=eno3:off and so on but the installer stop once it finds the first one complaining this flag cannot be used without specifying static IP addresses If using a VM as Bootstrap nodes ==> everything works - If simulating BM install on a virtual environment ==> everything works When using an actual BM server as the Bootstrap node, it is NOT working. I've been doing multiple configuration types and every single one of them fails. Here are some of the most common failures I'm seeing with BM Bootstrap Nodes (rhcos-410.8.20190516): --> If multiple NICs ==> it fails with CRIO and NetworkManager and bootkube is not even started --> If using a NIC different from the first NIC (as detected by the system) ==> fails with NetworkManager-wait-online.service and bootkube start failing and goes into a loop starting the etcd and even 24hrs later it still doing this --> If passing NM configuration to disable other interfaces ==> 3 processes fail, one about certificates, CRIO, NetworkManager, (and it even assign the wrong flags to the file "130" instead of "600", see configs bellow [1]). --> If passing NM configurations to the Bootstrap Node ==> the bootkube process fails and journalctl shows another process complaining of missing OCP certificates keys under /opt/openshift/<some-path> and when I go into that folder, there is nothing under /opt