Bug 1991928

Summary: Installation with multiple NIC failed on OCP 4.9
Product: Red Hat Enterprise Linux 8 Reporter: Muhammad Adeel (IBM) <madeel>
Component: NetworkManagerAssignee: NetworkManager Development Team <nm-team>
Status: CLOSED NOTABUG QA Contact: Desktop QE <desktop-qa-list>
Severity: high Docs Contact:
Priority: high    
Version: 8.4CC: anbhat, atragler, bfournie, bgalvani, christian.lapolt, danili, dslavens, lrintel, mtarsel, rkhan, sukulkar, thaller, till, wolfgang.voesch
Target Milestone: betaKeywords: Triaged
Target Release: 8.4   
Hardware: s390x   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-07 08:53:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
journalctl network manager logs none

Description Muhammad Adeel (IBM) 2021-08-10 12:12:35 UTC
Created attachment 1812736 [details]
journalctl network manager logs

Description of problem:
OCP 4.9 installation with multiple ip= kernel param failed with DNS resolution. The installation with multiple NIC is described here: https://docs.openshift.com/container-platform/4.7/installing/installing_bare_metal/installing-bare-metal.html#installation-user-infra-machines-static-network_installing-bare-metal

Version-Release number of selected component (if applicable):
RHCOS version: 49.84.202108032348-0
OpenShift: 4.9.0-0.nightly-s390x-2021-08-04-204925

How reproducible:
Adding multiple ip= option in bootstrap node should reproduce the problem. We are seeing this with OSA and an additional RoCE ethernet card. 

Steps to Reproduce:
1.
2.
3.

Actual results:
bootstrap node sometimes does not boot and failed on fetching rootfs. 
Sometimes boot but with localhost hostname: [core@localhost ~]
and "search ocp-m1314001.lnxne.boe" entry is always missing in resolv.conf.

Expected results:
bootstrap node should boot similar to only single NIC installation. 

Additional info:
Network Manager logs are attached

Comment 1 Muhammad Adeel (IBM) 2021-08-10 12:16:45 UTC
A similar problem was observed in BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1974411

Comment 2 Dan Li 2021-08-11 14:10:56 UTC
Muhammad will add additional details on the 2 problems, there is a workaround for problem #2, then we can decide whether this bug is related to the networking team. Also setting "reviewed-in-sprint" flag as this bug is still in evaluation and is unlikely to be resolved before the end of the sprint (August 14th).

Comment 3 Muhammad Adeel (IBM) 2021-08-11 14:53:56 UTC
There are two problems associated to this BZ:

1. Sometimes even with the single NIC, the network doesn't come up and CoreOS rootfs can't be fetched from the http server. The logs on the console shows that CoreOS is always retrying to fetch the rootfs but it never finishes. 

2. Add an additional NIC to the node by using ip= in the param file:
      ip=10.13.114.2::10.13.114.1:255.255.255.0::enc1000:none
      ip=10.100.214.2::10.100.214.1:255.255.255.0::enP513s129:none
   here,
       enc1000 is the primary network interface where DNS server exists.
       enP513s129 is an additional network which has no DNS server.   
    
    On some machine the Network Manager(NM) picks up enc1000 and selects gateway ip 10.13.114.1 as the default route as mentioned in the NM logs:
       policy: set 'enc1000' (enc1000) as default for IPv4 routing and DNS
    In this case the Cluster installation is successful because it has setup the correct default route.

    However, on other machine the NM selects enP513s129 as primary interface and sets 10.100.214.1 as default route. In this case the Cluster installation failed, which is 
    obvious because there is no DNS on that route. A workaround in this case is to remove the gateway ip from the ip= param which ended in only one 10.13.114.1 default route. 

I think we need to understand two things here:
a. Why NIC probe order is changing between machines and in particular which rule NM is dependent upon?
b. How do we setup our default route to be the correct one?

Comment 4 Dan Li 2021-08-12 20:07:21 UTC
Re-assigning to Networking team to further evaluate this bug, as Muhammad has provided information and behavior in Comment 3. Please feel free to evaluate the "Blocker?" status as the team sees fit.

Comment 5 Dan Li 2021-08-12 20:10:12 UTC
Please also feel free to assign to the correct sub-component as I only took a guess based on Multi-NIC's relations with NMState Operator

Comment 6 Ben Nemec 2021-08-20 16:55:16 UTC
This doesn't appear to have any involvement from OpenShift networking. It's purely NetworkManager/Dracut behavior. Sending to the NetworkManager team for their input.

I can say they will most likely ask for trace logs from NetworkManager. In this case, you should be able to enable trace logging by passing a machine-config manifest to the installer that creates a file in /etc/NetworkManager/conf.d with the content:

[logging]
level=TRACE

Comment 8 Thomas Haller 2021-08-23 08:18:37 UTC
(In reply to Ben Nemec from comment #6)
> I can say they will most likely ask for trace logs from NetworkManager. In
> this case, you should be able to enable trace logging by passing a
> machine-config manifest to the installer that creates a file in
> /etc/NetworkManager/conf.d with the content:
> 
> [logging]
> level=TRACE

Yes, please. You enable debug logging during boot by setting `rd.debug` (from `man dracut.cmdline`). Then provide the complete logs. Thank you.

Comment 9 Muhammad Adeel (IBM) 2021-08-25 08:37:34 UTC
I was able to identify the root cause of the problem and it was due to multiple default routes. We will update the installation part of the document so that it reflects the correct multiple NIC configuration. Thank you.

Comment 10 Muhammad Adeel (IBM) 2021-09-02 11:54:44 UTC
I will close this one and open a corresponding document BZ. Thomas is there any AI required from your side?

Comment 11 Muhammad Adeel (IBM) 2021-09-02 12:42:31 UTC
the corresponding document BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2000583

Comment 12 Thomas Haller 2021-09-02 14:20:16 UTC
sorry, I didn't understand your comment 9 and comment 10.

Am I reading this correctly, that you think there is no bug here?

If yes, then we can indeed just close it...

Comment 13 Muhammad Adeel (IBM) 2021-09-07 07:25:00 UTC
Yes, there is no bug.

Comment 14 Thomas Haller 2021-09-07 08:53:15 UTC
thanks. Closing due to comment 13.


If something is missing, please comment or reopen. Thank you!!