RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1991928 - Installation with multiple NIC failed on OCP 4.9
Summary: Installation with multiple NIC failed on OCP 4.9
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: NetworkManager
Version: 8.4
Hardware: s390x
OS: Linux
high
high
Target Milestone: beta
: 8.4
Assignee: NetworkManager Development Team
QA Contact: Desktop QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-10 12:12 UTC by Muhammad Adeel (IBM)
Modified: 2021-09-07 09:05 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-07 08:53:15 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
journalctl network manager logs (15.57 KB, text/plain)
2021-08-10 12:12 UTC, Muhammad Adeel (IBM)
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-94363 0 None None None 2021-08-22 14:02:08 UTC

Description Muhammad Adeel (IBM) 2021-08-10 12:12:35 UTC
Created attachment 1812736 [details]
journalctl network manager logs

Description of problem:
OCP 4.9 installation with multiple ip= kernel param failed with DNS resolution. The installation with multiple NIC is described here: https://docs.openshift.com/container-platform/4.7/installing/installing_bare_metal/installing-bare-metal.html#installation-user-infra-machines-static-network_installing-bare-metal

Version-Release number of selected component (if applicable):
RHCOS version: 49.84.202108032348-0
OpenShift: 4.9.0-0.nightly-s390x-2021-08-04-204925

How reproducible:
Adding multiple ip= option in bootstrap node should reproduce the problem. We are seeing this with OSA and an additional RoCE ethernet card. 

Steps to Reproduce:
1.
2.
3.

Actual results:
bootstrap node sometimes does not boot and failed on fetching rootfs. 
Sometimes boot but with localhost hostname: [core@localhost ~]
and "search ocp-m1314001.lnxne.boe" entry is always missing in resolv.conf.

Expected results:
bootstrap node should boot similar to only single NIC installation. 

Additional info:
Network Manager logs are attached

Comment 1 Muhammad Adeel (IBM) 2021-08-10 12:16:45 UTC
A similar problem was observed in BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1974411

Comment 2 Dan Li 2021-08-11 14:10:56 UTC
Muhammad will add additional details on the 2 problems, there is a workaround for problem #2, then we can decide whether this bug is related to the networking team. Also setting "reviewed-in-sprint" flag as this bug is still in evaluation and is unlikely to be resolved before the end of the sprint (August 14th).

Comment 3 Muhammad Adeel (IBM) 2021-08-11 14:53:56 UTC
There are two problems associated to this BZ:

1. Sometimes even with the single NIC, the network doesn't come up and CoreOS rootfs can't be fetched from the http server. The logs on the console shows that CoreOS is always retrying to fetch the rootfs but it never finishes. 

2. Add an additional NIC to the node by using ip= in the param file:
      ip=10.13.114.2::10.13.114.1:255.255.255.0::enc1000:none
      ip=10.100.214.2::10.100.214.1:255.255.255.0::enP513s129:none
   here,
       enc1000 is the primary network interface where DNS server exists.
       enP513s129 is an additional network which has no DNS server.   
    
    On some machine the Network Manager(NM) picks up enc1000 and selects gateway ip 10.13.114.1 as the default route as mentioned in the NM logs:
       policy: set 'enc1000' (enc1000) as default for IPv4 routing and DNS
    In this case the Cluster installation is successful because it has setup the correct default route.

    However, on other machine the NM selects enP513s129 as primary interface and sets 10.100.214.1 as default route. In this case the Cluster installation failed, which is 
    obvious because there is no DNS on that route. A workaround in this case is to remove the gateway ip from the ip= param which ended in only one 10.13.114.1 default route. 

I think we need to understand two things here:
a. Why NIC probe order is changing between machines and in particular which rule NM is dependent upon?
b. How do we setup our default route to be the correct one?

Comment 4 Dan Li 2021-08-12 20:07:21 UTC
Re-assigning to Networking team to further evaluate this bug, as Muhammad has provided information and behavior in Comment 3. Please feel free to evaluate the "Blocker?" status as the team sees fit.

Comment 5 Dan Li 2021-08-12 20:10:12 UTC
Please also feel free to assign to the correct sub-component as I only took a guess based on Multi-NIC's relations with NMState Operator

Comment 6 Ben Nemec 2021-08-20 16:55:16 UTC
This doesn't appear to have any involvement from OpenShift networking. It's purely NetworkManager/Dracut behavior. Sending to the NetworkManager team for their input.

I can say they will most likely ask for trace logs from NetworkManager. In this case, you should be able to enable trace logging by passing a machine-config manifest to the installer that creates a file in /etc/NetworkManager/conf.d with the content:

[logging]
level=TRACE

Comment 8 Thomas Haller 2021-08-23 08:18:37 UTC
(In reply to Ben Nemec from comment #6)
> I can say they will most likely ask for trace logs from NetworkManager. In
> this case, you should be able to enable trace logging by passing a
> machine-config manifest to the installer that creates a file in
> /etc/NetworkManager/conf.d with the content:
> 
> [logging]
> level=TRACE

Yes, please. You enable debug logging during boot by setting `rd.debug` (from `man dracut.cmdline`). Then provide the complete logs. Thank you.

Comment 9 Muhammad Adeel (IBM) 2021-08-25 08:37:34 UTC
I was able to identify the root cause of the problem and it was due to multiple default routes. We will update the installation part of the document so that it reflects the correct multiple NIC configuration. Thank you.

Comment 10 Muhammad Adeel (IBM) 2021-09-02 11:54:44 UTC
I will close this one and open a corresponding document BZ. Thomas is there any AI required from your side?

Comment 11 Muhammad Adeel (IBM) 2021-09-02 12:42:31 UTC
the corresponding document BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2000583

Comment 12 Thomas Haller 2021-09-02 14:20:16 UTC
sorry, I didn't understand your comment 9 and comment 10.

Am I reading this correctly, that you think there is no bug here?

If yes, then we can indeed just close it...

Comment 13 Muhammad Adeel (IBM) 2021-09-07 07:25:00 UTC
Yes, there is no bug.

Comment 14 Thomas Haller 2021-09-07 08:53:15 UTC
thanks. Closing due to comment 13.


If something is missing, please comment or reopen. Thank you!!


Note You need to log in before you can comment on or make changes to this bug.