Bug 1859897 - Configuration of LACP bond with tagged VLAN
Summary: Configuration of LACP bond with tagged VLAN
Keywords:
Status: CLOSED DUPLICATE of bug 1857532
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.5
Hardware: x86_64
OS: Unspecified
medium
high
Target Milestone: ---
: 4.6.0
Assignee: Dusty Mabe
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-23 09:46 UTC by Ahmed Anwar
Modified: 2020-08-19 20:39 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-19 20:39:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Ahmed Anwar 2020-07-23 09:46:05 UTC
Description of problem:

While deploying a sandbox environment with a customer, we faced a major issue during the network configuration. The customer requires that the nodes connected directly to the switch be configured with VLAN tagging and the interfaces must be in a LACP bond. In order to configure the VLAN and LACP bond the below variables were used when booting with the RHCOS ISO images:

... ip=172.21.118.203::172.21.116.1:255.255.252.0:rhvsctst03.example.net:bond0.116:none bond=bond0:ens3,ens4:mode=802.3ad,lacp_rate=fast,miimon=100 vlan=bond0.116:bond0 ...

The installation of RHCOS on the physical machine was successful, which means that the parameters are set properly in terms of building a LACP bond and setting up the VLAN tagging. The raw disk image of RHCOS was downloaded from an external webserver and extracted to the installation drive (sda), then the machine reboots.

Upon the initial boot of the installed RHCOS, the network stack is configured properly during initrd initialization. The ignition system downloads the master.ign file from the external webserver and subsequently the master's full ignition file (api-int:22623/config/master) from the bootstrap server, and the boot continues.

When the system fully boots, the network stack isn't properly configured and the network connectivity is lost. No SSH access was established to further investigate and troubleshoot the issue

Version-Release number of selected component (if applicable):
- RHOCP 4.5.2
- RHCOS 4.5 (rhcos-4.5.2-x86_64-metal.x86_64.raw.gz)

How reproducible:


Steps to Reproduce:
1. Switch must be configured with LACP on the ports and VLAN tagging
2. Generate and host the inigition files and raw disk image on an external webserver
3. Using a RHCOS ISO image, boot the machines with the below parameters

coreos.inst=yes ip=172.21.118.203::172.21.116.1:255.255.252.0:rhvsctst03.example.net:bond0.116:none nameserver=10.240.20.53 nameserver=172.16.4.25 bond=bond0:ens3,ens4:mode=802.3ad,lacp_rate=fast,miimon=100 vlan=bond0.116:bond0 coreos.inst.install_dev=sda coreos.inst.image_url=172.21.118.2:80/rhcos-4.5.2-x86_64-metal.x86_64.raw.gz coreos.inst.ignition_url=172.21.118.2:80/master.ign

Actual results:
- Network connectivity is lost when RHCOS fully boots
- The network interfaces on the login prompt only shows IPv6 IPs and doesn't show the configured the IPv4 IP neither the VLAN.

Expected results:
- Network connectivity should function properly when RHCOS fully boot
- The network interfaces on the login prompt must show the configured the IPv4 IP network interface.

Additional info:
Since this was a sandbox environment, the priority for the customer was to get RHOCP up and runng properly rather than configuring the advanced networking parameters. We ended up removing the LACP configuration, VLAN tagging and switching the port on the switch to access port. This lead to deploying RHOCP properly.

Comment 1 Micah Abbott 2020-07-23 13:17:43 UTC
This may be a limitation in RHCOS 4.5 about which parameters passed to dracut during the install are propagated to the real root.  BZ#1857532 seems to complain of a similar problem.

In RHCOS 4.6, we will have better support for complex network configurations.

Since a workaround was found (understandably, not a desirable one), going to set the priority as medium and target for 4.6

Comment 2 Dusty Mabe 2020-07-30 19:37:34 UTC
This bug has not been selected for work in the current sprint.

Comment 3 Dusty Mabe 2020-08-03 21:34:42 UTC
Hey Ahmed,

As Micah mentioned in comment 1, the networking transition between the initramfs and real root in 4.6 got overhauled for the better so this problem might be already solved. I have a few questions for you:

1. Do you mind trying your setup with a 4.6 image to see if it works now? You can use a simple ignition config based on spec 3 (ignition spec changed to spec 3 in 4.6). Something like this should work:

```
{
    "ignition": {
        "version": "3.1.0"
    },
    "passwd": {
        "users": [
            {
                "groups": [
                    "sudo"
                ],
                "name": "core",
                "sshAuthorizedKeys": [
                    "ssh-rsa AAAA..."
                ]
            }
        ]
    }
}
```

2. On the 4.5 system you are having trouble with, can you get into the system? If so can you tell me the contents of the files in the `/etc/sysconfig/network-scripts/` directory?

Comment 4 Ahmed Anwar 2020-08-04 09:38:55 UTC
Hey Dusty,

I clarified this to the customer and he's okay with it as long as it'll be supported in a future release.

For #1, I disengaged from the customer and I have some limitation in reproducing a similar environment.

For #2, the problem occured when RHCOS fully boots, the network connectivity is completely lost. What happened was that when the system boots we got couple of ping responses. After that when the console is at the login prompt, the IPs didn't show up and the network connectivity to the node is lost.

We tried another round of deploying OCP with configuring VLANs only, and not using LACP bonds. This time RHCOS was installed from an ISO, and upon the first boot the network connectivity is lost (much like what happened with bonds). At this point we rebooted the node by hand, and on the second boot the network is configured properly with VLANs, and I was able to ssh into the system. However the bootstrap didn't progress properly, probably because the first boot wasn't carried out successfully.

I didn't have time to further troubleshoot while with the customer, but if I had the chance I would have boot the node in another run level, modify the `core` user's password and grant it the ability to login from the console, reboot the node and login to investigate.

Comment 5 Dusty Mabe 2020-08-12 21:43:28 UTC
Thanks Ahmed for the information.

There are several bugs related to VLANs and I think I'm going to try to get them all moved into the same bug (assuming they are the same root cause). Right now I'm focusing on https://bugzilla.redhat.como/show_bug.cgi?id=1857532#c4 to see if we can get more feedback. If those efforts seem promising I'll probably close this out as a duplicate.

Dusty

Comment 6 Dusty Mabe 2020-08-19 20:39:22 UTC
I'm going to close this and a few other bugs as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1857532. Unless more information comes to light, I think that is the case.

*** This bug has been marked as a duplicate of bug 1857532 ***


Note You need to log in before you can comment on or make changes to this bug.