Bug 1818697

Summary:

Vlan over bond is not active after first boot

Product:

Red Hat Enterprise Linux 8

Reporter:

Qin Yuan <qiyuan>

Component:

NetworkManager

Assignee:

Beniamino Galvani <bgalvani>

Status:

CLOSED ERRATA

QA Contact:

Vladimir Benes <vbenes>

Severity:

high

Docs Contact:

Priority:

high

Version:

8.2

CC:

acardace, amusil, atragler, bgalvani, cshao, cutaylor, dholler, fpokryvk, lrintel, lsvaty, mavital, mkalinin, mtessun, peyu, ptalbert, qiyuan, rkhan, rvykydal, sbonazzo, sbueno, shlei, sukulkar, thaller, till, vbenes, weiwang, yaniwang

Target Milestone:

Keywords:

Triaged

Target Release:

8.2

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

NetworkManager-1.28.0-0.1.el8

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-05-18 13:29:37 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1783891

Bug Blocks:

Attachments:

Description	Flags
logs	none
level=TRACE log	none
Reproducer	none

Description Qin Yuan 2020-03-30 06:34:56 UTC

Created attachment 1674634 [details]
logs

Description of problem:
Configure vlan over bond device on Anaconda GUI, it can get ip during installation, but after first reboot, the vlan device is not up.

Version-Release number of selected component (if applicable):
RHVH-4.4-20200325.0-RHVH-x86_64-dvd1.iso

How reproducible:
100%

Steps to Reproduce:
1. Configure vlan over bond on Anaocnda GUI:
   1) Bond:
      slaves: 2 nics
      mode: active-backup
      ipv4: disabled
      ipv6: ignore
   2)Vlan:
       parent interface: bond0
       vlan id: 50    
     
2. Continue to finish other required settings, and begin installation
3. Reboot and enter system, check vlan over bond

Actual results:
1. The vlan over bond device is not up

Expected results:
1. The vlan over bond device should be up after first boot.

Additional info:
1. vlan over bond device can be activated by nmcli cmd.
2. vlan over bond device will be activated automatically when activating other nic using nmcli cmd.

Comment 1 Sandro Bonazzola 2020-04-14 07:59:59 UTC

Dominic can you please have a look at this one?

Comment 2 Dominik Holler 2020-04-14 08:19:49 UTC

Qin, is the bond configured to use dhcp? Is the dhcp successful?

Comment 3 Qin Yuan 2020-04-14 08:32:40 UTC

For bond, ipv4 is disabled, ipv6 is ignored. Isn't this the right way to configure vlan over bond?

In /var/log/messages, I saw:
Mar 30 02:06:53 localhost NetworkManager[1417]: <warn>  [1585534013.0331] dhcp4 (bond0.50): request timed out
Mar 30 02:06:53 localhost NetworkManager[1417]: <info>  [1585534013.0333] dhcp4 (bond0.50): state changed unknown -> timeout
Mar 30 02:06:53 localhost NetworkManager[1417]: <info>  [1585534013.0334] device (bond0.50): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed')

Comment 4 Dominik Holler 2020-04-14 08:34:29 UTC

(In reply to Qin Yuan from comment #3)
> For bond, ipv4 is disabled, ipv6 is ignored. Isn't this the right way to
> configure vlan over bond?
> 
> In /var/log/messages, I saw:
> Mar 30 02:06:53 localhost NetworkManager[1417]: <warn>  [1585534013.0331]
> dhcp4 (bond0.50): request timed out
> Mar 30 02:06:53 localhost NetworkManager[1417]: <info>  [1585534013.0333]
> dhcp4 (bond0.50): state changed unknown -> timeout
> Mar 30 02:06:53 localhost NetworkManager[1417]: <info>  [1585534013.0334]
> device (bond0.50): state change: ip-config -> failed (reason
> 'ip-config-unavailable', sys-iface-state: 'managed')

Thanks, is dhcp enabled for the VLAN on the bond?

Comment 5 Qin Yuan 2020-04-14 08:41:30 UTC

Yes. 

# cat /etc/sysconfig/network-script/ifcfg-VLAN_connection_1
VLAN=yes
TYPE=Vlan
PHYSDEV=d5008bac-31f3-4b75-803e-dbca0ee4e871
VLAN_ID=50
REORDER_HDR=yes
GVRP=no
MVRP=no
HWADDR=
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=dhcp
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_PRIVACY=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME="VLAN connection 1"
UUID=009ca293-b03d-422f-a35f-b1cc36e8c2f5
ONBOOT=yes

# nmcli c show "VLAN connection 1"
ipv4.method:                            auto
ipv6.method:                            auto

Comment 6 Dominik Holler 2020-04-14 08:43:44 UTC

Thomas, is this the intended behavior of NetworkManager?

Comment 7 Sandro Bonazzola 2020-04-28 07:45:59 UTC

Should be fixed by nmstate-0.2.5-1.el8 which is included in RHEl 8.2

Comment 8 Dominik Holler 2020-04-28 07:55:40 UTC

(In reply to Sandro Bonazzola from comment #7)
> Should be fixed by nmstate-0.2.5-1.el8 which is included in RHEl 8.2

Unfortunately not, anaconda is using NetworkManager directly without nmstate.
This means if we want to change the behavior, anaconda has to create a NetworkManager config similar to the one nmstate is creating.

Comment 10 Thomas Haller 2020-05-06 19:15:22 UTC

(In reply to Dominik Holler from comment #6)
> Thomas, is this the intended behavior of NetworkManager?

I don't fully understand the question. But yes, it seems intended. If you enable DHCP on a device and DHCP fails, the device goes down.

As we discussed several weeks ago, that depends on some circumstances like ipv4.may-fail/ipv6.may-fail settings (and the ipv4.dhcp-timeout and ipv6.ra-timeout)... but yes. Seems intended.

If something is unclear, please provide full level=TRACE logs.(In reply to Nir Levy from comment #9)

Comment 12 Radek Vykydal 2020-05-13 10:07:12 UTC

Anaconda is just copying ifcfg files created during configuration in Anaconda GUI in NetworkManager Connection Editor to installed system, so Anaconda does not create the ifcfg files in this case.

The only thing that could be possibly caused/fixed by Anaconda that comes to my mind could be interference with some other ifcfg files created during installation (like default ifcfg files for devices) but from the logs from the Description it does not seem to be the case.

I think we can learn more only from the logs requested in comment #10.

Comment 13 Qin Yuan 2020-05-15 06:49:25 UTC

Created attachment 1688793 [details]
level=TRACE log

Attached level=TRACE log

Comment 14 Sandro Bonazzola 2020-05-26 07:44:15 UTC

Radek, can you please have a look at provided logs?

Comment 15 Radek Vykydal 2020-05-26 08:22:40 UTC

I think we need NM eyes here.

Comment 16 Sandro Bonazzola 2020-06-09 07:38:34 UTC

thaller can you please have a look?

Comment 17 Thomas Haller 2020-06-09 11:16:30 UTC

from the log in comment 13:


<debug> [1589550235.7710] ++ connection.id             = 'VLAN connection 1'
<debug> [1589550235.7710] ++ connection.permissions    = []
<debug> [1589550235.7710] ++ connection.type           = 'vlan'
<debug> [1589550235.7710] ++ connection.uuid           = 'b6f95590-fa05-4b89-bcc9-64ee2c2ced6f'
...
<debug> [1589550235.7711] ++ vlan.id                   = 50
<debug> [1589550235.7711] ++ vlan.ingress-priority-map = []
<debug> [1589550235.7711] ++ vlan.parent               = '7a9a8c92-ec6d-420f-96a8-cff4086b3534'
<debug> [1589550235.7711] ++ ipv4                      [ 0x55d137d0a1e0 ]
<debug> [1589550235.7711] ++ ipv4.addresses            = ((GPtrArray*) 0x55d137d5c060)
<debug> [1589550235.7711] ++ ipv4.dns                  = []
<debug> [1589550235.7711] ++ ipv4.dns-search           = []
<debug> [1589550235.7712] ++ ipv4.method               = 'auto'
<debug> [1589550235.7713] ++ ipv4.routes               = ((GPtrArray*) 0x55d137d62880)
<debug> [1589550235.7713] ++ ipv4.routing-rules        = <unknown>
<debug> [1589550235.7713] ++ ipv6                      [ 0x55d137cca530 ]
<debug> [1589550235.7713] ++ ipv6.addresses            = ((GPtrArray*) 0x55d137d61b20)
<debug> [1589550235.7713] ++ ipv6.dns                  = []
<debug> [1589550235.7713] ++ ipv6.dns-search           = []
<debug> [1589550235.7713] ++ ipv6.ip6-privacy          = ((NMSettingIP6ConfigPrivacy) NM_SETTING_IP6_CONFIG_PRIVACY_DISABLED)
<debug> [1589550235.7714] ++ ipv6.method               = 'auto'
<debug> [1589550235.7714] ++ ipv6.routes               = ((GPtrArray*) 0x55d137d5c0e0)
<debug> [1589550235.7714] ++ ipv6.routing-rules        = <unknown>
...
<info>  [1589550239.5732] policy: auto-activating connection 'VLAN connection 1' (b6f95590-fa05-4b89-bcc9-64ee2c2ced6f)
...
<info>  [1589550239.5872] device (bond0.50): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
...
<warn>  [1589550284.8424] dhcp4 (bond0.50): request timed out
<info>  [1589550284.8428] dhcp4 (bond0.50): state changed unknown -> timeout
...
<info>  [1589550284.8430] device (bond0.50): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed')
...
<warn>  [1589550284.8449] device (bond0.50): Activation: failed for connection 'VLAN connection 1'


So far, so expected. You configured a VLAN profile that should DHCP, it timed out and failed. The solution for this problem is: don't configure the profile this way, if that is not the correct solution for your setup.



However, then we would exepect that the profile keeps trying to autoconnect indefinetly. That doesn't seem to happen right:

<info>  [1589550284.8644] policy: auto-activating connection 'VLAN connection 1' (b6f95590-fa05-4b89-bcc9-64ee2c2ced6f)
...
<debug> [1589550284.8869] device[5b394a254974fbfe] (bond0.50): parent: clear
<debug> [1589550284.8875] device[5b394a254974fbfe] (bond0.50): unmanaged: flags set to [platform-init,!sleeping,!parent,!by-type,!user-explicit,!user-settings=0x10/0x7d/unmanaged/unrealize>
<debug> [1589550284.8876] device[5b394a254974fbfe] (bond0.50): unmanaged: flags set to [platform-init,!sleeping,!user-settings=0x10/0x51/unmanaged/unrealized], forget [parent,by-type,user->
<info>  [1589550284.8876] device (bond0.50): state change: disconnected -> unmanaged (reason 'user-requested', sys-iface-state: 'managed')


That's odd.

Comment 18 Thomas Haller 2020-06-09 11:33:13 UTC

do you need this bug report to investigate why the profile was configured in the way it is (when it possibly shouldn't be)?

Depending on that, I either clone or reassign the bug, to check why the autoconnect doesn't work.

Comment 19 Sandro Bonazzola 2020-06-09 14:45:17 UTC

(In reply to Thomas Haller from comment #18)
> do you need this bug report to investigate why the profile was configured in
> the way it is (when it possibly shouldn't be)?
> 
> Depending on that, I either clone or reassign the bug, to check why the
> autoconnect doesn't work.

Please clone, we still need some investigation on how we get profile configured in this way.
As far as I understood, anaconda allowed the profile to be configured like this and if this is not supposed to be the right configuration we may need to work with anaconda team to prevent this configuration to be selected.

Comment 20 Sandro Bonazzola 2020-06-16 07:40:29 UTC

Moving to anaconda for checking the profile generation here. In RHEL 7 this worked fine.

Comment 21 Radek Vykydal 2020-06-16 07:58:51 UTC

Based on comment #12 and comment #18 I am reassigning to NetworkManager / Thomas for checking why the autoconnection does not work.

@Sandro: I think there was no change on Anaconda side between RHEL7 and RHEL8 regarding this kind of configuration, and the profile is generated by nm-c-e/NetworkManager. As I understand what Thomas said - the configuration passed to installed system by Anaconda seems OK/expected. I think we need to find out why the configuration works in Anaconda environment but does not work on installed system.

Comment 26 Beniamino Galvani 2020-08-26 14:47:19 UTC

Hi Patrick,

nice investigation!

As you have found, the problem is that we don't schedule an
activation-check after the device is deleted (in fact we schedule it
when the device is still being deleted, which is the wrong time).

I bisected the regression to commit
d35d3c468a304c3e0e78b4b068d105b1d753876c, which is large rework; it's
not yet clear to me what part of that commit caused the regression.

Comment 27 Beniamino Galvani 2020-08-27 14:20:45 UTC

I have opened a merge request with a possible fix at:

https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/613/commits

    device: fix autoactivating virtual devices after a failure

    When a virtual device fails, its state goes to FAIL and then
    DISCONNECTED. In DISCONNECTED we call schedule_activate_check() to
    schedule an auto-activation if needed. We also schudule the deletion
    of the link through delete_on_deactivate_check_and_schedule(). The
    auto-activation attempt fails because the link deletion unmanages the
    device; as a result, the device doesn't try to auto-activate again.

    To fix this:

     - don't allow the device to auto-activate if the device deletion is
       pending;

     - check again if the device can be auto-activated after its deletion.

Comment 28 Beniamino Galvani 2020-09-18 10:05:29 UTC

Created attachment 1715342 [details]
Reproducer

Comment 32 Vladimir Benes 2020-10-01 13:47:01 UTC

CI test added:
https://gitlab.freedesktop.org/NetworkManager/NetworkManager-ci/-/merge_requests/644

Comment 34 Antonio Cardace 2020-10-08 12:10:56 UTC

*** Bug 1879003 has been marked as a duplicate of this bug. ***

Comment 39 errata-xmlrpc 2021-05-18 13:29:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: NetworkManager and libnma security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1574