Bug 1818697
Summary: | Vlan over bond is not active after first boot | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Qin Yuan <qiyuan> | ||||||||
Component: | NetworkManager | Assignee: | Beniamino Galvani <bgalvani> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Vladimir Benes <vbenes> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 8.2 | CC: | acardace, amusil, atragler, bgalvani, cshao, cutaylor, dholler, fpokryvk, lrintel, lsvaty, mavital, mkalinin, mtessun, peyu, ptalbert, qiyuan, rkhan, rvykydal, sbonazzo, sbueno, shlei, sukulkar, thaller, till, vbenes, weiwang, yaniwang | ||||||||
Target Milestone: | rc | Keywords: | Triaged | ||||||||
Target Release: | 8.2 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | NetworkManager-1.28.0-0.1.el8 | Doc Type: | If docs needed, set a value | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2021-05-18 13:29:37 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 1783891 | ||||||||||
Bug Blocks: | |||||||||||
Attachments: |
|
Dominic can you please have a look at this one? Qin, is the bond configured to use dhcp? Is the dhcp successful? For bond, ipv4 is disabled, ipv6 is ignored. Isn't this the right way to configure vlan over bond? In /var/log/messages, I saw: Mar 30 02:06:53 localhost NetworkManager[1417]: <warn> [1585534013.0331] dhcp4 (bond0.50): request timed out Mar 30 02:06:53 localhost NetworkManager[1417]: <info> [1585534013.0333] dhcp4 (bond0.50): state changed unknown -> timeout Mar 30 02:06:53 localhost NetworkManager[1417]: <info> [1585534013.0334] device (bond0.50): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed') (In reply to Qin Yuan from comment #3) > For bond, ipv4 is disabled, ipv6 is ignored. Isn't this the right way to > configure vlan over bond? > > In /var/log/messages, I saw: > Mar 30 02:06:53 localhost NetworkManager[1417]: <warn> [1585534013.0331] > dhcp4 (bond0.50): request timed out > Mar 30 02:06:53 localhost NetworkManager[1417]: <info> [1585534013.0333] > dhcp4 (bond0.50): state changed unknown -> timeout > Mar 30 02:06:53 localhost NetworkManager[1417]: <info> [1585534013.0334] > device (bond0.50): state change: ip-config -> failed (reason > 'ip-config-unavailable', sys-iface-state: 'managed') Thanks, is dhcp enabled for the VLAN on the bond? Yes. # cat /etc/sysconfig/network-script/ifcfg-VLAN_connection_1 VLAN=yes TYPE=Vlan PHYSDEV=d5008bac-31f3-4b75-803e-dbca0ee4e871 VLAN_ID=50 REORDER_HDR=yes GVRP=no MVRP=no HWADDR= PROXY_METHOD=none BROWSER_ONLY=no BOOTPROTO=dhcp DEFROUTE=yes IPV4_FAILURE_FATAL=no IPV6INIT=yes IPV6_AUTOCONF=yes IPV6_DEFROUTE=yes IPV6_FAILURE_FATAL=no IPV6_PRIVACY=no IPV6_ADDR_GEN_MODE=stable-privacy NAME="VLAN connection 1" UUID=009ca293-b03d-422f-a35f-b1cc36e8c2f5 ONBOOT=yes # nmcli c show "VLAN connection 1" ipv4.method: auto ipv6.method: auto Thomas, is this the intended behavior of NetworkManager? Should be fixed by nmstate-0.2.5-1.el8 which is included in RHEl 8.2 (In reply to Sandro Bonazzola from comment #7) > Should be fixed by nmstate-0.2.5-1.el8 which is included in RHEl 8.2 Unfortunately not, anaconda is using NetworkManager directly without nmstate. This means if we want to change the behavior, anaconda has to create a NetworkManager config similar to the one nmstate is creating. (In reply to Dominik Holler from comment #6) > Thomas, is this the intended behavior of NetworkManager? I don't fully understand the question. But yes, it seems intended. If you enable DHCP on a device and DHCP fails, the device goes down. As we discussed several weeks ago, that depends on some circumstances like ipv4.may-fail/ipv6.may-fail settings (and the ipv4.dhcp-timeout and ipv6.ra-timeout)... but yes. Seems intended. If something is unclear, please provide full level=TRACE logs.(In reply to Nir Levy from comment #9) Anaconda is just copying ifcfg files created during configuration in Anaconda GUI in NetworkManager Connection Editor to installed system, so Anaconda does not create the ifcfg files in this case. The only thing that could be possibly caused/fixed by Anaconda that comes to my mind could be interference with some other ifcfg files created during installation (like default ifcfg files for devices) but from the logs from the Description it does not seem to be the case. I think we can learn more only from the logs requested in comment #10. Created attachment 1688793 [details]
level=TRACE log
Attached level=TRACE log
Radek, can you please have a look at provided logs? I think we need NM eyes here. thaller can you please have a look? from the log in comment 13: <debug> [1589550235.7710] ++ connection.id = 'VLAN connection 1' <debug> [1589550235.7710] ++ connection.permissions = [] <debug> [1589550235.7710] ++ connection.type = 'vlan' <debug> [1589550235.7710] ++ connection.uuid = 'b6f95590-fa05-4b89-bcc9-64ee2c2ced6f' ... <debug> [1589550235.7711] ++ vlan.id = 50 <debug> [1589550235.7711] ++ vlan.ingress-priority-map = [] <debug> [1589550235.7711] ++ vlan.parent = '7a9a8c92-ec6d-420f-96a8-cff4086b3534' <debug> [1589550235.7711] ++ ipv4 [ 0x55d137d0a1e0 ] <debug> [1589550235.7711] ++ ipv4.addresses = ((GPtrArray*) 0x55d137d5c060) <debug> [1589550235.7711] ++ ipv4.dns = [] <debug> [1589550235.7711] ++ ipv4.dns-search = [] <debug> [1589550235.7712] ++ ipv4.method = 'auto' <debug> [1589550235.7713] ++ ipv4.routes = ((GPtrArray*) 0x55d137d62880) <debug> [1589550235.7713] ++ ipv4.routing-rules = <unknown> <debug> [1589550235.7713] ++ ipv6 [ 0x55d137cca530 ] <debug> [1589550235.7713] ++ ipv6.addresses = ((GPtrArray*) 0x55d137d61b20) <debug> [1589550235.7713] ++ ipv6.dns = [] <debug> [1589550235.7713] ++ ipv6.dns-search = [] <debug> [1589550235.7713] ++ ipv6.ip6-privacy = ((NMSettingIP6ConfigPrivacy) NM_SETTING_IP6_CONFIG_PRIVACY_DISABLED) <debug> [1589550235.7714] ++ ipv6.method = 'auto' <debug> [1589550235.7714] ++ ipv6.routes = ((GPtrArray*) 0x55d137d5c0e0) <debug> [1589550235.7714] ++ ipv6.routing-rules = <unknown> ... <info> [1589550239.5732] policy: auto-activating connection 'VLAN connection 1' (b6f95590-fa05-4b89-bcc9-64ee2c2ced6f) ... <info> [1589550239.5872] device (bond0.50): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed') ... <warn> [1589550284.8424] dhcp4 (bond0.50): request timed out <info> [1589550284.8428] dhcp4 (bond0.50): state changed unknown -> timeout ... <info> [1589550284.8430] device (bond0.50): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed') ... <warn> [1589550284.8449] device (bond0.50): Activation: failed for connection 'VLAN connection 1' So far, so expected. You configured a VLAN profile that should DHCP, it timed out and failed. The solution for this problem is: don't configure the profile this way, if that is not the correct solution for your setup. However, then we would exepect that the profile keeps trying to autoconnect indefinetly. That doesn't seem to happen right: <info> [1589550284.8644] policy: auto-activating connection 'VLAN connection 1' (b6f95590-fa05-4b89-bcc9-64ee2c2ced6f) ... <debug> [1589550284.8869] device[5b394a254974fbfe] (bond0.50): parent: clear <debug> [1589550284.8875] device[5b394a254974fbfe] (bond0.50): unmanaged: flags set to [platform-init,!sleeping,!parent,!by-type,!user-explicit,!user-settings=0x10/0x7d/unmanaged/unrealize> <debug> [1589550284.8876] device[5b394a254974fbfe] (bond0.50): unmanaged: flags set to [platform-init,!sleeping,!user-settings=0x10/0x51/unmanaged/unrealized], forget [parent,by-type,user-> <info> [1589550284.8876] device (bond0.50): state change: disconnected -> unmanaged (reason 'user-requested', sys-iface-state: 'managed') That's odd. do you need this bug report to investigate why the profile was configured in the way it is (when it possibly shouldn't be)? Depending on that, I either clone or reassign the bug, to check why the autoconnect doesn't work. (In reply to Thomas Haller from comment #18) > do you need this bug report to investigate why the profile was configured in > the way it is (when it possibly shouldn't be)? > > Depending on that, I either clone or reassign the bug, to check why the > autoconnect doesn't work. Please clone, we still need some investigation on how we get profile configured in this way. As far as I understood, anaconda allowed the profile to be configured like this and if this is not supposed to be the right configuration we may need to work with anaconda team to prevent this configuration to be selected. Moving to anaconda for checking the profile generation here. In RHEL 7 this worked fine. Based on comment #12 and comment #18 I am reassigning to NetworkManager / Thomas for checking why the autoconnection does not work. @Sandro: I think there was no change on Anaconda side between RHEL7 and RHEL8 regarding this kind of configuration, and the profile is generated by nm-c-e/NetworkManager. As I understand what Thomas said - the configuration passed to installed system by Anaconda seems OK/expected. I think we need to find out why the configuration works in Anaconda environment but does not work on installed system. Hi Patrick, nice investigation! As you have found, the problem is that we don't schedule an activation-check after the device is deleted (in fact we schedule it when the device is still being deleted, which is the wrong time). I bisected the regression to commit d35d3c468a304c3e0e78b4b068d105b1d753876c, which is large rework; it's not yet clear to me what part of that commit caused the regression. I have opened a merge request with a possible fix at: https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/613/commits device: fix autoactivating virtual devices after a failure When a virtual device fails, its state goes to FAIL and then DISCONNECTED. In DISCONNECTED we call schedule_activate_check() to schedule an auto-activation if needed. We also schudule the deletion of the link through delete_on_deactivate_check_and_schedule(). The auto-activation attempt fails because the link deletion unmanages the device; as a result, the device doesn't try to auto-activate again. To fix this: - don't allow the device to auto-activate if the device deletion is pending; - check again if the device can be auto-activated after its deletion. Created attachment 1715342 [details]
Reproducer
*** Bug 1879003 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: NetworkManager and libnma security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:1574 |
Created attachment 1674634 [details] logs Description of problem: Configure vlan over bond device on Anaconda GUI, it can get ip during installation, but after first reboot, the vlan device is not up. Version-Release number of selected component (if applicable): RHVH-4.4-20200325.0-RHVH-x86_64-dvd1.iso How reproducible: 100% Steps to Reproduce: 1. Configure vlan over bond on Anaocnda GUI: 1) Bond: slaves: 2 nics mode: active-backup ipv4: disabled ipv6: ignore 2)Vlan: parent interface: bond0 vlan id: 50 2. Continue to finish other required settings, and begin installation 3. Reboot and enter system, check vlan over bond Actual results: 1. The vlan over bond device is not up Expected results: 1. The vlan over bond device should be up after first boot. Additional info: 1. vlan over bond device can be activated by nmcli cmd. 2. vlan over bond device will be activated automatically when activating other nic using nmcli cmd.