Bug 1356635 - RHVH can't obtain ip over bond+vlan network after anaconda interactive installation.
Summary: RHVH can't obtain ip over bond+vlan network after anaconda interactive instal...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-node
Classification: oVirt
Component: Installation & Update
Version: 4.0
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ovirt-4.0.2
: 4.0
Assignee: Petr Horáček
QA Contact: cshao
URL:
Whiteboard:
: 1364476 (view as bug list)
Depends On:
Blocks: 1338732
TreeView+ depends on / blocked
 
Reported: 2016-07-14 14:10 UTC by cshao
Modified: 2017-02-06 23:15 UTC (History)
20 users (show)

Fixed In Version: v4.18.10 (ovirt-4.0.2-6)
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-12 14:24:22 UTC
oVirt Team: Network
Embargoed:
rule-engine: ovirt-4.0.z+
ykaul: blocker+
mgoldboi: planning_ack+
fdeutsch: devel_ack+
cshao: testing_ack+


Attachments (Terms of Use)
bond_vlan.png (436.78 KB, image/png)
2016-07-14 14:11 UTC, cshao
no flags Details
all_log_info (6.16 MB, application/x-gzip)
2016-07-14 14:12 UTC, cshao
no flags Details
new log after run sed and reboot (6.32 MB, application/x-gzip)
2016-07-15 03:59 UTC, cshao
no flags Details
ifcfg files and networkmanager log (4.51 KB, application/x-gzip)
2016-07-19 17:44 UTC, Ryan Barry
no flags Details
new_bond_vlan_log_with_NM (6.23 MB, application/x-gzip)
2016-07-22 08:05 UTC, cshao
no flags Details
bond_vlan_new_log_0803 (6.48 MB, application/x-gzip)
2016-08-09 07:19 UTC, cshao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 61396 0 'None' MERGED modprobe: set bonding max_bonds to 0 2020-09-23 09:57:42 UTC
oVirt gerrit 61470 0 'None' ABANDONED modules-load.d: drop bonding UNTESTED 2020-09-23 09:57:42 UTC
oVirt gerrit 61593 0 'None' MERGED modprobe: set bonding max_bonds to 0 2020-09-23 09:57:42 UTC

Description cshao 2016-07-14 14:10:30 UTC
Description of problem:
RHVH can't obtain ip over bond+vlan network after anaconda interactive installation.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Anaconda interactive install RHVH via iso(with default ks) 
2. Enter network page.
3. Add bond network -> save.
4. Enter editing bond connection page-> bond tab, click add button.
5. Choose a connection type -> VLAN -> select a vlan nic -> set vlan name and id 20.
6. click add button again, choose a connection type -> VLAN -> select another vlan nic -> set vlan name and id 20.
7. Bond mode set -> active backup
8. Save
9. Bond+vlan can obtain vlan IP.
10. Continue the installation.
11. Reboot and login RHVH.
12. ip addr


Actual results:
1. After step 9, bond+vlan network can obtain vlan IP.
2. After step 12, RHVH can't obtain IP after reboot

Expected results:
RHEV can obtain ip over bond+vlan network after reboot.

Additional info:

Comment 1 cshao 2016-07-14 14:11:18 UTC
Created attachment 1179875 [details]
bond_vlan.png

Comment 2 cshao 2016-07-14 14:12:06 UTC
Created attachment 1179876 [details]
all_log_info

Comment 3 Ryan Barry 2016-07-14 14:21:06 UTC
Does this mean bz#1355678 can be closed? It looks like the same configuration.

Can you please get a sosreport?

Comment 4 cshao 2016-07-14 14:28:45 UTC
(In reply to Ryan Barry from comment #3)
> Does this mean bz#1355678 can be closed? It looks like the same
> configuration.
> 
Yes.
> Can you please get a sosreport?
sosreport has been included in #c2.

Comment 5 Ryan Barry 2016-07-14 14:48:02 UTC
Sorry, I must have missed it --

ethtool shows p3p1 and p4p1 (the bond slaves) as having links

However, networkmanager shows them as disconnected (possibly because ONBOOT=no)

I'll try to reproduce, but can you please try "sed -e 's/ONBOOT=no/ONBOOT=yes/' /etc/sysconfig/network-scripts/ifcfg-p?p1", then rebooting?

This is not a permanent solution, but I'd like to ensure that this is the problem in case I can't reproduce.

Comment 6 cshao 2016-07-15 03:58:01 UTC
(In reply to Ryan Barry from comment #5)
> Sorry, I must have missed it --
> 
> ethtool shows p3p1 and p4p1 (the bond slaves) as having links
> 
> However, networkmanager shows them as disconnected (possibly because
> ONBOOT=no)
> 
> I'll try to reproduce, but can you please try "sed -e
> 's/ONBOOT=no/ONBOOT=yes/' /etc/sysconfig/network-scripts/ifcfg-p?p1", then
> rebooting?
> 
Bond + vlan network still can't up after a reboot, new log attached.

# sed -e 's/ONBOOT=no/ONBOOT=yes/' /etc/sysconfig/network-scripts/ifcfg-p?p1
TYPE=Ethernet
BOOTPROTO=dhcp
DEFROUTE=yes
PEERDNS=yes
PEERROUTES=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_PEERDNS=yes
IPV6_PEERROUTES=yes
IPV6_FAILURE_FATAL=no
NAME=p3p1
UUID=29db3899-e8e0-4f46-98fb-3639107a2726
DEVICE=p3p1
ONBOOT=yes
TYPE=Ethernet
BOOTPROTO=dhcp
DEFROUTE=yes
PEERDNS=yes
PEERROUTES=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_PEERDNS=yes
IPV6_PEERROUTES=yes
IPV6_FAILURE_FATAL=no
NAME=p4p1
UUID=9a0f4af5-983f-4472-a0b3-7ea0868bcaa5
DEVICE=p4p1
ONBOOT=yes


> This is not a permanent solution, but I'd like to ensure that this is the
> problem in case I can't reproduce.

Comment 7 cshao 2016-07-15 03:59:32 UTC
Created attachment 1180023 [details]
new log after run sed and reboot

Comment 8 Ryan Barry 2016-07-15 04:01:33 UTC
Can you please leave a test system up at the end of your day? This appears to be a networkmanager problem, but I need to investigate, and I don't have an appropriate testing environment in my lab right now.

Comment 9 cshao 2016-07-15 04:30:42 UTC
(In reply to Ryan Barry from comment #8)
> Can you please leave a test system up at the end of your day? This appears
> to be a networkmanager problem, but I need to investigate, and I don't have
> an appropriate testing environment in my lab right now.

Sure, I will leave env for you at the end of today.

Comment 10 cshao 2016-07-15 07:56:06 UTC
Hi Ryan,

I have send a mail with test env to you.

Comment 11 Red Hat Bugzilla Rules Engine 2016-07-18 16:38:59 UTC
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 12 Ryan Barry 2016-07-19 17:43:31 UTC
Is this reproducible on RHEL?

This appears to be a bug in NetworkManager (or the way anaconda writes NetworkManager files) -- this is not something which RHVH touches.

# nmcli c up "bond0 slave 1"
Error: Connection activation failed: Master device bond0 unmanaged or not available for activation

[root@dell-op790-01 ~]# nmcli d
DEVICE  TYPE      STATE         CONNECTION 
em1     ethernet  connected     em1        
p3p1    ethernet  disconnected  --         
p4p1    ethernet  disconnected  --         
p4p2    ethernet  disconnected  --         
11      vlan      disconnected  --         
22      vlan      disconnected  --         
bond0   bond      unmanaged     --         
lo      loopback  unmanaged     --   

The slaves aren't up, either.

However:

[root@dell-op790-01 ~]# systemctl stop NetworkManager.service ;systemctl start network.service ;systemctl stop network.service; systemctl start NetworkManager.service && nmcli d

[root@dell-op790-01 ~]# nmcli d
DEVICE  TYPE      STATE                                  CONNECTION        
em1     ethernet  connected                              em1               
11      vlan      connected                              bond0 slave 1     
22      vlan      connected                              bond0 slave 2     
bond0   bond      connecting (getting IP configuration)  Bond connection 1 
p3p1    ethernet  disconnected                           --                
p4p1    ethernet  disconnected                           --                
p4p2    ethernet  disconnected                           --                
lo      loopback  unmanaged                              -- 

ifcfg files and journalctl -u NetworkManager.service are attached.

thaller, what could be happening here?

Comment 13 Ryan Barry 2016-07-19 17:44:03 UTC
Created attachment 1181751 [details]
ifcfg files and networkmanager log

Comment 14 Beniamino Galvani 2016-07-21 15:29:48 UTC
(In reply to Ryan Barry from comment #12)

> ifcfg files and journalctl -u NetworkManager.service are attached.
> 
> thaller, what could be happening here?

The configuration in the ifcfg files seems correct. Can you please set level=DEBUG in the [logging] section of /etc/NetworkManager/NetworkManager.conf, reproduce the issue and attach NM logs? Thanks.

Comment 15 Ryan Barry 2016-07-21 15:33:23 UTC
Thanks Beniamino -

I don't actually have a test environment in my lab, so we'll both need to wait for QE.

Comment 16 cshao 2016-07-22 08:04:50 UTC
(In reply to Beniamino Galvani from comment #14)
> (In reply to Ryan Barry from comment #12)
> 
> > ifcfg files and journalctl -u NetworkManager.service are attached.
> > 
> > thaller, what could be happening here?
> 
> The configuration in the ifcfg files seems correct. Can you please set
> level=DEBUG in the [logging] section of
> /etc/NetworkManager/NetworkManager.conf, reproduce the issue and attach NM
> logs? Thanks.

Still can reproduce this issue on other test env.
1. During anaconda interactive install RHVH via ISO(with default), set level=DEBUG in the [logging] section of /etc/NetworkManager/NetworkManager.conf.
2. reproduce the issue

Actually this is no NetworkManager.log, You can find the Network Manager logs in /var/log/syslog, which acts as a catch-all for log messages.

Meanwhile, I have sent a mail which contain the reproduced env to both of your.
And the env will keep for 2 days.

Thanks.

Comment 17 cshao 2016-07-22 08:05:41 UTC
Created attachment 1182732 [details]
new_bond_vlan_log_with_NM

Comment 18 Beniamino Galvani 2016-07-22 10:08:24 UTC
I've analyzed the logs and the problem is that there is an
externally-created bond0 when NM starts and the device is down. NM
doesn't manage software interfaces with link down to avoid situations
like [1]. There are two possible workarounds: (1) don't create the
interface before NM starts or (2) at least bring it up so that NM will
manage it and activate the existing connection.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1030947

Comment 19 cshao 2016-07-22 11:19:58 UTC
(In reply to Beniamino Galvani from comment #18)
> I've analyzed the logs and the problem is that there is an
> externally-created bond0 when NM starts and the device is down. NM
> doesn't manage software interfaces with link down to avoid situations
> like [1]. There are two possible workarounds:
> (1) don't create the interface before NM starts or 
Don't create the interface means don't create bond over vlan network, am I right?
If so, just create a vlan nic is enough?
I did the test only create one vlan network, it can obtain IP before and after reboot.

>(2) at least bring it up so that NM will manage it and activate the existing    > connection.

Actually the nic which over bond + vlan was up status before reboot, the nic can obtain IP(192.168.xx.xx), please see attachment "bond_vlan-1.png" for more details. The ip just lost after a reboot.

> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1030947

Comment 20 Beniamino Galvani 2016-07-22 11:52:15 UTC
(In reply to shaochen from comment #19)
> (In reply to Beniamino Galvani from comment #18)
> > I've analyzed the logs and the problem is that there is an
> > externally-created bond0 when NM starts and the device is down. NM
> > doesn't manage software interfaces with link down to avoid situations
> > like [1]. There are two possible workarounds:
> > (1) don't create the interface before NM starts or 
> Don't create the interface means don't create bond over vlan network, am I
> right?

Correct, you should not manually create bond0 at boot. NM will do so when it starts, if there is the "Bond connection 1" connection referring to bond0.

> If so, just create a vlan nic is enough?

The point is that you don't need to manually create interfaces.

> I did the test only create one vlan network, it can obtain IP before and
> after reboot.
> 
> >(2) at least bring it up so that NM will manage it and activate the existing    > connection.
> 
> Actually the nic which over bond + vlan was up status before reboot, the nic
> can obtain IP(192.168.xx.xx), please see attachment "bond_vlan-1.png" for
> more details. The ip just lost after a reboot.

I see, the reason is the one explained above and in comment 18.

Comment 21 Fabian Deutsch 2016-07-22 11:59:30 UTC
Considering comment 18 and comment 20, this sounds like this is something to improve on the vdsm side.

Comment 22 Red Hat Bugzilla Rules Engine 2016-07-22 11:59:36 UTC
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 23 Ryan Barry 2016-07-22 14:01:59 UTC
(In reply to Beniamino Galvani from comment #20)
> (In reply to shaochen from comment #19)
> > (In reply to Beniamino Galvani from comment #18)
> > > I've analyzed the logs and the problem is that there is an
> > > externally-created bond0 when NM starts and the device is down. NM
> > > doesn't manage software interfaces with link down to avoid situations
> > > like [1]. There are two possible workarounds:
> > > (1) don't create the interface before NM starts or 
> > Don't create the interface means don't create bond over vlan network, am I
> > right?

The externally-created bond0 is likely user error on my part -- I don't often use nmcli, and never before with names like "Bond_Connection_1"

If ONBOOT=yes is set for em1, the provided test system can easily be rebooted and configurations with user error tested.

But the steps to reproduce are essentially:

Create a bond on top of 2 VLAN devices in Anaconda
Finish install
Bond doesn't work

The provided test system shows this configuration.

Any messages about manual creation should be seen as user error from me looking at the system...

> 
> Correct, you should not manually create bond0 at boot. NM will do so when it
> starts, if there is the "Bond connection 1" connection referring to bond0.
> 
> > If so, just create a vlan nic is enough?
> 
> The point is that you don't need to manually create interfaces.

I believe this refers to creating the interface through Anaconda's NetworkManager abstraction

> I see, the reason is the one explained above and in comment 18.

By "before reboot", I believe this refers to "during Anaconda"

Comment 24 Beniamino Galvani 2016-07-22 20:44:23 UTC
I think I understand what's happening. As said before, the root cause
of the failure of the bond activation is the presence of a bond0
interface at NM startup. This happens because the file
/etc/modules-load.d/vdsm.conf loads the bonding module and doing so
automatically generates a bond0 interface.

To avoid such problem in NM we pass the max_bonds=0 option to the
bonding module upon load, so that the initial interface is not
created. I suppose the simplest solution would be to add the line:

 options bonding max_bonds=0

to the vdsm.conf file. I haven't tried it, but I think it should work.

Comment 25 cshao 2016-07-25 06:53:20 UTC
(In reply to Beniamino Galvani from comment #24)
> I think I understand what's happening. As said before, the root cause
> of the failure of the bond activation is the presence of a bond0
> interface at NM startup. This happens because the file
> /etc/modules-load.d/vdsm.conf loads the bonding module and doing so
> automatically generates a bond0 interface.
> 
> To avoid such problem in NM we pass the max_bonds=0 option to the
> bonding module upon load, so that the initial interface is not
> created. I suppose the simplest solution would be to add the line:
> 
>  options bonding max_bonds=0
> 
> to the vdsm.conf file. I haven't tried it, but I think it should work.

The nic over bond_vlan still can up after append "options bonding max_bonds=0" to /etc/modules-load.d/vdsm.conf.

Comment 26 cshao 2016-07-26 03:36:10 UTC
(In reply to shaochen from comment #25)
> (In reply to Beniamino Galvani from comment #24)
> > I think I understand what's happening. As said before, the root cause
> > of the failure of the bond activation is the presence of a bond0
> > interface at NM startup. This happens because the file
> > /etc/modules-load.d/vdsm.conf loads the bonding module and doing so
> > automatically generates a bond0 interface.
> > 
> > To avoid such problem in NM we pass the max_bonds=0 option to the
> > bonding module upon load, so that the initial interface is not
> > created. I suppose the simplest solution would be to add the line:
> > 
> >  options bonding max_bonds=0
> > 
> > to the vdsm.conf file. I haven't tried it, but I think it should work.
> 
> The nic over bond_vlan still can up after append "options bonding
> max_bonds=0" to /etc/modules-load.d/vdsm.conf.

Typo:
The nic over bond_vlan still can't up after append "options bonding
max_bonds=0" to /etc/modules-load.d/vdsm.conf.

Comment 27 Petr Horáček 2016-07-26 13:37:44 UTC
It is not possible to set module options in /etc/modules-load.d/vdsm.conf, you have to write it into /etc/modprobe.d/vdsm.conf.

[root@vm_fc ~]# cat /etc/modprobe.d/vdsm-bonding-modprobe.conf 
# VDSM bonding modprobe configuration
options bonding max_bonds=0

I created a patch which should fix this problem, but I don't know how to verify it with Anaconda installation. Could you help me? I can give you RPMs, just tell me for which system and which version of VDSM should I build them.

Comment 28 Fabian Deutsch 2016-07-26 13:51:40 UTC
Petr, fi your fix is working on RHEL-H then this should very likely also work on NGN.

Did you verify it on RHEL?

Comment 29 Petr Horáček 2016-07-26 14:36:32 UTC
I built and installed it on Fedora 23, it does not create bond0 on boot anymore. On CentOS 7 I just tried if config files works, they does the same as on Fedora 23 (at least for me, edwardh reported that for him it still creates bond0 even with changed bonding modules options).

I have not tried it together with Anaconda. Is it possible to install VDSM via CentOS 7 Anaconda?

Comment 30 Edward Haas 2016-07-26 15:15:49 UTC
(In reply to Beniamino Galvani from comment #24)
> I think I understand what's happening. As said before, the root cause
> of the failure of the bond activation is the presence of a bond0
> interface at NM startup. This happens because the file
> /etc/modules-load.d/vdsm.conf loads the bonding module and doing so
> automatically generates a bond0 interface.
> 
> To avoid such problem in NM we pass the max_bonds=0 option to the
> bonding module upon load, so that the initial interface is not
> created. I suppose the simplest solution would be to add the line:
> 
>  options bonding max_bonds=0
> 
> to the vdsm.conf file. I haven't tried it, but I think it should work.

Unfortunately, this solution is not consistent.
the /etc/modprobe.d/* may be loaded after the kernel module is, making it irrelevant for the existing state.

I would suggest avoiding the use of bond0 as it collides with NM.

Comment 31 Martin Tessun 2016-08-05 13:38:48 UTC
*** Bug 1364476 has been marked as a duplicate of this bug. ***

Comment 32 Edward Haas 2016-08-05 17:26:25 UTC
Pasting here the comment from the gerrit patch.
It refers to the reason that the bonding module may be loaded even though the modprobe.d has been updated.

For the patch to be helpful it will require a 'dracut -f' command to update initrd.

If the initrd used is an old one, the following scenario will occur:
- Install VDSM.
- Upgrade the kernel (initrd is created, taking /etc/modules-load.d/vdsm.conf config)
[From this point on, bonding is loaded by initrd]
- Upgrade VDSM with a build that includes this patch: no effect on boot.

As NGN updates initrd on each boot, this solution will work smoothly.

Comment 33 cshao 2016-08-08 08:16:43 UTC
Test version:
redhat-virtualization-host-4.0-20160803.3
imgbased-0.7.4-0.1.el7ev.noarch 
vdsm-4.18.10-1.el7ev.x86_64

I have to assigned this bug due to I still can reproduce this issue with the latest RHVH build. RHVH still can't obtain IP after reboot

And I notice that all the related path are belong to vdsm component, should we move the bug to VDSM component?

Comment 34 Edward Haas 2016-08-08 08:35:15 UTC
Is this the bond0 problem or something new?
Please confirm if /etc/modprobe.d/vdsm-bonding-modprobe.conf exist and that after boot, bond0 is not defined.

Comment 35 cshao 2016-08-09 07:18:30 UTC
(In reply to Edward Haas from comment #34)
> Is this the bond0 problem or something new?

Seems it still the bond0(bond+vlan network) problem, please see attachment for more details.

> Please confirm if /etc/modprobe.d/vdsm-bonding-modprobe.conf exist and that
> after boot, bond0 is not defined.

# cat /etc/modprobe.d/vdsm-bonding-modprobe.conf
# VDSM bonding modprobe configuration
options bonding max_bonds=0

Comment 36 cshao 2016-08-09 07:19:33 UTC
Created attachment 1189074 [details]
bond_vlan_new_log_0803

Comment 37 Red Hat Bugzilla Rules Engine 2016-08-09 08:19:53 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 38 Anatoly Litovsky 2016-08-10 12:12:52 UTC
ok then we should add to bootloader command in iso ks file 
--append="max_bond=0"

Comment 39 Fabian Deutsch 2016-08-10 13:01:02 UTC
We should not have custom installation arguments for NGN.

We need to understand why the modprobe.d approach is not working (having the config in a file i modprobe.d)

Comment 40 Edward Haas 2016-08-11 11:03:02 UTC
I have not noticed that we are talking about having vlan interfaces as slaved under a bond.
That option is not supported by RHEL as far as I know (https://access.redhat.com/solutions/483803) and it is surely not supported by VDSM.

Please retest without the vlan slaves (I guess moving the vlan on top of the bond).

Comment 41 cshao 2016-08-11 13:41:58 UTC
(In reply to Edward Haas from comment #40)
> I have not noticed that we are talking about having vlan interfaces as
> slaved under a bond.
> That option is not supported by RHEL as far as I know
> (https://access.redhat.com/solutions/483803) and it is surely not supported
> by VDSM.
> 
> Please retest without the vlan slaves (I guess moving the vlan on top of the
> bond).

Test result without the vlan slaves:


Test version: 
redhat-virtualization-host-4.0-20160810.1 
imgbased-0.8.3-0.1.el7ev.noarch
redhat-release-virtualization-host-4.0-0.29.el7.x86_64
vdsm-4.18.11-1.el7ev.x86_64


Test steps:
1. Anaconda interactive install RHVH via PXE.
2. Enter network page.
3. Add bond network -> save.
4. Enter editing bond connection page-> bond tab, click add button.
5. Choose a connection type -> Eth -> select a nic 
6. click add button again, choose a connection type -> Eth -> select another nic
7. Bond mode set -> active backup
8. Save
9. Bond can obtain vlan IP.
10. Continue the installation.
11. Reboot and login RHVH.
12. ip addr

Test result:
RHVH still can obtain IP over bond network.

So the bug is fixed, change bug status to VERIFIED.

Comment 42 Fabian Deutsch 2016-08-11 14:38:48 UTC
Thanks Edward!

I've opened bug 1366298 to prevent the creatipn of this setup in NetworkManager.


Note You need to log in before you can comment on or make changes to this bug.