Bug 1154399

Summary: VDSM script reset network configuration on every reboot when based on predefined bond
Product: [Retired] oVirt Reporter: Aleksandr <aleksandr.bembel>
Component: vdsmAssignee: Petr Horáček <phoracek>
Status: CLOSED CURRENTRELEASE QA Contact: Michael Burman <mburman>
Severity: medium Docs Contact:
Priority: high    
Version: 3.5CC: aleksandr.bembel, asegurap, bazulay, bugs, danken, dougsland, ecohen, enrico.tagliavini, fdeutsch, gklein, iheim, ldelouw, lsurette, lvernia, mario.langmann, mburman, mgoldboi, mkalinin, nicolas, pdwyer, pzhukov, rbalakri, rbarry, sbonazzo, shihliu, sraje, troels, ycui, yeylon, ylavi
Target Milestone: ---Keywords: Reopened
Target Release: 3.5.2Flags: lvernia: ovirt_requires_release_note?
Hardware: x86_64   
OS: Linux   
Whiteboard: network
Fixed In Version: vdsm-4.16.13-1.el6ev Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1194553 1213842 (view as bug list) Environment:
Last Closed: 2015-04-29 06:18:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1209486    
Bug Blocks: 1186161, 1193058, 1194553, 1213842    

Description Aleksandr 2014-10-19 12:57:37 UTC
Description of problem:
I configure oVirt node with ovirtmgmt-interface. It's UP in ovirt-engine, but when I reboot node - it boots and after few second i lost connectivity. I connect to node via IPMI and in /etc/sysconfig/network-scripts/ there no ifcfg-bond0.X and ifcgf-ovirtmgmt.

Also vdsm daemon didn't start:
 service vdsmd start
libvirtd start/running, process 6113
vdsm: Running mkdirs
vdsm: Running configure_coredump
vdsm: Running configure_vdsm_logs
vdsm: Running run_init_hooks
vdsm: Running check_is_configured
libvirt is already configured for vdsm
vdsm: Running validate_configuration
SUCCESS: ssl configured to true. No conflicts
vdsm: Running prepare_transient_repository
vdsm: Running syslog_available
vdsm: Running nwfilter
vdsm: Running dummybr
vdsm: Running load_needed_modules
vdsm: Running tune_system
vdsm: Running test_space
vdsm: Running test_lo
vdsm: Running unified_network_persistence_upgrade
vdsm: Running restore_nets
Traceback (most recent call last):
  File "/usr/share/vdsm/vdsm-restore-net-config", line 137, in <module>
    restore()
  File "/usr/share/vdsm/vdsm-restore-net-config", line 123, in restore
    unified_restoration()
  File "/usr/share/vdsm/vdsm-restore-net-config", line 66, in unified_restoration
    persistentConfig.bonds)
  File "/usr/share/vdsm/vdsm-restore-net-config", line 91, in _filter_nets_bonds
    bonds[bond]['nics'], net)
KeyError: u'bond0'
vdsm: stopped during execute restore_nets task (task returned with error code 1).
vdsm start                                                 [FAILED]

It starts only if i manually restore network configutaion, delete /var/lib/vdsm/persistence/netconf  and create nets_restored file in /var/lib/vdsm 

Version-Release number of selected component (if applicable):
rpm -qa | grep vdsm
vdsm-python-4.16.7-1.gitdb83943.el6.noarch
vdsm-jsonrpc-4.16.7-1.gitdb83943.el6.noarch
vdsm-python-zombiereaper-4.16.7-1.gitdb83943.el6.noarch
vdsm-xmlrpc-4.16.7-1.gitdb83943.el6.noarch
vdsm-yajsonrpc-4.16.7-1.gitdb83943.el6.noarch
vdsm-4.16.7-1.gitdb83943.el6.x86_64
vdsm-cli-4.16.7-1.gitdb83943.el6.noarch

cat /etc/redhat-release
CentOS release 6.5 (Final)


Steps to Reproduce:
1. configure network manually in /etc/sysconfig/network-scripts/
2. add node to ovirt-engine
3. reboot node

Actual results:
Lost connectivity to node, vdsm scripts reset network configuration

Expected results:
normal reboot

Additional info:

Comment 1 Aleksandr 2014-10-20 06:12:17 UTC
I found the problem and how to solve it manually. 

I create bond interface manually by hand before installation of oVirt, and create ovirtmgmt interface manually to. After i install oVirt to node - VDSM scripts create folders with network configuration in var/lib/vdsm/persistence/netconf/ and there is folder "nets" with configutarion of ovirtmgmt interface, but there is no folder bonds for bonding configs.

After creation of this folder with configuration for bond0 interface all starts to work and reboot normally.

Comment 2 Dan Kenigsberg 2014-10-26 00:55:59 UTC
Toni, could you take a look? I thought http://gerrit.ovirt.org/32769 should have fixed that.

Comment 3 Antoni Segura Puimedon 2014-10-27 07:52:13 UTC
@Aleksandr: Does the /etc/sysconfig/network-scripts/ifcfg-bond0 you manually create have any of these headers:

- '# Generated by VDSM version'
- '# automatically generated by vdsm'

If that is the case, they will be removed at every boot. If that is not the case, are you calling 'persist /etc/sysconfig/network-scripts/ifcfg-bond0' in the command line after creating them?

vdsm only persists the networks and bonds it creates and since ifcfg-bond0 is created by you, it assumes (wrongly or not) it will be there on boot. There are three ways to go about this:

- creating the bond with vdsClient like so:
  vdsClient -s 0 setupNetworks bondings='{bond11:{nics:p1p3+p1p4}}'
  # Then create the network over it (which will persist the bond too in /var/lib/vdsm/persistence/netconf/bonds
- Using the node persistence directly:
  persist /etc/sysconfig/network-scripts/ifcfg-bond0
- Code: Somehow detect that device configuration we depend on is not persisted and do like in the upgrade script to unified persistence.

Comment 4 Aleksandr 2014-10-27 08:11:43 UTC
(In reply to Antoni Segura Puimedon from comment #3)
> @Aleksandr: Does the /etc/sysconfig/network-scripts/ifcfg-bond0 you manually
> create have any of these headers:
> 
> - '# Generated by VDSM version'
> - '# automatically generated by vdsm'
> 
> If that is the case, they will be removed at every boot. If that is not the
> case, are you calling 'persist /etc/sysconfig/network-scripts/ifcfg-bond0'
> in the command line after creating them?
> 
> vdsm only persists the networks and bonds it creates and since ifcfg-bond0
> is created by you, it assumes (wrongly or not) it will be there on boot.
> There are three ways to go about this:
> 
> - creating the bond with vdsClient like so:
>   vdsClient -s 0 setupNetworks bondings='{bond11:{nics:p1p3+p1p4}}'
>   # Then create the network over it (which will persist the bond too in
> /var/lib/vdsm/persistence/netconf/bonds
> - Using the node persistence directly:
>   persist /etc/sysconfig/network-scripts/ifcfg-bond0
> - Code: Somehow detect that device configuration we depend on is not
> persisted and do like in the upgrade script to unified persistence.

/etc/sysconfig/network-scripts/ifcfg-bond0 doesn't have such header. I create it manually before installing oVirt on this node.

Comment 5 Dan Kenigsberg 2014-10-29 14:08:58 UTC
And when you run

 persist /etc/sysconfig/network-scripts/ifcfg-bond0

does the problem go away? If so - it's not a bug. Manual creation requires manual persistence.

Comment 6 Dan Kenigsberg 2014-11-17 16:25:48 UTC
(In reply to Dan Kenigsberg from comment #5)
> does the problem go away?

Please reopen if this is not the case.

Comment 7 Dan Kenigsberg 2015-02-07 17:20:56 UTC
We have heard more reports about our failure to revive bonds that where created outside Vdsm, but required by its networks.

Since this is a common use case, particularly for hosted engine, it may require extreme measures such as consuming these bonds and making the ours.

Comment 8 Nicolas Ecarnot 2015-02-26 09:52:58 UTC
Hello,

After upgrading from 3.4.1 to 3.5.1 on CentOS 6.6 (1 manager and 3 hosts), all my manually configured interfaces, with bonded interfaces, bridges and VLAN went OK and resisted hosts reboots on 2 hosts, but not on the third one.

All 3 hosts are absolutely clones, they were installed by kickstart+cfengine, so no manual error possible.

Anyway, on this third host, I'm facing the exact issue described here as well in the related BZ (#1134346, #1188251).

I can not use the proposed workaround because these CentOS 6.6 have no "persist" command (does this command only exist on oVirt _nodes_, not complete hosts?).

I'd be OK to completely wipe my manual network config, and do it again via the web GUI, but will it allow me to add the VLAN tag?

If :
- I can not redefine the interfaces setting+VLAN using the oVirt GUI
and
- I can not persist my manual interfaces settings because this command does not exist,

is there another way?
Can I manually persist the needed ifcfg files? Where should this persistence files reside, and how?

Comment 9 Mario Langmann 2015-03-02 06:49:41 UTC
Hello,

i solved this problem by editing the file /etc/vdsm/vdsm.conf:
Add the line "net_persistence = ifcfg" which now uses the manual created network configuration
the persist command is as far as i know only for the hypervisor images. correct me if i am wrong

Comment 10 Dan Kenigsberg 2015-03-02 17:05:56 UTC
(In reply to Nicolas Ecarnot from comment #8)

> All 3 hosts are absolutely clones, they were installed by
> kickstart+cfengine, so no manual error possible.

This is extremely interesting. Can you think of any different in the faulty host? Do you have `chkconfig network on` on all three? Can you find "timeout waiting for ipv4 addresses" on your /var/log/message on the first post-upgrade boot?

> I can not use the proposed workaround because these CentOS 6.6 have no
> "persist" command (does this command only exist on oVirt _nodes_, not
> complete hosts?).

You are perfectly right. "persist" is node-only.

Reverting to net_persistence=ifcfg is the important part of the workaround.

Comment 11 Nicolas Ecarnot 2015-03-03 10:34:51 UTC
(In reply to Dan Kenigsberg from comment #10)
> This is extremely interesting. Can you think of any different in the faulty
> host?

I honnestly don't recall anything different I made have done different on this third host.
When things began to act differently, I obviously began to make additionnal actions, like ifup, ifdown, double check things, reboots.
But these are clearly consequences, not causes.

> Do you have `chkconfig network on` on all three?

Yes.

> Can you find "timeout
> waiting for ipv4 addresses" on your /var/log/message on the first
> post-upgrade boot?

No. A grep (and additionnal lookups) on all old /var/log/messages* did not return anything.

> Reverting to net_persistence=ifcfg is the important part of the workaround.

_This_ is puzzling me!
I solved my issue by using what was adviced in a related BZ, ie :

# vdsClient -s 0 setupNetworks bondings='{bond0:{nics:em1+em2},bond1:{nics:em3+em4}}'
# vdsClient -s 0 setSafeNetworkConfig

But I confirm that I have no 'net_persistence' setup in my vdsm.conf.

Just to add somethg : I have 3 datacenters, and the one on which I upgraded is in semi-production - I can afford some downtimes. This one only contains 3 hosts, so network correction are bearable.
I'm hesitating to upgrade my two other datacenters containing 10 hosts each.

I can test some of your suggestions on the small DC.
Just saying that today, all 3 hosts are working well, and did well when they were cleanly following the path maintenance > manual reboot > wait for both network come up (mgmt+iscsi) > activation.

Comment 12 Sandro Bonazzola 2015-03-03 12:51:13 UTC
Re-targeting to 3.5.3 since this bug has not been marked as blocker for 3.5.2 and we have already released 3.5.2 Release Candidate.

Comment 13 Dan Kenigsberg 2015-03-03 12:56:15 UTC
Nicolas, I understand your worry regarding an upgrade of a production system, before we understand the nature of this bug. Do you have ifcfg files with no '# Generated by VDSM' header on your production hosts?

Making sure that all /etc/sysconfig/network-scripts/ifcfg-bond* files are "owned" by Vdsm by reconfiguring them via command line is a good precaution. I'm happy that it has worked for you.

Please note that your command line sets the bonds with default options (mode=4), which may not be what you want. The following syntax is used to set further options:

# vdsClient -s 0 setupNetworks bondings='{bond0:{nics:em1+em2,options:mode=3 miimon=222}}
# vdsClient -s 0 setSafeNetworkConfig

Setting net_persistence=ifcfg on vdsm.conf before upgrade is a more "brutal" workaround. It means that vdsm does not attempt to move network definitions from ifcfg to /var/lib/vdsm/persistence/network.

Comment 14 Michael Burman 2015-03-03 15:46:47 UTC
Hello,

I was trying to reproduce this issue with the next steps and issue didn't reproduced:
3.5.1-0.1.el6ev
vdsm-4.16.12-2.el6ev.x86_64
Red Hat Enterprise Linux Server release 6.6 (Santiago)


1. Created bond0 manually:
cat /etc/sysconfig/network-scripts/ifcfg-eth0
# Generated by VDSM version 4.16.12-2.el6ev
DEVICE=eth0
HWADDR=00:14:5e:dd:09:24
MASTER=bond0
SLAVE=yes
ONBOOT=yes
MTU=1500
NM_CONTROLLED=no

cat /etc/sysconfig/network-scripts/ifcfg-eth1
# Generated by VDSM version 4.16.12-2.el6ev
DEVICE=eth1
HWADDR=00:14:5e:dd:09:26
MASTER=bond0
SLAVE=yes
ONBOOT=yes
MTU=1500
NM_CONTROLLED=no

cat /etc/sysconfig/network-scripts/ifcfg-bond0
# Generated by VDSM version 4.16.12-2.el6ev
DEVICE=bond0
BONDING_OPTS='mode=active-backup miimon=150'
ONBOOT=yes
NM_CONTROLLED=no
BOOTPROTO=dhcp

2. /etc/init.d/network restart

3. Installed host in my vt14 setup, rhevm bridge created on bond0 and got ip.

4. Rebooted host

5. Host gone to non-operational state until finished rebooting, then went up and got ip.

- Couldn't reproduce this way.

- I will try to reproduce in other way, will try to upgrade from 3.4 > 3.5.1.

- If someone have/know the steps to how reproduce this issue, please let me know.

Best regards,

Comment 15 Nicolas Ecarnot 2015-03-04 09:02:36 UTC
(In reply to Michael Burman from comment #14)
> I was trying to reproduce this issue with the next steps and issue didn't
> reproduced:

Thank you Michael for taking some time on this bug.
I suppose that what's different between your setup and mine is that I have additionnal things I did not see in your description :

- usage of VLAN on some bonds (ie "bond0.6"...)
- specifying additionnal params (ie MTU=9000). Does it matter?
- presence of oVirt bridges

Dan, Ok for your explanations.
When coping with bugs, I'm always trying to head towards the future, and I'd like to avoid to waste too much time on behaviour that won't last. I haven't already completely read and understood what network persistence implied, but if _this_ is the future, let us all test the setups used in the future usages.

If all this issue is known to be corrected in 3.5.2, and as my present setup is OK, I'm not sure it's worth looking further.

Just to answer your questions :
- some of my files contain and some other don't wontain the '# generated by vdsm blahblah', but I don't take it as an evidence as I may have copied them from another setup sometimes long ago
- every ifcfg-* are owned by root only. Do I have to chown 36:36 them all, like what is done on NFS shares?
- cat /proc/net/bonding/bond* is telling me that mode active-backup is used.
That does not match your statement about the mode 4 set by the vdsClient command.
Sounds like I have further reading to do now, because it seems I'm using a mix of two concurrent technologies, and I don't feel good.

Comment 16 Dan Kenigsberg 2015-03-05 12:34:08 UTC
Nicolas, based on your account, two of three identical hosts do not reproduce the bug. Only the third host expressed it. This means that we have a flaky raceful issue here, which may be unrelated to vlans/mtu/bridges.

We are not yet sure when the bug would be fix, as we do not understand it properly.

By ifcfg "own" I mean the existence of '# generated by vdsm' header, not filesystem owning user (which is always root). It is paramount to understand what is the case on your production servers. Do they have non-vdsm-owned ifcfg-bond* files?

If Vdsm does not receive any bond option, it sets 'mode=802.3ad miimon=150' as its default. However, if the bond device already exists, it keeps it former options. Specifying the options explicitly is the safe way to ensure the final config.

Comment 17 Michael Burman 2015-03-09 07:11:32 UTC
Hello All,

I made another try to reproduce this issue, this time upgrading vdsm from 3.4>3.5.1. 
Didn't managed to reproduce this issue this time as well.

My steps:

1) Started with clean rhel6.6 host with vdsm-4.14.18-7.el6ev.x86_64

2) Manually created bond0:

vi /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
ONBOOT=yes
BONDING_OPTS='mode=active-backup miimon=150'
NM_CONTROLLED=no
BOOTPROTO=dhcp

vi /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
ONBOOT=yes
MASTER=bond0
SLAVE=yes
NM_CONTROLLED=no

vi /etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE=eth1
ONBOOT=yes
MASTER=bond0
SLAVE=yes
NM_CONTROLLED=no

3) Restarted network service and Installed rhel6.6 host in my 
rhevm-3.5.1-0.1.el6ev.noarch engine with success.

4) rhevm bridge was created on top of bond0

5) Set host to maintenance mode and run 'yum update'. host was updated with 
vdsm-4.16.12-2.el6ev.x86_64.

6) Activated host in 3.5 cluster with no issue, host is up and have connectivity.

7) Rebooted host. host gone to non-operational state until finished booting and went up again. no connectivity issues. rhevm bridge is on top of the bond0(that was created in step 2) and have ip.

Comment 18 Nicolas Ecarnot 2015-03-09 07:25:18 UTC
(In reply to Michael Burman from comment #17)
Michael,

I wrote previously :

> I suppose that what's different between your setup and mine is that I have
> additionnal things I did not see in your description :
> - usage of VLAN on some bonds (ie "bond0.6"...)

Would it be worth you try adding this point in your tests?
Does it matter?

Comment 19 Michael Burman 2015-03-09 16:22:54 UTC
Another try to reproduce, this time with vlan on top of the bond.
upgraded from 3.4>3.5.1

Managed to reproduce this issue this time after rebooting.

1) Started with clean rhel6.6 host with vdsm-4.14.18-7.el6ev.x86_64

2) Manually created bond0.162

  vi /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
ONBOOT=yes
BONDING_OPTS='mode=active-backup miimon=150'
NM_CONTROLLED=no
BOOTPROTO=none

   vi /etc/sysconfig/network-scripts/ifcfg-bond0.162
DEVICE=bond0.162
VLAN=yes
BOOTPROTO=static
IPADDR=10.35.129.14
NETMASK=255.255.255.0
GATEWAY=10.35.129.254
NM_CONTROLLED=no
ONBOOT=yes

   vi /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
ONBOOT=yes
MASTER=bond0
SLAVE=yes
NM_CONTROLLED=no

   vi /etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE=eth1
ONBOOT=yes
MASTER=bond0
SLAVE=yes
NM_CONTROLLED=no

3) Restarted network service and Installed rhel6.6 host in my 
rhevm-3.5.1-0.1.el6ev.noarch engine with success.

4) rhevm bridge was created on top of bond0.162

5) Set host to maintenance mode and run 'yum update'. host was updated with 
vdsm-4.16.12-2.el6ev.x86_64.

6) Activated host in 3.5 cluster, host is up and have connectivity.

7) Rebooted host. host gone to non-operational state. bond0.162 disappeared.

before reboot
cat /var/lib/vdsm/persistenence/netconf/bonds
bond0 bond0.162

after reboot:
only bond0

So this bug reproduced with vlan on top of bond created manually.
netconf is not persistence for vlan on top of the bond.

if i run manually ifup bond01.62, i get ip. but if restarting network service, i loose ip.
So i think now it's more clear.

Best regards.

Comment 20 Nicolas Ecarnot 2015-03-09 18:40:24 UTC
(In reply to Michael Burman from comment #19)
> Another try to reproduce, this time with vlan on top of the bond.
> upgraded from 3.4>3.5.1
> 
> Managed to reproduce this issue this time after rebooting.

Michael, you're the man!
Thank you for confirming this, it is a relief for me :)

Comment 21 Michael Burman 2015-03-10 06:20:57 UTC
Dan, 

This is the ifcfg files after rebooting:
cat /etc/sysconfig/network-scripts/ifcfg-bond0
# Generated by VDSM version 4.16.12-2.el6ev
DEVICE=bond0
BONDING_OPTS='mode=active-backup miimon=150'
BRIDGE=rhevm
ONBOOT=no
MTU=1500
NM_CONTROLLED=no
HOTPLUG=no

cat /etc/sysconfig/network-scripts/ifcfg-bond0.162
DEVICE=bond0.162
VLAN=yes
BOOTPROTO=static
IPADDR=10.35.129.14
NETMASK=255.255.255.0
GATEWAY=10.35.129.254
NM_CONTROLLED=no
ONBOOT=yes

cat /etc/sysconfig/network-scripts/ifcfg-eth0
# Generated by VDSM version 4.16.12-2.el6ev
DEVICE=eth0
HWADDR=00:14:5e:dd:09:24
MASTER=bond0
SLAVE=yes
ONBOOT=no
MTU=1500
NM_CONTROLLED=no


cat /etc/sysconfig/network-scripts/ifcfg-eth1
# Generated by VDSM version 4.16.12-2.el6ev
DEVICE=eth1
HWADDR=00:14:5e:dd:09:26
MASTER=bond0
SLAVE=yes
ONBOOT=no
MTU=1500
NM_CONTROLLED=no

- As you can see in the ifcfg-bond0.162 file, there is no 
# Generated by VDSM version 4.16.12-2.el6ev line.

- I'm going to give it another run, this time without upgrading from 3.4>3.5.1.
I believe the upgrade process have nothing to do with this issue.

Comment 22 Michael Burman 2015-03-10 08:07:15 UTC
So this issue is not related to the upgrade process. 
vdsm is not generating vlan on top of a bond created manually.
We can see it before rebooting the host:

- vlan on top of bond created manually(same steps from comment 21) and host installed in RHEV-M. Host installed successfully, rhevm bridge created on top of bond0.162
Before reboot:

[root@navy-vds1 netconf]# ls nets
rhevm

[root@navy-vds1 netconf]# ls bonds
bond0

[root@navy-vds1 netconf]# cat /etc/sysconfig/network-scripts/ifcfg-bond0
# Generated by VDSM version 4.16.12-2.el6ev
DEVICE=bond0
BONDING_OPTS='mode=active-backup miimon=150'
BRIDGE=rhevm
ONBOOT=yes
BOOTPROTO=none
MTU=1500
NM_CONTROLLED=no
HOTPLUG=no

[root@navy-vds1 netconf]# cat /etc/sysconfig/network-scripts/ifcfg-bond0.162
DEVICE=bond0.162
VLAN=yes
BOOTPROTO=static
IPADDR=10.35.129.14
NETMASK=255.255.255.0
GATEWAY=10.35.129.254
NM_CONTROLLED=no
ONBOOT=yes

[root@navy-vds1 netconf]# cat /etc/sysconfig/network-scripts/ifcfg-eth0
# Generated by VDSM version 4.16.12-2.el6ev
DEVICE=eth0
HWADDR=00:14:5e:dd:09:24
MASTER=bond0
SLAVE=yes
ONBOOT=yes
MTU=1500
NM_CONTROLLED=no

[root@navy-vds1 netconf]# cat /etc/sysconfig/network-scripts/ifcfg-eth1
# Generated by VDSM version 4.16.12-2.el6ev
DEVICE=eth1
HWADDR=00:14:5e:dd:09:26
MASTER=bond0
SLAVE=yes
ONBOOT=yes
MTU=1500
NM_CONTROLLED=no

ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:10:18:24:4a:fc brd ff:ff:ff:ff:ff:ff
    inet6 fe80::210:18ff:fe24:4afc/64 scope link
       valid_lft forever preferred_lft forever
3: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 00:10:18:24:4a:fd brd ff:ff:ff:ff:ff:ff
4: eth0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP qlen 1000
    link/ether 00:14:5e:dd:09:24 brd ff:ff:ff:ff:ff:ff
5: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP qlen 1000
    link/ether 00:14:5e:dd:09:24 brd ff:ff:ff:ff:ff:ff
6: virbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether 52:54:00:3e:29:c1 brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
7: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 500
    link/ether 52:54:00:3e:29:c1 brd ff:ff:ff:ff:ff:ff
9: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
    link/ether 00:14:5e:dd:09:24 brd ff:ff:ff:ff:ff:ff
10: bond0.162@bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
    link/ether 00:14:5e:dd:09:24 brd ff:ff:ff:ff:ff:ff
    inet 10.35.129.14/24 brd 10.35.129.255 scope global bond0.162
    inet6 fe80::214:5eff:fedd:924/64 scope link
       valid_lft forever preferred_lft forever
12: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
    link/ether 9a:cc:8d:94:6c:92 brd ff:ff:ff:ff:ff:ff
13: rhevm: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether 00:14:5e:dd:09:24 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::214:5eff:fedd:924/64 scope link
       valid_lft forever preferred_lft forever



After reboot:
Host gone to non-responsive state and doesn't recover. Host lost connectivity.
only 'ifup bond0.162' command will bring the host up.

So we have here a reproduction, clear steps.

Best regards

Comment 23 Dan Kenigsberg 2015-03-10 10:15:49 UTC
I am afraid this is a different bug. I see that Engine attempts to create a rhevm network with no vlan, no IP address, and no DHCP

Thread-25::DEBUG::2015-03-10 09:31:24,745::__init__::469::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Host.setupNetworks' in bridge with {'bondings': {}, 'networks': {'rhevm': {'bonding': 'bond0', 'STP': 'no', 'bridged': 'true', 'mtu': '1500'}}, 'options': {'connectivityCheck': 'true', 'connectivityTimeout': 120}}

nothing good can come out of that. It's an Engine bug. I suspect that this is because in Engine, rhevm is defined with no vlan id. Please open a separate bug on that.

The case that the users discuss is a bit different. They had a working vdsm of ovirt-3.4, connected to the Engine with vlan networks on top of a predefined bond. Then, they upgrade vdsm, and reboot the host.

No new setupNetworks command is issued from Engine to Vdsm in that process. Thus, I need more of your time to help reproduce the users' issue.

Comment 24 Michael Burman 2015-03-11 07:32:52 UTC
Another attempt to reproduce, this time without any success.
Followed next steps:

1) clean rhel6.6 host with vdsm-4.14.18-7.el6ev.x86_64

2) created bond0 manually from eth2 and eth3

3) Added host to RHEV-M with success. 'rhevm' attached to eth0

4) created VM vlan tagged network and attached to bond0 via SN

5) set host to maintenance and updated vdsm to 3.5.1(vdsm 4.16.12) with 'yum update' command.

6) activated host in 3.5 cluster. bond0 doesn't break, vlan network attached to bond0

7) rebooted host. bond0 doesn't break, vlan attached to bond0

ls /var/lib/vdsm/persistence/netconf/nets/
rhevm  test_net(vlan network)

ls /var/lib/vdsm/persistence/netconf/bonds/
bond0



- No reproduction.

Comment 25 Petr Horáček 2015-03-11 13:56:21 UTC
Reproduced:

1) installed and configured rhel66

2) install VDSM dependencies
$ yum install -y http://download.devel.redhat.com/brewroot/packages/augeas/1.0.0/8.el6/x86_64/augeas-1.0.0-8.el6.x86_64.rpm \
http://download.devel.redhat.com/brewroot/packages/augeas/1.0.0/8.el6/x86_64/augeas-devel-1.0.0-8.el6.x86_64.rpm \
http://download.devel.redhat.com/brewroot/packages/augeas/1.0.0/8.el6/x86_64/augeas-libs-1.0.0-8.el6.x86_64.rpm \
http://download.devel.redhat.com/brewroot/packages/augeas/1.0.0/8.el6/x86_64/augeas-debuginfo-1.0.0-8.el6.x86_64.rpm
$ yum install -y http://download.devel.redhat.com/brewroot/packages/sanlock/2.8/1.el6/x86_64/fence-sanlock-2.8-1.el6.x86_64.rpm \
http://download.devel.redhat.com/brewroot/packages/sanlock/2.8/1.el6/x86_64/sanlock-2.8-1.el6.x86_64.rpm \
http://download.devel.redhat.com/brewroot/packages/sanlock/2.8/1.el6/x86_64/sanlock-devel-2.8-1.el6.x86_64.rpm \
http://download.devel.redhat.com/brewroot/packages/sanlock/2.8/1.el6/x86_64/sanlock-lib-2.8-1.el6.x86_64.rpm \
http://download.devel.redhat.com/brewroot/packages/sanlock/2.8/1.el6/x86_64/sanlock-python-2.8-1.el6.x86_64.rpm \
http://download.devel.redhat.com/brewroot/packages/sanlock/2.8/1.el6/x86_64/sanlock-debuginfo-2.8-1.el6.x86_64.rpm
$ yum install -y http://download.devel.redhat.com/brewroot/packages/libvirt/0.10.2/49.el6/x86_64/libvirt-0.10.2-49.el6.x86_64.rpm \
http://download.devel.redhat.com/brewroot/packages/libvirt/0.10.2/49.el6/x86_64/libvirt-client-0.10.2-49.el6.x86_64.rpm \
http://download.devel.redhat.com/brewroot/packages/libvirt/0.10.2/49.el6/x86_64/libvirt-devel-0.10.2-49.el6.x86_64.rpm \
http://download.eng.bos.redhat.com/brewroot/packages/libvirt/0.10.2/49.el6/x86_64/libvirt-lock-sanlock-0.10.2-49.el6.x86_64.rpm \
http://download.devel.redhat.com/brewroot/packages/libvirt/0.10.2/49.el6/x86_64/libvirt-python-0.10.2-49.el6.x86_64.rpm \
http://download.devel.redhat.com/brewroot/packages/libvirt/0.10.2/49.el6/x86_64/libvirt-debuginfo-0.10.2-49.el6.x86_64.rpm

2) $ yum install -y http://resources.ovirt.org/pub/yum-repo/ovirt-release34.rpm

3) $ yum install -y vdsm vdsm-cli vdsm-xmlrpc vdsm-jsonrpc vdsm-debug-plugin

4) configure and start vdsm (if fails, try again, there is one ugly bug)
vdsm-tool configure --force
service vdsmd start

4) create non-vdsm vlaned bond over a dummy
$ ip link add dummy_10 type dummy
$ echo "DEVICE=dummy_9
MASTER=bond10
SLAVE=yes
ONBOOT=yes
MTU=1500
NM_CONTROLLED=no" > /etc/sysconfig/network-scripts/ifcfg-dummy_10
$ echo "DEVICE=bond10
BONDING_OPTS='mode=802.3ad miimon=150'
ONBOOT=yes
BOOTPROTO=none
DEFROUTE=yes
NM_CONTROLLED=no
HOTPLUG=no" > /etc/sysconfig/network-scripts/ifcfg-bond10
$ echo "DEVICE=bond10.180
VLAN=yes
ONBOOT=yes
BOOTPROTO=static
NM_CONTROLLED=no
HOTPLUG=no" > /etc/sysconfig/network-scripts/ifcfg-bond10.180
$ service network restart

5) setup bridged vdsm network over existing vlaned bridge
$ vdsClient -s 0 setupNetworks "networks={test-network:{bonding:bond10,vlan:180,bridged:true}}"

6) current state of ifcfg files (vlan is tagged, but nic and bond are still
non-vdsm):
[root@rhel6 ~]# cat /etc/sysconfig/network-scripts/ifcfg-bond10.180 
# Generated by VDSM version 4.14.17-0.el6
DEVICE=bond10.180
ONBOOT=yes
VLAN=yes
BRIDGE=test-network
NM_CONTROLLED=no
HOTPLUG=no
[root@rhel6 ~]# cat /etc/sysconfig/network-scripts/ifcfg-bond10
DEVICE=bond10                                                           
BONDING_OPTS='mode=802.3ad miimon=150'                                          
ONBOOT=yes                                                                      
BOOTPROTO=none                                                                  
DEFROUTE=yes                                                                    
NM_CONTROLLED=no                                                                
HOTPLUG=no
[root@rhel6 ~]# cat /etc/sysconfig/network-scripts/ifcfg-dummy_10 
DEVICE=dummy_10                                                         
MASTER=bond10                                                                   
SLAVE=yes                                                                       
ONBOOT=yes                                                                      
MTU=1500                                                                        
NM_CONTROLLED=no

7) upgrade vdsm
$ service vdsmd stop
$ yum install -y http://resources.ovirt.org/pub/yum-repo/ovirt-release35.rpm
$ yum update -y

8) enable network, disable vdsmd on startup and reboot
$ chkconfig vdsmd off
$ shutdown -r now

9) we lost dummy, so add it again
$ ip link add dummy_10 type dummy
$ service network restart

10) start vdsm service manualy:
$ service vdsmd start
[root@rhel6 ~]# service vdsmd start
Please enter your authentication name: Please enter your password: 

Can't connect to default. Skipping.
Stopping ksmtuned:                                         [  OK  ]
Starting multipathd daemon:                                [  OK  ]
Starting ntpd:                                             [  OK  ]
initctl: Job is already running: libvirtd
Starting iscsid:                                           [  OK  ]
vdsm: Running mkdirs
vdsm: Running configure_coredump
vdsm: Running configure_vdsm_logs
vdsm: Running wait_for_network
vdsm: Running run_init_hooks
vdsm: Running upgraded_version_check
vdsm: Running check_is_configured
libvirt is already configured for vdsm
vdsm: Running validate_configuration
SUCCESS: ssl configured to true. No conflicts
vdsm: Running prepare_transient_repository
vdsm: Running syslog_available
vdsm: Running nwfilter
vdsm: Running dummybr
vdsm: Running load_needed_modules
vdsm: Running tune_system
vdsm: Running test_space
vdsm: Running test_lo
vdsm: Running unified_network_persistence_upgrade
vdsm: Running restore_nets
Traceback (most recent call last):
  File "/usr/share/vdsm/vdsm-restore-net-config", line 137, in <module>
    restore()
  File "/usr/share/vdsm/vdsm-restore-net-config", line 123, in restore
    unified_restoration()
  File "/usr/share/vdsm/vdsm-restore-net-config", line 66, in unified_restoration
    persistentConfig.bonds)
  File "/usr/share/vdsm/vdsm-restore-net-config", line 91, in _filter_nets_bonds
    bonds[bond]['nics'], net)
KeyError: u'bond10'
vdsm: stopped during execute restore_nets task (task returned with error code 1).
vdsm start

11) start failed, ifcfg files of bridge, vlan, bond and dummy were removed from network-scripts

Comment 26 Michael Burman 2015-03-11 15:53:35 UTC
Followed Petr steps, only using my host interfaces to create the bond.

reached to step 6) before the vdsm upgrade and our ifcfg files a bit different:

cat /etc/sysconfig/network-scripts/ifcfg-bond0
# Generated by VDSM version 4.14.18-7.el6ev
DEVICE=bond0
ONBOOT=yes
BONDING_OPTS='mode=802.3ad miimon=150'
BRIDGE=test
BOOTPROTO=none
MTU=1500
DEFROUTE=yes
NM_CONTROLLED=no
STP=no
HOTPLUG=no

cat /etc/sysconfig/network-scripts/ifcfg-bond0.162
DEVICE=bond0.162
VLAN=yes
BOOTPROTO=static
NM_CONTROLLED=no
ONBOOT=yes

cat /etc/sysconfig/network-scripts/ifcfg-eth2
# Generated by VDSM version 4.14.18-7.el6ev
DEVICE=eth2
ONBOOT=yes
HWADDR=00:10:18:24:4A:FC
MASTER=bond0
SLAVE=yes
MTU=1500
NM_CONTROLLED=no
STP=no

cat /etc/sysconfig/network-scripts/ifcfg-eth3
# Generated by VDSM version 4.14.18-7.el6ev
DEVICE=eth3
ONBOOT=yes
HWADDR=00:10:18:24:4A:FD
MASTER=bond0
SLAVE=yes
MTU=1500
NM_CONTROLLED=no
STP=no

- As you can see my bond0.162 is Generated by VDSM and bond0 is not.
in Petr steps it is opposite.

So i stopped in this stage before continuing to the update step.

Comment 27 Petr Horáček 2015-03-11 16:37:24 UTC
Sorry, i have a typo in step 4, it should be '$ echo "DEVICE=dummy_10' not '$ echo "DEVICE=dummy_9'

Comment 28 Lior Vernia 2015-03-15 12:26:45 UTC
Just updating I was with Burman when he tried to reproduce. Seemed like vdsm "owned" the bond and VLAN device as soon as we added the VLAN network via a Setup Networks command from the engine (host had pre-existing bond and VLAN devices created manually), and bug wasn't reproduced.

Comment 31 Dan Kenigsberg 2015-03-17 21:48:34 UTC
This bug annoys multitudes of users. Desite the problems with reproduction in lab conditions, I'd like to include the two suggested patches in the nearest release.

Comment 32 Nicolas Ecarnot 2015-03-17 22:01:17 UTC
Today, I had to semi-manually recreate the many files of near to 40 hosts in two DC, and add the persistence = ifcfg on all of them.
It's been two weeks I'm witnessing this issue and fighting it, so at least, please don't close this bug by pretending one can not reproduce it in lab.

Dan is right, and is politely saying it : it is annoying many of us.

(3.5.1)

Comment 34 Michael Burman 2015-04-02 06:24:49 UTC
Want to test this with rhel and rhev-h as well.
Dan, do we have rhev-h build that is includes this fix in vdsm?

Comment 35 Michael Burman 2015-04-02 08:51:44 UTC
Verified and tested successfully with new build rhevm-3.5.1-0.3.el6ev.noarch
vdsm-4.16.13-1.el6ev.x86_64
rhel 6.6
vdsm upgrade from 4.14 >> 4.16.13-1 
followed next steps:
1) bond0 and bond0.162 created manually outside RHEV-M(network restart)
2) installed server in RHEV-M setup successfully
3) attached VM vlan tagged network(162) to host via SN
4) copied relevant repo's(vt14.12) to server and run 'yum update', vdsm upgraded successfully, but regarding this BZ  1200467, needed to restart vdsmd service manually, operation was successful and no network configuration are broken.(refreshed capabilities) 
5) rebooted server with success. all network configuration saved and didn't broke. including the manually bond0 and bond0.162

I would like to perform the same test with rhev-H 6.6 that including this fix, before moving this bug to verified. In the origin i managed to reproduce this issue only with rhev-H.

Comment 36 Lior Vernia 2015-04-02 09:06:37 UTC
Since Dan is on PTO, Fabian should be able to tell us if a build has gone out that included the tracked patches?...

Comment 37 Michael Burman 2015-04-05 11:07:18 UTC
Verified and tested successfully with new rhev-hypervisor6-6.6-20150402.0.el6ev.noarch.rpm
that includes vdsm-4.16.13-1.el6ev.x86_64
vdsm upgrade from vdsm-4.14.18-6.el6ev >> vdsm-4.16.13-1.el6ev.x86_64
followed next steps:
1) bond0 and bond0.162 created manually outside RHEV-M - via TUI 
2) installed server in RHEV-M setup(vt14.2) successfully
3) attached VM vlan tagged network (162) to host via SN
4) downloaded and installed rhev-hypervisor6-6.6-20150402.0.el6ev.noarch.rpm in engine
5) put host to maintenance and run 'upgrade' via RHEV-M
6) host rebooted successfully. All network configuration saved and didn't broke. Including the manually(TUI) configured bond0 and bond0.162.

Comment 38 Michael Burman 2015-04-05 11:58:59 UTC
Should have been verified on 3.5.2 - my mistake. 
And also, Looks like the fix for this bug, created another issue regarding to this. bond0 and bond0.162 persistent and nothing breaks, but non of the networks or nic's have BOOTPROTO= line in ifcfg files.

Moving back to Assigned instead of creating new BZ.

Comment 39 Lior Vernia 2015-04-08 12:22:41 UTC
*** Bug 1203226 has been marked as a duplicate of this bug. ***

Comment 40 Lior Vernia 2015-04-08 12:25:35 UTC
*** Bug 1170869 has been marked as a duplicate of this bug. ***

Comment 41 Lior Vernia 2015-04-08 12:43:31 UTC
If you're going to perform any further tests (or if you're going to write automated tests for these cases - you should!), then please also test pre-defined bridges on top of regular NICs (as described in Bug 1203226).

Comment 42 Lior Vernia 2015-04-20 08:05:58 UTC
To my understanding should be okay for upgrade 3.4.* --> 3.5.1, but needs release notes for 3.5.0 --> 3.5.1.

Dan, could you supply documentation what goes wrong during a 3.5.0 --> 3.5.1 upgrade, and how to work around it once it breaks?

Comment 43 Lior Vernia 2015-04-20 08:08:03 UTC
Sorry, I meant 3.4.* --> 3.5.2 and 3.5.* --> 3.5.2.

Comment 44 Michael Burman 2015-04-27 12:26:38 UTC
Verified on - 3.5.1-0.4.el6ev


- rhev-h 6.6 3.4.z >> rhev-h 6.6 3.5.1
using the next builds:
rhev-h 6.6 3.4.z 20150123.1.el6ev >> rhev-h 6.6 3.5.1  20150421.0.el6ev

1) clean rhev-h 6.6 3.4.z 20150123.1.el6ev installed via USB
2) bond0.162 configured via TUI with dhcp
3) installed server in RHEV-M, rhevm network created on top of bond0
4) via SN attached network to other NIC

* Host is up after upgrade and reboot, all networks attached to server, rhevm got ip, host is up in RHEV-M and can be activated on 3.5 cluster.


- clean rhev-h 6.6 3.5.1  20150421.0.el6ev

1) clean rhev-h 6.6 3.5.1  20150421.0.el6ev installed via USB
2) bond0.162 configured via TUI with dhcp
3) installed server in RHEV-M, rhevm network created on top of bond0
4) via SN attached network to other NIC

* Host is up after reboot, all networks attached to server, rhevm got ip, host is up in RHEV-M.


- rhev-h 7.1 3.5.1  20150420.0.el7ev

1) clean rhev-h 7.1 3.5.1  20150420.0.el7ev installed via USB
2) bond0.162 configured via TUI with dhcp
3) installed server in RHEV-M, rhevm network created on top of bond0
4) via SN attached network to other NIC

* Host is up after reboot, all networks attached to server, rhevm got ip, host is up in RHEV-M.

Comment 45 Eyal Edri 2015-04-29 06:18:59 UTC
ovirt 3.5.2 was GA'd. closing current release.

Comment 46 Dan Kenigsberg 2015-05-18 15:05:40 UTC
(In reply to Lior Vernia from comment #42)
> To my understanding should be okay for upgrade 3.4.* --> 3.5.1, but needs
> release notes for 3.5.0 --> 3.5.1.
> 
> Dan, could you supply documentation what goes wrong during a 3.5.0 --> 3.5.1
> upgrade, and how to work around it once it breaks?

3.4.* --> 3.5.1 is the recommended upgrade path, as 3.5.0 had many problems related to upgrade.

I am not aware of any issue regarding upgrade from a working 3.5.0 to rhev-3.5.1. However if one did upgrade to 3.5.0 and did break one's persisted network config, upgrade to rhev-3.5.1 would not help restore it - one would have to define a local IP and reconfigure networking on the host via Engine.