Bug 1249397 - vdsm should restore networks much earlier, to let net-dependent services start
vdsm should restore networks much earlier, to let net-dependent services start
Status: CLOSED DEFERRED
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
3.5.4
x86_64 Linux
urgent Severity medium
: ---
: 3.5.4
Assigned To: nobody nobody
Meni Yakove
network
: ZStream
: 1249396 (view as bug list)
Depends On: 1203422
Blocks: rhsc_qe_tracker_everglades 1208376 1215388 1222139 1228280 1229227 1234293
  Show dependency treegraph
 
Reported: 2015-08-02 09:01 EDT by Ido Barkan
Modified: 2016-02-10 14:46 EST (History)
24 users (show)

See Also:
Fixed In Version: vt16.6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1203422
Environment:
Last Closed: 2015-08-03 16:42:26 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Network
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 41973 None None None Never
oVirt gerrit 42146 None None None Never
oVirt gerrit 42355 None None None Never
oVirt gerrit 42362 None None None Never
oVirt gerrit 42955 None None None Never
oVirt gerrit 42956 None None None Never
oVirt gerrit 42957 None None None Never
oVirt gerrit 42958 None None None Never
oVirt gerrit 42959 None None None Never
oVirt gerrit 42960 None None None Never
oVirt gerrit 42961 None None None Never
oVirt gerrit 42962 None None None Never
oVirt gerrit 42963 None None None Never
oVirt gerrit 42964 None None None Never
oVirt gerrit 43027 None None None Never
oVirt gerrit 43028 None None None Never
oVirt gerrit 43029 None None None Never
oVirt gerrit 43190 None None None Never
oVirt gerrit 43209 None None None Never
oVirt gerrit 43222 None None None Never
oVirt gerrit 43238 None None None Never
oVirt gerrit 43276 None None None Never
oVirt gerrit 43381 None None None Never
oVirt gerrit 43382 None None None Never
oVirt gerrit 43499 None None None Never
oVirt gerrit 43507 None None None Never
oVirt gerrit 43512 None None None Never
oVirt gerrit 43513 None None None Never
oVirt gerrit 43545 None None None Never
oVirt gerrit 43684 None None None Never
oVirt gerrit 43685 None None None Never
oVirt gerrit 44133 None None None Never
oVirt gerrit 44153 None None None Never
oVirt gerrit 44199 None None None Never

  None (edit)
Description Ido Barkan 2015-08-02 09:01:34 EDT
+++ This bug was initially created as a clone of Bug #1203422 +++

Description of problem:
After rebooting, it seems that 'vdsm-tool restore-nets' changes the configuration of the ovirtmgmt interface from "ONBOOT=yes" to "ONBOOT=no"

In our environment, that interface is the primary NIC ("em1"), and is not on a VLAN, nor is it used as a VM network.

When initially adding the interface, it keeps the "ONBOOT" setting, but after rebooting once, it goes back to "ONBOOT=no"


# Generated by VDSM version 4.16.10-8.gitc937927.el6
DEVICE=em1
HWADDR=b8:2a:72:de:05:fe
ONBOOT=no
IPADDR=10.227.178.131
NETMASK=255.255.255.128
BOOTPROTO=none
MTU=1500
DEFROUTE=yes
NM_CONTROLLED=no

Version-Release number of selected component (if applicable):
oVirt Engine Version: 3.5.1.1-1.el6
vdsm-4.16.10-8.gitc937927.el6.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install oVirt 3.5.1 and add a host
2. Set the ovirtmgmt interface to not be a VM network
3. Add other VLAN-tagged VM networks
4. Reboot

Actual results:
After running 'vdsm-tool restore-nets' or rebooting, /etc/sysconfig/network-scripts/ifcfg-em1 switches from "ONBOOT=yes" to "ONBOOT=no"


Expected results:
I expect VDSM not to alter the "ONBOOT" setting.


Additional info:
This causes cascade failures for other services when the primary interface is not reactivated after booting. Right now, I have Ansible ensuring it gets set to "yes", but that's less than ideal, especially if we need to reboot twice in succession for some reason.

--- Additional comment from Matt R on 2015-03-18 15:10:43 EDT ---

I should add, this is not a strict duplicate of #1128140 - the notes in that bug indicate it's an issue with self-hosted engines. Our engine is not self-hosted in this instance.

--- Additional comment from Dan Kenigsberg on 2015-03-19 06:55:29 EDT ---

Setting ONBOOT=no was an intentional step on the route of leaving ifcfg files behind and moving to what we call "unified persistence": where ovirt-owned networks are stored under /var/lib/vdsm/persistence/netconf.

When we merge https://gerrit.ovirt.org/29441/ network would be started earlier. Would you please share more details on your cascade of failure, so we can understand if they would be solved by setting an independent vdsm-network service?

--- Additional comment from Matt R on 2015-03-19 08:59:08 EDT ---

Hi Dan,

The biggest thing in my setup is that each oVirt node is also a gluster storage node. Glusterd starts before VDSMD, and it fails when the network isn't active. This then causes the storage domain to fail, as well. Aside from that, it also causes nslcd and autofs to not load properly, which prevents logging in with a non-admin account once the machine does finish booting.

Is there any reason to not start the vdsm daemon earlier in the boot process (say, the same as the 'network' service) until that merge is released? Or should gluster be started before vdsmd, causing a sort of catch-22 scenario?

--- Additional comment from Dan Kenigsberg on 2015-03-19 09:12:47 EDT ---

Vdsm requires libvirtd, which requires network. The motivation for the vdsm-network service is to break this vicious circle.

--- Additional comment from Matt R on 2015-03-19 09:36:09 EDT ---

Is it possible to emulate the new merge by setting up a new init script that only does a "vdsm-tool restorenets", and stick that up earlier in the boot sequence?

--- Additional comment from Dan Kenigsberg on 2015-03-19 12:43:23 EDT ---

I'm so sorry, Matt. Only now do I notice that you are using el6, where the referred patch has no effect at all.

Your idea of extending it to el6 makes sense, but may need to wait a bit more (unless you post the patch).

--- Additional comment from Yaniv Dary on 2015-03-30 08:39:04 EDT ---

any updates on this one?

--- Additional comment from Ido Barkan on 2015-03-30 10:33:44 EDT ---

We have investigated this and came to a conclusion that a quick solution cannot be implemented. VDSM currently needs libvirt, and restoring networks leverages VDSM code. Since libvirt is network dependent, restoring networks can unfortunately take place only long after network service and libvirt are up.
This is true both for el6 + el7.

In the current unified persistence mode, VDSM tries to prevent the network service to configure all of it's networks, just for VDSM to tear them done while restoring them (hence the ONBOOT=no).

This leaves us with 2 main choices:

1. try and drop our dependency in libvirt. This is not a trivial task.
2. revert back to old ifcfg persistence mode (this will require a downgrade path for the next release).

There are a few more dirty hacks that we might pull, but we need to think more carefully and do some more analysis before we continue.

--- Additional comment from Matt R on 2015-03-30 12:06:10 EDT ---

Thanks for everyone's help.

For the time being, I'm resorting to what might be the dirtiest hack. I've edited /etc/init.d/network to include the line:

# Fix for VDSM bug
/usr/bin/perl -pi -e 's/^ONBOOT.*$/ONBOOT=yes/g' /etc/sysconfig/network-scripts/ifcfg-em1

after it sources the 'functions' file. Additionally, I have our cf management (ansible) fixing the file periodically as well.

Seems like this is a Catch-22 situation.

--- Additional comment from Yaniv Dary on 2015-03-31 04:33:30 EDT ---

I'm moving the fix for this to 3.5.3, since it is quite a complex fix and will not fit the schedule.

--- Additional comment from Sandro Bonazzola on 2015-04-08 05:27:38 EDT ---

Removing from the tracker as per comment #10

--- Additional comment from Dan Kenigsberg on 2015-04-24 05:24:19 EDT ---



--- Additional comment from Dan Kenigsberg on 2015-06-08 06:08:20 EDT ---



--- Additional comment from Dan Kenigsberg on 2015-06-08 06:19:41 EDT ---



--- Additional comment from Yaniv Dary on 2015-06-23 11:59:06 EDT ---

What is the status of this?

--- Additional comment from Ido Barkan on 2015-06-24 03:09:01 EDT ---

Late stages of code review. Should be pushed soon, but will need QA attention.

--- Additional comment from Ido Barkan on 2015-06-29 07:03:56 EDT ---

for QA:

The following verifications must be done in order to release this safely:
Axes:
 1. persistence = "unified"/"ifcfg"- ifcfg is to test for regression.
 2. rhel/rhev-h
 3. scenarios: 
  a. upgrade from 3.4.x to 3.5.4.
  b. upgrade from 3.5.x to 3.5.4.
  c. "selective restoration" (only supported with persistence=unified):
   o. setup networks.
   o. setSafeNetworkConfig
   o. manually change (some or all) networks (such as changing IP, bonding options etc.)
   o. reboot
   o. verify that VDSM restores *only* the networks you have 'sabotaged' two steps before.

* all scenarios must include a complex network setup which involves bonds (including custom bonding options), Vlan devices and with/without bridges

--- Additional comment from Barak on 2015-07-06 08:10:14 EDT ---

Danken, what was the scenario that failed.

--- Additional comment from Michael Burman on 2015-07-06 09:07:28 EDT ---

Scenarios failed: (RHEL 7.1)
Upgrade 3.5.3 >> 3.5.4 
Upgrade 3.4.5 >> 3.5.4
Both with bonds.

First boot after upgrade, vdsm don't see the slaves of the bond and recognize it as a change, as well for the network attached to the bond.
Although manually change has been done on the bond via ifcfg-bond0 before upgrade, from bond mode=4 to bond mode=1, on the second and third reboots vdsm still recognize there was a change in bond0, cause the slaves of the bond weren't up in time. So every reboot vdsm will touch the bond and the network attached to him.

MainThread::INFO::2015-07-02 15:21:12,243::vdsm-restore-net-config::163::root:_find_changed_or_missing) bond0 is different or missing from persistent configu
ration. current: {'nics': [], 'options': ''}, persisted: {u'nics': [u'ens1f0', u'ens1f1'], u'options': u'miimon=100 mode=4'}
MainThread::INFO::2015-07-02 15:21:12,243::vdsm-restore-net-config::163::root:_find_changed_or_missing) net_lb is different or missing from persistent config
uration. current: None, persisted: {u'bondingOptions': u'mode=4 miimon=100', u'mtu': '1500', u'bonding': u'bond0', 'bootproto': 'none', 'stp': False, u'bridged
': True, 'defaultRoute': False}


- Couldn't test RHEV-H, because latest rhev-h builds include only vdsm-4.16.20, and the fix was done on vdsm-4.16.21

--- Additional comment from Fabian Deutsch on 2015-07-07 05:46:02 EDT ---

I tested vdsm-4.16.21 on RHEV-H and was still seeing this issue.

--- Additional comment from Ido Barkan on 2015-07-14 01:56:37 EDT ---

There were a quite a few bugs that were found thanks to QE during the integration process:

 1. https://gerrit.ovirt.org/#/c/43507/
 2. https://gerrit.ovirt.org/#/c/43222/
 3. https://gerrit.ovirt.org/#/c/43238/
 4. https://gerrit.ovirt.org/#/c/43512/
 5. https://gerrit.ovirt.org/#/c/43382/

--- Additional comment from Michael Burman on 2015-07-15 01:10:35 EDT ---

Not sure why this bug pushed to ON_QA. 
We still don't have full fix for this, new underlying bug discovered yesterday, 
when upgrading rhel 6.7 , from vdsm 3.5.3 >> 3.5.4 .

This bug can't be verified at this point, there are more tests that need to be done here.

vdsm still touching bond and the network attached to him every boot, although no change has been done.

Ido, feel free to add any comment, thanks.

--- Additional comment from Ido Barkan on 2015-07-19 01:24:29 EDT ---

AFAIC all known problems are solved and merged. A lot of them thanks for QE helping on pre-integration. This can be moved to ON_QA and if it is blecked on another bug, let it be so.

--- Additional comment from Michael Burman on 2015-07-29 08:29:08 EDT ---

RHEL 6.7 and 7.1 tested and verified on vt16.3 --> vdsm-4.16.23-1.
RHEV-H latest(vdsm-4.16.23-1) 6.7 and 7.1 failed QA. Working with DEV on the fix.

RHEL tests:

1) BAsic flow and testing:
Base tests for rhel 6.7 and 7.1 
- Clean servers with vdsm.4.16.23-1 installed, network configurations via Setup Networks.
Testing ifcfg-* generated by vdsm with ONBOOT=yes
restart network service doesn't breaks the network configuration on server, as well reboots.
PASS

2) Red Hat Enterprise Linux Server release 6.7 (Santiago) 
vdsm-4.14.18-7.el6ev.x86_64(3.4.5) >> vdsm-4.16.23-1.el6ev.x86_64(3.5.4)

[root@pink-vds2 yum.repos.d]# brctl show
bridge name     bridge id               STP enabled     interfaces
;vdsmdummy;             8000.000000000000       no
gggg            8000.00215e3fdb2e       no              eth1.145
rhevm           8000.00215e3fdb2c       no              eth0
t1              8000.001b21d0bb4a       no              bond0
virbr0          8000.525400dc8e40       yes             virbr0-nic

- upgrade to>> vdsm-4.16.23-1.el6ev.x86_64

[root@pink-vds2 ~]# vi /etc/sysconfig/network-scripts/ifcfg-bond0 
change bond mode to mode=1

- rebooted

- MainThread::INFO::2015-07-21 10:40:21,584::netconfpersistence::86::root::(setBonding) Adding bond0({'nics': ['eth2', 'eth3'], 'options': 'miimon=100 mode=1'})
MainThread::INFO::2015-07-21 10:40:21,584::vdsm-restore-net-config::220::root::(_find_changed_or_missing) bond0 is different or missing from persistent configuration. current: {'nics': ['eth2', 'eth3'], 'options':
 'miimon=100 mode=1'}, persisted: {u'nics': [u'eth2', u'eth3'], u'options': u'miimon=100 mode=4'}
MainThread::INFO::2015-07-21 10:40:21,584::vdsm-restore-net-config::224::root::(_find_changed_or_missing) gggg was not changed since last time it was persisted, skipping restoration.
MainThread::INFO::2015-07-21 10:40:21,584::vdsm-restore-net-config::224::root::(_find_changed_or_missing) rhevm was not changed since last time it was persisted, skipping restoration.
MainThread::INFO::2015-07-21 10:40:21,584::vdsm-restore-net-config::224::root::(_find_changed_or_missing) t1 was not changed since last time it was persisted, skipping restoration.
MainThread::DEBUG::2015-07-21 10:40:21,585::vdsm-restore-net-config::91::root::(unified_restoration) Calling setupNetworks with networks ({}) and bond ({'bond0': {u'nics': [u'eth2', u'eth3'], u'options': u'mode=4 miimon=100'}}).

*vdsm touching only the change made on the bond mode and restoring it to mode=4

- second reboot
- vdsm is not touching anything. no change was done. Server is up and all network configuration are OK.

PASS

3) Red Hat Enterprise Linux Server release 6.7 (Santiago)
vdsm-4.16.16-1.el6ev.x86_64(3.5.3) >> vdsm-4.16.23-1.el6ev.x86_64(3.5.4)

[root@pink-vds2 yum.repos.d]# brctl show
bridge name     bridge id               STP enabled     interfaces
;vdsmdummy;             8000.000000000000       no
rhevm           8000.00215e3fdb2c       no              eth0
t1              8000.001b21d0bb4a       no              bond0
t2              8000.00215e3fdb2e       no              eth1.151
virbr0          8000.525400deea43       yes             virbr0-nic

- vim /etc/sysconfig/network-scripts/ifcfg-bond0
changed bond mode to mode=1

upgrade >> vdsm-4.16.23-1.el7ev.x86_64(3.5.4)

- first reboot after upgrade
MainThread::INFO::2015-07-21 13:35:38,051::netconfpersistence::75::root::(setNetwork) Adding network rhevm({'nic': 'eth0', 'mtu': '1500', 'bootproto': 'dhcp', 'stp': False, 'bridged': True, 'defaultRoute': True})
MainThread::INFO::2015-07-21 13:35:38,053::netconfpersistence::75::root::(setNetwork) Adding network t2({'nic': 'eth1', 'vlan': '151', 'mtu': '1500', 'bootproto': 'none', 'stp': False, 'bridged': True, 'defaultRou
te': False})
MainThread::INFO::2015-07-21 13:35:38,053::netconfpersistence::75::root::(setNetwork) Adding network t1({'mtu': '1500', 'bonding': 'bond0', 'bootproto': 'none', 'stp': False, 'bridged': True, 'defaultRoute': False
})
MainThread::INFO::2015-07-21 13:35:38,053::netconfpersistence::86::root::(setBonding) Adding bond0({'nics': ['eth2', 'eth3'], 'options': 'miimon=100 mode=1'})
MainThread::INFO::2015-07-21 13:35:38,053::vdsm-restore-net-config::220::root::(_find_changed_or_missing) bond0 is different or missing from persistent configuration. current: {'nics': ['eth2', 'eth3'], 'options':
 'miimon=100 mode=1'}, persisted: {u'nics': [u'eth2', u'eth3'], u'options': u'miimon=100 mode=4'}
MainThread::INFO::2015-07-21 13:35:38,054::vdsm-restore-net-config::224::root::(_find_changed_or_missing) rhevm was not changed since last time it was persisted, skipping restoration.
MainThread::INFO::2015-07-21 13:35:38,054::vdsm-restore-net-config::224::root::(_find_changed_or_missing) t2 was not changed since last time it was persisted, skipping restoration.
MainThread::INFO::2015-07-21 13:35:38,054::vdsm-restore-net-config::224::root::(_find_changed_or_missing) t1 was not changed since last time it was persisted, skipping restoration.
MainThread::DEBUG::2015-07-21 13:35:38,054::vdsm-restore-net-config::91::root::(unified_restoration) Calling setupNetworks with networks ({}) and bond ({'bond0': {u'nics': [u'eth2', u'eth3'], u'options': u'mode=4 
miimon=100'}}).

*vdsm touching only the change made on the bond mode and restoring it to mode=4

- second reboot
- vdsm is not touching anything. no change was done. Server is up and all network configurations are OK.

PASS

4) Red Hat Enterprise Linux Server release 7.1 (Maipo) 
vdsm-4.16.16-1.el7ev.x86_64(3.5.3) >> vdsm-4.16.23-1.el7ev.x86_64(3.5.4)

[root@navy-vds1 ~]# brctl show                                                                                                                                                                                       
bridge name     bridge id               STP enabled     interfaces
;vdsmdummy;             8000.000000000000       no
rhevm           8000.00145edd0924       no              enp4s0
t1              8000.001018244afc       no              bond0
t2              8000.00145edd0926       no              enp6s0.151

- vim /etc/sysconfig/network-scripts/ifcfg-bond0
changed bond mode to mode=1

- upgrade >> vdsm-4.16.23-1.el7ev.x86_64

- first reboot after upgrade:
MainThread::INFO::2015-07-21 14:28:16,100::netconfpersistence::75::root::(setNetwork) Adding network rhevm({'nic': 'enp4s0', 'mtu': '1500', 'bootproto': 'dhcp', 'stp': False, 'bridged': True, 'defaultRoute': True}
)
MainThread::INFO::2015-07-21 14:28:16,101::netconfpersistence::75::root::(setNetwork) Adding network t2({'nic': 'enp6s0', 'vlan': '151', 'mtu': '1500', 'bootproto': 'none', 'stp': False, 'bridged': True, 'defaultR
oute': False})
MainThread::INFO::2015-07-21 14:28:16,101::netconfpersistence::75::root::(setNetwork) Adding network t1({'mtu': '1500', 'bonding': 'bond0', 'bootproto': 'none', 'stp': False, 'bridged': True, 'defaultRoute': False
})
MainThread::INFO::2015-07-21 14:28:16,101::netconfpersistence::86::root::(setBonding) Adding bond0({'nics': ['ens2f0', 'ens2f1'], 'options': 'miimon=100 mode=1'})
MainThread::INFO::2015-07-21 14:28:16,102::vdsm-restore-net-config::220::root::(_find_changed_or_missing) bond0 is different or missing from persistent configuration. current: {'nics': ['ens2f0', 'ens2f1'], 'optio
ns': 'miimon=100 mode=1'}, persisted: {u'nics': [u'ens2f0', u'ens2f1'], u'options': u'miimon=100 mode=4'}
MainThread::INFO::2015-07-21 14:28:16,102::vdsm-restore-net-config::224::root::(_find_changed_or_missing) rhevm was not changed since last time it was persisted, skipping restoration.
MainThread::INFO::2015-07-21 14:28:16,102::vdsm-restore-net-config::224::root::(_find_changed_or_missing) t2 was not changed since last time it was persisted, skipping restoration.
MainThread::INFO::2015-07-21 14:28:16,102::vdsm-restore-net-config::224::root::(_find_changed_or_missing) t1 was not changed since last time it was persisted, skipping restoration.
MainThread::DEBUG::2015-07-21 14:28:16,102::vdsm-restore-net-config::91::root::(unified_restoration) Calling setupNetworks with networks ({}) and bond ({'bond0': {u'nics': [u'ens2f0', u'ens2f1'], u'options': u'mod
e=4 miimon=100'}}).

*vdsm touching only the change made on the bond mode and restoring it to mode=4

- second reboot
- vdsm is not touching anything. no change was done. Server is up and all network configurations are OK.

PASS

* All rhev-h scenarios failed.
Comment 2 Eyal Edri 2015-08-02 09:18:44 EDT
*** Bug 1249396 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.