Bug 1203422
Summary: | vdsm should restore networks much earlier, to let net-dependent services start | |||
---|---|---|---|---|
Product: | [Retired] oVirt | Reporter: | Matt R <mriedel> | |
Component: | vdsm | Assignee: | Ido Barkan <ibarkan> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Michael Burman <mburman> | |
Severity: | medium | Docs Contact: | ||
Priority: | urgent | |||
Version: | 3.5 | CC: | audgiri, bazulay, bugs, cshao, danken, ecohen, eedri, fdeutsch, ibarkan, knarra, leiwang, lsurette, mburman, mgoldboi, mriedel, pablo.iranzo, pzhukov, Ravi.Wijesekera, rbalakri, rmcswain, sbonazzo, simon.neininger, sshnaidm, ycui, yeylon, ylavi | |
Target Milestone: | --- | |||
Target Release: | 3.5.4 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | network | |||
Fixed In Version: | vdsm-4.16.26 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1249396 1249397 (view as bug list) | Environment: | ||
Last Closed: | 2015-09-03 13:54:32 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1187461, 1208376, 1215388, 1222139, 1228280, 1229227, 1234293, 1249396, 1249397 |
Description
Matt R
2015-03-18 19:06:04 UTC
I should add, this is not a strict duplicate of #1128140 - the notes in that bug indicate it's an issue with self-hosted engines. Our engine is not self-hosted in this instance. Setting ONBOOT=no was an intentional step on the route of leaving ifcfg files behind and moving to what we call "unified persistence": where ovirt-owned networks are stored under /var/lib/vdsm/persistence/netconf. When we merge https://gerrit.ovirt.org/29441/ network would be started earlier. Would you please share more details on your cascade of failure, so we can understand if they would be solved by setting an independent vdsm-network service? Hi Dan, The biggest thing in my setup is that each oVirt node is also a gluster storage node. Glusterd starts before VDSMD, and it fails when the network isn't active. This then causes the storage domain to fail, as well. Aside from that, it also causes nslcd and autofs to not load properly, which prevents logging in with a non-admin account once the machine does finish booting. Is there any reason to not start the vdsm daemon earlier in the boot process (say, the same as the 'network' service) until that merge is released? Or should gluster be started before vdsmd, causing a sort of catch-22 scenario? Vdsm requires libvirtd, which requires network. The motivation for the vdsm-network service is to break this vicious circle. Is it possible to emulate the new merge by setting up a new init script that only does a "vdsm-tool restorenets", and stick that up earlier in the boot sequence? I'm so sorry, Matt. Only now do I notice that you are using el6, where the referred patch has no effect at all. Your idea of extending it to el6 makes sense, but may need to wait a bit more (unless you post the patch). any updates on this one? We have investigated this and came to a conclusion that a quick solution cannot be implemented. VDSM currently needs libvirt, and restoring networks leverages VDSM code. Since libvirt is network dependent, restoring networks can unfortunately take place only long after network service and libvirt are up. This is true both for el6 + el7. In the current unified persistence mode, VDSM tries to prevent the network service to configure all of it's networks, just for VDSM to tear them done while restoring them (hence the ONBOOT=no). This leaves us with 2 main choices: 1. try and drop our dependency in libvirt. This is not a trivial task. 2. revert back to old ifcfg persistence mode (this will require a downgrade path for the next release). There are a few more dirty hacks that we might pull, but we need to think more carefully and do some more analysis before we continue. Thanks for everyone's help. For the time being, I'm resorting to what might be the dirtiest hack. I've edited /etc/init.d/network to include the line: # Fix for VDSM bug /usr/bin/perl -pi -e 's/^ONBOOT.*$/ONBOOT=yes/g' /etc/sysconfig/network-scripts/ifcfg-em1 after it sources the 'functions' file. Additionally, I have our cf management (ansible) fixing the file periodically as well. Seems like this is a Catch-22 situation. I'm moving the fix for this to 3.5.3, since it is quite a complex fix and will not fit the schedule. Removing from the tracker as per comment #10 *** Bug 1215011 has been marked as a duplicate of this bug. *** *** Bug 1226056 has been marked as a duplicate of this bug. *** *** Bug 1215388 has been marked as a duplicate of this bug. *** What is the status of this? Late stages of code review. Should be pushed soon, but will need QA attention. for QA: The following verifications must be done in order to release this safely: Axes: 1. persistence = "unified"/"ifcfg"- ifcfg is to test for regression. 2. rhel/rhev-h 3. scenarios: a. upgrade from 3.4.x to 3.5.4. b. upgrade from 3.5.x to 3.5.4. c. "selective restoration" (only supported with persistence=unified): o. setup networks. o. setSafeNetworkConfig o. manually change (some or all) networks (such as changing IP, bonding options etc.) o. reboot o. verify that VDSM restores *only* the networks you have 'sabotaged' two steps before. * all scenarios must include a complex network setup which involves bonds (including custom bonding options), Vlan devices and with/without bridges Danken, what was the scenario that failed. Scenarios failed: (RHEL 7.1) Upgrade 3.5.3 >> 3.5.4 Upgrade 3.4.5 >> 3.5.4 Both with bonds. First boot after upgrade, vdsm don't see the slaves of the bond and recognize it as a change, as well for the network attached to the bond. Although manually change has been done on the bond via ifcfg-bond0 before upgrade, from bond mode=4 to bond mode=1, on the second and third reboots vdsm still recognize there was a change in bond0, cause the slaves of the bond weren't up in time. So every reboot vdsm will touch the bond and the network attached to him. MainThread::INFO::2015-07-02 15:21:12,243::vdsm-restore-net-config::163::root:_find_changed_or_missing) bond0 is different or missing from persistent configu ration. current: {'nics': [], 'options': ''}, persisted: {u'nics': [u'ens1f0', u'ens1f1'], u'options': u'miimon=100 mode=4'} MainThread::INFO::2015-07-02 15:21:12,243::vdsm-restore-net-config::163::root:_find_changed_or_missing) net_lb is different or missing from persistent config uration. current: None, persisted: {u'bondingOptions': u'mode=4 miimon=100', u'mtu': '1500', u'bonding': u'bond0', 'bootproto': 'none', 'stp': False, u'bridged ': True, 'defaultRoute': False} - Couldn't test RHEV-H, because latest rhev-h builds include only vdsm-4.16.20, and the fix was done on vdsm-4.16.21 I tested vdsm-4.16.21 on RHEV-H and was still seeing this issue. There were a quite a few bugs that were found thanks to QE during the integration process: 1. https://gerrit.ovirt.org/#/c/43507/ 2. https://gerrit.ovirt.org/#/c/43222/ 3. https://gerrit.ovirt.org/#/c/43238/ 4. https://gerrit.ovirt.org/#/c/43512/ 5. https://gerrit.ovirt.org/#/c/43382/ Not sure why this bug pushed to ON_QA. We still don't have full fix for this, new underlying bug discovered yesterday, when upgrading rhel 6.7 , from vdsm 3.5.3 >> 3.5.4 . This bug can't be verified at this point, there are more tests that need to be done here. vdsm still touching bond and the network attached to him every boot, although no change has been done. Ido, feel free to add any comment, thanks. AFAIC all known problems are solved and merged. A lot of them thanks for QE helping on pre-integration. This can be moved to ON_QA and if it is blecked on another bug, let it be so. RHEL 6.7 and 7.1 tested and verified on vt16.3 --> vdsm-4.16.23-1. RHEV-H latest(vdsm-4.16.23-1) 6.7 and 7.1 failed QA. Working with DEV on the fix. RHEL tests: 1) BAsic flow and testing: Base tests for rhel 6.7 and 7.1 - Clean servers with vdsm.4.16.23-1 installed, network configurations via Setup Networks. Testing ifcfg-* generated by vdsm with ONBOOT=yes restart network service doesn't breaks the network configuration on server, as well reboots. PASS 2) Red Hat Enterprise Linux Server release 6.7 (Santiago) vdsm-4.14.18-7.el6ev.x86_64(3.4.5) >> vdsm-4.16.23-1.el6ev.x86_64(3.5.4) [root@pink-vds2 yum.repos.d]# brctl show bridge name bridge id STP enabled interfaces ;vdsmdummy; 8000.000000000000 no gggg 8000.00215e3fdb2e no eth1.145 rhevm 8000.00215e3fdb2c no eth0 t1 8000.001b21d0bb4a no bond0 virbr0 8000.525400dc8e40 yes virbr0-nic - upgrade to>> vdsm-4.16.23-1.el6ev.x86_64 [root@pink-vds2 ~]# vi /etc/sysconfig/network-scripts/ifcfg-bond0 change bond mode to mode=1 - rebooted - MainThread::INFO::2015-07-21 10:40:21,584::netconfpersistence::86::root::(setBonding) Adding bond0({'nics': ['eth2', 'eth3'], 'options': 'miimon=100 mode=1'}) MainThread::INFO::2015-07-21 10:40:21,584::vdsm-restore-net-config::220::root::(_find_changed_or_missing) bond0 is different or missing from persistent configuration. current: {'nics': ['eth2', 'eth3'], 'options': 'miimon=100 mode=1'}, persisted: {u'nics': [u'eth2', u'eth3'], u'options': u'miimon=100 mode=4'} MainThread::INFO::2015-07-21 10:40:21,584::vdsm-restore-net-config::224::root::(_find_changed_or_missing) gggg was not changed since last time it was persisted, skipping restoration. MainThread::INFO::2015-07-21 10:40:21,584::vdsm-restore-net-config::224::root::(_find_changed_or_missing) rhevm was not changed since last time it was persisted, skipping restoration. MainThread::INFO::2015-07-21 10:40:21,584::vdsm-restore-net-config::224::root::(_find_changed_or_missing) t1 was not changed since last time it was persisted, skipping restoration. MainThread::DEBUG::2015-07-21 10:40:21,585::vdsm-restore-net-config::91::root::(unified_restoration) Calling setupNetworks with networks ({}) and bond ({'bond0': {u'nics': [u'eth2', u'eth3'], u'options': u'mode=4 miimon=100'}}). *vdsm touching only the change made on the bond mode and restoring it to mode=4 - second reboot - vdsm is not touching anything. no change was done. Server is up and all network configuration are OK. PASS 3) Red Hat Enterprise Linux Server release 6.7 (Santiago) vdsm-4.16.16-1.el6ev.x86_64(3.5.3) >> vdsm-4.16.23-1.el6ev.x86_64(3.5.4) [root@pink-vds2 yum.repos.d]# brctl show bridge name bridge id STP enabled interfaces ;vdsmdummy; 8000.000000000000 no rhevm 8000.00215e3fdb2c no eth0 t1 8000.001b21d0bb4a no bond0 t2 8000.00215e3fdb2e no eth1.151 virbr0 8000.525400deea43 yes virbr0-nic - vim /etc/sysconfig/network-scripts/ifcfg-bond0 changed bond mode to mode=1 upgrade >> vdsm-4.16.23-1.el7ev.x86_64(3.5.4) - first reboot after upgrade MainThread::INFO::2015-07-21 13:35:38,051::netconfpersistence::75::root::(setNetwork) Adding network rhevm({'nic': 'eth0', 'mtu': '1500', 'bootproto': 'dhcp', 'stp': False, 'bridged': True, 'defaultRoute': True}) MainThread::INFO::2015-07-21 13:35:38,053::netconfpersistence::75::root::(setNetwork) Adding network t2({'nic': 'eth1', 'vlan': '151', 'mtu': '1500', 'bootproto': 'none', 'stp': False, 'bridged': True, 'defaultRou te': False}) MainThread::INFO::2015-07-21 13:35:38,053::netconfpersistence::75::root::(setNetwork) Adding network t1({'mtu': '1500', 'bonding': 'bond0', 'bootproto': 'none', 'stp': False, 'bridged': True, 'defaultRoute': False }) MainThread::INFO::2015-07-21 13:35:38,053::netconfpersistence::86::root::(setBonding) Adding bond0({'nics': ['eth2', 'eth3'], 'options': 'miimon=100 mode=1'}) MainThread::INFO::2015-07-21 13:35:38,053::vdsm-restore-net-config::220::root::(_find_changed_or_missing) bond0 is different or missing from persistent configuration. current: {'nics': ['eth2', 'eth3'], 'options': 'miimon=100 mode=1'}, persisted: {u'nics': [u'eth2', u'eth3'], u'options': u'miimon=100 mode=4'} MainThread::INFO::2015-07-21 13:35:38,054::vdsm-restore-net-config::224::root::(_find_changed_or_missing) rhevm was not changed since last time it was persisted, skipping restoration. MainThread::INFO::2015-07-21 13:35:38,054::vdsm-restore-net-config::224::root::(_find_changed_or_missing) t2 was not changed since last time it was persisted, skipping restoration. MainThread::INFO::2015-07-21 13:35:38,054::vdsm-restore-net-config::224::root::(_find_changed_or_missing) t1 was not changed since last time it was persisted, skipping restoration. MainThread::DEBUG::2015-07-21 13:35:38,054::vdsm-restore-net-config::91::root::(unified_restoration) Calling setupNetworks with networks ({}) and bond ({'bond0': {u'nics': [u'eth2', u'eth3'], u'options': u'mode=4 miimon=100'}}). *vdsm touching only the change made on the bond mode and restoring it to mode=4 - second reboot - vdsm is not touching anything. no change was done. Server is up and all network configurations are OK. PASS 4) Red Hat Enterprise Linux Server release 7.1 (Maipo) vdsm-4.16.16-1.el7ev.x86_64(3.5.3) >> vdsm-4.16.23-1.el7ev.x86_64(3.5.4) [root@navy-vds1 ~]# brctl show bridge name bridge id STP enabled interfaces ;vdsmdummy; 8000.000000000000 no rhevm 8000.00145edd0924 no enp4s0 t1 8000.001018244afc no bond0 t2 8000.00145edd0926 no enp6s0.151 - vim /etc/sysconfig/network-scripts/ifcfg-bond0 changed bond mode to mode=1 - upgrade >> vdsm-4.16.23-1.el7ev.x86_64 - first reboot after upgrade: MainThread::INFO::2015-07-21 14:28:16,100::netconfpersistence::75::root::(setNetwork) Adding network rhevm({'nic': 'enp4s0', 'mtu': '1500', 'bootproto': 'dhcp', 'stp': False, 'bridged': True, 'defaultRoute': True} ) MainThread::INFO::2015-07-21 14:28:16,101::netconfpersistence::75::root::(setNetwork) Adding network t2({'nic': 'enp6s0', 'vlan': '151', 'mtu': '1500', 'bootproto': 'none', 'stp': False, 'bridged': True, 'defaultR oute': False}) MainThread::INFO::2015-07-21 14:28:16,101::netconfpersistence::75::root::(setNetwork) Adding network t1({'mtu': '1500', 'bonding': 'bond0', 'bootproto': 'none', 'stp': False, 'bridged': True, 'defaultRoute': False }) MainThread::INFO::2015-07-21 14:28:16,101::netconfpersistence::86::root::(setBonding) Adding bond0({'nics': ['ens2f0', 'ens2f1'], 'options': 'miimon=100 mode=1'}) MainThread::INFO::2015-07-21 14:28:16,102::vdsm-restore-net-config::220::root::(_find_changed_or_missing) bond0 is different or missing from persistent configuration. current: {'nics': ['ens2f0', 'ens2f1'], 'optio ns': 'miimon=100 mode=1'}, persisted: {u'nics': [u'ens2f0', u'ens2f1'], u'options': u'miimon=100 mode=4'} MainThread::INFO::2015-07-21 14:28:16,102::vdsm-restore-net-config::224::root::(_find_changed_or_missing) rhevm was not changed since last time it was persisted, skipping restoration. MainThread::INFO::2015-07-21 14:28:16,102::vdsm-restore-net-config::224::root::(_find_changed_or_missing) t2 was not changed since last time it was persisted, skipping restoration. MainThread::INFO::2015-07-21 14:28:16,102::vdsm-restore-net-config::224::root::(_find_changed_or_missing) t1 was not changed since last time it was persisted, skipping restoration. MainThread::DEBUG::2015-07-21 14:28:16,102::vdsm-restore-net-config::91::root::(unified_restoration) Calling setupNetworks with networks ({}) and bond ({'bond0': {u'nics': [u'ens2f0', u'ens2f1'], u'options': u'mod e=4 miimon=100'}}). *vdsm touching only the change made on the bond mode and restoring it to mode=4 - second reboot - vdsm is not touching anything. no change was done. Server is up and all network configurations are OK. PASS * All rhev-h scenarios failed. Moved to ON_QA with release vdsm-4.16.24-2 A clarification regarding the behaviour of rhev-h hypervisors which upgrade to 3.5.4: Since this version, rhev-h will persist all vdsm owned ifcfg files into the file system upon a call to setSafeNetworkConfig ("save network configuration" option on engine). This is done in order to support ONBOOT=yes option in all VDSM's ifcfg files. Which is also configured by VDSM since this version. If setSafeNetworkConfig is not called, and those ifcfg files are not persisted, then on every subsequent boot the host will restore all networks. Although this result in a correct network configuration, it might take more time to boot. For this call to actually happen, users will have to do some network change (configuring the host networks) since the engine will refuse to do anything if there is no change in the network configuration on the host. This will be fixed in a proper way upstream, and is also planned to be fixed in 3.5.5. *** Bug 1207377 has been marked as a duplicate of this bug. *** Moved to MODIFIED along with respin of 3.5.4 builds: vdsm-4.16.25-1.el7ev vdsm-4.16.25-1.el6ev Failed QA, upgrade scenario tests failed. - rhev-h 6.6 3.5.3 >> rhev-h 6.7 3.5.4 - vdsmd and libvirtd are not running after upgrade and reboot - rhev-h 7.1 3.5.3 >> rhev-h 7.1 3.5.4 - vdsm touching networks he shouldn't touch on boot after upgrade tested on rhevm-3.5.4.2-1.3.el6ev.noarch with vdsm-4.16.26-1.el6ev.x86_64 6.7 (20150823.0.el6ev) 7.1 (20150823.0.el7ev) Verified and tested with --> - rhev-hypervisor6-6.7-20150826.0.el6ev - rhev-hypervisor7-7.1-20150826.0.el7ev - ovirt-node-3.2.3-20 - vdsm-4.16.26-1 - On 3.5.4.2-1.3.el6ev RHEV-H 3.5.4 - scenarios --> 1) rhev-h 6.6 - 20150512.0.el6ev >> rhev-hypervisor6-6.7-20150826.0.el6ev vdsm-4.16.13.1-1.el6ev.x86_64 >> vdsm-4.16.26-1.el6ev [root@pink-vds2 ~]# brctl show bridge name bridge id STP enabled interfaces ;vdsmdummy; 8000.000000000000 no net_lb 8000.00215e3fdb2e no eth3.163 queue 8000.00215e3fdb2e no eth3.166 rhevm 8000.00215e3fdb2c no eth2 t1 8000.001b21d0bb4a no bond0 root@navy-vds1-vlan162 ~]# brctl show bridge name bridge id STP enabled interfaces ;vdsmdummy; 8000.000000000000 no gggg 8000.00145edd0926 no eth3.145 rhevm 8000.00145edd0924 no eth2.162 t1 8000.001018244afc no bond0 - run upgrade and reboot - PASS - second and third reboot without changes - PASS - change network config via SN and 4th reboot - PASS [root@localhost ~]# brctl show bridge name bridge id STP enabled interfaces ;vdsmdummy; 8000.000000000000 no net_lb 8000.001b21d0bb4b no eth1.163 queue 8000.001b21d0bb4b no eth1.166 rhevm 8000.00215e3fdb2c no eth2 t1 8000.001b21d0bb4a no bond0 [root@localhost ~]# brctl show bridge name bridge id STP enabled interfaces ;vdsmdummy; 8000.000000000000 no gggg 8000.001018244afc no eth0.145 rhevm 8000.00145edd0924 no eth2.162 t1 8000.001018244afc no bond0 PASS 2) rhev-h 7.1 - 20150512.1.el7ev >> rhev-hypervisor7-7.1-20150826.0.el7ev vdsm-4.16.13.1-1.el7ev.x86_64 >> vdsm-4.16.26-1.el7ev root@orchid-vds2 ~]# brctl show bridge name bridge id STP enabled interfaces ;vdsmdummy; 8000.000000000000 no rhevm 8000.001a647a9462 no enp4s0 t1 8000.0015173dcdce no bond0 traffic 8000.001a647a9464 no enp6s0.164 vmfex 8000.001a647a9464 no enp6s0.160 - run upgrade and reboot - PASS - second reboot without changes - PASS - change network configuration via SN and third reboot - PASS [root@localhost ~]# brctl show bridge name bridge id STP enabled interfaces ;vdsmdummy; 8000.000000000000 no rhevm 8000.001a647a9462 no enp4s0 t1 8000.0015173dcdce no bond0 traffic 8000.0015173dcdcf no ens1f1.164 vmfex 8000.0015173dcdcf no ens1f1.160 PASS 3) rhev-h 6.6 -20150123.1.el6ev >> rhev-hypervisor6-6.7-20150826.0.el6ev 3.5.4 vdsm-4.14.18-6.el6ev.x86_64 >> vdsm-4.16.26-1.el6ev.x86_64 [root@pink-vds2 ~]# brctl show bridge name bridge id STP enabled interfaces ;vdsmdummy; 8000.000000000000 no gggg 8000.00215e3fdb2e no eth3.145 rhevm 8000.00215e3fdb2c no eth2 t1 8000.001b21d0bb4a no bond0 - run upgrade and reboot - PASS - second reboot without changes - PASS - change network configuration via SN and third reboot - PASS [root@localhost ~]# brctl show bridge name bridge id STP enabled interfaces ;vdsmdummy; 8000.000000000000 no gggg 8000.001b21d0bb4b no eth1.145 rhevm 8000.00215e3fdb2c no eth2 t1 8000.001b21d0bb4a no bond0 PASS 4) Clean rhev-h - 7.1 - 20150826.0.el7ev installed in rhev-m vdsm-4.16.26-1.el7ev - first reboot PASS - Network configurations via Setup Networks - second reboot. All networks are up. PASS PASS 5) Clean rhev-h 6.7 -20150826.0.el6ev installed in rhev-m vdsm-4.16.26-1.el6ev - first reboot PASS - Network configurations via Setup Networks - second reboot. All networks are up. PASS PASS This is an automated message. oVirt 3.5.4 has been released on September 3rd 2015 and should include the fix for this BZ. Moving to closed current release. *** Bug 1260892 has been marked as a duplicate of this bug. *** |