Bug 1194553

Summary: VDSM script reset network configuration on every reboot when based on predefined bond
Product: Red Hat Enterprise Virtualization Manager Reporter: Pavel Zhukov <pzhukov>
Component: vdsmAssignee: Petr Horáček <phoracek>
Status: CLOSED ERRATA QA Contact: Michael Burman <mburman>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.5.0CC: aleksandr.bembel, asegurap, audgiri, bazulay, bugs, danken, eedri, fdeutsch, gwatson, iheim, ldelouw, lpeer, lsurette, mburman, meverett, mgoldboi, mkalinin, myakove, phoracek, pmukhedk, pzhukov, rbalakri, rhodain, rmcswain, troels, ycui, yeylon, ykaul, ylavi
Target Milestone: ovirt-3.6.0-rcKeywords: Reopened, ZStream
Target Release: 3.6.0Flags: ylavi: Triaged+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: v4.16.12.1 Doc Type: Bug Fix
Doc Text:
ifcfg-bond* devices defined out of VDSM (manually or via the RHEV-H TUI) are removed during the upgrade of VDSM to 3.5.0. There is no complete fix at current; the simplest workaround is to re-define bond devices via VDSM. For example: vdsClient -s 0 setupNetworks bondings='{bond11:{nics:p1p3,p1p4}}' vdsClient -s 0 setSafeNetworkConfig Alternatively, bond devices can be re-defined via the engine. After upgrade, this results in created bond11 being persisted in VDSM's /var/lib/vdsm/persistence/netconf/bonds/bond11 and being available on reboot.
Story Points: ---
Clone Of: 1154399
: 1205711 (view as bug list) Environment:
Last Closed: 2016-03-09 19:31:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1154399, 1213842    
Bug Blocks: 1205711    

Description Pavel Zhukov 2015-02-20 07:37:36 UTC
The same problem exists in downstream vdsm.

+++ This bug was initially created as a clone of Bug #1154399 +++

Description of problem:
I configure oVirt node with ovirtmgmt-interface. It's UP in ovirt-engine, but when I reboot node - it boots and after few second i lost connectivity. I connect to node via IPMI and in /etc/sysconfig/network-scripts/ there no ifcfg-bond0.X and ifcgf-ovirtmgmt.

Also vdsm daemon didn't start:
 service vdsmd start
libvirtd start/running, process 6113
vdsm: Running mkdirs
vdsm: Running configure_coredump
vdsm: Running configure_vdsm_logs
vdsm: Running run_init_hooks
vdsm: Running check_is_configured
libvirt is already configured for vdsm
vdsm: Running validate_configuration
SUCCESS: ssl configured to true. No conflicts
vdsm: Running prepare_transient_repository
vdsm: Running syslog_available
vdsm: Running nwfilter
vdsm: Running dummybr
vdsm: Running load_needed_modules
vdsm: Running tune_system
vdsm: Running test_space
vdsm: Running test_lo
vdsm: Running unified_network_persistence_upgrade
vdsm: Running restore_nets
Traceback (most recent call last):
  File "/usr/share/vdsm/vdsm-restore-net-config", line 137, in <module>
    restore()
  File "/usr/share/vdsm/vdsm-restore-net-config", line 123, in restore
    unified_restoration()
  File "/usr/share/vdsm/vdsm-restore-net-config", line 66, in unified_restoration
    persistentConfig.bonds)
  File "/usr/share/vdsm/vdsm-restore-net-config", line 91, in _filter_nets_bonds
    bonds[bond]['nics'], net)
KeyError: u'bond0'
vdsm: stopped during execute restore_nets task (task returned with error code 1).
vdsm start                                                 [FAILED]

It starts only if i manually restore network configutaion, delete /var/lib/vdsm/persistence/netconf  and create nets_restored file in /var/lib/vdsm 

Version-Release number of selected component (if applicable):
rpm -qa | grep vdsm
vdsm-python-4.16.7-1.gitdb83943.el6.noarch
vdsm-jsonrpc-4.16.7-1.gitdb83943.el6.noarch
vdsm-python-zombiereaper-4.16.7-1.gitdb83943.el6.noarch
vdsm-xmlrpc-4.16.7-1.gitdb83943.el6.noarch
vdsm-yajsonrpc-4.16.7-1.gitdb83943.el6.noarch
vdsm-4.16.7-1.gitdb83943.el6.x86_64
vdsm-cli-4.16.7-1.gitdb83943.el6.noarch

cat /etc/redhat-release
CentOS release 6.5 (Final)


Steps to Reproduce:
1. configure network manually in /etc/sysconfig/network-scripts/
2. add node to ovirt-engine
3. reboot node

Actual results:
Lost connectivity to node, vdsm scripts reset network configuration

Expected results:
normal reboot

Additional info:

--- Additional comment from Aleksandr on 2014-10-20 02:12:17 EDT ---

I found the problem and how to solve it manually. 

I create bond interface manually by hand before installation of oVirt, and create ovirtmgmt interface manually to. After i install oVirt to node - VDSM scripts create folders with network configuration in var/lib/vdsm/persistence/netconf/ and there is folder "nets" with configutarion of ovirtmgmt interface, but there is no folder bonds for bonding configs.

After creation of this folder with configuration for bond0 interface all starts to work and reboot normally.

--- Additional comment from Dan Kenigsberg on 2014-10-25 20:55:59 EDT ---

Toni, could you take a look? I thought http://gerrit.ovirt.org/32769 should have fixed that.

--- Additional comment from Antoni Segura Puimedon on 2014-10-27 03:52:13 EDT ---

@Aleksandr: Does the /etc/sysconfig/network-scripts/ifcfg-bond0 you manually create have any of these headers:

- '# Generated by VDSM version'
- '# automatically generated by vdsm'

If that is the case, they will be removed at every boot. If that is not the case, are you calling 'persist /etc/sysconfig/network-scripts/ifcfg-bond0' in the command line after creating them?

vdsm only persists the networks and bonds it creates and since ifcfg-bond0 is created by you, it assumes (wrongly or not) it will be there on boot. There are three ways to go about this:

- creating the bond with vdsClient like so:
  vdsClient -s 0 setupNetworks bondings='{bond11:{nics:p1p3+p1p4}}'
  # Then create the network over it (which will persist the bond too in /var/lib/vdsm/persistence/netconf/bonds
- Using the node persistence directly:
  persist /etc/sysconfig/network-scripts/ifcfg-bond0
- Code: Somehow detect that device configuration we depend on is not persisted and do like in the upgrade script to unified persistence.

--- Additional comment from Aleksandr on 2014-10-27 04:11:43 EDT ---

(In reply to Antoni Segura Puimedon from comment #3)
> @Aleksandr: Does the /etc/sysconfig/network-scripts/ifcfg-bond0 you manually
> create have any of these headers:
> 
> - '# Generated by VDSM version'
> - '# automatically generated by vdsm'
> 
> If that is the case, they will be removed at every boot. If that is not the
> case, are you calling 'persist /etc/sysconfig/network-scripts/ifcfg-bond0'
> in the command line after creating them?
> 
> vdsm only persists the networks and bonds it creates and since ifcfg-bond0
> is created by you, it assumes (wrongly or not) it will be there on boot.
> There are three ways to go about this:
> 
> - creating the bond with vdsClient like so:
>   vdsClient -s 0 setupNetworks bondings='{bond11:{nics:p1p3+p1p4}}'
>   # Then create the network over it (which will persist the bond too in
> /var/lib/vdsm/persistence/netconf/bonds
> - Using the node persistence directly:
>   persist /etc/sysconfig/network-scripts/ifcfg-bond0
> - Code: Somehow detect that device configuration we depend on is not
> persisted and do like in the upgrade script to unified persistence.

/etc/sysconfig/network-scripts/ifcfg-bond0 doesn't have such header. I create it manually before installing oVirt on this node.

--- Additional comment from Dan Kenigsberg on 2014-10-29 10:08:58 EDT ---

And when you run

 persist /etc/sysconfig/network-scripts/ifcfg-bond0

does the problem go away? If so - it's not a bug. Manual creation requires manual persistence.

--- Additional comment from Dan Kenigsberg on 2014-11-17 11:25:48 EST ---

(In reply to Dan Kenigsberg from comment #5)
> does the problem go away?

Please reopen if this is not the case.

--- Additional comment from Dan Kenigsberg on 2015-02-07 12:20:56 EST ---

We have heard more reports about our failure to revive bonds that where created outside Vdsm, but required by its networks.

Since this is a common use case, particularly for hosted engine, it may require extreme measures such as consuming these bonds and making the ours.

Comment 1 Prasad Mukhedkar 2015-02-23 08:02:25 UTC
*** Bug 1194267 has been marked as a duplicate of this bug. ***

Comment 2 Dan Kenigsberg 2015-02-25 17:09:45 UTC
Pavel, what's `chkconfig | grep network` on the affected hosts?
I heard a report that `chkconfig network on` makes the problem go away. Could you verify that?

Comment 3 Pavel Zhukov 2015-02-26 10:03:29 UTC
(In reply to Dan Kenigsberg from comment #2)
> Pavel, what's `chkconfig | grep network` on the affected hosts?
> I heard a report that `chkconfig network on` makes the problem go away.
> Could you verify that?

Seems like it's active already.
network         0:off 1:off 2:on  3:on  4:on  5:on  6:off

For RHEL host we can work this around by using ifcfg as persistence store. In that case vdsm doesn't complain with manually created bond devices

Comment 4 Petr Horáček 2015-02-27 09:01:46 UTC
@Pavel could you grant me an access to your machine? I would like to check logs, versions etc. Then it would be great if we could try to setup machine to the state before upgrade and try to upgrade it with some changes or extra logging.

Comment 14 Dan Kenigsberg 2015-03-19 16:27:33 UTC
Sorry, Roman - I did not refresh my browser and did not see your recent update when I moved this bug to MODIFIED. I do not understand your report. This bug is about ifcfg-bond* files not existing upon upgrade. Is this the case with your reproduction? Can you attach the post-boot supervdsm.log?

There may well be more issues regarding network upgrade on the node; I'm not sure that what you describe is the problem I'm trying to solve.

Comment 21 Dan Kenigsberg 2015-03-23 11:08:13 UTC
I suspect that the other problems we see are related to the fact that ovirt-node restarts networking while vdsm is starting up and performing network config upgrade.

Hence this bug can go back to ON_QA.

Comment 23 Michael Burman 2015-04-06 08:22:37 UTC
Hi Yaniv,

Please help to organize the Fixed in version and target release for this BZ, cause definitely we have some mess here.

Thank you,

Comment 24 Yaniv Lavi 2015-04-11 22:18:04 UTC
Please provide info needed in comment #23.

Comment 25 Petr Horáček 2015-04-13 10:14:52 UTC
Backported patch haven't passed QA #1205711 https://bugzilla.redhat.com/show_bug.cgi?id=1205711#c10

Comment 26 Robert McSwain 2015-04-20 13:51:38 UTC
Any updates on this and the QA process since last week?

Comment 27 Fabian Deutsch 2015-04-22 09:52:31 UTC
Yes, please see the clone of this bug for the updates.

Petr, can this bug get moved to MODIFIED as well?

Comment 28 Pavel Zhukov 2015-04-22 10:23:38 UTC
I hope I was requested for info by mistake...

Comment 29 Petr Horáček 2015-04-22 10:52:39 UTC
Relevant patches are merged.

Comment 32 Michael Burman 2015-05-31 07:17:55 UTC
How this BZ is ON_QA?

Do we have a rhev-h 3.6.0 ?

Comment 33 Yaniv Lavi 2015-05-31 12:14:50 UTC
(In reply to Michael Burman from comment #32)
> How this BZ is ON_QA?
> 
> Do we have a rhev-h 3.6.0 ?

This affects RHEL as well, RHEV-H build should arrive soon.

Comment 34 Meni Yakove 2015-07-01 07:19:04 UTC
Can we verify this on RHEL or we need to wait for 3.6 RHEV-H?
If we need to wait for RHEV-H please remove the ON_QA from the bug.

Comment 35 Yaniv Lavi 2015-07-06 09:19:51 UTC
(In reply to Meni Yakove from comment #34)
> Can we verify this on RHEL or we need to wait for 3.6 RHEV-H?
> If we need to wait for RHEV-H please remove the ON_QA from the bug.

You need to VERIFY on both.

Comment 36 Michael Burman 2015-11-08 09:02:03 UTC
Verified on - 3.6.0.3-0.1.el6 and :
- Red Hat Enterprise Virtualization Hypervisor release 7.2 (20151104.0.el7ev)
- vdsm-4.17.10.1-0.el7ev.noarch
- ovirt-node-3.6.0-0.20.20151103git3d3779a.el7ev.noarch

Comment 41 errata-xmlrpc 2016-03-09 19:31:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html