Bug 1154399
Summary: | VDSM script reset network configuration on every reboot when based on predefined bond | |||
---|---|---|---|---|
Product: | [Retired] oVirt | Reporter: | Aleksandr <aleksandr.bembel> | |
Component: | vdsm | Assignee: | Petr Horáček <phoracek> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Michael Burman <mburman> | |
Severity: | medium | Docs Contact: | ||
Priority: | high | |||
Version: | 3.5 | CC: | aleksandr.bembel, asegurap, bazulay, bugs, danken, dougsland, ecohen, enrico.tagliavini, fdeutsch, gklein, iheim, ldelouw, lsurette, lvernia, mario.langmann, mburman, mgoldboi, mkalinin, nicolas, pdwyer, pzhukov, rbalakri, rbarry, sbonazzo, shihliu, sraje, troels, ycui, yeylon, ylavi | |
Target Milestone: | --- | Keywords: | Reopened | |
Target Release: | 3.5.2 | Flags: | lvernia:
ovirt_requires_release_note?
|
|
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | network | |||
Fixed In Version: | vdsm-4.16.13-1.el6ev | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1194553 1213842 (view as bug list) | Environment: | ||
Last Closed: | 2015-04-29 06:18:59 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1209486 | |||
Bug Blocks: | 1186161, 1193058, 1194553, 1213842 |
Description
Aleksandr
2014-10-19 12:57:37 UTC
I found the problem and how to solve it manually. I create bond interface manually by hand before installation of oVirt, and create ovirtmgmt interface manually to. After i install oVirt to node - VDSM scripts create folders with network configuration in var/lib/vdsm/persistence/netconf/ and there is folder "nets" with configutarion of ovirtmgmt interface, but there is no folder bonds for bonding configs. After creation of this folder with configuration for bond0 interface all starts to work and reboot normally. Toni, could you take a look? I thought http://gerrit.ovirt.org/32769 should have fixed that. @Aleksandr: Does the /etc/sysconfig/network-scripts/ifcfg-bond0 you manually create have any of these headers: - '# Generated by VDSM version' - '# automatically generated by vdsm' If that is the case, they will be removed at every boot. If that is not the case, are you calling 'persist /etc/sysconfig/network-scripts/ifcfg-bond0' in the command line after creating them? vdsm only persists the networks and bonds it creates and since ifcfg-bond0 is created by you, it assumes (wrongly or not) it will be there on boot. There are three ways to go about this: - creating the bond with vdsClient like so: vdsClient -s 0 setupNetworks bondings='{bond11:{nics:p1p3+p1p4}}' # Then create the network over it (which will persist the bond too in /var/lib/vdsm/persistence/netconf/bonds - Using the node persistence directly: persist /etc/sysconfig/network-scripts/ifcfg-bond0 - Code: Somehow detect that device configuration we depend on is not persisted and do like in the upgrade script to unified persistence. (In reply to Antoni Segura Puimedon from comment #3) > @Aleksandr: Does the /etc/sysconfig/network-scripts/ifcfg-bond0 you manually > create have any of these headers: > > - '# Generated by VDSM version' > - '# automatically generated by vdsm' > > If that is the case, they will be removed at every boot. If that is not the > case, are you calling 'persist /etc/sysconfig/network-scripts/ifcfg-bond0' > in the command line after creating them? > > vdsm only persists the networks and bonds it creates and since ifcfg-bond0 > is created by you, it assumes (wrongly or not) it will be there on boot. > There are three ways to go about this: > > - creating the bond with vdsClient like so: > vdsClient -s 0 setupNetworks bondings='{bond11:{nics:p1p3+p1p4}}' > # Then create the network over it (which will persist the bond too in > /var/lib/vdsm/persistence/netconf/bonds > - Using the node persistence directly: > persist /etc/sysconfig/network-scripts/ifcfg-bond0 > - Code: Somehow detect that device configuration we depend on is not > persisted and do like in the upgrade script to unified persistence. /etc/sysconfig/network-scripts/ifcfg-bond0 doesn't have such header. I create it manually before installing oVirt on this node. And when you run persist /etc/sysconfig/network-scripts/ifcfg-bond0 does the problem go away? If so - it's not a bug. Manual creation requires manual persistence. (In reply to Dan Kenigsberg from comment #5) > does the problem go away? Please reopen if this is not the case. We have heard more reports about our failure to revive bonds that where created outside Vdsm, but required by its networks. Since this is a common use case, particularly for hosted engine, it may require extreme measures such as consuming these bonds and making the ours. Hello, After upgrading from 3.4.1 to 3.5.1 on CentOS 6.6 (1 manager and 3 hosts), all my manually configured interfaces, with bonded interfaces, bridges and VLAN went OK and resisted hosts reboots on 2 hosts, but not on the third one. All 3 hosts are absolutely clones, they were installed by kickstart+cfengine, so no manual error possible. Anyway, on this third host, I'm facing the exact issue described here as well in the related BZ (#1134346, #1188251). I can not use the proposed workaround because these CentOS 6.6 have no "persist" command (does this command only exist on oVirt _nodes_, not complete hosts?). I'd be OK to completely wipe my manual network config, and do it again via the web GUI, but will it allow me to add the VLAN tag? If : - I can not redefine the interfaces setting+VLAN using the oVirt GUI and - I can not persist my manual interfaces settings because this command does not exist, is there another way? Can I manually persist the needed ifcfg files? Where should this persistence files reside, and how? Hello, i solved this problem by editing the file /etc/vdsm/vdsm.conf: Add the line "net_persistence = ifcfg" which now uses the manual created network configuration the persist command is as far as i know only for the hypervisor images. correct me if i am wrong (In reply to Nicolas Ecarnot from comment #8) > All 3 hosts are absolutely clones, they were installed by > kickstart+cfengine, so no manual error possible. This is extremely interesting. Can you think of any different in the faulty host? Do you have `chkconfig network on` on all three? Can you find "timeout waiting for ipv4 addresses" on your /var/log/message on the first post-upgrade boot? > I can not use the proposed workaround because these CentOS 6.6 have no > "persist" command (does this command only exist on oVirt _nodes_, not > complete hosts?). You are perfectly right. "persist" is node-only. Reverting to net_persistence=ifcfg is the important part of the workaround. (In reply to Dan Kenigsberg from comment #10) > This is extremely interesting. Can you think of any different in the faulty > host? I honnestly don't recall anything different I made have done different on this third host. When things began to act differently, I obviously began to make additionnal actions, like ifup, ifdown, double check things, reboots. But these are clearly consequences, not causes. > Do you have `chkconfig network on` on all three? Yes. > Can you find "timeout > waiting for ipv4 addresses" on your /var/log/message on the first > post-upgrade boot? No. A grep (and additionnal lookups) on all old /var/log/messages* did not return anything. > Reverting to net_persistence=ifcfg is the important part of the workaround. _This_ is puzzling me! I solved my issue by using what was adviced in a related BZ, ie : # vdsClient -s 0 setupNetworks bondings='{bond0:{nics:em1+em2},bond1:{nics:em3+em4}}' # vdsClient -s 0 setSafeNetworkConfig But I confirm that I have no 'net_persistence' setup in my vdsm.conf. Just to add somethg : I have 3 datacenters, and the one on which I upgraded is in semi-production - I can afford some downtimes. This one only contains 3 hosts, so network correction are bearable. I'm hesitating to upgrade my two other datacenters containing 10 hosts each. I can test some of your suggestions on the small DC. Just saying that today, all 3 hosts are working well, and did well when they were cleanly following the path maintenance > manual reboot > wait for both network come up (mgmt+iscsi) > activation. Re-targeting to 3.5.3 since this bug has not been marked as blocker for 3.5.2 and we have already released 3.5.2 Release Candidate. Nicolas, I understand your worry regarding an upgrade of a production system, before we understand the nature of this bug. Do you have ifcfg files with no '# Generated by VDSM' header on your production hosts? Making sure that all /etc/sysconfig/network-scripts/ifcfg-bond* files are "owned" by Vdsm by reconfiguring them via command line is a good precaution. I'm happy that it has worked for you. Please note that your command line sets the bonds with default options (mode=4), which may not be what you want. The following syntax is used to set further options: # vdsClient -s 0 setupNetworks bondings='{bond0:{nics:em1+em2,options:mode=3 miimon=222}} # vdsClient -s 0 setSafeNetworkConfig Setting net_persistence=ifcfg on vdsm.conf before upgrade is a more "brutal" workaround. It means that vdsm does not attempt to move network definitions from ifcfg to /var/lib/vdsm/persistence/network. Hello, I was trying to reproduce this issue with the next steps and issue didn't reproduced: 3.5.1-0.1.el6ev vdsm-4.16.12-2.el6ev.x86_64 Red Hat Enterprise Linux Server release 6.6 (Santiago) 1. Created bond0 manually: cat /etc/sysconfig/network-scripts/ifcfg-eth0 # Generated by VDSM version 4.16.12-2.el6ev DEVICE=eth0 HWADDR=00:14:5e:dd:09:24 MASTER=bond0 SLAVE=yes ONBOOT=yes MTU=1500 NM_CONTROLLED=no cat /etc/sysconfig/network-scripts/ifcfg-eth1 # Generated by VDSM version 4.16.12-2.el6ev DEVICE=eth1 HWADDR=00:14:5e:dd:09:26 MASTER=bond0 SLAVE=yes ONBOOT=yes MTU=1500 NM_CONTROLLED=no cat /etc/sysconfig/network-scripts/ifcfg-bond0 # Generated by VDSM version 4.16.12-2.el6ev DEVICE=bond0 BONDING_OPTS='mode=active-backup miimon=150' ONBOOT=yes NM_CONTROLLED=no BOOTPROTO=dhcp 2. /etc/init.d/network restart 3. Installed host in my vt14 setup, rhevm bridge created on bond0 and got ip. 4. Rebooted host 5. Host gone to non-operational state until finished rebooting, then went up and got ip. - Couldn't reproduce this way. - I will try to reproduce in other way, will try to upgrade from 3.4 > 3.5.1. - If someone have/know the steps to how reproduce this issue, please let me know. Best regards, (In reply to Michael Burman from comment #14) > I was trying to reproduce this issue with the next steps and issue didn't > reproduced: Thank you Michael for taking some time on this bug. I suppose that what's different between your setup and mine is that I have additionnal things I did not see in your description : - usage of VLAN on some bonds (ie "bond0.6"...) - specifying additionnal params (ie MTU=9000). Does it matter? - presence of oVirt bridges Dan, Ok for your explanations. When coping with bugs, I'm always trying to head towards the future, and I'd like to avoid to waste too much time on behaviour that won't last. I haven't already completely read and understood what network persistence implied, but if _this_ is the future, let us all test the setups used in the future usages. If all this issue is known to be corrected in 3.5.2, and as my present setup is OK, I'm not sure it's worth looking further. Just to answer your questions : - some of my files contain and some other don't wontain the '# generated by vdsm blahblah', but I don't take it as an evidence as I may have copied them from another setup sometimes long ago - every ifcfg-* are owned by root only. Do I have to chown 36:36 them all, like what is done on NFS shares? - cat /proc/net/bonding/bond* is telling me that mode active-backup is used. That does not match your statement about the mode 4 set by the vdsClient command. Sounds like I have further reading to do now, because it seems I'm using a mix of two concurrent technologies, and I don't feel good. Nicolas, based on your account, two of three identical hosts do not reproduce the bug. Only the third host expressed it. This means that we have a flaky raceful issue here, which may be unrelated to vlans/mtu/bridges. We are not yet sure when the bug would be fix, as we do not understand it properly. By ifcfg "own" I mean the existence of '# generated by vdsm' header, not filesystem owning user (which is always root). It is paramount to understand what is the case on your production servers. Do they have non-vdsm-owned ifcfg-bond* files? If Vdsm does not receive any bond option, it sets 'mode=802.3ad miimon=150' as its default. However, if the bond device already exists, it keeps it former options. Specifying the options explicitly is the safe way to ensure the final config. Hello All, I made another try to reproduce this issue, this time upgrading vdsm from 3.4>3.5.1. Didn't managed to reproduce this issue this time as well. My steps: 1) Started with clean rhel6.6 host with vdsm-4.14.18-7.el6ev.x86_64 2) Manually created bond0: vi /etc/sysconfig/network-scripts/ifcfg-bond0 DEVICE=bond0 ONBOOT=yes BONDING_OPTS='mode=active-backup miimon=150' NM_CONTROLLED=no BOOTPROTO=dhcp vi /etc/sysconfig/network-scripts/ifcfg-eth0 DEVICE=eth0 ONBOOT=yes MASTER=bond0 SLAVE=yes NM_CONTROLLED=no vi /etc/sysconfig/network-scripts/ifcfg-eth1 DEVICE=eth1 ONBOOT=yes MASTER=bond0 SLAVE=yes NM_CONTROLLED=no 3) Restarted network service and Installed rhel6.6 host in my rhevm-3.5.1-0.1.el6ev.noarch engine with success. 4) rhevm bridge was created on top of bond0 5) Set host to maintenance mode and run 'yum update'. host was updated with vdsm-4.16.12-2.el6ev.x86_64. 6) Activated host in 3.5 cluster with no issue, host is up and have connectivity. 7) Rebooted host. host gone to non-operational state until finished booting and went up again. no connectivity issues. rhevm bridge is on top of the bond0(that was created in step 2) and have ip. (In reply to Michael Burman from comment #17) Michael, I wrote previously : > I suppose that what's different between your setup and mine is that I have > additionnal things I did not see in your description : > - usage of VLAN on some bonds (ie "bond0.6"...) Would it be worth you try adding this point in your tests? Does it matter? Another try to reproduce, this time with vlan on top of the bond. upgraded from 3.4>3.5.1 Managed to reproduce this issue this time after rebooting. 1) Started with clean rhel6.6 host with vdsm-4.14.18-7.el6ev.x86_64 2) Manually created bond0.162 vi /etc/sysconfig/network-scripts/ifcfg-bond0 DEVICE=bond0 ONBOOT=yes BONDING_OPTS='mode=active-backup miimon=150' NM_CONTROLLED=no BOOTPROTO=none vi /etc/sysconfig/network-scripts/ifcfg-bond0.162 DEVICE=bond0.162 VLAN=yes BOOTPROTO=static IPADDR=10.35.129.14 NETMASK=255.255.255.0 GATEWAY=10.35.129.254 NM_CONTROLLED=no ONBOOT=yes vi /etc/sysconfig/network-scripts/ifcfg-eth0 DEVICE=eth0 ONBOOT=yes MASTER=bond0 SLAVE=yes NM_CONTROLLED=no vi /etc/sysconfig/network-scripts/ifcfg-eth1 DEVICE=eth1 ONBOOT=yes MASTER=bond0 SLAVE=yes NM_CONTROLLED=no 3) Restarted network service and Installed rhel6.6 host in my rhevm-3.5.1-0.1.el6ev.noarch engine with success. 4) rhevm bridge was created on top of bond0.162 5) Set host to maintenance mode and run 'yum update'. host was updated with vdsm-4.16.12-2.el6ev.x86_64. 6) Activated host in 3.5 cluster, host is up and have connectivity. 7) Rebooted host. host gone to non-operational state. bond0.162 disappeared. before reboot cat /var/lib/vdsm/persistenence/netconf/bonds bond0 bond0.162 after reboot: only bond0 So this bug reproduced with vlan on top of bond created manually. netconf is not persistence for vlan on top of the bond. if i run manually ifup bond01.62, i get ip. but if restarting network service, i loose ip. So i think now it's more clear. Best regards. (In reply to Michael Burman from comment #19) > Another try to reproduce, this time with vlan on top of the bond. > upgraded from 3.4>3.5.1 > > Managed to reproduce this issue this time after rebooting. Michael, you're the man! Thank you for confirming this, it is a relief for me :) Dan, This is the ifcfg files after rebooting: cat /etc/sysconfig/network-scripts/ifcfg-bond0 # Generated by VDSM version 4.16.12-2.el6ev DEVICE=bond0 BONDING_OPTS='mode=active-backup miimon=150' BRIDGE=rhevm ONBOOT=no MTU=1500 NM_CONTROLLED=no HOTPLUG=no cat /etc/sysconfig/network-scripts/ifcfg-bond0.162 DEVICE=bond0.162 VLAN=yes BOOTPROTO=static IPADDR=10.35.129.14 NETMASK=255.255.255.0 GATEWAY=10.35.129.254 NM_CONTROLLED=no ONBOOT=yes cat /etc/sysconfig/network-scripts/ifcfg-eth0 # Generated by VDSM version 4.16.12-2.el6ev DEVICE=eth0 HWADDR=00:14:5e:dd:09:24 MASTER=bond0 SLAVE=yes ONBOOT=no MTU=1500 NM_CONTROLLED=no cat /etc/sysconfig/network-scripts/ifcfg-eth1 # Generated by VDSM version 4.16.12-2.el6ev DEVICE=eth1 HWADDR=00:14:5e:dd:09:26 MASTER=bond0 SLAVE=yes ONBOOT=no MTU=1500 NM_CONTROLLED=no - As you can see in the ifcfg-bond0.162 file, there is no # Generated by VDSM version 4.16.12-2.el6ev line. - I'm going to give it another run, this time without upgrading from 3.4>3.5.1. I believe the upgrade process have nothing to do with this issue. So this issue is not related to the upgrade process. vdsm is not generating vlan on top of a bond created manually. We can see it before rebooting the host: - vlan on top of bond created manually(same steps from comment 21) and host installed in RHEV-M. Host installed successfully, rhevm bridge created on top of bond0.162 Before reboot: [root@navy-vds1 netconf]# ls nets rhevm [root@navy-vds1 netconf]# ls bonds bond0 [root@navy-vds1 netconf]# cat /etc/sysconfig/network-scripts/ifcfg-bond0 # Generated by VDSM version 4.16.12-2.el6ev DEVICE=bond0 BONDING_OPTS='mode=active-backup miimon=150' BRIDGE=rhevm ONBOOT=yes BOOTPROTO=none MTU=1500 NM_CONTROLLED=no HOTPLUG=no [root@navy-vds1 netconf]# cat /etc/sysconfig/network-scripts/ifcfg-bond0.162 DEVICE=bond0.162 VLAN=yes BOOTPROTO=static IPADDR=10.35.129.14 NETMASK=255.255.255.0 GATEWAY=10.35.129.254 NM_CONTROLLED=no ONBOOT=yes [root@navy-vds1 netconf]# cat /etc/sysconfig/network-scripts/ifcfg-eth0 # Generated by VDSM version 4.16.12-2.el6ev DEVICE=eth0 HWADDR=00:14:5e:dd:09:24 MASTER=bond0 SLAVE=yes ONBOOT=yes MTU=1500 NM_CONTROLLED=no [root@navy-vds1 netconf]# cat /etc/sysconfig/network-scripts/ifcfg-eth1 # Generated by VDSM version 4.16.12-2.el6ev DEVICE=eth1 HWADDR=00:14:5e:dd:09:26 MASTER=bond0 SLAVE=yes ONBOOT=yes MTU=1500 NM_CONTROLLED=no ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000 link/ether 00:10:18:24:4a:fc brd ff:ff:ff:ff:ff:ff inet6 fe80::210:18ff:fe24:4afc/64 scope link valid_lft forever preferred_lft forever 3: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 00:10:18:24:4a:fd brd ff:ff:ff:ff:ff:ff 4: eth0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP qlen 1000 link/ether 00:14:5e:dd:09:24 brd ff:ff:ff:ff:ff:ff 5: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP qlen 1000 link/ether 00:14:5e:dd:09:24 brd ff:ff:ff:ff:ff:ff 6: virbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN link/ether 52:54:00:3e:29:c1 brd ff:ff:ff:ff:ff:ff inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0 7: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 500 link/ether 52:54:00:3e:29:c1 brd ff:ff:ff:ff:ff:ff 9: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP link/ether 00:14:5e:dd:09:24 brd ff:ff:ff:ff:ff:ff 10: bond0.162@bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP link/ether 00:14:5e:dd:09:24 brd ff:ff:ff:ff:ff:ff inet 10.35.129.14/24 brd 10.35.129.255 scope global bond0.162 inet6 fe80::214:5eff:fedd:924/64 scope link valid_lft forever preferred_lft forever 12: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN link/ether 9a:cc:8d:94:6c:92 brd ff:ff:ff:ff:ff:ff 13: rhevm: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN link/ether 00:14:5e:dd:09:24 brd ff:ff:ff:ff:ff:ff inet6 fe80::214:5eff:fedd:924/64 scope link valid_lft forever preferred_lft forever After reboot: Host gone to non-responsive state and doesn't recover. Host lost connectivity. only 'ifup bond0.162' command will bring the host up. So we have here a reproduction, clear steps. Best regards I am afraid this is a different bug. I see that Engine attempts to create a rhevm network with no vlan, no IP address, and no DHCP Thread-25::DEBUG::2015-03-10 09:31:24,745::__init__::469::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Host.setupNetworks' in bridge with {'bondings': {}, 'networks': {'rhevm': {'bonding': 'bond0', 'STP': 'no', 'bridged': 'true', 'mtu': '1500'}}, 'options': {'connectivityCheck': 'true', 'connectivityTimeout': 120}} nothing good can come out of that. It's an Engine bug. I suspect that this is because in Engine, rhevm is defined with no vlan id. Please open a separate bug on that. The case that the users discuss is a bit different. They had a working vdsm of ovirt-3.4, connected to the Engine with vlan networks on top of a predefined bond. Then, they upgrade vdsm, and reboot the host. No new setupNetworks command is issued from Engine to Vdsm in that process. Thus, I need more of your time to help reproduce the users' issue. Another attempt to reproduce, this time without any success. Followed next steps: 1) clean rhel6.6 host with vdsm-4.14.18-7.el6ev.x86_64 2) created bond0 manually from eth2 and eth3 3) Added host to RHEV-M with success. 'rhevm' attached to eth0 4) created VM vlan tagged network and attached to bond0 via SN 5) set host to maintenance and updated vdsm to 3.5.1(vdsm 4.16.12) with 'yum update' command. 6) activated host in 3.5 cluster. bond0 doesn't break, vlan network attached to bond0 7) rebooted host. bond0 doesn't break, vlan attached to bond0 ls /var/lib/vdsm/persistence/netconf/nets/ rhevm test_net(vlan network) ls /var/lib/vdsm/persistence/netconf/bonds/ bond0 - No reproduction. Reproduced: 1) installed and configured rhel66 2) install VDSM dependencies $ yum install -y http://download.devel.redhat.com/brewroot/packages/augeas/1.0.0/8.el6/x86_64/augeas-1.0.0-8.el6.x86_64.rpm \ http://download.devel.redhat.com/brewroot/packages/augeas/1.0.0/8.el6/x86_64/augeas-devel-1.0.0-8.el6.x86_64.rpm \ http://download.devel.redhat.com/brewroot/packages/augeas/1.0.0/8.el6/x86_64/augeas-libs-1.0.0-8.el6.x86_64.rpm \ http://download.devel.redhat.com/brewroot/packages/augeas/1.0.0/8.el6/x86_64/augeas-debuginfo-1.0.0-8.el6.x86_64.rpm $ yum install -y http://download.devel.redhat.com/brewroot/packages/sanlock/2.8/1.el6/x86_64/fence-sanlock-2.8-1.el6.x86_64.rpm \ http://download.devel.redhat.com/brewroot/packages/sanlock/2.8/1.el6/x86_64/sanlock-2.8-1.el6.x86_64.rpm \ http://download.devel.redhat.com/brewroot/packages/sanlock/2.8/1.el6/x86_64/sanlock-devel-2.8-1.el6.x86_64.rpm \ http://download.devel.redhat.com/brewroot/packages/sanlock/2.8/1.el6/x86_64/sanlock-lib-2.8-1.el6.x86_64.rpm \ http://download.devel.redhat.com/brewroot/packages/sanlock/2.8/1.el6/x86_64/sanlock-python-2.8-1.el6.x86_64.rpm \ http://download.devel.redhat.com/brewroot/packages/sanlock/2.8/1.el6/x86_64/sanlock-debuginfo-2.8-1.el6.x86_64.rpm $ yum install -y http://download.devel.redhat.com/brewroot/packages/libvirt/0.10.2/49.el6/x86_64/libvirt-0.10.2-49.el6.x86_64.rpm \ http://download.devel.redhat.com/brewroot/packages/libvirt/0.10.2/49.el6/x86_64/libvirt-client-0.10.2-49.el6.x86_64.rpm \ http://download.devel.redhat.com/brewroot/packages/libvirt/0.10.2/49.el6/x86_64/libvirt-devel-0.10.2-49.el6.x86_64.rpm \ http://download.eng.bos.redhat.com/brewroot/packages/libvirt/0.10.2/49.el6/x86_64/libvirt-lock-sanlock-0.10.2-49.el6.x86_64.rpm \ http://download.devel.redhat.com/brewroot/packages/libvirt/0.10.2/49.el6/x86_64/libvirt-python-0.10.2-49.el6.x86_64.rpm \ http://download.devel.redhat.com/brewroot/packages/libvirt/0.10.2/49.el6/x86_64/libvirt-debuginfo-0.10.2-49.el6.x86_64.rpm 2) $ yum install -y http://resources.ovirt.org/pub/yum-repo/ovirt-release34.rpm 3) $ yum install -y vdsm vdsm-cli vdsm-xmlrpc vdsm-jsonrpc vdsm-debug-plugin 4) configure and start vdsm (if fails, try again, there is one ugly bug) vdsm-tool configure --force service vdsmd start 4) create non-vdsm vlaned bond over a dummy $ ip link add dummy_10 type dummy $ echo "DEVICE=dummy_9 MASTER=bond10 SLAVE=yes ONBOOT=yes MTU=1500 NM_CONTROLLED=no" > /etc/sysconfig/network-scripts/ifcfg-dummy_10 $ echo "DEVICE=bond10 BONDING_OPTS='mode=802.3ad miimon=150' ONBOOT=yes BOOTPROTO=none DEFROUTE=yes NM_CONTROLLED=no HOTPLUG=no" > /etc/sysconfig/network-scripts/ifcfg-bond10 $ echo "DEVICE=bond10.180 VLAN=yes ONBOOT=yes BOOTPROTO=static NM_CONTROLLED=no HOTPLUG=no" > /etc/sysconfig/network-scripts/ifcfg-bond10.180 $ service network restart 5) setup bridged vdsm network over existing vlaned bridge $ vdsClient -s 0 setupNetworks "networks={test-network:{bonding:bond10,vlan:180,bridged:true}}" 6) current state of ifcfg files (vlan is tagged, but nic and bond are still non-vdsm): [root@rhel6 ~]# cat /etc/sysconfig/network-scripts/ifcfg-bond10.180 # Generated by VDSM version 4.14.17-0.el6 DEVICE=bond10.180 ONBOOT=yes VLAN=yes BRIDGE=test-network NM_CONTROLLED=no HOTPLUG=no [root@rhel6 ~]# cat /etc/sysconfig/network-scripts/ifcfg-bond10 DEVICE=bond10 BONDING_OPTS='mode=802.3ad miimon=150' ONBOOT=yes BOOTPROTO=none DEFROUTE=yes NM_CONTROLLED=no HOTPLUG=no [root@rhel6 ~]# cat /etc/sysconfig/network-scripts/ifcfg-dummy_10 DEVICE=dummy_10 MASTER=bond10 SLAVE=yes ONBOOT=yes MTU=1500 NM_CONTROLLED=no 7) upgrade vdsm $ service vdsmd stop $ yum install -y http://resources.ovirt.org/pub/yum-repo/ovirt-release35.rpm $ yum update -y 8) enable network, disable vdsmd on startup and reboot $ chkconfig vdsmd off $ shutdown -r now 9) we lost dummy, so add it again $ ip link add dummy_10 type dummy $ service network restart 10) start vdsm service manualy: $ service vdsmd start [root@rhel6 ~]# service vdsmd start Please enter your authentication name: Please enter your password: Can't connect to default. Skipping. Stopping ksmtuned: [ OK ] Starting multipathd daemon: [ OK ] Starting ntpd: [ OK ] initctl: Job is already running: libvirtd Starting iscsid: [ OK ] vdsm: Running mkdirs vdsm: Running configure_coredump vdsm: Running configure_vdsm_logs vdsm: Running wait_for_network vdsm: Running run_init_hooks vdsm: Running upgraded_version_check vdsm: Running check_is_configured libvirt is already configured for vdsm vdsm: Running validate_configuration SUCCESS: ssl configured to true. No conflicts vdsm: Running prepare_transient_repository vdsm: Running syslog_available vdsm: Running nwfilter vdsm: Running dummybr vdsm: Running load_needed_modules vdsm: Running tune_system vdsm: Running test_space vdsm: Running test_lo vdsm: Running unified_network_persistence_upgrade vdsm: Running restore_nets Traceback (most recent call last): File "/usr/share/vdsm/vdsm-restore-net-config", line 137, in <module> restore() File "/usr/share/vdsm/vdsm-restore-net-config", line 123, in restore unified_restoration() File "/usr/share/vdsm/vdsm-restore-net-config", line 66, in unified_restoration persistentConfig.bonds) File "/usr/share/vdsm/vdsm-restore-net-config", line 91, in _filter_nets_bonds bonds[bond]['nics'], net) KeyError: u'bond10' vdsm: stopped during execute restore_nets task (task returned with error code 1). vdsm start 11) start failed, ifcfg files of bridge, vlan, bond and dummy were removed from network-scripts Followed Petr steps, only using my host interfaces to create the bond. reached to step 6) before the vdsm upgrade and our ifcfg files a bit different: cat /etc/sysconfig/network-scripts/ifcfg-bond0 # Generated by VDSM version 4.14.18-7.el6ev DEVICE=bond0 ONBOOT=yes BONDING_OPTS='mode=802.3ad miimon=150' BRIDGE=test BOOTPROTO=none MTU=1500 DEFROUTE=yes NM_CONTROLLED=no STP=no HOTPLUG=no cat /etc/sysconfig/network-scripts/ifcfg-bond0.162 DEVICE=bond0.162 VLAN=yes BOOTPROTO=static NM_CONTROLLED=no ONBOOT=yes cat /etc/sysconfig/network-scripts/ifcfg-eth2 # Generated by VDSM version 4.14.18-7.el6ev DEVICE=eth2 ONBOOT=yes HWADDR=00:10:18:24:4A:FC MASTER=bond0 SLAVE=yes MTU=1500 NM_CONTROLLED=no STP=no cat /etc/sysconfig/network-scripts/ifcfg-eth3 # Generated by VDSM version 4.14.18-7.el6ev DEVICE=eth3 ONBOOT=yes HWADDR=00:10:18:24:4A:FD MASTER=bond0 SLAVE=yes MTU=1500 NM_CONTROLLED=no STP=no - As you can see my bond0.162 is Generated by VDSM and bond0 is not. in Petr steps it is opposite. So i stopped in this stage before continuing to the update step. Sorry, i have a typo in step 4, it should be '$ echo "DEVICE=dummy_10' not '$ echo "DEVICE=dummy_9' Just updating I was with Burman when he tried to reproduce. Seemed like vdsm "owned" the bond and VLAN device as soon as we added the VLAN network via a Setup Networks command from the engine (host had pre-existing bond and VLAN devices created manually), and bug wasn't reproduced. This bug annoys multitudes of users. Desite the problems with reproduction in lab conditions, I'd like to include the two suggested patches in the nearest release. Today, I had to semi-manually recreate the many files of near to 40 hosts in two DC, and add the persistence = ifcfg on all of them. It's been two weeks I'm witnessing this issue and fighting it, so at least, please don't close this bug by pretending one can not reproduce it in lab. Dan is right, and is politely saying it : it is annoying many of us. (3.5.1) Want to test this with rhel and rhev-h as well. Dan, do we have rhev-h build that is includes this fix in vdsm? Verified and tested successfully with new build rhevm-3.5.1-0.3.el6ev.noarch vdsm-4.16.13-1.el6ev.x86_64 rhel 6.6 vdsm upgrade from 4.14 >> 4.16.13-1 followed next steps: 1) bond0 and bond0.162 created manually outside RHEV-M(network restart) 2) installed server in RHEV-M setup successfully 3) attached VM vlan tagged network(162) to host via SN 4) copied relevant repo's(vt14.12) to server and run 'yum update', vdsm upgraded successfully, but regarding this BZ 1200467, needed to restart vdsmd service manually, operation was successful and no network configuration are broken.(refreshed capabilities) 5) rebooted server with success. all network configuration saved and didn't broke. including the manually bond0 and bond0.162 I would like to perform the same test with rhev-H 6.6 that including this fix, before moving this bug to verified. In the origin i managed to reproduce this issue only with rhev-H. Since Dan is on PTO, Fabian should be able to tell us if a build has gone out that included the tracked patches?... Verified and tested successfully with new rhev-hypervisor6-6.6-20150402.0.el6ev.noarch.rpm that includes vdsm-4.16.13-1.el6ev.x86_64 vdsm upgrade from vdsm-4.14.18-6.el6ev >> vdsm-4.16.13-1.el6ev.x86_64 followed next steps: 1) bond0 and bond0.162 created manually outside RHEV-M - via TUI 2) installed server in RHEV-M setup(vt14.2) successfully 3) attached VM vlan tagged network (162) to host via SN 4) downloaded and installed rhev-hypervisor6-6.6-20150402.0.el6ev.noarch.rpm in engine 5) put host to maintenance and run 'upgrade' via RHEV-M 6) host rebooted successfully. All network configuration saved and didn't broke. Including the manually(TUI) configured bond0 and bond0.162. Should have been verified on 3.5.2 - my mistake. And also, Looks like the fix for this bug, created another issue regarding to this. bond0 and bond0.162 persistent and nothing breaks, but non of the networks or nic's have BOOTPROTO= line in ifcfg files. Moving back to Assigned instead of creating new BZ. *** Bug 1203226 has been marked as a duplicate of this bug. *** *** Bug 1170869 has been marked as a duplicate of this bug. *** If you're going to perform any further tests (or if you're going to write automated tests for these cases - you should!), then please also test pre-defined bridges on top of regular NICs (as described in Bug 1203226). To my understanding should be okay for upgrade 3.4.* --> 3.5.1, but needs release notes for 3.5.0 --> 3.5.1. Dan, could you supply documentation what goes wrong during a 3.5.0 --> 3.5.1 upgrade, and how to work around it once it breaks? Sorry, I meant 3.4.* --> 3.5.2 and 3.5.* --> 3.5.2. Verified on - 3.5.1-0.4.el6ev - rhev-h 6.6 3.4.z >> rhev-h 6.6 3.5.1 using the next builds: rhev-h 6.6 3.4.z 20150123.1.el6ev >> rhev-h 6.6 3.5.1 20150421.0.el6ev 1) clean rhev-h 6.6 3.4.z 20150123.1.el6ev installed via USB 2) bond0.162 configured via TUI with dhcp 3) installed server in RHEV-M, rhevm network created on top of bond0 4) via SN attached network to other NIC * Host is up after upgrade and reboot, all networks attached to server, rhevm got ip, host is up in RHEV-M and can be activated on 3.5 cluster. - clean rhev-h 6.6 3.5.1 20150421.0.el6ev 1) clean rhev-h 6.6 3.5.1 20150421.0.el6ev installed via USB 2) bond0.162 configured via TUI with dhcp 3) installed server in RHEV-M, rhevm network created on top of bond0 4) via SN attached network to other NIC * Host is up after reboot, all networks attached to server, rhevm got ip, host is up in RHEV-M. - rhev-h 7.1 3.5.1 20150420.0.el7ev 1) clean rhev-h 7.1 3.5.1 20150420.0.el7ev installed via USB 2) bond0.162 configured via TUI with dhcp 3) installed server in RHEV-M, rhevm network created on top of bond0 4) via SN attached network to other NIC * Host is up after reboot, all networks attached to server, rhevm got ip, host is up in RHEV-M. ovirt 3.5.2 was GA'd. closing current release. (In reply to Lior Vernia from comment #42) > To my understanding should be okay for upgrade 3.4.* --> 3.5.1, but needs > release notes for 3.5.0 --> 3.5.1. > > Dan, could you supply documentation what goes wrong during a 3.5.0 --> 3.5.1 > upgrade, and how to work around it once it breaks? 3.4.* --> 3.5.1 is the recommended upgrade path, as 3.5.0 had many problems related to upgrade. I am not aware of any issue regarding upgrade from a working 3.5.0 to rhev-3.5.1. However if one did upgrade to 3.5.0 and did break one's persisted network config, upgrade to rhev-3.5.1 would not help restore it - one would have to define a local IP and reconfigure networking on the host via Engine. |