Bug 979081
Summary: | RHEVH network configuration is not persisted when addNetwork fails | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Martin Pavlik <mpavlik> | ||||||||||
Component: | vdsm | Assignee: | psebek | ||||||||||
Status: | CLOSED DUPLICATE | QA Contact: | Martin Pavlik <mpavlik> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | unspecified | ||||||||||||
Version: | 3.3.0 | CC: | abaron, acathrow, alonbl, asegurap, bazulay, cpelland, cshao, danken, dougsland, eedri, fdeutsch, gklein, glenn.crawford, gouyang, hadong, hateya, huiwa, iheim, jkt, leiwang, lpeer, lsong, mburns, mpavlik, myakove, psebek, Rhev-m-bugs, yaniwang, ycui, yeylon | ||||||||||
Target Milestone: | --- | Keywords: | Regression, Triaged | ||||||||||
Target Release: | 3.3.0 | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | network | ||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | |||||||||||||
: | 999123 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2013-09-03 07:22:20 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | 979567 | ||||||||||||
Bug Blocks: | 979570, 979572, 999123 | ||||||||||||
Attachments: |
|
Dan, delNetwork/addNetwork fail it looks like it tries to perform dhcp and fail and then skips the persist? but afterwards when bridge settled it does get ip address... Can you please take a look? Thanks, Alon Created attachment 766205 [details]
vdsm-reg.log
Hi Martin, I would like to see the dhcp failure reason, Could you please edit initscripts ifup to run !/bin/bash -xv for more detailed output and modify /usr/share/vdsm/netconf/ifcfg.py ifup so that the raise logs all the output and not just the message? If you need help to do so contact me. Hi Antoni, Maybe we need to persist before bringing up the interface? Or... properly rollback if failure; destroy the bridge etc... I think the issue is caused by dhcp, but may suggest something should be done differently when error is detected. What do you think? Hi Alon, The flow is: 1. addNetwork: This is not persisted, nor does it rollback, rollback is only done when using the more modern API endpoint, setupNetworks. 2. setSafeNetworkConfig: This deletes the backup files in /var/lib/vdsm/netconfback/* implying that what the operation left in /etc/sysconfig/network-scripts/ is the persistent conf. So the option I'd go for is to always use setupNetworks since it has a rollback feature that would prevent the deployer from losing access to the host. When an error is detected, we should notify rhev-m so that it can give relevant info to the admin who will hopefully be able to come up with a different config for the still accessible host. (In reply to Antoni Segura Puimedon from comment #5) > Hi Alon, Hi! Maybe we need to fix host-deploy, this was not changed since vdsm-bootstrap... OH! it is vdsm-reg... bad for us. Anyway... > > The flow is: > > 1. addNetwork: This is not persisted, nor does it rollback, rollback is only > done when using the more modern API endpoint, setupNetworks. Yes, I use: addNetwork script to add the network. > 2. setSafeNetworkConfig: This deletes the backup files in > /var/lib/vdsm/netconfback/* implying that what the operation left in > /etc/sysconfig/network-scripts/ is the persistent conf. No, I use: vdsm-store-net-config But both vdsm-reg and ovirt-host-deploy do nothing when addNetwork fails... maybe there is other script I need to run? > So the option I'd go for is to always use setupNetworks since it has a > rollback feature that would prevent the deployer from losing access to the > host. When an error is detected, we should notify rhev-m so that it can give > relevant info to > the admin who will hopefully be able to come up with a different config for > the still accessible host. Right, this is where we are going, but still need to fix bugs for older versions, and even in 3.3 older ovirt-node will not be using the setup-network as we did not implement the deletion of brXXXX bridges in engine... so we fallback to the legacy method. (In reply to Alon Bar-Lev from comment #6) > (In reply to Antoni Segura Puimedon from comment #5) > > Hi Alon, > > Hi! > > Maybe we need to fix host-deploy, this was not changed since > vdsm-bootstrap... > > OH! it is vdsm-reg... bad for us. > > Anyway... > > > > > The flow is: > > > > 1. addNetwork: This is not persisted, nor does it rollback, rollback is only > > done when using the more modern API endpoint, setupNetworks. > > Yes, I use: addNetwork script to add the network. > > > 2. setSafeNetworkConfig: This deletes the backup files in > > /var/lib/vdsm/netconfback/* implying that what the operation left in > > /etc/sysconfig/network-scripts/ is the persistent conf. > > No, I use: vdsm-store-net-config Well, setSafeNetworkConfig calls vdsm-store-net-config anyway ;-) > > But both vdsm-reg and ovirt-host-deploy do nothing when addNetwork fails... > maybe there is other script I need to run? When it fails, to get a rollback the script to call is vdsm-restore-net-config. However... for the node we are kind of out of luck, since that script just aborts when it detects it is running on a node (this is because this script is intended to be executed at boot time and the node doesn't need such actions, it rolls back by its own nature). So for the node I'd say that the following should do it: ifcfg.ConfigWriter().restorePersistentBackup which is equivalent to ignoring the ovirt-node check. > > So the option I'd go for is to always use setupNetworks since it has a > > rollback feature that would prevent the deployer from losing access to the > > host. When an error is detected, we should notify rhev-m so that it can give > > relevant info to > > the admin who will hopefully be able to come up with a different config for > > the still accessible host. > > Right, this is where we are going, but still need to fix bugs for older > versions, and even in 3.3 older ovirt-node will not be using the > setup-network as we did not implement the deletion of brXXXX bridges in > engine... so we fallback to the legacy method. So I'd try what I wrote above. Some history: the ovirt-node short-circuit was introduced to avoid bug 494533. Nowadays, the implementation of vdsm-restore-net-config has utterly changed in a way that would not re-expose that bug if we let it run to completion. I see that the bug is in POST and it has some patch, is it supposed to be fixed in RHEV Hypervisor - 6.4 - 20130813.0.el6_4 which includes vdsm-4.10.2-24.1.el6ev? I am asking because bug still occurs. RHEV Hypervisor - 6.4 - 20130815.0.el6_4 vdsm-4.10.2-24.1.el6ev bug still present Hi, Can you please see if: /usr/share/vdsm/vdsm-restore-net-config Is updated per[1]? [1] http://gerrit.ovirt.org/#/c/16847/4/vdsm/vdsm-restore-net-config,unified (In reply to Alon Bar-Lev from comment #11) > Hi, > > Can you please see if: > > /usr/share/vdsm/vdsm-restore-net-config > > Is updated per[1]? > > [1] http://gerrit.ovirt.org/#/c/16847/4/vdsm/vdsm-restore-net-config,unified No it is not content of /usr/share/vdsm/vdsm-restore-net-config which is on the node import os import glob import configNetwork def main(): # this should NOT be used in ovirt-node, where configuration persistence is # handled otherwise. try: if os.path.exists('/etc/rhev-hypervisor-release') or \ not len(glob.glob('/etc/ovirt-node-*-release')) == 0: return except: pass configWriter = configNetwork.ConfigWriter() configWriter.restorePersistentBackup() if __name__ == '__main__': main() (In reply to Martin Pavlik from comment #12) > (In reply to Alon Bar-Lev from comment #11) > > Hi, > > > > Can you please see if: > > > > /usr/share/vdsm/vdsm-restore-net-config > > > > Is updated per[1]? > > > > [1] http://gerrit.ovirt.org/#/c/16847/4/vdsm/vdsm-restore-net-config,unified > > No it is not So POST is correct... :) moving to MODIFIED since patch since patch is merged. It turns out that patch http://gerrit.ovirt.org/#/c/16847/ does NOT solve the bug. After reboot the host is in non-operational state and there is no rhevm network. Toni suggested that this could because of bug 988986. But it is not the case. After install the host and adding to the rhevm there are this files: [root@dell-r210ii-06 ~]# ls /config/etc/libvirt/qemu/networks/ autostart default.xml vdsm-rhevm.xml After reboot the network is down and instead of vdsm-rhevm.xml is vdsm-brem1.xml. So it is not connected with bug 988986, but there is still another problem. (In reply to psebek from comment #20) > It turns out that patch http://gerrit.ovirt.org/#/c/16847/ does NOT solve > the bug. After reboot the host is in non-operational state and there is no > rhevm network. > > Toni suggested that this could because of bug 988986. But it is not the > case. After install the host and adding to the rhevm there are this files: > > [root@dell-r210ii-06 ~]# ls /config/etc/libvirt/qemu/networks/ > autostart default.xml vdsm-rhevm.xml > > After reboot the network is down and instead of vdsm-rhevm.xml is > vdsm-brem1.xml. So it is not connected with bug 988986, but there is still > another problem. Which RHEV-H/VDSM versions are you using in this last test? rhevh-6.4-20130815.0.iso rhev-hypervisor6-6.4-20130815.0.el6 (In reply to psebek from comment #20) > > After reboot the network is down and instead of vdsm-rhevm.xml is > vdsm-brem1.xml. So it is not connected with bug 988986, but there is still > another problem. I would guess this lies on the border of libvirt and node, and is a relatively new regression - when a libvirt network is defined, it should also persist it on node. Unfortunately, I do not recall how this was done in former versions. Do you, Mike? I tried it with RHEV Hypervisor - 6.4 - 20130415.0.el6_4, vdsm-4.10.2-1.9.el6ev, oVirt Engine Version: 3.3.0-0.2.master.20130730134744.gitf5e6e45.fc18 and after reboot there is no problem with host. Management network (ovirtmgmt) is still there and host is in state Up again. No problem with this configuration. (In reply to Dan Kenigsberg from comment #23) > I would guess this lies on the border of libvirt and node, and is a > relatively new regression - when a libvirt network is defined, it should > also persist it on node. Unfortunately, I do not recall how this was done in > former versions. Do you, Mike? We persist the /etc/libvirt/qemu/networks/ directory by default, so they should all be persisted automatically. The fact that the /config/etc/libvirt/qemu/networks/ exists (from comment 20) indicates that the persistence is working Created attachment 788924 [details]
vdsm.log during described action
Created attachment 788925 [details]
ovirt.log
My first impression from vdsm-reg.log is that libvirt was down, so adding new interface (rhevm) with addNetwork script to libvirt database won't work. libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory libvir: XML-RPC error : Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory INFO:root:Network 'brem1': doesn't exist in libvirt database MainThread::DEBUG::2013-06-27 14:24:25,245::deployUtil::135::root::['/usr/share/vdsm/addNetwork', 'rhevm', '', '', 'em1', 'PEERNTP=yes', 'DELAY=0', 'IPV6INIT=no', 'IPV6_AUTOCONF=no', 'IPV6FORWARDING=no', 'BOOTPROTO=dhcp', 'ONBOOT=yes', 'blockingdhcp=true'] MainThread::DEBUG::2013-06-27 14:24:31,254::deployUtil::143::root::Determining IP information for rhevm... failed. Logs at 10.34.67.2 (vdsm-reg.log) ================================================ MainThread::DEBUG::2013-08-21 13:50:54,852::deployUtil::135::root::['/usr/share/vdsm/addNetwork', 'rhevm', '', '', 'em1', 'PEERNTP=yes', 'DELAY=0', 'IPV6INIT=no', 'IPV6_AUTOCONF=no', 'IPV6FORWARDING=no', 'BOOTPROTO=dhcp', 'ONBOOT=yes', 'blockingdhcp=true'] MainThread::DEBUG::2013-08-21 13:50:56,638::deployUtil::143::root::Determining IP information for rhevm... failed. INFO:root:Adding network rhevm with vlan=, bonding=, nics=['em1'], bondingOptions=None, mtu=None, bridged=True, options={'IPV6FORWARDING': 'no', 'blockingdhcp': 'true', 'IPV6INIT': 'no', 'delay': '0', 'bootproto': 'dhcp', 'IPV6_AUTOCONF': 'no', 'PEERNTP': 'yes', 'onboot': 'yes'} Traceback (most recent call last): File "/usr/share/vdsm/configNetwork.py", line 1535, in <module> File "/usr/share/vdsm/configNetwork.py", line 1504, in main File "/usr/share/vdsm/configNetwork.py", line 1022, in addNetwork File "/usr/share/vdsm/configNetwork.py", line 86, in ifup File "/usr/share/vdsm/configNetwork.py", line 75, in _ifup ConfigNetworkError: (29, 'Determining IP information for rhevm... failed.') MainThread::DEBUG::2013-08-21 13:50:56,645::deployUtil::1025::root::makeBridge Failed to add rhevm bridge out=Determining IP information for rhevm... failed. =================================== At this stage we have created the file /etc/sysconfig/network-scripts/ifcfg-rhevm but vdsm-reg-setup -> addNetwork -> configNetwork failed and vdsm-reg-setup won't persist the file. vdsm-reg-setup =================================== #Rename existing bridge fReturn = deployUtil.makeBridge(self.vdcName, self.vdsmDir) if not fReturn: logging.error("renameBridge Failed to rename existing bridge!") #Persist changes if fReturn: try: out, err, ret = deployUtil._logExec([os.path.join(self.vdsmDir, SCRIPT_NAME_SAVE)]) Where: SCRIPT_NAME_SAVE="vdsm-store-net-config" =================================== Also, I have executed few tests and the host (rhev-h - 20130815.0.el6_4) is ok to persist files. # rpm -qa | grep -i vdsm vdsm-cli-4.10.2-24.1.el6ev.noarch vdsm-hook-vhostmd-4.10.2-24.1.el6ev.noarch vdsm-4.10.2-24.1.el6ev.x86_64 vdsm-python-4.10.2-24.1.el6ev.x86_64 vdsm-xmlrpc-4.10.2-24.1.el6ev.noarch vdsm-reg-4.10.2-24.1.el6ev.noarch Additionally, I have created test environment with RHEV-H 6.4 (20130815.0.el6_4) + RHEV-M 3.2.2-0.41.el6ev and it worked. (persisted rhevm interface after reboot). MainThread::DEBUG::2013-08-21 20:53:08,054::deployUtil::135::root::['/usr/share/vdsm/addNetwork', 'rhevm', '', '', 'eth0', 'PEERNTP=yes', 'DELAY=0', 'IPV6INIT=no', 'IPV6_AUTOCONF=no', 'IPV6FORWARDING=no', 'BOOTPROTO=dhcp', 'ONBOOT=yes', 'blockingdhcp=true'] MainThread::DEBUG::2013-08-21 20:53:14,422::deployUtil::143::root:: MainThread::DEBUG::2013-08-21 20:53:14,427::deployUtil::144::root::WARNING:Storage.LVM:Cannot create env file [Errno 2] No such file or directory: '/var/run/vdsm/lvm.env' WARNING:root:options DELAY is deprecated. Use delay instead WARNING:root:options BOOTPROTO is deprecated. Use bootproto instead WARNING:root:options ONBOOT is deprecated. Use onboot instead INFO:root:Adding network rhevm with vlan=, bonding=, nics=['eth0'], bondingOptions=None, mtu=None, bridged=True, options={'IPV6FORWARDING': 'no', 'blockingdhcp': 'true', 'IPV6INIT': 'no', 'delay': '0', 'bootproto': 'dhcp', 'IPV6_AUTOCONF': 'no', 'PEERNTP': 'yes', 'onboot': 'yes'} libvir: Network Driver error : Network not found: no network with matching name 'vdsm-rhevm' libvir: Network Driver error : Network not found: no network with matching name 'vdsm-rhevm' MainThread::DEBUG::2013-08-21 20:53:14,427::deployUtil::1043::root::makeBridge return. MainThread::DEBUG::2013-08-21 20:53:14,428::deployUtil::135::root::['/usr/share/vdsm/vdsm-store-net-config'] MainThread::DEBUG::2013-08-21 20:53:14,477::deployUtil::143::root:: File persisted Successfully persisted /etc/sysconfig/network-scripts/ifcfg-eth0 File persisted Successfully persisted /etc/sysconfig/network-scripts/ifcfg-rhevm I've tested this behaviour on my machine and I was NOT able to reproduce it. Used versions: rhevm: 3.2.3-0.42.el6ev rhevh: 20130815.0.el6_4 Also with 3.3 it works as should: rhevm: 3.3.0-0.16.master.el6ev rhevh: 20130815.0.el6_4 So I suggest it is something related to Martin PavlĂk environment, probably DHCP server. We should still examine why and when exactly it's failing. IIUC, on Martin's environment, the dhcp server fails to provide an address to the newly-defined rhevm network. This should have failed the installation and rolled back the network definition. However due to bug 979572, it only fails the persisting of the network. I suggest not to fix this bug as well - it would go away once vdsm-reg is deprecated. Please reopen bug 979572 if it becomes a real-life problem. I would suggest fixing that bug by replacing vdsm-reg. *** This bug has been marked as a duplicate of bug 979572 *** |
Created attachment 766190 [details] install_log Description of problem: Used 3.3 engine with 3.2 DC/CL RHEVH Host can be successfully added into RHEVM. After RHEVH host reboot it becomes Non Responsive due to missing rhevm interface. rhevm bridge is gone, so is ifcfg-rhevm file brem1 bridge appears with no ifcfg file Version-Release number of selected component (if applicable): oVirt Engine Version: 3.3.0-0.5.master.el6ev RHEV Hypervisor - 6.4 - 20130528.0.el6_4 vdsm-4.10.2-22.0.el6ev How reproducible: 100% Steps to Reproduce: 1. add RHEVH (20130528.0.el6_4) host into 3.2 DC/CL on RHEVM 3.3 2. reboot RHEVH host Actual results: RHEVH host becomes Non Responsive due to missing rhevm interface. Expected results: network configuration is persisted through reboot Additional info: ###Host installed from PXE ####files after adding host to RHEVM [root@dell-r210ii-07 ~]# date && ll /etc/sysconfig/network-scripts/ifcfg-* Thu Jun 27 14:06:14 UTC 2013 -rw-rw-r--. 1 root root 144 2013-06-27 13:50 /etc/sysconfig/network-scripts/ifcfg-em1 -rw-r--r--. 1 root root 46 2013-06-27 15:47 /etc/sysconfig/network-scripts/ifcfg-em2 -rw-r--r--. 1 root root 254 2013-01-09 11:13 /etc/sysconfig/network-scripts/ifcfg-lo -rw-r--r--. 1 root root 47 2013-06-27 15:47 /etc/sysconfig/network-scripts/ifcfg-p1p1 -rw-r--r--. 1 root root 47 2013-06-27 15:47 /etc/sysconfig/network-scripts/ifcfg-p1p2 -rw-rw-r--. 1 root root 135 2013-06-27 13:50 /etc/sysconfig/network-scripts/ifcfg-rhevm [root@dell-r210ii-07 ~]# cat /etc/sysconfig/network-scripts/ifcfg-rhevm DEVICE=rhevm ONBOOT=yes TYPE=Bridge DELAY=0 BOOTPROTO=dhcp NM_CONTROLLED=no IPV6FORWARDING=no IPV6_AUTOCONF=no IPV6INIT=no PEERNTP=yes [root@dell-r210ii-07 ~]# cat /etc/sysconfig/network-scripts/ifcfg-em1 DEVICE=em1 ONBOOT=yes HWADDR=d0:67:e5:f0:82:44 BRIDGE=rhevm NM_CONTROLLED=no IPV6FORWARDING=no DELAY=0 PEERNTP=yes IPV6INIT=no IPV6_AUTOCONF=no [root@dell-r210ii-07 ~]# brctl show bridge name bridge id STP enabled interfaces ;vdsmdummy; 8000.000000000000 no rhevm 8000.d067e5f08244 no em1 ###files after reboot [root@localhost ~]# date && ls /etc/sysconfig/network-scripts/ifcfg-* Thu Jun 27 14:36:00 UTC 2013 /etc/sysconfig/network-scripts/ifcfg-em1 /etc/sysconfig/network-scripts/ifcfg-lo /etc/sysconfig/network-scripts/ifcfg-p1p2 /etc/sysconfig/network-scripts/ifcfg-em2 /etc/sysconfig/network-scripts/ifcfg-p1p1 [root@localhost ~]# cat /etc/sysconfig/network-scripts/ifcfg-em1 DEVICE=em1 HWADDR=d0:67:e5:f0:82:44 BRIDGE=brem1 ONBOOT=yes [root@localhost ~]# brctl show bridge name bridge id STP enabled interfaces ;vdsmdummy; 8000.000000000000 no brem1 8000.d067e5f08244 no em1