Description of problem: Setting up hosted-engine via cockpit does fail during setup of the ovirtmgmt-bridge, if cockpit is also accessed via the management network. Version-Release number of selected component (if applicable): RHEV-H-7.2-20160627.2-RHVH-x86_64-dvd1.iso How reproducible: always Steps to Reproduce: 1. Install RHEV-M from ISO 2. Define Network settings 3. Start hosted-engine installation from cockpit (https://<FWDN>:9090/) 4. Select the network you are connected to cockpit as the management network Actual results: During installation the setup aborts and cockpit shows "Disconnected" message. The ovirtmgmt-bridge stays in status down. Also routing and DNS isn't set up correctly Expected results: hosted-engine setup should be able to commence and to resetup the network as required. Additional info: Network after the cockpit shows "Disconnected: [root@ovirt1 ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master ovirtmgmt state UP qlen 1000 link/ether 52:54:00:80:c7:08 brd ff:ff:ff:ff:ff:ff 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:54:00:5c:b2:16 brd ff:ff:ff:ff:ff:ff 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:54:00:e2:27:f6 brd ff:ff:ff:ff:ff:ff 5: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN link/ether 5a:8f:e2:eb:a8:7c brd ff:ff:ff:ff:ff:ff 6: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN link/ether 72:a3:24:54:0e:73 brd ff:ff:ff:ff:ff:ff 7: ovirtmgmt: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN link/ether 52:54:00:80:c7:08 brd ff:ff:ff:ff:ff:ff inet 192.168.100.21/24 brd 192.168.100.255 scope global ovirtmgmt valid_lft forever preferred_lft forever [root@ovirt1 ~]# Unfortunatly the hosted-engine-setup log does not show much relevant stuff (besides the iscsi failing due to no more network...): 2016-06-29 10:34:59 DEBUG otopi.context context.dumpEnvironment:760 ENVIRONMENT DUMP - BEGIN 2016-06-29 10:34:59 DEBUG otopi.context context.dumpEnvironment:770 ENV OVEHOSTED_VDSM/vdscli=_Server:'<vdsm.jsonrpcvdscli._Server object at 0x4c72b50>' 2016-06-29 10:34:59 DEBUG otopi.context context.dumpEnvironment:774 ENVIRONMENT DUMP - END 2016-06-29 10:34:59 DEBUG otopi.context context._executeMethod:128 Stage misc METHOD otopi.plugins.gr_he_common.network.bridge.Plugin._misc 2016-06-29 10:34:59 INFO otopi.plugins.gr_he_common.network.bridge bridge._misc:372 Configuring the management bridge 2016-06-29 10:34:59 DEBUG otopi.plugins.gr_he_common.network.bridge bridge._misc:384 networks: {'ovirtmgmt': {'nic': 'eth0', 'ipaddr': u'192.168.100.21', 'netmask': u'255.255.255.0', 'bootproto': u'none', 'gateway': u'192.168.100.1', 'defaultRoute': True}} 2016-06-29 10:34:59 DEBUG otopi.plugins.gr_he_common.network.bridge bridge._misc:385 bonds: {} 2016-06-29 10:34:59 DEBUG otopi.plugins.gr_he_common.network.bridge bridge._misc:386 options: {'connectivityCheck': False} 2016-06-29 10:35:03 DEBUG otopi.context context._executeMethod:128 Stage misc METHOD otopi.plugins.gr_he_setup.storage.blockd.Plugin._misc 2016-06-29 10:35:03 INFO otopi.plugins.gr_he_setup.storage.blockd blockd._misc:656 Creating Volume Group 2016-06-29 10:35:13 DEBUG otopi.plugins.gr_he_setup.storage.blockd blockd._misc:658 {'status': {'message': u'Failed to initialize physical device: ("[u\'/dev/mapper/36001405ddaf6032e1bf47538069c78c2\']",)', 'code': 601}} 2016-06-29 10:35:13 ERROR otopi.plugins.gr_he_setup.storage.blockd blockd._misc:664 Error creating Volume Group: Failed to initialize physical device: ("[u'/dev/mapper/36001405ddaf6032e1bf47538069c78c2']",)
Created attachment 1173726 [details] All messages from /var/log of the hypervisor Just in case it is of interesst, I just added a tar file containing all files of /var/log after the abort. route and DNS is also not configured anymore: [root@ovirt1 ~]# ip route list 192.168.100.0/24 dev ovirtmgmt proto kernel scope link src 192.168.100.21 [root@ovirt1 ~]# cat /etc/resolv.conf # Generated by NetworkManager search satellite.local # No nameservers found; try putting DNS servers into your # ifcfg files in /etc/sysconfig/network-scripts like so: # # DNS1=xxx.xxx.xxx.xxx # DNS2=xxx.xxx.xxx.xxx # DOMAIN=lab.foo.com bar.foo.com [root@ovirt1 ~]#
Created attachment 1173732 [details] sosreport after the failed ovirtmgmt bridge setup Some additional remarks. It seems this is not cockpit related. Running hosted-engine --deploy on the console leads to the same result (even if run in a screen session). The screen session has the advantage that one can see the console output after the management bridge has been configured. So hosted-engine --deploy just commences to the storage part (which fails). I gathered a sosreport that is also attached. Output from screen: [ INFO ] Stage: Setup validation --== CONFIGURATION PREVIEW ==-- Bridge interface : eth0 Engine FQDN : ovirt.satellite.local Bridge name : ovirtmgmt Host address : ovirt1.satellite.local SSH daemon port : 22 Firewall manager : iptables Gateway address : 192.168.100.1 Host name for web application : ovirt1 Storage Domain type : iscsi Host ID : 1 LUN ID : 36001405ddaf6032e1bf47538069c78c2 Image size GB : 50 iSCSI Portal IP Address : 192.168.100.1 iSCSI Target Name : iqn.2003-01.org.linux-iscsi.kirk.x8664:sn.21dc789db84d iSCSI Portal port : 3260 iSCSI Portal user : Console type : vnc Memory size MB : 6144 MAC address : 00:16:3e:3e:b0:69 Boot type : disk Number of CPUs : 4 OVF archive (for disk boot) : /root/rhevm-appliance-20160623.0-1.x86_64.rhevm.ova Restart engine VM after engine-setup: True CPU Type : model_SandyBridge Please confirm installation settings (Yes, No)[Yes]: [ INFO ] Stage: Transaction setup [ INFO ] Stage: Misc configuration [ INFO ] Stage: Package installation [ INFO ] Stage: Misc configuration [ INFO ] Configuring libvirt [ INFO ] Configuring VDSM [ INFO ] Starting vdsmd [ INFO ] Configuring the management bridge [ INFO ] Creating Volume Group [ ERROR ] Error creating Volume Group: Failed to initialize physical device: ("[u'/dev/mapper/36001405ddaf6032e1bf47538069c78c2']",) The selected device is already used. To create a vg on this device, you must use Force. WARNING: This will destroy existing data on the device. (Force, Abort)[Abort]?
While testing I found that I only run into the same issue whith static IPs. In case I use DHCP for the hypervisor host deployment of hosted engine is fine
(In reply to Martin Tessun from comment #2) > [ ERROR ] Error creating Volume Group: Failed to initialize physical device: > ("[u'/dev/mapper/36001405ddaf6032e1bf47538069c78c2']",) > The selected device is already used. > To create a vg on this device, you must use Force. > WARNING: This will destroy existing data on the device. > (Force, Abort)[Abort]? What do I need to reproduce this? Was there already a volume group on that LUN, or does this happen every time iscsi is used? It seems that the primary problem here is that hosted-engine-setup drops the network if there's a storage problem on iSCSI rather than cockpit (which can't be expected to show anything if the network drops), but this is hard to verify without a reproducer. Similar "stopping" errors are show correctly in cockpit, but... (In reply to dmoessne from comment #3) > While testing I found that I only run into the same issue whith static IPs. > In case I use DHCP for the hypervisor host deployment of hosted engine is > fine The management bridge dropping, or iscsi? I'm reasonably sure that we've verified static IPs (Douglas, can you confirm?), but I'm not sure which issue you're referring to.
In the logs I see: jsonrpc.Executor/4::ERROR::2016-06-29 10:35:13,452::task::868::Storage.TaskManager.Task::(_setError) Task=`69e9e009-7157-44e7-a73a-e2c28743b01e`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 875, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 2110, in createVG (force.capitalize() == "True"))) File "/usr/share/vdsm/storage/lvm.py", line 936, in createVG _initpvs(pvs, metadataSize, force) File "/usr/share/vdsm/storage/lvm.py", line 739, in _initpvs raise se.PhysDevInitializationError(str(devices))
Hi Piotr, Ryan this is the result of ovirtmgmt bridge being down. Besides this the installation goes fine, if you run the following loop just before starting the hosted-engine --deploy: [root@ovirt1 ~]# while true;do ip link set up ovirtmgmt;done The VG creation fails, as I am using iSCSI and as soon as the IP is gone (ovirtmgmt interface being created) it can't access the storage devices any more. So best to reproduce the same way I did: 1. Have a static IP set for Hypervisor, e.g.: [root@ovirt2 ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0 TYPE=Ethernet BOOTPROTO=none DEFROUTE=yes IPV4_FAILURE_FATAL=no IPV6INIT=yes IPV6_AUTOCONF=yes IPV6_DEFROUTE=yes IPV6_FAILURE_FATAL=no NAME=eth0 UUID=8bcd806a-9028-461b-ac8e-54595c302144 DEVICE=eth0 ONBOOT=yes IPADDR=192.168.100.22 PREFIX=24 GATEWAY=192.168.100.1 DNS1=192.168.100.1 DOMAIN=satellite.local IPV6_PEERDNS=yes IPV6_PEERROUTES=yes IPV6_PRIVACY=no [root@ovirt2 ~]# 2. Start hosted-engine --deploy (either via cockpit or locally) 3. Select the connected interface as management interface 4. Wait until hosted-engine/vdsm create the ovirtmgmt bridge. If the above loop isn't run, ovirtmgmt bridge will stay in down state. As said the iSCSI storage error is just a side effect, as my iSCSI is also connected via eth0 in this case. As outlined by Daniel in C#3 it will not fail if DHCP is used instead of static IP. Just let me know if you need more informations for a reproducer. Cheers, Martin
Just an additional finding: Having that "while" loop does help getting hosted-engine running further, but still DNS is missing: Failed to execute stage 'Closing up': [ERROR]::oVirt API connection failure, (6, 'Could not resolve host: ovirt.satellite.local; Unknown error') Hosted Engine deployment failed: this system is not reliable, please check the issue, fix and redeploy [root@ovirt1 ~]# cat /etc/resolv.conf # Generated by NetworkManager search satellite.local # No nameservers found; try putting DNS servers into your # ifcfg files in /etc/sysconfig/network-scripts like so: # # DNS1=xxx.xxx.xxx.xxx # DNS2=xxx.xxx.xxx.xxx # DOMAIN=lab.foo.com bar.foo.com [root@ovirt1 ~]# [root@ovirt1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt # Generated by VDSM version 4.18.3-0.el7ev DEVICE=ovirtmgmt TYPE=Bridge DELAY=0 STP=off ONBOOT=yes IPADDR=192.168.100.21 NETMASK=255.255.255.0 GATEWAY=192.168.100.1 BOOTPROTO=none MTU=1500 DEFROUTE=yes NM_CONTROLLED=no IPV6INIT=no [root@ovirt1 ~]# So this is still missing the DNS1=192.168.100.1 entry. Try redeploying now with a different script running while I do the deployment: [root@ovirt1 ~]# while true;do ip link set up ovirtmgmt;(grep -q '^nameserver' /etc/resolv.conf || echo "nameserver 192.168.100.1" >> /etc/resolv.conf);done Please also find the configuration files (ifcfg-eth0 and resolv.conf) as they look like directly after the Hypervisor is installed: [root@ovirt1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0 TYPE=Ethernet BOOTPROTO=none DEFROUTE=yes IPV4_FAILURE_FATAL=no IPV6INIT=yes IPV6_AUTOCONF=yes IPV6_DEFROUTE=yes IPV6_FAILURE_FATAL=no NAME=eth0 UUID=fbe724dd-2b4a-43d0-bec7-db9cffa39609 DEVICE=eth0 ONBOOT=yes IPADDR=192.168.100.21 PREFIX=24 GATEWAY=192.168.100.1 DNS1=192.168.100.1 DOMAIN=satellite.local IPV6_PEERDNS=yes IPV6_PEERROUTES=yes IPV6_PRIVACY=no [root@ovirt1 ~]# cat /etc/resolv.conf # Generated by NetworkManager search satellite.local nameserver 192.168.100.1 [root@ovirt1 ~]# So what I would expect: 1. ovirtmgmt should get the same configuration as eth0 had 2. resolv.conf should also be maintained as is 3. Interface ovirtmgmt should be set to up after it has been configured.
(In reply to Martin Tessun from comment #7) > So what I would expect: > 1. ovirtmgmt should get the same configuration as eth0 had > 2. resolv.conf should also be maintained as is > 3. Interface ovirtmgmt should be set to up after it has been configured. This is also what I'd expect. Dan, Simone, any thoughts?
(In reply to Ryan Barry from comment #8) > (In reply to Martin Tessun from comment #7) > > So what I would expect: > > 1. ovirtmgmt should get the same configuration as eth0 had > > 2. resolv.conf should also be maintained as is > > 3. Interface ovirtmgmt should be set to up after it has been configured. > > This is also what I'd expect. > > Dan, Simone, any thoughts? Ryan, see this one: https://bugzilla.redhat.com/show_bug.cgi?id=1160423
(In reply to Simone Tiraboschi from comment #9) > > Ryan, see this one: https://bugzilla.redhat.com/show_bug.cgi?id=1160423 Well that is for the DNS (and hopefully will solve (1) and (2). Still keeps us with the following: > > 3. Interface ovirtmgmt should be set to up after it has been configured. Additionally as I understand BZ #1160423 is already scheduled for 4.1 although this has been discovered in 2014 already? Personally I think that these issues should be *really* fixed in RHV 4.0 at least. Is there anything that can be done to get this resolved in RHV 4.0? Cheers, Martin
Hi Martin - I think the impact is low, this this only appears to affect interfaces which use DNS1/DNS2 in ifcfg files, and systems which set DNS in resolv.conf (or via DHCP) operate normally. Personally, I also find this to be surprising, since I can't remember the last time I set a DNS server in resolv.conf instead of DNS1, but the lack of customer tickets attached to a very old bug indicates that this may also be an uncommon use case among customers. It may be hard to get it into 4.0 this late. For issue #3, this definitely looks like ovirt-host-deploy. I'm just getting back from holiday, but I'll see if I have some time to set up a reproducer this week. Is there anything in journald about it not coming up?
(In reply to Ryan Barry from comment #11) > Hi Martin - > > I think the impact is low, this this only appears to affect interfaces which > use DNS1/DNS2 in ifcfg files, and systems which set DNS in resolv.conf (or > via DHCP) operate normally. > > Personally, I also find this to be surprising, since I can't remember the > last time I set a DNS server in resolv.conf instead of DNS1, but the lack of > customer tickets attached to a very old bug indicates that this may also be > an uncommon use case among customers. It may be hard to get it into 4.0 this > late. Well I would tend to agree, in case all our customers were using RHEL-H, but installing RHEV-H with anaconda is just doing this (using DNS1/DNS2) for the configuration. As such, I believe this is quite a relevant bug. Although I am not sure how many customers will use this way of installation (RHEV-H + Hosted Engine), but this installation will fail unless not changed manually afterwards or with my small infinite loop being run during installation. As such I think that this now got a more relevant bug than it might have been before. > > For issue #3, this definitely looks like ovirt-host-deploy. I'm just getting > back from holiday, but I'll see if I have some time to set up a reproducer > this week. Is there anything in journald about it not coming up? Indeed, see below: Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> ifcfg-rh: remove /etc/sysconfig/network-scripts/ifcfg-eth0 (d0fea1c3-8518-4407-8345-3544d3c Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> (eth0): device state change: activated -> deactivating (reason 'connection-removed') [100 1 Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> NetworkManager state is now DISCONNECTING Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-eth0 (5fb06bd0-0bb0-7ffb-45f1 Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <warn> ifcfg-rh: Ignoring connection /etc/sysconfig/network-scripts/ifcfg-eth0 (5fb06bd0-0bb0-7ffb Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> (eth0): device state change: deactivating -> unmanaged (reason 'unmanaged') [110 10 3] Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> NetworkManager state is now DISCONNECTED Jul 06 08:37:01 ovirt1.satellite.local dbus[894]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-di Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> (eth0): link disconnected Jul 06 08:37:01 ovirt1.satellite.local dbus-daemon[894]: dbus[894]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org Jul 06 08:37:01 ovirt1.satellite.local systemd[1]: Starting Network Manager Script Dispatcher Service... Jul 06 08:37:01 ovirt1.satellite.local dbus[894]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher' Jul 06 08:37:01 ovirt1.satellite.local dbus-daemon[894]: dbus[894]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher' Jul 06 08:37:01 ovirt1.satellite.local nm-dispatcher[19272]: Dispatching action 'down' for eth0 Jul 06 08:37:01 ovirt1.satellite.local systemd[1]: Started Network Manager Script Dispatcher Service. [...] Jul 06 08:37:01 ovirt1.satellite.local systemd[1]: Started /usr/sbin/ifup eth0. Jul 06 08:37:01 ovirt1.satellite.local systemd[1]: Starting /usr/sbin/ifup eth0. Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> (ovirtmgmt): new Bridge device (carrier: OFF, driver: 'bridge', ifindex: 7) Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> (eth0): link connected Jul 06 08:37:01 ovirt1.satellite.local kernel: 8021q: adding VLAN 0 to HW filter on device eth0 Jul 06 08:37:01 ovirt1.satellite.local kernel: device eth0 entered promiscuous mode Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> (ovirtmgmt): bridge port eth0 was attached Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> (eth0): enslaved to ovirtmgmt Jul 06 08:37:01 ovirt1.satellite.local systemd[1]: Started /usr/sbin/ifup ovirtmgmt. Jul 06 08:37:01 ovirt1.satellite.local systemd[1]: Starting /usr/sbin/ifup ovirtmgmt. Jul 06 08:37:01 ovirt1.satellite.local kernel: ovirtmgmt: port 1(eth0) entered forwarding state Jul 06 08:37:01 ovirt1.satellite.local kernel: ovirtmgmt: port 1(eth0) entered forwarding state Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> (ovirtmgmt): link connected Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> (ovirtmgmt): device state change: unmanaged -> unavailable (reason 'connection-assumed') [1 Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> ifcfg-rh: add connection in-memory (cb0a5d45-38d3-4e3b-8e6e-58993fd453d1,"ovirtmgmt") Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> (ovirtmgmt): device state change: unavailable -> disconnected (reason 'connection-assumed') Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> (ovirtmgmt): Activation: starting connection 'ovirtmgmt' (cb0a5d45-38d3-4e3b-8e6e-58993fd45 Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> (ovirtmgmt): device state change: disconnected -> prepare (reason 'none') [30 40 0] Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> (ovirtmgmt): device state change: prepare -> config (reason 'none') [40 50 0] Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> (ovirtmgmt): device state change: config -> ip-config (reason 'none') [50 70 0] Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> (ovirtmgmt): device state change: ip-config -> ip-check (reason 'ip-config-unavailable') [7 Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> (ovirtmgmt): device state change: ip-check -> secondaries (reason 'none') [80 90 0] Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> (ovirtmgmt): device state change: secondaries -> activated (reason 'none') [90 100 0] Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> NetworkManager state is now CONNECTED_LOCAL Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> (ovirtmgmt): Activation: successful, device activated. Jul 06 08:37:01 ovirt1.satellite.local nm-dispatcher[19272]: Dispatching action 'up' for ovirtmgmt Jul 06 08:37:01 ovirt1.satellite.local systemd[1]: Unit iscsi.service cannot be reloaded because it is inactive. Jul 06 08:37:02 ovirt1.satellite.local NetworkManager[1024]: <info> ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt (9a0b07c0-2983-fe97 Jul 06 08:37:02 ovirt1.satellite.local NetworkManager[1024]: <warn> ifcfg-rh: Ignoring connection /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt (9a0b07c0-2983 Jul 06 08:37:02 ovirt1.satellite.local NetworkManager[1024]: <info> (ovirtmgmt): device state change: activated -> unmanaged (reason 'unmanaged') [100 10 3] Jul 06 08:37:02 ovirt1.satellite.local kernel: ovirtmgmt: port 1(eth0) entered disabled state Jul 06 08:37:02 ovirt1.satellite.local NetworkManager[1024]: <info> NetworkManager state is now DISCONNECTED Jul 06 08:37:02 ovirt1.satellite.local kernel: IPv6: ADDRCONF(NETDEV_UP): ovirtmgmt: link is not ready Jul 06 08:37:02 ovirt1.satellite.local nm-dispatcher[19272]: Dispatching action 'down' for ovirtmgmt Jul 06 08:37:02 ovirt1.satellite.local NetworkManager[1024]: <info> (ovirtmgmt): link disconnected Jul 06 08:37:03 ovirt1.satellite.local /etc/sysconfig/network-scripts/ifup-eth[19409]: Error adding default gateway 192.168.100.1 for ovirtmgmt. Jul 06 08:37:04 ovirt1.satellite.local daemonAdapter[19001]: libvirt: Network Driver error : Network not found: no network with matching name 'vdsm-ovirtmgmt'
hosted-engine-setup calls Host.setupNetworks and everything seams fine. jsonrpc.Executor/2::DEBUG::2016-06-29 10:34:59,933::__init__::522::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Host.setupNetworks' in bridge with {u'bondings': {}, u'networks': {u'ovirtmgmt': {u'nic': u'eth0', u'ipaddr': u'192.168.100.21', u'netmask': u'255.255.255.0', u'bootproto': u'none', u'gateway': u'192.168.100.1', u'defaultRoute': True}}, u'options': {u'connectivityCheck': False}} jsonrpc.Executor/2::DEBUG::2016-06-29 10:35:02,983::__init__::550::jsonrpc.JsonRpcServer::(_serveRequest) Return 'Host.setupNetworks' in bridge with {'message': 'Done', 'code': 0} jsonrpc.Executor/3::DEBUG::2016-06-29 10:35:03,008::__init__::522::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Host.setSafeNetworkConfig' in bridge with {} The only relevant different here between el7 and NGN scenario is that hosted-engine-setup on NGN runs with NetworkManager active since since it's required to show the network status in Cockpit. By default on el7 we ask the user to disable NetworkManager. Martin, can you please try to reproduce on your NGN scenario explicitly stopping and disabling NetworkManager?
Unfortunately, we do not have any means to set DNS explicitly (see bug 1160667). We could sneak it in by editing ifcfg-ovirtmgmt using a hook http://www.ovirt.org/blog/2016/05/modify-ifcfg-files/
Hi Simone, (In reply to Simone Tiraboschi from comment #13) > hosted-engine-setup calls Host.setupNetworks and everything seams fine. > > jsonrpc.Executor/2::DEBUG::2016-06-29 > 10:34:59,933::__init__::522::jsonrpc.JsonRpcServer::(_serveRequest) Calling > 'Host.setupNetworks' in bridge with {u'bondings': {}, u'networks': > {u'ovirtmgmt': {u'nic': u'eth0', u'ipaddr': u'192.168.100.21', u'netmask': > u'255.255.255.0', u'bootproto': u'none', u'gateway': u'192.168.100.1', > u'defaultRoute': True}}, u'options': {u'connectivityCheck': False}} > jsonrpc.Executor/2::DEBUG::2016-06-29 > 10:35:02,983::__init__::550::jsonrpc.JsonRpcServer::(_serveRequest) Return > 'Host.setupNetworks' in bridge with {'message': 'Done', 'code': 0} > jsonrpc.Executor/3::DEBUG::2016-06-29 > 10:35:03,008::__init__::522::jsonrpc.JsonRpcServer::(_serveRequest) Calling > 'Host.setSafeNetworkConfig' in bridge with {} > > The only relevant different here between el7 and NGN scenario is that > hosted-engine-setup on NGN runs with NetworkManager active since since it's > required to show the network status in Cockpit. > By default on el7 we ask the user to disable NetworkManager. > > Martin, can you please try to reproduce on your NGN scenario explicitly > stopping and disabling NetworkManager? Indeed. As soon as I disable NetworkManager the setup works fine: Jul 06 09:51:07 ovirt1.satellite.local systemd[1]: Started /usr/sbin/ifup eth0. Jul 06 09:51:07 ovirt1.satellite.local systemd[1]: Starting /usr/sbin/ifup eth0. Jul 06 09:51:07 ovirt1.satellite.local kernel: 8021q: adding VLAN 0 to HW filter on device eth0 Jul 06 09:51:07 ovirt1.satellite.local kernel: device eth0 entered promiscuous mode Jul 06 09:51:07 ovirt1.satellite.local systemd[1]: Started /usr/sbin/ifup ovirtmgmt. Jul 06 09:51:07 ovirt1.satellite.local systemd[1]: Starting /usr/sbin/ifup ovirtmgmt. Jul 06 09:51:07 ovirt1.satellite.local kernel: ovirtmgmt: port 1(eth0) entered forwarding state Jul 06 09:51:07 ovirt1.satellite.local kernel: ovirtmgmt: port 1(eth0) entered forwarding state Jul 06 09:51:09 ovirt1.satellite.local daemonAdapter[3687]: libvirt: Network Driver error : Network not found: no network with matching name 'vdsm-ovirtmgmt' [...] So this is clearly NetworkManager and RHV-H related then. BTW: Even my DNS stays intact: [root@ovirt1 ~]# cat /etc/resolv.conf # Generated by NetworkManager search satellite.local nameserver 192.168.100.1 [root@ovirt1 ~]# So both issues are (at least partly) NetworkManager related. Cheers, Martin
(In reply to Martin Tessun from comment #15) > > So both issues are (at least partly) NetworkManager related. > > Cheers, > Martin This is very interesting (and not good for NGN). NetworkManager support for vdsm was pushed into NGN so it works nicely with cockpit, but there's some ongoing work which needs to be done there...
Dan, any ideas here?
Martin, I do not understand how you DNS1/DNS2 configuration is working when NM is turned off. Did they somehow stayed written in the configuration file?
Could this be the cause? https://bugzilla.redhat.com/show_bug.cgi?id=1335426
Yaniv, could you elaborate? I see no relation between loss of DNS resolution to punching a firewall hole for cockpit.
Hi Dan, (In reply to Dan Kenigsberg from comment #18) > Martin, I do not understand how you DNS1/DNS2 configuration is working when > NM is turned off. Did they somehow stayed written in the configuration file? Exactly. After /etc/resolv.conf has been created by anaconda/NetworkManager, disabling NM resulted in /etc/resolv.conf not being touched again and as such the /etc/resolv.conf stayed as it is. Probably as DNS1/DNS2 are not available in /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt, the "classic" network startup "saw" no reason to modify /etc/resolv.conf. So to sum up: - DNS1/2 are still missing in the sysconfig files - /etc/resolv.conf doesn't get touched again by the "classic" network startup
Tomas, do you know if NetworkManager would avoid clearing up resolv.conf when DNS entries are dropped from ifcfg, once we disable monitor-connection-files? (We can take https://gerrit.ovirt.org/#/c/59260 only when bug 1346947 is fixed)
(In reply to Dan Kenigsberg from comment #22) > Tomas, do you know if NetworkManager would avoid clearing up resolv.conf > when DNS entries are dropped from ifcfg, once we disable > monitor-connection-files? > > (We can take https://gerrit.ovirt.org/#/c/59260 only when bug 1346947 is > fixed) If I understand well you have monitor-connection-files=yes and when a ifcfg file is modified (by removing and re-adding it), the content of resolv.conf changes. This is expected from NM point of view, as the remove/readd cycle causes the connection to be activated again, changing resolv.conf. If you disable monitor-connection-files, NM will not react to changes of ifcfg files and thus resolv.conf will not change until a "nmcli connection reload", followed by a reactivation of connection is performed.
https://gerrit.ovirt.org/#/c/61184/ makes VDSM write /etc/resolv.conf entries to ovirtmgmt's ifcfg file (currently DNS1 and DNS2, DOMAIN may be necessary as well) and as a result, ifup-post updates /etc/resolv.conf with them. Martin, can you give it a test? I will be looking for a suitable VM too.
Hi Ondrej, sure, I will give it a try. I expect that I need to apply the patches manually as there is no build with this patches available yet? Cheers, Martin
Hi Ondrej (In reply to Ondřej Svoboda from comment #24) > https://gerrit.ovirt.org/#/c/61184/ makes VDSM write /etc/resolv.conf > entries to ovirtmgmt's ifcfg file (currently DNS1 and DNS2, DOMAIN may be > necessary as well) and as a result, ifup-post updates /etc/resolv.conf with > them. > > Martin, can you give it a test? I will be looking for a suitable VM too. The patches did not apply 100% cleanly, so I needed to do some rework (not that difficult). But the results for the DNS are promising: [root@ovirt1 ~]# cat /etc/resolv.conf # Generated by NetworkManager search satellite.local # No nameservers found; try putting DNS servers into your # ifcfg files in /etc/sysconfig/network-scripts like so: # # DNS1=xxx.xxx.xxx.xxx # DNS2=xxx.xxx.xxx.xxx # DOMAIN=lab.foo.com bar.foo.com nameserver 192.168.100.1 [root@ovirt1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt # Generated by VDSM version 4.18.6-1.el7ev DEVICE=ovirtmgmt TYPE=Bridge DELAY=0 STP=off ONBOOT=yes DNS1=192.168.100.1 IPADDR=192.168.100.21 NETMASK=255.255.255.0 GATEWAY=192.168.100.1 BOOTPROTO=none MTU=1500 DEFROUTE=yes NM_CONTROLLED=no IPV6INIT=no [root@ovirt1 ~]# Still the ovirtmgmt bridge stays in "DOWN" state if configured with a static IP adress, but I think the patch was not meant to solve this as well. Additioanlly the "DOMAIN=" is still missing. Cheers, Martin
Thank you so much, Martin. Does /var/log/vdsm/supervdsm.log show any hints as to why the bridge couldn't be brought up? I will add DOMAN= handling to the patch (or rather, introduce a follow-up patch) and also prepare a backport closer to your version so it hopefully applies cleanly this time.
$ nmcli d might also be interesting, Martin. maybe it's related to bug 1356635
I can't reproduce this issue with normal network configure, is there any other network configure? My test steps as following: 1. Install RHVH via ISO(with defaulf ks) 2. Configure a simple network(one nic). 3. Start hosted-engine installation from cockpit (https://<FWDN>:9090/) 4. Select the network you are connected to cockpit as the management network 5. Configure hosted engine can successful.
(In reply to shaochen from comment #29) > I can't reproduce this issue with normal network configure, is there any > other network configure? > > My test steps as following: > 1. Install RHVH via ISO(with defaulf ks) > 2. Configure a simple network(one nic). Did you use static IP configuration, which is affected by this bug? DHCP-based configuration was reported to be fine. > 3. Start hosted-engine installation from cockpit (https://<FWDN>:9090/) > 4. Select the network you are connected to cockpit as the management network > 5. Configure hosted engine can successful.
A few points (actually, a big one, which I feared) that shouldn't be missed (from Martin's comment #12): > Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> > ifcfg-rh: remove /etc/sysconfig/network-scripts/ifcfg-eth0 > (d0fea1c3-8518-4407-8345-3544d3c > Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> (eth0): > device state change: activated -> deactivating (reason 'connection-removed') > [100 1 > Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> > ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-eth0 > (5fb06bd0-0bb0-7ffb-45f1 > Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <warn> > ifcfg-rh: Ignoring connection /etc/sysconfig/network-scripts/ifcfg-eth0 > (5fb06bd0-0bb0-7ffb > Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> (eth0): > device state change: deactivating -> unmanaged (reason 'unmanaged') [110 10 > 3] NetworkManager correctly ignores eth0, as ifcfg-eth0 was deleted and recreated by VDSM. > Jul 06 08:37:01 ovirt1.satellite.local systemd[1]: Started /usr/sbin/ifup > eth0. > Jul 06 08:37:01 ovirt1.satellite.local systemd[1]: Starting /usr/sbin/ifup > eth0. > Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> > (ovirtmgmt): new Bridge device (carrier: OFF, driver: 'bridge', ifindex: 7) VDSM creates a bridge "manually" (as usual) just after configuring eth0 all on its own. > Jul 06 08:37:01 ovirt1.satellite.local systemd[1]: Started /usr/sbin/ifup > ovirtmgmt. > Jul 06 08:37:01 ovirt1.satellite.local systemd[1]: Starting /usr/sbin/ifup > ovirtmgmt. Initscripts start static-IP configuration based on ifcfg-ovirtmgmt. But, at the same time... (assuming precise timestamps on a NetworkManager logger's side) > Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> > (ovirtmgmt): link connected > Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> > (ovirtmgmt): device state change: unmanaged -> unavailable (reason > 'connection-assumed') [1 > Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> > ifcfg-rh: add connection in-memory > (cb0a5d45-38d3-4e3b-8e6e-58993fd453d1,"ovirtmgmt") NetworkManager picks up the bridge as well, unaware of ifcfg-ovirtmgmt yet. > Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> > (ovirtmgmt): device state change: unavailable -> disconnected (reason > 'connection-assumed') > Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> > (ovirtmgmt): Activation: starting connection 'ovirtmgmt' > (cb0a5d45-38d3-4e3b-8e6e-58993fd45 > Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> > (ovirtmgmt): device state change: disconnected -> prepare (reason 'none') > [30 40 0] > Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> > (ovirtmgmt): device state change: prepare -> config (reason 'none') [40 50 0] > Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> > (ovirtmgmt): device state change: config -> ip-config (reason 'none') [50 70 > 0] > Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> > (ovirtmgmt): device state change: ip-config -> ip-check (reason > 'ip-config-unavailable') [7 > Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> > (ovirtmgmt): device state change: ip-check -> secondaries (reason 'none') > [80 90 0] > Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> > (ovirtmgmt): device state change: secondaries -> activated (reason 'none') > [90 100 0] > Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> > NetworkManager state is now CONNECTED_LOCAL > Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> > (ovirtmgmt): Activation: successful, device activated. > Jul 06 08:37:01 ovirt1.satellite.local nm-dispatcher[19272]: Dispatching > action 'up' for ovirtmgmt NM is now done configuring the bridge. In the meantime, initscripts are also doing their job, feeling alone and safe. What puzzles me about the way NM configured the bridge is that I think it didn't even try DHCP, it probably only assigned an IPv4 link-local address (inferring from CONNECTED_LOCAL). Perhaps 'ip-config-unavailable' means that it should do just that. > Jul 06 08:37:02 ovirt1.satellite.local NetworkManager[1024]: <warn> > ifcfg-rh: Ignoring connection /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt > (9a0b07c0-2983 Only here NM realizes it should not manage the bridge and something worse happens. > Jul 06 08:37:02 ovirt1.satellite.local NetworkManager[1024]: <info> > (ovirtmgmt): device state change: activated -> unmanaged (reason > 'unmanaged') [100 10 3] > Jul 06 08:37:02 ovirt1.satellite.local kernel: ovirtmgmt: port 1(eth0) > entered disabled state > Jul 06 08:37:02 ovirt1.satellite.local NetworkManager[1024]: <info> > NetworkManager state is now DISCONNECTED > Jul 06 08:37:02 ovirt1.satellite.local kernel: IPv6: ADDRCONF(NETDEV_UP): > ovirtmgmt: link is not ready > Jul 06 08:37:02 ovirt1.satellite.local nm-dispatcher[19272]: Dispatching > action 'down' for ovirtmgmt > Jul 06 08:37:02 ovirt1.satellite.local NetworkManager[1024]: <info> > (ovirtmgmt): link disconnected > Jul 06 08:37:03 ovirt1.satellite.local Is it NetworkManager who downed the bridge? > /etc/sysconfig/network-scripts/ifup-eth[19409]: Error adding default gateway > 192.168.100.1 for ovirtmgmt. In any case, this last error is probably only a result of the bridge being down already. If I got it right, this is a better-than-a-textbook example of a deadly race that I caused by giving a chance to monitor-connection-files=yes. In this bug, we should probably deal with the DNS problem and solve the race one way or another (by switching monitor-connection-files back to 'no' and either calling 'nmcli load' ourselves before ifup or having initscripts do that for us) in https://bugzilla.redhat.com/show_bug.cgi?id=1344411
> > My test steps as following: > > 1. Install RHVH via ISO(with defaulf ks) > > 2. Configure a simple network(one nic). > > Did you use static IP configuration, which is affected by this bug? > DHCP-based configuration was reported to be fine. Deploy HE still can successful with static IP configuration.
Hi, (In reply to shaochen from comment #29) > I can't reproduce this issue with normal network configure, is there any > other network configure? > > My test steps as following: > 1. Install RHVH via ISO(with defaulf ks) > 2. Configure a simple network(one nic). > 3. Start hosted-engine installation from cockpit (https://<FWDN>:9090/) > 4. Select the network you are connected to cockpit as the management network > 5. Configure hosted engine can successful. How did you configure your network? With DHCP? If so, please use static configuration, as dhcp ensures the interface is up afterwards. Static configuration doesn't. Please note: It is not about HE having a static IP, but about the Hypervisor (RHV-H NGN) having a static IP (and no ovirtmgmt bridge set up already). Looks like Ondrejs analysis in Comment #31 is quite to the point. (In reply to Fabian Deutsch from comment #28) > $ nmcli d > > might also be interesting, Martin. > > maybe it's related to bug 1356635 No, I don't think so, as this setup is done with DHCP. My issue only shows up if dhcp is not used, but static configuration is used instead. As mentioned earlier in my update, I think Ondrejs analysis in Comment #31 is right to the point, so we have a sort of nasty race here (and probably NM does shut down the interface) I will do some further tests with nm configuration (disabling monitor-connection-files) Cheers, Martin
Hi, (In reply to Martin Tessun from comment #33) > I will do some further tests with nm configuration (disabling > monitor-connection-files) Patching /etc/sysconfig/network-scripts/network-functions as below should have the same effect as monitor-connection-files=yes, minus the raciness. Quoting from https://bugzilla.redhat.com/show_bug.cgi?id=1345919 and https://git.fedorahosted.org/cgit/initscripts.git/commit/?id=61fb1cb4efd62120ffbc021d7fdee1cd25059c08 (the diff is the same as the one posted by Thomas Haller) In /etc/sysconfig/network-scripts/network-functions, replace: if ! is_false $NM_CONTROLLED && is_nm_running; then nmcli con load "/etc/sysconfig/network-scripts/$CONFIG" UUID=$(get_uuid_by_config $CONFIG) [ -n "$UUID" ] && _use_nm=true fi With: if is_nm_running; then nmcli con load "/etc/sysconfig/network-scripts/$CONFIG" if ! is_false $NM_CONTROLLED; then UUID=$(get_uuid_by_config $CONFIG) [ -n "$UUID" ] && _use_nm=true fi fi I'll try this on my VMs as well, but anyone is welcome to test. Thanks, Ondra
Hi Fabian, first all the nmcli d outputs: 1. Prior to installation: [root@ovirt1 ~]# nmcli d DEVICE TYPE STATE CONNECTION eth0 ethernet connected eth0 eth1 ethernet disconnected -- eth2 ethernet disconnected -- bond0 bond unmanaged -- lo loopback unmanaged -- [root@ovirt1 ~]# 2. After installation got stuck: [root@ovirt1 ~]# nmcli d DEVICE TYPE STATE CONNECTION eth1 ethernet disconnected -- eth2 ethernet disconnected -- bond0 bond unmanaged -- ;vdsmdummy; bridge unmanaged -- ovirtmgmt bridge unmanaged -- eth0 ethernet unmanaged -- lo loopback unmanaged -- [root@ovirt1 ~]# [root@ovirt1 ~]# ip a s dev ovirtmgmt 7: ovirtmgmt: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN link/ether 52:54:00:80:c7:08 brd ff:ff:ff:ff:ff:ff inet 192.168.100.21/24 brd 192.168.100.255 scope global ovirtmgmt valid_lft forever preferred_lft forever [root@ovirt1 ~]# So I am now starting over again with the changes applied in the setup as described in Comment #34. I will update the BZ once it is finished. Cheers, Martin
Hi Ondra, (In reply to Ondřej Svoboda from comment #34) > Hi, > > (In reply to Martin Tessun from comment #33) > > I will do some further tests with nm configuration (disabling > > monitor-connection-files) > > Patching /etc/sysconfig/network-scripts/network-functions as below should > have the same effect as monitor-connection-files=yes, minus the raciness. > > Quoting from https://bugzilla.redhat.com/show_bug.cgi?id=1345919 and > https://git.fedorahosted.org/cgit/initscripts.git/commit/ > ?id=61fb1cb4efd62120ffbc021d7fdee1cd25059c08 (the diff is the same as the > one posted by Thomas Haller) > > In /etc/sysconfig/network-scripts/network-functions, replace: > > if ! is_false $NM_CONTROLLED && is_nm_running; then > nmcli con load "/etc/sysconfig/network-scripts/$CONFIG" > UUID=$(get_uuid_by_config $CONFIG) > [ -n "$UUID" ] && _use_nm=true > fi > > With: > > if is_nm_running; then > nmcli con load "/etc/sysconfig/network-scripts/$CONFIG" > if ! is_false $NM_CONTROLLED; then > UUID=$(get_uuid_by_config $CONFIG) > [ -n "$UUID" ] && _use_nm=true > fi > fi > > I'll try this on my VMs as well, but anyone is welcome to test. > > Thanks, > Ondra Yep this does work. No more "downed" ovirtmgmt bridge. If you need it, I can provide the logs as well. Cheers, Martin
(In reply to Martin Tessun from comment #33) > Hi, > > (In reply to shaochen from comment #29) > > I can't reproduce this issue with normal network configure, is there any > > other network configure? > > > > My test steps as following: > > 1. Install RHVH via ISO(with defaulf ks) > > 2. Configure a simple network(one nic). > > 3. Start hosted-engine installation from cockpit (https://<FWDN>:9090/) > > 4. Select the network you are connected to cockpit as the management network > > 5. Configure hosted engine can successful. > > How did you configure your network? With DHCP? If so, please use static > configuration, as dhcp ensures the interface is up afterwards. Static > configuration doesn't. > > Please note: It is not about HE having a static IP, but about the Hypervisor > (RHV-H NGN) having a static IP (and no ovirtmgmt bridge set up already). Thanks for bringing up the point, I can reproduce this issue now. Test steps: 1. Anaconda interactive install RHVH via ISO 2. Define Network settings(with static configuration) 3. Start hosted-engine installation from cockpit (https://<FWDN>:9090/) 4. Select the network you are connected to cockpit as the management network Actual results: During installation the setup aborts and cockpit shows "Disconnected" message. The ovirtmgmt-bridge stays in status down.
We've decided to disable NM on NGN (see bug 1364126) in order to avoid this bug.
How can it be on QA and on 4.0.4?
*** Bug 1361017 has been marked as a duplicate of this bug. ***
After some time we've seen that in /var/run/ovirt-hosted-engine-ha/vm.conf, was added "filter:vdsm-no-mac-spoofing,specParams:{}", it prevented VM from starting on host, so we removed it manually and started the VM.
(In reply to Nikolai Sednev from comment #46) > After some time we've seen that in /var/run/ovirt-hosted-engine-ha/vm.conf, > was added "filter:vdsm-no-mac-spoofing,specParams:{}", it prevented VM from > starting on host, so we removed it manually and started the VM. it's already in /usr/share/ovirt-hosted-engine-setup/templates/vm.conf.in, the question is why it prevents the VM from starting
Can you please add also a sos reports from the engine VM?
(In reply to Simone Tiraboschi from comment #48) > Can you please add also a sos reports from the engine VM? I've seen that both ovirt-ha-agent and broker were down and engine's VM was failing to get started, once we've changed /var/run/ovirt-hosted-engine-ha/vm.conf as appears in https://bugzilla.redhat.com/show_bug.cgi?id=1351095#c46, I could start agent and broker and VM also got started. One more thing, I've tried to reboot the host right before changing /var/run/ovirt-hosted-engine-ha/vm.conf as thought that it might be interesting to see for if it might change something and discovered that /var/run/ovirt-hosted-engine-ha/vm.conf simply disappeared after host has been rebooted, so we had to create this file manually and filled it with everything that was previously there, except for the "filter:vdsm-no-mac-spoofing,specParams:{}" line.
Created attachment 1192544 [details] sosreport from the engine
(In reply to Nikolai Sednev from comment #50) > I've seen that both ovirt-ha-agent and broker were down and engine's VM was > failing to get started, once we've changed > /var/run/ovirt-hosted-engine-ha/vm.conf as appears in > https://bugzilla.redhat.com/show_bug.cgi?id=1351095#c46, I could start agent > and broker and VM also got started. Pretty strange > One more thing, I've tried to reboot the > host right before changing /var/run/ovirt-hosted-engine-ha/vm.conf as > thought that it might be interesting to see for if it might change something > and discovered that /var/run/ovirt-hosted-engine-ha/vm.conf simply > disappeared after host has been rebooted, so we had to create this file > manually and filled it with everything that was previously there, except for > the "filter:vdsm-no-mac-spoofing,specParams:{}" line. /var/run is on tmpfs so it's perfectly fine that it disappears on reboots. The agent should recreate vm.conf from what is saved on the shared storage, the issue is just why the agent didn't started.
Aug 18 16:26:44 alma03 vdsmd_init_common.sh: libvirt: Network Filter Driver error : Network filter not found: no nwfilter with matching name 'vdsm-no-mac-spoofing' Aug 18 16:26:44 alma03 vdsmd_init_common.sh: vdsm: Running dummybr The issue seams here so I suspect that the bridge wasn't properly working at first VM boot time and so hosted-engine-setup didn't completed.
Works for me with osvoboda's fixes for VDSM on these components on engine: ovirt-log-collector-4.0.0-1.el7ev.noarch ovirt-engine-websocket-proxy-4.0.2.7-0.1.el7ev.noarch ovirt-engine-tools-backup-4.0.2.7-0.1.el7ev.noarch ovirt-engine-dwh-setup-4.0.2-1.el7ev.noarch ovirt-engine-lib-4.0.2.7-0.1.el7ev.noarch ovirt-iso-uploader-4.0.0-1.el7ev.noarch ovirt-setup-lib-1.0.2-1.el7ev.noarch ovirt-engine-setup-plugin-ovirt-engine-4.0.2.7-0.1.el7ev.noarch ovirt-imageio-common-0.3.0-0.el7ev.noarch ovirt-imageio-proxy-0.3.0-0.el7ev.noarch ovirt-engine-cli-3.6.8.1-1.el7ev.noarch ovirt-image-uploader-4.0.0-1.el7ev.noarch ovirt-vmconsole-proxy-1.0.4-1.el7ev.noarch ovirt-imageio-proxy-setup-0.3.0-0.el7ev.noarch ovirt-host-deploy-1.5.1-1.el7ev.noarch ovirt-engine-tools-4.0.2.7-0.1.el7ev.noarch ovirt-engine-dbscripts-4.0.2.7-0.1.el7ev.noarch ovirt-engine-dwh-4.0.2-1.el7ev.noarch ovirt-engine-dashboard-1.0.2-1.el7ev.x86_64 ovirt-engine-setup-base-4.0.2.7-0.1.el7ev.noarch ovirt-engine-setup-plugin-websocket-proxy-4.0.2.7-0.1.el7ev.noarch ovirt-engine-vmconsole-proxy-helper-4.0.2.7-0.1.el7ev.noarch ovirt-engine-webadmin-portal-4.0.2.7-0.1.el7ev.noarch ovirt-engine-sdk-python-3.6.8.0-1.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch ovirt-engine-extension-aaa-jdbc-1.1.0-1.el7ev.noarch ovirt-engine-restapi-4.0.2.7-0.1.el7ev.noarch ovirt-engine-setup-plugin-ovirt-engine-common-4.0.2.7-0.1.el7ev.noarch ovirt-engine-extensions-api-impl-4.0.2.7-0.1.el7ev.noarch ovirt-engine-4.0.2.7-0.1.el7ev.noarch ovirt-host-deploy-java-1.5.1-1.el7ev.noarch ovirt-engine-setup-plugin-vmconsole-proxy-helper-4.0.2.7-0.1.el7ev.noarch ovirt-engine-userportal-4.0.2.7-0.1.el7ev.noarch python-ovirt-engine-sdk4-4.0.0-0.5.a5.el7ev.x86_64 ovirt-engine-setup-4.0.2.7-0.1.el7ev.noarch ovirt-engine-backend-4.0.2.7-0.1.el7ev.noarch rhevm-spice-client-x86-msi-4.0-3.el7ev.noarch rhevm-doc-4.0.0-3.el7ev.noarch rhevm-4.0.2.7-0.1.el7ev.noarch rhevm-spice-client-x64-msi-4.0-3.el7ev.noarch rhev-guest-tools-iso-4.0-5.el7ev.noarch rhevm-dependencies-4.0.0-1.el7ev.noarch rhevm-setup-plugins-4.0.0.2-1.el7ev.noarch rhev-release-4.0.2-9-001.noarch rhevm-branding-rhev-4.0.0-5.el7ev.noarch rhevm-guest-agent-common-1.0.12-3.el7ev.noarch Linux version 3.10.0-327.28.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon Jun 27 14:48:28 EDT 2016 Linux 3.10.0-327.28.2.el7.x86_64 #1 SMP Mon Jun 27 14:48:28 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.2 (Maipo) Host: ovirt-setup-lib-1.0.2-1.el7ev.noarch qemu-kvm-rhev-2.3.0-31.el7_2.16.x86_64 libvirt-client-1.2.17-13.el7_2.5.x86_64 rhevm-appliance-20160811.0-1.el7ev.noarch ovirt-host-deploy-1.5.1-1.el7ev.noarch vdsm-4.18.11-1.el7ev.x86_64 ovirt-imageio-common-0.3.0-0.el7ev.noarch sanlock-3.2.4-2.el7_2.x86_64 ovirt-hosted-engine-ha-2.0.2-1.el7ev.noarch ovirt-engine-sdk-python-3.6.8.0-1.el7ev.noarch ovirt-vmconsole-host-1.0.4-1.el7ev.noarch mom-0.5.5-1.el7ev.noarch ovirt-imageio-daemon-0.3.0-0.el7ev.noarch ovirt-hosted-engine-setup-2.0.1.4-1.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch Linux version 3.10.0-327.28.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon Jun 27 14:48:28 EDT 2016 Linux 3.10.0-327.28.2.el7.x86_64 #1 SMP Mon Jun 27 14:48:28 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 7.2 I've cleanly installed NGN from latest RHVH-7.2-20160815.0-RHVH-x86_64-dvd1.iso, then used rhevm-appliance-20160811.0-1.el7ev.noarch for deployment of the HE over iSCSI and prior to it I've changed the IP address configuration on NGN from DHCP to Static, network manager was not disabled before the deployment. HE deployment succeeded via Cockpit and network management bridge created without any issues. Network manager was disabled during deployment. Once I've had the engine up and running, I've upgraded it to latest bits from repos and restarted it, then added data iSCSI storage domain to get hosted_storage auto-imported in to the engine's WEBUI. Finally the whole deployment succeeded with hosted_storage being auto-imported ans HE-VM visible from the WEBUI. Please consider pushing your fixes and moving this bug to modified.
https://bugzilla.redhat.com/show_bug.cgi?id=1351095#c54
Moving to verified forth to https://bugzilla.redhat.com/show_bug.cgi?id=1351095#c54.
Hello, I'm doing an evaluatin of RHEV and I'm incurring in this bug with RHVH-4.1-20170817.0-RHVH-x86_64-dvd1.iso Possible regession? In my case I have a physical blade with 6 network adapters. I configured the first to be the ip of the host during anaconda install. Then from cockpit I bonded it with the second network adapter creating bond0 (802.3ad mode). From inside cockpit I also configured bond1 (two other network adapters) and bond2. All good until now. Then I start self hosted engine install. I select bond0 as the adapter to create bridge on (I'm proposed bond0, bond1, bond2). Is it expected? In my case I pre-created the bond because otherwise network guys had to force disable the second port on the cisco switch. I got disconnected from cockpit and after some further seconds I can connect again, but it seems I'm not proposed a way to resume, but only start over. At host I have: [root@rhevora1 ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens2f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP qlen 1000 link/ether 2e:d8:2f:07:5b:67 brd ff:ff:ff:ff:ff:ff 3: eno49: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond1 portid 0100000000000000000000304d31353543 state UP qlen 1000 link/ether 00:fd:45:f6:09:b0 brd ff:ff:ff:ff:ff:ff 4: ens2f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP qlen 1000 link/ether 2e:d8:2f:07:5b:67 brd ff:ff:ff:ff:ff:ff 5: eno50: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond1 portid 0200000000000000000000304d31353543 state UP qlen 1000 link/ether 00:fd:45:f6:09:b0 brd ff:ff:ff:ff:ff:ff 6: ens2f2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond2 state UP qlen 1000 link/ether 48:df:37:0c:7f:5a brd ff:ff:ff:ff:ff:ff 7: ens2f3: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond2 state UP qlen 1000 link/ether 48:df:37:0c:7f:5a brd ff:ff:ff:ff:ff:ff 23: bond2: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000 link/ether 48:df:37:0c:7f:5a brd ff:ff:ff:ff:ff:ff inet6 fe80::4adf:37ff:fe0c:7f5a/64 scope link valid_lft forever preferred_lft forever 24: bond1: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000 link/ether 00:fd:45:f6:09:b0 brd ff:ff:ff:ff:ff:ff inet6 fe80::2fd:45ff:fef6:9b0/64 scope link valid_lft forever preferred_lft forever 25: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovirtmgmt state UP qlen 1000 link/ether 2e:d8:2f:07:5b:67 brd ff:ff:ff:ff:ff:ff 26: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000 link/ether a6:09:2b:a3:90:88 brd ff:ff:ff:ff:ff:ff 27: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000 link/ether 2e:d8:2f:07:5b:67 brd ff:ff:ff:ff:ff:ff inet 192.168.50.21/24 brd 192.168.50.255 scope global ovirtmgmt valid_lft forever preferred_lft forever [root@rhevora1 ~]# [root@rhevora1 ~]# brctl show bridge name bridge id STP enabled interfaces ;vdsmdummy; 8000.000000000000 no ovirtmgmt 8000.2ed82f075b67 no bond0 [root@rhevora1 ~]#
btw: I configured static IP
(In reply to Gianluca Cecchi from comment #57) > I got disconnected from cockpit and after some further seconds I can connect > again, but it seems I'm not proposed a way to resume, but only start over. Did hosted-engine-setup died or did you simply lose the connection to the cockpit UI? Could you please attach hosted-engine-setup logs?
No, no more hosted-engine-setup process in place. And also connecting to cockpit I had to restart, no vm and no previous setup detected to resume. I got error because VG and its LVs already created i first run if I select the same LUN To be able re-run the setup I had to remove VG and PV created in first phase: [root@rhevora1 ~]# vgremove 22adbae5-4698-4e9a-bfe0-758695b1552b Do you really want to remove volume group "22adbae5-4698-4e9a-bfe0-758695b1552b" containing 6 logical volumes? [y/n]: n Volume group "22adbae5-4698-4e9a-bfe0-758695b1552b" not removed [root@rhevora1 ~]# vgremove 22adbae5-4698-4e9a-bfe0-758695b1552b Do you really want to remove volume group "22adbae5-4698-4e9a-bfe0-758695b1552b" containing 6 logical volumes? [y/n]: y Do you really want to remove active logical volume 22adbae5-4698-4e9a-bfe0-758695b1552b/metadata? [y/n]: y Logical volume "metadata" successfully removed Do you really want to remove active logical volume 22adbae5-4698-4e9a-bfe0-758695b1552b/outbox? [y/n]: y Logical volume "outbox" successfully removed Do you really want to remove active logical volume 22adbae5-4698-4e9a-bfe0-758695b1552b/leases? [y/n]: y Logical volume "leases" successfully removed Do you really want to remove active logical volume 22adbae5-4698-4e9a-bfe0-758695b1552b/ids? [y/n]: y Logical volume "ids" successfully removed Do you really want to remove active logical volume 22adbae5-4698-4e9a-bfe0-758695b1552b/inbox? [y/n]: y Logical volume "inbox" successfully removed Do you really want to remove active logical volume 22adbae5-4698-4e9a-bfe0-758695b1552b/master? [y/n]: y Logical volume "master" successfully removed Volume group "22adbae5-4698-4e9a-bfe0-758695b1552b" successfully removed [root@rhevora1 ~]# [root@rhevora1 ~]# pvremove /dev/mapper/360002ac000000000000000530001894c Labels on physical volume "/dev/mapper/360002ac000000000000000530001894c" successfully wiped. [root@rhevora1 ~]# Then the new start didn't ask me about which card to use for ovirtmgmt, and it automatically chooses bond0 (I see it from resume screen) probably because it found already in place the ovirtmgmt bridge on bond0, and completed ok, with the final screen: Hosted Engine Setup successfully completed! But I'm quite expert of OS and oVirt. I think the normal user following install guide could have serious problems going ahead. I'm going to collect both hosted-engine-setup logs and send to you if it can help debugging
Hello, I'm going t attach into a tgz the series of ovirt-hosted-engine-setup*.log generated. The one related with the cockpit disconnect is this: ovirt-hosted-engine-setup-20170912105542-anlmgu.log The one that after completed without errors was this: ovirt-hosted-engine-setup-20170912114523-r1pgjs.log The log between them (ovirt-hosted-engine-setup-20170912112818-ywfeby.log) is because I tried to see if possible to resume in any way, but it gave error about LUN already with a VG on it from previous attempt. The previous ones were because I already downloaded the appliance but I didn't find a way to give it to the installer. In fact the step where I can choose a file for the appliance is "after" it installs the appliance rpm itself (I'm going to attach screenshot), so even if I already downloaded the 1.5Gb appliance ova file, I was forced to download/install the appliance rpm and so re-download a 1.5gb file Let me know if I have to open another bug/rfe for this. Also, the engine vm was not reachable after install and I see that the it depends on gateway not set. You can walk through the logs if you can find the reason for this too.. let me know if you want me to open a bug fir this too. This was the situation of engine vm (that was reachable from the host because on the same ovirtmgmt lan): [root@rhevmgr ~]# ip route show 192.168.50.0/24 dev eth0 proto kernel scope link src 192.168.50.20 metric 100 [root@rhevmgr ~]# Instead on th host: [root@rhevora1 ~]# ip route show default via 192.168.50.1 dev ovirtmgmt 169.254.0.0/16 dev ovirtmgmt scope link metric 1027 192.168.50.0/24 dev ovirtmgmt proto kernel scope link src 192.168.50.21 [root@rhevora1 ~]# On the engine there is NetworkManager set this way: [root@rhevmgr network-scripts]# nmcli con show NAME UUID TYPE DEVICE System eth0 5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03 802-3-ethernet eth0 [root@rhevmgr network-scripts]# nmcli con show "System eth0" connection.id: System eth0 connection.uuid: 5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03 connection.stable-id: -- connection.interface-name: eth0 connection.type: 802-3-ethernet connection.autoconnect: yes connection.autoconnect-priority: 0 connection.autoconnect-retries: -1 (default) connection.timestamp: 1505216520 connection.read-only: no connection.permissions: -- connection.zone: public connection.master: -- connection.slave-type: -- connection.autoconnect-slaves: -1 (default) connection.secondaries: -- connection.gateway-ping-timeout: 0 connection.metered: unknown connection.lldp: -1 (default) 802-3-ethernet.port: -- 802-3-ethernet.speed: 0 802-3-ethernet.duplex: -- 802-3-ethernet.auto-negotiate: no 802-3-ethernet.mac-address: -- 802-3-ethernet.cloned-mac-address: -- 802-3-ethernet.generate-mac-address-mask:-- 802-3-ethernet.mac-address-blacklist: -- 802-3-ethernet.mtu: auto 802-3-ethernet.s390-subchannels: -- 802-3-ethernet.s390-nettype: -- 802-3-ethernet.s390-options: -- 802-3-ethernet.wake-on-lan: 1 (default) 802-3-ethernet.wake-on-lan-password: -- ipv4.method: manual ipv4.dns: 172.16.1.11,172.16.1.2 ipv4.dns-search: my.domain ipv4.dns-options: (default) ipv4.dns-priority: 0 ipv4.addresses: 192.168.50.20/24 ipv4.gateway: -- ipv4.routes: -- ipv4.route-metric: -1 ipv4.ignore-auto-routes: no ipv4.ignore-auto-dns: no ipv4.dhcp-client-id: -- ipv4.dhcp-timeout: 0 ipv4.dhcp-send-hostname: yes ipv4.dhcp-hostname: -- ipv4.dhcp-fqdn: -- ipv4.never-default: no ipv4.may-fail: yes ipv4.dad-timeout: -1 (default) ipv6.method: ignore ipv6.dns: -- ipv6.dns-search: -- ipv6.dns-options: (default) ipv6.dns-priority: 0 ipv6.addresses: -- ipv6.gateway: -- ipv6.routes: -- ipv6.route-metric: -1 ipv6.ignore-auto-routes: no ipv6.ignore-auto-dns: no ipv6.never-default: no ipv6.may-fail: yes ipv6.ip6-privacy: -1 (unknown) ipv6.addr-gen-mode: stable-privacy ipv6.dhcp-send-hostname: yes ipv6.dhcp-hostname: -- ipv6.token: -- proxy.method: none proxy.browser-only: no proxy.pac-url: -- proxy.pac-script: -- GENERAL.NAME: System eth0 GENERAL.UUID: 5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03 GENERAL.DEVICES: eth0 GENERAL.STATE: activated GENERAL.DEFAULT: no GENERAL.DEFAULT6: no GENERAL.VPN: no GENERAL.ZONE: public GENERAL.DBUS-PATH: /org/freedesktop/NetworkManager/ActiveConnection/1 GENERAL.CON-PATH: /org/freedesktop/NetworkManager/Settings/1 GENERAL.SPEC-OBJECT: -- GENERAL.MASTER-PATH: -- IP4.ADDRESS[1]: 192.168.50.20/24 IP4.GATEWAY: -- IP4.DNS[1]: 172.16.1.11 IP4.DNS[2]: 172.16.1.2 IP6.ADDRESS[1]: fe80::216:3eff:fe7c:6534/64 IP6.GATEWAY: -- [root@rhevmgr network-scripts]# To solve this further problem, hoping it stays persistent across engine reboot was: [root@rhevmgr network-scripts]# nmcli con modify "System eth0" ipv4.gateway 192.168.50.1 [root@rhevmgr network-scripts]# nmcli con reload "System eth0" [root@rhevmgr network-scripts]# nmcli con up "System eth0" Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/2) [root@rhevmgr network-scripts]# [root@rhevmgr ~]# nmcli con show "System eth0" | grep -i GATEWAY connection.gateway-ping-timeout: 0 ipv4.gateway: 192.168.50.1 ipv6.gateway: -- IP4.GATEWAY: 192.168.50.1 IP6.GATEWAY: -- [root@rhevmgr ~]# And now I'm able to reach it from outside
Created attachment 1324839 [details] tar gzip of hosted-engine-setup logs
Created attachment 1324852 [details] screenshot where I chose gateway of engine vm Inside hosted-engine-setup log already provided (ovirt-hosted-engine-setup-20170912114523-r1pgjs.log) there is this line that should configure the gateway of the engine vm... 2017-09-12 11:47:07 DEBUG otopi.context context.dumpEnvironment:770 ENV OVEHOSTED_NETWORK/gateway=str:'192.168.50.1' but actually has not been configured
(In reply to Gianluca Cecchi from comment #61) > The previous ones were because I already downloaded the appliance but I > didn't find a way to give it to the installer. In fact the step where I can > choose a file for the appliance is "after" it installs the appliance rpm > itself (I'm going to attach screenshot), so even if I already downloaded the > 1.5Gb appliance ova file, I was forced to download/install the appliance rpm > and so re-download a 1.5gb file > Let me know if I have to open another bug/rfe for this. https://bugzilla.redhat.com/show_bug.cgi?id=1481095 Already fixed > Also, the engine vm was not reachable after install and I see that the it > depends on gateway not set. You can walk through the logs if you can find > the reason for this too.. let me know if you want me to open a bug fir this > too. Yes please, could you please attach /var/log/messages and cloud-init logs from the engine VM?
hosted-engine-setup got a SIGHUP and so it terminated, this is coherent since cockpit was the controlling terminal. 2017-09-12 11:22:09 DEBUG otopi.context context._executeMethod:142 method exception Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/otopi/context.py", line 132, in _executeMethod method['method']() File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-setup/storage/storage.py", line 980, in _misc self._createStoragePool() File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-setup/storage/storage.py", line 657, in _createStoragePool leaseRetries=None, File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 165, in _callMethod kwargs.pop('_transport_timeout', self._default_timeout))) File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 363, in call call.wait(kwargs.get('timeout', CALL_TIMEOUT)) File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 333, in wait self._ev.wait(timeout) File "/usr/lib64/python2.7/threading.py", line 622, in wait self.__cond.wait(timeout, balancing) File "/usr/lib64/python2.7/threading.py", line 362, in wait _sleep(delay) File "/usr/lib/python2.7/site-packages/otopi/main.py", line 53, in _signal raise RuntimeError("SIG%s" % signum) RuntimeError: SIG1 Now we need to understand why cockpit died. Gianluca, could you please share /var/log/messaged and cockpit logs from that host? By the way I'm not really sure it was a network related issue since at that time hosted-engine-setup was creating the storage pool which shouldn't affect the network. 2017-09-12 11:21:24 DEBUG otopi.plugins.gr_he_setup.storage.storage storage._createStoragePool:646 createStoragePool(args=[storagepoolID=203a9a04-9e88-40b3-931c-50e1dd63520e, name=hosted_datacenter, masterSdUUID=ab2cb8c5-656e-4d1e-8d69-ed2e9d8b6e77, masterVersion=1, domainList=['ab2cb8c5-656e-4d1e-8d69-ed2e9d8b6e77', '22adbae5-4698-4e9a-bfe0-758695b1552b'], lockRenewalIntervalSec=None, leaseTimeSec=None, ioOpTimeoutSec=None, leaseRetries=None]) Management bridge creations seams indeed fine: 2017-09-12 11:21:12 DEBUG otopi.context context._executeMethod:128 Stage misc METHOD otopi.plugins.gr_he_common.network.bridge.Plugin._misc 2017-09-12 11:21:12 INFO otopi.plugins.gr_he_common.network.bridge bridge._misc:359 Configuring the management bridge 2017-09-12 11:21:13 DEBUG otopi.plugins.gr_he_common.network.bridge bridge._misc:371 networks: {'ovirtmgmt': {'bonding': 'bond0', 'ipaddr': u'192.168.50.21', 'netmask': u'255.255.255.0', 'defaultRoute': True, 'gateway': u'192.168.50.1'}} 2017-09-12 11:21:13 DEBUG otopi.plugins.gr_he_common.network.bridge bridge._misc:372 bonds: {} 2017-09-12 11:21:13 DEBUG otopi.plugins.gr_he_common.network.bridge bridge._misc:373 options: {'connectivityCheck': False} 2017-09-12 11:21:20 DEBUG otopi.context context._executeMethod:128 Stage misc METHOD otopi.plugins.gr_he_setup.storage.blockd.Plugin._misc
Created attachment 1324940 [details] /var/log/messages of hypervisor
Created attachment 1324941 [details] /var/log/messages of engine vm
I have attached /var/log/messages of hypervisor and engine vm. Can you tell me the paths of cloud-init logs on engine vm system and cockpit logs on hypervisor system, so that I can attach them too?
Cockpit should be in the journal. We do not log separately for cockpit-q city
Created attachment 1324946 [details] journal log Here the output of journalctl -x --since today | gzip > /tmp/journal_today.txt.gz
Sep 12 11:22:09 rhevora1.padana.locale cockpit-ws[3826]: WebSocket from 172.16.4.22 for session closed this happened at 11:22:09 exactly when hosted-engine-setup got a SIGHUP.
cockpit-ovirt terminates HE setup if it loses the connection. The question is why it disconnected...
(In reply to Simone Tiraboschi from comment #64) > > Also, the engine vm was not reachable after install and I see that the it > > depends on gateway not set. You can walk through the logs if you can find > > the reason for this too.. let me know if you want me to open a bug fir this > > too. > > Yes please, could you please attach /var/log/messages and cloud-init logs > from the engine VM? https://bugzilla.redhat.com/show_bug.cgi?id=1492726
Ah ok, fine. It is a general cloud-init problem with RH EL 7.4, that applies to engine vm too, as its deploy uses cloud-init... I will follow that bugzilla Thanks
I saw something very similar today - cokcpit has died in the engine-setup 'Closing up' state with SIG1 - 2017-09-26 10:43:32 DEBUG otopi.plugins.gr_he_common.engine.health appliance_esetup._appliance_connect:89 Successfully connected to the appliance 2017-09-26 10:43:32 INFO otopi.plugins.gr_he_common.engine.health health._closeup:127 Running engine-setup on the appliance 2017-09-26 10:45:32 ERROR otopi.context context._executeMethod:151 Failed to execute stage 'Closing up': SIG1 2017-09-26 10:45:32 DEBUG otopi.context context.dumpEnvironment:760 ENVIRONMENT DUMP - BEGIN 2017-09-26 10:45:32 DEBUG otopi.context context.dumpEnvironment:770 ENV BASE/error=bool:'True' 2017-09-26 10:45:32 DEBUG otopi.context context.dumpEnvironment:770 ENV BASE/exceptionInfo=list:'[(<type 'exceptions.RuntimeError'>, RuntimeError('SIG1',), <traceback object at 0x2b90908> set 26 10:45:32 orchid-vds2.qa.lab.tlv.redhat.com cockpit-ws[5725]: WebSocket from 10.35.4.183 for session closed /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20170926102646-3xy8j5.log:2017-09-26 10:45:32 ERROR otopi.context context._executeMethod:151 Failed to execute stage 'Closing up': SIG1
Can you post the full log from /var/log/ovirt-hosted-engine-setup?
(In reply to Ryan Barry from comment #76) > Can you post the full log from /var/log/ovirt-hosted-engine-setup? Yes, and if you need access to the setup then let me know as it still alive, but i'm going to kill it in the next hour..
Created attachment 1330930 [details] HE sig1 cockpit died log
In additional run i had : 2017-09-26 15:06:03 DEBUG otopi.context context._executeMethod:142 method exception Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/otopi/context.py", line 132, in _executeMethod method['method']() File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-setup/storage/storage.py", line 985, in _misc self.environment[ohostedcons.StorageEnv.FAKE_MASTER_SD_UUID] File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-setup/storage/storage.py", line 784, in _activateStorageDomain spUUID File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 165, in _callMethod kwargs.pop('_transport_timeout', self._default_timeout))) File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 363, in call call.wait(kwargs.get('timeout', CALL_TIMEOUT)) File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 333, in wait self._ev.wait(timeout) File "/usr/lib64/python2.7/threading.py", line 622, in wait self.__cond.wait(timeout, balancing) File "/usr/lib64/python2.7/threading.py", line 362, in wait _sleep(delay) File "/usr/lib/python2.7/site-packages/otopi/main.py", line 53, in _signal raise RuntimeError("SIG%s" % signum) RuntimeError: SIG1 2017-09-26 15:06:03 ERROR otopi.context context._executeMethod:151 Failed to execute stage 'Misc configuration': SIG1 Are you sure guys this bug should be considered as verified ?? Is all those sig1 errors are the same issue? HE + cockpit just not working... rhvh-4.1-0.20170914.0+1 rhvm-appliance-4.1.20170914.0-1.el7.noarch.rpm
Created attachment 1331029 [details] new failure
Yes, it should be verified. This is regularly tested under a large number of scenarios, and widely used for deployment. Can you post more details about your environment? What kind of storage, what's on the host (VM with limited memory? Physical host? What's the network configuration?)
If it's verified then maybe it's better to stop the discussion here and not spam the bug. I'm using physical host with enough memory and using static network configuration(as with dhcp i'm thrown away from the installation wizard) and nfs storage. I will give it additional attempt now and see. Ryan if you want please contact me and i will provide you access to the setup. Thanks,