1351095 – [RHV-H Cockpit] hosted-engine-setup fails when creating the ovirtmgmt bridge

Bug 1351095 - [RHV-H Cockpit] hosted-engine-setup fails when creating the ovirtmgmt bridge

Summary: [RHV-H Cockpit] hosted-engine-setup fails when creating the ovirtmgmt bridge

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.0.0
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-4.0.4
Target Release:	4.0.4
Assignee:	Ondřej Svoboda
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1361017 (view as bug list)
Depends On:	1346947
Blocks:	1160423 1338732 1361017
TreeView+	depends on / blocked

Reported:	2016-06-29 08:47 UTC by Martin Tessun
Modified:	2019-04-28 14:29 UTC (History)
CC List:	26 users (show)
Fixed In Version:	redhat-virtualization-host-4.0-20160812.0
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-10-10 11:53:09 UTC
oVirt Team:	Network
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
All messages from /var/log of the hypervisor (218.73 KB, application/x-gzip) 2016-06-29 08:54 UTC, Martin Tessun	no flags	Details
sosreport after the failed ovirtmgmt bridge setup (5.88 MB, application/x-xz) 2016-06-29 09:53 UTC, Martin Tessun	no flags	Details
sosreport from the engine (9.02 MB, application/x-xz) 2016-08-21 08:57 UTC, Nikolai Sednev	no flags	Details
tar gzip of hosted-engine-setup logs (218.90 KB, application/x-gzip) 2017-09-12 12:06 UTC, Gianluca Cecchi	no flags	Details
screenshot where I chose gateway of engine vm (26.39 KB, image/png) 2017-09-12 12:10 UTC, Gianluca Cecchi	no flags	Details
/var/log/messages of hypervisor (93.97 KB, application/x-gzip) 2017-09-12 15:11 UTC, Gianluca Cecchi	no flags	Details
/var/log/messages of engine vm (31.08 KB, application/x-gzip) 2017-09-12 15:12 UTC, Gianluca Cecchi	no flags	Details
journal log (85.18 KB, application/x-gzip) 2017-09-12 15:36 UTC, Gianluca Cecchi	no flags	Details
HE sig1 cockpit died log (49.31 KB, application/x-gzip) 2017-09-26 08:51 UTC, Michael Burman	no flags	Details
new failure (36.56 KB, application/x-gzip) 2017-09-26 12:19 UTC, Michael Burman	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1160423	high	CLOSED	hosted-engine --deploy doesn't copy DNS config to ovirtmgmt	2021-02-22 00:41:40 UTC
oVirt gerrit	61184	'None'	MERGED	net: write out nameservers to ifcfg files	2021-02-11 12:30:53 UTC
oVirt gerrit	61931	'None'	MERGED	net: Introduce nameservers (dns) network api	2021-02-11 12:30:53 UTC
oVirt gerrit	61932	'None'	MERGED	network: rename reported 'dnss' to 'nameservers' for clarity	2021-02-11 12:30:53 UTC
oVirt gerrit	62148	'None'	MERGED	net: add a 'nameservers' property to NetInfo	2021-02-11 12:30:53 UTC
oVirt gerrit	62359	'None'	MERGED	net: add a 'nameservers' property to NetInfo	2021-02-11 12:30:53 UTC
oVirt gerrit	62360	'None'	MERGED	net: write out nameservers to ifcfg files	2021-02-11 12:30:53 UTC
oVirt gerrit	62361	'None'	MERGED	net tests: add a 'status' parameter to SetupNetworksError	2021-02-11 12:30:54 UTC
oVirt gerrit	62362	'None'	MERGED	net: don't accept nameservers on a non-default network	2021-02-11 12:30:54 UTC

Internal Links: 1160423

Description Martin Tessun 2016-06-29 08:47:44 UTC

Description of problem:
Setting up hosted-engine via cockpit does fail during setup of the ovirtmgmt-bridge, if cockpit is also accessed via the management network.

Version-Release number of selected component (if applicable):
RHEV-H-7.2-20160627.2-RHVH-x86_64-dvd1.iso

How reproducible:
always

Steps to Reproduce:
1. Install RHEV-M from ISO
2. Define Network settings
3. Start hosted-engine installation from cockpit (https://<FWDN>:9090/)
4. Select the network you are connected to cockpit as the management network

Actual results:
During installation the setup aborts and cockpit shows "Disconnected" message.
The ovirtmgmt-bridge stays in status down. Also routing and DNS isn't set up correctly

Expected results:
hosted-engine setup should be able to commence and to resetup the network as required.

Additional info:
Network after the cockpit shows "Disconnected:

[root@ovirt1 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master ovirtmgmt state UP qlen 1000
    link/ether 52:54:00:80:c7:08 brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 52:54:00:5c:b2:16 brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 52:54:00:e2:27:f6 brd ff:ff:ff:ff:ff:ff
5: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN 
    link/ether 5a:8f:e2:eb:a8:7c brd ff:ff:ff:ff:ff:ff
6: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN 
    link/ether 72:a3:24:54:0e:73 brd ff:ff:ff:ff:ff:ff
7: ovirtmgmt: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN 
    link/ether 52:54:00:80:c7:08 brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.21/24 brd 192.168.100.255 scope global ovirtmgmt
       valid_lft forever preferred_lft forever
[root@ovirt1 ~]# 

Unfortunatly the hosted-engine-setup log does not show much relevant stuff (besides the iscsi failing due to no more network...):

2016-06-29 10:34:59 DEBUG otopi.context context.dumpEnvironment:760 ENVIRONMENT DUMP - BEGIN
2016-06-29 10:34:59 DEBUG otopi.context context.dumpEnvironment:770 ENV OVEHOSTED_VDSM/vdscli=_Server:'<vdsm.jsonrpcvdscli._Server object at 0x4c72b50>'
2016-06-29 10:34:59 DEBUG otopi.context context.dumpEnvironment:774 ENVIRONMENT DUMP - END
2016-06-29 10:34:59 DEBUG otopi.context context._executeMethod:128 Stage misc METHOD otopi.plugins.gr_he_common.network.bridge.Plugin._misc
2016-06-29 10:34:59 INFO otopi.plugins.gr_he_common.network.bridge bridge._misc:372 Configuring the management bridge
2016-06-29 10:34:59 DEBUG otopi.plugins.gr_he_common.network.bridge bridge._misc:384 networks: {'ovirtmgmt': {'nic': 'eth0', 'ipaddr': u'192.168.100.21', 'netmask': u'255.255.255.0', 'bootproto': u'none', 'gateway': u'192.168.100.1', 'defaultRoute': True}}
2016-06-29 10:34:59 DEBUG otopi.plugins.gr_he_common.network.bridge bridge._misc:385 bonds: {}
2016-06-29 10:34:59 DEBUG otopi.plugins.gr_he_common.network.bridge bridge._misc:386 options: {'connectivityCheck': False}
2016-06-29 10:35:03 DEBUG otopi.context context._executeMethod:128 Stage misc METHOD otopi.plugins.gr_he_setup.storage.blockd.Plugin._misc
2016-06-29 10:35:03 INFO otopi.plugins.gr_he_setup.storage.blockd blockd._misc:656 Creating Volume Group
2016-06-29 10:35:13 DEBUG otopi.plugins.gr_he_setup.storage.blockd blockd._misc:658 {'status': {'message': u'Failed to initialize physical device: ("[u\'/dev/mapper/36001405ddaf6032e1bf47538069c78c2\']",)', 'code': 601}}
2016-06-29 10:35:13 ERROR otopi.plugins.gr_he_setup.storage.blockd blockd._misc:664 Error creating Volume Group: Failed to initialize physical device: ("[u'/dev/mapper/36001405ddaf6032e1bf47538069c78c2']",)

Comment 1 Martin Tessun 2016-06-29 08:54:54 UTC

Created attachment 1173726 [details]
All messages from /var/log of the hypervisor

Just in case it is of interesst, I just added a tar file containing all files of /var/log after the abort.

route and DNS is also not configured anymore:
[root@ovirt1 ~]# ip route list
192.168.100.0/24 dev ovirtmgmt  proto kernel  scope link  src 192.168.100.21 
[root@ovirt1 ~]# cat /etc/resolv.conf 
# Generated by NetworkManager
search satellite.local


# No nameservers found; try putting DNS servers into your
# ifcfg files in /etc/sysconfig/network-scripts like so:
#
# DNS1=xxx.xxx.xxx.xxx
# DNS2=xxx.xxx.xxx.xxx
# DOMAIN=lab.foo.com bar.foo.com
[root@ovirt1 ~]#

Comment 2 Martin Tessun 2016-06-29 09:53:51 UTC

Created attachment 1173732 [details]
sosreport after the failed ovirtmgmt bridge setup

Some additional remarks.

It seems this is not cockpit related. Running hosted-engine --deploy on the console leads to the same result (even if run in a screen session).

The screen session has the advantage that one can see the console output after the management bridge has been configured. So hosted-engine --deploy just commences to the storage part (which fails).

I gathered a sosreport that is also attached.

Output from screen:
[ INFO  ] Stage: Setup validation
         
          --== CONFIGURATION PREVIEW ==--
         
          Bridge interface                   : eth0
          Engine FQDN                        : ovirt.satellite.local
          Bridge name                        : ovirtmgmt
          Host address                       : ovirt1.satellite.local
          SSH daemon port                    : 22
          Firewall manager                   : iptables
          Gateway address                    : 192.168.100.1
          Host name for web application      : ovirt1
          Storage Domain type                : iscsi
          Host ID                            : 1
          LUN ID                             : 36001405ddaf6032e1bf47538069c78c2
          Image size GB                      : 50
          iSCSI Portal IP Address            : 192.168.100.1
          iSCSI Target Name                  : iqn.2003-01.org.linux-iscsi.kirk.x8664:sn.21dc789db84d
          iSCSI Portal port                  : 3260
          iSCSI Portal user                  : 
          Console type                       : vnc
          Memory size MB                     : 6144
          MAC address                        : 00:16:3e:3e:b0:69
          Boot type                          : disk
          Number of CPUs                     : 4
          OVF archive (for disk boot)        : /root/rhevm-appliance-20160623.0-1.x86_64.rhevm.ova
          Restart engine VM after engine-setup: True
          CPU Type                           : model_SandyBridge
         
          Please confirm installation settings (Yes, No)[Yes]: 
[ INFO  ] Stage: Transaction setup
[ INFO  ] Stage: Misc configuration
[ INFO  ] Stage: Package installation
[ INFO  ] Stage: Misc configuration
[ INFO  ] Configuring libvirt
[ INFO  ] Configuring VDSM
[ INFO  ] Starting vdsmd
[ INFO  ] Configuring the management bridge
[ INFO  ] Creating Volume Group
[ ERROR ] Error creating Volume Group: Failed to initialize physical device: ("[u'/dev/mapper/36001405ddaf6032e1bf47538069c78c2']",)
          The selected device is already used.
          To create a vg on this device, you must use Force.
          WARNING: This will destroy existing data on the device.
          (Force, Abort)[Abort]?

Comment 3 daniel 2016-06-29 12:21:32 UTC

While testing I found that I only run into the same issue whith static IPs.
In case I use DHCP for the hypervisor host deployment of hosted engine is fine

Comment 4 Ryan Barry 2016-06-29 14:20:40 UTC

(In reply to Martin Tessun from comment #2)
> [ ERROR ] Error creating Volume Group: Failed to initialize physical device:
> ("[u'/dev/mapper/36001405ddaf6032e1bf47538069c78c2']",)
>           The selected device is already used.
>           To create a vg on this device, you must use Force.
>           WARNING: This will destroy existing data on the device.
>           (Force, Abort)[Abort]?

What do I need to reproduce this?

Was there already a volume group on that LUN, or does this happen every time iscsi is used?

It seems that the primary problem here is that hosted-engine-setup drops the network if there's a storage problem on iSCSI rather than cockpit (which can't be expected to show anything if the network drops), but this is hard to verify without a reproducer. Similar "stopping" errors are show correctly in cockpit, but...

(In reply to dmoessne from comment #3)
> While testing I found that I only run into the same issue whith static IPs.
> In case I use DHCP for the hypervisor host deployment of hosted engine is
> fine

The management bridge dropping, or iscsi? I'm reasonably sure that we've verified static IPs (Douglas, can you confirm?), but I'm not sure which issue you're referring to.

Comment 5 Piotr Kliczewski 2016-06-30 14:29:49 UTC

In the logs I see:

jsonrpc.Executor/4::ERROR::2016-06-29 10:35:13,452::task::868::Storage.TaskManager.Task::(_setError) Task=`69e9e009-7157-44e7-a73a-e2c28743b01e`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 875, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2110, in createVG
    (force.capitalize() == "True")))
  File "/usr/share/vdsm/storage/lvm.py", line 936, in createVG
    _initpvs(pvs, metadataSize, force)
  File "/usr/share/vdsm/storage/lvm.py", line 739, in _initpvs
    raise se.PhysDevInitializationError(str(devices))

Comment 6 Martin Tessun 2016-07-05 09:42:15 UTC

Hi Piotr, Ryan

this is the result of ovirtmgmt bridge being down.
Besides this the installation goes fine, if you run the following loop just before starting the hosted-engine --deploy:

[root@ovirt1 ~]# while true;do ip link set up ovirtmgmt;done

The VG creation fails, as I am using iSCSI and as soon as the IP is gone (ovirtmgmt interface being created) it can't access the storage devices any more.

So best to reproduce the same way I did:
1. Have a static IP set for Hypervisor, e.g.:

[root@ovirt2 ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0 
TYPE=Ethernet
BOOTPROTO=none
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
NAME=eth0
UUID=8bcd806a-9028-461b-ac8e-54595c302144
DEVICE=eth0
ONBOOT=yes
IPADDR=192.168.100.22
PREFIX=24
GATEWAY=192.168.100.1
DNS1=192.168.100.1
DOMAIN=satellite.local
IPV6_PEERDNS=yes
IPV6_PEERROUTES=yes
IPV6_PRIVACY=no
[root@ovirt2 ~]# 

2. Start hosted-engine --deploy (either via cockpit or locally)
3. Select the connected interface as management interface
4. Wait until hosted-engine/vdsm create the ovirtmgmt bridge.

If the above loop isn't run, ovirtmgmt bridge will stay in down state.

As said the iSCSI storage error is just a side effect, as my iSCSI is also connected via eth0 in this case.

As outlined by Daniel in C#3 it will not fail if DHCP is used instead of static IP.

Just let me know if you need more informations for a reproducer.

Cheers,
Martin

Comment 7 Martin Tessun 2016-07-05 09:58:19 UTC

Just an additional finding:

Having that "while" loop does help getting hosted-engine running further, but still DNS is missing:

 Failed to execute stage 'Closing up': [ERROR]::oVirt API connection failure, (6, 'Could not resolve host: ovirt.satellite.local; Unknown error')
Hosted Engine deployment failed: this system is not reliable, please check the issue, fix and redeploy

[root@ovirt1 ~]# cat /etc/resolv.conf 
# Generated by NetworkManager
search satellite.local


# No nameservers found; try putting DNS servers into your
# ifcfg files in /etc/sysconfig/network-scripts like so:
#
# DNS1=xxx.xxx.xxx.xxx
# DNS2=xxx.xxx.xxx.xxx
# DOMAIN=lab.foo.com bar.foo.com
[root@ovirt1 ~]# 


[root@ovirt1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt 
# Generated by VDSM version 4.18.3-0.el7ev
DEVICE=ovirtmgmt
TYPE=Bridge
DELAY=0
STP=off
ONBOOT=yes
IPADDR=192.168.100.21
NETMASK=255.255.255.0
GATEWAY=192.168.100.1
BOOTPROTO=none
MTU=1500
DEFROUTE=yes
NM_CONTROLLED=no
IPV6INIT=no
[root@ovirt1 ~]# 

So this is still missing the DNS1=192.168.100.1 entry.

Try redeploying now with a different script running while I do the deployment:
[root@ovirt1 ~]# while true;do ip link set up ovirtmgmt;(grep -q '^nameserver' /etc/resolv.conf || echo "nameserver 192.168.100.1" >> /etc/resolv.conf);done


Please also find the configuration files (ifcfg-eth0 and resolv.conf) as they look like directly after the Hypervisor is installed:
[root@ovirt1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0 
TYPE=Ethernet
BOOTPROTO=none
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
NAME=eth0
UUID=fbe724dd-2b4a-43d0-bec7-db9cffa39609
DEVICE=eth0
ONBOOT=yes
IPADDR=192.168.100.21
PREFIX=24
GATEWAY=192.168.100.1
DNS1=192.168.100.1
DOMAIN=satellite.local
IPV6_PEERDNS=yes
IPV6_PEERROUTES=yes
IPV6_PRIVACY=no
[root@ovirt1 ~]# cat /etc/resolv.conf 
# Generated by NetworkManager
search satellite.local
nameserver 192.168.100.1
[root@ovirt1 ~]# 


So what I would expect:
1. ovirtmgmt should get the same configuration as eth0 had
2. resolv.conf should also be maintained as is
3. Interface ovirtmgmt should be set to up after it has been configured.

Comment 8 Ryan Barry 2016-07-05 15:08:07 UTC

(In reply to Martin Tessun from comment #7)
> So what I would expect:
> 1. ovirtmgmt should get the same configuration as eth0 had
> 2. resolv.conf should also be maintained as is
> 3. Interface ovirtmgmt should be set to up after it has been configured.

This is also what I'd expect.

Dan, Simone, any thoughts?

Comment 9 Simone Tiraboschi 2016-07-05 15:23:06 UTC

(In reply to Ryan Barry from comment #8)
> (In reply to Martin Tessun from comment #7)
> > So what I would expect:
> > 1. ovirtmgmt should get the same configuration as eth0 had
> > 2. resolv.conf should also be maintained as is
> > 3. Interface ovirtmgmt should be set to up after it has been configured.
> 
> This is also what I'd expect.
> 
> Dan, Simone, any thoughts?

Ryan, see this one: https://bugzilla.redhat.com/show_bug.cgi?id=1160423

Comment 10 Martin Tessun 2016-07-05 16:12:48 UTC

(In reply to Simone Tiraboschi from comment #9)
> 
> Ryan, see this one: https://bugzilla.redhat.com/show_bug.cgi?id=1160423

Well that is for the DNS (and hopefully will solve (1) and (2). Still keeps us with the following:

> > 3. Interface ovirtmgmt should be set to up after it has been configured.


Additionally as I understand BZ #1160423 is already scheduled for 4.1 although this has been discovered in 2014 already?
Personally I think that these issues should be *really* fixed in RHV 4.0 at least. Is there anything that can be done to get this resolved in RHV 4.0?

Cheers,
Martin

Comment 11 Ryan Barry 2016-07-05 16:28:33 UTC

Hi Martin -

I think the impact is low, this this only appears to affect interfaces which use DNS1/DNS2 in ifcfg files, and systems which set DNS in resolv.conf (or via DHCP) operate normally.

Personally, I also find this to be surprising, since I can't remember the last time I set a DNS server in resolv.conf instead of DNS1, but the lack of customer tickets attached to a very old bug indicates that this may also be an uncommon use case among customers. It may be hard to get it into 4.0 this late.

For issue #3, this definitely looks like ovirt-host-deploy. I'm just getting back from holiday, but I'll see if I have some time to set up a reproducer this week. Is there anything in journald about it not coming up?

Comment 12 Martin Tessun 2016-07-06 06:42:27 UTC

(In reply to Ryan Barry from comment #11)
> Hi Martin -
> 
> I think the impact is low, this this only appears to affect interfaces which
> use DNS1/DNS2 in ifcfg files, and systems which set DNS in resolv.conf (or
> via DHCP) operate normally.
> 
> Personally, I also find this to be surprising, since I can't remember the
> last time I set a DNS server in resolv.conf instead of DNS1, but the lack of
> customer tickets attached to a very old bug indicates that this may also be
> an uncommon use case among customers. It may be hard to get it into 4.0 this
> late.

Well I would tend to agree, in case all our customers were using RHEL-H, but installing RHEV-H with anaconda is just doing this (using DNS1/DNS2) for the configuration. As such, I believe this is quite a relevant bug.
Although I am not sure how many customers will use this way of installation (RHEV-H + Hosted Engine), but this installation will fail unless not changed manually afterwards or with my small infinite loop being run during installation.
As such I think that this now got a more relevant bug than it might have been before.

> 
> For issue #3, this definitely looks like ovirt-host-deploy. I'm just getting
> back from holiday, but I'll see if I have some time to set up a reproducer
> this week. Is there anything in journald about it not coming up?

Indeed, see below:

Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  ifcfg-rh: remove /etc/sysconfig/network-scripts/ifcfg-eth0 (d0fea1c3-8518-4407-8345-3544d3c
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  (eth0): device state change: activated -> deactivating (reason 'connection-removed') [100 1
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  NetworkManager state is now DISCONNECTING
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-eth0 (5fb06bd0-0bb0-7ffb-45f1
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <warn>  ifcfg-rh: Ignoring connection /etc/sysconfig/network-scripts/ifcfg-eth0 (5fb06bd0-0bb0-7ffb
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  (eth0): device state change: deactivating -> unmanaged (reason 'unmanaged') [110 10 3]
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  NetworkManager state is now DISCONNECTED
Jul 06 08:37:01 ovirt1.satellite.local dbus[894]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-di
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  (eth0): link disconnected
Jul 06 08:37:01 ovirt1.satellite.local dbus-daemon[894]: dbus[894]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org
Jul 06 08:37:01 ovirt1.satellite.local systemd[1]: Starting Network Manager Script Dispatcher Service...
Jul 06 08:37:01 ovirt1.satellite.local dbus[894]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Jul 06 08:37:01 ovirt1.satellite.local dbus-daemon[894]: dbus[894]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Jul 06 08:37:01 ovirt1.satellite.local nm-dispatcher[19272]: Dispatching action 'down' for eth0
Jul 06 08:37:01 ovirt1.satellite.local systemd[1]: Started Network Manager Script Dispatcher Service.
[...]
Jul 06 08:37:01 ovirt1.satellite.local systemd[1]: Started /usr/sbin/ifup eth0.
Jul 06 08:37:01 ovirt1.satellite.local systemd[1]: Starting /usr/sbin/ifup eth0.
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  (ovirtmgmt): new Bridge device (carrier: OFF, driver: 'bridge', ifindex: 7)
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  (eth0): link connected
Jul 06 08:37:01 ovirt1.satellite.local kernel: 8021q: adding VLAN 0 to HW filter on device eth0
Jul 06 08:37:01 ovirt1.satellite.local kernel: device eth0 entered promiscuous mode
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  (ovirtmgmt): bridge port eth0 was attached
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  (eth0): enslaved to ovirtmgmt
Jul 06 08:37:01 ovirt1.satellite.local systemd[1]: Started /usr/sbin/ifup ovirtmgmt.
Jul 06 08:37:01 ovirt1.satellite.local systemd[1]: Starting /usr/sbin/ifup ovirtmgmt.
Jul 06 08:37:01 ovirt1.satellite.local kernel: ovirtmgmt: port 1(eth0) entered forwarding state
Jul 06 08:37:01 ovirt1.satellite.local kernel: ovirtmgmt: port 1(eth0) entered forwarding state
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  (ovirtmgmt): link connected
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  (ovirtmgmt): device state change: unmanaged -> unavailable (reason 'connection-assumed') [1
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  ifcfg-rh: add connection in-memory (cb0a5d45-38d3-4e3b-8e6e-58993fd453d1,"ovirtmgmt")
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  (ovirtmgmt): device state change: unavailable -> disconnected (reason 'connection-assumed')
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  (ovirtmgmt): Activation: starting connection 'ovirtmgmt' (cb0a5d45-38d3-4e3b-8e6e-58993fd45
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  (ovirtmgmt): device state change: disconnected -> prepare (reason 'none') [30 40 0]
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  (ovirtmgmt): device state change: prepare -> config (reason 'none') [40 50 0]
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  (ovirtmgmt): device state change: config -> ip-config (reason 'none') [50 70 0]
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  (ovirtmgmt): device state change: ip-config -> ip-check (reason 'ip-config-unavailable') [7
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  (ovirtmgmt): device state change: ip-check -> secondaries (reason 'none') [80 90 0]
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  (ovirtmgmt): device state change: secondaries -> activated (reason 'none') [90 100 0]
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  NetworkManager state is now CONNECTED_LOCAL
Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  (ovirtmgmt): Activation: successful, device activated.
Jul 06 08:37:01 ovirt1.satellite.local nm-dispatcher[19272]: Dispatching action 'up' for ovirtmgmt
Jul 06 08:37:01 ovirt1.satellite.local systemd[1]: Unit iscsi.service cannot be reloaded because it is inactive.
Jul 06 08:37:02 ovirt1.satellite.local NetworkManager[1024]: <info>  ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt (9a0b07c0-2983-fe97
Jul 06 08:37:02 ovirt1.satellite.local NetworkManager[1024]: <warn>  ifcfg-rh: Ignoring connection /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt (9a0b07c0-2983
Jul 06 08:37:02 ovirt1.satellite.local NetworkManager[1024]: <info>  (ovirtmgmt): device state change: activated -> unmanaged (reason 'unmanaged') [100 10 3]
Jul 06 08:37:02 ovirt1.satellite.local kernel: ovirtmgmt: port 1(eth0) entered disabled state
Jul 06 08:37:02 ovirt1.satellite.local NetworkManager[1024]: <info>  NetworkManager state is now DISCONNECTED
Jul 06 08:37:02 ovirt1.satellite.local kernel: IPv6: ADDRCONF(NETDEV_UP): ovirtmgmt: link is not ready
Jul 06 08:37:02 ovirt1.satellite.local nm-dispatcher[19272]: Dispatching action 'down' for ovirtmgmt
Jul 06 08:37:02 ovirt1.satellite.local NetworkManager[1024]: <info>  (ovirtmgmt): link disconnected
Jul 06 08:37:03 ovirt1.satellite.local /etc/sysconfig/network-scripts/ifup-eth[19409]: Error adding default gateway 192.168.100.1 for ovirtmgmt.
Jul 06 08:37:04 ovirt1.satellite.local daemonAdapter[19001]: libvirt: Network Driver error : Network not found: no network with matching name 'vdsm-ovirtmgmt'

Comment 13 Simone Tiraboschi 2016-07-06 07:26:31 UTC

hosted-engine-setup calls Host.setupNetworks and everything seams fine.

jsonrpc.Executor/2::DEBUG::2016-06-29 10:34:59,933::__init__::522::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Host.setupNetworks' in bridge with {u'bondings': {}, u'networks': {u'ovirtmgmt': {u'nic': u'eth0', u'ipaddr': u'192.168.100.21', u'netmask': u'255.255.255.0', u'bootproto': u'none', u'gateway': u'192.168.100.1', u'defaultRoute': True}}, u'options': {u'connectivityCheck': False}}
jsonrpc.Executor/2::DEBUG::2016-06-29 10:35:02,983::__init__::550::jsonrpc.JsonRpcServer::(_serveRequest) Return 'Host.setupNetworks' in bridge with {'message': 'Done', 'code': 0}
jsonrpc.Executor/3::DEBUG::2016-06-29 10:35:03,008::__init__::522::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Host.setSafeNetworkConfig' in bridge with {}

The only relevant different here between el7 and NGN scenario is that hosted-engine-setup on NGN runs with NetworkManager active since since it's required to show the network status in Cockpit.
By default on el7 we ask the user to disable NetworkManager.

Martin, can you please try to reproduce on your NGN scenario explicitly stopping and disabling NetworkManager?

Comment 14 Dan Kenigsberg 2016-07-06 07:33:13 UTC

Unfortunately, we do not have any means to set DNS explicitly (see bug 1160667). We could sneak it in by editing ifcfg-ovirtmgmt using a hook http://www.ovirt.org/blog/2016/05/modify-ifcfg-files/

Comment 15 Martin Tessun 2016-07-06 07:55:19 UTC

Hi Simone,

(In reply to Simone Tiraboschi from comment #13)
> hosted-engine-setup calls Host.setupNetworks and everything seams fine.
> 
> jsonrpc.Executor/2::DEBUG::2016-06-29
> 10:34:59,933::__init__::522::jsonrpc.JsonRpcServer::(_serveRequest) Calling
> 'Host.setupNetworks' in bridge with {u'bondings': {}, u'networks':
> {u'ovirtmgmt': {u'nic': u'eth0', u'ipaddr': u'192.168.100.21', u'netmask':
> u'255.255.255.0', u'bootproto': u'none', u'gateway': u'192.168.100.1',
> u'defaultRoute': True}}, u'options': {u'connectivityCheck': False}}
> jsonrpc.Executor/2::DEBUG::2016-06-29
> 10:35:02,983::__init__::550::jsonrpc.JsonRpcServer::(_serveRequest) Return
> 'Host.setupNetworks' in bridge with {'message': 'Done', 'code': 0}
> jsonrpc.Executor/3::DEBUG::2016-06-29
> 10:35:03,008::__init__::522::jsonrpc.JsonRpcServer::(_serveRequest) Calling
> 'Host.setSafeNetworkConfig' in bridge with {}
> 
> The only relevant different here between el7 and NGN scenario is that
> hosted-engine-setup on NGN runs with NetworkManager active since since it's
> required to show the network status in Cockpit.
> By default on el7 we ask the user to disable NetworkManager.
> 
> Martin, can you please try to reproduce on your NGN scenario explicitly
> stopping and disabling NetworkManager?

Indeed. As soon as I disable NetworkManager the setup works fine:

Jul 06 09:51:07 ovirt1.satellite.local systemd[1]: Started /usr/sbin/ifup eth0.
Jul 06 09:51:07 ovirt1.satellite.local systemd[1]: Starting /usr/sbin/ifup eth0.
Jul 06 09:51:07 ovirt1.satellite.local kernel: 8021q: adding VLAN 0 to HW filter on device eth0
Jul 06 09:51:07 ovirt1.satellite.local kernel: device eth0 entered promiscuous mode
Jul 06 09:51:07 ovirt1.satellite.local systemd[1]: Started /usr/sbin/ifup ovirtmgmt.
Jul 06 09:51:07 ovirt1.satellite.local systemd[1]: Starting /usr/sbin/ifup ovirtmgmt.
Jul 06 09:51:07 ovirt1.satellite.local kernel: ovirtmgmt: port 1(eth0) entered forwarding state
Jul 06 09:51:07 ovirt1.satellite.local kernel: ovirtmgmt: port 1(eth0) entered forwarding state
Jul 06 09:51:09 ovirt1.satellite.local daemonAdapter[3687]: libvirt: Network Driver error : Network not found: no network with matching name 'vdsm-ovirtmgmt'
[...]

So this is clearly NetworkManager and RHV-H related then.
BTW: Even my DNS stays intact:
[root@ovirt1 ~]# cat /etc/resolv.conf 
# Generated by NetworkManager
search satellite.local
nameserver 192.168.100.1
[root@ovirt1 ~]# 

So both issues are (at least partly) NetworkManager related.

Cheers,
Martin

Comment 16 Ryan Barry 2016-07-06 12:10:56 UTC

(In reply to Martin Tessun from comment #15)
> 
> So both issues are (at least partly) NetworkManager related.
> 
> Cheers,
> Martin

This is very interesting (and not good for NGN). NetworkManager support for vdsm was pushed into NGN so it works nicely with cockpit, but there's some ongoing work which needs to be done there...

Comment 17 Ryan Barry 2016-07-06 12:59:16 UTC

Dan, any ideas here?

Comment 18 Dan Kenigsberg 2016-07-17 11:51:30 UTC

Martin, I do not understand how you DNS1/DNS2 configuration is working when NM is turned off. Did they somehow stayed written in the configuration file?

Comment 19 Yaniv Lavi 2016-07-17 11:59:32 UTC

Could this be the cause?
https://bugzilla.redhat.com/show_bug.cgi?id=1335426

Comment 20 Dan Kenigsberg 2016-07-18 05:47:17 UTC

Yaniv, could you elaborate? I see no relation between loss of DNS resolution to punching a firewall hole for cockpit.

Comment 21 Martin Tessun 2016-07-18 07:07:45 UTC

Hi Dan,

(In reply to Dan Kenigsberg from comment #18)
> Martin, I do not understand how you DNS1/DNS2 configuration is working when
> NM is turned off. Did they somehow stayed written in the configuration file?

Exactly. After /etc/resolv.conf has been created by anaconda/NetworkManager, disabling NM resulted in /etc/resolv.conf not being touched again and as such the /etc/resolv.conf stayed as it is.

Probably as DNS1/DNS2 are not available in /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt, the "classic" network startup "saw" no reason to modify /etc/resolv.conf.

So to sum up:
- DNS1/2 are still missing in the sysconfig files
- /etc/resolv.conf doesn't get touched again by the "classic" network startup

Comment 22 Dan Kenigsberg 2016-07-21 06:29:20 UTC

Tomas, do you know if NetworkManager would avoid clearing up resolv.conf when DNS entries are dropped from ifcfg, once we disable monitor-connection-files?

(We can take https://gerrit.ovirt.org/#/c/59260 only when bug 1346947 is fixed)

Comment 23 Beniamino Galvani 2016-07-21 12:49:05 UTC

(In reply to Dan Kenigsberg from comment #22)
> Tomas, do you know if NetworkManager would avoid clearing up resolv.conf
> when DNS entries are dropped from ifcfg, once we disable
> monitor-connection-files?
>
> (We can take https://gerrit.ovirt.org/#/c/59260 only when bug 1346947 is
> fixed)

If I understand well you have monitor-connection-files=yes and when a
ifcfg file is modified (by removing and re-adding it), the content of
resolv.conf changes. This is expected from NM point of view, as the
remove/readd cycle causes the connection to be activated again,
changing resolv.conf.

If you disable monitor-connection-files, NM will not react to changes
of ifcfg files and thus resolv.conf will not change until a "nmcli
connection reload", followed by a reactivation of connection is
performed.

Comment 24 Ondřej Svoboda 2016-07-21 13:47:20 UTC

https://gerrit.ovirt.org/#/c/61184/ makes VDSM write /etc/resolv.conf entries to ovirtmgmt's ifcfg file (currently DNS1 and DNS2, DOMAIN may be necessary as well) and as a result, ifup-post updates /etc/resolv.conf with them.

Martin, can you give it a test? I will be looking for a suitable VM too.

Comment 25 Martin Tessun 2016-07-22 06:45:39 UTC

Hi Ondrej,

sure, I will give it a try. I expect that I need to apply the patches manually as there is no build with this patches available yet?

Cheers,
Martin

Comment 26 Martin Tessun 2016-07-22 07:16:27 UTC

Hi Ondrej

(In reply to Ondřej Svoboda from comment #24)
> https://gerrit.ovirt.org/#/c/61184/ makes VDSM write /etc/resolv.conf
> entries to ovirtmgmt's ifcfg file (currently DNS1 and DNS2, DOMAIN may be
> necessary as well) and as a result, ifup-post updates /etc/resolv.conf with
> them.
> 
> Martin, can you give it a test? I will be looking for a suitable VM too.

The patches did not apply 100% cleanly, so I needed to do some rework (not that difficult). But the results for the DNS are promising:

[root@ovirt1 ~]# cat /etc/resolv.conf
# Generated by NetworkManager
search satellite.local


# No nameservers found; try putting DNS servers into your
# ifcfg files in /etc/sysconfig/network-scripts like so:
#
# DNS1=xxx.xxx.xxx.xxx
# DNS2=xxx.xxx.xxx.xxx
# DOMAIN=lab.foo.com bar.foo.com
nameserver 192.168.100.1

[root@ovirt1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt 
# Generated by VDSM version 4.18.6-1.el7ev
DEVICE=ovirtmgmt
TYPE=Bridge
DELAY=0
STP=off
ONBOOT=yes
DNS1=192.168.100.1
IPADDR=192.168.100.21
NETMASK=255.255.255.0
GATEWAY=192.168.100.1
BOOTPROTO=none
MTU=1500
DEFROUTE=yes
NM_CONTROLLED=no
IPV6INIT=no
[root@ovirt1 ~]# 

Still the ovirtmgmt bridge stays in "DOWN" state if configured with a static IP adress, but I think the patch was not meant to solve this as well.

Additioanlly the "DOMAIN=" is still missing.

Cheers,
Martin

Comment 27 Ondřej Svoboda 2016-07-22 07:38:02 UTC

Thank you so much, Martin.

Does /var/log/vdsm/supervdsm.log show any hints as to why the bridge couldn't be brought up?

I will add DOMAN= handling to the patch (or rather, introduce a follow-up patch) and also prepare a backport closer to your version so it hopefully applies cleanly this time.

Comment 28 Fabian Deutsch 2016-07-22 09:08:07 UTC

$ nmcli d

might also be interesting, Martin.

maybe it's related to bug 1356635

Comment 29 cshao 2016-07-22 11:24:31 UTC

I can't reproduce this issue with normal network configure, is there any other network configure?

My test steps as following:
1. Install RHVH via ISO(with defaulf ks)
2. Configure a simple network(one nic).
3. Start hosted-engine installation from cockpit (https://<FWDN>:9090/)
4. Select the network you are connected to cockpit as the management network
5. Configure hosted engine can successful.

Comment 30 Ondřej Svoboda 2016-07-22 12:12:39 UTC

(In reply to shaochen from comment #29)
> I can't reproduce this issue with normal network configure, is there any
> other network configure?
> 
> My test steps as following:
> 1. Install RHVH via ISO(with defaulf ks)
> 2. Configure a simple network(one nic).

Did you use static IP configuration, which is affected by this bug? DHCP-based configuration was reported to be fine.

> 3. Start hosted-engine installation from cockpit (https://<FWDN>:9090/)
> 4. Select the network you are connected to cockpit as the management network
> 5. Configure hosted engine can successful.

Comment 31 Ondřej Svoboda 2016-07-22 13:29:38 UTC

A few points (actually, a big one, which I feared) that shouldn't be missed (from Martin's comment #12):

> Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> 
> ifcfg-rh: remove /etc/sysconfig/network-scripts/ifcfg-eth0
> (d0fea1c3-8518-4407-8345-3544d3c
> Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  (eth0):
> device state change: activated -> deactivating (reason 'connection-removed')
> [100 1

> Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> 
> ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-eth0
> (5fb06bd0-0bb0-7ffb-45f1
> Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <warn> 
> ifcfg-rh: Ignoring connection /etc/sysconfig/network-scripts/ifcfg-eth0
> (5fb06bd0-0bb0-7ffb
> Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info>  (eth0):
> device state change: deactivating -> unmanaged (reason 'unmanaged') [110 10
> 3]

NetworkManager correctly ignores eth0, as ifcfg-eth0 was deleted and recreated by VDSM.

> Jul 06 08:37:01 ovirt1.satellite.local systemd[1]: Started /usr/sbin/ifup
> eth0.
> Jul 06 08:37:01 ovirt1.satellite.local systemd[1]: Starting /usr/sbin/ifup
> eth0.
> Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> 
> (ovirtmgmt): new Bridge device (carrier: OFF, driver: 'bridge', ifindex: 7)

VDSM creates a bridge "manually" (as usual) just after configuring eth0 all on its own.

> Jul 06 08:37:01 ovirt1.satellite.local systemd[1]: Started /usr/sbin/ifup
> ovirtmgmt.
> Jul 06 08:37:01 ovirt1.satellite.local systemd[1]: Starting /usr/sbin/ifup
> ovirtmgmt.

Initscripts start static-IP configuration based on ifcfg-ovirtmgmt. But, at the same time... (assuming precise timestamps on a NetworkManager logger's side)

> Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> 
> (ovirtmgmt): link connected
> Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> 
> (ovirtmgmt): device state change: unmanaged -> unavailable (reason
> 'connection-assumed') [1
> Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> 
> ifcfg-rh: add connection in-memory
> (cb0a5d45-38d3-4e3b-8e6e-58993fd453d1,"ovirtmgmt")

NetworkManager picks up the bridge as well, unaware of ifcfg-ovirtmgmt yet.

> Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> 
> (ovirtmgmt): device state change: unavailable -> disconnected (reason
> 'connection-assumed')
> Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> 
> (ovirtmgmt): Activation: starting connection 'ovirtmgmt'
> (cb0a5d45-38d3-4e3b-8e6e-58993fd45
> Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> 
> (ovirtmgmt): device state change: disconnected -> prepare (reason 'none')
> [30 40 0]
> Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> 
> (ovirtmgmt): device state change: prepare -> config (reason 'none') [40 50 0]
> Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> 
> (ovirtmgmt): device state change: config -> ip-config (reason 'none') [50 70
> 0]
> Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> 
> (ovirtmgmt): device state change: ip-config -> ip-check (reason
> 'ip-config-unavailable') [7
> Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> 
> (ovirtmgmt): device state change: ip-check -> secondaries (reason 'none')
> [80 90 0]
> Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> 
> (ovirtmgmt): device state change: secondaries -> activated (reason 'none')
> [90 100 0]
> Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> 
> NetworkManager state is now CONNECTED_LOCAL
> Jul 06 08:37:01 ovirt1.satellite.local NetworkManager[1024]: <info> 
> (ovirtmgmt): Activation: successful, device activated.
> Jul 06 08:37:01 ovirt1.satellite.local nm-dispatcher[19272]: Dispatching
> action 'up' for ovirtmgmt

NM is now done configuring the bridge. In the meantime, initscripts are also doing their job, feeling alone and safe. 

What puzzles me about the way NM configured the bridge is that I think it didn't even try DHCP, it probably only assigned an IPv4 link-local address (inferring from CONNECTED_LOCAL). Perhaps 'ip-config-unavailable' means that it should do just that.

> Jul 06 08:37:02 ovirt1.satellite.local NetworkManager[1024]: <warn> 
> ifcfg-rh: Ignoring connection /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt
> (9a0b07c0-2983

Only here NM realizes it should not manage the bridge and something worse happens.

> Jul 06 08:37:02 ovirt1.satellite.local NetworkManager[1024]: <info> 
> (ovirtmgmt): device state change: activated -> unmanaged (reason
> 'unmanaged') [100 10 3]
> Jul 06 08:37:02 ovirt1.satellite.local kernel: ovirtmgmt: port 1(eth0)
> entered disabled state
> Jul 06 08:37:02 ovirt1.satellite.local NetworkManager[1024]: <info> 
> NetworkManager state is now DISCONNECTED
> Jul 06 08:37:02 ovirt1.satellite.local kernel: IPv6: ADDRCONF(NETDEV_UP):
> ovirtmgmt: link is not ready
> Jul 06 08:37:02 ovirt1.satellite.local nm-dispatcher[19272]: Dispatching
> action 'down' for ovirtmgmt
> Jul 06 08:37:02 ovirt1.satellite.local NetworkManager[1024]: <info> 
> (ovirtmgmt): link disconnected
> Jul 06 08:37:03 ovirt1.satellite.local

Is it NetworkManager who downed the bridge?

> /etc/sysconfig/network-scripts/ifup-eth[19409]: Error adding default gateway
> 192.168.100.1 for ovirtmgmt.

In any case, this last error is probably only a result of the bridge being down already.

If I got it right, this is a better-than-a-textbook example of a deadly race that I caused by giving a chance to monitor-connection-files=yes.

In this bug, we should probably deal with the DNS problem and solve the race one way or another (by switching monitor-connection-files back to 'no' and either calling 'nmcli load' ourselves before ifup or having initscripts do that for us) in https://bugzilla.redhat.com/show_bug.cgi?id=1344411

Comment 32 cshao 2016-07-25 06:55:30 UTC

> > My test steps as following:
> > 1. Install RHVH via ISO(with defaulf ks)
> > 2. Configure a simple network(one nic).
> 
> Did you use static IP configuration, which is affected by this bug?
> DHCP-based configuration was reported to be fine.

Deploy HE still can successful with static IP configuration.

Comment 33 Martin Tessun 2016-07-25 12:23:51 UTC

Hi,

(In reply to shaochen from comment #29)
> I can't reproduce this issue with normal network configure, is there any
> other network configure?
> 
> My test steps as following:
> 1. Install RHVH via ISO(with defaulf ks)
> 2. Configure a simple network(one nic).
> 3. Start hosted-engine installation from cockpit (https://<FWDN>:9090/)
> 4. Select the network you are connected to cockpit as the management network
> 5. Configure hosted engine can successful.

How did you configure your network? With DHCP? If so, please use static configuration, as dhcp ensures the interface is up afterwards. Static configuration doesn't.

Please note: It is not about HE having a static IP, but about the Hypervisor (RHV-H NGN) having a static IP (and no ovirtmgmt bridge set up already).

Looks like Ondrejs analysis in Comment #31 is quite to the point.

(In reply to Fabian Deutsch from comment #28)
> $ nmcli d
> 
> might also be interesting, Martin.
> 
> maybe it's related to bug 1356635

No, I don't think so, as this setup is done with DHCP. My issue only shows up if dhcp is not used, but static configuration is used instead.

As mentioned earlier in my update, I think Ondrejs analysis in Comment #31 is right to the point, so we have a sort of nasty race here (and probably NM does shut down the interface)

I will do some further tests with nm configuration (disabling monitor-connection-files)

Cheers,
Martin

Comment 34 Ondřej Svoboda 2016-07-25 13:08:04 UTC

Hi,

(In reply to Martin Tessun from comment #33)
> I will do some further tests with nm configuration (disabling
> monitor-connection-files)

Patching /etc/sysconfig/network-scripts/network-functions as below should have the same effect as monitor-connection-files=yes, minus the raciness.

Quoting from https://bugzilla.redhat.com/show_bug.cgi?id=1345919 and https://git.fedorahosted.org/cgit/initscripts.git/commit/?id=61fb1cb4efd62120ffbc021d7fdee1cd25059c08 (the diff is the same as the one posted by Thomas Haller)

In /etc/sysconfig/network-scripts/network-functions, replace:

    if ! is_false $NM_CONTROLLED && is_nm_running; then
        nmcli con load "/etc/sysconfig/network-scripts/$CONFIG"
        UUID=$(get_uuid_by_config $CONFIG)
        [ -n "$UUID" ] && _use_nm=true
    fi

With:

    if is_nm_running; then
        nmcli con load "/etc/sysconfig/network-scripts/$CONFIG"
        if ! is_false $NM_CONTROLLED; then
            UUID=$(get_uuid_by_config $CONFIG)
            [ -n "$UUID" ] && _use_nm=true
        fi
    fi

I'll try this on my VMs as well, but anyone is welcome to test.

Thanks,
Ondra

Comment 35 Martin Tessun 2016-07-26 10:48:28 UTC

Hi Fabian,

first all the nmcli d outputs:

1. Prior to installation:
[root@ovirt1 ~]# nmcli d
DEVICE  TYPE      STATE         CONNECTION 
eth0    ethernet  connected     eth0       
eth1    ethernet  disconnected  --         
eth2    ethernet  disconnected  --         
bond0   bond      unmanaged     --         
lo      loopback  unmanaged     --         
[root@ovirt1 ~]# 

2. After installation got stuck:
[root@ovirt1 ~]# nmcli d
DEVICE       TYPE      STATE         CONNECTION 
eth1         ethernet  disconnected  --                
eth2         ethernet  disconnected  --         
bond0        bond      unmanaged     --         
;vdsmdummy;  bridge    unmanaged     --         
ovirtmgmt    bridge    unmanaged     --         
eth0         ethernet  unmanaged     --         
lo           loopback  unmanaged     --         
[root@ovirt1 ~]# 

[root@ovirt1 ~]# ip a s dev ovirtmgmt
7: ovirtmgmt: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN 
    link/ether 52:54:00:80:c7:08 brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.21/24 brd 192.168.100.255 scope global ovirtmgmt
       valid_lft forever preferred_lft forever
[root@ovirt1 ~]# 


So I am now starting over again with the changes applied in the setup as described in Comment #34. I will update the BZ once it is finished.

Cheers,
Martin

Comment 36 Martin Tessun 2016-07-26 10:59:01 UTC

Hi Ondra,

(In reply to Ondřej Svoboda from comment #34)
> Hi,
> 
> (In reply to Martin Tessun from comment #33)
> > I will do some further tests with nm configuration (disabling
> > monitor-connection-files)
> 
> Patching /etc/sysconfig/network-scripts/network-functions as below should
> have the same effect as monitor-connection-files=yes, minus the raciness.
> 
> Quoting from https://bugzilla.redhat.com/show_bug.cgi?id=1345919 and
> https://git.fedorahosted.org/cgit/initscripts.git/commit/
> ?id=61fb1cb4efd62120ffbc021d7fdee1cd25059c08 (the diff is the same as the
> one posted by Thomas Haller)
> 
> In /etc/sysconfig/network-scripts/network-functions, replace:
> 
>     if ! is_false $NM_CONTROLLED && is_nm_running; then
>         nmcli con load "/etc/sysconfig/network-scripts/$CONFIG"
>         UUID=$(get_uuid_by_config $CONFIG)
>         [ -n "$UUID" ] && _use_nm=true
>     fi
> 
> With:
> 
>     if is_nm_running; then
>         nmcli con load "/etc/sysconfig/network-scripts/$CONFIG"
>         if ! is_false $NM_CONTROLLED; then
>             UUID=$(get_uuid_by_config $CONFIG)
>             [ -n "$UUID" ] && _use_nm=true
>         fi
>     fi
> 
> I'll try this on my VMs as well, but anyone is welcome to test.
> 
> Thanks,
> Ondra

Yep this does work. No more "downed" ovirtmgmt bridge. If you need it, I can provide the logs as well.

Cheers,
Martin

Comment 37 cshao 2016-07-27 04:36:05 UTC

(In reply to Martin Tessun from comment #33)
> Hi,
> 
> (In reply to shaochen from comment #29)
> > I can't reproduce this issue with normal network configure, is there any
> > other network configure?
> > 
> > My test steps as following:
> > 1. Install RHVH via ISO(with defaulf ks)
> > 2. Configure a simple network(one nic).
> > 3. Start hosted-engine installation from cockpit (https://<FWDN>:9090/)
> > 4. Select the network you are connected to cockpit as the management network
> > 5. Configure hosted engine can successful.
> 
> How did you configure your network? With DHCP? If so, please use static
> configuration, as dhcp ensures the interface is up afterwards. Static
> configuration doesn't.
> 
> Please note: It is not about HE having a static IP, but about the Hypervisor
> (RHV-H NGN) having a static IP (and no ovirtmgmt bridge set up already).


Thanks for bringing up the point, I can reproduce this issue now.

Test steps:
1. Anaconda interactive install RHVH via ISO
2. Define Network settings(with static configuration)
3. Start hosted-engine installation from cockpit (https://<FWDN>:9090/)
4. Select the network you are connected to cockpit as the management network

Actual results:
During installation the setup aborts and cockpit shows "Disconnected" message.
The ovirtmgmt-bridge stays in status down.

Comment 38 Dan Kenigsberg 2016-08-08 12:27:09 UTC

We've decided to disable NM on NGN (see bug 1364126) in order to avoid this bug.

Comment 41 Yaniv Kaul 2016-08-18 07:52:03 UTC

How can it be on QA and on 4.0.4?

Comment 42 Dan Kenigsberg 2016-08-18 12:14:36 UTC

*** Bug 1361017 has been marked as a duplicate of this bug. ***

Comment 46 Nikolai Sednev 2016-08-18 21:12:09 UTC

After some time we've seen that in /var/run/ovirt-hosted-engine-ha/vm.conf, was added "filter:vdsm-no-mac-spoofing,specParams:{}", it prevented VM from starting on host, so we removed it manually and started the VM.

Comment 47 Simone Tiraboschi 2016-08-19 10:16:00 UTC

(In reply to Nikolai Sednev from comment #46)
> After some time we've seen that in /var/run/ovirt-hosted-engine-ha/vm.conf,
> was added "filter:vdsm-no-mac-spoofing,specParams:{}", it prevented VM from
> starting on host, so we removed it manually and started the VM.

it's already in /usr/share/ovirt-hosted-engine-setup/templates/vm.conf.in,
the question is why it prevents the VM from starting

Comment 48 Simone Tiraboschi 2016-08-19 10:46:25 UTC

Can you please add also a sos reports from the engine VM?

Comment 50 Nikolai Sednev 2016-08-21 08:54:42 UTC

(In reply to Simone Tiraboschi from comment #48)
> Can you please add also a sos reports from the engine VM?

I've seen that both ovirt-ha-agent and broker were down and engine's VM was failing to get started, once we've changed /var/run/ovirt-hosted-engine-ha/vm.conf as appears in https://bugzilla.redhat.com/show_bug.cgi?id=1351095#c46, I could start agent and broker and VM also got started. One more thing, I've tried to reboot the host right before changing /var/run/ovirt-hosted-engine-ha/vm.conf as thought that it might be interesting to see for if it might change something and discovered that /var/run/ovirt-hosted-engine-ha/vm.conf simply disappeared after host has been rebooted, so we had to create this file manually and filled it with everything that was previously there, except for the "filter:vdsm-no-mac-spoofing,specParams:{}" line.

Comment 51 Nikolai Sednev 2016-08-21 08:57:25 UTC

Created attachment 1192544 [details]
sosreport from the engine

Comment 52 Simone Tiraboschi 2016-08-22 08:22:03 UTC

(In reply to Nikolai Sednev from comment #50)
> I've seen that both ovirt-ha-agent and broker were down and engine's VM was
> failing to get started, once we've changed
> /var/run/ovirt-hosted-engine-ha/vm.conf as appears in
> https://bugzilla.redhat.com/show_bug.cgi?id=1351095#c46, I could start agent
> and broker and VM also got started.

Pretty strange

> One more thing, I've tried to reboot the
> host right before changing /var/run/ovirt-hosted-engine-ha/vm.conf as
> thought that it might be interesting to see for if it might change something
> and discovered that /var/run/ovirt-hosted-engine-ha/vm.conf simply
> disappeared after host has been rebooted, so we had to create this file
> manually and filled it with everything that was previously there, except for
> the "filter:vdsm-no-mac-spoofing,specParams:{}" line.

/var/run is on tmpfs so it's perfectly fine that it disappears on reboots.
The agent should recreate vm.conf from what is saved on the shared storage, the issue is just why the agent didn't started.

Comment 53 Simone Tiraboschi 2016-08-22 09:52:20 UTC

Aug 18 16:26:44 alma03 vdsmd_init_common.sh: libvirt: Network Filter Driver error : Network filter not found: no nwfilter with matching name 'vdsm-no-mac-spoofing'
Aug 18 16:26:44 alma03 vdsmd_init_common.sh: vdsm: Running dummybr


The issue seams here so I suspect that the bridge wasn't properly working at first VM boot time and so hosted-engine-setup didn't completed.

Comment 54 Nikolai Sednev 2016-08-22 10:57:38 UTC

Works for me with osvoboda's fixes for VDSM on these components on engine:

ovirt-log-collector-4.0.0-1.el7ev.noarch
ovirt-engine-websocket-proxy-4.0.2.7-0.1.el7ev.noarch
ovirt-engine-tools-backup-4.0.2.7-0.1.el7ev.noarch
ovirt-engine-dwh-setup-4.0.2-1.el7ev.noarch
ovirt-engine-lib-4.0.2.7-0.1.el7ev.noarch
ovirt-iso-uploader-4.0.0-1.el7ev.noarch
ovirt-setup-lib-1.0.2-1.el7ev.noarch
ovirt-engine-setup-plugin-ovirt-engine-4.0.2.7-0.1.el7ev.noarch
ovirt-imageio-common-0.3.0-0.el7ev.noarch
ovirt-imageio-proxy-0.3.0-0.el7ev.noarch
ovirt-engine-cli-3.6.8.1-1.el7ev.noarch
ovirt-image-uploader-4.0.0-1.el7ev.noarch
ovirt-vmconsole-proxy-1.0.4-1.el7ev.noarch
ovirt-imageio-proxy-setup-0.3.0-0.el7ev.noarch
ovirt-host-deploy-1.5.1-1.el7ev.noarch
ovirt-engine-tools-4.0.2.7-0.1.el7ev.noarch
ovirt-engine-dbscripts-4.0.2.7-0.1.el7ev.noarch
ovirt-engine-dwh-4.0.2-1.el7ev.noarch
ovirt-engine-dashboard-1.0.2-1.el7ev.x86_64
ovirt-engine-setup-base-4.0.2.7-0.1.el7ev.noarch
ovirt-engine-setup-plugin-websocket-proxy-4.0.2.7-0.1.el7ev.noarch
ovirt-engine-vmconsole-proxy-helper-4.0.2.7-0.1.el7ev.noarch
ovirt-engine-webadmin-portal-4.0.2.7-0.1.el7ev.noarch
ovirt-engine-sdk-python-3.6.8.0-1.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
ovirt-engine-extension-aaa-jdbc-1.1.0-1.el7ev.noarch
ovirt-engine-restapi-4.0.2.7-0.1.el7ev.noarch
ovirt-engine-setup-plugin-ovirt-engine-common-4.0.2.7-0.1.el7ev.noarch
ovirt-engine-extensions-api-impl-4.0.2.7-0.1.el7ev.noarch
ovirt-engine-4.0.2.7-0.1.el7ev.noarch
ovirt-host-deploy-java-1.5.1-1.el7ev.noarch
ovirt-engine-setup-plugin-vmconsole-proxy-helper-4.0.2.7-0.1.el7ev.noarch
ovirt-engine-userportal-4.0.2.7-0.1.el7ev.noarch
python-ovirt-engine-sdk4-4.0.0-0.5.a5.el7ev.x86_64
ovirt-engine-setup-4.0.2.7-0.1.el7ev.noarch
ovirt-engine-backend-4.0.2.7-0.1.el7ev.noarch
rhevm-spice-client-x86-msi-4.0-3.el7ev.noarch
rhevm-doc-4.0.0-3.el7ev.noarch
rhevm-4.0.2.7-0.1.el7ev.noarch
rhevm-spice-client-x64-msi-4.0-3.el7ev.noarch
rhev-guest-tools-iso-4.0-5.el7ev.noarch
rhevm-dependencies-4.0.0-1.el7ev.noarch
rhevm-setup-plugins-4.0.0.2-1.el7ev.noarch
rhev-release-4.0.2-9-001.noarch
rhevm-branding-rhev-4.0.0-5.el7ev.noarch
rhevm-guest-agent-common-1.0.12-3.el7ev.noarch
Linux version 3.10.0-327.28.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon Jun 27 14:48:28 EDT 2016
Linux 3.10.0-327.28.2.el7.x86_64 #1 SMP Mon Jun 27 14:48:28 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

Host:
ovirt-setup-lib-1.0.2-1.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.16.x86_64
libvirt-client-1.2.17-13.el7_2.5.x86_64
rhevm-appliance-20160811.0-1.el7ev.noarch
ovirt-host-deploy-1.5.1-1.el7ev.noarch
vdsm-4.18.11-1.el7ev.x86_64
ovirt-imageio-common-0.3.0-0.el7ev.noarch
sanlock-3.2.4-2.el7_2.x86_64
ovirt-hosted-engine-ha-2.0.2-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.8.0-1.el7ev.noarch
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
mom-0.5.5-1.el7ev.noarch
ovirt-imageio-daemon-0.3.0-0.el7ev.noarch
ovirt-hosted-engine-setup-2.0.1.4-1.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
Linux version 3.10.0-327.28.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon Jun 27 14:48:28 EDT 2016
Linux 3.10.0-327.28.2.el7.x86_64 #1 SMP Mon Jun 27 14:48:28 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 7.2

I've cleanly installed NGN from latest RHVH-7.2-20160815.0-RHVH-x86_64-dvd1.iso, then used rhevm-appliance-20160811.0-1.el7ev.noarch for deployment of the HE over iSCSI and prior to it I've changed the IP address configuration on NGN from DHCP to Static, network manager was not disabled before the deployment. HE deployment succeeded via Cockpit and network management bridge created without any issues. Network manager was disabled during deployment. Once I've had the engine up and running, I've upgraded it to latest bits from repos and restarted it, then added data iSCSI storage domain to get hosted_storage auto-imported in to the engine's WEBUI. 
Finally the whole deployment succeeded with hosted_storage being auto-imported ans HE-VM visible from the WEBUI.

Please consider pushing your fixes and moving this bug to modified.

Comment 55 Nikolai Sednev 2016-08-22 10:58:18 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1351095#c54

Comment 56 Nikolai Sednev 2016-09-01 11:37:56 UTC

Moving to verified forth to https://bugzilla.redhat.com/show_bug.cgi?id=1351095#c54.

Comment 57 Gianluca Cecchi 2017-09-12 09:42:50 UTC

Hello,
I'm doing an evaluatin of RHEV and I'm incurring in this bug with
RHVH-4.1-20170817.0-RHVH-x86_64-dvd1.iso
Possible regession?
In my case I have a physical blade with 6 network adapters.
I configured the first to be the ip of the host during anaconda install.
Then from cockpit I bonded it with the second network adapter creating bond0 (802.3ad mode). From inside cockpit I also configured bond1 (two other network adapters) and bond2.
All good until now.
Then I start self hosted engine install.
I select bond0 as the adapter to create bridge on (I'm proposed bond0, bond1, bond2). Is it expected?
In my case I pre-created the bond because otherwise network guys had to force disable the second port on the cisco switch.
I got disconnected from cockpit and after some further seconds I can connect again, but it seems I'm not proposed a way to resume, but only start over.

At host I have:

[root@rhevora1 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens2f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP qlen 1000
    link/ether 2e:d8:2f:07:5b:67 brd ff:ff:ff:ff:ff:ff
3: eno49: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond1 portid 0100000000000000000000304d31353543 state UP qlen 1000
    link/ether 00:fd:45:f6:09:b0 brd ff:ff:ff:ff:ff:ff
4: ens2f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP qlen 1000
    link/ether 2e:d8:2f:07:5b:67 brd ff:ff:ff:ff:ff:ff
5: eno50: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond1 portid 0200000000000000000000304d31353543 state UP qlen 1000
    link/ether 00:fd:45:f6:09:b0 brd ff:ff:ff:ff:ff:ff
6: ens2f2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond2 state UP qlen 1000
    link/ether 48:df:37:0c:7f:5a brd ff:ff:ff:ff:ff:ff
7: ens2f3: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond2 state UP qlen 1000
    link/ether 48:df:37:0c:7f:5a brd ff:ff:ff:ff:ff:ff
23: bond2: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether 48:df:37:0c:7f:5a brd ff:ff:ff:ff:ff:ff
    inet6 fe80::4adf:37ff:fe0c:7f5a/64 scope link 
       valid_lft forever preferred_lft forever
24: bond1: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether 00:fd:45:f6:09:b0 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::2fd:45ff:fef6:9b0/64 scope link 
       valid_lft forever preferred_lft forever
25: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovirtmgmt state UP qlen 1000
    link/ether 2e:d8:2f:07:5b:67 brd ff:ff:ff:ff:ff:ff
26: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether a6:09:2b:a3:90:88 brd ff:ff:ff:ff:ff:ff
27: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether 2e:d8:2f:07:5b:67 brd ff:ff:ff:ff:ff:ff
    inet 192.168.50.21/24 brd 192.168.50.255 scope global ovirtmgmt
       valid_lft forever preferred_lft forever
[root@rhevora1 ~]# 

[root@rhevora1 ~]# brctl show
bridge name	bridge id		STP enabled	interfaces
;vdsmdummy;		8000.000000000000	no		
ovirtmgmt		8000.2ed82f075b67	no		bond0
[root@rhevora1 ~]#

Comment 58 Gianluca Cecchi 2017-09-12 09:43:12 UTC

btw: I configured static IP

Comment 59 Simone Tiraboschi 2017-09-12 09:52:21 UTC

(In reply to Gianluca Cecchi from comment #57)
> I got disconnected from cockpit and after some further seconds I can connect
> again, but it seems I'm not proposed a way to resume, but only start over.

Did hosted-engine-setup died or did you simply lose the connection to the cockpit UI?
Could you please attach hosted-engine-setup logs?

Comment 60 Gianluca Cecchi 2017-09-12 10:02:28 UTC

No, no more hosted-engine-setup process in place.

And also connecting to cockpit I had to restart, no vm and no previous setup detected to resume.
I got error because VG and its LVs already created i first run if I select the same LUN
To be able re-run the setup I had to remove VG and PV created in first phase:

[root@rhevora1 ~]# vgremove 22adbae5-4698-4e9a-bfe0-758695b1552b
Do you really want to remove volume group "22adbae5-4698-4e9a-bfe0-758695b1552b" containing 6 logical volumes? [y/n]: n
Volume group "22adbae5-4698-4e9a-bfe0-758695b1552b" not removed
[root@rhevora1 ~]# vgremove 22adbae5-4698-4e9a-bfe0-758695b1552b
Do you really want to remove volume group "22adbae5-4698-4e9a-bfe0-758695b1552b" containing 6 logical volumes? [y/n]: y
Do you really want to remove active logical volume 22adbae5-4698-4e9a-bfe0-758695b1552b/metadata? [y/n]: y
Logical volume "metadata" successfully removed
Do you really want to remove active logical volume 22adbae5-4698-4e9a-bfe0-758695b1552b/outbox? [y/n]: y
Logical volume "outbox" successfully removed
Do you really want to remove active logical volume 22adbae5-4698-4e9a-bfe0-758695b1552b/leases? [y/n]: y
Logical volume "leases" successfully removed
Do you really want to remove active logical volume 22adbae5-4698-4e9a-bfe0-758695b1552b/ids? [y/n]: y
Logical volume "ids" successfully removed
Do you really want to remove active logical volume 22adbae5-4698-4e9a-bfe0-758695b1552b/inbox? [y/n]: y
Logical volume "inbox" successfully removed
Do you really want to remove active logical volume 22adbae5-4698-4e9a-bfe0-758695b1552b/master? [y/n]: y
Logical volume "master" successfully removed
Volume group "22adbae5-4698-4e9a-bfe0-758695b1552b" successfully removed
[root@rhevora1 ~]#

[root@rhevora1 ~]# pvremove /dev/mapper/360002ac000000000000000530001894c
Labels on physical volume "/dev/mapper/360002ac000000000000000530001894c" successfully wiped.
[root@rhevora1 ~]#

Then the new start didn't ask me about which card to use for ovirtmgmt, and it automatically chooses bond0 (I see it from resume screen) probably because it found already in place the ovirtmgmt bridge on bond0, and completed ok, with the final screen:

Hosted Engine Setup successfully completed!

But I'm quite expert of OS and oVirt. I think the normal user following install guide could have serious problems going ahead.
I'm going to collect both hosted-engine-setup logs and send to you if it can help debugging

Comment 61 Gianluca Cecchi 2017-09-12 12:05:48 UTC

Hello,
I'm going t attach into a tgz the series of ovirt-hosted-engine-setup*.log generated.

The one related with the cockpit disconnect is this:
ovirt-hosted-engine-setup-20170912105542-anlmgu.log

The one that after completed without errors was this:
ovirt-hosted-engine-setup-20170912114523-r1pgjs.log

The log between them (ovirt-hosted-engine-setup-20170912112818-ywfeby.log) is because I tried to see if possible to resume in any way, but it gave error about LUN already with a VG on it from previous attempt.

The previous ones were because I already downloaded the appliance but I didn't find a way to give it to the installer. In fact the step where I can choose a file for the appliance is "after" it installs the appliance rpm itself (I'm going to attach screenshot), so even if I already downloaded the 1.5Gb appliance ova file, I was forced to download/install the appliance rpm and so re-download a 1.5gb file
Let me know if I have to open another bug/rfe for this.

Also, the engine vm was not reachable after install and I see that the it depends on gateway not set. You can walk through the logs if you can find the reason for this too.. let me know if you want me to open a bug fir this too.

This was the situation of engine vm (that was reachable from the host because on the same ovirtmgmt lan):

[root@rhevmgr ~]# ip route show
192.168.50.0/24 dev eth0 proto kernel scope link src 192.168.50.20 metric 100 
[root@rhevmgr ~]#

Instead on th host:
[root@rhevora1 ~]# ip route show
default via 192.168.50.1 dev ovirtmgmt 
169.254.0.0/16 dev ovirtmgmt scope link metric 1027 
192.168.50.0/24 dev ovirtmgmt proto kernel scope link src 192.168.50.21 
[root@rhevora1 ~]# 

On the engine there is NetworkManager set this way:

[root@rhevmgr network-scripts]# nmcli con show 
NAME         UUID                                  TYPE            DEVICE 
System eth0  5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03  802-3-ethernet  eth0   

[root@rhevmgr network-scripts]# nmcli con show "System eth0"
connection.id:                          System eth0
connection.uuid:                        5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03
connection.stable-id:                   --
connection.interface-name:              eth0
connection.type:                        802-3-ethernet
connection.autoconnect:                 yes
connection.autoconnect-priority:        0
connection.autoconnect-retries:         -1 (default)
connection.timestamp:                   1505216520
connection.read-only:                   no
connection.permissions:                 --
connection.zone:                        public
connection.master:                      --
connection.slave-type:                  --
connection.autoconnect-slaves:          -1 (default)
connection.secondaries:                 --
connection.gateway-ping-timeout:        0
connection.metered:                     unknown
connection.lldp:                        -1 (default)
802-3-ethernet.port:                    --
802-3-ethernet.speed:                   0
802-3-ethernet.duplex:                  --
802-3-ethernet.auto-negotiate:          no
802-3-ethernet.mac-address:             --
802-3-ethernet.cloned-mac-address:      --
802-3-ethernet.generate-mac-address-mask:--
802-3-ethernet.mac-address-blacklist:   --
802-3-ethernet.mtu:                     auto
802-3-ethernet.s390-subchannels:        --
802-3-ethernet.s390-nettype:            --
802-3-ethernet.s390-options:            --
802-3-ethernet.wake-on-lan:             1 (default)
802-3-ethernet.wake-on-lan-password:    --
ipv4.method:                            manual
ipv4.dns:                               172.16.1.11,172.16.1.2
ipv4.dns-search:                        my.domain
ipv4.dns-options:                       (default)
ipv4.dns-priority:                      0
ipv4.addresses:                         192.168.50.20/24
ipv4.gateway:                           --
ipv4.routes:                            --
ipv4.route-metric:                      -1
ipv4.ignore-auto-routes:                no
ipv4.ignore-auto-dns:                   no
ipv4.dhcp-client-id:                    --
ipv4.dhcp-timeout:                      0
ipv4.dhcp-send-hostname:                yes
ipv4.dhcp-hostname:                     --
ipv4.dhcp-fqdn:                         --
ipv4.never-default:                     no
ipv4.may-fail:                          yes
ipv4.dad-timeout:                       -1 (default)
ipv6.method:                            ignore
ipv6.dns:                               --
ipv6.dns-search:                        --
ipv6.dns-options:                       (default)
ipv6.dns-priority:                      0
ipv6.addresses:                         --
ipv6.gateway:                           --
ipv6.routes:                            --
ipv6.route-metric:                      -1
ipv6.ignore-auto-routes:                no
ipv6.ignore-auto-dns:                   no
ipv6.never-default:                     no
ipv6.may-fail:                          yes
ipv6.ip6-privacy:                       -1 (unknown)
ipv6.addr-gen-mode:                     stable-privacy
ipv6.dhcp-send-hostname:                yes
ipv6.dhcp-hostname:                     --
ipv6.token:                             --
proxy.method:                           none
proxy.browser-only:                     no
proxy.pac-url:                          --
proxy.pac-script:                       --
GENERAL.NAME:                           System eth0
GENERAL.UUID:                           5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03
GENERAL.DEVICES:                        eth0
GENERAL.STATE:                          activated
GENERAL.DEFAULT:                        no
GENERAL.DEFAULT6:                       no
GENERAL.VPN:                            no
GENERAL.ZONE:                           public
GENERAL.DBUS-PATH:                      /org/freedesktop/NetworkManager/ActiveConnection/1
GENERAL.CON-PATH:                       /org/freedesktop/NetworkManager/Settings/1
GENERAL.SPEC-OBJECT:                    --
GENERAL.MASTER-PATH:                    --
IP4.ADDRESS[1]:                         192.168.50.20/24
IP4.GATEWAY:                            --
IP4.DNS[1]:                             172.16.1.11
IP4.DNS[2]:                             172.16.1.2
IP6.ADDRESS[1]:                         fe80::216:3eff:fe7c:6534/64
IP6.GATEWAY:                            --
[root@rhevmgr network-scripts]# 


To solve this further problem, hoping it stays persistent across engine reboot was:

[root@rhevmgr network-scripts]# nmcli con modify "System eth0" ipv4.gateway 192.168.50.1


[root@rhevmgr network-scripts]# nmcli con reload "System eth0"

[root@rhevmgr network-scripts]# nmcli con up "System eth0"
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/2)
[root@rhevmgr network-scripts]# 

[root@rhevmgr ~]# nmcli con show "System eth0" | grep -i GATEWAY
connection.gateway-ping-timeout:        0
ipv4.gateway:                           192.168.50.1
ipv6.gateway:                           --
IP4.GATEWAY:                            192.168.50.1
IP6.GATEWAY:                            --
[root@rhevmgr ~]# 

And now I'm able to reach it from outside

Comment 62 Gianluca Cecchi 2017-09-12 12:06:59 UTC

Created attachment 1324839 [details]
tar gzip of hosted-engine-setup logs

Comment 63 Gianluca Cecchi 2017-09-12 12:10:36 UTC

Created attachment 1324852 [details]
screenshot where I chose gateway of engine vm

Inside hosted-engine-setup log already provided (ovirt-hosted-engine-setup-20170912114523-r1pgjs.log) there is this line that should configure the gateway of the engine vm...

2017-09-12 11:47:07 DEBUG otopi.context context.dumpEnvironment:770 ENV OVEHOSTED_NETWORK/gateway=str:'192.168.50.1'

but actually has not been configured

Comment 64 Simone Tiraboschi 2017-09-12 14:46:15 UTC

(In reply to Gianluca Cecchi from comment #61)
> The previous ones were because I already downloaded the appliance but I
> didn't find a way to give it to the installer. In fact the step where I can
> choose a file for the appliance is "after" it installs the appliance rpm
> itself (I'm going to attach screenshot), so even if I already downloaded the
> 1.5Gb appliance ova file, I was forced to download/install the appliance rpm
> and so re-download a 1.5gb file
> Let me know if I have to open another bug/rfe for this.

https://bugzilla.redhat.com/show_bug.cgi?id=1481095
Already fixed
 
> Also, the engine vm was not reachable after install and I see that the it
> depends on gateway not set. You can walk through the logs if you can find
> the reason for this too.. let me know if you want me to open a bug fir this
> too.

Yes please, could you please attach /var/log/messages and cloud-init logs from the engine VM?

Comment 65 Simone Tiraboschi 2017-09-12 14:56:57 UTC

hosted-engine-setup got a SIGHUP and so it terminated, this is coherent since cockpit was the controlling terminal.

2017-09-12 11:22:09 DEBUG otopi.context context._executeMethod:142 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/otopi/context.py", line 132, in _executeMethod
    method['method']()
  File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-setup/storage/storage.py", line 980, in _misc
    self._createStoragePool()
  File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-setup/storage/storage.py", line 657, in _createStoragePool
    leaseRetries=None,
  File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 165, in _callMethod
    kwargs.pop('_transport_timeout', self._default_timeout)))
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 363, in call
    call.wait(kwargs.get('timeout', CALL_TIMEOUT))
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 333, in wait
    self._ev.wait(timeout)
  File "/usr/lib64/python2.7/threading.py", line 622, in wait
    self.__cond.wait(timeout, balancing)
  File "/usr/lib64/python2.7/threading.py", line 362, in wait
    _sleep(delay)
  File "/usr/lib/python2.7/site-packages/otopi/main.py", line 53, in _signal
    raise RuntimeError("SIG%s" % signum)
RuntimeError: SIG1


Now we need to understand why cockpit died.
Gianluca, could you please share /var/log/messaged and cockpit logs from that host?

By the way I'm not really sure it was a network related issue since at that time hosted-engine-setup was creating the storage pool which shouldn't affect the network.

2017-09-12 11:21:24 DEBUG otopi.plugins.gr_he_setup.storage.storage storage._createStoragePool:646 createStoragePool(args=[storagepoolID=203a9a04-9e88-40b3-931c-50e1dd63520e, name=hosted_datacenter, masterSdUUID=ab2cb8c5-656e-4d1e-8d69-ed2e9d8b6e77, masterVersion=1, domainList=['ab2cb8c5-656e-4d1e-8d69-ed2e9d8b6e77', '22adbae5-4698-4e9a-bfe0-758695b1552b'], lockRenewalIntervalSec=None, leaseTimeSec=None, ioOpTimeoutSec=None, leaseRetries=None])


Management bridge creations seams indeed fine:

2017-09-12 11:21:12 DEBUG otopi.context context._executeMethod:128 Stage misc METHOD otopi.plugins.gr_he_common.network.bridge.Plugin._misc
2017-09-12 11:21:12 INFO otopi.plugins.gr_he_common.network.bridge bridge._misc:359 Configuring the management bridge
2017-09-12 11:21:13 DEBUG otopi.plugins.gr_he_common.network.bridge bridge._misc:371 networks: {'ovirtmgmt': {'bonding': 'bond0', 'ipaddr': u'192.168.50.21', 'netmask': u'255.255.255.0', 'defaultRoute': True, 'gateway': u'192.168.50.1'}}
2017-09-12 11:21:13 DEBUG otopi.plugins.gr_he_common.network.bridge bridge._misc:372 bonds: {}
2017-09-12 11:21:13 DEBUG otopi.plugins.gr_he_common.network.bridge bridge._misc:373 options: {'connectivityCheck': False}
2017-09-12 11:21:20 DEBUG otopi.context context._executeMethod:128 Stage misc METHOD otopi.plugins.gr_he_setup.storage.blockd.Plugin._misc

Comment 66 Gianluca Cecchi 2017-09-12 15:11:39 UTC

Created attachment 1324940 [details]
/var/log/messages of hypervisor

Comment 67 Gianluca Cecchi 2017-09-12 15:12:17 UTC

Created attachment 1324941 [details]
/var/log/messages of engine vm

Comment 68 Gianluca Cecchi 2017-09-12 15:14:15 UTC

I have attached /var/log/messages of hypervisor and engine vm.
Can you tell me the paths of cloud-init logs on engine vm system and cockpit logs on hypervisor system, so that I can attach them too?

Comment 69 Ryan Barry 2017-09-12 15:17:27 UTC

Cockpit should be in the journal. We do not log separately for cockpit-q city

Comment 70 Gianluca Cecchi 2017-09-12 15:36:09 UTC

Created attachment 1324946 [details]
journal log

Here the output of

journalctl -x --since today | gzip > /tmp/journal_today.txt.gz

Comment 71 Simone Tiraboschi 2017-09-12 15:58:50 UTC

Sep 12 11:22:09 rhevora1.padana.locale cockpit-ws[3826]: WebSocket from 172.16.4.22 for session closed

this happened at 11:22:09 exactly when hosted-engine-setup got a SIGHUP.

Comment 72 Ryan Barry 2017-09-12 16:06:50 UTC

cockpit-ovirt terminates HE setup if it loses the connection. The question is why it disconnected...

Comment 73 Simone Tiraboschi 2017-09-18 14:27:22 UTC

(In reply to Simone Tiraboschi from comment #64)
> > Also, the engine vm was not reachable after install and I see that the it
> > depends on gateway not set. You can walk through the logs if you can find
> > the reason for this too.. let me know if you want me to open a bug fir this
> > too.
> 
> Yes please, could you please attach /var/log/messages and cloud-init logs
> from the engine VM?

https://bugzilla.redhat.com/show_bug.cgi?id=1492726

Comment 74 Gianluca Cecchi 2017-09-18 15:02:15 UTC

Ah ok, fine.
It is a general cloud-init problem with RH EL 7.4, that applies to engine vm too, as its deploy uses cloud-init...
I will follow that bugzilla
Thanks

Comment 75 Michael Burman 2017-09-26 08:41:23 UTC

I saw something very similar today - cokcpit has died in the engine-setup 'Closing up' state with SIG1 -

2017-09-26 10:43:32 DEBUG otopi.plugins.gr_he_common.engine.health appliance_esetup._appliance_connect:89 Successfully connected to the appliance
2017-09-26 10:43:32 INFO otopi.plugins.gr_he_common.engine.health health._closeup:127 Running engine-setup on the appliance
2017-09-26 10:45:32 ERROR otopi.context context._executeMethod:151 Failed to execute stage 'Closing up': SIG1
2017-09-26 10:45:32 DEBUG otopi.context context.dumpEnvironment:760 ENVIRONMENT DUMP - BEGIN
2017-09-26 10:45:32 DEBUG otopi.context context.dumpEnvironment:770 ENV BASE/error=bool:'True'
2017-09-26 10:45:32 DEBUG otopi.context context.dumpEnvironment:770 ENV BASE/exceptionInfo=list:'[(<type 'exceptions.RuntimeError'>, RuntimeError('SIG1',), <traceback object at 0x2b90908>

set 26 10:45:32 orchid-vds2.qa.lab.tlv.redhat.com cockpit-ws[5725]: WebSocket from 10.35.4.183 for session closed
/var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20170926102646-3xy8j5.log:2017-09-26 10:45:32 ERROR otopi.context context._executeMethod:151 Failed to execute stage 'Closing up': SIG1

Comment 76 Ryan Barry 2017-09-26 08:47:41 UTC

Can you post the full log from /var/log/ovirt-hosted-engine-setup?

Comment 77 Michael Burman 2017-09-26 08:49:02 UTC

(In reply to Ryan Barry from comment #76)
> Can you post the full log from /var/log/ovirt-hosted-engine-setup?

Yes, and if you need access to the setup then let me know as it still alive, but i'm going to kill it in the next hour..

Comment 78 Michael Burman 2017-09-26 08:51:05 UTC

Created attachment 1330930 [details]
HE sig1 cockpit died log

Comment 79 Michael Burman 2017-09-26 12:18:45 UTC

In additional run i had :

2017-09-26 15:06:03 DEBUG otopi.context context._executeMethod:142 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/otopi/context.py", line 132, in _executeMethod
    method['method']()
  File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-setup/storage/storage.py", line 985, in _misc
    self.environment[ohostedcons.StorageEnv.FAKE_MASTER_SD_UUID]
  File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-setup/storage/storage.py", line 784, in _activateStorageDomain
    spUUID
  File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 165, in _callMethod
    kwargs.pop('_transport_timeout', self._default_timeout)))
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 363, in call
    call.wait(kwargs.get('timeout', CALL_TIMEOUT))
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 333, in wait
    self._ev.wait(timeout)
  File "/usr/lib64/python2.7/threading.py", line 622, in wait
    self.__cond.wait(timeout, balancing)
  File "/usr/lib64/python2.7/threading.py", line 362, in wait
    _sleep(delay)
  File "/usr/lib/python2.7/site-packages/otopi/main.py", line 53, in _signal
    raise RuntimeError("SIG%s" % signum)
RuntimeError: SIG1
2017-09-26 15:06:03 ERROR otopi.context context._executeMethod:151 Failed to execute stage 'Misc configuration': SIG1

Are you sure guys this bug should be considered as verified ??
Is all those sig1 errors are the same issue? 
HE + cockpit just not working...

rhvh-4.1-0.20170914.0+1
rhvm-appliance-4.1.20170914.0-1.el7.noarch.rpm

Comment 80 Michael Burman 2017-09-26 12:19:44 UTC

Created attachment 1331029 [details]
new failure

Comment 81 Ryan Barry 2017-09-26 12:23:38 UTC

Yes, it should be verified. This is regularly tested under a large number of scenarios, and widely used for deployment.

Can you post more details about your environment? What kind of storage, what's on the host (VM with limited memory? Physical host? What's the network configuration?)

Comment 82 Michael Burman 2017-09-26 12:45:46 UTC

If it's verified then maybe it's better to stop the discussion here and not spam the bug.

I'm using physical host with enough memory and using static network configuration(as with dhcp i'm thrown away from the installation wizard) and nfs storage. 

I will give it additional attempt now and see. 

Ryan if you want please contact me and i will provide you access to the setup. Thanks,

Note You need to log in before you can comment on or make changes to this bug.

amarchuk
bazulay
bgalvani
cshao
danken
dfediuck
dmoessne
dougsland
fdeutsch
gianluca.cecchi
lsurette
mburman
mtessun
nsednev
osvoboda
pkliczew
pstehlik
rbarry
sbonazzo
srevivo
stirabos
thaller
ycui
ykaul
ylavi
yzhao