Created attachment 1388724 [details] ip-info Description of problem: During the HE setup, the task fails at waiting for host to become non-operational post adding the host. 2018-01-30 21:22:27,229+0530 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:75 {u'_ansible_parsed' : True, u'_ansible_no_log': False, u'changed': False, u'attempts': 150, u'invocation': {u'module_args': {u'pattern': u'name=rhsdev-gr afton2-nic2.lab.eng.blr.redhat.com', u'fetch_nested': False, u'nested_attributes': []}}, u'ansible_facts': {u'ovirt_hosts': []}} 2018-01-30 21:22:27,329+0530 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:79 fatal: [localhost]: FAILED! => {"ansible_facts": {"ovirt_hosts": []}, "attempts": 150, "changed": false} 2018-01-30 21:22:27,531+0530 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:75 PLAY RECAP [grafton-engine1.redhat.com] : ok: 15 changed: 8 unreachable: 0 skipped: 4 failed: 0 2018-01-30 21:22:27,631+0530 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:75 PLAY RECAP [localhost] : ok: 26 changed: 12 unreachable: 0 skipped: 1 failed: 1 2018-01-30 21:22:27,732+0530 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils.run:161 ansible-playbook rc: 2 2018-01-30 21:22:27,732+0530 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils.run:168 ansible-playbook stdout: 2018-01-30 21:22:27,732+0530 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils.run:170 to retry, use: --limit @/usr/share/ovirt-hosted-engine-setup/ansible/bootstrap_local_vm.retry In the engine VM, engine.log 2018-01-30 21:16:36,361+05 INFO [org.ovirt.engine.core.bll.aaa.CreateUserSessionCommand] (default task-4) [1cb6edf0] Running command: CreateUserSessionCommand internal: false. 2018-01-30 21:16:36,396+05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-4) [1cb6edf0] EVENT_ID: USER_VDC_LOGIN(30), User admin@internal-authz connecting from '192.168.122.1' using session 'JuGooYeg8VITZPiqwDRJ+bHdzx3H2Wuabg4IQKaMN2YNu1zI8XLaWRkAQ2zvwU/1wW2PZgq4BQjdxDScxHQ4xQ==' logged in. 2018-01-30 21:16:56,760+05 ERROR [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-7) [7772aab9-aa8a-4f60-8aed-e04f7e58917e] Failed to establish session with host 'rhsdev-grafton2-nic2.lab.eng.blr.redhat.com': SSH connection timed out connecting to 'root.eng.blr.redhat.com' 2018-01-30 21:16:56,762+05 WARN [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-7) [7772aab9-aa8a-4f60-8aed-e04f7e58917e] Validation of action 'AddVds' failed for user admin@internal-authz. Reasons: VAR__ACTION__ADD,VAR__TYPE__HOST,$server rhsdev-grafton2-nic2.lab.eng.blr.redhat.com,VDS_CANNOT_CONNECT_TO_SERVER 2018-01-30 21:16:56,786+05 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-7) [] Operation Failed: [Cannot add Host. Connecting to host via SSH has failed, verify that the host is reachable (IP address, routable address etc.) You may refer to the engine.log file for further details.] If I login to engine VM and try, indeed the hostname is not reachable : [root@grafton-engine1 ~]# ping rhsdev-grafton2-nic2.lab.eng.blr.redhat.com PING rhsdev-grafton2-nic2.lab.eng.blr.redhat.com (10.70.40.34) 56(84) bytes of data. ^C --- rhsdev-grafton2-nic2.lab.eng.blr.redhat.com ping statistics --- 4 packets transmitted, 0 received, 100% packet loss, time 2999ms Steps to Reproduce: 1. Deploy HE on host with 2 NICs Additional info: Attached the output of ip addr, ip route list and ip rule list from host and engine VM
Any workaround for this? Deploying on the RHHI setup is blocked due to this.
Can we look into what's dropping the traffic? firewall, ARP tables, etc.?
The issue seems to be due to 2 NICs configured on the host, each on a different subnet and using different FQDN. While installing, I had chosen gateway as 10.70.43.254 and interface as ens3f0 of host. On host: ip route default via 10.70.43.254 dev ovirtmgmt default via 10.70.37.254 dev enp4s0f0 proto dhcp metric 100 10.70.36.0/23 dev enp4s0f0 proto kernel scope link src 10.70.37.28 metric 100 10.70.40.0/22 dev ovirtmgmt proto kernel scope link src 10.70.40.34 169.254.0.0/16 dev ovirtmgmt scope link metric 1045 192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 On engine VM: ip route default via 192.168.122.1 dev eth0 proto static metric 100 192.168.122.0/24 dev eth0 proto kernel scope link src 192.168.122.240 metric 100 From engine VM, I'm able to ping 10.70.37.28 - which is the ip addr associated with enp4s0f0 (the second NIC), but not 10.70.40.34, the address associated with host FQDN that the HE install picked up based on the NIC/gateway I provided. I'm not sure if this is to do with inputs I provided or the way the bridge was created.
What's the bridge setup? Perhaps it was indeed set up on the other interface?
(In reply to Yaniv Kaul from comment #4) > What's the bridge setup? Perhaps it was indeed set up on the other interface? #brctl show bridge name bridge id STP enabled interfaces ;vdsmdummy; 8000.000000000000 no ovirtmgmt 8000.0cc47a80f93c no ens3f0 virbr0 8000.52540029272e yes virbr0-nic vnet0 The ovirtmgmt bridge is created on the correct interface - ens3f0, but the local VM is using the virbr0 bridge, I think which is on dummy devices. From the host's ip addr that's already attached: 21: virbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 52:54:00:29:27:2e brd ff:ff:ff:ff:ff:ff inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0 valid_lft forever preferred_lft forever [root@rhsdev-grafton2-nic2 ~]# ip rule list 0: from all lookup local 32764: from all to 10.70.40.0/22 iif ovirtmgmt lookup 172369954 32765: from 10.70.40.0/22 lookup 172369954 32766: from all lookup main 32767: from all lookup default
(In reply to Sahina Bose from comment #1) > Any workaround for this? > Deploying on the RHHI setup is blocked due to this. Ido is trying to reproduce it but it seams to correctly work on our side.
yzhao reproduced it. From the engine VM on the libvirt natted network I'm able to ping 8.8.8.8 but not the host address. If I stop firewalld on the host the engine VM stops pinging also 8.8.8.8 since the nat stops as well.
Stopping NetworkManager and restarting network service on the host made it working again.
(In reply to Simone Tiraboschi from comment #8) > Stopping NetworkManager and restarting network service on the host made it > working again. Bad news, also failed, but it is penging on: [ INFO ] TASK [Restore network configuration] [ INFO ] changed: [localhost] [ INFO ] TASK [Wait for the host to be up] And the ICMP is not ok from the vm to the host
I was able to complete deployment after stopping NetworkManager on the host prior to deployment.
I am also seeing this issue with 4.2.1 RC6. All worked fine with RC4, after upgrade to RC6 host and engine could not communicate. Tried ditching, cleanup, and then redeply new engine and I cant get past: TASK [Wait for the host to become non operational] without: fatal: [localhost]: FAILED! => {"ansible_facts": {"ovirt_hosts": []}, "attempts": 150, "changed": false} This is a Cent)S 7.4 Box - fresh install with RC4 a few days ago.
zipur, thanks for the report. Can you please try again stopping NetworkManager on your host before executing hosted-engine-setup?
Danken, any hint for us?
Simone, I tried that as well as everything else listed above. Nothing worked. Was going to try a fresh install again of CentOS7.4 and oVirt 4.2.1RC6 to see if that works. My RC4 install was working great up until upgrading to RC6 which caused the engine to not be able to communicate with host. I then removed the engine, uninstalled all, cleaned up, and tried again with no luck. Will let you know what I find from fresh install of everything.
Looks like issue with RC6 is fixed when installing from scratch (New CentOS 7.4 and then new HostedEngine install). Must be a bug related to issue with prior data <RC6?
Simone asked me to add here the description of a problem I saw in my environment while running hosted-engine --deploy on host cyan-vdsf.qa.lab.tlv.redhat.com. setup failed with 2018-02-13 11:52:05,428+02 ERROR [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-8) [da0fee02-6e68-477d-9d1d-cda14450bf87] Failed to establish session with host 'cyan-vdsf.qa.lab.tlv.redhat.com': SSH connection timed out connecting to 'root.lab.tlv.redhat.com' <tiraboschi> 2018-02-13 11:52:05,430+02 WARN [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-8) [da0fee02-6e68-477d-9d1d-cda14450bf87] Validation of action 'AddVds' failed for user admin@internal-authz. Reasons: VAR__ACTION__ADD,VAR__TYPE__HOST,$server cyan-vdsf.qa.lab.tlv.redhat.com,VDS_CANNOT_CONNECT_TO_SERVER the problem disappeared after running on host: systemctl stop NetworkManager systemctl restart network
I also failed to deploy HE 4.2 with ansible on rhel-7.5 with the following error: [ INFO ] TASK [Get ovirtmgmt route table id] [ ERROR ] fatal: [localhost]: FAILED! => {\"attempts\": 50, \"changed\": true, \"cmd\": \"ip rule list | grep ovirtmgmt | sed s/\\\\\\\\[.*\\\\\\\\]\\\\ //g | awk '{ print $9 }'\", \"delta\": \"0:00:00.010850\", \"end\": \"2018-02-19 17:59:07.135606\", \"rc\": 0, \"start\": \"2018-02-19 17:59:07.124756\", \"stderr\": \"\", \"stderr_lines\": [], \"stdout\": \"\", \"stdout_lines\": []} Notes: - when I did the WA as Polina mention above its installed successfully. - when I did the ansible deploy last week on rhel-7.4 its worked without any WA. maybe it's also related to the following issue that we have: https://bugzilla.redhat.com/show_bug.cgi?id=1538630
(In reply to Kobi Hakimi from comment #17) > maybe it's also related to the following issue that we have: > https://bugzilla.redhat.com/show_bug.cgi?id=1538630 Yes, right. We always see something like: Feb 20 18:00:50 alma03 NetworkManager[564]: <info> [1519142450.0279] manager: (bondscan-J6Ls2c): new Bond device (/org/freedesktop/NetworkManager/Devices/14) Feb 20 18:00:50 alma03 kernel: bonding: bondscan-J6Ls2c is being deleted... Feb 20 18:00:50 alma03 kernel: (unregistered net_device): Released all slaves Feb 20 18:00:50 alma03 kernel: bonding: bondscan-J6Ls2c is being created... Feb 20 18:00:50 alma03 NetworkManager[564]: <info> [1519142450.0640] manager: (bondscan-J6Ls2c): new Bond device (/org/freedesktop/NetworkManager/Devices/15) Feb 20 18:00:50 alma03 kernel: bonding: bondscan-J6Ls2c is being deleted... Feb 20 18:00:50 alma03 kernel: (unregistered net_device): Released all slaves Feb 20 18:00:50 alma03 kernel: bonding: bondscan-J6Ls2c is being created... Feb 20 18:00:50 alma03 kernel: bonding: bondscan-J6Ls2c is being deleted... Feb 20 18:00:50 alma03 kernel: (unregistered net_device): Released all slaves Feb 20 18:00:50 alma03 kernel: bonding: bondscan-J6Ls2c is being created... Feb 20 18:00:50 alma03 kernel: bonding: bondscan-J6Ls2c is being deleted... Feb 20 18:00:50 alma03 kernel: (unregistered net_device): Released all slaves Feb 20 18:00:50 alma03 NetworkManager[564]: <info> [1519142450.1181] manager: (bondscan-J6Ls2c): new Bond device (/org/freedesktop/NetworkManager/Devices/16) Feb 20 18:00:50 alma03 kernel: bonding: bondscan-J6Ls2c is being created... Feb 20 18:00:50 alma03 NetworkManager[564]: <info> [1519142450.1286] manager: (bondscan-J6Ls2c): new Bond device (/org/freedesktop/NetworkManager/Devices/17) Feb 20 18:00:50 alma03 lldpad: recvfrom(Event interface): No buffer space available Feb 20 18:00:50 alma03 vdsm-tool: Traceback (most recent call last): Feb 20 18:00:50 alma03 vdsm-tool: File "/usr/bin/vdsm-tool", line 219, in main Feb 20 18:00:50 alma03 vdsm-tool: return tool_command[cmd]["command"](*args) Feb 20 18:00:50 alma03 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/tool/network.py", line 94, in dump_bonding_options Feb 20 18:00:50 alma03 vdsm-tool: sysfs_options_mapper.dump_bonding_options() Feb 20 18:00:50 alma03 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/network/link/bond/sysfs_options_mapper.py", line 47, in dump_bonding_options Feb 20 18:00:50 alma03 vdsm-tool: jdump(_get_default_bonding_options(), f) Feb 20 18:00:50 alma03 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/network/link/bond/sysfs_options_mapper.py", line 69, in _get_default_bonding_options Feb 20 18:00:50 alma03 vdsm-tool: with _bond_device(bond_name, mode): Feb 20 18:00:50 alma03 vdsm-tool: File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__ Feb 20 18:00:50 alma03 vdsm-tool: return self.gen.next() Feb 20 18:00:50 alma03 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/network/link/bond/sysfs_options_mapper.py", line 102, in _bond_device Feb 20 18:00:50 alma03 vdsm-tool: _change_mode(bond_name, mode) Feb 20 18:00:50 alma03 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/network/link/bond/sysfs_options_mapper.py", line 112, in _change_mode Feb 20 18:00:50 alma03 vdsm-tool: opt.write(mode) Feb 20 18:00:50 alma03 vdsm-tool: IOError: [Errno 22] Invalid argument Feb 20 18:00:50 alma03 systemd: vdsm-network-init.service: main process exited, code=exited, status=1/FAILURE Feb 20 18:00:50 alma03 systemd: Failed to start Virtual Desktop Server Manager network IP+link restoration. Feb 20 18:00:50 alma03 systemd: Dependency failed for Virtual Desktop Server Manager network restoration. Feb 20 18:00:50 alma03 systemd: Dependency failed for Virtual Desktop Server Manager. Feb 20 18:00:50 alma03 systemd: Dependency failed for MOM instance configured for VDSM purposes. Feb 20 18:00:50 alma03 systemd: Job mom-vdsm.service/start failed with result 'dependency'. Feb 20 18:00:50 alma03 systemd: Job vdsmd.service/start failed with result 'dependency'. Feb 20 18:00:50 alma03 systemd: Job vdsm-network.service/start failed with result 'dependency'. Feb 20 18:00:50 alma03 systemd: Unit vdsm-network-init.service entered failed state. Feb 20 18:00:50 alma03 systemd: vdsm-network-init.service failed. So I think it's definitively related to that one
please retry with a newer kerenl >= -851
Works for me on these components: ovirt-hosted-engine-ha-2.2.5-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.10-1.el7ev.noarch rhvm-appliance-4.2-20180202.0.el7.noarch Linux 3.10.0-851.el7.x86_64 #1 SMP Mon Feb 12 07:53:52 EST 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.5 Beta (Maipo) Deployment passed on NFS and Gluster storage types, host had two NICs and RHEL7.5 OS.
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.