Bug 1540451

Summary: HE setup fails due to routing/natting issue - host not reachable from engine VM
Product: [oVirt] ovirt-hosted-engine-setup Reporter: Sahina Bose <sabose>
Component: GeneralAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED CURRENTRELEASE QA Contact: Nikolai Sednev <nsednev>
Severity: high Docs Contact:
Priority: high    
Version: 2.2.6CC: bugs, danken, irosenzw, khakimi, klaas, lsvaty, pagranat, plord, rcyriac, sabose, stirabos, yzhao
Target Milestone: ovirt-4.2.2Keywords: AutomationBlocker, Triaged
Target Release: ---Flags: rule-engine: ovirt-4.2+
ykaul: blocker+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-03-29 10:57:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1534483, 1538630    
Bug Blocks:    
Attachments:
Description Flags
ip-info none

Description Sahina Bose 2018-01-31 05:32:35 UTC
Created attachment 1388724 [details]
ip-info

Description of problem:

During the HE setup, the task fails at waiting for host to become non-operational post adding the host.

2018-01-30 21:22:27,229+0530 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:75 {u'_ansible_parsed'
: True, u'_ansible_no_log': False, u'changed': False, u'attempts': 150, u'invocation': {u'module_args': {u'pattern': u'name=rhsdev-gr
afton2-nic2.lab.eng.blr.redhat.com', u'fetch_nested': False, u'nested_attributes': []}}, u'ansible_facts': {u'ovirt_hosts': []}}
2018-01-30 21:22:27,329+0530 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:79 fatal: [localhost]:
 FAILED! => {"ansible_facts": {"ovirt_hosts": []}, "attempts": 150, "changed": false}
2018-01-30 21:22:27,531+0530 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:75 PLAY RECAP [grafton-engine1.redhat.com] : ok: 15 changed: 8 unreachable: 0 skipped: 4 failed: 0
2018-01-30 21:22:27,631+0530 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:75 PLAY RECAP [localhost] : ok: 26 changed: 12 unreachable: 0 skipped: 1 failed: 1
2018-01-30 21:22:27,732+0530 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils.run:161 ansible-playbook rc: 2
2018-01-30 21:22:27,732+0530 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils.run:168 ansible-playbook stdout:
2018-01-30 21:22:27,732+0530 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils.run:170  to retry, use: --limit @/usr/share/ovirt-hosted-engine-setup/ansible/bootstrap_local_vm.retry

In the engine VM, engine.log

2018-01-30 21:16:36,361+05 INFO  [org.ovirt.engine.core.bll.aaa.CreateUserSessionCommand] (default task-4) [1cb6edf0] Running command: CreateUserSessionCommand internal: false.
2018-01-30 21:16:36,396+05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-4) [1cb6edf0] EVENT_ID: USER_VDC_LOGIN(30), User admin@internal-authz connecting from '192.168.122.1' using session 'JuGooYeg8VITZPiqwDRJ+bHdzx3H2Wuabg4IQKaMN2YNu1zI8XLaWRkAQ2zvwU/1wW2PZgq4BQjdxDScxHQ4xQ==' logged in.
2018-01-30 21:16:56,760+05 ERROR [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-7) [7772aab9-aa8a-4f60-8aed-e04f7e58917e] Failed to establish session with host 'rhsdev-grafton2-nic2.lab.eng.blr.redhat.com': SSH connection timed out connecting to 'root.eng.blr.redhat.com'
2018-01-30 21:16:56,762+05 WARN  [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-7) [7772aab9-aa8a-4f60-8aed-e04f7e58917e] Validation of action 'AddVds' failed for user admin@internal-authz. Reasons: VAR__ACTION__ADD,VAR__TYPE__HOST,$server rhsdev-grafton2-nic2.lab.eng.blr.redhat.com,VDS_CANNOT_CONNECT_TO_SERVER
2018-01-30 21:16:56,786+05 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-7) [] Operation Failed: [Cannot add Host. Connecting to host via SSH has failed, verify that the host is reachable (IP address, routable address etc.) You may refer to the engine.log file for further details.]

If I login to engine VM and try, indeed the hostname is not reachable
:
[root@grafton-engine1 ~]# ping rhsdev-grafton2-nic2.lab.eng.blr.redhat.com
PING rhsdev-grafton2-nic2.lab.eng.blr.redhat.com (10.70.40.34) 56(84) bytes of data.
^C
--- rhsdev-grafton2-nic2.lab.eng.blr.redhat.com ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 2999ms



Steps to Reproduce:
1. Deploy HE on host with 2 NICs



Additional info:

Attached the output of ip addr, ip route list and ip rule list from host and engine VM

Comment 1 Sahina Bose 2018-02-05 11:13:31 UTC
Any workaround for this?
Deploying on the RHHI setup is blocked due to this.

Comment 2 Yaniv Kaul 2018-02-05 11:36:22 UTC
Can we look into what's dropping the traffic? firewall, ARP tables, etc.?

Comment 3 Sahina Bose 2018-02-05 12:10:57 UTC
The issue seems to be due to 2 NICs configured on the host, each on a different subnet and using different FQDN. 
While installing, I had chosen gateway as 10.70.43.254 and interface as ens3f0 of host.
 On host:
ip route
default via 10.70.43.254 dev ovirtmgmt 
default via 10.70.37.254 dev enp4s0f0 proto dhcp metric 100 
10.70.36.0/23 dev enp4s0f0 proto kernel scope link src 10.70.37.28 metric 100 
10.70.40.0/22 dev ovirtmgmt proto kernel scope link src 10.70.40.34 
169.254.0.0/16 dev ovirtmgmt scope link metric 1045 
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1

On engine VM:
ip route
default via 192.168.122.1 dev eth0 proto static metric 100 
192.168.122.0/24 dev eth0 proto kernel scope link src 192.168.122.240 metric 100 

From engine VM, I'm able to ping 10.70.37.28 - which is the ip addr associated with enp4s0f0 (the second NIC), but not 10.70.40.34, the address associated with host FQDN that the HE install picked up based on the NIC/gateway I provided.

I'm not sure if this is to do with inputs I provided or the way the bridge was created.

Comment 4 Yaniv Kaul 2018-02-05 12:45:11 UTC
What's the bridge setup? Perhaps it was indeed set up on the other interface?

Comment 5 Sahina Bose 2018-02-05 13:40:42 UTC
(In reply to Yaniv Kaul from comment #4)
> What's the bridge setup? Perhaps it was indeed set up on the other interface?

#brctl show
bridge name	bridge id		STP enabled	interfaces
;vdsmdummy;		8000.000000000000	no		
ovirtmgmt		8000.0cc47a80f93c	no		ens3f0
virbr0		8000.52540029272e	yes		virbr0-nic
							vnet0
The ovirtmgmt bridge is created on the correct interface - ens3f0, but the local VM is using the virbr0 bridge, I think which is on dummy devices.

From the host's ip addr that's already attached:
21: virbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 52:54:00:29:27:2e brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
       valid_lft forever preferred_lft forever

[root@rhsdev-grafton2-nic2 ~]# ip rule list
0:	from all lookup local 
32764:	from all to 10.70.40.0/22 iif ovirtmgmt lookup 172369954 
32765:	from 10.70.40.0/22 lookup 172369954 
32766:	from all lookup main 
32767:	from all lookup default

Comment 6 Simone Tiraboschi 2018-02-05 16:06:45 UTC
(In reply to Sahina Bose from comment #1)
> Any workaround for this?
> Deploying on the RHHI setup is blocked due to this.

Ido is trying to reproduce it but it seams to correctly work on our side.

Comment 7 Simone Tiraboschi 2018-02-06 09:49:50 UTC
yzhao reproduced it.
From the engine VM on the libvirt natted network I'm able to ping 8.8.8.8 but not the host address.
If I stop firewalld on the host the engine VM stops pinging also 8.8.8.8 since the nat stops as well.

Comment 8 Simone Tiraboschi 2018-02-06 10:00:21 UTC
Stopping NetworkManager and restarting network service on the host made it working again.

Comment 9 Yihui Zhao 2018-02-06 10:21:26 UTC
(In reply to Simone Tiraboschi from comment #8)
> Stopping NetworkManager and restarting network service on the host made it
> working again.

Bad news, also failed, but it is penging on:

[ INFO  ] TASK [Restore network configuration]
[ INFO  ] changed: [localhost]
[ INFO  ] TASK [Wait for the host to be up]


And the ICMP is not ok from the vm to the host

Comment 10 Sahina Bose 2018-02-07 08:34:39 UTC
I was able to complete deployment after stopping NetworkManager on the host prior to deployment.

Comment 11 zipur 2018-02-07 17:09:54 UTC
I am also seeing this issue with 4.2.1 RC6. All worked fine with RC4, after upgrade to RC6 host and engine could not communicate. Tried ditching, cleanup, and then redeply new engine and I cant get past:

TASK [Wait for the host to become non operational]

without:

fatal: [localhost]: FAILED! => {"ansible_facts": {"ovirt_hosts": []}, "attempts": 150, "changed": false}

This is a Cent)S 7.4 Box - fresh install with RC4 a few days ago.

Comment 12 Simone Tiraboschi 2018-02-07 18:21:48 UTC
zipur,
thanks for the report.

Can you please try again stopping NetworkManager on your host before executing hosted-engine-setup?

Comment 13 Simone Tiraboschi 2018-02-07 18:24:05 UTC
Danken,
any hint for us?

Comment 14 zipur 2018-02-07 19:15:10 UTC
Simone,

I tried that as well as everything else listed above. Nothing worked. Was going to try a fresh install again of CentOS7.4 and  oVirt 4.2.1RC6 to see if that works. My RC4 install was working great up until upgrading to RC6 which caused the engine to not be able to communicate with host. I then removed the engine, uninstalled all, cleaned up, and tried again with no luck. Will let you know what I find from fresh install of everything.

Comment 15 zipur 2018-02-07 23:15:55 UTC
Looks like issue with RC6 is fixed when installing from scratch (New CentOS 7.4 and then new HostedEngine install). Must be a bug related to issue with prior data <RC6?

Comment 16 Polina 2018-02-13 12:18:45 UTC
Simone asked me to add here the description of a problem I saw in my environment while running hosted-engine --deploy on host cyan-vdsf.qa.lab.tlv.redhat.com.

setup failed with 

2018-02-13 11:52:05,428+02 ERROR [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-8) [da0fee02-6e68-477d-9d1d-cda14450bf87] Failed to establish session with host 'cyan-vdsf.qa.lab.tlv.redhat.com': SSH connection timed out connecting to 'root.lab.tlv.redhat.com'
<tiraboschi> 2018-02-13 11:52:05,430+02 WARN  [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-8) [da0fee02-6e68-477d-9d1d-cda14450bf87] Validation of action 'AddVds' failed for user admin@internal-authz. Reasons: VAR__ACTION__ADD,VAR__TYPE__HOST,$server cyan-vdsf.qa.lab.tlv.redhat.com,VDS_CANNOT_CONNECT_TO_SERVER

the problem disappeared after running on host:
systemctl stop NetworkManager
systemctl restart network

Comment 17 Kobi Hakimi 2018-02-20 08:39:35 UTC
I also failed to deploy HE 4.2 with ansible on rhel-7.5 with the following error:

[ INFO  ] TASK [Get ovirtmgmt route table id]
[ ERROR ] fatal: [localhost]: FAILED! => {\"attempts\": 50, \"changed\": true, \"cmd\": \"ip rule list | grep ovirtmgmt | sed s/\\\\\\\\[.*\\\\\\\\]\\\\ //g | awk '{ print $9 }'\", \"delta\": \"0:00:00.010850\", \"end\": \"2018-02-19 17:59:07.135606\", \"rc\": 0, \"start\": \"2018-02-19 17:59:07.124756\", \"stderr\": \"\", \"stderr_lines\": [], \"stdout\": \"\", \"stdout_lines\": []}

Notes:
 - when I did the WA as Polina mention above its installed successfully.
 - when I did the ansible deploy last week on rhel-7.4 its worked without any WA.

maybe it's also related to the following issue that we have:
https://bugzilla.redhat.com/show_bug.cgi?id=1538630

Comment 18 Simone Tiraboschi 2018-02-20 16:14:35 UTC
(In reply to Kobi Hakimi from comment #17)
> maybe it's also related to the following issue that we have:
> https://bugzilla.redhat.com/show_bug.cgi?id=1538630

Yes, right.
We always see something like:

Feb 20 18:00:50 alma03 NetworkManager[564]: <info>  [1519142450.0279] manager: (bondscan-J6Ls2c): new Bond device (/org/freedesktop/NetworkManager/Devices/14)
Feb 20 18:00:50 alma03 kernel: bonding: bondscan-J6Ls2c is being deleted...
Feb 20 18:00:50 alma03 kernel: (unregistered net_device): Released all slaves
Feb 20 18:00:50 alma03 kernel: bonding: bondscan-J6Ls2c is being created...
Feb 20 18:00:50 alma03 NetworkManager[564]: <info>  [1519142450.0640] manager: (bondscan-J6Ls2c): new Bond device (/org/freedesktop/NetworkManager/Devices/15)
Feb 20 18:00:50 alma03 kernel: bonding: bondscan-J6Ls2c is being deleted...
Feb 20 18:00:50 alma03 kernel: (unregistered net_device): Released all slaves
Feb 20 18:00:50 alma03 kernel: bonding: bondscan-J6Ls2c is being created...
Feb 20 18:00:50 alma03 kernel: bonding: bondscan-J6Ls2c is being deleted...
Feb 20 18:00:50 alma03 kernel: (unregistered net_device): Released all slaves
Feb 20 18:00:50 alma03 kernel: bonding: bondscan-J6Ls2c is being created...
Feb 20 18:00:50 alma03 kernel: bonding: bondscan-J6Ls2c is being deleted...
Feb 20 18:00:50 alma03 kernel: (unregistered net_device): Released all slaves
Feb 20 18:00:50 alma03 NetworkManager[564]: <info>  [1519142450.1181] manager: (bondscan-J6Ls2c): new Bond device (/org/freedesktop/NetworkManager/Devices/16)
Feb 20 18:00:50 alma03 kernel: bonding: bondscan-J6Ls2c is being created...
Feb 20 18:00:50 alma03 NetworkManager[564]: <info>  [1519142450.1286] manager: (bondscan-J6Ls2c): new Bond device (/org/freedesktop/NetworkManager/Devices/17)
Feb 20 18:00:50 alma03 lldpad: recvfrom(Event interface): No buffer space available
Feb 20 18:00:50 alma03 vdsm-tool: Traceback (most recent call last):
Feb 20 18:00:50 alma03 vdsm-tool: File "/usr/bin/vdsm-tool", line 219, in main
Feb 20 18:00:50 alma03 vdsm-tool: return tool_command[cmd]["command"](*args)
Feb 20 18:00:50 alma03 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/tool/network.py", line 94, in dump_bonding_options
Feb 20 18:00:50 alma03 vdsm-tool: sysfs_options_mapper.dump_bonding_options()
Feb 20 18:00:50 alma03 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/network/link/bond/sysfs_options_mapper.py", line 47, in dump_bonding_options
Feb 20 18:00:50 alma03 vdsm-tool: jdump(_get_default_bonding_options(), f)
Feb 20 18:00:50 alma03 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/network/link/bond/sysfs_options_mapper.py", line 69, in _get_default_bonding_options
Feb 20 18:00:50 alma03 vdsm-tool: with _bond_device(bond_name, mode):
Feb 20 18:00:50 alma03 vdsm-tool: File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
Feb 20 18:00:50 alma03 vdsm-tool: return self.gen.next()
Feb 20 18:00:50 alma03 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/network/link/bond/sysfs_options_mapper.py", line 102, in _bond_device
Feb 20 18:00:50 alma03 vdsm-tool: _change_mode(bond_name, mode)
Feb 20 18:00:50 alma03 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/network/link/bond/sysfs_options_mapper.py", line 112, in _change_mode
Feb 20 18:00:50 alma03 vdsm-tool: opt.write(mode)
Feb 20 18:00:50 alma03 vdsm-tool: IOError: [Errno 22] Invalid argument
Feb 20 18:00:50 alma03 systemd: vdsm-network-init.service: main process exited, code=exited, status=1/FAILURE
Feb 20 18:00:50 alma03 systemd: Failed to start Virtual Desktop Server Manager network IP+link restoration.
Feb 20 18:00:50 alma03 systemd: Dependency failed for Virtual Desktop Server Manager network restoration.
Feb 20 18:00:50 alma03 systemd: Dependency failed for Virtual Desktop Server Manager.
Feb 20 18:00:50 alma03 systemd: Dependency failed for MOM instance configured for VDSM purposes.
Feb 20 18:00:50 alma03 systemd: Job mom-vdsm.service/start failed with result 'dependency'.
Feb 20 18:00:50 alma03 systemd: Job vdsmd.service/start failed with result 'dependency'.
Feb 20 18:00:50 alma03 systemd: Job vdsm-network.service/start failed with result 'dependency'.
Feb 20 18:00:50 alma03 systemd: Unit vdsm-network-init.service entered failed state.
Feb 20 18:00:50 alma03 systemd: vdsm-network-init.service failed.

So I think it's definitively related to that one

Comment 19 Dan Kenigsberg 2018-02-21 08:49:51 UTC
please retry with a newer kerenl >= -851

Comment 20 Nikolai Sednev 2018-02-21 15:27:00 UTC
Works for me on these components:
ovirt-hosted-engine-ha-2.2.5-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.10-1.el7ev.noarch
rhvm-appliance-4.2-20180202.0.el7.noarch
Linux 3.10.0-851.el7.x86_64 #1 SMP Mon Feb 12 07:53:52 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.5 Beta (Maipo)

Deployment passed on NFS and Gluster storage types, host had two NICs and RHEL7.5 OS.

Comment 21 Sandro Bonazzola 2018-03-29 10:57:06 UTC
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.