Description of problem: hosted-engine-setup in the new node-zero flow just added the first host to the engine and waited for the host 'up' status before starting further actions with the engine. After some times add_sourceroute kicks in on the host adding additional routes and the host goes in non_responsive status in the middle of the deployment and so hosted-engine deployment fails with different errors all of them due to fact that the engine lsot the host in the middle of HE deployment. Let's see the log: from engine.log hosted-engine-setup adds the host here: 2018-02-27 11:09:04,128+01 INFO [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-10) [8da0ea32-39f0-47f2-ae3c-f103e6818fc1] Running command: AddVdsCommand internal: false. Entities affected : ID: f53b4ec2-1ba4-11e8-b5bc-001a4c103f14 Type: ClusterAction group CREATE_HOST with role type ADMIN otopi host-deploy completes here: 2018-02-27 11:10:11,298+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) [2f1e0198] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing Host 10-37-160-177.rhev.lab.eng.brq.redhat.com. Stage: Termination. And then the engine starts ansible host-deploy: 2018-02-27 11:10:15,427+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-1) [2f1e0198] EVENT_ID: VDS_ANSIBLE_INSTALL_STARTED(560), Ansible host-deploy playbook execution has started on host 10-37-160-177.rhev.lab.eng.brq.redhat.com. That completes here: 2018-02-27 11:11:27,547+01 INFO [org.ovirt.engine.core.common.utils.ansible.AnsibleExecutor] (EE-ManagedThreadFactory-engine-Thread-1) [2f1e0198] Ansible playbook command has exited with value: 0 2018-02-27 11:11:27,731+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-1) [2f1e0198] EVENT_ID: VDS_ANSIBLE_INSTALL_FINISHED(561), Ansible host-deploy playbook execution has successfully finished on host 10-37-160-177.rhev.lab.eng.brq.redhat.com. So the engine starts HostSetupNetworksVDSCommand: 2018-02-27 11:11:32,786+01 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand] (EE-ManagedThreadFactory-engine-Thread-1) [16ec1655] START, HostSetupNetworksVDSCommand(HostName = 10-37-160-177.rhev.lab.eng.brq.redhat.com, HostSetupNetworksVdsCommandParameters:{hostId='e6b7744c-035b-4bea-8276-c6d4af7e420a', vds='Host[10-37-160-177.rhev.lab.eng.brq.redhat.com,e6b7744c-035b-4bea-8276-c6d4af7e420a]', rollbackOnFailure='true', connectivityTimeout='120', networks='[HostNetwork:{defaultRoute='true', bonding='false', networkName='ovirtmgmt', vdsmName='ovirtmgmt', nicName='eth0', vlan='null', mtu='0', vmNetwork='true', stp='false', properties='null', ipv4BootProtocol='DHCP', ipv4Address='null', ipv4Netmask='null', ipv4Gateway='null', ipv6BootProtocol='AUTOCONF', ipv6Address='null', ipv6Prefix='null', ipv6Gateway='null', nameServers='null'}]', removedNetworks='[]', bonds='[]', removedBonds='[]', clusterSwitchType='LEGACY'}), log id: 7f93bbfd That completes here: 2018-02-27 11:11:53,102+01 INFO [org.ovirt.engine.core.bll.network.host.HostSetupNetworksCommand] (EE-ManagedThreadFactory-engine-Thread-1) [16ec1655] Host setup networks finished. Lock released. Monitoring can run now for host '10-37-160-177.rhev.lab.eng.brq.redhat.com' from data-center 'Default' Everything seams fine on the network side and so the network configuration got commited: 2018-02-27 11:11:53,330+01 INFO [org.ovirt.engine.core.bll.network.host.CommitNetworkChangesCommand] (EE-ManagedThreadFactory-engine-Thread-1) [9c04c28] Running command: CommitNetworkChangesCommand internal: true. Entities affected : ID: e6b7744c-035b-4bea-8276-c6d4af7e420a Type: VDSAction group CONFIGURE_HOST_NETWORK with role type ADMIN and so the host was set to Up: 2018-02-27 11:12:00,117+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-84) [59b7705f] EVENT_ID: VDS_DETECTED(13), Status of host 10-37-160-177.rhev.lab.eng.brq.redhat.com was set to Up. from ovirt-hosted-engine-setup-20180227103913-jocut8.log and indeed hosted-engine setup notices it: 2018-02-27 11:12:06,507+0100 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 host_result_up: {'failed': False, 'attempts': 4, u'changed': False, u'ansible_facts': {u'ovirt_hosts': [{u'comment': u'', u'update_available': False, u'protocol': u'stomp', u'affinity_labels': [], u'hooks': [], u'cluster': {u'href': u'/ovirt-engine/api/clusters/f53b4ec2-1ba4-11e8-b5bc-001a4c103f14', u'id': u'f53b4ec2-1ba4-11e8-b5bc-001a4c103f14'}, u'href': u'/ovirt-engine/api/hosts/e6b7744c-035b-4bea-8276-c6d4af7e420a', u'devices': [], u'port': 54321, u'external_status': u'ok', u'statistics': [], u'certificate': {u'organization': u'rhev.lab.eng.brq.redhat.com', u'subject': u'O=rhev.lab.eng.brq.redhat.com,CN=10-37-160-177.rhev.lab.eng.brq.redhat.com'}, u'nics': [], u'iscsi': {u'initiator': u'iqn.1994-05.com.redhat:3269ab941f28'}, u'cpu': {u'speed': 2133.0, u'name': u'Intel Core i7 9xx (Nehalem Class Core i7)', u'topology': {u'cores': 1, u'threads': 1, u'sockets': 2}}, u'id': u'e6b7744c-035b-4bea-8276-c6d4af7e420a', u'hardware_information': {u'uuid': u'647A3197-8EB5-4E10-A48F-CE9B0980945A', u'family': u'Red Hat Enterprise Linux', u'supported_rng_sources': [u'random', u'hwrng'], u'version': u'7-3.1611.el7.centos', u'serial_number': u'0103A4A5-91FE-D511-A0F0-80C16E7243EA', u'product_name': u'oVirt Node', u'manufacturer': u'oVirt'}, u'version': {u'full_version': u'vdsm-4.20.18-18.gitddba3f1.el7.centos', u'revision': 0, u'major': 4, u'minor': 20, u'build': 18}, u'memory': 8199864320, u'ksm': {u'enabled': False}, u'se_linux': {u'mode': u'enforcing'}, u'type': u'rhel', u'status': u'up', u'tags': [], u'katello_errata': [], u'external_network_provider_configurations': [], u'ssh': {u'port': 22, u'fingerprint': u'SHA256:pdiEu+2IUsbabzcFjnf6c+rVJzkkvzAJAFr8/TRZUsc'}, u'address': u'10-37-160-177.rhev.lab.eng.brq.redhat.com', u'numa_nodes': [], u'device_passthrough': {u'enabled': False}, u'unmanaged_networks': [], u'permissions': [], u'numa_supported': False, u'libvirt_version': {u'full_version': u'libvirt-3.2.0-14.el7_4.7', u'revision': 0, u'major': 3, u'minor': 2, u'build': 0}, u'power_management': {u'kdump_detection': True, u'enabled': False, u'pm_proxies': [], u'automatic_pm_enabled': True}, u'name': u'10-37-160-177.rhev.lab.eng.brq.redhat.com', u'max_scheduling_memory': 7863271424, u'summary': {u'total': 0}, u'auto_numa_status': u'disable', u'transparent_huge_pages': {u'enabled': False}, u'network_attachments': [], u'os': {u'version': {u'full_version': u'7 - 4.1708.el7.centos', u'major': 7}, u'type': u'RHEL', u'custom_kernel_cmdline': u'', u'reported_kernel_cmdline': u'BOOT_IMAGE=/vmlinuz-3.10.0-693.17.1.el7.x86_64 root=/dev/mapper/centos_10--37--160--177-root ro crashkernel=auto rd.lvm.lv=centos_10-37-160-177/root rd.lvm.lv=centos_10-37-160-177/swap rhgb quiet LANG=en_US.UTF-8'}, u'storage_connection_extensions': [], u'kdump_status': u'disabled', u'spm': {u'priority': 5, u'status': u'none'}}]}} and so it starts using the host. The host went up at 11:12:00 but then, 23 seconds later something kicked it on supervdsm side re-configuring the network. From supervdsm.log: MainProcess|dhclient-monitor::DEBUG::2018-02-27 11:12:23,150::supervdsm_server::96::SuperVdsm.ServerCallback::(wrapper) call add_sourceroute with ('10.37.160.177', '255.255.255.0', '10.37.160.254') {} MainProcess|dhclient-monitor::DEBUG::2018-02-27 11:12:23,151::sourceroute::190::root::(add) Adding source route for device ovirtmgmt MainProcess|dhclient-monitor::DEBUG::2018-02-27 11:12:23,152::cmdutils::150::root::(exec_cmd) /sbin/ip -4 route add 0.0.0.0/0 via 10.37.160.254 dev ovirtmgmt table 170238129 (cwd None) MainProcess|dhclient-monitor::DEBUG::2018-02-27 11:12:23,182::cmdutils::158::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0 MainProcess|dhclient-monitor::DEBUG::2018-02-27 11:12:23,183::cmdutils::150::root::(exec_cmd) /sbin/ip -4 route add 10.37.160.0/24 via 10.37.160.177 dev ovirtmgmt table 170238129 (cwd None) MainProcess|dhclient-monitor::DEBUG::2018-02-27 11:12:23,190::cmdutils::158::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0 MainProcess|dhclient-monitor::DEBUG::2018-02-27 11:12:23,191::cmdutils::150::root::(exec_cmd) /sbin/ip rule add from 10.37.160.0/24 table 170238129 (cwd None) MainProcess|dhclient-monitor::DEBUG::2018-02-27 11:12:23,198::cmdutils::158::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0 MainProcess|dhclient-monitor::DEBUG::2018-02-27 11:12:23,199::cmdutils::150::root::(exec_cmd) /sbin/ip rule add from all to 10.37.160.0/24 dev ovirtmgmt table 170238129 (cwd None) MainProcess|dhclient-monitor::DEBUG::2018-02-27 11:12:23,207::cmdutils::158::root::(exec_cmd) SUCCESS: <err> = ''; <rc> = 0 MainProcess|dhclient-monitor::DEBUG::2018-02-27 11:12:23,209::supervdsm_server::103::SuperVdsm.ServerCallback::(wrapper) return add_sourceroute with None And that source route rules broken hosted-engine bootstrap networking (the bootstrap VM is on default natted libvirt network) and so the engine looses the host. From engine.log: 2018-02-27 11:12:31,249+01 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [535d8d66] No interaction with host '10-37-160-177.rhev.lab.eng.brq.redhat.com' for 20000 ms. 2018-02-27 11:12:34,257+01 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [535d8d66] Heartbeat exceeded for host '10-37-160-177.rhev.lab.eng.brq.redhat.com', last response arrived 23472 ms ago. 23472 ms before 11:12:34 means that the last valid interaction with the host was about 11:12:21 and indeed add_sourceroute kicked in at 11:12:23. So the host will be set as non_responsive by the engine: 2018-02-27 11:13:43,793+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-21) [] EVENT_ID: VDS_FAILURE(12), Host 10-37-160-177.rhev.lab.eng.brq.redhat.com is non responsive. And facing fails as well since power management is still not configured and ssh soft fencing fails as well since the engine VM cannot reach the host where it's running. 2018-02-27 11:13:44,469+01 INFO [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-23) [357e31ab] Running command: SshSoftFencingCommand internal: true. Entities affected : ID: e6b7744c-035b-4bea-8276-c6d4af7e420a Type: VDS In the mean time hosted-engine setup was still running and we can get different errors according to where the engine lost the host. Version-Release number of selected component (if applicable): engine 4.2.1.7-1.el7.centos vdsm 4.20.18-18.gitddba3f1.el7.centos How reproducible: We got it more than once in hosted-engine node zero deployment, not systematic Steps to Reproduce: 1. try deploy hosted-engine with node zero 2. 3. Actual results: the host goes up into the engine, hosted-engine setup start using the engine to create the storage domain and so but then add_sourceroute kicks in and the host goes in non responsive status. Expected results: The host goes up at engine eyes and stays up till the end of the deployment Additional info: Not systematic
Created attachment 1401390 [details] vdsm logs
Created attachment 1401391 [details] supervdsm logs
Created attachment 1401392 [details] hosted-engine-setup logs
Created attachment 1401393 [details] host-deploy-ansible logs
Created attachment 1401394 [details] host-deploy logs
Created attachment 1401395 [details] engine logs
I think it's due to handling of DHCP responses. VDSM registers add_sourceroute to react on DHCP responses: https://github.com/oVirt/vdsm/blob/master/lib/vdsm/network/initializer.py#L67 If the host already has a DHCP lease the deployment will continue and the host will go up with that address. Then once DHCP response arrives, and the host could be already up at engine eyes at that time if the DHCP server is slow enough, add_sourceroute will kick it and this could is exactly what happened here. Indeed the DHCP response completed exactly when we lost the host: Feb 27 11:12:23 10-37-160-177 dhclient[30837]: bound to 10.37.160.177 -- renewal in 1140207 seconds. But dhcp client process started 50 seconds before: Feb 27 11:11:37 10-37-160-177 dhclient[30837]: DHCPDISCOVER on ovirtmgmt to 255.255.255.255 port 67 interval 6 (xid=0x54bcf2d5) Feb 27 11:11:43 10-37-160-177 dhclient[30837]: DHCPDISCOVER on ovirtmgmt to 255.255.255.255 port 67 interval 10 (xid=0x54bcf2d5) Feb 27 11:11:45 10-37-160-177 dhclient[30837]: DHCPREQUEST on ovirtmgmt to 255.255.255.255 port 67 (xid=0x54bcf2d5) Feb 27 11:11:45 10-37-160-177 dhclient[30837]: DHCPOFFER from 10.37.160.253 Feb 27 11:11:45 10-37-160-177 dhclient[30837]: DHCPACK from 10.37.160.253 (xid=0x54bcf2d5) Feb 27 11:12:23 10-37-160-177 dhclient[30837]: bound to 10.37.160.177 -- renewal in 1140207 seconds.
Created attachment 1401416 [details] /var/log/messages from the host
How common is this? Would you specify the severity for you? Can you explain how does hosted engine setup modify host routing?
(In reply to Dan Kenigsberg from comment #9) > How common is this? Probably not that common but we definitively hit this more that once. > Would you specify the severity for you? The nasty point is that the error got by the user is almost random depending on when we lost the host in the middle of the deployment and this can cause a lot of noise. I'd say high > Can you explain how does hosted engine setup modify host routing? The local bootstrap VM runs over default libvirt natted network. We start it with: virsh net-start default And routing rules looks like: [root@c74he20180108h1 ~]# ip route default via 192.168.1.1 dev eth0 proto static metric 100 192.168.1.0/24 dev eth0 proto kernel scope link src 192.168.1.114 metric 100 192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 [root@c74he20180108h1 ~]# ip rule list 0: from all lookup local 32766: from all lookup main 32767: from all lookup default Then ovirt-hosted-engine-setup adds the host to the engine and so the ovirtmgmt bridge got created so SuperVdsm.ServerCallback::add_sourceroute got called and it disrupts libvirt default natted network. We detect it and we have a kind of workaround that fixes it running: ip route add 192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 table 170238129 where 170238129 is the id of the table of ovirtmgmt 'src 192.168.122.1' is enough to win over rules added by add_sourceroute and so the engine VM is still able to talk with the engine. The point is if, as for this bug, add_sourceroute got called after our workaround disrupting it. add_sourceroute runs: ip -4 route add 0.0.0.0/0 via 10.37.160.254 dev ovirtmgmt table 170238129 ip -4 route add 10.37.160.0/24 via 10.37.160.177 dev ovirtmgmt table 170238129 ip rule add from 10.37.160.0/24 table 170238129 ip rule add from all to 10.37.160.0/24 dev ovirtmgmt table 170238129
> ip rule add from 10.37.160.0/24 table 170238129 > ip rule add from all to 10.37.160.0/24 dev ovirtmgmt table 170238129 These are the rules that causes traffic to be routed based on the special table (170238129). If you need to bypass it, you just need to know which IP/subnet you need to 'protect' and add a high priority rule (a low pref/priority number) on the host so it will be matched before the ones sourceroute adds: ip rule add from 10.37.160.0/24 priority 100 table main ip rule add from all to 10.37.160.0/24 dev ovirtmgmt priority 100 table main Can you try it?
(In reply to Edward Haas from comment #11) > Can you try it? Sure, thanks, do you think https://gerrit.ovirt.org/88295/ is enough?
(In reply to Simone Tiraboschi from comment #12) > (In reply to Edward Haas from comment #11) > > Can you try it? > > Sure, thanks, do you think https://gerrit.ovirt.org/88295/ is enough? I don't think that change is valid, the command is not "ip route" but "ip rule" and the arguments are different if I'm not mistaken.
Created attachment 1402414 [details] before add_sourceroute
Created attachment 1402415 [details] after add_sourceroute
*** Bug 1552027 has been marked as a duplicate of this bug. ***
*** Bug 1551289 has been marked as a duplicate of this bug. ***
Works for me on these components: ovirt-hosted-engine-ha-2.2.7-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.13-1.el7ev.noarch rhvm-appliance-4.2-20180202.0.el7.noarch Linux 3.10.0-861.el7.x86_64 #1 SMP Wed Mar 14 10:21:01 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.5 (Maipo) Deployment worked fine over iSCSI storage.
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.