Description of problem: Hosted engine migration fails, while using "Virtual Machines"->HE-VM->Migrate. 2015-01-25 15:11:39,523 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler_Worker-29) Exception during connection 2015-01-25 15:11:39,523 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-29) Command GetCapabilit iesVDSCommand(HostName = blue-vdsc.qa.lab.tlv.redhat.com, HostId = 84de041b-e0ad-4ba5-bb91-c645f9ccbaf3, vds=Host[blue-vdsc.qa.lab.tlv.redhat.com,84de041b- e0ad-4ba5-bb91-c645f9ccbaf3]) execution failed. Exception: VDSNetworkException: java.rmi.ConnectException: Connection timeout 2015-01-25 15:11:39,523 WARN [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) Exception thrown during message processing 2015-01-25 15:11:39,523 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-29) Host blue-vdsc.qa.lab.tlv.redhat.com is not r esponding. It will stay in Connecting state for a grace period of 120 seconds and after that an attempt to fence the host will be issued. 2015-01-25 15:11:39,533 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-29) Failure to refresh Vds runtime info : org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.rmi.ConnectException: Connection timeout at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.createNetworkException(VdsBrokerCommand.java:126) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:101) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:56) [vdsbroker.jar:] at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:31) [dal.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.refreshCapabilities(VdsManager.java:571) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refreshVdsRunTimeInfo(VdsUpdateRunTimeInfo.java:648) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refresh(VdsUpdateRunTimeInfo.java:494) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:236) [vdsbroker.jar:] at sun.reflect.GeneratedMethodAccessor88.invoke(Unknown Source) [:1.7.0_75] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_75] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_75] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:] Caused by: java.rmi.ConnectException: Connection timeout at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient$1.call(ReactorClient.java:113) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient$1.call(ReactorClient.java:97) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable.call(Retryable.java:26) [vdsm-jsonrpc-java-client.jar:] at java.util.concurrent.FutureTask.run(FutureTask.java:262) [rt.jar:1.7.0_75] at org.ovirt.vdsm.jsonrpc.client.utils.ReactorScheduler.performPendingOperations(ReactorScheduler.java:28) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.run(Reactor.java:60) [vdsm-jsonrpc-java-client.jar:] 2015-01-25 15:11:42,541 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) Connecting to blue-vdsc.qa.lab.tlv.redhat.com/10.3 5.103.12 2015-01-25 15:11:43,673 WARN [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) Retry failed Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1.Install fresh NFS hosted engine environment using all RHEL7.0 hosts and RHEL6.6 HE's VM. 2.Have at least 4 guests. 3.Try migrating HE from currently running host to another random or manually chosen host. Actual results: Migration fails with error " 2015-Jan-25, 15:23 Migration failed due to Error: Fatal error during migration (VM: HostedEngine, Source: alma03.qa.lab.tlv.redhat.com, Destination: alma04.qa.lab.tlv.redhat.com)." Expected results: Migration should succeed. Additional info: rhevm-3.5.0-0.30.el6ev.noarch rhevm-guest-agent-common-1.0.10-2.el6ev.noarch Linux version 2.6.32-504.3.3.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-9) (GCC) ) #1 SMP Fri Dec 12 16:05:43 EST 2014 qemu-kvm-rhev-1.5.3-60.el7_0.11.x86_64 libvirt-client-1.1.1-29.el7_0.5.x86_64 vdsm-4.16.8.1-6.el7ev.x86_64 sanlock-3.1.0-2.el7.x86_64 ovirt-hosted-engine-ha-1.2.4-5.el7ev.noarch mom-0.4.1-4.el7ev.noarch ovirt-hosted-engine-setup-1.2.1-9.el7ev.noarch I've used these 4 hypervisors: alma04.qa.lab.tlv.redhat.com alma03.qa.lab.tlv.redhat.com brown-vdsc.qa.lab.tlv.redhat.com blue-vdsc.qa.lab.tlv.redhat.com Tried to migrate alma03.qa.lab.tlv.redhat.com->alma04.qa.lab.tlv.redhat.com and also to any random hypervisor.
[root@alma03 ~]# hosted-engine --vm-status --== Host 1 status ==-- Status up-to-date : True Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 2400 Local maintenance : False Host timestamp : 14669 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=14669 (Sun Jan 25 18:24:00 2015) host-id=1 score=2400 maintenance=False state=EngineUp --== Host 2 status ==-- Status up-to-date : True Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 2400 Local maintenance : False Host timestamp : 4520 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=4520 (Sun Jan 25 18:23:54 2015) host-id=2 score=2400 maintenance=False state=EngineDown --== Host 4 status ==-- Status up-to-date : True Hostname : brown-vdsc.qa.lab.tlv.redhat.com Host ID : 4 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 2400 Local maintenance : False Host timestamp : 4408 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=4408 (Sun Jan 25 18:23:54 2015) host-id=4 score=2400 maintenance=False state=EngineDown --== Host 6 status ==-- Status up-to-date : True Hostname : blue-vdsc.qa.lab.tlv.redhat.com Host ID : 6 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 2400 Local maintenance : False Host timestamp : 4405 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=4405 (Sun Jan 25 18:24:00 2015) host-id=6 score=2400 maintenance=False state=EngineDown
Created attachment 983970 [details] logs
looks like a duplicate of bug #1184129 Didi, Simone, can you take a look?
I now looked at the machines. alma03 was "preparing for maintenance" I activated it Then tried to migrate the vm to alma04 in alma04 vdsm.log: Thread-2295::ERROR::2015-01-26 16:14:09,922::vm::2318::vm.Vm::(_startUnderlyingVm) vmId=`2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86`::Failed to start a migration destination vm Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 2291, in _startUnderlyingVm self._completeIncomingMigration() File "/usr/share/vdsm/virt/vm.py", line 4095, in _completeIncomingMigration self._incomingMigrationFinished.isSet(), usedTimeout) File "/usr/share/vdsm/virt/vm.py", line 4148, in _attachLibvirtDomainAfterMigration raise MigrationError(e.get_error_message()) MigrationError: Domain not found: no domain with matching uuid '2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86' Tried to migrate it to blue-vdsc and it succeeded. Seems like some local configuration issue. Was the system (hosts, storage) clean before installation?
(In reply to Yedidyah Bar David from comment #4) > I now looked at the machines. > alma03 was "preparing for maintenance" > I activated it > Then tried to migrate the vm to alma04 > in alma04 vdsm.log: > > Thread-2295::ERROR::2015-01-26 > 16:14:09,922::vm::2318::vm.Vm::(_startUnderlyingVm) > vmId=`2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86`::Failed to start a migration > destination vm > Traceback (most recent call last): > File "/usr/share/vdsm/virt/vm.py", line 2291, in _startUnderlyingVm > self._completeIncomingMigration() > File "/usr/share/vdsm/virt/vm.py", line 4095, in _completeIncomingMigration > self._incomingMigrationFinished.isSet(), usedTimeout) > File "/usr/share/vdsm/virt/vm.py", line 4148, in > _attachLibvirtDomainAfterMigration > raise MigrationError(e.get_error_message()) > MigrationError: Domain not found: no domain with matching uuid > '2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86' > > Tried to migrate it to blue-vdsc and it succeeded. > > Seems like some local configuration issue. > > Was the system (hosts, storage) clean before installation? All clean installation, all hosts restarted after deployment, NFS share was cleared before deployment.
I've added logs from all setup to blue-vdsc.qa.lab.tlv.redhat.com:/root/1185661sosreport-LogCollector-20150125160853 (copy).tar.xz Please review.
Might be we have something in common with https://bugzilla.redhat.com/show_bug.cgi?id=923946 ? I've checked, the SELINUX is enforcing on all of my hosts.
From the logs its clear that the dest host is somehow resolved 127.0.0.1. few people seen that before and now as well. -- alma03 vdsm.log ::starting migration to qemu+tls://alma04.qa.lab.tlv.redhat.com/system with miguri tcp://10.35.117.26 -- so far the name and ip are ok but moving on: Thread-3417::DEBUG::2015-01-25 17:20:43,230::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 1 edom: 10 level: 2 message: internal error: Attempt to migrate guest to the same host 87de1c88-a068-4dfd-bde0-70d6e17c4819 -- and libvirt.log looks weird - the xml block for migration has already alma03 set as the target 2015-01-25 15:20:43.190+0000: 65166: debug : doPeer2PeerMigrate3:3718 : driver=0x7f5394164c90, sconn=0x7f5380001590, dconn=0x7f5388001960, dconnuri=qemu+tls://alma04.qa.lab.tlv.redhat.com/system, vm=0x7f53941ba7e0, xmlin=(null), dname=(null), uri=tcp://10.35.117.26, graphicsuri=(null), listenAddress=(null), bandwidth=32, useParams=1, flags=1003 -- parameters looks good but wrong in xml 2015-01-25 15:20:43.190+0000: 65166: debug : qemuMigrationBeginPhase:2025 : driver=0x7f5394164c90, vm=0x7f53941ba7e0, xmlin=(null), dname=(null), cookieout=0x7f53a3c6f910, cookieoutlen=0x7f53a3c6f8f4, flags=1003 2015-01-25 15:20:43.190+0000: 65166: debug : qemuDomainObjSetJobPhase:919 : Setting 'migration out' phase to 'begin3' 2015-01-25 15:20:43.190+0000: 65166: debug : qemuProcessAutoDestroyActive:4762 : vm=HostedEngine 2015-01-25 15:20:43.190+0000: 65166: debug : virCloseCallbacksGet:197 : vm=HostedEngine, uuid=2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86, conn=(nil) 2015-01-25 15:20:43.190+0000: 65166: debug : virCloseCallbacksGet:207 : cb=(nil) 2015-01-25 15:20:43.190+0000: 65166: debug : qemuMigrationEatCookie:1044 : cookielen=0 cookie='(null)' 2015-01-25 15:20:43.190+0000: 65166: debug : qemuMigrationBakeCookie:1021 : cookielen=588 cookie=<qemu-migration> <name>HostedEngine</name> <uuid>2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86</uuid> <hostname>alma03.qa.lab.tlv.redhat.com</hostname> <hostuuid>87de1c88-a068-4dfd-bde0-70d6e17c4819</hostuuid> <feature name='lockstate'/> <lockstate driver='sanlock'> <leases>d029dde8-5ad0-462d-a04f-1e5c992716d4:8e14d3ee-b1d4-46e5-a57b-91ed695a4457:/rhev/data-center/mnt/10.35.160.108\:_RHEV_nsednev__HE__3__5/d029dde8-5ad0-462d-a04f-1e5c992716d4/images/5ab421b3-da97-43ea-b484-6be39b3cfd62/8e14d3ee-b1d4-46e5-a57b-91ed695a4457.lease:0:4</leases> </lockstate> </qemu-migration>
I'm not sure "resolved" is the right term. the actual parameter is suddenly alma03 in qemu xml snip as see in the comment 8
I think that problem lays within /etc/libvirt/qemu.conf "# Override the listen address for all incoming migrations. Defaults to # 0.0.0.0 or :: in case if both host and qemu are capable of IPv6. #migration_address = "127.0.0.1" " You should not ever use 127.0.0.1 for IPv6. My all hosts are capable of using IPv6. [root@alma03 ~]# ifconfig enp3s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet6 fe80::a236:9fff:fe3a:c4f0 prefixlen 64 scopeid 0x20<link> ether a0:36:9f:3a:c4:f0 txqueuelen 1000 (Ethernet) RX packets 6676070 bytes 9046046102 (8.4 GiB) RX errors 0 dropped 334 overruns 0 frame 0 TX packets 3639434 bytes 3646678067 (3.3 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 0 (Local Loopback) RX packets 88786 bytes 55678971 (53.0 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 88786 bytes 55678971 (53.0 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 rhevm: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.35.117.24 netmask 255.255.255.0 broadcast 10.35.117.255 inet6 fe80::a236:9fff:fe3a:c4f0 prefixlen 64 scopeid 0x20<link> ether a0:36:9f:3a:c4:f0 txqueuelen 0 (Ethernet) RX packets 1731099 bytes 8666120265 (8.0 GiB) RX errors 0 dropped 20 overruns 0 frame 0 TX packets 1394741 bytes 3470484930 (3.2 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 vnet0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet6 fe80::fc16:3eff:fe7b:b853 prefixlen 64 scopeid 0x20<link> ether fe:16:3e:7b:b8:53 txqueuelen 500 (Ethernet) RX packets 17764 bytes 3757405 (3.5 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 34303 bytes 12319812 (11.7 MiB) TX errors 0 dropped 421 overruns 0 carrier 0 collisions 0 less /var/log/messages Jan 26 16:14:09 alma03 journal: vdsm vm.Vm ERROR vmId=`2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86`::internal error: Attempt to migrate guest to the same host 87d e1c88-a068-4dfd-bde0-70d6e17c4819 Jan 26 16:14:09 alma03 journal: vdsm vm.Vm ERROR vmId=`2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86`::Failed to migrate Traceback (most recent call last): File "/usr/share/vdsm/virt/migration.py", line 246, in run self._startUnderlyingMigration(time.time()) File "/usr/share/vdsm/virt/migration.py", line 325, in _startUnderlyingMigration None, maxBandwidth) File "/usr/share/vdsm/virt/vm.py", line 689, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 111, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1264, in migrateToURI2 if ret == -1: raise libvirtError ('virDomainMigrateToURI2() failed', dom=self)
I also would think of not using XML files as communication protocol that is telling in direct form to which IP to connect, unless checked with TCP protocol stack of the host, before, so at least routing and resolution decisions will be pinned to relevant components of the system, such as DNS and TCP stack.
Also might be that when I copied the libvirt.conf file from alma03 to all other hosts in my setup in order to enable libvirt logging, it's host_uuid copied to all other hosts, so during migration what happened is libvirt chosen itself on alma03, because of the received host_uuid of itself from other hosts? In such case it is my misconfiguration, and I need to change host_uuid for each host to it's unique host_uuid.
According to the libvirt log file you attached: 2015-01-25 15:23:31.177+0000: 65164: error : virNetClientProgramDispatchError:175 : internal error: Attempt to migrate guest to the same host 87de1c88-a068-4dfd-bde0-70d6e17c4819 Closing this issue.
Issue caused by copying of the same host_uuid among all hosts over the setup in form of libvirt.conf from alma03, this caused for migration loop, as when I tried to migrate VM from alma03 to any other host, they returned the same host_uuid of initial alma03, causing it to try migrating to itself 127.0.0.1. I'll be happy to close this bug as works for me now and if possible to add some code, which checks if getting destination host_uuid=self to trigger appropriate warning to the customer and cancel migration at once.