Bug 1185661
| Summary: | Hosted engine migration fails, while using "Virtual Machines"->HE-VM->Migrate. | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Nikolai Sednev <nsednev> | ||||
| Component: | ovirt-engine | Assignee: | Martin Sivák <msivak> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | |||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 3.5.0 | CC: | dfediuck, didi, ecohen, gklein, iheim, lpeer, lsurette, nsednev, rbalakri, rgolan, Rhev-m-bugs, sbonazzo, stirabos, yeylon, ylavi | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 3.5.0-1 | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | sla | ||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2015-01-26 17:13:41 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1164311 | ||||||
| Attachments: |
|
||||||
[root@alma03 ~]# hosted-engine --vm-status
--== Host 1 status ==--
Status up-to-date : True
Hostname : alma03.qa.lab.tlv.redhat.com
Host ID : 1
Engine status : {"health": "good", "vm": "up", "detail": "up"}
Score : 2400
Local maintenance : False
Host timestamp : 14669
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=14669 (Sun Jan 25 18:24:00 2015)
host-id=1
score=2400
maintenance=False
state=EngineUp
--== Host 2 status ==--
Status up-to-date : True
Hostname : alma04.qa.lab.tlv.redhat.com
Host ID : 2
Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score : 2400
Local maintenance : False
Host timestamp : 4520
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=4520 (Sun Jan 25 18:23:54 2015)
host-id=2
score=2400
maintenance=False
state=EngineDown
--== Host 4 status ==--
Status up-to-date : True
Hostname : brown-vdsc.qa.lab.tlv.redhat.com
Host ID : 4
Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score : 2400
Local maintenance : False
Host timestamp : 4408
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=4408 (Sun Jan 25 18:23:54 2015)
host-id=4
score=2400
maintenance=False
state=EngineDown
--== Host 6 status ==--
Status up-to-date : True
Hostname : blue-vdsc.qa.lab.tlv.redhat.com
Host ID : 6
Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score : 2400
Local maintenance : False
Host timestamp : 4405
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=4405 (Sun Jan 25 18:24:00 2015)
host-id=6
score=2400
maintenance=False
state=EngineDown
Created attachment 983970 [details]
logs
looks like a duplicate of bug #1184129 Didi, Simone, can you take a look? I now looked at the machines.
alma03 was "preparing for maintenance"
I activated it
Then tried to migrate the vm to alma04
in alma04 vdsm.log:
Thread-2295::ERROR::2015-01-26 16:14:09,922::vm::2318::vm.Vm::(_startUnderlyingVm) vmId=`2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86`::Failed to start a migration destination vm
Traceback (most recent call last):
File "/usr/share/vdsm/virt/vm.py", line 2291, in _startUnderlyingVm
self._completeIncomingMigration()
File "/usr/share/vdsm/virt/vm.py", line 4095, in _completeIncomingMigration
self._incomingMigrationFinished.isSet(), usedTimeout)
File "/usr/share/vdsm/virt/vm.py", line 4148, in _attachLibvirtDomainAfterMigration
raise MigrationError(e.get_error_message())
MigrationError: Domain not found: no domain with matching uuid '2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86'
Tried to migrate it to blue-vdsc and it succeeded.
Seems like some local configuration issue.
Was the system (hosts, storage) clean before installation?
(In reply to Yedidyah Bar David from comment #4) > I now looked at the machines. > alma03 was "preparing for maintenance" > I activated it > Then tried to migrate the vm to alma04 > in alma04 vdsm.log: > > Thread-2295::ERROR::2015-01-26 > 16:14:09,922::vm::2318::vm.Vm::(_startUnderlyingVm) > vmId=`2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86`::Failed to start a migration > destination vm > Traceback (most recent call last): > File "/usr/share/vdsm/virt/vm.py", line 2291, in _startUnderlyingVm > self._completeIncomingMigration() > File "/usr/share/vdsm/virt/vm.py", line 4095, in _completeIncomingMigration > self._incomingMigrationFinished.isSet(), usedTimeout) > File "/usr/share/vdsm/virt/vm.py", line 4148, in > _attachLibvirtDomainAfterMigration > raise MigrationError(e.get_error_message()) > MigrationError: Domain not found: no domain with matching uuid > '2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86' > > Tried to migrate it to blue-vdsc and it succeeded. > > Seems like some local configuration issue. > > Was the system (hosts, storage) clean before installation? All clean installation, all hosts restarted after deployment, NFS share was cleared before deployment. I've added logs from all setup to blue-vdsc.qa.lab.tlv.redhat.com:/root/1185661sosreport-LogCollector-20150125160853 (copy).tar.xz Please review. Might be we have something in common with https://bugzilla.redhat.com/show_bug.cgi?id=923946 ? I've checked, the SELINUX is enforcing on all of my hosts. From the logs its clear that the dest host is somehow resolved 127.0.0.1. few people seen that before and now as well.
-- alma03 vdsm.log
::starting migration to qemu+tls://alma04.qa.lab.tlv.redhat.com/system with miguri tcp://10.35.117.26
-- so far the name and ip are ok but moving on:
Thread-3417::DEBUG::2015-01-25 17:20:43,230::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 1 edom: 10 level: 2 message: internal error: Attempt to migrate guest to the same host 87de1c88-a068-4dfd-bde0-70d6e17c4819
-- and libvirt.log looks weird - the xml block for migration has already alma03 set as the target
2015-01-25 15:20:43.190+0000: 65166: debug : doPeer2PeerMigrate3:3718 : driver=0x7f5394164c90, sconn=0x7f5380001590, dconn=0x7f5388001960, dconnuri=qemu+tls://alma04.qa.lab.tlv.redhat.com/system, vm=0x7f53941ba7e0, xmlin=(null), dname=(null), uri=tcp://10.35.117.26, graphicsuri=(null), listenAddress=(null), bandwidth=32, useParams=1, flags=1003
-- parameters looks good but wrong in xml
2015-01-25 15:20:43.190+0000: 65166: debug : qemuMigrationBeginPhase:2025 : driver=0x7f5394164c90, vm=0x7f53941ba7e0, xmlin=(null), dname=(null), cookieout=0x7f53a3c6f910, cookieoutlen=0x7f53a3c6f8f4, flags=1003
2015-01-25 15:20:43.190+0000: 65166: debug : qemuDomainObjSetJobPhase:919 : Setting 'migration out' phase to 'begin3'
2015-01-25 15:20:43.190+0000: 65166: debug : qemuProcessAutoDestroyActive:4762 : vm=HostedEngine
2015-01-25 15:20:43.190+0000: 65166: debug : virCloseCallbacksGet:197 : vm=HostedEngine, uuid=2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86, conn=(nil)
2015-01-25 15:20:43.190+0000: 65166: debug : virCloseCallbacksGet:207 : cb=(nil)
2015-01-25 15:20:43.190+0000: 65166: debug : qemuMigrationEatCookie:1044 : cookielen=0 cookie='(null)'
2015-01-25 15:20:43.190+0000: 65166: debug : qemuMigrationBakeCookie:1021 : cookielen=588 cookie=<qemu-migration>
<name>HostedEngine</name>
<uuid>2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86</uuid>
<hostname>alma03.qa.lab.tlv.redhat.com</hostname>
<hostuuid>87de1c88-a068-4dfd-bde0-70d6e17c4819</hostuuid>
<feature name='lockstate'/>
<lockstate driver='sanlock'>
<leases>d029dde8-5ad0-462d-a04f-1e5c992716d4:8e14d3ee-b1d4-46e5-a57b-91ed695a4457:/rhev/data-center/mnt/10.35.160.108\:_RHEV_nsednev__HE__3__5/d029dde8-5ad0-462d-a04f-1e5c992716d4/images/5ab421b3-da97-43ea-b484-6be39b3cfd62/8e14d3ee-b1d4-46e5-a57b-91ed695a4457.lease:0:4</leases>
</lockstate>
</qemu-migration>
I'm not sure "resolved" is the right term. the actual parameter is suddenly alma03 in qemu xml snip as see in the comment 8 I think that problem lays within /etc/libvirt/qemu.conf
"# Override the listen address for all incoming migrations. Defaults to
# 0.0.0.0 or :: in case if both host and qemu are capable of IPv6.
#migration_address = "127.0.0.1"
"
You should not ever use 127.0.0.1 for IPv6.
My all hosts are capable of using IPv6.
[root@alma03 ~]# ifconfig
enp3s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::a236:9fff:fe3a:c4f0 prefixlen 64 scopeid 0x20<link>
ether a0:36:9f:3a:c4:f0 txqueuelen 1000 (Ethernet)
RX packets 6676070 bytes 9046046102 (8.4 GiB)
RX errors 0 dropped 334 overruns 0 frame 0
TX packets 3639434 bytes 3646678067 (3.3 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 0 (Local Loopback)
RX packets 88786 bytes 55678971 (53.0 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 88786 bytes 55678971 (53.0 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
rhevm: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.35.117.24 netmask 255.255.255.0 broadcast 10.35.117.255
inet6 fe80::a236:9fff:fe3a:c4f0 prefixlen 64 scopeid 0x20<link>
ether a0:36:9f:3a:c4:f0 txqueuelen 0 (Ethernet)
RX packets 1731099 bytes 8666120265 (8.0 GiB)
RX errors 0 dropped 20 overruns 0 frame 0
TX packets 1394741 bytes 3470484930 (3.2 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
vnet0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::fc16:3eff:fe7b:b853 prefixlen 64 scopeid 0x20<link>
ether fe:16:3e:7b:b8:53 txqueuelen 500 (Ethernet)
RX packets 17764 bytes 3757405 (3.5 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 34303 bytes 12319812 (11.7 MiB)
TX errors 0 dropped 421 overruns 0 carrier 0 collisions 0
less /var/log/messages
Jan 26 16:14:09 alma03 journal: vdsm vm.Vm ERROR vmId=`2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86`::internal error: Attempt to migrate guest to the same host 87d
e1c88-a068-4dfd-bde0-70d6e17c4819
Jan 26 16:14:09 alma03 journal: vdsm vm.Vm ERROR vmId=`2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86`::Failed to migrate
Traceback (most recent call last):
File "/usr/share/vdsm/virt/migration.py", line 246, in run
self._startUnderlyingMigration(time.time())
File "/usr/share/vdsm/virt/migration.py", line 325, in _startUnderlyingMigration
None, maxBandwidth)
File "/usr/share/vdsm/virt/vm.py", line 689, in f
ret = attr(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 111, in wrapper
ret = f(*args, **kwargs)
File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1264, in migrateToURI2
if ret == -1: raise libvirtError ('virDomainMigrateToURI2() failed', dom=self)
I also would think of not using XML files as communication protocol that is telling in direct form to which IP to connect, unless checked with TCP protocol stack of the host, before, so at least routing and resolution decisions will be pinned to relevant components of the system, such as DNS and TCP stack. Also might be that when I copied the libvirt.conf file from alma03 to all other hosts in my setup in order to enable libvirt logging, it's host_uuid copied to all other hosts, so during migration what happened is libvirt chosen itself on alma03, because of the received host_uuid of itself from other hosts? In such case it is my misconfiguration, and I need to change host_uuid for each host to it's unique host_uuid. According to the libvirt log file you attached: 2015-01-25 15:23:31.177+0000: 65164: error : virNetClientProgramDispatchError:175 : internal error: Attempt to migrate guest to the same host 87de1c88-a068-4dfd-bde0-70d6e17c4819 Closing this issue. Issue caused by copying of the same host_uuid among all hosts over the setup in form of libvirt.conf from alma03, this caused for migration loop, as when I tried to migrate VM from alma03 to any other host, they returned the same host_uuid of initial alma03, causing it to try migrating to itself 127.0.0.1. I'll be happy to close this bug as works for me now and if possible to add some code, which checks if getting destination host_uuid=self to trigger appropriate warning to the customer and cancel migration at once. |
Description of problem: Hosted engine migration fails, while using "Virtual Machines"->HE-VM->Migrate. 2015-01-25 15:11:39,523 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler_Worker-29) Exception during connection 2015-01-25 15:11:39,523 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-29) Command GetCapabilit iesVDSCommand(HostName = blue-vdsc.qa.lab.tlv.redhat.com, HostId = 84de041b-e0ad-4ba5-bb91-c645f9ccbaf3, vds=Host[blue-vdsc.qa.lab.tlv.redhat.com,84de041b- e0ad-4ba5-bb91-c645f9ccbaf3]) execution failed. Exception: VDSNetworkException: java.rmi.ConnectException: Connection timeout 2015-01-25 15:11:39,523 WARN [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) Exception thrown during message processing 2015-01-25 15:11:39,523 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-29) Host blue-vdsc.qa.lab.tlv.redhat.com is not r esponding. It will stay in Connecting state for a grace period of 120 seconds and after that an attempt to fence the host will be issued. 2015-01-25 15:11:39,533 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-29) Failure to refresh Vds runtime info : org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.rmi.ConnectException: Connection timeout at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.createNetworkException(VdsBrokerCommand.java:126) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:101) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:56) [vdsbroker.jar:] at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:31) [dal.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.refreshCapabilities(VdsManager.java:571) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refreshVdsRunTimeInfo(VdsUpdateRunTimeInfo.java:648) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refresh(VdsUpdateRunTimeInfo.java:494) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:236) [vdsbroker.jar:] at sun.reflect.GeneratedMethodAccessor88.invoke(Unknown Source) [:1.7.0_75] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_75] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_75] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:] Caused by: java.rmi.ConnectException: Connection timeout at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient$1.call(ReactorClient.java:113) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient$1.call(ReactorClient.java:97) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable.call(Retryable.java:26) [vdsm-jsonrpc-java-client.jar:] at java.util.concurrent.FutureTask.run(FutureTask.java:262) [rt.jar:1.7.0_75] at org.ovirt.vdsm.jsonrpc.client.utils.ReactorScheduler.performPendingOperations(ReactorScheduler.java:28) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.run(Reactor.java:60) [vdsm-jsonrpc-java-client.jar:] 2015-01-25 15:11:42,541 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) Connecting to blue-vdsc.qa.lab.tlv.redhat.com/10.3 5.103.12 2015-01-25 15:11:43,673 WARN [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) Retry failed Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1.Install fresh NFS hosted engine environment using all RHEL7.0 hosts and RHEL6.6 HE's VM. 2.Have at least 4 guests. 3.Try migrating HE from currently running host to another random or manually chosen host. Actual results: Migration fails with error " 2015-Jan-25, 15:23 Migration failed due to Error: Fatal error during migration (VM: HostedEngine, Source: alma03.qa.lab.tlv.redhat.com, Destination: alma04.qa.lab.tlv.redhat.com)." Expected results: Migration should succeed. Additional info: rhevm-3.5.0-0.30.el6ev.noarch rhevm-guest-agent-common-1.0.10-2.el6ev.noarch Linux version 2.6.32-504.3.3.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-9) (GCC) ) #1 SMP Fri Dec 12 16:05:43 EST 2014 qemu-kvm-rhev-1.5.3-60.el7_0.11.x86_64 libvirt-client-1.1.1-29.el7_0.5.x86_64 vdsm-4.16.8.1-6.el7ev.x86_64 sanlock-3.1.0-2.el7.x86_64 ovirt-hosted-engine-ha-1.2.4-5.el7ev.noarch mom-0.4.1-4.el7ev.noarch ovirt-hosted-engine-setup-1.2.1-9.el7ev.noarch I've used these 4 hypervisors: alma04.qa.lab.tlv.redhat.com alma03.qa.lab.tlv.redhat.com brown-vdsc.qa.lab.tlv.redhat.com blue-vdsc.qa.lab.tlv.redhat.com Tried to migrate alma03.qa.lab.tlv.redhat.com->alma04.qa.lab.tlv.redhat.com and also to any random hypervisor.