1185661 – Hosted engine migration fails, while using "Virtual Machines"->HE-VM->Migrate.

Bug 1185661 - Hosted engine migration fails, while using "Virtual Machines"->HE-VM->Migrate.

Summary: Hosted engine migration fails, while using "Virtual Machines"->HE-VM->Migrate.

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.5.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.5.0-1
Assignee:	Martin Sivák
QA Contact:
Docs Contact:
URL:
Whiteboard:	sla
Depends On:
Blocks:	rhev35gablocker
TreeView+	depends on / blocked

Reported:	2015-01-25 16:25 UTC by Nikolai Sednev
Modified:	2016-02-10 20:16 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-01-26 17:13:41 UTC
oVirt Team:	SLA
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs (5.88 MB, application/x-gzip) 2015-01-25 16:50 UTC, Nikolai Sednev	no flags	Details
View All

Description Nikolai Sednev 2015-01-25 16:25:40 UTC

Description of problem:
Hosted engine migration fails, while using "Virtual Machines"->HE-VM->Migrate.

2015-01-25 15:11:39,523 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler_Worker-29) Exception during connection
2015-01-25 15:11:39,523 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-29) Command GetCapabilit
iesVDSCommand(HostName = blue-vdsc.qa.lab.tlv.redhat.com, HostId = 84de041b-e0ad-4ba5-bb91-c645f9ccbaf3, vds=Host[blue-vdsc.qa.lab.tlv.redhat.com,84de041b-
e0ad-4ba5-bb91-c645f9ccbaf3]) execution failed. Exception: VDSNetworkException: java.rmi.ConnectException: Connection timeout
2015-01-25 15:11:39,523 WARN  [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) Exception thrown during message processing
2015-01-25 15:11:39,523 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-29) Host blue-vdsc.qa.lab.tlv.redhat.com is not r
esponding. It will stay in Connecting state for a grace period of 120 seconds and after that an attempt to fence the host will be issued.
2015-01-25 15:11:39,533 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-29) Failure to refresh Vds runtime info
: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.rmi.ConnectException: Connection timeout
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.createNetworkException(VdsBrokerCommand.java:126) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:101) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:56) [vdsbroker.jar:]
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:31) [dal.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.refreshCapabilities(VdsManager.java:571) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refreshVdsRunTimeInfo(VdsUpdateRunTimeInfo.java:648) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refresh(VdsUpdateRunTimeInfo.java:494) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:236) [vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor88.invoke(Unknown Source) [:1.7.0_75]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_75]
        at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_75]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]
Caused by: java.rmi.ConnectException: Connection timeout
        at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient$1.call(ReactorClient.java:113) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient$1.call(ReactorClient.java:97) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable.call(Retryable.java:26) [vdsm-jsonrpc-java-client.jar:]
        at java.util.concurrent.FutureTask.run(FutureTask.java:262) [rt.jar:1.7.0_75]
        at org.ovirt.vdsm.jsonrpc.client.utils.ReactorScheduler.performPendingOperations(ReactorScheduler.java:28) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.run(Reactor.java:60) [vdsm-jsonrpc-java-client.jar:]

2015-01-25 15:11:42,541 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) Connecting to blue-vdsc.qa.lab.tlv.redhat.com/10.3
5.103.12
2015-01-25 15:11:43,673 WARN  [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) Retry failed

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.Install fresh NFS hosted engine environment using all RHEL7.0 hosts and RHEL6.6 HE's VM.
2.Have at least 4 guests.
3.Try migrating HE from currently running host to another random or manually chosen host.

Actual results:
Migration fails with error "	
2015-Jan-25, 15:23
Migration failed due to Error: Fatal error during migration (VM: HostedEngine, Source: alma03.qa.lab.tlv.redhat.com, Destination: alma04.qa.lab.tlv.redhat.com)."

Expected results:
Migration should succeed.

Additional info:
rhevm-3.5.0-0.30.el6ev.noarch
rhevm-guest-agent-common-1.0.10-2.el6ev.noarch
Linux version 2.6.32-504.3.3.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-9) (GCC) ) #1 SMP Fri Dec 12 16:05:43 EST 2014

qemu-kvm-rhev-1.5.3-60.el7_0.11.x86_64
libvirt-client-1.1.1-29.el7_0.5.x86_64
vdsm-4.16.8.1-6.el7ev.x86_64
sanlock-3.1.0-2.el7.x86_64
ovirt-hosted-engine-ha-1.2.4-5.el7ev.noarch
mom-0.4.1-4.el7ev.noarch
ovirt-hosted-engine-setup-1.2.1-9.el7ev.noarch


I've used these 4 hypervisors:
alma04.qa.lab.tlv.redhat.com
alma03.qa.lab.tlv.redhat.com
brown-vdsc.qa.lab.tlv.redhat.com
blue-vdsc.qa.lab.tlv.redhat.com

Tried to migrate alma03.qa.lab.tlv.redhat.com->alma04.qa.lab.tlv.redhat.com and also to any random hypervisor.

Comment 1 Nikolai Sednev 2015-01-25 16:26:06 UTC

[root@alma03 ~]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : alma03.qa.lab.tlv.redhat.com
Host ID                            : 1                           
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 2400                                          
Local maintenance                  : False                                         
Host timestamp                     : 14669                                         
Extra metadata (valid at timestamp):                                               
        metadata_parse_version=1                                                   
        metadata_feature_version=1                                                 
        timestamp=14669 (Sun Jan 25 18:24:00 2015)                                 
        host-id=1                                                                  
        score=2400                                                                 
        maintenance=False                                                          
        state=EngineUp                                                             


--== Host 2 status ==--

Status up-to-date                  : True
Hostname                           : alma04.qa.lab.tlv.redhat.com
Host ID                            : 2                           
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 2400                                                                                         
Local maintenance                  : False                                                                                        
Host timestamp                     : 4520                                                                                         
Extra metadata (valid at timestamp):                                                                                              
        metadata_parse_version=1                                                                                                  
        metadata_feature_version=1
        timestamp=4520 (Sun Jan 25 18:23:54 2015)
        host-id=2
        score=2400
        maintenance=False
        state=EngineDown


--== Host 4 status ==--

Status up-to-date                  : True
Hostname                           : brown-vdsc.qa.lab.tlv.redhat.com
Host ID                            : 4
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 4408
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=4408 (Sun Jan 25 18:23:54 2015)
        host-id=4
        score=2400
        maintenance=False
        state=EngineDown


--== Host 6 status ==--

Status up-to-date                  : True
Hostname                           : blue-vdsc.qa.lab.tlv.redhat.com
Host ID                            : 6
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 4405
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=4405 (Sun Jan 25 18:24:00 2015)
        host-id=6
        score=2400
        maintenance=False
        state=EngineDown

Comment 2 Nikolai Sednev 2015-01-25 16:50:25 UTC

Created attachment 983970 [details]
logs

Comment 3 Sandro Bonazzola 2015-01-26 13:28:14 UTC

looks like a duplicate of bug #1184129 Didi, Simone, can you take a look?

Comment 4 Yedidyah Bar David 2015-01-26 14:23:29 UTC

I now looked at the machines.
alma03 was "preparing for maintenance"
I activated it
Then tried to migrate the vm to alma04
in alma04 vdsm.log:

Thread-2295::ERROR::2015-01-26 16:14:09,922::vm::2318::vm.Vm::(_startUnderlyingVm) vmId=`2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86`::Failed to start a migration destination vm
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 2291, in _startUnderlyingVm
    self._completeIncomingMigration()
  File "/usr/share/vdsm/virt/vm.py", line 4095, in _completeIncomingMigration
    self._incomingMigrationFinished.isSet(), usedTimeout)
  File "/usr/share/vdsm/virt/vm.py", line 4148, in _attachLibvirtDomainAfterMigration
    raise MigrationError(e.get_error_message())
MigrationError: Domain not found: no domain with matching uuid '2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86'

Tried to migrate it to blue-vdsc and it succeeded.

Seems like some local configuration issue.

Was the system (hosts, storage) clean before installation?

Comment 5 Nikolai Sednev 2015-01-26 14:36:30 UTC

(In reply to Yedidyah Bar David from comment #4)
> I now looked at the machines.
> alma03 was "preparing for maintenance"
> I activated it
> Then tried to migrate the vm to alma04
> in alma04 vdsm.log:
> 
> Thread-2295::ERROR::2015-01-26
> 16:14:09,922::vm::2318::vm.Vm::(_startUnderlyingVm)
> vmId=`2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86`::Failed to start a migration
> destination vm
> Traceback (most recent call last):
>   File "/usr/share/vdsm/virt/vm.py", line 2291, in _startUnderlyingVm
>     self._completeIncomingMigration()
>   File "/usr/share/vdsm/virt/vm.py", line 4095, in _completeIncomingMigration
>     self._incomingMigrationFinished.isSet(), usedTimeout)
>   File "/usr/share/vdsm/virt/vm.py", line 4148, in
> _attachLibvirtDomainAfterMigration
>     raise MigrationError(e.get_error_message())
> MigrationError: Domain not found: no domain with matching uuid
> '2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86'
> 
> Tried to migrate it to blue-vdsc and it succeeded.
> 
> Seems like some local configuration issue.
> 
> Was the system (hosts, storage) clean before installation?


All clean installation, all hosts restarted after deployment, NFS share was cleared before deployment.

Comment 6 Nikolai Sednev 2015-01-26 14:37:59 UTC

I've added logs from all setup to blue-vdsc.qa.lab.tlv.redhat.com:/root/1185661sosreport-LogCollector-20150125160853 (copy).tar.xz 

Please review.

Comment 7 Nikolai Sednev 2015-01-26 14:44:51 UTC

Might be we have something in common with https://bugzilla.redhat.com/show_bug.cgi?id=923946 ?
I've checked, the SELINUX is enforcing on all of my hosts.

Comment 8 Roy Golan 2015-01-26 14:46:27 UTC

From the logs its clear that the dest host is somehow resolved 127.0.0.1. few people seen that before and now as well.

-- alma03 vdsm.log
 
::starting migration to qemu+tls://alma04.qa.lab.tlv.redhat.com/system with miguri tcp://10.35.117.26

-- so far the name and ip are ok but moving on:

Thread-3417::DEBUG::2015-01-25 17:20:43,230::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 1 edom: 10 level: 2 message: internal error: Attempt to migrate guest to the same host 87de1c88-a068-4dfd-bde0-70d6e17c4819


-- and libvirt.log looks weird - the xml block for migration has already alma03 set as the target

2015-01-25 15:20:43.190+0000: 65166: debug : doPeer2PeerMigrate3:3718 : driver=0x7f5394164c90, sconn=0x7f5380001590, dconn=0x7f5388001960, dconnuri=qemu+tls://alma04.qa.lab.tlv.redhat.com/system, vm=0x7f53941ba7e0, xmlin=(null), dname=(null), uri=tcp://10.35.117.26, graphicsuri=(null), listenAddress=(null), bandwidth=32, useParams=1, flags=1003


-- parameters looks good but wrong in xml
2015-01-25 15:20:43.190+0000: 65166: debug : qemuMigrationBeginPhase:2025 : driver=0x7f5394164c90, vm=0x7f53941ba7e0, xmlin=(null), dname=(null), cookieout=0x7f53a3c6f910, cookieoutlen=0x7f53a3c6f8f4, flags=1003
2015-01-25 15:20:43.190+0000: 65166: debug : qemuDomainObjSetJobPhase:919 : Setting 'migration out' phase to 'begin3'
2015-01-25 15:20:43.190+0000: 65166: debug : qemuProcessAutoDestroyActive:4762 : vm=HostedEngine
2015-01-25 15:20:43.190+0000: 65166: debug : virCloseCallbacksGet:197 : vm=HostedEngine, uuid=2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86, conn=(nil)
2015-01-25 15:20:43.190+0000: 65166: debug : virCloseCallbacksGet:207 : cb=(nil)
2015-01-25 15:20:43.190+0000: 65166: debug : qemuMigrationEatCookie:1044 : cookielen=0 cookie='(null)'
2015-01-25 15:20:43.190+0000: 65166: debug : qemuMigrationBakeCookie:1021 : cookielen=588 cookie=<qemu-migration>
  <name>HostedEngine</name>
  <uuid>2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86</uuid>
  <hostname>alma03.qa.lab.tlv.redhat.com</hostname>
  <hostuuid>87de1c88-a068-4dfd-bde0-70d6e17c4819</hostuuid>
  <feature name='lockstate'/>
  <lockstate driver='sanlock'>
    <leases>d029dde8-5ad0-462d-a04f-1e5c992716d4:8e14d3ee-b1d4-46e5-a57b-91ed695a4457:/rhev/data-center/mnt/10.35.160.108\:_RHEV_nsednev__HE__3__5/d029dde8-5ad0-462d-a04f-1e5c992716d4/images/5ab421b3-da97-43ea-b484-6be39b3cfd62/8e14d3ee-b1d4-46e5-a57b-91ed695a4457.lease:0:4</leases>
  </lockstate>
</qemu-migration>

Comment 9 Roy Golan 2015-01-26 14:48:02 UTC

I'm not sure "resolved" is the right term. the actual parameter is suddenly alma03 in qemu xml snip as see in the comment 8

Comment 10 Nikolai Sednev 2015-01-26 15:01:05 UTC

I think that problem lays within /etc/libvirt/qemu.conf
"# Override the listen address for all incoming migrations. Defaults to
# 0.0.0.0 or :: in case if both host and qemu are capable of IPv6.
#migration_address = "127.0.0.1"
"

You should not ever use 127.0.0.1 for IPv6.

My all hosts are capable of using IPv6.

[root@alma03 ~]# ifconfig
enp3s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::a236:9fff:fe3a:c4f0  prefixlen 64  scopeid 0x20<link>
        ether a0:36:9f:3a:c4:f0  txqueuelen 1000  (Ethernet)
        RX packets 6676070  bytes 9046046102 (8.4 GiB)
        RX errors 0  dropped 334  overruns 0  frame 0
        TX packets 3639434  bytes 3646678067 (3.3 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 0  (Local Loopback)
        RX packets 88786  bytes 55678971 (53.0 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 88786  bytes 55678971 (53.0 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

rhevm: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.35.117.24  netmask 255.255.255.0  broadcast 10.35.117.255
        inet6 fe80::a236:9fff:fe3a:c4f0  prefixlen 64  scopeid 0x20<link>
        ether a0:36:9f:3a:c4:f0  txqueuelen 0  (Ethernet)
        RX packets 1731099  bytes 8666120265 (8.0 GiB)
        RX errors 0  dropped 20  overruns 0  frame 0
        TX packets 1394741  bytes 3470484930 (3.2 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vnet0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::fc16:3eff:fe7b:b853  prefixlen 64  scopeid 0x20<link>
        ether fe:16:3e:7b:b8:53  txqueuelen 500  (Ethernet)
        RX packets 17764  bytes 3757405 (3.5 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 34303  bytes 12319812 (11.7 MiB)
        TX errors 0  dropped 421 overruns 0  carrier 0  collisions 0

less /var/log/messages
Jan 26 16:14:09 alma03 journal: vdsm vm.Vm ERROR vmId=`2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86`::internal error: Attempt to migrate guest to the same host 87d
e1c88-a068-4dfd-bde0-70d6e17c4819
Jan 26 16:14:09 alma03 journal: vdsm vm.Vm ERROR vmId=`2c9fdbde-bb68-4f39-8acc-fdc3d4cfcf86`::Failed to migrate
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/migration.py", line 246, in run
    self._startUnderlyingMigration(time.time())
  File "/usr/share/vdsm/virt/migration.py", line 325, in _startUnderlyingMigration
    None, maxBandwidth)
  File "/usr/share/vdsm/virt/vm.py", line 689, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 111, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1264, in migrateToURI2
    if ret == -1: raise libvirtError ('virDomainMigrateToURI2() failed', dom=self)

Comment 11 Nikolai Sednev 2015-01-26 15:03:44 UTC

I also would think of not using XML files as communication protocol that is telling in direct form to which IP to connect, unless checked with TCP protocol stack of the host, before, so at least routing and resolution decisions will be pinned to relevant components of the system, such as DNS and TCP stack.

Comment 12 Nikolai Sednev 2015-01-26 15:43:44 UTC

Also might be that when I copied the libvirt.conf file from alma03 to all other hosts in my setup in order to enable libvirt logging, it's host_uuid copied to all other hosts, so during migration what happened is libvirt chosen itself on alma03, because of the received host_uuid of itself from other hosts?

In such case it is my misconfiguration, and I need to change host_uuid for each host to it's unique host_uuid.

Comment 13 Doron Fediuck 2015-01-26 17:13:41 UTC

According to the libvirt log file you attached:

2015-01-25 15:23:31.177+0000: 65164: error :
virNetClientProgramDispatchError:175 : internal error: Attempt to
migrate guest to the same host 87de1c88-a068-4dfd-bde0-70d6e17c4819

Closing this issue.

Comment 14 Nikolai Sednev 2015-01-26 17:19:10 UTC

Issue caused by copying of the same host_uuid among all hosts over the setup in form of libvirt.conf from alma03, this caused for migration loop, as when I tried to migrate VM from alma03 to any other host, they returned the same host_uuid of initial alma03, causing it to try migrating to itself 127.0.0.1. 

I'll be happy to close this bug as works for me now and if possible to add some code, which checks if getting destination host_uuid=self to trigger appropriate warning to the customer and cancel migration at once.

Note You need to log in before you can comment on or make changes to this bug.