Description of problem: User started a migration, but it got stuck in the source host, underlying migration was never started. It looks like connection to destination vdsm was setup (setupVdsConnection()) and guest was prepared (_prepareGuest()), but the source thread never acquired the semaphore even though there were no other simultaneous migrations. There were, however, 2 migrations about 1h10m earlier that the monitoring thread failed with an uncaught exception when the vms were presumably shutdown from within the guest. CLOCKS: engine and source: in sync engine and destination: -12s (engine and source ahead by 12s) ENGINE: 2018-04-23 13:54:25,511-05 INFO [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (org.ovirt.thread.pool-7-thread-42) [c02781dd-abaf-4f31-82e5-60015b4b862d] START, MigrateVDSCommand( MigrateVDSCommandParameters:{runAsync='true', hostId='b35bac7d-0a5d-4d8f-a562-c2afa55dc0eb', vmId='217e7696-c131-40f6-a2ee-15bc01fd52cb', srcHost='rs-rhev-h-2', dstVdsId='3f808fa2-9913-4d66-ac14-dcd4dfd329bb', dstHost='rs-rhev-h-4:54321', migrationMethod='ONLINE', tunnelMigration='false', migrationDowntime='0', autoConverge='true', migrateCompressed='false', consoleAddress='null', maxBandwidth='10000', enableGuestEvents='true', maxIncomingMigrations='2', maxOutgoingMigrations='2', convergenceSchedule='[init=[{name=setDowntime, params=[100]}], stalling=[{limit=1, action={name=setDowntime, params=[150]}}, {limit=2, action={name=setDowntime, params=[200]}}, {limit=3, action={name=setDowntime, params=[300]}}, {limit=4, action={name=setDowntime, params=[400]}}, {limit=6, action={name=setDowntime, params=[500]}}, {limit=-1, action={name=abort, params=[]}}]]'}), log id: 67cf21a6 DESTINATION VDSM: Only connection was setup (setupVdsConnection()), nothing else. 172.28.43.238 is the source IP address: 2018-04-23 13:54:16,919-0500 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::ffff:172.28.43.238:42728 (protocoldetector:72) 2018-04-23 13:54:16,924-0500 INFO (Reactor thread) [ProtocolDetector.Detector] Detected protocol stomp from ::ffff:172.28.43.238:42728 (protocoldetector:127) <goes silent> SOURCE VDSM: 2018-04-23 13:54:25,514-0500 INFO (jsonrpc/4) [vdsm.api] START migrate(params={u'incomingLimit': 2, u'src': u'rs-rhev-h-2', u'dstqemu': u'172.28.43.236', u'autoConverge': u'true', u'tunneled': u'false', u'enableGuestEvents': True, u'dst': u'rs-rhev-h-4:54321', u'convergenceSchedule': {u'init': [{u'params': [u'100'], u'name': u'setDowntime'}], u'stalling': [{u'action': {u'params': [u'150'], u'name': u'setDowntime'}, u'limit': 1}, {u'action': {u'params': [u'200'], u'name': u'setDowntime'}, u'limit': 2}, {u'action': {u'params': [u'300'], u'name': u'setDowntime'}, u'limit': 3}, {u'action': {u'params': [u'400'], u'name': u'setDowntime'}, u'limit': 4}, {u'action': {u'params': [u'500'], u'name': u'setDowntime'}, u'limit': 6}, {u'action': {u'params': [], u'name': u'abort'}, u'limit': -1}]}, u'vmId': u'217e7696-c131-40f6-a2ee-15bc01fd52cb', u'abortOnError': u'true', u'outgoingLimit': 2, u'compressed': u'false', u'maxBandwidth': 10000, u'method': u'online', 'mode': 'remote'}) from=::ffff:172.28.43.235,49396, flow_id=c02781dd-abaf-4f31-82e5-60015b4b862d (api:46) 2018-04-23 13:54:25,516-0500 INFO (jsonrpc/4) [vdsm.api] FINISH migrate return={'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 0} from=::ffff:172.28.43.235,49396, flow_id=c02781dd-abaf-4f31-82e5-60015b4b862d (api:52) <goes silent> Looking at the source code, the SourceMigration thread was created fine, and start() was called. It ran through setupVdsConnection() and _prepareGuest() because the connection to destination was established and vdsm started reporting this guest as "Source Migration". So the most obvious point to block would be the simultaneous outgoing migration semaphore, which is right after this. It was set to 2. But there were no migrations active at that time. However, about 1h earlier there were 2 bad ongoing migrations that ended with an exception on their monitoring thread: 1) 0bda3dbe 2018-04-23 12:45:00,759-0500 WARN (migmon/0bda3dbe) [virt.vm] (vmId='0bda3dbe-fdaa-4213-ac86-939388c328eb') Migration stalling: remaining (0MiB) > lowmark (0MiB). Refer to RHBZ#919201. (migration:785) 2018-04-23 12:45:00,759-0500 INFO (migmon/0bda3dbe) [virt.vm] (vmId='0bda3dbe-fdaa-4213-ac86-939388c328eb') Migration Progress: 308116 seconds elapsed, 0% of data processed, total data: 0MB, processed data: 0MB, remaining data: 0MB, transfer speed 0MBps, zero pages: 0MB, compressed: 0MB, dirty rate: 0, memory iteration: 0 (migration:815) 2018-04-23 12:45:08,708-0500 INFO (libvirt/events) [virt.vm] (vmId='0bda3dbe-fdaa-4213-ac86-939388c328eb') underlying process disconnected (vm:694) 2018-04-23 12:45:10,761-0500 ERROR (migmon/0bda3dbe) [root] Unhandled exception (utils:374) 2) e45c70fe 2018-04-23 12:45:20,763-0500 WARN (migmon/e45c70fe) [virt.vm] (vmId='e45c70fe-9e35-46ae-91a9-f1b128e4820e') Migration stalling: remaining (0MiB) > lowmark (0MiB). Refer to RHBZ#919201. (migration:785) 2018-04-23 12:45:20,763-0500 INFO (migmon/e45c70fe) [virt.vm] (vmId='e45c70fe-9e35-46ae-91a9-f1b128e4820e') Migration Progress: 308136 seconds elapsed, 0% of data processed, total data: 0MB, processed data: 0MB, remaining data: 0MB, transfer speed 0MBps, zero pages: 0MB, compressed: 0MB, dirty rate: 0, memory iteration: 0 (migration:815) 2018-04-23 12:45:40,936-0500 INFO (libvirt/events) [virt.vm] (vmId='e45c70fe-9e35-46ae-91a9-f1b128e4820e') underlying process disconnected (vm:694) 2018-04-23 12:45:40,767-0500 ERROR (migmon/e45c70fe) [root] Unhandled exception (utils:374) The exception that both migrations failed with is the monitoring thread attempting to get the migration stats for a guest that did not exist anymore. 2018-04-23 12:45:10,763-0500 ERROR (migmon/0bda3dbe) [root] Unhandled exception (utils:374) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 368, in wrapper return f(*a, **kw) File "/usr/lib/python2.7/site-packages/vdsm/concurrent.py", line 186, in run return func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 368, in wrapper return f(*a, **kw) File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 705, in run self.monitor_migration() File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 738, in monitor_migration job_stats = self._vm._dom.jobStats() File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 47, in __getattr__ % self.vmid) NotConnectedError: VM u'0bda3dbe-fdaa-4213-ac86-939388c328eb' was not started yet or was shut down After the exception there are no signs of the migrations in the logs. Could this exception have left the migration semaphore acquired for those 2 migrations, blocking any subsequent migrations? Those 2 migrations are weird (I will investigate them later and open another BZ if necessary), but there seem to be a problem here as no matter what happened to those, this one should have gone to the underlying migration phase, and it didn't. Version-Release number of selected component (if applicable): rhevm-4.1.10.3-0.1.el7.noarch vdsm-4.19.50-1.el7ev.x86_64 libvirt-daemon-3.2.0-14.el7_4.9.x86_64 How reproducible: Unknown
This is the "sister BZ" for the two VMs that the migration failed with the exception. Their migration was also very weird. https://bugzilla.redhat.com/show_bug.cgi?id=1572451
(In reply to Germano Veit Michel from comment #2) > This is the "sister BZ" for the two VMs that the migration failed with the > exception. Their migration was also very weird. > > https://bugzilla.redhat.com/show_bug.cgi?id=1572451 hm, nothing intelligent to say about this one....close insufficient_data as well?
(In reply to Michal Skrivanek from comment #4) > hm, nothing intelligent to say about this one....close insufficient_data as > well? Right, very weird. Maybe close as is or if you can think of any logging improvement to help clarify what on earth happened here it would help if it ever happens again. Maybe some info message when a migration thread waits for the migration semaphore that avoids N concurrent migrations? Dumping which threads are holding it? I don't know, might not have even been that....
no idea:( are you fine with closing it?
(In reply to Michal Skrivanek from comment #6) > no idea:( are you fine with closing it? I would like to see some better logging when a migration waits for the semaphore, other than that we can close this with insufficient data.
we have that at debug level. Even with the current logging default of info I do not think it would be such a problem to find that out in current logs when they are complete. The value is fairly low (2 by default) so whenever there are other migrations around the same time it is very likely it blocks on that We could fix the exception and exit gracefully (looks like a race when the monitor threads notices earlier than the lifecycle event is handled). It shouldn't affect the observed behavior though. Milane, care to post a patch?
Patch posted: https://gerrit.ovirt.org/94076