1572426 – underlying migration was never started

Bug 1572426 - underlying migration was never started

Summary: underlying migration was never started

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.1.10
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Michal Skrivanek
QA Contact:	Israel Pinto
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-27 01:36 UTC by Germano Veit Michel
Modified:	2021-06-10 15:59 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-08-16 07:10:53 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	94076	0	'None'	MERGED	virt: Catch NotConnectedError's in migration monitor thread	2021-02-03 07:15:21 UTC

Description Germano Veit Michel 2018-04-27 01:36:45 UTC

Description of problem:

User started a migration, but it got stuck in the source host, underlying migration was never started.
It looks like connection to destination vdsm was setup (setupVdsConnection()) and guest was prepared (_prepareGuest()), but the source thread never acquired the semaphore even though there were no other simultaneous migrations. There were, however, 2 migrations about 1h10m earlier that the monitoring thread failed with an uncaught exception when the vms were presumably shutdown from within the guest.

CLOCKS:
engine and source: in sync
engine and destination: -12s (engine and source ahead by 12s)


ENGINE:
2018-04-23 13:54:25,511-05 INFO  [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (org.ovirt.thread.pool-7-thread-42) [c02781dd-abaf-4f31-82e5-60015b4b862d] START, MigrateVDSCommand( MigrateVDSCommandParameters:{runAsync='true', hostId='b35bac7d-0a5d-4d8f-a562-c2afa55dc0eb', vmId='217e7696-c131-40f6-a2ee-15bc01fd52cb', srcHost='rs-rhev-h-2', dstVdsId='3f808fa2-9913-4d66-ac14-dcd4dfd329bb', dstHost='rs-rhev-h-4:54321', migrationMethod='ONLINE', tunnelMigration='false', migrationDowntime='0', autoConverge='true', migrateCompressed='false', consoleAddress='null', maxBandwidth='10000', enableGuestEvents='true', maxIncomingMigrations='2', maxOutgoingMigrations='2', convergenceSchedule='[init=[{name=setDowntime, params=[100]}], stalling=[{limit=1, action={name=setDowntime, params=[150]}}, {limit=2, action={name=setDowntime, params=[200]}}, {limit=3, action={name=setDowntime, params=[300]}}, {limit=4, action={name=setDowntime, params=[400]}}, {limit=6, action={name=setDowntime, params=[500]}}, {limit=-1, action={name=abort, params=[]}}]]'}), log id: 67cf21a6


DESTINATION VDSM:
Only connection was setup (setupVdsConnection()), nothing else. 172.28.43.238 is the source IP address:
2018-04-23 13:54:16,919-0500 INFO  (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::ffff:172.28.43.238:42728 (protocoldetector:72)                                  
2018-04-23 13:54:16,924-0500 INFO  (Reactor thread) [ProtocolDetector.Detector] Detected protocol stomp from ::ffff:172.28.43.238:42728 (protocoldetector:127) 
<goes silent>


SOURCE VDSM:
2018-04-23 13:54:25,514-0500 INFO  (jsonrpc/4) [vdsm.api] START migrate(params={u'incomingLimit': 2, u'src': u'rs-rhev-h-2', u'dstqemu': u'172.28.43.236', u'autoConverge': u'true', u'tunneled': u'false', u'enableGuestEvents': True, u'dst': u'rs-rhev-h-4:54321', u'convergenceSchedule': {u'init': [{u'params': [u'100'], u'name': u'setDowntime'}], u'stalling': [{u'action': {u'params': [u'150'], u'name': u'setDowntime'}, u'limit': 1}, {u'action': {u'params': [u'200'], u'name': u'setDowntime'}, u'limit': 2}, {u'action': {u'params': [u'300'], u'name': u'setDowntime'}, u'limit': 3}, {u'action': {u'params': [u'400'], u'name': u'setDowntime'}, u'limit': 4}, {u'action': {u'params': [u'500'], u'name': u'setDowntime'}, u'limit': 6}, {u'action': {u'params': [], u'name': u'abort'}, u'limit': -1}]}, u'vmId': u'217e7696-c131-40f6-a2ee-15bc01fd52cb', u'abortOnError': u'true', u'outgoingLimit': 2, u'compressed': u'false', u'maxBandwidth': 10000, u'method': u'online', 'mode': 'remote'}) from=::ffff:172.28.43.235,49396, flow_id=c02781dd-abaf-4f31-82e5-60015b4b862d (api:46)
2018-04-23 13:54:25,516-0500 INFO  (jsonrpc/4) [vdsm.api] FINISH migrate return={'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 0} from=::ffff:172.28.43.235,49396, flow_id=c02781dd-abaf-4f31-82e5-60015b4b862d (api:52) 
<goes silent>

Looking at the source code, the SourceMigration thread was created fine, and start() was called. It ran through setupVdsConnection() and _prepareGuest() because the connection to destination was established and vdsm started reporting this guest as "Source Migration". So the most obvious point to block would be the simultaneous outgoing migration semaphore, which is right after this. It was set to 2.

But there were no migrations active at that time. However, about 1h earlier there were 2 bad ongoing migrations that ended with an exception on their monitoring thread:

1) 0bda3dbe
2018-04-23 12:45:00,759-0500 WARN  (migmon/0bda3dbe) [virt.vm] (vmId='0bda3dbe-fdaa-4213-ac86-939388c328eb') Migration stalling: remaining (0MiB) > lowmark (0MiB). Refer to RHBZ#919201. (migration:785)                                                                                                                                      
2018-04-23 12:45:00,759-0500 INFO  (migmon/0bda3dbe) [virt.vm] (vmId='0bda3dbe-fdaa-4213-ac86-939388c328eb') Migration Progress: 308116 seconds elapsed, 0% of data processed, total data: 0MB, processed data: 0MB, remaining data: 0MB, transfer speed 0MBps, zero pages: 0MB, compressed: 0MB, dirty rate: 0, memory iteration: 0 (migration:815)

2018-04-23 12:45:08,708-0500 INFO  (libvirt/events) [virt.vm] (vmId='0bda3dbe-fdaa-4213-ac86-939388c328eb') underlying process disconnected (vm:694)                                           
2018-04-23 12:45:10,761-0500 ERROR (migmon/0bda3dbe) [root] Unhandled exception (utils:374)                                                                                                    

2) e45c70fe
2018-04-23 12:45:20,763-0500 WARN  (migmon/e45c70fe) [virt.vm] (vmId='e45c70fe-9e35-46ae-91a9-f1b128e4820e') Migration stalling: remaining (0MiB) > lowmark (0MiB). Refer to RHBZ#919201. (migration:785)                                                                                                                                      
2018-04-23 12:45:20,763-0500 INFO  (migmon/e45c70fe) [virt.vm] (vmId='e45c70fe-9e35-46ae-91a9-f1b128e4820e') Migration Progress: 308136 seconds elapsed, 0% of data processed, total data: 0MB, processed data: 0MB, remaining data: 0MB, transfer speed 0MBps, zero pages: 0MB, compressed: 0MB, dirty rate: 0, memory iteration: 0 (migration:815)
2018-04-23 12:45:40,936-0500 INFO  (libvirt/events) [virt.vm] (vmId='e45c70fe-9e35-46ae-91a9-f1b128e4820e') underlying process disconnected (vm:694)                                           
2018-04-23 12:45:40,767-0500 ERROR (migmon/e45c70fe) [root] Unhandled exception (utils:374)                                                                                                    

The exception that both migrations failed with is the monitoring thread attempting to get the migration stats for a guest that did not exist anymore.

2018-04-23 12:45:10,763-0500 ERROR (migmon/0bda3dbe) [root] Unhandled exception (utils:374)                                                                                                    
Traceback (most recent call last):                                                                                                                                                             
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 368, in wrapper                                                                                                                  
    return f(*a, **kw)                                                                                                                                                                         
  File "/usr/lib/python2.7/site-packages/vdsm/concurrent.py", line 186, in run                                                                                                                 
    return func(*args, **kwargs)                                                                                                                                                               
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 368, in wrapper                                                                                                                  
    return f(*a, **kw)                                                                                                                                                                         
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 705, in run                                                                                                             
    self.monitor_migration()                                                                                                                                                                   
  File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 738, in monitor_migration                                                                                               
    job_stats = self._vm._dom.jobStats()                                                                                                                                                       
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 47, in __getattr__                                                                                                      
    % self.vmid)                                                                                                                                                                               
NotConnectedError: VM u'0bda3dbe-fdaa-4213-ac86-939388c328eb' was not started yet or was shut down

After the exception there are no signs of the migrations in the logs. Could this exception have left the migration semaphore acquired for those 2 migrations, blocking any subsequent migrations?

Those 2 migrations are weird (I will investigate them later and open another BZ if necessary), but there seem to be a problem here as no matter what happened to those, this one should have gone to the underlying migration phase, and it didn't.

Version-Release number of selected component (if applicable):
rhevm-4.1.10.3-0.1.el7.noarch
vdsm-4.19.50-1.el7ev.x86_64
libvirt-daemon-3.2.0-14.el7_4.9.x86_64

How reproducible:
Unknown

Comment 2 Germano Veit Michel 2018-04-27 03:40:36 UTC

This is the "sister BZ" for the two VMs that the migration failed with the exception. Their migration was also very weird.

https://bugzilla.redhat.com/show_bug.cgi?id=1572451

Comment 4 Michal Skrivanek 2018-06-28 11:16:29 UTC

(In reply to Germano Veit Michel from comment #2)
> This is the "sister BZ" for the two VMs that the migration failed with the
> exception. Their migration was also very weird.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1572451

hm, nothing intelligent to say about this one....close insufficient_data as well?

Comment 5 Germano Veit Michel 2018-06-28 22:57:42 UTC

(In reply to Michal Skrivanek from comment #4)
> hm, nothing intelligent to say about this one....close insufficient_data as
> well?

Right, very weird. Maybe close as is or if you can think of any logging improvement to help clarify what on earth happened here it would help if it ever happens again. Maybe some info message when a migration thread waits for the migration semaphore that avoids N concurrent migrations? Dumping which threads are holding it? I don't know, might not have even been that....

Comment 6 Michal Skrivanek 2018-08-15 17:09:32 UTC

no idea:( are you fine with closing it?

Comment 7 Germano Veit Michel 2018-08-15 22:11:27 UTC

(In reply to Michal Skrivanek from comment #6)
> no idea:( are you fine with closing it?

I would like to see some better logging when a migration waits for the semaphore, other than that we can close this with insufficient data.

Comment 8 Michal Skrivanek 2018-08-16 07:10:53 UTC

we have that at debug level.
Even with the current logging default of info I do not think it would be such a problem to find that out in current logs when they are complete.

The value is fairly low (2 by default) so whenever there are other migrations around the same time it is very likely it blocks on that

We could fix the exception and exit gracefully (looks like a race when the monitor threads notices earlier than the lifecycle event is handled). It shouldn't affect the observed behavior though.
Milane, care to post a patch?

Comment 9 Milan Zamazal 2018-08-31 14:34:37 UTC

Patch posted: https://gerrit.ovirt.org/94076

Note You need to log in before you can comment on or make changes to this bug.