Bug 1142776 - rhev-m stops syncing the VM statuses after massive live VM migration which fails.
Summary: rhev-m stops syncing the VM statuses after massive live VM migration which fa...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.4.1-1
Hardware: All
OS: Linux
medium
urgent
Target Milestone: ovirt-3.6.0-rc
: 3.6.0
Assignee: Francesco Romani
QA Contact: Israel Pinto
URL:
Whiteboard:
Depends On:
Blocks: rhev35gablocker 1174813
TreeView+ depends on / blocked
 
Reported: 2014-09-17 11:41 UTC by Roman Hodain
Modified: 2016-03-09 19:25 UTC (History)
16 users (show)

Fixed In Version: vt11
Doc Type: Bug Fix
Doc Text:
Simultaneous migration of many virtual machines could create a deadlock between the threads that monitor the hosts. As a result, the hosts were not monitored, thus the status of their virtual machines were not updated by the Red Hat Enterprise Virtualization Manager. The virtual machines are now monitored to prevent the deadlock.
Clone Of:
: 1146908 1174813 (view as bug list)
Environment:
Last Closed: 2016-03-09 19:25:26 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:
mavital: needinfo+


Attachments (Terms of Use)
logs_20141107_144809 (695.70 KB, application/x-xz)
2014-11-07 14:46 UTC, Roman Hodain
no flags Details
Bug-1142776-logs (2.69 MB, application/zip)
2014-12-08 09:59 UTC, Israel Pinto
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:0362 0 normal SHIPPED_LIVE vdsm 3.6.0 bug fix and enhancement update 2016-03-09 23:49:32 UTC
oVirt gerrit 34995 0 master MERGED core: prevent possible deadlock between monitoring threads Never
oVirt gerrit 35099 0 ovirt-engine-3.5 MERGED core: prevent possible deadlock between monitoring threads Never
oVirt gerrit 35896 0 master MERGED vm: make _set_lastStatus safer Never
oVirt gerrit 35962 0 ovirt-3.5 MERGED vm: make _set_lastStatus safer Never

Description Roman Hodain 2014-09-17 11:41:55 UTC
Description of problem:
	When huge amount of VM is migrated at ones and the migration fails the
	engine stops monitoring the VM statuses and the environment is unusable. 

Version-Release number of selected component (if applicable):
	rhevm-3.4.2-0.2.el6ev

How reproducible:
	100%

Steps to Reproduce:
	1. Start 30 VMs across two hypervisors
	2. Mark all of them in the UI and start migration.
	3. Wait 30 sec
	4. Kill all the VMs on both of the hypervisors.

		# ps -ef | grep qemu |tr -s ' ' | cut -d " " -f 2 | xargs kill -9

Actual results:
	Some of the VMs are marked as down, but the rest remains in Migration or up
	state. There are also many jobs stuck in the "started" state.

Expected results:
	The jobs are marked as failed and the Vm statuses are synced.

Comment 7 Omer Frenkel 2014-09-21 14:32:23 UTC
the source host logs are cut before the calls to the migration verbs..
please attach full logs

Comment 9 Roman Hodain 2014-09-30 07:08:42 UTC
(In reply to Omer Frenkel from comment #7)
> the source host logs are cut before the calls to the migration verbs..
> please attach full logs

I have repeated the test and here is the process

	start 2014-09-29 11:32
	The UI does not refresh the statuses
	engine_2014-09-29_1323.tar
	engine restart 2014-09-29 13:23
	The situation is still the same
	engine_2014-09-29_1552.tar
	engine_2014-09-29_1643.tar
	I have restarted vdsms on both of the hypervisors and engine
	VMs down
	engine_2014-09-30_0821.tar

The logs and DB dumps are attached.

Comment 11 Michal Skrivanek 2014-11-03 15:45:58 UTC
thoughts?

Comment 12 Arik 2014-11-05 11:27:19 UTC
Roman, it would help if you'll be able to reproduce it and add the following information:
1. Name the VMs that were 'stuck' in UP/Migrate statuses at the end of the process
2. Attach the output of 'list' verb in vdsm in both hosts:
vdsClient 0 list

Can you please reproduce it again and add this information along with the engine+vdsm logs?

Comment 13 Roman Hodain 2014-11-07 14:46:15 UTC
Created attachment 954960 [details]
logs_20141107_144809

I have repeated the test and uploaded the logs form the hypervisors and the engine to the bugzilla.

The time is synced on all the systems in UTC.

The test was following:

   - 40 VMs were started across two hypervisors

   - Migration of all the VMs were triggered.

   - After a while (some of the VMs were already migrated in UP state some of them not) I killed all the qemu processes:

          for i in `pidof qemu-kvm`; do kill -9 $i; done;

   - I have collected a statuses of the VMs form the hypervisors a couple of time. There is always a time stamp in the files.

   - I have restarted vdsms on the hypervisors and collected the list of VMs again. There was not Vm listed. The result is in the same files, but obviously there is just the timestamp as no data was returned.


   I have then collected all the logs.

Comment 14 Roman Hodain 2014-11-07 14:51:03 UTC
Regarding to:
> 1. Name the VMs that were 'stuck' in UP/Migrate statuses at the end of the process

Non of the status is synced as all the VMs are killed and the UI shows all of them UP or migrating apart of the two VMs (TapeTest and Win7) which were not part of the test and have never been started.

Comment 15 Arik 2014-11-09 08:50:15 UTC
(In reply to Michal Skrivanek from comment #11)
Roman, thanks now it's clear

There's a deadlock between the two monitoring (VURTI) threads:
Each of them holds the lock on its host (the one it monitors), since both of them detect that one of their VMs is down while being migrated on the source and the destination, they both try to kill those VMs on the 'other' host. In order to kill the VM on the a host, one has to lock that host. so each one of them holds the lock on its host and tried to lock the other host => deadlock

Comment 17 Israel Pinto 2014-12-08 09:59:07 UTC
Created attachment 965736 [details]
Bug-1142776-logs

Comment 18 Israel Pinto 2014-12-08 12:06:38 UTC
I reproduce it with vt12,
The secnario is:
Setup:2 Hosts,10 VMs
1.Start migrate from Host1 to Host2
2 After 10 seconds kill quem process for all VMs.
(with: for i in `pidof qemu-kvm`; do kill -9 $i; done;)
I see that 6 of 10 VM in pool (Pool_test_2) are not responding.
Like: "VM Pool_test_2-2 is not responding."


Francesco Romani had investigate it and found problem in VDSM.

Comment 20 Michal Skrivanek 2014-12-09 12:23:08 UTC
3.6 is done

Comment 24 Israel Pinto 2015-09-08 13:03:53 UTC
Verfiry it with:
Red Hat Enterprise Virtualization Manager Version: 3.6.0-0.13.master.el6
VDSM: vdsm-4.17.5-1.el7ev

The secnario is:
Setup:2 Hosts,10 VMs
1.Start migrate from Host1 to Host2
2 After 10 seconds kill quem process for all VMs.
(with: for i in `pidof qemu-kvm`; do kill -9 $i; done;)
All VMs are down

Comment 25 jas 2015-11-08 00:23:32 UTC
I posted the following to ovirt users list on Friday, and got no response..

https://www.mail-archive.com/users@ovirt.org/msg28962.html

After doing some experimentation, I can see that I can migrate 1 VM between two hosts back and forth, and there's no problem whatsoever.  However, if I do the same with 3 hosts at the same time, I run into a problem where-by near the end of migration, engine says that the migration failed, and the VM is DOWN.  I'm not 100% sure if it's this bug, but I'm 99% sure it is.  That's pretty bad because I can make it happen by simply switching a host into maintenance mode, or, with the balancing policy after I bring the host back up, and engine starts to migrate hosts back to the machine.    I haven't been using 3.5 for *so* long.. Ironically, the release notes for 3.5 say that this bug was fixed, but I don't think it is, yet I think it's pretty serious since I've had to disable balancing, and now I need to be super careful about putting a host into maintenance mode unless I want my VMs to shutdown.  In addition, I'm seeing this happen with far fewer hosts than have been reported in this report.

Just a FYI .. the last time I tried it with 3 hosts - 2 migrated fine, the last was almost done, and then it happened.

Comment 26 jas 2015-11-08 00:34:46 UTC
Sigh.. maybe it's a different bug.. I just migrated 5 VMs from 1 host to another, one at a time, and didn't see the problem.  When I migrated the 6th host.. "Migration failed", and the host is down.

I threw my engine, virt1, virt2, and virt3 logs in http://www.eecs.yorku.ca/~jas/ovirt-debug/11072015 if someone can have a look.  The VM in question was called "webapp" and moved from virt3 to virt1.

Comment 28 errata-xmlrpc 2016-03-09 19:25:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html


Note You need to log in before you can comment on or make changes to this bug.