1334105 – VMs from auto-start pool randomly stop getting started

Bug 1334105 - VMs from auto-start pool randomly stop getting started

Summary: VMs from auto-start pool randomly stop getting started

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.6.5
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	ovirt-4.0.0-rc
Target Release:	4.0.0
Assignee:	Liron Aravot
QA Contact:	Natalie Gavrielov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1348079
TreeView+	depends on / blocked

Reported:	2016-05-08 15:44 UTC by Barak Korren
Modified:	2016-08-23 20:38 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1348079 (view as bug list)
Environment:
Last Closed:	2016-08-23 20:38:29 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
engine.log (158.48 KB, application/x-xz) 2016-05-08 15:44 UTC, Barak Korren	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2016:1743	normal	SHIPPED_LIVE	Red Hat Virtualization Manager 4.0 GA Enhancement (ovirt-engine)	2016-09-02 21:54:01 UTC
oVirt gerrit	57823	master	MERGED	core: RunVm - do not pass compensation context	2016-05-22 15:25:56 UTC
oVirt gerrit	57848	ovirt-engine-3.6	MERGED	core: RunVm - do not pass compensation context	2016-05-23 12:41:58 UTC

Description Barak Korren 2016-05-08 15:44:09 UTC

Created attachment 1155066 [details]
engine.log

Description of problem:
Given a pool of auto-started VMs, occasionally some VMs get into a state where engine is constantly failing to start them.

Version-Release number of selected component (if applicable):
3.6.5.3-0.1.el6

How reproducible:
Hard - this happens sporadically on a relatively large system and no relation to other actions was found yet.

Steps to Reproduce:
1. Install RHEV and setup a VM pool with auto started VMs
2. Setup some process to occasionally shut-down some of the VMs so they are brought back up automatically by the engine.

Actual results:
At some point one of the VM's will stop coming back up, the following messages will show up in the UI event log:

  VM your-vm-name was started on Host <UNKNOWN> as stateless

Expected results:
The VM should start up

Additional info:
Engine log attached, in that log the non-starting VM is "nested-lab4-rgi-7.2-20160302.0-tester-1" the VM ID is "9354d16a-a03b-48a5-91fc-443c331a78ee".

Comment 1 Michal Skrivanek 2016-05-09 07:00:05 UTC

failing due to "Could not restore stateless VM  9354d1
6a-a03b-48a5-91fc-443c331a78ee due to a failure to detach Disk 1e480e80-a386-4442-aefb-951736cb100d", but earlier logs are missing to be able to say what went wrong with the disk. Please attach logs capturing earlier period

Comment 4 Michal Skrivanek 2016-05-09 15:37:29 UTC

Invastigating on the setup we found out snapshots are somehow messed up. VM's snapshot table refer a snapshot which doesn't exist in storage's images table. Yet there is an entry from corresponding time and corresponding image group. Snapshot id inside the ovf also doesn't match id in snapshot table. 
Don't know why, it probably happened on 25.4 already. That's the last time run VM succeeded, after 3 hours of attempts timing out on snapshot creation

Comment 5 Barak Korren 2016-05-10 07:33:13 UTC

More data provided in direct channels,
please reopen needinfo if further assistance is needed.

Comment 6 Michal Skrivanek 2016-05-10 08:27:50 UTC

Barak, please grab full sosreport as it might need some time to investigate and the logs are slowly getting rotated away. Interesting time frame is 25.4. ~20:00 till 23:20 when the VM was attempted to start but failed due to snapshot creation issues (can be observed at the SPM at that time), then at 23:16 the SPM was switched to new host and then the snapshot succeeded and VM was started for the last time. I haven't seen any apparent exceptions at that time, but the end result was that db got inconsistent. I would suspect snapshot async task issue, hence pushing to storage. Please feel free to talk with Arik for more info

Comment 7 Arik 2016-05-10 10:58:07 UTC

Some more info:
VM named "nested-lab4-rgi-7.2-20160302.0-tester-1" can't start because it cannot be restored to its stateless snapshot.

The VM has 2 snapshots:
         _create_date          |             snapshot_id              | snapshot_type 
-------------------------------+--------------------------------------+---------------
 2016-04-25 23:20:16.985383+03 | b0e60768-3ec9-4e31-b3f4-55d34f5d5527 | STATELESS
 2016-04-25 20:23:09.138351+03 | 1d473f32-f914-4954-9f7a-2c434f4d978c | ACTIVE

Looking at images of its single disk (1e480e80-a386-4442-aefb-951736cb100d):
              image_guid              |            vm_snapshot_id            
--------------------------------------+--------------------------------------
 2463273e-792d-4c71-a59d-ed3e10a3fa91 | 1d473f32-f914-4954-9f7a-2c434f4d978c
 67a0989e-fbcb-4fd4-ace6-3c89596d393c | 7c6a8ca5-7523-4c77-92b8-d361d946b8cb

You see, the disk has two volumes, one is associated with the active snapshot and the other one with non-existing snapshot.

At the time the snapshot was taken, the active snapshot of the VM was 7c6a8ca5-7523-4c77-92b8-d361d946b8cb and now there is no such snapshot.

The VM cannot be started because we try to detach every disk that is associated with the active snapshot but not associated with the stateless snapshot (we assume that the disk was attached to the stateless VM while it was running). The issue here is that we cannot detach the disk since it has snapshots.

Comment 8 Tal Nisan 2016-05-10 14:26:55 UTC

Greg, you're this week's QE contact, please have a look

Comment 9 Yaniv Kaul 2016-05-22 07:12:23 UTC

Tal - can you make sure there's progress being made on this bug?

Comment 10 Liron Aravot 2016-05-22 12:48:34 UTC

I've found the issue causing the issue and pushed a patch the addresses it (http://gerrit.ovirt.org/57823)

thanks,
Liron.

Comment 12 Elad 2016-06-06 14:34:44 UTC

Liron,

Is there a specific flow for verification

Comment 13 Liron Aravot 2016-06-09 13:30:24 UTC

Try the above:
1. add vm with disk image
2. delete the disk from the underlying storage
3. attempt to run the vm in stateless mode.
4. check the db consistency - verify in the images table that the vm_snapshot_id of the disk matches the active snapshot id.

if you need help with the verification let me know.

Comment 14 Elad 2016-06-09 14:43:37 UTC

Tested according to the steps in comment #13.
==================================================
The VM snapshot which the image belongs to in images table in the DB exists in snapshots table:


engine=# select * from images where image_group_id  = 'd68d9b40-3c8b-457a-a6b4-798be30a3bf8';
-[ RECORD 1 ]---------+-------------------------------------
image_guid            | ede29f0a-f20a-415e-ba89-26dbae4c5350
creation_date         | 2016-06-09 16:54:57+03
size                  | 1073741824
it_guid               | 00000000-0000-0000-0000-000000000000
parentid              | 00000000-0000-0000-0000-000000000000
imagestatus           | 1
lastmodified          | 1970-01-01 02:00:00+02
vm_snapshot_id        | afee4ab7-684c-4abe-b020-d21a1721bb92
volume_type           | 1
volume_format         | 5
image_group_id        | d68d9b40-3c8b-457a-a6b4-798be30a3bf8
_create_date          | 2016-06-09 13:54:53.265983+03
_update_date          | 2016-06-09 13:57:15.645048+03
active                | t
volume_classification | 1

engine=# select * from snapshots where s
snapshot_id    snapshot_type  status         
engine=# select * from snapshots where snapshot_id ='afee4ab7-684c-4abe-b020-d21a1721bb92';
-[ RECORD 1 ]----+-------------------------------------
snapshot_id      | afee4ab7-684c-4abe-b020-d21a1721bb92
vm_id            | 1ffd2f45-1628-4e35-9c75-922523dd3ec0
snapshot_type    | ACTIVE
status           | OK
description      | Active VM
creation_date    | 2016-06-09 13:54:52.261+03
app_list         | 
vm_configuration | 
_create_date     | 2016-06-09 13:54:52.248655+03
_update_date     | 
memory_volume    | 


=======================================================
Verified using:
rhevm-3.6.7.2-0.1.el6.noarch
vdsm-4.17.30-1.el7ev.noarch

Comment 15 Eyal Edri 2016-06-19 12:55:37 UTC

This bug is still on 4.0 + 3.6, I don't know if VERIFIED means 3.6 or 4.0.
Please clone this bug and set flags/status accordingly.

Comment 16 Allon Mureinik 2016-06-19 14:26:28 UTC

(In reply to Eyal Edri from comment #15)
> This bug is still on 4.0 + 3.6, I don't know if VERIFIED means 3.6 or 4.0.
> Please clone this bug and set flags/status accordingly.

It was clearly verified on 3.6:
(In reply to Elad from comment #14)
> Verified using:
> rhevm-3.6.7.2-0.1.el6.noarch
> vdsm-4.17.30-1.el7ev.noarch

Regardless, you are right - this should be cloned.

Comment 18 Eyal Edri 2016-06-20 08:25:50 UTC

Please update target milestone to 4.0 since the bug was cloned.
Not sure about status, if it wasn't verified yet, it should be ON_QA.

Comment 19 Eyal Edri 2016-06-20 08:26:47 UTC

just saw it was target release which was still on 3.6.7, moved to 4.0.0.

Comment 20 Allon Mureinik 2016-06-20 08:53:35 UTC

As per comments 14 and 16, this bug was verified on a 3.6.7 build. I set the clone (bugs 1348079) to VERIFIED, and am returning this bug to ON_QA to signify it was never verified on a 4.0 build.

Leaving it to the QA's discretion whether to re-test on a 4.0 build or simply close based on the 3.6.7 verification.

Comment 21 Natalie Gavrielov 2016-06-22 18:15:37 UTC

Verified using:
rhevm-4.0.0.5-0.1.el7ev.noarch
vdsm-4.18.3-0.el7ev.x86_64

engine=# select * from images where image_group_id  ='0458ee04-05cd-4da9-888d-1cdde4deb176';
-[ RECORD 1 ]---------+-------------------------------------
image_guid            | 75c67409-2195-40c2-9160-184369266ec3
creation_date         | 2016-06-22 20:51:05+03
size                  | 5368709120
it_guid               | 00000000-0000-0000-0000-000000000000
parentid              | 00000000-0000-0000-0000-000000000000
imagestatus           | 1
lastmodified          | 1970-01-01 02:00:00+02
vm_snapshot_id        | bf962e10-393f-45b4-a2d6-6007da8aac4f
volume_type           | 1
volume_format         | 5
image_group_id        | 0458ee04-05cd-4da9-888d-1cdde4deb176
_create_date          | 2016-06-22 20:51:05.129645+03
_update_date          | 2016-06-22 20:55:43.272086+03
active                | t
volume_classification | 1

engine=# select * from snapshots where snapshot_id ='bf962e10-393f-45b4-a2d6-6007da8aac4f';
-[ RECORD 1 ]-----------+-------------------------------------
snapshot_id             | bf962e10-393f-45b4-a2d6-6007da8aac4f
vm_id                   | af367995-993b-4901-9130-66dfbfe28c51
snapshot_type           | ACTIVE
status                  | OK
description             | Active VM
creation_date           | 2016-06-22 20:51:02.348+03
app_list                | 
vm_configuration        | 
_create_date            | 2016-06-22 20:51:02.312752+03
_update_date            | 
memory_volume           | 
memory_metadata_disk_id | 
memory_dump_disk_id     |

Comment 23 errata-xmlrpc 2016-08-23 20:38:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1743.html

Note You need to log in before you can comment on or make changes to this bug.