Bug 1334105

Summary: VMs from auto-start pool randomly stop getting started
Product: Red Hat Enterprise Virtualization Manager Reporter: Barak Korren <bkorren>
Component: ovirt-engineAssignee: Liron Aravot <laravot>
Status: CLOSED ERRATA QA Contact: Natalie Gavrielov <ngavrilo>
Severity: high Docs Contact:
Priority: medium    
Version: 3.6.5CC: ahadas, amarchuk, amureini, bkorren, eedri, gklein, laravot, lsurette, michal.skrivanek, ngavrilo, rbalakri, Rhev-m-bugs, srevivo, tnisan, ykaul, ylavi
Target Milestone: ovirt-4.0.0-rcKeywords: Regression, ZStream
Target Release: 4.0.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1348079 (view as bug list) Environment:
Last Closed: 2016-08-23 20:38:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1348079    
Attachments:
Description Flags
engine.log none

Description Barak Korren 2016-05-08 15:44:09 UTC
Created attachment 1155066 [details]
engine.log

Description of problem:
Given a pool of auto-started VMs, occasionally some VMs get into a state where engine is constantly failing to start them.

Version-Release number of selected component (if applicable):
3.6.5.3-0.1.el6

How reproducible:
Hard - this happens sporadically on a relatively large system and no relation to other actions was found yet.

Steps to Reproduce:
1. Install RHEV and setup a VM pool with auto started VMs
2. Setup some process to occasionally shut-down some of the VMs so they are brought back up automatically by the engine.

Actual results:
At some point one of the VM's will stop coming back up, the following messages will show up in the UI event log:

  VM your-vm-name was started on Host <UNKNOWN> as stateless

Expected results:
The VM should start up

Additional info:
Engine log attached, in that log the non-starting VM is "nested-lab4-rgi-7.2-20160302.0-tester-1" the VM ID is "9354d16a-a03b-48a5-91fc-443c331a78ee".

Comment 1 Michal Skrivanek 2016-05-09 07:00:05 UTC
failing due to "Could not restore stateless VM  9354d1
6a-a03b-48a5-91fc-443c331a78ee due to a failure to detach Disk 1e480e80-a386-4442-aefb-951736cb100d", but earlier logs are missing to be able to say what went wrong with the disk. Please attach logs capturing earlier period

Comment 4 Michal Skrivanek 2016-05-09 15:37:29 UTC
Invastigating on the setup we found out snapshots are somehow messed up. VM's snapshot table refer a snapshot which doesn't exist in storage's images table. Yet there is an entry from corresponding time and corresponding image group. Snapshot id inside the ovf also doesn't match id in snapshot table. 
Don't know why, it probably happened on 25.4 already. That's the last time run VM succeeded, after 3 hours of attempts timing out on snapshot creation

Comment 5 Barak Korren 2016-05-10 07:33:13 UTC
More data provided in direct channels,
please reopen needinfo if further assistance is needed.

Comment 6 Michal Skrivanek 2016-05-10 08:27:50 UTC
Barak, please grab full sosreport as it might need some time to investigate and the logs are slowly getting rotated away. Interesting time frame is 25.4. ~20:00 till 23:20 when the VM was attempted to start but failed due to snapshot creation issues (can be observed at the SPM at that time), then at 23:16 the SPM was switched to new host and then the snapshot succeeded and VM was started for the last time. I haven't seen any apparent exceptions at that time, but the end result was that db got inconsistent. I would suspect snapshot async task issue, hence pushing to storage. Please feel free to talk with Arik for more info

Comment 7 Arik 2016-05-10 10:58:07 UTC
Some more info:
VM named "nested-lab4-rgi-7.2-20160302.0-tester-1" can't start because it cannot be restored to its stateless snapshot.

The VM has 2 snapshots:
         _create_date          |             snapshot_id              | snapshot_type 
-------------------------------+--------------------------------------+---------------
 2016-04-25 23:20:16.985383+03 | b0e60768-3ec9-4e31-b3f4-55d34f5d5527 | STATELESS
 2016-04-25 20:23:09.138351+03 | 1d473f32-f914-4954-9f7a-2c434f4d978c | ACTIVE

Looking at images of its single disk (1e480e80-a386-4442-aefb-951736cb100d):
              image_guid              |            vm_snapshot_id            
--------------------------------------+--------------------------------------
 2463273e-792d-4c71-a59d-ed3e10a3fa91 | 1d473f32-f914-4954-9f7a-2c434f4d978c
 67a0989e-fbcb-4fd4-ace6-3c89596d393c | 7c6a8ca5-7523-4c77-92b8-d361d946b8cb

You see, the disk has two volumes, one is associated with the active snapshot and the other one with non-existing snapshot.

At the time the snapshot was taken, the active snapshot of the VM was 7c6a8ca5-7523-4c77-92b8-d361d946b8cb and now there is no such snapshot.

The VM cannot be started because we try to detach every disk that is associated with the active snapshot but not associated with the stateless snapshot (we assume that the disk was attached to the stateless VM while it was running). The issue here is that we cannot detach the disk since it has snapshots.

Comment 8 Tal Nisan 2016-05-10 14:26:55 UTC
Greg, you're this week's QE contact, please have a look

Comment 9 Yaniv Kaul 2016-05-22 07:12:23 UTC
Tal - can you make sure there's progress being made on this bug?

Comment 10 Liron Aravot 2016-05-22 12:48:34 UTC
I've found the issue causing the issue and pushed a patch the addresses it (http://gerrit.ovirt.org/57823)

thanks,
Liron.

Comment 12 Elad 2016-06-06 14:34:44 UTC
Liron,

Is there a specific flow for verification

Comment 13 Liron Aravot 2016-06-09 13:30:24 UTC
Try the above:
1. add vm with disk image
2. delete the disk from the underlying storage
3. attempt to run the vm in stateless mode.
4. check the db consistency - verify in the images table that the vm_snapshot_id of the disk matches the active snapshot id.

if you need help with the verification let me know.

Comment 14 Elad 2016-06-09 14:43:37 UTC
Tested according to the steps in comment #13.
==================================================
The VM snapshot which the image belongs to in images table in the DB exists in snapshots table:


engine=# select * from images where image_group_id  = 'd68d9b40-3c8b-457a-a6b4-798be30a3bf8';
-[ RECORD 1 ]---------+-------------------------------------
image_guid            | ede29f0a-f20a-415e-ba89-26dbae4c5350
creation_date         | 2016-06-09 16:54:57+03
size                  | 1073741824
it_guid               | 00000000-0000-0000-0000-000000000000
parentid              | 00000000-0000-0000-0000-000000000000
imagestatus           | 1
lastmodified          | 1970-01-01 02:00:00+02
vm_snapshot_id        | afee4ab7-684c-4abe-b020-d21a1721bb92
volume_type           | 1
volume_format         | 5
image_group_id        | d68d9b40-3c8b-457a-a6b4-798be30a3bf8
_create_date          | 2016-06-09 13:54:53.265983+03
_update_date          | 2016-06-09 13:57:15.645048+03
active                | t
volume_classification | 1

engine=# select * from snapshots where s
snapshot_id    snapshot_type  status         
engine=# select * from snapshots where snapshot_id ='afee4ab7-684c-4abe-b020-d21a1721bb92';
-[ RECORD 1 ]----+-------------------------------------
snapshot_id      | afee4ab7-684c-4abe-b020-d21a1721bb92
vm_id            | 1ffd2f45-1628-4e35-9c75-922523dd3ec0
snapshot_type    | ACTIVE
status           | OK
description      | Active VM
creation_date    | 2016-06-09 13:54:52.261+03
app_list         | 
vm_configuration | 
_create_date     | 2016-06-09 13:54:52.248655+03
_update_date     | 
memory_volume    | 


=======================================================
Verified using:
rhevm-3.6.7.2-0.1.el6.noarch
vdsm-4.17.30-1.el7ev.noarch

Comment 15 Eyal Edri 2016-06-19 12:55:37 UTC
This bug is still on 4.0 + 3.6, I don't know if VERIFIED means 3.6 or 4.0.
Please clone this bug and set flags/status accordingly.

Comment 16 Allon Mureinik 2016-06-19 14:26:28 UTC
(In reply to Eyal Edri from comment #15)
> This bug is still on 4.0 + 3.6, I don't know if VERIFIED means 3.6 or 4.0.
> Please clone this bug and set flags/status accordingly.

It was clearly verified on 3.6:
(In reply to Elad from comment #14)
> Verified using:
> rhevm-3.6.7.2-0.1.el6.noarch
> vdsm-4.17.30-1.el7ev.noarch

Regardless, you are right - this should be cloned.

Comment 18 Eyal Edri 2016-06-20 08:25:50 UTC
Please update target milestone to 4.0 since the bug was cloned.
Not sure about status, if it wasn't verified yet, it should be ON_QA.

Comment 19 Eyal Edri 2016-06-20 08:26:47 UTC
just saw it was target release which was still on 3.6.7, moved to 4.0.0.

Comment 20 Allon Mureinik 2016-06-20 08:53:35 UTC
As per comments 14 and 16, this bug was verified on a 3.6.7 build. I set the clone (bugs 1348079) to VERIFIED, and am returning this bug to ON_QA to signify it was never verified on a 4.0 build.

Leaving it to the QA's discretion whether to re-test on a 4.0 build or simply close based on the 3.6.7 verification.

Comment 21 Natalie Gavrielov 2016-06-22 18:15:37 UTC
Verified using:
rhevm-4.0.0.5-0.1.el7ev.noarch
vdsm-4.18.3-0.el7ev.x86_64

engine=# select * from images where image_group_id  ='0458ee04-05cd-4da9-888d-1cdde4deb176';
-[ RECORD 1 ]---------+-------------------------------------
image_guid            | 75c67409-2195-40c2-9160-184369266ec3
creation_date         | 2016-06-22 20:51:05+03
size                  | 5368709120
it_guid               | 00000000-0000-0000-0000-000000000000
parentid              | 00000000-0000-0000-0000-000000000000
imagestatus           | 1
lastmodified          | 1970-01-01 02:00:00+02
vm_snapshot_id        | bf962e10-393f-45b4-a2d6-6007da8aac4f
volume_type           | 1
volume_format         | 5
image_group_id        | 0458ee04-05cd-4da9-888d-1cdde4deb176
_create_date          | 2016-06-22 20:51:05.129645+03
_update_date          | 2016-06-22 20:55:43.272086+03
active                | t
volume_classification | 1

engine=# select * from snapshots where snapshot_id ='bf962e10-393f-45b4-a2d6-6007da8aac4f';
-[ RECORD 1 ]-----------+-------------------------------------
snapshot_id             | bf962e10-393f-45b4-a2d6-6007da8aac4f
vm_id                   | af367995-993b-4901-9130-66dfbfe28c51
snapshot_type           | ACTIVE
status                  | OK
description             | Active VM
creation_date           | 2016-06-22 20:51:02.348+03
app_list                | 
vm_configuration        | 
_create_date            | 2016-06-22 20:51:02.312752+03
_update_date            | 
memory_volume           | 
memory_metadata_disk_id | 
memory_dump_disk_id     |

Comment 23 errata-xmlrpc 2016-08-23 20:38:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1743.html