Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1620009

Summary:

Starting stateless VM failed as it still restores the stateless VM's Snapshot AFTER reaching down state

Product:

[oVirt] ovirt-engine

Reporter:

Avihai <aefrat>

Component:

BLL.Storage

Assignee:

Tal Nisan <tnisan>

Status:

CLOSED NOTABUG

QA Contact:

Elad <ebenahar>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

4.2.6.2

CC:

ahadas, bugs, bzlotnik, derez, eshenitz, michal.skrivanek, tnisan

Target Milestone:

---

Keywords:

Automation

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-10-28 15:05:33 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
engine and VDSM logs	none

Description Avihai 2018-08-22 07:54:22 UTC

Created attachment 1477812 [details]
engine and VDSM logs

Description of problem:
Starting stateless VM failed as it still restores the stateless VM's Snapshot AFTER reaching down state.

This is causing VM start to fail if you start it right after VM is in down state with the validation error “Cannot run VM. The VM is performing an operation on a Snapshot”.


Engine log:

1) Stop stateless VM:
2018-08-21 16:27:14,920+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-16) [vms_syncAction_020c61cc-3312-4bda] EVENT_ID: USER_STOP_VM(33), VM vm_TestCase5131_2116182812 powered off by admin@internal-authz (Host: host_mixed_2).

2) VM is in down state before restore snapshot task is finished :
2018-08-21 16:27:15,031+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-2) [] EVENT_ID: VM_DOWN(61), VM vm_TestCase5131_2116182812 is down.  


3) VM snapshot restore snapshot Task is STILL running in the background:
2018-08-21 16:27:16,050+03 INFO  [org.ovirt.engine.core.bll.ProcessDownVmCommand] (EE-ManagedThreadFactory-engine-Thread-41668) [16ac7ea0] Deleting snapshot for stateless vm '89016312-01f4-47f7-8b39-f4bfe1de47a0

2018-08-21 16:27:16,260+03 WARN  [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-41668) [261b4c89] Couldn't update VM 'vm_TestCase5131_2116182812' (89016312-01f4-47f7
-8b39-f4bfe1de47a0) version from it's template, continue with restoring stateless snapshot

2018-08-21 16:27:16,760+03 INFO  [org.ovirt.engine.core.bll.snapshots.RestoreAllSnapshotsCommand] (EE-ManagedThreadFactory-engine-Thread-41668) [4171a8b4] Running command: RestoreAllSnapshotsCommand internal: true. Entities affected :  ID: 89016312-01f4-47f7-8b39-f4bfe1de47a0 Type: VMAction group MANIPULATE_VM_SNAPSHOTS with role type USER

2018-08-21 16:27:26,885+03 INFO  [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-41668) [4171a8b4] Lock freed to object 'EngineLock:{exclusiveLocks='[89016312-01f4-47f7-8b39-f4bfe1de47a0=VM]', sharedLocks=''}'


4) Right after VM reach 'down' state -> Start statless VM.

This cause Validation error as the VM is performing an operation on a Snapshot:
2018-08-21 16:27:27,556+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-4) [vms_syncAction_4ed33744-f251-451a] EVENT_ID: USER_FAILED_RUN_VM(54), Failed to run VM vm_TestCase5131_2116182812 due to a failed validation: [Cannot run VM. The VM is performing an operation on a Snapshot. Please wait for the operation to finish, and try again.] (User: admin@internal-authz).

STILL RUNNING snapshot operations 3 seconds after startVM was initiated:
2018-08-21 16:27:30,550+03 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-41675) [] CommandAsyncTask::endCommandAction [within thread] context: Attempting to endAction 'RestoreFromSnapshot',

Version-Release number of selected component (if applicable):
4.2.6.4_SNAPSHOT-97.g02f589b.0.scratch.master.el7ev

How reproducible:
-100% running automation TestCase5131 when I try to start VM right after stateless VM is down .

-Harder to reproduce manually but it also occurs.
 Timing is important - you should starting VM right after it reached down state 
 in the engine log.

Steps to Reproduce:
1.Create Stateless VM from template or change an existing VM to ‘stateless’
2.Start Stateless VM
3.Stop Stateless VM 

Actual results:
VM is in ‘down’ state before VM’s snapshot restore operation are complete .
This is causing VM start to fail if you start it right after VM is in down state with the validation error.

Expected results:
Fix should be to wait until all stateless VM snapshot operations (restore snapshot & destroy active VM volume) are completed, only then go to ‘down’ state.

Additional info:
Bug does not look like regression, this was found running tier4 stabilization TC5131 which was not run until now.

Comment 1 Tal Nisan 2018-08-22 12:59:18 UTC

The VM shutdown and is marked as DOWN i.e. the QEMU process is dead and gone.
After that the operation on a snapshot is performed and in that time frame you cannot start the VM, behavior is right in the fact that the VM state doesn't retain as UP.
Only thing I can think about is perhaps during the snapshot operation to move the VM to another status (ImageLocked? RestoringState?)
Michael, what do you think?

Comment 2 Michal Skrivanek 2018-08-22 16:29:37 UTC

Bot a new VM state, but we can probably just display something else when VM has some ImageLocked disks. But since it’s working fine I consider it rather cosmetic

Comment 3 Michal Skrivanek 2018-08-23 04:57:13 UTC

“Not a new VM state...”

Tal, did anything change in how these snapshots are handled recently? Bug 1620087 also seem to indicate change in behavior

Comment 4 Tal Nisan 2018-08-27 15:03:54 UTC

Eyal, Daniel, do you have any recollection of some change around snapshot locking?

Comment 5 Eyal Shenitzky 2018-08-28 07:06:18 UTC

(In reply to Tal Nisan from comment #4)
> Eyal, Daniel, do you have any recollection of some change around snapshot
> locking?

Not from my side.
Maybe Benny can add some info.
I know he is working around there.

Comment 6 Benny Zlotnik 2018-08-28 08:17:38 UTC

I did the changes in snapshot creation introduced in 4.2.4 (Did it work then or started only recently?). There was a race issue in OST, because snapshot creation took a few seconds longer because of the CoCo polling[1].

However the error message does not indicate it's failing because of the memory lock that is taken in CreateSnapshotForVm

The lock is released here:
2018-08-21 16:25:04,623+03 INFO  [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-30) [] Lock freed to object 'EngineLock:{exclusiveLocks='[89016312-01f4-47f7-8b39-f4bfe1de47a0=VM]', sharedLocks=''}'



However, it seems that RestoreAllSnapshotsCommand is finished 5 seconds after the RunVm attempt

RestoreAllSnapshotsCommand:

2018-08-21 16:27:32,820+03 INFO  [org.ovirt.engine.core.bll.snapshots.RestoreAllSnapshotsCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-41) [4171a8b4] Ending command 'org.ovirt.engine.core.bll.snapshots.RestoreAllSnapshotsCommand' successfully.
2018-08-21 16:27:32,854+03 INFO  [org.ovirt.engine.core.bll.snapshots.RestoreFromSnapshotCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-41) [4171a8b4] Ending command 'org.ovirt.engine.core.bll.snapshots.RestoreFromSnapshotCommand' successfully.

RunVm:
2018-08-21 16:27:27,233+03 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (default task-4) [vms_syncAction_4ed33744-f251-451a] Lock Acquired to object 'EngineLock:{exclusiveLocks='[89016312-01f4-47f7-8b39-f4bfe1de47a0=VM]', sharedLocks=''}'
2018-08-21 16:27:27,556+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-4) [vms_syncAction_4ed33744-f251-451a] EVENT_ID: USER_FAILED_RUN_VM(54), Failed to run VM vm_TestCase5131_2116182812 due to a failed validation: [Cannot run VM. The VM is performing an operation on a Snapshot. Please wait for the operation to finish, and try again.] (User: admin@internal-authz).
2018-08-21 16:27:27,556+03 WARN  [org.ovirt.engine.core.bll.RunVmCommand] (default task-4) [vms_syncAction_4ed33744-f251-451a] Validation of action 'RunVm' failed for user admin@internal-authz. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VM_IS_DURING_SNAPSHOT


I am not very familiar with these flow, but it seems the snapshot is unlocked in endSuccessfully, though I might be wrong



[1] - https://lists.ovirt.org/archives/list/devel@ovirt.org/thread/Q3JOOMG5HLBPGQXSRKYSDPLEOHMZ6SUH/

Comment 7 Tal Nisan 2018-09-02 14:45:32 UTC

Arik, can you shed some light on this flow maybe?

Comment 8 Arik 2018-10-07 15:17:06 UTC

(In reply to Tal Nisan from comment #7)
> Arik, can you shed some light on this flow maybe?

The snapshot has always been removed after the VM is switched to Down and locks/VM status should prevent the VM from starting as long as the snapshot exists.

It could be that due to the use of CoCo it takes more time for the RestoreAllSnapshots to finish and thus it is more noticeable now.

Comment 9 Arik 2018-10-07 15:22:23 UTC

(In reply to Michal Skrivanek from comment #2)
> Bot a new VM state, but we can probably just display something else when VM
> has some ImageLocked disks. But since it’s working fine I consider it rather
> cosmetic

Actually, RestoreAllSnapshots changes the VM status to ImageLocked. But this transition to ImageLock is not immediate, the VM status transitions the user might see:
1. Some running status (e.g., UP)
2. Down - the monitoring sets the VM to Down.
3. ImageLocked - RestoreAllSnapshot sets it when it starts.
4. Down - RestoreAllSnapshot sets it when it ends.

Comment 10 Tal Nisan 2018-10-22 08:15:29 UTC

So what should we do here then? Close as not a bug?

Comment 11 Arik 2018-10-22 09:16:30 UTC

(In reply to Tal Nisan from comment #10)
> So what should we do here then? Close as not a bug?

+1

Comment 12 Tal Nisan 2018-10-28 15:05:33 UTC

Closing based on comments 8 and 9