Bug 1628909

Summary:	Engine marks the snapshot status as OK before the actual snapshot operation
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	nijin ashok <nashok>
Component:	ovirt-engine	Assignee:	Benny Zlotnik <bzlotnik>
Status:	CLOSED ERRATA	QA Contact:	Yosi Ben Shimon <ybenshim>
Severity:	urgent	Docs Contact:
Priority:	medium
Version:	4.2.5	CC:	ahadas, audgiri, bscalio, bzlotnik, ebenahar, gveitmic, gwatson, michael.moir, mkalinin, mtessun, peli, peter, Rhev-m-bugs, shipatil, tnisan
Target Milestone:	ovirt-4.3.0	Keywords:	ZStream
Target Release:	4.3.0	Flags:	lsvaty: testing_plan_complete-
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	ovirt-engine-4.3.0_alpha	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1635189 (view as bug list)		Environment:
Last Closed:	2019-05-08 12:38:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Storage	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1635189

Description nijin ashok 2018-09-14 10:37:36 UTC

Description of problem:

The ovirt-engine is marking the snapshot status as OK before it sends the "snapshot" command to the VM. So if there is a backup automation tool like Commvault is checking the status of the snapshot by looking into this "status", it would assume that the snapshot operation is complete since it would return "OK". So the tool will proceed to the next step of attaching the snapshot disk to the agent VM. Since the status is OK, it will also complete successfully. This ends up in attaching the snapshot disk to the backup agent VM before the actual snapshot operation is complete. Also, the SnapshotVDSCommand can get failed (example bug  1572801) and this will result in attaching an "invalid" snapshot disk to the backup agent VM.


Version-Release number of selected component (if applicable):

rhvm-4.2.6.4-0.1.el7ev.noarch

How reproducible:

100%

Steps to Reproduce:

1. I added a delay in the vdsm code where it freezes the guest filesystem so that I can replicate a snapshot failure.
2. The snapshot status will be changed to OK before it sends the snapshot command to the VM.

Actual results:

The snapshot status is changed to OK immediately after it creates the volume and before it sends the "snapshot" command to the VM.

Expected results:

The snapshot status should be changed to "OK" only after the complete snapshot operation.

Additional info:

Comment 1 Tal Nisan 2018-09-16 11:30:06 UTC

Eyal please have a look.
Arik, do you have any insights from Virt side?

Comment 3 Arik 2018-09-16 11:45:35 UTC

(In reply to Tal Nisan from comment #1)
> Eyal please have a look.
> Arik, do you have any insights from Virt side?

That looks like a regression caused by the relatively recent changes in the create-snapshot command. The snapshot should indeed remain locked until all tasks are finished.

Comment 4 Eyal Shenitzky 2018-09-16 12:04:12 UTC

Benny,
Can you please take a look?

Comment 5 Arik 2018-09-16 12:14:02 UTC

(In reply to Arik from comment #3)
> (In reply to Tal Nisan from comment #1)
> > Eyal please have a look.
> > Arik, do you have any insights from Virt side?
> 
> That looks like a regression caused by the relatively recent changes in the
> create-snapshot command. The snapshot should indeed remain locked until all
> tasks are finished.

Actually, I was wrong, it seems that we unlocked the snapshot before calling the live-snapshot verb also in 4.1 [1], before those changes.

[1] https://github.com/oVirt/ovirt-engine/blob/ovirt-engine-4.1/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/snapshots/CreateAllSnapshotsFromVmCommand.java#L401-L404

Comment 6 Tal Nisan 2018-09-16 12:44:14 UTC

So, is it a Virt issue or Storage?

Comment 7 Arik 2018-09-16 12:52:51 UTC

(In reply to Tal Nisan from comment #6)
> So, is it a Virt issue or Storage?

It can go either way but I would keep it as Storage since the storage team is the last to introduce a major change to the way this command operates.

Comment 8 Tal Nisan 2018-09-16 13:02:45 UTC

*** Bug 1620087 has been marked as a duplicate of this bug. ***

Comment 15 RHV bug bot 2018-12-10 15:12:59 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ]

For more info please contact: rhv-devops

Comment 16 Yosi Ben Shimon 2018-12-25 16:05:38 UTC

Tested using:
ovirt-engine-4.3.0-0.6.alpha2.el7.noarch
vdsm-4.30.4-1.el7ev.x86_64

- Tested both with and without 30 seconds delay in VDSM.
- Tested with both VM states - up and down.
- There were 5 preallocated disks of size 20G each.

The snapshot creation process took much more than 10 seconds (~60 seconds).

The whole operation time (till it completed), the snapshot status was locked (using REST API).

Moving to VERIFIED

Comment 17 RHV bug bot 2019-01-15 23:35:27 UTC

WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ]

For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ]

For more info please contact: rhv-devops

Comment 19 errata-xmlrpc 2019-05-08 12:38:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1085

Comment 20 Daniel Gur 2019-08-28 13:12:09 UTC

sync2jira

Comment 21 Daniel Gur 2019-08-28 13:16:22 UTC

sync2jira