Bug 1660742

Summary: Successful snapshot status returned by API although the snapshot creation got failed
Product: Red Hat Enterprise Virtualization Manager Reporter: nijin ashok <nashok>
Component: ovirt-engineAssignee: Fred Rolland <frolland>
Status: CLOSED NOTABUG QA Contact: meital avital <mavital>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.2.7CC: bzlotnik, dev-unix-virtualization, ebenahar, frolland, gauravjadhav.jadhav, gveitmic, max, mkalinin, rbarry, Rhev-m-bugs, tnisan
Target Milestone: ovirt-4.3.5Flags: lsvaty: testing_plan_complete-
Target Release: 4.3.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-05-16 09:12:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1660997    

Description nijin ashok 2018-12-19 06:25:17 UTC
Description of problem:

If a backup application creates a snapshot using RHV API, it will return the newly created snapshot id to the application. Then the application can check the status of the snapshot operation by querying it using snapshot ID. If it changes from "LOCKED" to "OK" it can conclude that the snapshot operation is complete.

However, if the snapshot operation fails while it sends the snapshot command to the kvm domain, it will automatically delete the newly created snapshot and the volumes. But in this process, the newly created snapshot ID will be marked as "Active VM". So the application which is querying the status using snapshot ID will get status as "OK" and will think that the operation was a success although it got failed leading into doing other incorrect jobs.

To reproduce the issue, I killed the vdsm pid just before it sends the snapshot command to the libvirt. Pasting the output of db and the curl when during the snapshot operation and after it failed and deleted the snapshot automatically.

During snapshot operation.
====

 description |             snapshot_id              | status 
-------------+--------------------------------------+--------
 Active VM   | 86cc6641-9da5-4265-bcbe-8287783ea1c3 | OK
 backup      | 592fc15d-0b12-4156-826b-a15827cf79ab | LOCKED
(2 rows)

    <snapshot href="/ovirt-engine/api/vms/acfba9f2-5de8-4c50-a30e-04024013ab28/snapshots/86cc6641-9da5-4265-bcbe-8287783ea1c3" id="86cc6641-9da5-4265-bcbe-8287783ea1c3">
        <description>Active VM</description>
        <snapshot_status>ok</snapshot_status>
    <snapshot href="/ovirt-engine/api/vms/acfba9f2-5de8-4c50-a30e-04024013ab28/snapshots/592fc15d-0b12-4156-826b-a15827cf79ab" id="592fc15d-0b12-4156-826b-a15827cf79ab">
        <description>backup</description>
        <snapshot_status>locked</snapshot_status>
            <description></description>
            <status>up</status>

After it failed.
====

 description |             snapshot_id              | status 
-------------+--------------------------------------+--------
 Active VM   | 592fc15d-0b12-4156-826b-a15827cf79ab | OK
(1 row)


    <snapshot href="/ovirt-engine/api/vms/acfba9f2-5de8-4c50-a30e-04024013ab28/snapshots/592fc15d-0b12-4156-826b-a15827cf79ab" id="592fc15d-0b12-4156-826b-a15827cf79ab">
        <description>Active VM</description>
        <snapshot_status>ok</snapshot_status>
===

The 592fc15d was the new UUID and if the application checks the status, it will get the "OK" status and can incorrectly interpret that the operation was success.


Version-Release number of selected component (if applicable):

RHV 4.2.7

How reproducible:

100%

Steps to Reproduce:

See above.

Actual results:

Checking the snapshot status by querying it using UUID is leading to the incorrect interpretation of operation.

Expected results:

The snapshot status should be provided correctly or there should be some other way so that an external application can check the status of the snapshot.

Additional info:

Comment 2 nijin ashok 2018-12-19 06:31:26 UTC
The issue was observed while doing the backup using Commvault where the Commvault support provides the info that it's getting the success status when it queries the snapshot id although it failed in RHV.

Comment 3 Elad 2018-12-21 00:52:14 UTC
Sounds similar to bug 1660997, which was reported for upstream

Comment 4 nijin ashok 2018-12-21 04:02:43 UTC
(In reply to Elad from comment #3)
> Sounds similar to bug 1660997, which was reported for upstream

I think this is different. In my case, the snapshot operation is marked as "failed" in the engine and is also getting cleaned up automatically.

Comment 5 Sandro Bonazzola 2019-01-28 09:39:58 UTC
This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 7 Eyal Shenitzky 2019-03-17 11:16:33 UTC
Nijin, 
Can you please add engine and VDSM logs?

Comment 8 nijin ashok 2019-03-18 01:46:03 UTC
(In reply to Eyal Shenitzky from comment #7)
> Nijin, 
> Can you please add engine and VDSM logs?

I don't have the same environment now. However, it was easy to reproduce. Could you please try at your end?

Comment 9 Fred Rolland 2019-04-22 08:43:36 UTC
Nijin hi,

It seems that looking only at the status of the snapshot entry only is not enough.

You could try one of the following:

1. Same as described, check the status of the snapshot but once the status is OK, check the 'snapshot_type'.
   - If it got back to 'ACTIVE', then it means that the operation failed.
   - If it is 'REGULAR' and the status is 'OK', then the operation is successful.

2. Add a correlation ID when creating the snapshot, and check that all jobs with this ID are finished without failures.
This is the way it is implemented in oVirt system tests:
   - Add correlation ID:
         https://github.com/oVirt/ovirt-system-tests/blob/master/basic-suite-master/test-scenarios/004_basic_sanity.py#L363
   - Search jobs with correlation ID:
         https://github.com/oVirt/ovirt-system-tests/blob/10d0662f1a34d0f1ac5e27b80ad7a79a5fda3779/basic-suite-master/test_utils/__init__.py#L211


I don't think that we plan to change the current logic of the snapshot statuses in the near future.

Please tell me what you think about the above propositions.

Thanks,
Freddy

Comment 10 nijin ashok 2019-04-22 12:52:10 UTC
(In reply to Fred Rolland from comment #9)
> Nijin hi,
> 
> It seems that looking only at the status of the snapshot entry only is not
> enough.
> 
> You could try one of the following:
> 
> 1. Same as described, check the status of the snapshot but once the status
> is OK, check the 'snapshot_type'.
>    - If it got back to 'ACTIVE', then it means that the operation failed.
>    - If it is 'REGULAR' and the status is 'OK', then the operation is
> successful.
> 
> 2. Add a correlation ID when creating the snapshot, and check that all jobs
> with this ID are finished without failures.
> This is the way it is implemented in oVirt system tests:
>    - Add correlation ID:
>         
> https://github.com/oVirt/ovirt-system-tests/blob/master/basic-suite-master/
> test-scenarios/004_basic_sanity.py#L363
>    - Search jobs with correlation ID:
>         
> https://github.com/oVirt/ovirt-system-tests/blob/
> 10d0662f1a34d0f1ac5e27b80ad7a79a5fda3779/basic-suite-master/test_utils/
> __init__.py#L211
> 
> 
Thank you Fred. I have asked the customer to forward this feedback to Commvault team.

Comment 11 Marina Kalinin 2019-04-30 19:09:39 UTC
Nijin, 
Can you please put this in a KCS as well?

Comment 12 Marina Kalinin 2019-05-03 17:57:36 UTC
Thanks, Nijin!

Should we close the bug now?

Comment 13 dev-unix-virtualization 2019-05-07 18:32:04 UTC
Freddy, 

Can you please post the correct syntax for the XML request to create a snap using  correlation id ? 

We tried the forms below and always receive a 400 from the API server.


XML Req 1 : 
<snapshot>
  <description>My snapshot</description>
  <persist_memorystate>false</persist_memorystate>
  <query>correlation_id=test</query>
</snapshot>


XML Req 2 : 
<snapshot>
  <description>My snapshot</description>
  <persist_memorystate>false</persist_memorystate>
  <query><correlation_id>test</correlation_id></query>
</snapshot>


JSON Req ===> 

{
	
		"description" : "My snap2",
		"query": { 
			"correlation_id": "test"
		}
}

Resp: 

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<fault>
    <detail>For correct usage, see: https://172.24.25.3/ovirt-engine/apidoc#services/snapshots/methods/add</detail>
    <reason>Request syntactically incorrect.</reason>
</fault>

Comment 14 Benny Zlotnik 2019-05-13 12:14:14 UTC
Hi,

You can pass the correlation_id as follows:
POST /ovirt-engine/api/vms/{vm_id}/snapshots?correlation_id=097d3014-b5c4-4ab0-96d9-003f310a1b31

and to search for it you can use:
GET /ovirt-engine/api/jobs?search=correlation_id%3D097d3014-b5c4-4ab0-96d9-003f310a1b31

Comment 15 Tal Nisan 2019-05-13 14:12:59 UTC
*** Bug 1702188 has been marked as a duplicate of this bug. ***

Comment 16 Fred Rolland 2019-05-16 09:07:15 UTC
Nijin,

Can we close the bug?

Thanks

Comment 17 nijin ashok 2019-05-16 09:10:38 UTC
Sure Fred. I think we can close it.