Bug 1835096

Summary: Snapshot reports as 'done' even though it failed (due to I/O error)
Product: [oVirt] vdsm Reporter: Beni Pelled <bpelled>
Component: GeneralAssignee: Liran Rotenberg <lrotenbe>
Status: CLOSED CURRENTRELEASE QA Contact: Beni Pelled <bpelled>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.40.13CC: bugs, lrotenbe, rbarry, rdlugyhe
Target Milestone: ovirt-4.4.1Flags: rbarry: ovirt-4.4?
Target Release: 4.40.19   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: vdsm-4.40.19 Doc Type: Bug Fix
Doc Text:
Previously, if creating a live snapshot failed because of a storage error, the RHV Manager would incorrectly report that it had been successful. The current release fixes this issue. Now, if creating a snapshot fails, the Manager correctly shows that it failed.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-08 08:27:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Beni Pelled 2020-05-13 05:52:02 UTC
Description of problem:
In case the SD connection fails for a few minutes (less than the timeout configured for snapshot)
the snapshot process is reported as "successful" but no option to use this snapshot (clone, review etc.)

Version-Release number of selected component (if applicable):
- ovirt-engine-4.4.0-0.33.master.el8ev.noarch
- libvirt-6.0.0-17.module+el8.2.0+6257+0d066c28.x86_64
- vdsm-4.40.13-1.el8ev.x86_64


How reproducible:
100%

Steps to Reproduce:
1. Run a VM and create a snapshot with memory
2. Break the connection to the SD and reconnect it after 5 minutes

Actual results:
The engine reports that the snapshot process was done successfully, but no option to use this snapshot (clone, review etc.)

Expected results:
If the snapshot process cannot be completed after the connection is restored (continue from the fall point)
- the engine should report that the process fails and update accordingly.

Additional info:

Comment 1 Beni Pelled 2020-07-05 14:25:28 UTC
Verified with:
- ovirt-engine-4.4.1.7-0.3.el8ev.noarch
- vdsm-4.40.22-1.el8ev.x86_64
- libvirt-6.0.0-25.module+el8.2.1+7154+47ffd890.x86_64


Steps to Reproduce:
1. Run a VM and create a snapshot with memory
2. Break the connection to the SD (iSCSI in my case)

Result:
- The snapshot operation failed after ~minute with "Failed to complete snapshot 'snap1' creation for VM ...."


- engine.log

    2020-07-05 17:07:16,326+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-88) [] EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_FAILURE(69), Failed to complete snapshot 'snap1' creation for VM 'bpelled_test_snapshot_break_storage_connection'.


- vdsm.log

    2020-07-05 17:07:12,933+0300 DEBUG (jsonrpc/4) [storage.TaskManager.Task] (Task='f271b5e5-2ef3-4240-b17a-563102540548') stopping in state failed (force False) (task:1265)
    2020-07-05 17:07:12,933+0300 DEBUG (jsonrpc/4) [storage.TaskManager.Task] (Task='f271b5e5-2ef3-4240-b17a-563102540548') ref 1 aborting True (task:1008)
    2020-07-05 17:07:12,933+0300 INFO  (jsonrpc/4) [storage.TaskManager.Task] (Task='f271b5e5-2ef3-4240-b17a-563102540548') aborting: Task is aborted: "value=Storage domain does not exist: ('b28faf24-4a1f-4bed-8830-21b4d1578141',) abortedcode=358" (task:1190)
    2020-07-05 17:07:12,933+0300 DEBUG (jsonrpc/4) [storage.TaskManager.Task] (Task='f271b5e5-2ef3-4240-b17a-563102540548') Prepare: aborted: value=Storage domain does not exist: ('b28faf24-4a1f-4bed-8830-21b4d1578141',) abortedcode=358 (task:1195)
    2020-07-05 17:07:12,934+0300 DEBUG (jsonrpc/4) [storage.TaskManager.Task] (Task='f271b5e5-2ef3-4240-b17a-563102540548') ref 0 aborting True (task:1008)
    2020-07-05 17:07:12,934+0300 DEBUG (jsonrpc/4) [storage.TaskManager.Task] (Task='f271b5e5-2ef3-4240-b17a-563102540548') Task._doAbort: force False (task:944)
    2020-07-05 17:07:12,934+0300 DEBUG (jsonrpc/4) [storage.TaskManager.Task] (Task='f271b5e5-2ef3-4240-b17a-563102540548') moving from state failed -> state aborting (task:624)
    2020-07-05 17:07:12,934+0300 DEBUG (jsonrpc/4) [storage.TaskManager.Task] (Task='f271b5e5-2ef3-4240-b17a-563102540548') _aborting: recover policy none (task:578)
    2020-07-05 17:07:12,934+0300 DEBUG (jsonrpc/4) [storage.TaskManager.Task] (Task='f271b5e5-2ef3-4240-b17a-563102540548') moving from state failed -> state failed (task:624)

Comment 2 Sandro Bonazzola 2020-07-08 08:27:08 UTC
This bugzilla is included in oVirt 4.4.1 release, published on July 8th 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.