Bug 1835096 - Snapshot reports as 'done' even though it failed (due to I/O error)
Summary: Snapshot reports as 'done' even though it failed (due to I/O error)
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.40.13
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.4.1
: 4.40.19
Assignee: Liran Rotenberg
QA Contact: Beni Pelled
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-13 05:52 UTC by Beni Pelled
Modified: 2020-08-17 06:27 UTC (History)
4 users (show)

Fixed In Version: vdsm-4.40.19
Doc Type: Bug Fix
Doc Text:
Previously, if creating a live snapshot failed because of a storage error, the RHV Manager would incorrectly report that it had been successful. The current release fixes this issue. Now, if creating a snapshot fails, the Manager correctly shows that it failed.
Clone Of:
Environment:
Last Closed: 2020-07-08 08:27:08 UTC
oVirt Team: Virt
Embargoed:
rbarry: ovirt-4.4?


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 108998 0 master MERGED virt: snapshot failure handling 2020-08-17 06:27:26 UTC

Description Beni Pelled 2020-05-13 05:52:02 UTC
Description of problem:
In case the SD connection fails for a few minutes (less than the timeout configured for snapshot)
the snapshot process is reported as "successful" but no option to use this snapshot (clone, review etc.)

Version-Release number of selected component (if applicable):
- ovirt-engine-4.4.0-0.33.master.el8ev.noarch
- libvirt-6.0.0-17.module+el8.2.0+6257+0d066c28.x86_64
- vdsm-4.40.13-1.el8ev.x86_64


How reproducible:
100%

Steps to Reproduce:
1. Run a VM and create a snapshot with memory
2. Break the connection to the SD and reconnect it after 5 minutes

Actual results:
The engine reports that the snapshot process was done successfully, but no option to use this snapshot (clone, review etc.)

Expected results:
If the snapshot process cannot be completed after the connection is restored (continue from the fall point)
- the engine should report that the process fails and update accordingly.

Additional info:

Comment 1 Beni Pelled 2020-07-05 14:25:28 UTC
Verified with:
- ovirt-engine-4.4.1.7-0.3.el8ev.noarch
- vdsm-4.40.22-1.el8ev.x86_64
- libvirt-6.0.0-25.module+el8.2.1+7154+47ffd890.x86_64


Steps to Reproduce:
1. Run a VM and create a snapshot with memory
2. Break the connection to the SD (iSCSI in my case)

Result:
- The snapshot operation failed after ~minute with "Failed to complete snapshot 'snap1' creation for VM ...."


- engine.log

    2020-07-05 17:07:16,326+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-88) [] EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_FAILURE(69), Failed to complete snapshot 'snap1' creation for VM 'bpelled_test_snapshot_break_storage_connection'.


- vdsm.log

    2020-07-05 17:07:12,933+0300 DEBUG (jsonrpc/4) [storage.TaskManager.Task] (Task='f271b5e5-2ef3-4240-b17a-563102540548') stopping in state failed (force False) (task:1265)
    2020-07-05 17:07:12,933+0300 DEBUG (jsonrpc/4) [storage.TaskManager.Task] (Task='f271b5e5-2ef3-4240-b17a-563102540548') ref 1 aborting True (task:1008)
    2020-07-05 17:07:12,933+0300 INFO  (jsonrpc/4) [storage.TaskManager.Task] (Task='f271b5e5-2ef3-4240-b17a-563102540548') aborting: Task is aborted: "value=Storage domain does not exist: ('b28faf24-4a1f-4bed-8830-21b4d1578141',) abortedcode=358" (task:1190)
    2020-07-05 17:07:12,933+0300 DEBUG (jsonrpc/4) [storage.TaskManager.Task] (Task='f271b5e5-2ef3-4240-b17a-563102540548') Prepare: aborted: value=Storage domain does not exist: ('b28faf24-4a1f-4bed-8830-21b4d1578141',) abortedcode=358 (task:1195)
    2020-07-05 17:07:12,934+0300 DEBUG (jsonrpc/4) [storage.TaskManager.Task] (Task='f271b5e5-2ef3-4240-b17a-563102540548') ref 0 aborting True (task:1008)
    2020-07-05 17:07:12,934+0300 DEBUG (jsonrpc/4) [storage.TaskManager.Task] (Task='f271b5e5-2ef3-4240-b17a-563102540548') Task._doAbort: force False (task:944)
    2020-07-05 17:07:12,934+0300 DEBUG (jsonrpc/4) [storage.TaskManager.Task] (Task='f271b5e5-2ef3-4240-b17a-563102540548') moving from state failed -> state aborting (task:624)
    2020-07-05 17:07:12,934+0300 DEBUG (jsonrpc/4) [storage.TaskManager.Task] (Task='f271b5e5-2ef3-4240-b17a-563102540548') _aborting: recover policy none (task:578)
    2020-07-05 17:07:12,934+0300 DEBUG (jsonrpc/4) [storage.TaskManager.Task] (Task='f271b5e5-2ef3-4240-b17a-563102540548') moving from state failed -> state failed (task:624)

Comment 2 Sandro Bonazzola 2020-07-08 08:27:08 UTC
This bugzilla is included in oVirt 4.4.1 release, published on July 8th 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.