1422949 – Removal of volumes will show error - VDSM <HOST NAME> command HSMGetAllTasksStatusesVDS failed: Volume does not exist

Bug 1422949 - Removal of volumes will show error - VDSM <HOST NAME> command HSMGetAllTasksStatusesVDS failed: Volume does not exist

Summary: Removal of volumes will show error - VDSM <HOST NAME> command HSMGetAllTasksS...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Storage
Sub Component:
Version:	4.1.0.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	ovirt-4.1.3
Target Release:	4.1.3
Assignee:	Fred Rolland
QA Contact:	Jiri Belka
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1441558
TreeView+	depends on / blocked

Reported:	2017-02-16 16:16 UTC by Raz Tamir
Modified:	2019-04-28 11:13 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-07-06 14:00:52 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	frolland: needinfo- rule-engine: ovirt-4.1+ rule-engine: blocker+ ylavi: testing_plan_complete?

Attachments	(Terms of Use)
engine and vdsm logs (1.08 MB, application/x-gzip) 2017-02-16 16:16 UTC, Raz Tamir	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1427104	medium	CLOSED	Commit old snapshot ends with 'Error while executing action Revert to Snapshot: Internal Engine Error'	2021-02-22 00:41:40 UTC
oVirt gerrit	74119	'None'	MERGED	core: avoid unlocking in-preview snapshot on failure	2020-07-22 14:54:58 UTC
oVirt gerrit	74122	'None'	MERGED	core: avoid unlocking in-preview snapshot on failure	2020-07-22 14:54:58 UTC
oVirt gerrit	77195	'None'	MERGED	engine: remove snapshots table constraints	2020-07-22 14:54:57 UTC
oVirt gerrit	77241	'None'	MERGED	engine: remove snapshots table constraints	2020-07-22 14:54:58 UTC

Internal Links: 1427104

Description Raz Tamir 2017-02-16 16:16:43 UTC

Created attachment 1250901 [details]
engine and vdsm logs

Description of problem:
On Hosted-Engine env only, when performing a volume removal - live merge, LSM, etc... after the operation completes, an error appears in the event tab:
VDSM <HOST NAME> command HSMGetAllTasksStatusesVDS failed: Volume does not exist

From the engine.log:
2017-02-16 11:04:07,624-05 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (DefaultQuartzScheduler3) [b318a40] Command 'RemoveSnapshot' (id: '67304551-2c3b-495c-979b-695efa3a217a') waitin
g on child command id: '049b769f-f7c3-456a-b48a-6d77682173a2' type:'RemoveSnapshotSingleDiskLive' to complete
2017-02-16 11:04:09,635-05 INFO  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommandCallback] (DefaultQuartzScheduler6) [69366425] Command 'RemoveSnapshotSingleDiskLive' (id: '049b769f-f7c3-456
a-b48a-6d77682173a2') waiting on child command id: '1aa42bb6-4083-4319-94f9-5da8260263db' type:'DestroyImage' to complete
2017-02-16 11:04:11,960-05 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (DefaultQuartzScheduler7) [b318a40] Polling and updating Async Tasks: 3 tasks, 1 tasks to poll now
2017-02-16 11:04:12,479-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler7) [b318a40] Failed in 'HSMGetAllTasksStatusesVDS' method
2017-02-16 11:04:12,484-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler7) [b318a40] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call S
tack: null, Custom Event ID: -1, Message: VDSM _slot-5b command HSMGetAllTasksStatusesVDS failed: Volume does not exist



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Raz Tamir 2017-02-16 16:18:41 UTC

Issue with browser submitted the bug before finished editing. 

Version-Release number of selected component (if applicable):
rhevm-4.1.0.4-0.1.el7

How reproducible:
100% on several operations

Steps to Reproduce:
1. Perform live merge
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Jiri Belka 2017-02-16 16:21:14 UTC

This somehow influences manual SPM switching to other host...

Feb 16, 2017 11:45:00 AM
Failed to force select slot-5a as the SPM due to a failure to stop the current SPM.

Comment 3 Yaniv Lavi 2017-02-20 12:25:04 UTC

Please provide more info on the flow that led you to this.
Was the environment upgraded from any versions? Is the VM upgraded from older compatibility level?

Comment 4 Jiri Belka 2017-02-20 12:32:34 UTC

(In reply to Yaniv Dary from comment #3)
> Please provide more info on the flow that led you to this.
> Was the environment upgraded from any versions? Is the VM upgraded from
> older compatibility level?

Our setup is old env, migrated in 3.5 to SHE (including DB restore), then upgraded to 3.6 and 4.0. Recently the env was upgrade to 4.1 but still keeping 4.0 cluster compat level.

Comment 5 Yaniv Lavi 2017-02-20 12:41:30 UTC

Please describe the VM in question. Is it running on latest compatibility? Is it on the 4.0 cluster?

Comment 6 Fred Rolland 2017-02-20 13:40:54 UTC

The log have a lot of :

Exception: org.springframework.jdbc.BadSqlGrammarException: StatementCallback; bad SQL grammar

Maybe something went wrong in the upgrade process ?

Comment 7 Jiri Belka 2017-02-22 07:49:08 UTC

The issue seems to disappear after we changed cluster level to 4.1.

Comment 8 Fred Rolland 2017-02-22 07:55:30 UTC

the cluster level was 4.0.
What was the version of the VDSM ?

Comment 9 Jiri Belka 2017-02-27 07:57:21 UTC

(In reply to Fred Rolland from comment #8)
> the cluster level was 4.0.
> What was the version of the VDSM ?

vdsm-4.19.4-1.el7ev.x86_64

Comment 10 Jiri Belka 2017-02-27 11:03:47 UTC

(In reply to Jiri Belka from comment #7)
> The issue seems to disappear after we changed cluster level to 4.1.

Not really, I just restarted ovirt-engine being on 4.1-beta and it's back again.

If you would like access to env, ping me on IRC.

Comment 11 Fred Rolland 2017-02-27 17:57:27 UTC

Some details on investigation so far :

- The message : "DSM _slot-5b command HSMGetAllTasksStatusesVDS failed: Volume does not exist" appears as there are tasks that were not cleared on the SPM

- Because of the unclear tasks, the SPM cannot move to another host

- These tasks are from command that failed on engine due to the following exception :
[org.ovirt.engine.core.bll.snapshots.RestoreFromSnapshotCommand] (default task-18) [2a3e3a24] Exception: java.lang.IllegalStateException: Duplicate key org.ovirt.engine.core.common.businessentities.storage.DiskImage@3711fed7

- The cause of this exception is that the view all_disks_for_vms in database contains for the same VM more than one disk with the same id but with different VM snapshotId.

- Preview a snapshot from one of the VM , adds one more disk to the view

Comment 12 Yaniv Kaul 2017-03-09 15:11:33 UTC

Strange to see it as a blocker if the severity is medium. Setting the severity to URGENT (also because I understood it happened in RHEV.TLV?)

Comment 14 Allon Mureinik 2017-03-15 13:30:48 UTC

Bug 1427104  will be resolved for 4.1.1.
ATM, we aren't sure which of these two caused the other, and there doesn't seem to be a clear reproducer for this one.

Pushing out to 4.1.2 for further investigation.

Daniel - any insights so far?

Comment 15 Red Hat Bugzilla Rules Engine 2017-03-15 13:30:54 UTC

This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 16 Daniel Erez 2017-03-15 14:10:59 UTC

According to the logs [1], the source issue seems similar to bug 1427104.
I.e. the 'Duplicate key' is a consequence of the failure during commit snapshot.
Moving to MODIFIED as it should be resolved as part of the same fix.

[1]
2017-02-15 07:28:28,746-05 ERROR [org.ovirt.engine.core.bll.snapshots.RestoreFromSnapshotCommand] (Transaction Reaper Worker 0) [] Transaction rolled-back for command 'org.ovirt.engine.core.bll.snapshots.RestoreFromSnapshotCommand'.
2017-02-15 07:28:28,747-05 ERROR [org.ovirt.engine.core.bll.snapshots.RestoreAllSnapshotsCommand] (Transaction Reaper Worker 0) [] Transaction rolled-back for command 'org.ovirt.engine.core.bll.snapshots.RestoreAllSnapshotsCommand'.
2017-02-15 07:28:28,748-05 INFO  [org.ovirt.engine.core.bll.tasks.CommandCoordinatorImpl] (org.ovirt.thread.pool-6-thread-6) [] Rollback for command 'org.ovirt.engine.core.bll.snapshots.RestoreAllSnapshotsCommand'

Comment 17 Allon Mureinik 2017-03-15 14:30:56 UTC

Moving back to 4.1.1 then, thanks Daniel.

Comment 18 Eyal Edri 2017-03-19 08:00:11 UTC

Where is the backport for 4.1.1.z if its in 4.1.1?

Comment 19 Daniel Erez 2017-03-19 08:18:50 UTC

(In reply to Eyal Edri from comment #18)
> Where is the backport for 4.1.1.z if its in 4.1.1?

It's the same backport as bug 1427104.

Comment 20 Yaniv Kaul 2017-03-19 08:24:28 UTC

(In reply to Daniel Erez from comment #19)
> (In reply to Eyal Edri from comment #18)
> > Where is the backport for 4.1.1.z if its in 4.1.1?
> 
> It's the same backport as bug 1427104.

So it's a dup? So please close as such.

Comment 21 Yaniv Lavi 2017-03-19 10:20:21 UTC

(In reply to Yaniv Kaul from comment #20)
> (In reply to Daniel Erez from comment #19)
> > (In reply to Eyal Edri from comment #18)
> > > Where is the backport for 4.1.1.z if its in 4.1.1?
> > 
> > It's the same backport as bug 1427104.
> 
> So it's a dup? So please close as such.

I prefer making this a test only bug, so that we make sure it is resolved.

Comment 22 Daniel Erez 2017-03-21 08:09:49 UTC

(In reply to Yaniv Kaul from comment #20)
> (In reply to Daniel Erez from comment #19)
> > (In reply to Eyal Edri from comment #18)
> > > Where is the backport for 4.1.1.z if its in 4.1.1?
> > 
> > It's the same backport as bug 1427104.
> 
> So it's a dup? So please close as such.

It's a similar root cause, but a different scenario. So better get verified separately.

Comment 24 Jiri Belka 2017-04-10 11:40:40 UTC

the issue still occurs on our env, the one where the issue was found.

Comment 25 Red Hat Bugzilla Rules Engine 2017-04-10 11:40:46 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 26 Fred Rolland 2017-04-12 08:59:33 UTC

Hi,

You should not verify on a VM that is already in a non valid state.
Please check that you don't have 2 active snapshot before trying to verify this fix.

In order to solve the case that the VM is this invalid state, we need to do manual step that will be described in bz 1441558.

Please verify on a VM that is in a valid state.

Thanks,

Fred

Comment 28 Jiri Belka 2017-04-13 09:55:40 UTC

See https://bugzilla.redhat.com/show_bug.cgi?id=1427104#c32 how I was able to face same issue.

Comment 29 Allon Mureinik 2017-04-19 06:40:31 UTC

The basic error was fixed. This exposed another error with handling memory volumes that seems to have been present since 3.5.
Pushing out of the async.

Comment 30 Fred Rolland 2017-05-24 12:48:47 UTC

solved in  #1427104

Comment 31 Allon Mureinik 2017-05-25 16:22:28 UTC

(In reply to Fred Rolland from comment #30)
> solved in  #1427104

Moving to MODIFIED alongside it.

Comment 32 Jiri Belka 2017-06-16 12:14:54 UTC

there's some progress (i was previewing old snapshot with memory and committing it; it kept only the snapshot i was previewing/committing while deleting others and no error in UI) but i see this in engine.log:

version: ovirt-engine-4.1.3.2-0.1.el7.noarch

# tail -f /var/log/ovirt-engine/engine.log | grep ERROR
2017-06-16 14:05:25,614+02 ERROR [org.ovirt.engine.core.bll.snapshots.RestoreAllSnapshotsCommand] (default task-14) [4f0a8e31] Failed to remove memory 'fc82af4a-67a6-4a7d-beb3-02a8d8d3155a,3a63d854-bed0-11e0-b671-545200312d04,a320622a-deff-4ab1-b9c0-334b09273e40,02144c69-25d2-4d80-8332-c24565228965,6137d9d4-34ad-48fb-94d5-693f6cd9257f,008966a9-f2f6-42f7-ab5f-7140f8d2290f' of snapshot '82b8fdbc-5d49-4870-8511-428e8da6b2f4'
2017-06-16 14:05:30,422+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler1) [797c6f73] Failed in 'HSMGetAllTasksStatusesVDS' method
2017-06-16 14:05:30,431+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [797c6f73] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM slot-1 command HSMGetAllTasksStatusesVDS failed: Volume does not exist
2017-06-16 14:05:41,466+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler9) [694d7b39] Failed in 'HSMGetAllTasksStatusesVDS' method
2017-06-16 14:05:41,479+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler9) [694d7b39] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM slot-1 command HSMGetAllTasksStatusesVDS failed: Volume does not exist
2017-06-16 14:05:41,483+02 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler9) [694d7b39] BaseAsyncTask::logEndTaskFailure: Task '868de0a5-c259-4c26-8b8b-d87e8a4ef60c' (Parent Command 'RestoreFromSnapshot', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure:
2017-06-16 14:05:49,160+02 ERROR [org.ovirt.engine.core.bll.snapshots.RestoreAllSnapshotsCommand] (DefaultQuartzScheduler4) [b76f8b48-375f-4336-9d65-2f78ba30e429] Ending command 'org.ovirt.engine.core.bll.snapshots.RestoreAllSnapshotsCommand' with failure.
2017-06-16 14:05:49,177+02 ERROR [org.ovirt.engine.core.bll.snapshots.RestoreFromSnapshotCommand] (DefaultQuartzScheduler4) [b76f8b48-375f-4336-9d65-2f78ba30e429] Ending command 'org.ovirt.engine.core.bll.snapshots.RestoreFromSnapshotCommand' with failure.

Comment 34 Fred Rolland 2017-06-20 08:35:27 UTC

Did you test on a clean environment ?

Here are the steps for verifying :

- Create a VM with a disk (no need to install OS)
- Start the VM
- Take a snapshot with memory (snap1)
- Take another snapshot with memory (snap2)
- Shutdown the VM
- Preview the first snapshot including the memory (snap1)
- Commit the first snapshot (snap1)

Comment 35 Jiri Belka 2017-06-20 10:18:19 UTC

(In reply to Fred Rolland from comment #34)
> Did you test on a clean environment ?
> 
> Here are the steps for verifying :
> 
> - Create a VM with a disk (no need to install OS)
> - Start the VM
> - Take a snapshot with memory (snap1)
> - Take another snapshot with memory (snap2)
> - Shutdown the VM
> - Preview the first snapshot including the memory (snap1)
> - Commit the first snapshot (snap1)

No, I did not test it on clean env. We have an env which has this issue - I am the reporter - and we want this issue to be fixed. What can I do to have this fix in real world?

Comment 36 Fred Rolland 2017-06-20 14:16:53 UTC

Can you please provide the output of the below command ?

$ psql -d engine -U engine
engine=> \d snapshots


Thanks

Comment 37 Jiri Belka 2017-06-20 15:28:49 UTC

(In reply to Fred Rolland from comment #36)
> Can you please provide the output of the below command ?
> 
> $ psql -d engine -U engine
> engine=> \d snapshots
> 
> 
> Thanks

engine=# \d snapshots 
                          Table "public.snapshots"
         Column          |           Type           |       Modifiers        
-------------------------+--------------------------+------------------------
 snapshot_id             | uuid                     | not null
 vm_id                   | uuid                     | not null
 snapshot_type           | character varying(32)    | not null
 status                  | character varying(32)    | not null
 description             | character varying(4000)  | 
 creation_date           | timestamp with time zone | not null
 app_list                | text                     | 
 vm_configuration        | text                     | 
 _create_date            | timestamp with time zone | default now()
 _update_date            | timestamp with time zone | 
 memory_volume           | character varying(255)   | 
 memory_metadata_disk_id | uuid                     | 
 memory_dump_disk_id     | uuid                     | 
 vm_configuration_broken | boolean                  | not null default false
Indexes:
    "pk_snapshots" PRIMARY KEY, btree (snapshot_id)
    "idx_snapshots_memory_dump_disk_id" btree (memory_dump_disk_id)
    "idx_snapshots_memory_metadata_disk_id" btree (memory_metadata_disk_id)
    "idx_snapshots_snapshot_type" btree (snapshot_type)
    "idx_snapshots_vm_id" btree (vm_id)
Foreign-key constraints:
    "fk_snapshot_vm" FOREIGN KEY (vm_id) REFERENCES vm_static(vm_guid)
Referenced by:
    TABLE "vm_device" CONSTRAINT "vm_device_snapshot_id_fkey" FOREIGN KEY (snapshot_id) REFERENCES snapshots(snapshot_id) ON DELETE CASCADE

engine=#

Comment 38 Fred Rolland 2017-06-22 12:56:18 UTC

Hi,

Does the log you provided includes also log entries from before you upgraded to engine-4.1.3.2 ?

If yes can you tell me when the upgrade was done ?

Thanks,

Freddy

Comment 39 Fred Rolland 2017-06-22 13:49:50 UTC

(In reply to Jiri Belka from comment #32)
> there's some progress (i was previewing old snapshot with memory and
> committing it; it kept only the snapshot i was previewing/committing while
> deleting others and no error in UI) but i see this in engine.log:
> 
> version: ovirt-engine-4.1.3.2-0.1.el7.noarch
> 
> # tail -f /var/log/ovirt-engine/engine.log | grep ERROR
> 2017-06-16 14:05:25,614+02 ERROR


The cause of these errors is that in previous attempts of committing the previewed snapshot before the last fix, the memory volumes were already deleted from the db/storage.

Did the operation eventually succeeded ?

Comment 40 Jiri Belka 2017-06-22 14:09:42 UTC

(In reply to Fred Rolland from comment #38)
> Hi,
> 
> Does the log you provided includes also log entries from before you upgraded
> to engine-4.1.3.2 ?

Could you be more specific which logs do you need (ie. what I should search in logs before I did upgrade to 4.1.3)?

> If yes can you tell me when the upgrade was done ?

Fri Jun 16 13:50:37 2017 - Fri Jun 16 13:51:14 2017

Comment 41 Jiri Belka 2017-06-23 09:28:10 UTC

ok, ovirt-engine-4.1.3.2-0.1.el7.noarch

issues described in #32 are remnants from incorrect functionality of previous versions while handling snapshots.

otherwise, i do not see HSMGetAllTasksStatusesVDSCommand ERROR events anymore but those can still appear if engine would be working with snapshots made in previous version, but only once.

Note You need to log in before you can comment on or make changes to this bug.