Created attachment 1250901 [details] engine and vdsm logs Description of problem: On Hosted-Engine env only, when performing a volume removal - live merge, LSM, etc... after the operation completes, an error appears in the event tab: VDSM <HOST NAME> command HSMGetAllTasksStatusesVDS failed: Volume does not exist From the engine.log: 2017-02-16 11:04:07,624-05 INFO [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (DefaultQuartzScheduler3) [b318a40] Command 'RemoveSnapshot' (id: '67304551-2c3b-495c-979b-695efa3a217a') waitin g on child command id: '049b769f-f7c3-456a-b48a-6d77682173a2' type:'RemoveSnapshotSingleDiskLive' to complete 2017-02-16 11:04:09,635-05 INFO [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommandCallback] (DefaultQuartzScheduler6) [69366425] Command 'RemoveSnapshotSingleDiskLive' (id: '049b769f-f7c3-456 a-b48a-6d77682173a2') waiting on child command id: '1aa42bb6-4083-4319-94f9-5da8260263db' type:'DestroyImage' to complete 2017-02-16 11:04:11,960-05 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (DefaultQuartzScheduler7) [b318a40] Polling and updating Async Tasks: 3 tasks, 1 tasks to poll now 2017-02-16 11:04:12,479-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler7) [b318a40] Failed in 'HSMGetAllTasksStatusesVDS' method 2017-02-16 11:04:12,484-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler7) [b318a40] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call S tack: null, Custom Event ID: -1, Message: VDSM _slot-5b command HSMGetAllTasksStatusesVDS failed: Volume does not exist Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Issue with browser submitted the bug before finished editing. Version-Release number of selected component (if applicable): rhevm-4.1.0.4-0.1.el7 How reproducible: 100% on several operations Steps to Reproduce: 1. Perform live merge 2. 3. Actual results: Expected results: Additional info:
This somehow influences manual SPM switching to other host... Feb 16, 2017 11:45:00 AM Failed to force select slot-5a as the SPM due to a failure to stop the current SPM.
Please provide more info on the flow that led you to this. Was the environment upgraded from any versions? Is the VM upgraded from older compatibility level?
(In reply to Yaniv Dary from comment #3) > Please provide more info on the flow that led you to this. > Was the environment upgraded from any versions? Is the VM upgraded from > older compatibility level? Our setup is old env, migrated in 3.5 to SHE (including DB restore), then upgraded to 3.6 and 4.0. Recently the env was upgrade to 4.1 but still keeping 4.0 cluster compat level.
Please describe the VM in question. Is it running on latest compatibility? Is it on the 4.0 cluster?
The log have a lot of : Exception: org.springframework.jdbc.BadSqlGrammarException: StatementCallback; bad SQL grammar Maybe something went wrong in the upgrade process ?
The issue seems to disappear after we changed cluster level to 4.1.
the cluster level was 4.0. What was the version of the VDSM ?
(In reply to Fred Rolland from comment #8) > the cluster level was 4.0. > What was the version of the VDSM ? vdsm-4.19.4-1.el7ev.x86_64
(In reply to Jiri Belka from comment #7) > The issue seems to disappear after we changed cluster level to 4.1. Not really, I just restarted ovirt-engine being on 4.1-beta and it's back again. If you would like access to env, ping me on IRC.
Some details on investigation so far : - The message : "DSM _slot-5b command HSMGetAllTasksStatusesVDS failed: Volume does not exist" appears as there are tasks that were not cleared on the SPM - Because of the unclear tasks, the SPM cannot move to another host - These tasks are from command that failed on engine due to the following exception : [org.ovirt.engine.core.bll.snapshots.RestoreFromSnapshotCommand] (default task-18) [2a3e3a24] Exception: java.lang.IllegalStateException: Duplicate key org.ovirt.engine.core.common.businessentities.storage.DiskImage@3711fed7 - The cause of this exception is that the view all_disks_for_vms in database contains for the same VM more than one disk with the same id but with different VM snapshotId. - Preview a snapshot from one of the VM , adds one more disk to the view
Strange to see it as a blocker if the severity is medium. Setting the severity to URGENT (also because I understood it happened in RHEV.TLV?)
Bug 1427104 will be resolved for 4.1.1. ATM, we aren't sure which of these two caused the other, and there doesn't seem to be a clear reproducer for this one. Pushing out to 4.1.2 for further investigation. Daniel - any insights so far?
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
According to the logs [1], the source issue seems similar to bug 1427104. I.e. the 'Duplicate key' is a consequence of the failure during commit snapshot. Moving to MODIFIED as it should be resolved as part of the same fix. [1] 2017-02-15 07:28:28,746-05 ERROR [org.ovirt.engine.core.bll.snapshots.RestoreFromSnapshotCommand] (Transaction Reaper Worker 0) [] Transaction rolled-back for command 'org.ovirt.engine.core.bll.snapshots.RestoreFromSnapshotCommand'. 2017-02-15 07:28:28,747-05 ERROR [org.ovirt.engine.core.bll.snapshots.RestoreAllSnapshotsCommand] (Transaction Reaper Worker 0) [] Transaction rolled-back for command 'org.ovirt.engine.core.bll.snapshots.RestoreAllSnapshotsCommand'. 2017-02-15 07:28:28,748-05 INFO [org.ovirt.engine.core.bll.tasks.CommandCoordinatorImpl] (org.ovirt.thread.pool-6-thread-6) [] Rollback for command 'org.ovirt.engine.core.bll.snapshots.RestoreAllSnapshotsCommand'
Moving back to 4.1.1 then, thanks Daniel.
Where is the backport for 4.1.1.z if its in 4.1.1?
(In reply to Eyal Edri from comment #18) > Where is the backport for 4.1.1.z if its in 4.1.1? It's the same backport as bug 1427104.
(In reply to Daniel Erez from comment #19) > (In reply to Eyal Edri from comment #18) > > Where is the backport for 4.1.1.z if its in 4.1.1? > > It's the same backport as bug 1427104. So it's a dup? So please close as such.
(In reply to Yaniv Kaul from comment #20) > (In reply to Daniel Erez from comment #19) > > (In reply to Eyal Edri from comment #18) > > > Where is the backport for 4.1.1.z if its in 4.1.1? > > > > It's the same backport as bug 1427104. > > So it's a dup? So please close as such. I prefer making this a test only bug, so that we make sure it is resolved.
(In reply to Yaniv Kaul from comment #20) > (In reply to Daniel Erez from comment #19) > > (In reply to Eyal Edri from comment #18) > > > Where is the backport for 4.1.1.z if its in 4.1.1? > > > > It's the same backport as bug 1427104. > > So it's a dup? So please close as such. It's a similar root cause, but a different scenario. So better get verified separately.
the issue still occurs on our env, the one where the issue was found.
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Hi, You should not verify on a VM that is already in a non valid state. Please check that you don't have 2 active snapshot before trying to verify this fix. In order to solve the case that the VM is this invalid state, we need to do manual step that will be described in bz 1441558. Please verify on a VM that is in a valid state. Thanks, Fred
See https://bugzilla.redhat.com/show_bug.cgi?id=1427104#c32 how I was able to face same issue.
The basic error was fixed. This exposed another error with handling memory volumes that seems to have been present since 3.5. Pushing out of the async.
solved in #1427104
(In reply to Fred Rolland from comment #30) > solved in #1427104 Moving to MODIFIED alongside it.
there's some progress (i was previewing old snapshot with memory and committing it; it kept only the snapshot i was previewing/committing while deleting others and no error in UI) but i see this in engine.log: version: ovirt-engine-4.1.3.2-0.1.el7.noarch # tail -f /var/log/ovirt-engine/engine.log | grep ERROR 2017-06-16 14:05:25,614+02 ERROR [org.ovirt.engine.core.bll.snapshots.RestoreAllSnapshotsCommand] (default task-14) [4f0a8e31] Failed to remove memory 'fc82af4a-67a6-4a7d-beb3-02a8d8d3155a,3a63d854-bed0-11e0-b671-545200312d04,a320622a-deff-4ab1-b9c0-334b09273e40,02144c69-25d2-4d80-8332-c24565228965,6137d9d4-34ad-48fb-94d5-693f6cd9257f,008966a9-f2f6-42f7-ab5f-7140f8d2290f' of snapshot '82b8fdbc-5d49-4870-8511-428e8da6b2f4' 2017-06-16 14:05:30,422+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler1) [797c6f73] Failed in 'HSMGetAllTasksStatusesVDS' method 2017-06-16 14:05:30,431+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [797c6f73] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM slot-1 command HSMGetAllTasksStatusesVDS failed: Volume does not exist 2017-06-16 14:05:41,466+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler9) [694d7b39] Failed in 'HSMGetAllTasksStatusesVDS' method 2017-06-16 14:05:41,479+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler9) [694d7b39] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM slot-1 command HSMGetAllTasksStatusesVDS failed: Volume does not exist 2017-06-16 14:05:41,483+02 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler9) [694d7b39] BaseAsyncTask::logEndTaskFailure: Task '868de0a5-c259-4c26-8b8b-d87e8a4ef60c' (Parent Command 'RestoreFromSnapshot', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure: 2017-06-16 14:05:49,160+02 ERROR [org.ovirt.engine.core.bll.snapshots.RestoreAllSnapshotsCommand] (DefaultQuartzScheduler4) [b76f8b48-375f-4336-9d65-2f78ba30e429] Ending command 'org.ovirt.engine.core.bll.snapshots.RestoreAllSnapshotsCommand' with failure. 2017-06-16 14:05:49,177+02 ERROR [org.ovirt.engine.core.bll.snapshots.RestoreFromSnapshotCommand] (DefaultQuartzScheduler4) [b76f8b48-375f-4336-9d65-2f78ba30e429] Ending command 'org.ovirt.engine.core.bll.snapshots.RestoreFromSnapshotCommand' with failure.
Did you test on a clean environment ? Here are the steps for verifying : - Create a VM with a disk (no need to install OS) - Start the VM - Take a snapshot with memory (snap1) - Take another snapshot with memory (snap2) - Shutdown the VM - Preview the first snapshot including the memory (snap1) - Commit the first snapshot (snap1)
(In reply to Fred Rolland from comment #34) > Did you test on a clean environment ? > > Here are the steps for verifying : > > - Create a VM with a disk (no need to install OS) > - Start the VM > - Take a snapshot with memory (snap1) > - Take another snapshot with memory (snap2) > - Shutdown the VM > - Preview the first snapshot including the memory (snap1) > - Commit the first snapshot (snap1) No, I did not test it on clean env. We have an env which has this issue - I am the reporter - and we want this issue to be fixed. What can I do to have this fix in real world?
Can you please provide the output of the below command ? $ psql -d engine -U engine engine=> \d snapshots Thanks
(In reply to Fred Rolland from comment #36) > Can you please provide the output of the below command ? > > $ psql -d engine -U engine > engine=> \d snapshots > > > Thanks engine=# \d snapshots Table "public.snapshots" Column | Type | Modifiers -------------------------+--------------------------+------------------------ snapshot_id | uuid | not null vm_id | uuid | not null snapshot_type | character varying(32) | not null status | character varying(32) | not null description | character varying(4000) | creation_date | timestamp with time zone | not null app_list | text | vm_configuration | text | _create_date | timestamp with time zone | default now() _update_date | timestamp with time zone | memory_volume | character varying(255) | memory_metadata_disk_id | uuid | memory_dump_disk_id | uuid | vm_configuration_broken | boolean | not null default false Indexes: "pk_snapshots" PRIMARY KEY, btree (snapshot_id) "idx_snapshots_memory_dump_disk_id" btree (memory_dump_disk_id) "idx_snapshots_memory_metadata_disk_id" btree (memory_metadata_disk_id) "idx_snapshots_snapshot_type" btree (snapshot_type) "idx_snapshots_vm_id" btree (vm_id) Foreign-key constraints: "fk_snapshot_vm" FOREIGN KEY (vm_id) REFERENCES vm_static(vm_guid) Referenced by: TABLE "vm_device" CONSTRAINT "vm_device_snapshot_id_fkey" FOREIGN KEY (snapshot_id) REFERENCES snapshots(snapshot_id) ON DELETE CASCADE engine=#
Hi, Does the log you provided includes also log entries from before you upgraded to engine-4.1.3.2 ? If yes can you tell me when the upgrade was done ? Thanks, Freddy
(In reply to Jiri Belka from comment #32) > there's some progress (i was previewing old snapshot with memory and > committing it; it kept only the snapshot i was previewing/committing while > deleting others and no error in UI) but i see this in engine.log: > > version: ovirt-engine-4.1.3.2-0.1.el7.noarch > > # tail -f /var/log/ovirt-engine/engine.log | grep ERROR > 2017-06-16 14:05:25,614+02 ERROR The cause of these errors is that in previous attempts of committing the previewed snapshot before the last fix, the memory volumes were already deleted from the db/storage. Did the operation eventually succeeded ?
(In reply to Fred Rolland from comment #38) > Hi, > > Does the log you provided includes also log entries from before you upgraded > to engine-4.1.3.2 ? Could you be more specific which logs do you need (ie. what I should search in logs before I did upgrade to 4.1.3)? > If yes can you tell me when the upgrade was done ? Fri Jun 16 13:50:37 2017 - Fri Jun 16 13:51:14 2017
ok, ovirt-engine-4.1.3.2-0.1.el7.noarch issues described in #32 are remnants from incorrect functionality of previous versions while handling snapshots. otherwise, i do not see HSMGetAllTasksStatusesVDSCommand ERROR events anymore but those can still appear if engine would be working with snapshots made in previous version, but only once.