Description of problem: Maybe this issue is affected by https://bugzilla.redhat.com/show_bug.cgi?id=1422949 When commint a snapshot in preview which was created on old cluster compat level, the action ends with: 'Error while executing action Revert to Snapshot: Internal Engine Error'. There's some mismatch in snapshot listing of used VM. See there's twice 'Active VM before the preview'. With each 'Preview' this duplicated listing is increased :D --------%>----------- Feb 27, 2017 11:38:36 AM OK Active VM before the preview Feb 27, 2017 11:48:39 AM OK Active VM before the preview Feb 27, 2017 12:05:27 PM OK Active VM before the preview Feb 22, 2017 11:36:33 AM OK before_migration_to_4.1 Feb 20, 2017 4:59:18 PM OK ttt Feb 17, 2017 5:45:20 PM OK 3.6 engine ------------<%-------------- Version-Release number of selected component (if applicable): ovirt-engine-4.1.0.4-0.1.el7.noarch 4.1 dc/cl compat vdsm-4.19.4-1.el7ev How reproducible: just happens Steps to Reproduce: 1. preview and commit a snapshot which was created on cl compat level older than current one (eg. 3.6 vs 4.1) 2. 3. Actual results: failure and duplicated listing in snapshot vm list Expected results: should work Additional info:
This preview also increases number of disks in Disks subtab list.
engine=# select vm_guid from vms where vm_name = 'jbelka-vm3'; vm_guid -------------------------------------- 87869854-45bc-4645-896e-53ccc0a358b8 (1 row) engine=# select snapshot_id,description from snapshots where vm_id = '87869854-45bc-4645-896e-53ccc0a358b8'; snapshot_id | description --------------------------------------+------------------------------ 3700ada4-ceb9-4403-b674-b5d76abcedb9 | Active VM before the preview 8a74b248-e6c2-4e06-bcf1-9d892d42868c | Active VM before the preview 8384221d-351b-4e52-ae13-d63666d8837a | Active VM before the preview aab77425-b829-41b9-8885-fd6b4ce299b3 | before_migration_to_4.1 5b5fba9b-91de-4b44-a8af-34848628056d | Active VM 5d553e67-9189-4f37-bdf6-9b567fcdf511 | Active VM before the preview 7507b657-7b74-4dcc-bb5d-9e904665d716 | Active VM before the preview 67404161-3f43-41c0-922f-e4bd675b1492 | ttt 11fefc3d-3f21-4639-b49c-81c2ceed2edb | Active VM before the preview 1ccb49de-2e3c-40f0-ab35-7220028aa123 | Active VM before the preview 60af453c-1d03-4ed8-9e5e-eccd8c4d7384 | 3.6 engine (11 rows) [oVirt shell (connected)]# list disks --parent-vm-name jbelka-vm3 id : 77575912-f759-4976-82f4-17a53239aa44 name : jbelka-vm3_Disk1 id : 77575912-f759-4976-82f4-17a53239aa44 name : jbelka-vm3_Disk1 id : 77575912-f759-4976-82f4-17a53239aa44 name : jbelka-vm3_Disk1 id : 77575912-f759-4976-82f4-17a53239aa44 name : jbelka-vm3_Disk1 id : 77575912-f759-4976-82f4-17a53239aa44 name : jbelka-vm3_Disk1 id : 77575912-f759-4976-82f4-17a53239aa44 name : jbelka-vm3_Disk1 id : 77575912-f759-4976-82f4-17a53239aa44 name : jbelka-vm3_Disk1
(In reply to Jiri Belka from comment #0) > Description of problem: > > Maybe this issue is affected by > https://bugzilla.redhat.com/show_bug.cgi?id=1422949 Indeed, this seems related. Assigning to the same assingee and targeting to 4.1.1. If we see these are completely unrelated issues, we can always rethink it, although I doubt that would be needed.
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
Hi, Can you reproduce on a clean environment ? Daniel tried with upgrade from 3.6 to 4.1 and it did not reproduced. Can you also attached a DB dump so we can try to debug with your database ? Thanks
(In reply to Fred Rolland from comment #6) > Hi, > Can you reproduce on a clean environment ? > Daniel tried with upgrade from 3.6 to 4.1 and it did not reproduced. Any possibility you could try to reproduce on a SHE env?
Because of this issue I lost my VM with installed engine and all prepared versions in snapshots. The only thing that I could do with it was to remove it. I have another VM in this state too and now I'm afraid to do something with snapshots on my other VMs. Problem occurs only with newly created snapshot. VM that has all snapshots created before upgrading the cluster works fine. But after I created new snapshot problem was there. Could that state be repaired so that I will not loose another VMs?
Lucie hi, Can explain exactly the flow that happened to you ? Does it happen with a new VM + snapshot ? Thanks, Freddy
I couldn't reproduce it either. After upgrading the DC & cluster from 4.0 to 4.1 and committing the old snapshot it completed with this message: 2017-03-13 09:44:45,787+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler3) [] EVENT_ID: USER_RESTORE_FROM_SNAPSHOT_FINISH_SU CCESS(100), Correlation ID: 60a048a5-37da-4cbe-a4b4-70fb14dba97c, Job ID: a34a1481-68cb-40cb-ae63-cf84492f7951, Call Stack: null, Custom Event ID: -1, Message: VM Test_vm3 restori ng from Snapshot has been completed. my snapshot list after the preview and commit: engine=# select snapshot_id,description from snapshots where vm_id = '32853641-65c8-4b42-9678-87420ed258b6'; -[ RECORD 1 ]------------------------------------- snapshot_id | 9bdc351b-d43d-49ec-b214-002e49415520 description | Active VM -[ RECORD 2 ]------------------------------------- snapshot_id | f4b7c004-e3aa-4beb-a7fa-7ae26d500457 description | snap_test and my disks remained the same as well.
(In reply to Lilach Zitnitski from comment #12) > I couldn't reproduce it either. > After upgrading the DC & cluster from 4.0 to 4.1 and committing the old > snapshot it completed with this message: > > 2017-03-13 09:44:45,787+02 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler3) [] EVENT_ID: USER_RESTORE_FROM_SNAPSHOT_FINISH_SU > CCESS(100), Correlation ID: 60a048a5-37da-4cbe-a4b4-70fb14dba97c, Job ID: > a34a1481-68cb-40cb-ae63-cf84492f7951, Call Stack: null, Custom Event ID: -1, > Message: VM Test_vm3 restori > ng from Snapshot has been completed. > > my snapshot list after the preview and commit: > > engine=# select snapshot_id,description from snapshots where vm_id = > '32853641-65c8-4b42-9678-87420ed258b6'; > -[ RECORD 1 ]------------------------------------- > snapshot_id | 9bdc351b-d43d-49ec-b214-002e49415520 > description | Active VM > -[ RECORD 2 ]------------------------------------- > snapshot_id | f4b7c004-e3aa-4beb-a7fa-7ae26d500457 > description | snap_test > > and my disks remained the same as well. First of all, this env is SHE env. And second comment, this env originates in 3.0 and in time of 3.5 it was converted into SHE. So, it has for sure some skeletons from the historical evolution there. IMO trying to reproduce on clean env doesn't make much sense.
It is on SHE with rhvm upgraded to 4.1.0.4-0.1.el7. I had my VMs in cluster version 3.6. By trying to update version to 4.0 there were problem with VMs that had System -> Custom Compatibility Version set to 3.6. After changing this value to empty, update of cluster to 4.0 was successful. Working with snapshots was OK. After few days cluster version was changed to 4.1 and all the troubles started. New created VMs in this cluster 4.1 work with snapshots OK.
1)Have you started your VMs in preview before you've made an upgrade of the engine 3.6->4.0->4.1? 2)Were your VMs running on 4.0 or 4.1 hosts, after you've got bumped up the compatibility mode of host cluster to 4.1?
(In reply to Nikolai Sednev from comment #15) > 1)Have you started your VMs in preview before you've made an upgrade of the > engine 3.6->4.0->4.1? From the logs it seems that none of my VMs were in preview. > 2)Were your VMs running on 4.0 or 4.1 hosts, after you've got bumped up the > compatibility mode of host cluster to 4.1? VMs were running first on hosts 4.0 before update cluster version to 4.0 and on hosts 4.1 before update cluster to 4.1. All the hosts were step by step upgraded from 3.6. Jiri pointed out that also a new VM ended in invalid state. So I did more testing (VM: lleistne-test) and he is right. VM from a template (VM: lleistne-engine4) still looks good whatever I do with its snapshots. See log file. When I went through the older logs I figure out I wrote it wrong - problems with snapshots were also after first cluster update to 4.0, sorry.
Created attachment 1262999 [details] engine.log.20170314
1)I've deployed clean hosted engine environment on NFS and added two NFS data storage domains to it, got hosted_storage auto-imported and added glance as external storage provider to create 4 RHEL7.3 VMs from template gotten from there. 2)I've created one snapshot for each VM, with RAM for VM1 and VM3 and without RAM for VM2 and VM4. 3)VM1 was powered off, while VM2-4 were up. 4)Upgraded one of two hosts to 4.1 components and made it SPM host. 5)Migrated all VMs to 4.1 host. 6)HE-VM was also migrated to 4.1 host. 7)Moved 3.6 host to local maintenance. 8)Set global maintenance from CLI of 4.1 host. 9)Installed rhevm-appliance-4.0.20170302.0-1.el7ev.noarch.rpm on 4.1 host. 10)Backed up db on engine using "engine-backup --mode=backup --file=nsednev_from_alma04_rhevm_3_6 --log=Log_nsednev_from_alma04_rhevm_3_6". 11)Created "/backup" directory on 4.1 host and copied there backup files from the engine. 11)Executed "hosted-engine --upgrade-appliance" on 4.1 host. 12)Upgraded the 3.6 engine to 4.0 engine (4.0.7.4-0.1.el7ev) and then updated it to latest 4.0 components, engine was running on 4.1 host, then removed global maintenance from hosts. rhevm-spice-client-x64-msi-4.0-3.el7ev.noarch rhevm-4.0.7.4-0.1.el7ev.noarch rhevm-spice-client-x86-msi-4.0-3.el7ev.noarch rhev-guest-tools-iso-4.0-7.el7ev.noarch rhevm-setup-plugins-4.0.0.3-1.el7ev.noarch rhevm-doc-4.0.7-1.el7ev.noarch rhevm-dependencies-4.0.0-1.el7ev.noarch rhevm-guest-agent-common-1.0.12-4.el7ev.noarch rhevm-branding-rhev-4.0.0-7.el7ev.noarch Linux version 3.10.0-514.6.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Feb 17 19:21:31 EST 2017 Linux 3.10.0-514.6.2.el7.x86_64 #1 SMP Fri Feb 17 19:21:31 EST 2017 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) 13)For the simplicity, I've removed first 3.6 host instead of upgrading it to 4.1 components and activating it back. 14)Bumped up compatibility mode 3.6->4.0 on host cluster and then on data center. 15)All active guest VMs appeared as "Server with newer configuration for the next run." 15)Previewed and committed a snapshot which was created on cluster compatibility level older than current one (eg. 3.6 vs 4.0) on VM1(with RAM) and on VM3(without RAM). 16)Started VM1 and VM2 without any issues. 17)Did not seen reported error on my environment. Also see attached screen cast.
https://drive.google.com/open?id=0B85BEaDBcF88MDhmY1hqZExTLWc
1)Upgraded 4.0.7 to 4.1 engine to check if being reproduced for VM2 and VM4 on 4.1 engine. Components on engine: rhevm-dependencies-4.1.1-1.el7ev.noarch rhevm-doc-4.1.0-2.el7ev.noarch rhev-guest-tools-iso-4.1-4.el7ev.noarch rhevm-branding-rhev-4.1.0-1.el7ev.noarch rhevm-4.1.1.4-0.1.el7.noarch rhevm-setup-plugins-4.1.1-1.el7ev.noarch Linux version 3.10.0-514.6.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Feb 17 19:21:31 EST 2017 Linux 3.10.0-514.6.2.el7.x86_64 #1 SMP Fri Feb 17 19:21:31 EST 2017 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) 2)Tried to preview and commit, but VMs 1 and and 4 failed to get started. See the attached screen cast and logs from engine and host.
https://drive.google.com/open?id=0B85BEaDBcF88ZlpOeUpxMTlpTUE
Created attachment 1263025 [details] sosreport-alma04.qa.lab.tlv.redhat.com-20170314193823.tar.xz
Created attachment 1263026 [details] sosreport-nsednev-he-1.qa.lab.tlv.redhat.com-20170314193810.tar.xz
The described scenario can be reproduced regardless to upgrading. Failure during the commit snapshot process (e.g. stop engine during commit), changes the 'Preview' snapshot to status 'OK' while keeping the 'Active before Preview' snapshot intact. Which causing the duplicate snapshots/disks when previewing again. The issue in Comment 20 seems unrelated and looks like a duplicate of bug 1431246
Verified with the following code: ------------------------------------------- ovirt-engine-4.1.1.5-0.1.el7.noarch rhevm-4.1.1.5-0.1.el7.noarch vdsm-4.19.9-1.el7ev.x86_64 Verified with the following scenario based on comment 24: ----------------------------------------------------------- Created VM with nfs and block disk Created snapshot Previewed snapshot Started to commit the snapshot and restarted the engine during the operation >>>>> engine restarted and the commit succeeded. Moving to Verified!
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Hi, You should not verify on a VM that is already in a non valid state. Please check that you don't have 2 active snapshot before trying to verify this fix. In order to solve the case that the VM is this invalid state, we need to do manual step that will be described in bz 1441558. Please verify on a VM that is in a valid state. Thanks, Fred
Jiri, Moving this bug to your verification
I was able to reproduce on another env which was created with following flow while having vms create in first 3.5 env still running and having snapshot with memory from 3.5 era: - install 3.5 engine - install el6 host - create vms with snapshot with memory ^^ here are our old snapshot originally created - upgrade to 3.6 engine - add el7 host with 3.x vdsm - inplaceupgrade policy on 3.5 cluster - migrate vms to el7 host - remove/re-add el6 host so it is el7 with 3.x vdsm - bump up cluster to 3.6 - upgrade/migrate to 4.0 engine - upgrade hosts to 4.0 vdsm - bump up cluster to 4.0 - upgrade to 4.1 engine - upgrade hosts to 4.1 vdsm - bump up cluster to 4.1 - select a vm with snapshot - create new snapshot on that vm ^^ this snapshot is create on 4.1 env - preview __old__ snapshot without memory - commit this __old__ snapshot ^^ bang, here it fails - do 'undo' of currently previewed old snapshot ^^ bang, here the latest snapshot is lost! - preview __old__ snapshot - commit this __old__ snapshot ^^ here it works fine
The basic error was fixed. This exposed another error with handling memory volumes that seems to have been present since 3.5. Pushing out of the async.
There is a simpler way to reproduce: - Create a VM with a disk (no need to install OS) - Start the VM - Take a snapshot with memory (snap1) - Take another snapshot with memory (snap2) - Shutdown the VM - Preview the first snapshot including the memory (snap1) - Commit the first snapshot (snap1) The "commit" will fail after some time with the following error in the engine log : Transaction was aborted in 'org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand' 2017-04-19 10:04:55,978+03 ERROR [org.ovirt.engine.core.bll.job.ExecutionHandler] (default task-16) [4bb333c7] Exception: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000460: Error checking for a transaction
there's some progress (i was previewing old snapshot with memory and committing it; it kept only the snapshot i was previewing/committing while deleting others and no error in UI) but i see this in engine.log: # tail -f /var/log/ovirt-engine/engine.log | grep ERROR 2017-06-16 14:05:25,614+02 ERROR [org.ovirt.engine.core.bll.snapshots.RestoreAllSnapshotsCommand] (default task-14) [4f0a8e31] Failed to remove memory 'fc82af4a-67a6-4a7d-beb3-02a8d8d3155a,3a63d854-bed0-11e0-b671-545200312d04,a320622a-deff-4ab1-b9c0-334b09273e40,02144c69-25d2-4d80-8332-c24565228965,6137d9d4-34ad-48fb-94d5-693f6cd9257f,008966a9-f2f6-42f7-ab5f-7140f8d2290f' of snapshot '82b8fdbc-5d49-4870-8511-428e8da6b2f4' 2017-06-16 14:05:30,422+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler1) [797c6f73] Failed in 'HSMGetAllTasksStatusesVDS' method 2017-06-16 14:05:30,431+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [797c6f73] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM slot-1 command HSMGetAllTasksStatusesVDS failed: Volume does not exist 2017-06-16 14:05:41,466+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler9) [694d7b39] Failed in 'HSMGetAllTasksStatusesVDS' method 2017-06-16 14:05:41,479+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler9) [694d7b39] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM slot-1 command HSMGetAllTasksStatusesVDS failed: Volume does not exist 2017-06-16 14:05:41,483+02 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler9) [694d7b39] BaseAsyncTask::logEndTaskFailure: Task '868de0a5-c259-4c26-8b8b-d87e8a4ef60c' (Parent Command 'RestoreFromSnapshot', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure: 2017-06-16 14:05:49,160+02 ERROR [org.ovirt.engine.core.bll.snapshots.RestoreAllSnapshotsCommand] (DefaultQuartzScheduler4) [b76f8b48-375f-4336-9d65-2f78ba30e429] Ending command 'org.ovirt.engine.core.bll.snapshots.RestoreAllSnapshotsCommand' with failure. 2017-06-16 14:05:49,177+02 ERROR [org.ovirt.engine.core.bll.snapshots.RestoreFromSnapshotCommand] (DefaultQuartzScheduler4) [b76f8b48-375f-4336-9d65-2f78ba30e429] Ending command 'org.ovirt.engine.core.bll.snapshots.RestoreFromSnapshotCommand' with failure.
Did you test on a clean environment ? Here are the steps for verifying : - Create a VM with a disk (no need to install OS) - Start the VM - Take a snapshot with memory (snap1) - Take another snapshot with memory (snap2) - Shutdown the VM - Preview the first snapshot including the memory (snap1) - Commit the first snapshot (snap1)
(In reply to Fred Rolland from comment #40) > Did you test on a clean environment ? > > Here are the steps for verifying : > > - Create a VM with a disk (no need to install OS) > - Start the VM > - Take a snapshot with memory (snap1) > - Take another snapshot with memory (snap2) > - Shutdown the VM > - Preview the first snapshot including the memory (snap1) > - Commit the first snapshot (snap1) No, I did not test it on clean env. We have an env which has this issue - I am the reporter - and we want this issue to be fixed. What can I do to have this fix in real world?
ok, ovirt-engine-4.1.3.2-0.1.el7.noarch looks ok on our problematic env (brq-setup).