Hide Forgot
Created attachment 1155066 [details] engine.log Description of problem: Given a pool of auto-started VMs, occasionally some VMs get into a state where engine is constantly failing to start them. Version-Release number of selected component (if applicable): 3.6.5.3-0.1.el6 How reproducible: Hard - this happens sporadically on a relatively large system and no relation to other actions was found yet. Steps to Reproduce: 1. Install RHEV and setup a VM pool with auto started VMs 2. Setup some process to occasionally shut-down some of the VMs so they are brought back up automatically by the engine. Actual results: At some point one of the VM's will stop coming back up, the following messages will show up in the UI event log: VM your-vm-name was started on Host <UNKNOWN> as stateless Expected results: The VM should start up Additional info: Engine log attached, in that log the non-starting VM is "nested-lab4-rgi-7.2-20160302.0-tester-1" the VM ID is "9354d16a-a03b-48a5-91fc-443c331a78ee".
failing due to "Could not restore stateless VM 9354d1 6a-a03b-48a5-91fc-443c331a78ee due to a failure to detach Disk 1e480e80-a386-4442-aefb-951736cb100d", but earlier logs are missing to be able to say what went wrong with the disk. Please attach logs capturing earlier period
Invastigating on the setup we found out snapshots are somehow messed up. VM's snapshot table refer a snapshot which doesn't exist in storage's images table. Yet there is an entry from corresponding time and corresponding image group. Snapshot id inside the ovf also doesn't match id in snapshot table. Don't know why, it probably happened on 25.4 already. That's the last time run VM succeeded, after 3 hours of attempts timing out on snapshot creation
More data provided in direct channels, please reopen needinfo if further assistance is needed.
Barak, please grab full sosreport as it might need some time to investigate and the logs are slowly getting rotated away. Interesting time frame is 25.4. ~20:00 till 23:20 when the VM was attempted to start but failed due to snapshot creation issues (can be observed at the SPM at that time), then at 23:16 the SPM was switched to new host and then the snapshot succeeded and VM was started for the last time. I haven't seen any apparent exceptions at that time, but the end result was that db got inconsistent. I would suspect snapshot async task issue, hence pushing to storage. Please feel free to talk with Arik for more info
Some more info: VM named "nested-lab4-rgi-7.2-20160302.0-tester-1" can't start because it cannot be restored to its stateless snapshot. The VM has 2 snapshots: _create_date | snapshot_id | snapshot_type -------------------------------+--------------------------------------+--------------- 2016-04-25 23:20:16.985383+03 | b0e60768-3ec9-4e31-b3f4-55d34f5d5527 | STATELESS 2016-04-25 20:23:09.138351+03 | 1d473f32-f914-4954-9f7a-2c434f4d978c | ACTIVE Looking at images of its single disk (1e480e80-a386-4442-aefb-951736cb100d): image_guid | vm_snapshot_id --------------------------------------+-------------------------------------- 2463273e-792d-4c71-a59d-ed3e10a3fa91 | 1d473f32-f914-4954-9f7a-2c434f4d978c 67a0989e-fbcb-4fd4-ace6-3c89596d393c | 7c6a8ca5-7523-4c77-92b8-d361d946b8cb You see, the disk has two volumes, one is associated with the active snapshot and the other one with non-existing snapshot. At the time the snapshot was taken, the active snapshot of the VM was 7c6a8ca5-7523-4c77-92b8-d361d946b8cb and now there is no such snapshot. The VM cannot be started because we try to detach every disk that is associated with the active snapshot but not associated with the stateless snapshot (we assume that the disk was attached to the stateless VM while it was running). The issue here is that we cannot detach the disk since it has snapshots.
Greg, you're this week's QE contact, please have a look
Tal - can you make sure there's progress being made on this bug?
I've found the issue causing the issue and pushed a patch the addresses it (http://gerrit.ovirt.org/57823) thanks, Liron.
Liron, Is there a specific flow for verification
Try the above: 1. add vm with disk image 2. delete the disk from the underlying storage 3. attempt to run the vm in stateless mode. 4. check the db consistency - verify in the images table that the vm_snapshot_id of the disk matches the active snapshot id. if you need help with the verification let me know.
Tested according to the steps in comment #13. ================================================== The VM snapshot which the image belongs to in images table in the DB exists in snapshots table: engine=# select * from images where image_group_id = 'd68d9b40-3c8b-457a-a6b4-798be30a3bf8'; -[ RECORD 1 ]---------+------------------------------------- image_guid | ede29f0a-f20a-415e-ba89-26dbae4c5350 creation_date | 2016-06-09 16:54:57+03 size | 1073741824 it_guid | 00000000-0000-0000-0000-000000000000 parentid | 00000000-0000-0000-0000-000000000000 imagestatus | 1 lastmodified | 1970-01-01 02:00:00+02 vm_snapshot_id | afee4ab7-684c-4abe-b020-d21a1721bb92 volume_type | 1 volume_format | 5 image_group_id | d68d9b40-3c8b-457a-a6b4-798be30a3bf8 _create_date | 2016-06-09 13:54:53.265983+03 _update_date | 2016-06-09 13:57:15.645048+03 active | t volume_classification | 1 engine=# select * from snapshots where s snapshot_id snapshot_type status engine=# select * from snapshots where snapshot_id ='afee4ab7-684c-4abe-b020-d21a1721bb92'; -[ RECORD 1 ]----+------------------------------------- snapshot_id | afee4ab7-684c-4abe-b020-d21a1721bb92 vm_id | 1ffd2f45-1628-4e35-9c75-922523dd3ec0 snapshot_type | ACTIVE status | OK description | Active VM creation_date | 2016-06-09 13:54:52.261+03 app_list | vm_configuration | _create_date | 2016-06-09 13:54:52.248655+03 _update_date | memory_volume | ======================================================= Verified using: rhevm-3.6.7.2-0.1.el6.noarch vdsm-4.17.30-1.el7ev.noarch
This bug is still on 4.0 + 3.6, I don't know if VERIFIED means 3.6 or 4.0. Please clone this bug and set flags/status accordingly.
(In reply to Eyal Edri from comment #15) > This bug is still on 4.0 + 3.6, I don't know if VERIFIED means 3.6 or 4.0. > Please clone this bug and set flags/status accordingly. It was clearly verified on 3.6: (In reply to Elad from comment #14) > Verified using: > rhevm-3.6.7.2-0.1.el6.noarch > vdsm-4.17.30-1.el7ev.noarch Regardless, you are right - this should be cloned.
Please update target milestone to 4.0 since the bug was cloned. Not sure about status, if it wasn't verified yet, it should be ON_QA.
just saw it was target release which was still on 3.6.7, moved to 4.0.0.
As per comments 14 and 16, this bug was verified on a 3.6.7 build. I set the clone (bugs 1348079) to VERIFIED, and am returning this bug to ON_QA to signify it was never verified on a 4.0 build. Leaving it to the QA's discretion whether to re-test on a 4.0 build or simply close based on the 3.6.7 verification.
Verified using: rhevm-4.0.0.5-0.1.el7ev.noarch vdsm-4.18.3-0.el7ev.x86_64 engine=# select * from images where image_group_id ='0458ee04-05cd-4da9-888d-1cdde4deb176'; -[ RECORD 1 ]---------+------------------------------------- image_guid | 75c67409-2195-40c2-9160-184369266ec3 creation_date | 2016-06-22 20:51:05+03 size | 5368709120 it_guid | 00000000-0000-0000-0000-000000000000 parentid | 00000000-0000-0000-0000-000000000000 imagestatus | 1 lastmodified | 1970-01-01 02:00:00+02 vm_snapshot_id | bf962e10-393f-45b4-a2d6-6007da8aac4f volume_type | 1 volume_format | 5 image_group_id | 0458ee04-05cd-4da9-888d-1cdde4deb176 _create_date | 2016-06-22 20:51:05.129645+03 _update_date | 2016-06-22 20:55:43.272086+03 active | t volume_classification | 1 engine=# select * from snapshots where snapshot_id ='bf962e10-393f-45b4-a2d6-6007da8aac4f'; -[ RECORD 1 ]-----------+------------------------------------- snapshot_id | bf962e10-393f-45b4-a2d6-6007da8aac4f vm_id | af367995-993b-4901-9130-66dfbfe28c51 snapshot_type | ACTIVE status | OK description | Active VM creation_date | 2016-06-22 20:51:02.348+03 app_list | vm_configuration | _create_date | 2016-06-22 20:51:02.312752+03 _update_date | memory_volume | memory_metadata_disk_id | memory_dump_disk_id |
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1743.html