Description of problem: When shutting down a stateless VM, the disk state snapshot sticks around until the VM is started back up, and occupies an otherwise unused storage domain disk space. This may lead to inability to resolve full storage issues when stateless VMs are involved. Version-Release number of selected component (if applicable): 4.0.4.4-0.1 How reproducible: Easily Steps to Reproduce: 1. Deploy oVirt with a small (~20g) storage domain 2. Create a VM template with thin allocated disk whose actual size is ~1g but whose virtual size is larger then the space in the storage domain (~40g). (One way to do this is to import the CentOS cloud image from oVirt Glance and resize the template disk) 3. Create a auto-starting VM pool with ~10VMs based on the template. Let all VMs in the pool start up. The storage should now have ~5-10g used. 4. Run a process in one of the pool VMs to write to the local disk and fill it up. Let it run until the VM gets paused because of I/O write errors. 5. Force-Shut down the VM Actual results: The storage will remain full. It will be impossible to bring the full VM back up. Any VM writing to its local disk will quickly become paused. It will probably be impossible to start up any stateless VM that gets shut down. It will be impossible to create any new VMs Expected results: Once a stateless VM is shut down, its state become inaccessible. Not having it take up disk space is a reasonable expectation.
Arik, can you please decscribe the current behavior when exactly it's removed? Only on next VM start?
(In reply to Michal Skrivanek from comment #1) > Arik, can you please decscribe the current behavior when exactly it's > removed? Only on next VM start? We try to remove the snapshot when we handle a stateless VM that went down. As a fallback we also try to remove it when we see a VM has stateless snapshot when it is being started - that should almost never happen. We do this since the storage pool might be down when the VM went down and therefore we cannot remove the snapshot on the first attempt. Barak, can you please provide the engine log?
Created attachment 1218012 [details] engine.log.xz Added engine.log from a reproducing system (Running on Lago BTW). One can see the 'fill-pool-4' VM is filling up the storage until it gets paused, then when it gets shut down the snapshot stays there and no further stateless VMs can be started because the storage is full.
What happens is this: - the storage is filled up completely so there is no space left on it - when the stateless snapshot should be deleted, there is a validation which checks if the disk size is smaller or not than a critical size (which is configured while creating the storage domain as "Critical Space Action Blocker (GB)" and by default it is 5GB) - if the available space is smaller than this 5GB, than the storage action is aborted (this happens to you: Validation of action 'RestoreAllSnapshots' failed for user SYSTEM. Reasons: VAR__ACTION__REVERT_TO,VAR__TYPE__SNAPSHOT,ACTION_TYPE_FAILED_DISK_SPACE_LOW_ON_STORAGE_DOMAIN,$storageName iscsi_small ) - you should be able to fix this by going to storage tab in webadmin, click "manage domain", than in "advanced" set 0 for "critical space action blocker" and hope in the best. But keep in mind that setting it to 0 is normally not a good idea. Could you please confirm that this steps help?
(In reply to Tomas Jelinek from comment #4) I wonder if that validation of the remaining space in the storage domain (allDomainsWithinThresholds) is really needed when restoring the active snapshot of a stateless VM. At the end of the process we will surely be with more free space than we had before, but I don't know if during the snapshot removal we rely on having more free space that the define threshold (this validation was added as part of I4cfc89 which seems to be mostly a refactoring, so not sure it is intentional). Therefore, moving it to storage.
Created attachment 1218887 [details] new engine.log (In reply to Tomas Jelinek from comment #4) > - you should be able to fix this by going to storage tab in webadmin, click > "manage domain", than in "advanced" set 0 for "critical space action > blocker" and hope in the best. But keep in mind that setting it to 0 is > normally not a good idea. Hi, I've reduced it to 2% before b/c the storage was just too small to have engine react on 5%. Reducing in further to 0% now does not seem to help any more, the storage is just too full at this point... Nevertheless, removing a snapshot should not require any more space, so I should be able to do it. I'm adding an update log that should include to change to 0% and what happens after it...
The fix for this issue should be included in oVirt 4.1.0 beta 1 released on December 1st. If not included please move back to modified.
verified at build 4.1.0-0.2.master.20161210231201.git26a385e.el7.centos Scenario : 1) small storage domain (3GB available storage)+ Critical Space Action Blocker (GB) = 0 . 2) created stateless VM from template (1G size) with thin provisioned disk 3) written 5GB to disk until VM is stateless 4) shutdown the VM I saw that stateless snapshot was restored successfully . From engine log: 2016-12-13 16:53:14,447+02 INFO [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (DefaultQuartzScheduler2) [d4bcc7b] Command 'RestoreAllSnapshots' id: 'ac83bc83-6228-4fc8-83d6-b0288c2bfd7a' child commands '[bc15788f-f 237-4feb-8c4d-d6bd892df2b4]' executions were completed, status 'SUCCEEDED' 2016-12-13 16:53:15,467+02 INFO [org.ovirt.engine.core.bll.snapshots.RestoreAllSnapshotsCommand] (DefaultQuartzScheduler4) [d4bcc7b] Ending command 'org.ovirt.engine.core.bll.snapshots.RestoreAllSnapshotsCommand' successfully. 2016-12-13 16:53:15,491+02 INFO [org.ovirt.engine.core.bll.snapshots.RestoreFromSnapshotCommand] (DefaultQuartzScheduler4) [d4bcc7b] Ending command 'org.ovirt.engine.core.bll.snapshots.RestoreFromSnapshotCommand' successfully.