Bug 1390072

Summary: Stopping a stateless VM does not erase state snapshot
Product: [oVirt] ovirt-engine Reporter: Barak Korren <bkorren>
Component: BLL.StorageAssignee: Allon Mureinik <amureini>
Status: CLOSED CURRENTRELEASE QA Contact: Avihai <aefrat>
Severity: medium Docs Contact:
Priority: low    
Version: 4.0.4.4CC: ahadas, bkorren, bugs, gklein, mgoldboi, pzhukov, ratamir, tjelinek
Target Milestone: ovirt-4.1.0-alphaFlags: rule-engine: ovirt-4.1+
rule-engine: planning_ack+
amureini: devel_ack+
ratamir: testing_ack+
Target Release: 4.1.0.2   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-01 14:33:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine.log.xz
none
new engine.log none

Description Barak Korren 2016-10-31 07:47:07 UTC
Description of problem:
When shutting down a stateless VM, the disk state snapshot sticks around until the VM is started back up, and occupies an otherwise unused storage domain disk space.

This may lead to inability to resolve full storage issues when stateless VMs are involved.

Version-Release number of selected component (if applicable):
4.0.4.4-0.1

How reproducible:
Easily

Steps to Reproduce:
1. Deploy oVirt with a small (~20g) storage domain
2. Create a VM template with thin allocated disk whose actual size is ~1g
   but whose virtual size is larger then the space in the storage domain
   (~40g).
   (One way to do this is to import the CentOS cloud image from oVirt Glance
    and resize the template disk)
3. Create a auto-starting VM pool with ~10VMs based on the template.
   Let all VMs in the pool start up.
   The storage should now have ~5-10g used.
4. Run a process in one of the pool VMs to write to the local disk and fill 
   it up. Let it run until the VM gets paused because of I/O write errors.
5. Force-Shut down the VM 

Actual results:
The storage will remain full. 
It will be impossible to bring the full VM back up.
Any VM writing to its local disk will quickly become paused.
It will probably be impossible to start up any stateless VM that gets shut down.
It will be impossible to create any new VMs

Expected results:
Once a stateless VM is shut down, its state become inaccessible. Not having
it take up disk space is a reasonable expectation.

Comment 1 Michal Skrivanek 2016-11-02 12:40:30 UTC
Arik, can you please decscribe the current behavior when exactly it's removed? Only on next VM start?

Comment 2 Arik 2016-11-02 13:38:05 UTC
(In reply to Michal Skrivanek from comment #1)
> Arik, can you please decscribe the current behavior when exactly it's
> removed? Only on next VM start?

We try to remove the snapshot when we handle a stateless VM that went down. As a fallback we also try to remove it when we see a VM has stateless snapshot when it is being started - that should almost never happen. We do this since the storage pool might be down when the VM went down and therefore we cannot remove the snapshot on the first attempt.

Barak, can you please provide the engine log?

Comment 3 Barak Korren 2016-11-07 13:32:57 UTC
Created attachment 1218012 [details]
engine.log.xz

Added engine.log from a reproducing system (Running on Lago BTW).

One can see the 'fill-pool-4' VM is filling up the storage until it gets paused, then when it gets shut down the snapshot stays there and no further stateless VMs can be started because the storage is full.

Comment 4 Tomas Jelinek 2016-11-09 09:48:10 UTC
What happens is this:
- the storage is filled up completely so there is no space left on it
- when the stateless snapshot should be deleted, there is a validation which checks if the disk size is smaller or not than a critical size (which is configured while creating the storage domain as "Critical Space Action Blocker (GB)" and by default it is 5GB)
- if the available space is smaller than this 5GB, than the storage action is aborted (this happens to you:
Validation of action 'RestoreAllSnapshots' failed for user SYSTEM. Reasons: VAR__ACTION__REVERT_TO,VAR__TYPE__SNAPSHOT,ACTION_TYPE_FAILED_DISK_SPACE_LOW_ON_STORAGE_DOMAIN,$storageName iscsi_small
)
- you should be able to fix this by going to storage tab in webadmin, click "manage domain", than in "advanced" set 0 for "critical space action blocker" and hope in the best. But keep in mind that setting it to 0 is normally not a good idea.

Could you please confirm that this steps help?

Comment 5 Arik 2016-11-09 09:55:06 UTC
(In reply to Tomas Jelinek from comment #4)

I wonder if that validation of the remaining space in the storage domain (allDomainsWithinThresholds) is really needed when restoring the active snapshot of a stateless VM. At the end of the process we will surely be with more free space than we had before, but I don't know if during the snapshot removal we rely on having more free space that the define threshold (this validation was added as part of I4cfc89 which seems to be mostly a refactoring, so not sure it is intentional).

Therefore, moving it to storage.

Comment 6 Barak Korren 2016-11-09 10:32:40 UTC
Created attachment 1218887 [details]
new engine.log

(In reply to Tomas Jelinek from comment #4)
> - you should be able to fix this by going to storage tab in webadmin, click
> "manage domain", than in "advanced" set 0 for "critical space action
> blocker" and hope in the best. But keep in mind that setting it to 0 is
> normally not a good idea.

Hi, I've reduced it to 2% before b/c the storage was just too small to have engine react on 5%. Reducing in further to 0% now does not seem to help any more, the storage is just too full at this point...

Nevertheless, removing a snapshot should not require any more space, so I should be able to do it.

I'm adding an update log that should include to change to 0% and what happens after it...

Comment 7 Sandro Bonazzola 2016-12-12 13:53:17 UTC
The fix for this issue should be included in oVirt 4.1.0 beta 1 released on December 1st. If not included please move back to modified.

Comment 8 Avihai 2016-12-13 15:02:24 UTC
verified at build 4.1.0-0.2.master.20161210231201.git26a385e.el7.centos

Scenario :

1) small storage domain (3GB available storage)+ Critical Space Action Blocker (GB) = 0 .
2) created stateless VM from template (1G size) with thin provisioned disk
3) written 5GB to disk until VM is stateless
4) shutdown the VM 

I saw that stateless snapshot was restored successfully .

From engine log:
2016-12-13 16:53:14,447+02 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (DefaultQuartzScheduler2) [d4bcc7b] Command 'RestoreAllSnapshots' id: 'ac83bc83-6228-4fc8-83d6-b0288c2bfd7a' child commands '[bc15788f-f
237-4feb-8c4d-d6bd892df2b4]' executions were completed, status 'SUCCEEDED'
2016-12-13 16:53:15,467+02 INFO  [org.ovirt.engine.core.bll.snapshots.RestoreAllSnapshotsCommand] (DefaultQuartzScheduler4) [d4bcc7b] Ending command 'org.ovirt.engine.core.bll.snapshots.RestoreAllSnapshotsCommand' successfully.
2016-12-13 16:53:15,491+02 INFO  [org.ovirt.engine.core.bll.snapshots.RestoreFromSnapshotCommand] (DefaultQuartzScheduler4) [d4bcc7b] Ending command 'org.ovirt.engine.core.bll.snapshots.RestoreFromSnapshotCommand' successfully.