Bug 1408594

Summary: VM would pause if it writes faster than the time it takes to lvextend its LV (happens with qcow2 over fast block storage)
Product: [oVirt] vdsm Reporter: guy chen <guchen>
Component: GeneralAssignee: Benny Zlotnik <bzlotnik>
Status: CLOSED CURRENTRELEASE QA Contact: mlehrer
Severity: high Docs Contact:
Priority: medium    
Version: ---CC: aefrat, bugs, dagur, ebenahar, frolland, guchen, lsurette, lsvaty, mlehrer, srevivo, tnisan, ycui
Target Milestone: ovirt-4.3.3Keywords: Performance
Target Release: ---Flags: rule-engine: ovirt-4.3+
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Please add the relevant info on fine tuning to the documentation
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-29 13:57:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description guy chen 2016-12-25 12:52:23 UTC
Description of problem:

On fast storage - XtreamIO - extending the actual disk size get an error "VM has been paused due to no Storage space error." 

How reproducible:
always

Steps to Reproduce:
    * Started a single VM 
    * Run fio that writes 10 files of 1 GB each, then delete them. 

Actual results:
    * Before i started the fio actual disk size of the disk was 5 GB 
    * when i start the fio at some point i get the error message : 
      "VM has been paused due to no Storage space error."
    * After several seconds the VM is back online 
    * The actual disk size of the disk is increasing by 10 GB 
    * The fio writing and deleting is succeed 
    * I have repeated the fio test 4 times, each time the same results with the error and actual disk size reaches 45 GB. 
    * From the fifth time on, the error disappears and actual disk size stabilizes on 47 GB. 

Expected results:

No error should appears 

Additional info:

topology : 
storage - XtreamIO 
1 host 
1 engine 
1 VM 


fio configuration : ( also tried with dd but the same results ) 

[global] 
rw=write 
size=1g 
directory=/home/fio_results 
thread=10 
unified_rw_reporting=1 
group_reporting 
iodepth=10 

error at the vdsm log : 

libvirtEventLoop::INFO::2016-12-21 17:31:29,347::vm::4041::virt.vm:onIOError) vmId=`99866c78-ede5-4707-86e4-12c1d627fa0c`::abnormal vm stop device virtio-disk0 error enospc 
libvirtEventLoop::INFO::2016-12-21 17:31:29,347::vm::4877::virt.vm:_logGuestCpuStatus) vmId=`99866c78-ede5-4707-86e4-12c1d627fa0c`::CPU stopped: onIOError

Comment 1 Yaniv Kaul 2016-12-25 12:55:57 UTC
The workaround is to change the relevant parameters of the watermark - WHEN we look to extend and by how much we extend.

An interesting idea is to look if the storage is thinly provisioned - if it is, assume we can extend more aggressively.

Comment 2 Yaniv Kaul 2016-12-26 08:35:13 UTC
Yaniv - we probably should document it if it's not already documented in some KB.

Comment 6 Yaniv Kaul 2016-12-29 11:47:04 UTC
Note: also, having an initial bigger disk would have probably prevented this altogether (since the watermark would have reached later, and would have given VDSM even more time to extend).

Comment 8 Yaniv Kaul 2017-01-01 13:21:36 UTC
From : (which is tightly coupled / duplicate of this one):

[irs]
volume_utilization_percent = 50
volume_utilization_chunk_mb = 1024


In the case above, the user can:
1. Extend the initial image size (makes sense anyway - say, 10G, not 1G)
2. Change the above threshold to lower and higher, respectively.

Comment 9 Yaniv Lavi 2017-01-01 13:23:48 UTC
Can you validate that the tweaking in comment 8 resolves your issue?

Comment 10 guy chen 2017-01-16 13:50:16 UTC
I changed the thresholds at vdsm.conf to 

[irs]
volume_utilization_percent = 15
volume_utilization_chunk_mb = 4048

The initial disk size was extended to 5 GB from 1 GB.
After it i run the load, there where 3 extends, so the disk size grow to 17 GB.
VM was not paused.
So the configuration fix the issue, BTW, also found relevant history correspondent of the issue :

http://lists.ovirt.org/pipermail/users/2016-May/039803.html

Comment 11 Yaniv Kaul 2017-01-16 14:01:32 UTC
Perhaps the initial values are sub-optimal.
If we initially have 1GB, and the watermark is on 50%, it means we have 500MB at most to be filled before we choke. Considering we may not even get the notification ASAP, we have ~5 seconds in a 100MB/sec write (which is not very very fast!) till we run out of space and pause.
I suggest we change at least the extend size to 2GB or so, to give us room to breath - I don't expect the initial extend to be very quick, since I don't expect in real life scenario to grow so much right away (but perhaps I'm wrong - if it's data dump? if it's OS installation, the decompression of installation files ensures IO isn't that fast).

Comment 12 Yaniv Kaul 2017-06-28 14:09:06 UTC
*** Bug 1461536 has been marked as a duplicate of this bug. ***

Comment 13 Sandro Bonazzola 2019-01-28 09:39:50 UTC
This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 14 Daniel Gur 2019-04-14 11:12:00 UTC
What was fixed changed here? Did you change defaults vdsm.conf? 
Please provide clear validation instructions

( Also I see comment 10  from  2017 that the conf change was already validated back then.)