1408594 – VM would pause if it writes faster than the time it takes to lvextend its LV (happens with qcow2 over fast block storage)

Bug 1408594 - VM would pause if it writes faster than the time it takes to lvextend its LV (happens with qcow2 over fast block storage)

Summary: VM would pause if it writes faster than the time it takes to lvextend its LV ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	General
Sub Component:
Version:	---
Hardware:	Unspecified
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	ovirt-4.3.3
Target Release:	---
Assignee:	Benny Zlotnik
QA Contact:	mlehrer
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-12-25 12:52 UTC by guy chen
Modified:	2019-04-29 13:57 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-04-29 13:57:41 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.3+

Attachments	(Terms of Use)

Description guy chen 2016-12-25 12:52:23 UTC

Description of problem:

On fast storage - XtreamIO - extending the actual disk size get an error "VM has been paused due to no Storage space error." 

How reproducible:
always

Steps to Reproduce:
    * Started a single VM 
    * Run fio that writes 10 files of 1 GB each, then delete them. 

Actual results:
    * Before i started the fio actual disk size of the disk was 5 GB 
    * when i start the fio at some point i get the error message : 
      "VM has been paused due to no Storage space error."
    * After several seconds the VM is back online 
    * The actual disk size of the disk is increasing by 10 GB 
    * The fio writing and deleting is succeed 
    * I have repeated the fio test 4 times, each time the same results with the error and actual disk size reaches 45 GB. 
    * From the fifth time on, the error disappears and actual disk size stabilizes on 47 GB. 

Expected results:

No error should appears 

Additional info:

topology : 
storage - XtreamIO 
1 host 
1 engine 
1 VM 


fio configuration : ( also tried with dd but the same results ) 

[global] 
rw=write 
size=1g 
directory=/home/fio_results 
thread=10 
unified_rw_reporting=1 
group_reporting 
iodepth=10 

error at the vdsm log : 

libvirtEventLoop::INFO::2016-12-21 17:31:29,347::vm::4041::virt.vm:onIOError) vmId=`99866c78-ede5-4707-86e4-12c1d627fa0c`::abnormal vm stop device virtio-disk0 error enospc 
libvirtEventLoop::INFO::2016-12-21 17:31:29,347::vm::4877::virt.vm:_logGuestCpuStatus) vmId=`99866c78-ede5-4707-86e4-12c1d627fa0c`::CPU stopped: onIOError

Comment 1 Yaniv Kaul 2016-12-25 12:55:57 UTC

The workaround is to change the relevant parameters of the watermark - WHEN we look to extend and by how much we extend.

An interesting idea is to look if the storage is thinly provisioned - if it is, assume we can extend more aggressively.

Comment 2 Yaniv Kaul 2016-12-26 08:35:13 UTC

Yaniv - we probably should document it if it's not already documented in some KB.

Comment 6 Yaniv Kaul 2016-12-29 11:47:04 UTC

Note: also, having an initial bigger disk would have probably prevented this altogether (since the watermark would have reached later, and would have given VDSM even more time to extend).

Comment 8 Yaniv Kaul 2017-01-01 13:21:36 UTC

From : (which is tightly coupled / duplicate of this one):

[irs]
volume_utilization_percent = 50
volume_utilization_chunk_mb = 1024


In the case above, the user can:
1. Extend the initial image size (makes sense anyway - say, 10G, not 1G)
2. Change the above threshold to lower and higher, respectively.

Comment 9 Yaniv Lavi 2017-01-01 13:23:48 UTC

Can you validate that the tweaking in comment 8 resolves your issue?

Comment 10 guy chen 2017-01-16 13:50:16 UTC

I changed the thresholds at vdsm.conf to 

[irs]
volume_utilization_percent = 15
volume_utilization_chunk_mb = 4048

The initial disk size was extended to 5 GB from 1 GB.
After it i run the load, there where 3 extends, so the disk size grow to 17 GB.
VM was not paused.
So the configuration fix the issue, BTW, also found relevant history correspondent of the issue :

http://lists.ovirt.org/pipermail/users/2016-May/039803.html

Comment 11 Yaniv Kaul 2017-01-16 14:01:32 UTC

Perhaps the initial values are sub-optimal.
If we initially have 1GB, and the watermark is on 50%, it means we have 500MB at most to be filled before we choke. Considering we may not even get the notification ASAP, we have ~5 seconds in a 100MB/sec write (which is not very very fast!) till we run out of space and pause.
I suggest we change at least the extend size to 2GB or so, to give us room to breath - I don't expect the initial extend to be very quick, since I don't expect in real life scenario to grow so much right away (but perhaps I'm wrong - if it's data dump? if it's OS installation, the decompression of installation files ensures IO isn't that fast).

Comment 12 Yaniv Kaul 2017-06-28 14:09:06 UTC

*** Bug 1461536 has been marked as a duplicate of this bug. ***

Comment 13 Sandro Bonazzola 2019-01-28 09:39:50 UTC

This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 14 Daniel Gur 2019-04-14 11:12:00 UTC

What was fixed changed here? Did you change defaults vdsm.conf? 
Please provide clear validation instructions

( Also I see comment 10  from  2017 that the conf change was already validated back then.)

Note You need to log in before you can comment on or make changes to this bug.