Bug 1024353

Summary: VM is not automatically unpaused after no space IO error on NFS
Product: [oVirt] vdsm Reporter: Katarzyna Jachim <kjachim>
Component: GeneralAssignee: Tal Nisan <tnisan>
Status: CLOSED DEFERRED QA Contact: Avihai <aefrat>
Severity: low Docs Contact:
Priority: unspecified    
Version: ---CC: acanan, amureini, bazulay, bugs, fsimonce, klaas, kshukla, lpeer, michal.skrivanek, mtessun, ncredi, nsoffer, rbalakri, rbarry, srevivo, tnisan, ylavi
Target Milestone: ---Keywords: FutureFeature, Improvement, Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-18 20:22:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Katarzyna Jachim 2013-10-29 13:32:07 UTC
Description of problem:
After IO error on storage domain (end of space) VM is automatically paused. However, after fixing the problem with the storage domain (free some space), the VM is not automatically unpaused


Version-Release number of selected component (if applicable): is20


How reproducible: 100% in automated tests


Steps to Reproduce:
1. create an NFS storage domain (the underlying storage should be rather small)
2. create a VM with THIN disk (let's say 20 GB) on this sd, install OS, boot it, start writing (just dd)
3. create a big (ca. free space on storage domain - 10 GB), PREALLOCATED disk on the storage domain
4. wait until the VM is paused
5. delete the disk added in point 3
6. wait until the VM is unpaused

Actual results:
the VM is never unpaused

Expected results:
the VM should be unpaused when there is free space on the sd

Additional info:
It may be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1003588, but I prefer to be 100% sure, especially as this scenario works fine for iSCSI storage domains

Comment 3 Federico Simoncelli 2013-11-15 12:16:10 UTC
This is probably not related to bug 1003588. I haven't looked at the logs well enough but the error here should be ENOSPC (instead of EIO).

I'm not sure if the feature was supposed to cover also ENOSPC errors as they're not related to domain availability.

The solution for ENOSPC is not related at all to the solution proposed for EIO (domain state change).

Ayal, what do you think?

Comment 4 Ayal Baron 2013-11-17 12:03:08 UTC
(In reply to Federico Simoncelli from comment #3)
> This is probably not related to bug 1003588. I haven't looked at the logs
> well enough but the error here should be ENOSPC (instead of EIO).
> 
> I'm not sure if the feature was supposed to cover also ENOSPC errors as
> they're not related to domain availability.
> 
> The solution for ENOSPC is not related at all to the solution proposed for
> EIO (domain state change).
> 
> Ayal, what do you think?

indeed we have no mechanism for this on NFS atm (on block domains we automatically extend and resume).
This is a gap we'll need to cover in 3.4

Comment 10 Yaniv Lavi 2016-12-04 15:19:06 UTC
This is not a RFE, it's a bug.
We can discuss when to fix it and if, but it's not correct to keep it on future.

Comment 11 Yaniv Kaul 2017-02-12 09:46:04 UTC
Tal, who's going to work on it for 4.1?

Comment 12 Nir Soffer 2017-02-12 14:53:08 UTC
We don't have a mechanism for unpausing vms on file storage, and our mechanism on
block storage is broken as well; there is no way to detect storage domain state
changes reliably.

Fixing this require major redesign, possible for for 4.2 if we start to work on it
now.

Comment 13 Yaniv Kaul 2017-02-12 15:19:48 UTC
(In reply to Nir Soffer from comment #12)
> We don't have a mechanism for unpausing vms on file storage, and our
> mechanism on
> block storage is broken as well; there is no way to detect storage domain
> state
> changes reliably.
> 
> Fixing this require major redesign, possible for for 4.2 if we start to work
> on it
> now.

We could just retry periodically, I believe.

Comment 14 Yaniv Lavi 2017-02-23 11:24:52 UTC
Moving out all non blocker\exceptions.

Comment 15 Nir Soffer 2017-02-28 23:42:12 UTC
(In reply to Yaniv Dary from comment #14)
> Moving out all non blocker\exceptions.

Yaniv, I don't think this makes sense for 4.1. Did you read comment 12?

We cannot fix problems like this is the last moment, this will only harm the
stability of the product. These kind of issues must be scheduled to the start of
the development for the next version.

Comment 16 Allon Mureinik 2017-07-16 09:30:33 UTC
(In reply to Nir Soffer from comment #12)
> We don't have a mechanism for unpausing vms on file storage, and our
> mechanism on
> block storage is broken as well; there is no way to detect storage domain
> state
> changes reliably.
> 
> Fixing this require major redesign, possible for for 4.2 if we start to work
> on it
> now.

We have seen no such requests from the field, and I don't see us putting any effort in this area - closing, if PM disagrees, please explain and reopen.

Comment 19 Nir Soffer 2019-03-27 16:42:51 UTC
Based comment 17, I think we can monitor the available space on file based storage
domains, and remember the last value seen.

When we move from value <= threshold to value > threshold, we can treat this as
change from invalid to valid state, and use the same mechanism to resume vms paused
because of ENOSPC.

This cannot be very robust, since we monitor only every 10 seconds, and if space
is added or removed quickly, we may miss the low space state. But if we choose
large enough threshold, it may be good enough.

Comment 20 Martin Tessun 2019-04-01 08:11:04 UTC
Hi Koutuk,

(In reply to Koutuk Shukla from comment #17)
> Hello,
> 
> Re-opening this bug [...]
> 
> Environment details:
> rhvm-4.2.6.4-0.1.el7ev.noarch
> 

we will look at this in 4.4 as a feature request, but unpausing a VM could have harmful effects as well.
E.g. if you have timer sensitive application (even the kernel is one) you might have unforeseen effects in case of longer pauses (hangs, deadlocks, etc.).
As such unpausing the VM unconditionally is not sensible.
So we will discuss how we can sensibly achieve this in 4.4. planning.

Cheers,
Martin

Comment 21 Klaas Demter 2019-04-01 17:19:29 UTC
(In reply to Martin Tessun from comment #20)
> Hi Koutuk,
> 
> (In reply to Koutuk Shukla from comment #17)
> > Hello,
> > 
> > Re-opening this bug [...]
> > 
> > Environment details:
> > rhvm-4.2.6.4-0.1.el7ev.noarch
> > 
> 
> we will look at this in 4.4 as a feature request, but unpausing a VM could
> have harmful effects as well.
> E.g. if you have timer sensitive application (even the kernel is one) you
> might have unforeseen effects in case of longer pauses (hangs, deadlocks,
> etc.).
> As such unpausing the VM unconditionally is not sensible.
> So we will discuss how we can sensibly achieve this in 4.4. planning.

If I understood this correctly the functionality already exists for other storage types. If you decide this is not a good idea for NFS why is it a good idea for other storage types?

> 
> Cheers,
> Martin

Comment 23 Martin Tessun 2019-04-23 10:26:41 UTC
(In reply to Klaas Demter from comment #21)
> (In reply to Martin Tessun from comment #20)
> > Hi Koutuk,
> > 
> > (In reply to Koutuk Shukla from comment #17)
> > > Hello,
> > > 
> > > Re-opening this bug [...]
> > > 
> > > Environment details:
> > > rhvm-4.2.6.4-0.1.el7ev.noarch
> > > 
> > 
> > we will look at this in 4.4 as a feature request, but unpausing a VM could
> > have harmful effects as well.
> > E.g. if you have timer sensitive application (even the kernel is one) you
> > might have unforeseen effects in case of longer pauses (hangs, deadlocks,
> > etc.).
> > As such unpausing the VM unconditionally is not sensible.
> > So we will discuss how we can sensibly achieve this in 4.4. planning.
> 
> If I understood this correctly the functionality already exists for other
> storage types. If you decide this is not a good idea for NFS why is it a
> good idea for other storage types?

To be fair it isn't. And afaik it should also not happen in case the ENOSPC condition lasts longer than a few moments. The unpausing in LVM (block based) storage is mainly done to avoid an extension being happening on a thin volume, while the guest writes data. These are the cases where the guest is automatically unpaused (afaik).

@Michal: Anything to add?

> 
> > 
> > Cheers,
> > Martin

Comment 24 Ryan Barry 2019-11-18 20:22:36 UTC
As indicated by numerous comments (comment#20 and comment#23 in particular), as well as the age of this RFE without a reasonable mechanism, this bug will be closed.

Block storage is able to handle this somewhat better than NFS direct monitoring of the underlying storage and reliable communication of events. NFS is not at a point where we can rely on the underlying storage being ready and alternate solutions for shared storage exist.

Comment 25 Benny Zlotnik 2020-01-08 08:28:39 UTC
*** Bug 1693786 has been marked as a duplicate of this bug. ***