Bug 1569926

Summary: Commands access unresponsive NFS storage domain can block for 20-30 minutes
Product: [oVirt] vdsm Reporter: Michal Skrivanek <michal.skrivanek>
Component: CoreAssignee: Nir Soffer <nsoffer>
Status: CLOSED CURRENTRELEASE QA Contact: Evelina Shames <eshames>
Severity: high Docs Contact:
Priority: high    
Version: 4.40.13CC: aefrat, bugs, lleistne, michal.skrivanek, mtessun, nsoffer, rdlugyhe, tnisan
Target Milestone: ovirt-4.4.1Flags: pm-rhel: ovirt-4.4+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, commands trying to access an unresponsive NFS storage domain remained blocked for 20-30 minutes, which had significant impacts. This was caused by the non-optimal values of the NFS storage timeout and retry parameters. The current release fixes this issue: It changes these parameter values so commands to a non-responsive NFS storage domain fail within one minute.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-08 08:25:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michal Skrivanek 2018-04-20 10:02:31 UTC
We struggle with our ancient decision to configure NFS storage timeouts too high so a storage error is not reported to OS for a long time, while the non-responsive treatment and storage domain monitoring generally concludes within 5 minutes that a host is NonResponsive. This state makes the real I/O error report which the host may finally see non-effective. We need to report I/O error properly to engine so the correct resume decision can be taken. It also affects stuff like migration of VMs when host becomes NonOperational or is shut down, and general accuracy of the VM's actual state on host in libvirt, vdsm, and in engine.
The situation became more visible once we implemented auto-resume policies(bug 1317450), one example of these issues would be bug 1481022.

While we're waiting on bug 665820 to provide a solution maybe we can tweak the values 

for reference, our current default settings are soft,tcp,timeo=600,retrans=6

Comment 1 Tal Nisan 2018-04-22 08:22:37 UTC
We are not configuring the timeout ourselves unless specified explicitly in the NFS custom options, the timeout of 600 comes from the default options for mount:

For NFS over TCP the default timeo value is 600 (60 seconds). The NFS client performs linear backoff: After each retransmission the timeout is increased by timeo up to the maximum of 600 seconds.


We can set a default custom value for timeout with a lower number if that's the desired behavior

Comment 2 Michal Skrivanek 2018-04-23 12:36:15 UTC
(In reply to Tal Nisan from comment #1)
> We are not configuring the timeout ourselves unless specified explicitly in
> the NFS custom options, the timeout of 600 comes from the default options
> for mount:
> 
> For NFS over TCP the default timeo value is 600 (60 seconds). The NFS client
> performs linear backoff: After each retransmission the timeout is increased
> by timeo up to the maximum of 600 seconds.

I'm afraid it's a bit more complicated as IIUC with TCP there are also TCP timeouts in play here.

> We can set a default custom value for timeout with a lower number if that's
> the desired behavior

well, it's desired to have IOError sooner than fencing. We need to first see when exactly we are getting IOError and then change the default options to get that sooner. Only then the logic around HA restart (and CDROM error reported to guest) will start working on NFS storage properly

Comment 3 Tal Nisan 2018-04-25 11:20:48 UTC
Nir,
I'd like to have your input on that, is changing the default timeout of 60 seconds will gain us something?

Comment 4 Michal Skrivanek 2018-05-02 06:51:36 UTC
some updates on testing different settings: https://bugzilla.redhat.com/show_bug.cgi?id=1540548#c31 , https://bugzilla.redhat.com/show_bug.cgi?id=1230788#c53

Comment 5 Nir Soffer 2018-06-14 15:15:11 UTC
(In reply to Tal Nisan from comment #3)
> I'd like to have your input on that, is changing the default timeout of 60
> seconds will gain us something?

I don't know, we will to consult file system folks about it, and testing such 
changes will consume lot of time, so this does not look like something we can 
do for 4.2.z.

In 4.2 there is no need to use NFS or other file based storage for ISO, so issues
with reporting errors to guest are not interesting. User should move ISO images
to block storage.

In the long term I think we should deprecate file based storage based on qcow2
chains and focus instead on LUN based solution. This gives better performance and 
reliability, and much easier management.

Comment 6 Tal Nisan 2018-06-17 09:49:33 UTC
Setting target release to 4.3 based on Nir's comment

Comment 7 Raz Tamir 2018-08-22 14:36:00 UTC
(In reply to Tal Nisan from comment #6)
> Setting target release to 4.3 based on Nir's comment

Yaniv,

Please remove the blocker+ flag if 4.3 is the right target milestone for this bug

Comment 8 Michal Skrivanek 2018-08-29 13:07:47 UTC
also see https://bugzilla.redhat.com/show_bug.cgi?id=1609701#c9 for more comments/thoughts

Comment 9 Sandro Bonazzola 2019-01-28 09:34:22 UTC
This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 12 Nir Soffer 2020-04-23 14:40:51 UTC
Move to vdsm, since the defaults are set there.

Vdsm use now:
- timeout: 100 (deciseconds)
- retry: 3

Users can override the default timeout and retry from engine UI/SDK
when creating or editing storage domain.

To override settings change:

Custom connection parameters:
- Retransmissions (#)
- Timeout (deciseconds)

Comment 14 Nir Soffer 2020-06-04 18:04:20 UTC
To verify the bug, test how much time NFS storage domain is blocked
when blocking access to NFS storage server, before I/O fails.

Previously we were blocking for at least 21 minutes, and practically
up to 30 minutes. With new NFS configuration we block up to 60 seconds,
but practically I see blocking for 270 seconds.

To test this you can grep vdsm log for WARN. About 10 seconds after
block access to NFS server storage domain, you will see a warning
about blocked checker for this storage domain.

Then you will see new warning every 10 seconds, with the time the
checker is blocked.

At some point you will see an error when the checker completed
the read (about 270 seconds in my tests).

If storage is still not accessible, this will repeat.

Previously checker could be blocked for 30 minutes.

I think most issue caused by this are related to virt, so maybe 
they should test if this change improves issues mentioned in 
comment 0.

Comment 15 Evelina Shames 2020-06-07 11:11:34 UTC
(In reply to Nir Soffer from comment #14)
> To verify the bug, test how much time NFS storage domain is blocked
> when blocking access to NFS storage server, before I/O fails.
> 
> Previously we were blocking for at least 21 minutes, and practically
> up to 30 minutes. With new NFS configuration we block up to 60 seconds,
> but practically I see blocking for 270 seconds.
> 
> To test this you can grep vdsm log for WARN. About 10 seconds after
> block access to NFS server storage domain, you will see a warning
> about blocked checker for this storage domain.
> 
> Then you will see new warning every 10 seconds, with the time the
> checker is blocked.
> 
> At some point you will see an error when the checker completed
> the read (about 270 seconds in my tests).
> 
> If storage is still not accessible, this will repeat.
> 
> Previously checker could be blocked for 30 minutes.
> 
> I think most issue caused by this are related to virt, so maybe 
> they should test if this change improves issues mentioned in 
> comment 0.

Verified with the above steps on rhv-4.4.1-3.

Comment 18 Sandro Bonazzola 2020-07-08 08:25:10 UTC
This bugzilla is included in oVirt 4.4.1 release, published on July 8th 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.