Bug 1569926
Summary: | Commands access unresponsive NFS storage domain can block for 20-30 minutes | ||
---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Michal Skrivanek <michal.skrivanek> |
Component: | Core | Assignee: | Nir Soffer <nsoffer> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Evelina Shames <eshames> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.40.13 | CC: | aefrat, bugs, lleistne, michal.skrivanek, mtessun, nsoffer, rdlugyhe, tnisan |
Target Milestone: | ovirt-4.4.1 | Flags: | pm-rhel:
ovirt-4.4+
|
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Previously, commands trying to access an unresponsive NFS storage domain remained blocked for 20-30 minutes, which had significant impacts. This was caused by the non-optimal values of the NFS storage timeout and retry parameters. The current release fixes this issue: It changes these parameter values so commands to a non-responsive NFS storage domain fail within one minute.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2020-07-08 08:25:10 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Michal Skrivanek
2018-04-20 10:02:31 UTC
We are not configuring the timeout ourselves unless specified explicitly in the NFS custom options, the timeout of 600 comes from the default options for mount: For NFS over TCP the default timeo value is 600 (60 seconds). The NFS client performs linear backoff: After each retransmission the timeout is increased by timeo up to the maximum of 600 seconds. We can set a default custom value for timeout with a lower number if that's the desired behavior (In reply to Tal Nisan from comment #1) > We are not configuring the timeout ourselves unless specified explicitly in > the NFS custom options, the timeout of 600 comes from the default options > for mount: > > For NFS over TCP the default timeo value is 600 (60 seconds). The NFS client > performs linear backoff: After each retransmission the timeout is increased > by timeo up to the maximum of 600 seconds. I'm afraid it's a bit more complicated as IIUC with TCP there are also TCP timeouts in play here. > We can set a default custom value for timeout with a lower number if that's > the desired behavior well, it's desired to have IOError sooner than fencing. We need to first see when exactly we are getting IOError and then change the default options to get that sooner. Only then the logic around HA restart (and CDROM error reported to guest) will start working on NFS storage properly Nir, I'd like to have your input on that, is changing the default timeout of 60 seconds will gain us something? some updates on testing different settings: https://bugzilla.redhat.com/show_bug.cgi?id=1540548#c31 , https://bugzilla.redhat.com/show_bug.cgi?id=1230788#c53 (In reply to Tal Nisan from comment #3) > I'd like to have your input on that, is changing the default timeout of 60 > seconds will gain us something? I don't know, we will to consult file system folks about it, and testing such changes will consume lot of time, so this does not look like something we can do for 4.2.z. In 4.2 there is no need to use NFS or other file based storage for ISO, so issues with reporting errors to guest are not interesting. User should move ISO images to block storage. In the long term I think we should deprecate file based storage based on qcow2 chains and focus instead on LUN based solution. This gives better performance and reliability, and much easier management. Setting target release to 4.3 based on Nir's comment (In reply to Tal Nisan from comment #6) > Setting target release to 4.3 based on Nir's comment Yaniv, Please remove the blocker+ flag if 4.3 is the right target milestone for this bug also see https://bugzilla.redhat.com/show_bug.cgi?id=1609701#c9 for more comments/thoughts This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1. Move to vdsm, since the defaults are set there. Vdsm use now: - timeout: 100 (deciseconds) - retry: 3 Users can override the default timeout and retry from engine UI/SDK when creating or editing storage domain. To override settings change: Custom connection parameters: - Retransmissions (#) - Timeout (deciseconds) To verify the bug, test how much time NFS storage domain is blocked when blocking access to NFS storage server, before I/O fails. Previously we were blocking for at least 21 minutes, and practically up to 30 minutes. With new NFS configuration we block up to 60 seconds, but practically I see blocking for 270 seconds. To test this you can grep vdsm log for WARN. About 10 seconds after block access to NFS server storage domain, you will see a warning about blocked checker for this storage domain. Then you will see new warning every 10 seconds, with the time the checker is blocked. At some point you will see an error when the checker completed the read (about 270 seconds in my tests). If storage is still not accessible, this will repeat. Previously checker could be blocked for 30 minutes. I think most issue caused by this are related to virt, so maybe they should test if this change improves issues mentioned in comment 0. (In reply to Nir Soffer from comment #14) > To verify the bug, test how much time NFS storage domain is blocked > when blocking access to NFS storage server, before I/O fails. > > Previously we were blocking for at least 21 minutes, and practically > up to 30 minutes. With new NFS configuration we block up to 60 seconds, > but practically I see blocking for 270 seconds. > > To test this you can grep vdsm log for WARN. About 10 seconds after > block access to NFS server storage domain, you will see a warning > about blocked checker for this storage domain. > > Then you will see new warning every 10 seconds, with the time the > checker is blocked. > > At some point you will see an error when the checker completed > the read (about 270 seconds in my tests). > > If storage is still not accessible, this will repeat. > > Previously checker could be blocked for 30 minutes. > > I think most issue caused by this are related to virt, so maybe > they should test if this change improves issues mentioned in > comment 0. Verified with the above steps on rhv-4.4.1-3. This bugzilla is included in oVirt 4.4.1 release, published on July 8th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |