Bug 1530072 - Vdsm can get into D state when checking disk type on non-responsive NFS server
Summary: Vdsm can get into D state when checking disk type on non-responsive NFS server
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.20.15
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.2.1
: ---
Assignee: Nir Soffer
QA Contact: Natalie Gavrielov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-01-01 18:34 UTC by Nir Soffer
Modified: 2018-02-22 10:01 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Vdsm was accessing NFS storage directly when checking if a volume is on a block storage when starting a virtual machine. Consequence: Vdsm can become unkillable (D state), and restarting Vdsm may fail. Fix: Vdsm does not access storage to detect the volumes on block storage. Result: Vdsm cannot cannot become unkillable when starting a virtual machine.
Clone Of:
Environment:
Last Closed: 2018-02-22 10:01:53 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.2+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1518676 0 high CLOSED Entire vdsm process hang during when formatting xlease volume on NFS storage domain 2021-03-11 21:11:50 UTC
oVirt gerrit 85307 0 master MERGED storage: Set diskType early and safely 2018-01-03 21:33:15 UTC
oVirt gerrit 85337 0 master MERGED volume: Return volume type in getVmVolumeInfo 2018-01-03 16:56:48 UTC
oVirt gerrit 85343 0 master MERGED localdisk: Update drive diskType 2018-01-03 17:11:27 UTC
oVirt gerrit 85870 0 master MERGED vm: Do not check disk type when starting LSM 2018-01-03 22:41:48 UTC

Internal Links: 1518676

Description Nir Soffer 2018-01-01 18:34:56 UTC
Description of problem:

When starting to use a disk, we check every disk if the disk is a block device
by invoking os.stat(). If the disk is on NFS storage domain, and the NFS server
becomes non responsive before starting the VM, this check may block for minutes,
causing vdsm to be unkillable.

The effected flows are:
- starting vm
- starting destination vm during migration
- hot plug disk
- starting live storage migration

When using cluster version 4.2, some flows do not check the disk type since
ngine specify the diskType in the xml.

Version-Release number of selected component (if applicable):
Any.

How reproducible:
Hard to reproduce, the server must become non responsive after preparing the
disk, but before checking the disk type.

Steps to Reproduce:
1. Block access to NFS storage domain in in the same time try one of the flows
   mentions above.

Actual results:
Vdsm enter D state, cannot be restarted.

Expected results:
Vdsm never access NFS storage directly, does not enter D state.

The best way to fix this issue is avoid accessing storage for checking the disk
type.

When preparing a volume before using it in a VM, we can return the disk type, and
use this value for initializing the Drive object. No disk access is needed.

Comment 1 Nir Soffer 2018-01-01 19:09:39 UTC
This is another variant of bug 1518676.

Comment 2 Nir Soffer 2018-01-01 19:19:32 UTC
Proposing for 4.2.1, since we already have patches for this.

Comment 3 Natalie Gavrielov 2018-02-19 09:29:02 UTC
Verified, tier1 automation run for rhv storage team passed.
Versions used:
rhvm-4.2.2-0.1.el7.noarch
vdsm-4.20.18-1.el7ev.x86_64

Comment 4 Sandro Bonazzola 2018-02-22 10:01:53 UTC
This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.