1530072 – Vdsm can get into D state when checking disk type on non-responsive NFS server

Bug 1530072 - Vdsm can get into D state when checking disk type on non-responsive NFS server

Summary: Vdsm can get into D state when checking disk type on non-responsive NFS server

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Core
Sub Component:
Version:	4.20.15
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.2.1
Target Release:	---
Assignee:	Nir Soffer
QA Contact:	Natalie Gavrielov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-01-01 18:34 UTC by Nir Soffer
Modified:	2018-02-22 10:01 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-02-22 10:01:53 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.2+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1518676	high	CLOSED	Entire vdsm process hang during when formatting xlease volume on NFS storage domain	2021-03-11 21:11:50 UTC
oVirt gerrit	85307	master	MERGED	storage: Set diskType early and safely	2018-01-03 21:33:15 UTC
oVirt gerrit	85337	master	MERGED	volume: Return volume type in getVmVolumeInfo	2018-01-03 16:56:48 UTC
oVirt gerrit	85343	master	MERGED	localdisk: Update drive diskType	2018-01-03 17:11:27 UTC
oVirt gerrit	85870	master	MERGED	vm: Do not check disk type when starting LSM	2018-01-03 22:41:48 UTC

Internal Links: 1518676

Description Nir Soffer 2018-01-01 18:34:56 UTC

Description of problem:

When starting to use a disk, we check every disk if the disk is a block device
by invoking os.stat(). If the disk is on NFS storage domain, and the NFS server
becomes non responsive before starting the VM, this check may block for minutes,
causing vdsm to be unkillable.

The effected flows are:
- starting vm
- starting destination vm during migration
- hot plug disk
- starting live storage migration

When using cluster version 4.2, some flows do not check the disk type since
ngine specify the diskType in the xml.

Version-Release number of selected component (if applicable):
Any.

How reproducible:
Hard to reproduce, the server must become non responsive after preparing the
disk, but before checking the disk type.

Steps to Reproduce:
1. Block access to NFS storage domain in in the same time try one of the flows
   mentions above.

Actual results:
Vdsm enter D state, cannot be restarted.

Expected results:
Vdsm never access NFS storage directly, does not enter D state.

The best way to fix this issue is avoid accessing storage for checking the disk
type.

When preparing a volume before using it in a VM, we can return the disk type, and
use this value for initializing the Drive object. No disk access is needed.

Comment 1 Nir Soffer 2018-01-01 19:09:39 UTC

This is another variant of bug 1518676.

Comment 2 Nir Soffer 2018-01-01 19:19:32 UTC

Proposing for 4.2.1, since we already have patches for this.

Comment 3 Natalie Gavrielov 2018-02-19 09:29:02 UTC

Verified, tier1 automation run for rhv storage team passed.
Versions used:
rhvm-4.2.2-0.1.el7.noarch
vdsm-4.20.18-1.el7ev.x86_64

Comment 4 Sandro Bonazzola 2018-02-22 10:01:53 UTC

This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.