Description of problem: When starting to use a disk, we check every disk if the disk is a block device by invoking os.stat(). If the disk is on NFS storage domain, and the NFS server becomes non responsive before starting the VM, this check may block for minutes, causing vdsm to be unkillable. The effected flows are: - starting vm - starting destination vm during migration - hot plug disk - starting live storage migration When using cluster version 4.2, some flows do not check the disk type since ngine specify the diskType in the xml. Version-Release number of selected component (if applicable): Any. How reproducible: Hard to reproduce, the server must become non responsive after preparing the disk, but before checking the disk type. Steps to Reproduce: 1. Block access to NFS storage domain in in the same time try one of the flows mentions above. Actual results: Vdsm enter D state, cannot be restarted. Expected results: Vdsm never access NFS storage directly, does not enter D state. The best way to fix this issue is avoid accessing storage for checking the disk type. When preparing a volume before using it in a VM, we can return the disk type, and use this value for initializing the Drive object. No disk access is needed.
This is another variant of bug 1518676.
Proposing for 4.2.1, since we already have patches for this.
Verified, tier1 automation run for rhv storage team passed. Versions used: rhvm-4.2.2-0.1.el7.noarch vdsm-4.20.18-1.el7ev.x86_64
This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.