Bug 1339963

Summary:	virDomainGetControlInfo hangs after random time with unresponsive storage
Product:	Red Hat Enterprise Linux 7	Reporter:	Marcel Kolaja <mkolaja>
Component:	libvirt	Assignee:	Peter Krempa <pkrempa>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.2	CC:	dyuan, ebenahar, fromani, jherrman, jsuchane, michal.skrivanek, mzamazal, pkrempa, pzhang, rbalakri, snagar, xuzhang, yisun
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	libvirt-1.2.17-13.el7_2.5	Doc Type:	Bug Fix
Doc Text:	When the libvirt service attempted to access a file on a blocked or unreachable NFS storage device on a remote guest virtual machine, the APIs running on the guest became unresponsive. With this update, if the remote guest is online, libvirt collects data from the guest's monitor utility and does not access its NFS storage. As a result, the described problem occurs significantly less frequently.	Story Points:	---
Clone Of:	1337073	Environment:
Last Closed:	2016-06-23 06:12:34 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1337073
Bug Blocks:

Description Marcel Kolaja 2016-05-26 09:17:40 UTC

This bug has been copied from bug #1337073 and has been proposed
to be backported to 7.2 z-stream (EUS).

Comment 6 Pei Zhang 2016-05-27 09:04:29 UTC

Verified version:
libvirt-1.2.17-13.el7_2.5.x86_64
qemu-kvm-rhev-2.1.2-23.el7_1.12.x86_64

Verified steps:

1. Prepare a NFS server:
#mount|grep nfs 
$IP:/mnt/img on /tmp/zp type nfs4 (rw,relatime,vers=4.0,soft,proto=tcp,......)

2.start a guest with image on the NFS storage.

# virsh list 
 Id    Name                           State
----------------------------------------------------
 3     vm1                            running

# virsh domblklist vm1
Target     Source
------------------------------------------------
hdc        -
vda        /tmp/zp/r72.qcow2

Check guest running well
# virsh blkdeviotune vm1 vda
total_bytes_sec: 0
read_bytes_sec : 0
write_bytes_sec: 0
total_iops_sec : 0
read_iops_sec  : 0
write_iops_sec : 0
total_bytes_sec_max: 0
read_bytes_sec_max: 0
write_bytes_sec_max: 0
total_iops_sec_max: 0
read_iops_sec_max: 0
write_iops_sec_max: 0
size_iops_sec  : 0

3. Disconnect the NFS server
#iptables -A OUTPUT -d $IP -p tcp --dport 2049 -j DROP

4. In terminal 1, check IO throttlling using blkdeviotune
# virsh blkdeviotune vm1 vda 
......It will hang for a few minutes at the beginning.

5. In terminal 2, check domstats for active guests.
It needs to wait for a while here, then it will get the return. But it won't hang.

# virsh domstats --block --list-active
Domain: 'vm1'
  block.count=2
  block.0.name=hdc
  block.1.name=vda
  block.1.path=/tmp/zp/r72.qcow2
# virsh domstats --list-active
Domain: 'vm1'
  state.state=3
  state.reason=5
  cpu.time=45910851705
  cpu.user=1270000000
  cpu.system=11380000000
  balloon.current=2097152
  balloon.maximum=2097152
......

check terminal 1 again, it also gets the return :
# virsh blkdeviotune vm1 vda 
total_bytes_sec: 0
read_bytes_sec : 0
write_bytes_sec: 0
total_iops_sec : 0
read_iops_sec  : 0
write_iops_sec : 0
total_bytes_sec_max: 0
read_bytes_sec_max: 0
write_bytes_sec_max: 0
total_iops_sec_max: 0
read_iops_sec_max: 0
write_iops_sec_max: 0
size_iops_sec  : 0


6.check virsh list or others operations working well.
# virsh list --all
 Id    Name                           State
----------------------------------------------------
 3     vm1                            paused
 -     vm2                            shut off

As above, domstats can get a return, it won't hang and it won't block other commands. Move to verified.

Comment 7 Pei Zhang 2016-05-30 07:19:48 UTC

Hi Francesco,
I was wondering if you could help verify this bug on RHEV. Then we can make sure that this issue was fixed both on libvirt and RHEV.
Thanks a lot in advance.

Comment 8 Francesco Romani 2016-05-30 07:27:52 UTC

(In reply to Pei Zhang from comment #7)
> Hi Francesco,
> I was wondering if you could help verify this bug on RHEV. Then we can make
> sure that this issue was fixed both on libvirt and RHEV.
> Thanks a lot in advance.

Hi Pei,

Sure thing, I'll add my own independent verification in the same environment described in https://bugzilla.redhat.com/show_bug.cgi?id=1337073#c0

Comment 9 Pei Zhang 2016-06-02 08:14:22 UTC

(In reply to Pei Zhang from comment #6)
> Verified version:
> libvirt-1.2.17-13.el7_2.5.x86_64
> qemu-kvm-rhev-2.1.2-23.el7_1.12.x86_64
Update since paste the wrong version.

Verified version:
libvirt-1.2.17-13.el7_2.5.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.14.x86_64

> Verified steps:
> 
> 1. Prepare a NFS server:
> #mount|grep nfs 
> $IP:/mnt/img on /tmp/zp type nfs4
> (rw,relatime,vers=4.0,soft,proto=tcp,......)
> 
> 2.start a guest with image on the NFS storage.
> 
> # virsh list 
>  Id    Name                           State
> ----------------------------------------------------
>  3     vm1                            running
> 
> # virsh domblklist vm1
> Target     Source
> ------------------------------------------------
> hdc        -
> vda        /tmp/zp/r72.qcow2
> 
> Check guest running well
> # virsh blkdeviotune vm1 vda
> total_bytes_sec: 0
> read_bytes_sec : 0
> write_bytes_sec: 0
> total_iops_sec : 0
> read_iops_sec  : 0
> write_iops_sec : 0
> total_bytes_sec_max: 0
> read_bytes_sec_max: 0
> write_bytes_sec_max: 0
> total_iops_sec_max: 0
> read_iops_sec_max: 0
> write_iops_sec_max: 0
> size_iops_sec  : 0
> 
> 3. Disconnect the NFS server
> #iptables -A OUTPUT -d $IP -p tcp --dport 2049 -j DROP
> 
> 4. In terminal 1, check IO throttlling using blkdeviotune
> # virsh blkdeviotune vm1 vda 
> ......It will hang for a few minutes at the beginning.
> 
> 5. In terminal 2, check domstats for active guests.
> It needs to wait for a while here, then it will get the return. But it won't
> hang.
> 
> # virsh domstats --block --list-active
> Domain: 'vm1'
>   block.count=2
>   block.0.name=hdc
>   block.1.name=vda
>   block.1.path=/tmp/zp/r72.qcow2
> # virsh domstats --list-active
> Domain: 'vm1'
>   state.state=3
>   state.reason=5
>   cpu.time=45910851705
>   cpu.user=1270000000
>   cpu.system=11380000000
>   balloon.current=2097152
>   balloon.maximum=2097152
> ......
> 
> check terminal 1 again, it also gets the return :
> # virsh blkdeviotune vm1 vda 
> total_bytes_sec: 0
> read_bytes_sec : 0
> write_bytes_sec: 0
> total_iops_sec : 0
> read_iops_sec  : 0
> write_iops_sec : 0
> total_bytes_sec_max: 0
> read_bytes_sec_max: 0
> write_bytes_sec_max: 0
> total_iops_sec_max: 0
> read_iops_sec_max: 0
> write_iops_sec_max: 0
> size_iops_sec  : 0
> 
> 
> 6.check virsh list or others operations working well.
> # virsh list --all
>  Id    Name                           State
> ----------------------------------------------------
>  3     vm1                            paused
>  -     vm2                            shut off
> 
> As above, domstats can get a return, it won't hang and it won't block other
> commands. Move to verified.

Comment 10 Elad 2016-06-05 08:42:48 UTC

Hi Francesco,

We would like to re-verify this bug over RHEV. Would you be able to provide us the steps to reproduce using RHEV?

Thanks

Comment 11 Francesco Romani 2016-06-06 07:25:33 UTC

(In reply to Elad from comment #10)
> Hi Francesco,
> 
> We would like to re-verify this bug over RHEV. Would you be able to provide
> us the steps to reproduce using RHEV?
> 
> Thanks

Hi Elad,

Here's the scenario I'm going to run on RHEV as soon as I can carve some time

1. prepare a RHEV setup: one Engine host, one virtualization host, one storage host (so three different hosts).
2. make sure to set storage as shared (default) over NFS
3. provision and run one (or more) VM(s). make sure the VM has 1+ disks over NFS
4. kill the storage, either with iptables or physically (shutdown, disconnect)
5. wait random amount of time, I recommend 2+ hours to get a good chance to recreate the conditions
6. verify Vdsm thread count is NOT growing unbounded, but stays constant.
7. In the scenario which highlighted the bug, the Vdsm thread count was growing over time in the hundreds. We are of course taking corrective action at Vdsm level to prevent this grow/leak.

Comment 14 Elad 2016-06-09 12:02:21 UTC

Following comment #11 and comment #13 , Should it be tested again?

Comment 17 errata-xmlrpc 2016-06-23 06:12:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1290