1339963 – virDomainGetControlInfo hangs after random time with unresponsive storage

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1339963 - virDomainGetControlInfo hangs after random time with unresponsive storage

Summary: virDomainGetControlInfo hangs after random time with unresponsive storage

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	7.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Peter Krempa
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:	1337073
Blocks:
TreeView+	depends on / blocked

Reported:	2016-05-26 09:17 UTC by Marcel Kolaja
Modified:	2016-06-23 06:12 UTC (History)
CC List:	13 users (show)
Fixed In Version:	libvirt-1.2.17-13.el7_2.5
Doc Type:	Bug Fix
Doc Text:	When the libvirt service attempted to access a file on a blocked or unreachable NFS storage device on a remote guest virtual machine, the APIs running on the guest became unresponsive. With this update, if the remote guest is online, libvirt collects data from the guest's monitor utility and does not access its NFS storage. As a result, the described problem occurs significantly less frequently.
Clone Of:	1337073
Environment:
Last Closed:	2016-06-23 06:12:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:1290	0	normal	SHIPPED_LIVE	libvirt bug fix update	2016-07-01 18:50:44 UTC

Description Marcel Kolaja 2016-05-26 09:17:40 UTC

This bug has been copied from bug #1337073 and has been proposed
to be backported to 7.2 z-stream (EUS).

Comment 6 Pei Zhang 2016-05-27 09:04:29 UTC

Verified version:
libvirt-1.2.17-13.el7_2.5.x86_64
qemu-kvm-rhev-2.1.2-23.el7_1.12.x86_64

Verified steps:

1. Prepare a NFS server:
#mount|grep nfs 
$IP:/mnt/img on /tmp/zp type nfs4 (rw,relatime,vers=4.0,soft,proto=tcp,......)

2.start a guest with image on the NFS storage.

# virsh list 
 Id    Name                           State
----------------------------------------------------
 3     vm1                            running

# virsh domblklist vm1
Target     Source
------------------------------------------------
hdc        -
vda        /tmp/zp/r72.qcow2

Check guest running well
# virsh blkdeviotune vm1 vda
total_bytes_sec: 0
read_bytes_sec : 0
write_bytes_sec: 0
total_iops_sec : 0
read_iops_sec  : 0
write_iops_sec : 0
total_bytes_sec_max: 0
read_bytes_sec_max: 0
write_bytes_sec_max: 0
total_iops_sec_max: 0
read_iops_sec_max: 0
write_iops_sec_max: 0
size_iops_sec  : 0

3. Disconnect the NFS server
#iptables -A OUTPUT -d $IP -p tcp --dport 2049 -j DROP

4. In terminal 1, check IO throttlling using blkdeviotune
# virsh blkdeviotune vm1 vda 
......It will hang for a few minutes at the beginning.

5. In terminal 2, check domstats for active guests.
It needs to wait for a while here, then it will get the return. But it won't hang.

# virsh domstats --block --list-active
Domain: 'vm1'
  block.count=2
  block.0.name=hdc
  block.1.name=vda
  block.1.path=/tmp/zp/r72.qcow2
# virsh domstats --list-active
Domain: 'vm1'
  state.state=3
  state.reason=5
  cpu.time=45910851705
  cpu.user=1270000000
  cpu.system=11380000000
  balloon.current=2097152
  balloon.maximum=2097152
......

check terminal 1 again, it also gets the return :
# virsh blkdeviotune vm1 vda 
total_bytes_sec: 0
read_bytes_sec : 0
write_bytes_sec: 0
total_iops_sec : 0
read_iops_sec  : 0
write_iops_sec : 0
total_bytes_sec_max: 0
read_bytes_sec_max: 0
write_bytes_sec_max: 0
total_iops_sec_max: 0
read_iops_sec_max: 0
write_iops_sec_max: 0
size_iops_sec  : 0


6.check virsh list or others operations working well.
# virsh list --all
 Id    Name                           State
----------------------------------------------------
 3     vm1                            paused
 -     vm2                            shut off

As above, domstats can get a return, it won't hang and it won't block other commands. Move to verified.

Comment 7 Pei Zhang 2016-05-30 07:19:48 UTC

Hi Francesco,
I was wondering if you could help verify this bug on RHEV. Then we can make sure that this issue was fixed both on libvirt and RHEV.
Thanks a lot in advance.

Comment 8 Francesco Romani 2016-05-30 07:27:52 UTC

(In reply to Pei Zhang from comment #7)
> Hi Francesco,
> I was wondering if you could help verify this bug on RHEV. Then we can make
> sure that this issue was fixed both on libvirt and RHEV.
> Thanks a lot in advance.

Hi Pei,

Sure thing, I'll add my own independent verification in the same environment described in https://bugzilla.redhat.com/show_bug.cgi?id=1337073#c0

Comment 9 Pei Zhang 2016-06-02 08:14:22 UTC

(In reply to Pei Zhang from comment #6)
> Verified version:
> libvirt-1.2.17-13.el7_2.5.x86_64
> qemu-kvm-rhev-2.1.2-23.el7_1.12.x86_64
Update since paste the wrong version.

Verified version:
libvirt-1.2.17-13.el7_2.5.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.14.x86_64

> Verified steps:
> 
> 1. Prepare a NFS server:
> #mount|grep nfs 
> $IP:/mnt/img on /tmp/zp type nfs4
> (rw,relatime,vers=4.0,soft,proto=tcp,......)
> 
> 2.start a guest with image on the NFS storage.
> 
> # virsh list 
>  Id    Name                           State
> ----------------------------------------------------
>  3     vm1                            running
> 
> # virsh domblklist vm1
> Target     Source
> ------------------------------------------------
> hdc        -
> vda        /tmp/zp/r72.qcow2
> 
> Check guest running well
> # virsh blkdeviotune vm1 vda
> total_bytes_sec: 0
> read_bytes_sec : 0
> write_bytes_sec: 0
> total_iops_sec : 0
> read_iops_sec  : 0
> write_iops_sec : 0
> total_bytes_sec_max: 0
> read_bytes_sec_max: 0
> write_bytes_sec_max: 0
> total_iops_sec_max: 0
> read_iops_sec_max: 0
> write_iops_sec_max: 0
> size_iops_sec  : 0
> 
> 3. Disconnect the NFS server
> #iptables -A OUTPUT -d $IP -p tcp --dport 2049 -j DROP
> 
> 4. In terminal 1, check IO throttlling using blkdeviotune
> # virsh blkdeviotune vm1 vda 
> ......It will hang for a few minutes at the beginning.
> 
> 5. In terminal 2, check domstats for active guests.
> It needs to wait for a while here, then it will get the return. But it won't
> hang.
> 
> # virsh domstats --block --list-active
> Domain: 'vm1'
>   block.count=2
>   block.0.name=hdc
>   block.1.name=vda
>   block.1.path=/tmp/zp/r72.qcow2
> # virsh domstats --list-active
> Domain: 'vm1'
>   state.state=3
>   state.reason=5
>   cpu.time=45910851705
>   cpu.user=1270000000
>   cpu.system=11380000000
>   balloon.current=2097152
>   balloon.maximum=2097152
> ......
> 
> check terminal 1 again, it also gets the return :
> # virsh blkdeviotune vm1 vda 
> total_bytes_sec: 0
> read_bytes_sec : 0
> write_bytes_sec: 0
> total_iops_sec : 0
> read_iops_sec  : 0
> write_iops_sec : 0
> total_bytes_sec_max: 0
> read_bytes_sec_max: 0
> write_bytes_sec_max: 0
> total_iops_sec_max: 0
> read_iops_sec_max: 0
> write_iops_sec_max: 0
> size_iops_sec  : 0
> 
> 
> 6.check virsh list or others operations working well.
> # virsh list --all
>  Id    Name                           State
> ----------------------------------------------------
>  3     vm1                            paused
>  -     vm2                            shut off
> 
> As above, domstats can get a return, it won't hang and it won't block other
> commands. Move to verified.

Comment 10 Elad 2016-06-05 08:42:48 UTC

Hi Francesco,

We would like to re-verify this bug over RHEV. Would you be able to provide us the steps to reproduce using RHEV?

Thanks

Comment 11 Francesco Romani 2016-06-06 07:25:33 UTC

(In reply to Elad from comment #10)
> Hi Francesco,
> 
> We would like to re-verify this bug over RHEV. Would you be able to provide
> us the steps to reproduce using RHEV?
> 
> Thanks

Hi Elad,

Here's the scenario I'm going to run on RHEV as soon as I can carve some time

1. prepare a RHEV setup: one Engine host, one virtualization host, one storage host (so three different hosts).
2. make sure to set storage as shared (default) over NFS
3. provision and run one (or more) VM(s). make sure the VM has 1+ disks over NFS
4. kill the storage, either with iptables or physically (shutdown, disconnect)
5. wait random amount of time, I recommend 2+ hours to get a good chance to recreate the conditions
6. verify Vdsm thread count is NOT growing unbounded, but stays constant.
7. In the scenario which highlighted the bug, the Vdsm thread count was growing over time in the hundreds. We are of course taking corrective action at Vdsm level to prevent this grow/leak.

Comment 14 Elad 2016-06-09 12:02:21 UTC

Following comment #11 and comment #13 , Should it be tested again?

Comment 17 errata-xmlrpc 2016-06-23 06:12:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1290

Note You need to log in before you can comment on or make changes to this bug.