Bug 1339963
Summary: | virDomainGetControlInfo hangs after random time with unresponsive storage | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Marcel Kolaja <mkolaja> |
Component: | libvirt | Assignee: | Peter Krempa <pkrempa> |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 7.2 | CC: | dyuan, ebenahar, fromani, jherrman, jsuchane, michal.skrivanek, mzamazal, pkrempa, pzhang, rbalakri, snagar, xuzhang, yisun |
Target Milestone: | rc | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | libvirt-1.2.17-13.el7_2.5 | Doc Type: | Bug Fix |
Doc Text: |
When the libvirt service attempted to access a file on a blocked or unreachable NFS storage device on a remote guest virtual machine, the APIs running on the guest became unresponsive. With this update, if the remote guest is online, libvirt collects data from the guest's monitor utility and does not access its NFS storage. As a result, the described problem occurs significantly less frequently.
|
Story Points: | --- |
Clone Of: | 1337073 | Environment: | |
Last Closed: | 2016-06-23 06:12:34 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1337073 | ||
Bug Blocks: |
Description
Marcel Kolaja
2016-05-26 09:17:40 UTC
Verified version: libvirt-1.2.17-13.el7_2.5.x86_64 qemu-kvm-rhev-2.1.2-23.el7_1.12.x86_64 Verified steps: 1. Prepare a NFS server: #mount|grep nfs $IP:/mnt/img on /tmp/zp type nfs4 (rw,relatime,vers=4.0,soft,proto=tcp,......) 2.start a guest with image on the NFS storage. # virsh list Id Name State ---------------------------------------------------- 3 vm1 running # virsh domblklist vm1 Target Source ------------------------------------------------ hdc - vda /tmp/zp/r72.qcow2 Check guest running well # virsh blkdeviotune vm1 vda total_bytes_sec: 0 read_bytes_sec : 0 write_bytes_sec: 0 total_iops_sec : 0 read_iops_sec : 0 write_iops_sec : 0 total_bytes_sec_max: 0 read_bytes_sec_max: 0 write_bytes_sec_max: 0 total_iops_sec_max: 0 read_iops_sec_max: 0 write_iops_sec_max: 0 size_iops_sec : 0 3. Disconnect the NFS server #iptables -A OUTPUT -d $IP -p tcp --dport 2049 -j DROP 4. In terminal 1, check IO throttlling using blkdeviotune # virsh blkdeviotune vm1 vda ......It will hang for a few minutes at the beginning. 5. In terminal 2, check domstats for active guests. It needs to wait for a while here, then it will get the return. But it won't hang. # virsh domstats --block --list-active Domain: 'vm1' block.count=2 block.0.name=hdc block.1.name=vda block.1.path=/tmp/zp/r72.qcow2 # virsh domstats --list-active Domain: 'vm1' state.state=3 state.reason=5 cpu.time=45910851705 cpu.user=1270000000 cpu.system=11380000000 balloon.current=2097152 balloon.maximum=2097152 ...... check terminal 1 again, it also gets the return : # virsh blkdeviotune vm1 vda total_bytes_sec: 0 read_bytes_sec : 0 write_bytes_sec: 0 total_iops_sec : 0 read_iops_sec : 0 write_iops_sec : 0 total_bytes_sec_max: 0 read_bytes_sec_max: 0 write_bytes_sec_max: 0 total_iops_sec_max: 0 read_iops_sec_max: 0 write_iops_sec_max: 0 size_iops_sec : 0 6.check virsh list or others operations working well. # virsh list --all Id Name State ---------------------------------------------------- 3 vm1 paused - vm2 shut off As above, domstats can get a return, it won't hang and it won't block other commands. Move to verified. Hi Francesco, I was wondering if you could help verify this bug on RHEV. Then we can make sure that this issue was fixed both on libvirt and RHEV. Thanks a lot in advance. (In reply to Pei Zhang from comment #7) > Hi Francesco, > I was wondering if you could help verify this bug on RHEV. Then we can make > sure that this issue was fixed both on libvirt and RHEV. > Thanks a lot in advance. Hi Pei, Sure thing, I'll add my own independent verification in the same environment described in https://bugzilla.redhat.com/show_bug.cgi?id=1337073#c0 (In reply to Pei Zhang from comment #6) > Verified version: > libvirt-1.2.17-13.el7_2.5.x86_64 > qemu-kvm-rhev-2.1.2-23.el7_1.12.x86_64 Update since paste the wrong version. Verified version: libvirt-1.2.17-13.el7_2.5.x86_64 qemu-kvm-rhev-2.3.0-31.el7_2.14.x86_64 > Verified steps: > > 1. Prepare a NFS server: > #mount|grep nfs > $IP:/mnt/img on /tmp/zp type nfs4 > (rw,relatime,vers=4.0,soft,proto=tcp,......) > > 2.start a guest with image on the NFS storage. > > # virsh list > Id Name State > ---------------------------------------------------- > 3 vm1 running > > # virsh domblklist vm1 > Target Source > ------------------------------------------------ > hdc - > vda /tmp/zp/r72.qcow2 > > Check guest running well > # virsh blkdeviotune vm1 vda > total_bytes_sec: 0 > read_bytes_sec : 0 > write_bytes_sec: 0 > total_iops_sec : 0 > read_iops_sec : 0 > write_iops_sec : 0 > total_bytes_sec_max: 0 > read_bytes_sec_max: 0 > write_bytes_sec_max: 0 > total_iops_sec_max: 0 > read_iops_sec_max: 0 > write_iops_sec_max: 0 > size_iops_sec : 0 > > 3. Disconnect the NFS server > #iptables -A OUTPUT -d $IP -p tcp --dport 2049 -j DROP > > 4. In terminal 1, check IO throttlling using blkdeviotune > # virsh blkdeviotune vm1 vda > ......It will hang for a few minutes at the beginning. > > 5. In terminal 2, check domstats for active guests. > It needs to wait for a while here, then it will get the return. But it won't > hang. > > # virsh domstats --block --list-active > Domain: 'vm1' > block.count=2 > block.0.name=hdc > block.1.name=vda > block.1.path=/tmp/zp/r72.qcow2 > # virsh domstats --list-active > Domain: 'vm1' > state.state=3 > state.reason=5 > cpu.time=45910851705 > cpu.user=1270000000 > cpu.system=11380000000 > balloon.current=2097152 > balloon.maximum=2097152 > ...... > > check terminal 1 again, it also gets the return : > # virsh blkdeviotune vm1 vda > total_bytes_sec: 0 > read_bytes_sec : 0 > write_bytes_sec: 0 > total_iops_sec : 0 > read_iops_sec : 0 > write_iops_sec : 0 > total_bytes_sec_max: 0 > read_bytes_sec_max: 0 > write_bytes_sec_max: 0 > total_iops_sec_max: 0 > read_iops_sec_max: 0 > write_iops_sec_max: 0 > size_iops_sec : 0 > > > 6.check virsh list or others operations working well. > # virsh list --all > Id Name State > ---------------------------------------------------- > 3 vm1 paused > - vm2 shut off > > As above, domstats can get a return, it won't hang and it won't block other > commands. Move to verified. Hi Francesco, We would like to re-verify this bug over RHEV. Would you be able to provide us the steps to reproduce using RHEV? Thanks (In reply to Elad from comment #10) > Hi Francesco, > > We would like to re-verify this bug over RHEV. Would you be able to provide > us the steps to reproduce using RHEV? > > Thanks Hi Elad, Here's the scenario I'm going to run on RHEV as soon as I can carve some time 1. prepare a RHEV setup: one Engine host, one virtualization host, one storage host (so three different hosts). 2. make sure to set storage as shared (default) over NFS 3. provision and run one (or more) VM(s). make sure the VM has 1+ disks over NFS 4. kill the storage, either with iptables or physically (shutdown, disconnect) 5. wait random amount of time, I recommend 2+ hours to get a good chance to recreate the conditions 6. verify Vdsm thread count is NOT growing unbounded, but stays constant. 7. In the scenario which highlighted the bug, the Vdsm thread count was growing over time in the hundreds. We are of course taking corrective action at Vdsm level to prevent this grow/leak. Following comment #11 and comment #13 , Should it be tested again? Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1290 |