Bug 1337073
Summary: | virDomainGetControlInfo hangs after random time with unresponsive storage | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Francesco Romani <fromani> | ||||||||||
Component: | libvirt | Assignee: | Peter Krempa <pkrempa> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | |||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | 7.2 | CC: | acanan, dyuan, ebenahar, fromani, gklein, jherrman, jsuchane, michal.skrivanek, mzamazal, pkrempa, pzhang, rbalakri, snagar, yisun, ztehypervisor | ||||||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||||||
Target Release: | --- | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | libvirt-1.3.5-1.el7 | Doc Type: | Bug Fix | ||||||||||
Doc Text: |
When the libvirt service attempted to access a file on a blocked or unreachable NFS storage device on a remote guest virtual machine, the APIs running on the guest became unresponsive. With this update, if the remote guest is online, libvirt collects data from the guest's monitor utility and does not access its NFS storage. As a result, the described problem occurs significantly less frequently.
|
Story Points: | --- | ||||||||||
Clone Of: | |||||||||||||
: | 1339963 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2016-11-03 18:45:17 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 1339963 | ||||||||||||
Attachments: |
|
Description
Francesco Romani
2016-05-18 08:46:03 UTC
Created attachment 1158650 [details]
vdsm + libvirtd debug logs
virDomainGetControlInfo stop responding around the 8:01:45 mark.
Please note libvirtd logs are 2hrs behind (look around 6:01:45)
Created attachment 1158651 [details]
Example domain
Created attachment 1158652 [details]
libvirtd configuration
Created attachment 1158653 [details]
libvirtd qemu configuration
So everything works, until virConnectGetAllDomainStats is called on the domain: 2016-05-18 06:01:15.922+0000: 3070: debug : virThreadJobSet:96 : Thread 3070 (virNetServerHandleJob) is now running job remoteDispatchConnectGetAllDomainStats 2016-05-18 06:01:15.922+0000: 3070: debug : virConnectGetAllDomainStats:11489 : conn=0x7f5a5c000ec0, stats=0x0, retStats=0x7f5a8124fad0, flags=0x0 The API wants to start a job... 2016-05-18 06:01:15.922+0000: 3070: debug : qemuDomainObjBeginJobInternal:2097 : Starting job: query (vm=0x7f5a64009bf0 name=a1, current job=query async=none) 2016-05-18 06:01:15.922+0000: 3070: debug : qemuDomainObjBeginJobInternal:2120 : Waiting for job (vm=0x7f5a64009bf0 name=a1) ... and keeps waiting for it. In the meantime, other APIs (which don't require a job) keep working just fine: 2016-05-18 06:01:45.920+0000: 3069: debug : virThreadJobSet:96 : Thread 3069 (virNetServerHandleJob) is now running job remoteDispatchDomainGetVcpus 2016-05-18 06:01:45.920+0000: 3069: debug : virDomainGetVcpus:7733 : dom=0x7f5a70004070, (VM: name=a1, uuid=048f8624-03fc-4729-8f4d-12cb4387f018), info=0x7f5a70002140, maxinfo=2, cpumaps=0x7f5a70002200, maplen=1 ... 2016-05-18 06:01:45.920+0000: 3069: debug : virDomainDispose:313 : release domain 0x7f5a70004070 a1 048f8624-03fc-4729-8f4d-12cb4387f018 2016-05-18 06:01:45.920+0000: 3069: debug : virThreadJobClear:121 : Thread 3069 (virNetServerHandleJob) finished job remoteDispatchDomainGetVcpus with ret=0 But when virConnectGetAllDomainStats times out on acquiring a job, 2016-05-18 06:01:45.922+0000: 3070: warning : qemuDomainObjBeginJobInternal:2180 : Cannot start job (query, none) for domain a1; current job is (query, none) owned by (3066 remoteDispatchDomainGetBlockIoTune, 0 <null>) for (30s, 0s) 2016-05-18 06:01:45.922+0000: 3070: error : qemuDomainObjBeginJobInternal:2192 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune) 2016-05-18 06:01:45.922+0000: 3070: debug : virCgroupGetValueStr:814 : Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d1 \x2da1.scope/cpuacct.usage 2016-05-18 06:01:45.922+0000: 3070: debug : virCgroupGetValueStr:814 : Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d1 \x2da1.scope/cpuacct.stat 2016-05-18 06:01:45.922+0000: 3070: debug : qemuGetProcessInfo:1486 : Got status for 3500/3505 user=1507 sys=209 cpu=1 rss=531128 2016-05-18 06:01:45.922+0000: 3070: debug : qemuGetProcessInfo:1486 : Got status for 3500/3506 user=1150 sys=55 cpu=0 rss=531128 the domain object stays locked and all APIs will be blocked on trying to lock the domain... 2016-05-18 06:01:45.939+0000: 3068: debug : virThreadJobSet:96 : Thread 3068 (virNetServerHandleJob) is now running job remoteDispatchDomainGetControlInfo 2016-05-18 06:01:45.939+0000: 3068: debug : virDomainGetControlInfo:2526 : dom=0x7f5a6c002b50, (VM: name=a1, uuid=048f8624-03fc-4729-8f4d-12cb4387f018), info=0x7f5a79a50ad0, flags=0 ... Fixed upstream: commit 71d2c172edb997bae1e883b2e1bafa97d9f953a1 Author: Peter Krempa <pkrempa> Date: Wed May 18 14:58:25 2016 +0200 qemu: bulk stats: Don't access possibly blocked storage If the stats for a block device can't be acquired from qemu we've fallen back to loading them from the file on the disk in libvirt. If qemu is not cooperating due to being stuck on an inaccessible NFS share we would then attempt to read the files and get stuck too with the VM object locked. All other APIs would eventually get stuck waiting on the VM lock. Avoid this problem by skipping the block stats if the VM is online but the monitor did not provide any stats. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1337073 commit 3aa5d51a9530a8737ca584b393c29297dd9bbc37 Author: Peter Krempa <pkrempa> Date: Wed May 18 14:40:10 2016 +0200 qemu: driver: Separate bulk stats worker for block devices Extract the fallback path that reloads the stats from disk into a separate function. commit 5d2b0e6f12b4e57d75ed1047ab1c36443b7a54b3 Author: Peter Krempa <pkrempa> Date: Wed May 18 14:17:07 2016 +0200 qemu: driver: Remove unnecessary flag in qemuDomainGetStatsBlock 'abbreviated' was true if 'stats' were NULL Try to reproduce it via libvirt. Version: libvirt-1.3.4-1.el7.x86_64 qemu-kvm-rhev-2.6.0-2.el7.x86_64 Step: 1. Start a guest with image on NFS storage # virsh list Id Name State ---------------------------------------------------- 2 r72 running 2. disconnect with the NFS storage server. # iptables -A OUTPUT -d $IP -p tcp --dport 2049 -j DROP I want to know when it will hang on virDomainGetControlInfo(). So I tried this : #!/bin/bash while : do virsh domcontrol r72 sleep 30 done 3. on terminal 1, check IO throughttling with the image. # virsh domblklist r72 Target Source ------------------------------------------------ hda - vda /tmp/zp/r7.2.img # virsh blkdeviotune r72 vda ......hang here log as following : ...... 2016-05-24 01:50:16.712+0000: 2512: debug : qemuDomainObjBeginJobInternal:2097 : Starting job: query (vm=0x7fe7501f2d40 name=r72, current job=query async=none) 2016-05-24 01:50:16.712+0000: 2512: debug : qemuDomainObjBeginJobInternal:2120 : Waiting for job (vm=0x7fe7501f2d40 name=r72) 2016-05-24 01:50:17.666+0000: 1309: info : virObjectRef:296 : OBJECT_REF: obj=0x7fe7d0a34300 ...... 4.change to termianl 2, check other info, it works well # virsh list Id Name State ---------------------------------------------------- 2 r72 running # virsh schedinfo r72 Scheduler : posix cpu_shares : 1024 vcpu_period : 100000 vcpu_quota : -1 emulator_period: 100000 emulator_quota : -1 global_period : 100000 global_quota : -1 5. change to terminal 3. check block stat info, it will hang for a few seconds, then it will report an error. # virsh domblkstat r72 (hang a few seconds) error: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune) log as following : ...... 2016-05-24 01:50:46.698+0000: 1309: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7fe7d0a47970 2016-05-24 01:50:46.712+0000: 2512: warning : qemuDomainObjBeginJobInternal:2180 : Cannot start job (query, none) for domain r72; current job is (query, none) owned by (2513 remoteDispatchDomainGetBlockIoTune, 0 <null>) for (41s, 0s) 2016-05-24 01:50:46.712+0000: 2512: error : qemuDomainObjBeginJobInternal:2192 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune) ...... 2016-05-24 01:45:32.325+0000: 2514: debug : virThreadJobSet:96 : Thread 2514 (virNetServerHandleJob) is now running job remoteDispatchDomainGetInfo 2016-05-24 01:45:32.325+0000: 2514: info : virObjectNew:202 : OBJECT_NEW: obj=0x7fe7a403d340 classname=virDomain ...... 2016-05-24 01:45:32.325+0000: 2514: debug : virThreadJobClear:121 : Thread 2514 (virNetServerHandleJob) finished job remoteDispatchDomainGetInfo with ret=0 ...... 6. It seems that it will hang in the terminal 1: # virsh blkdeviotune r72 vda ...... hang here, nothing output. but virsh list or other checkinfo except block related works well. 7.If I Ctrl+c terminal this command and re-try it again: # virsh blkdeviotune r72 vda ^C # virsh blkdeviotune r72 vda error: Unable to get number of block I/O throttle parameters error: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune) As above, Do these steps can be used to reproduce this issue? It will hang in step6 and step7. (In reply to Pei Zhang from comment #8) > Try to reproduce it via libvirt. [...] > 2. disconnect with the NFS storage server. > # iptables -A OUTPUT -d $IP -p tcp --dport 2049 -j DROP [...] > 3. on terminal 1, check IO throughttling with the image. > # virsh domblklist r72 > Target Source > ------------------------------------------------ > hda - > vda /tmp/zp/r7.2.img > > # virsh blkdeviotune r72 vda [...] > 5. change to terminal 3. check block stat info, it will hang for a few > seconds, then it will report an error. > # virsh domblkstat r72 > (hang a few seconds) > error: Timed out during operation: cannot acquire state change lock (held by > remoteDispatchDomainGetBlockIoTune) You have to call 'virsh domstats' here and request all, or at least "--block" at the point where the above API is returning failures. That should block the daemon. Listing guests and/or operations on the single guest should then block too. Thanks a lot for Peter's info. and reproduce it like following : version: libvirt-1.3.4-1.el7.x86_64 steps: 1. start a guest with image on NFS storage # virsh list Id Name State ---------------------------------------------------- 20 r72 running 2. Disconnect with NFS server # iptables -A OUTPUT -d $IP -p tcp --dport 2049 -j DROP # 3. In terminal 1, execute 'blkdeviotune' # virsh blkdeviotune r72 vda 4.in terminal 2, try 'domstats' to get statistices for all domains. # virsh domstats 5. After few seconds, in terminal 1, we can get an error # virsh blkdeviotune r72 vda error: Unable to get number of block I/O throttle parameters error: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune) 6. check termianl 2, virsh domstats still hang # virsh domstats ......hang here 7. in terminal 3, check virsh list, it will also hang #virsh list ...... hang here. Actual results: As step 6 and step 7, virsh domstats hang and listing guest also hang. Expected results: It should return less data for active VMs and won't hang. Hi, I have got a problem with the same error reported when starting a VM : 2016-05-30 08:46:30.437+0000: 5404: warning : qemuDomainObjBeginJobInternal:1572 : Cannot start job (query, none) for domain instance-00000039; current job is (modify, none) owned by (5405 remoteDispatchDomainCreateWithFlags, 0 <null>) for (598s, 0s) 2016-05-30 08:46:30.437+0000: 5404: error : qemuDomainObjBeginJobInternal:1584 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainCreateWithFlags) 2016-05-30 08:47:00.461+0000: 5402: warning : qemuDomainObjBeginJobInternal:1572 : Cannot start job (query, none) for domain instance-00000039; current job is (modify, none) owned by (5405 remoteDispatchDomainCreateWithFlags, 0 <null>) for (628s, 0s) 2016-05-30 08:47:00.461+0000: 5402: error : qemuDomainObjBeginJobInternal:1584 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainCreateWithFlags) I use the backed storage of ceph for VM, and the ceph had some unexpected problems before.I rebooted the host system when the ceph's problem has been resolved,then the VM was in state of pause using "virsh list".In libvirtd.log,lots of prints like: Cannot start job (query, none) for domain instance-00000039; current job is (modify, none) owned by (5405 remoteDispatchDomainCreateWithFlags, 0 <null>) for (658s, 0s) It looks like that because of ceph's storage ,but having no error log to support this. Does anyone have ideas on this problem? libvirt version: # virsh version Compiled against library: libvirt 1.2.21 Using library: libvirt 1.2.21 Using API: QEMU 1.2.21 Running hypervisor: QEMU 2.5.0 Verfied versions: libvirt-2.0.0-6.el7.x86_64 qemu-kvm-rhev-2.6.0-22.el7.x86_64 Steps: 1. Prepare a NFS storage, mount the NFS storage then start a guest which image located in the NFS storage. # virsh list --all Id Name State ---------------------------------------------------- 11 vm1 running # virsh domblklist vm1 Target Source ------------------------------------------------ hdc - vda /tmp/zp/beaker/r73q2-1.img check the guest is working well 2. disconnect the NFS server # iptables -A OUTPUT -d $IP -p tcp --dport 2049 -j DROP 3. In termial 1, check IO throttlling using blkdeviotune # virsh blkdeviotune vm1 vda ...... It will hange for a few minutes here 4. In the terminal 2, check domstats for the running guest using domstas It also needs few minutes to get a return. But it won't hang forever. # virsh domstats Domain: 'vm1' state.state=1 state.reason=1 cpu.time=30014234910 cpu.user=1760000000 cpu.system=8750000000 balloon.current=1048576 balloon.maximum=1048576 vcpu.current=1 vcpu.maximum=8 vcpu.0.state=1 vcpu.0.time=26690000000 vcpu.0.wait=0 net.count=0 block.count=2 block.0.name=hdc block.1.name=vda block.1.path=/tmp/zp/beaker/r73q2-1.img check terminal 1 agian, blkdeviotune also has a return. # virsh blkdeviotune vm1 vda total_bytes_sec: 0 read_bytes_sec : 0 write_bytes_sec: 0 total_iops_sec : 0 read_iops_sec : 0 write_iops_sec : 0 total_bytes_sec_max: 0 read_bytes_sec_max: 0 write_bytes_sec_max: 0 total_iops_sec_max: 0 read_iops_sec_max: 0 write_iops_sec_max: 0 size_iops_sec : 0 5. check virsh list, it also won't hang. # virsh list Id Name State ---------------------------------------------------- 11 vm1 running As above, domstats can get a return, it won't hang and it won't block other commands. From libvirt, it has been fixed. Hi Francesco, I verified this bug using latest libvirt on rhel7.3. The result is expected for libvirt. I was wondering if you can also help verify it on RHV to make sure that the issue is fixed on RHV. Thanks a lot in advance! Tested according to https://bugzilla.redhat.com/show_bug.cgi?id=1339963#c11: 1. prepare a RHEV setup: one Engine host, one virtualization host, one storage host (so three different hosts). 2. make sure to set storage as shared (default) over NFS 3. provision and run one (or more) VM(s). make sure the VM has 1+ disks over NFS 4. kill the storage, either with iptables or physically (shutdown, disconnect) 5. wait random amount of time, I recommend 2+ hours to get a good chance to recreate the conditions 6. verify Vdsm thread count is NOT growing unbounded, but stays constant. 7. In the scenario which highlighted the bug, the Vdsm thread count was growing over time in the hundreds. We are of course taking corrective action at Vdsm level to prevent this grow/leak. vdsm Threads pool is not growing unbounded, it stays constant after few hours that the storage is unreachable while there is a running VM holding a disk on it: [root@seal09 ~]# ps aux |grep vdsm vdsm 19254 1.0 0.3 5564648 110240 ? S<sl 12:58 1:27 /usr/bin/python /usr/share/vdsm/vdsm [root@seal09 ~]# grep Threads /proc/19254/status Threads: 55 =================== Used: vdsm-4.18.13-1.el7ev.x86_64 libvirt-daemon-driver-nwfilter-2.0.0-8.el7.x86_64 libvirt-daemon-config-network-2.0.0-8.el7.x86_64 libvirt-daemon-driver-secret-2.0.0-8.el7.x86_64 libvirt-lock-sanlock-2.0.0-8.el7.x86_64 libvirt-daemon-2.0.0-8.el7.x86_64 libvirt-daemon-driver-qemu-2.0.0-8.el7.x86_64 libvirt-daemon-config-nwfilter-2.0.0-8.el7.x86_64 libvirt-daemon-kvm-2.0.0-8.el7.x86_64 libvirt-client-2.0.0-8.el7.x86_64 libvirt-daemon-driver-storage-2.0.0-8.el7.x86_64 libvirt-daemon-driver-interface-2.0.0-8.el7.x86_64 libvirt-2.0.0-8.el7.x86_64 libvirt-daemon-driver-nodedev-2.0.0-8.el7.x86_64 libvirt-python-2.0.0-2.el7.x86_64 libvirt-daemon-driver-network-2.0.0-8.el7.x86_64 libvirt-daemon-driver-lxc-2.0.0-8.el7.x86_64 qemu-kvm-rhev-2.6.0-22.el7.x86_64 rhevm-4.0.4-0.1.el7ev.noarch =================== Moving to VERIFIED Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2577.html |