Bug 853839
Summary: | Try to make debugging of "cannot acquire state change lock" easier | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Alex Jia <ajia> | ||||
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> | ||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.0 | CC: | cwei, dyuan, jdenemar, mzhan, rbalakri, snagar, zebing86 | ||||
Target Milestone: | rc | Keywords: | Reopened | ||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | libvirt-1.2.14-1.el7 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2015-11-19 05:35:44 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1133060, 1205796 | ||||||
Attachments: |
|
Description
Alex Jia
2012-09-03 05:57:28 UTC
Created attachment 609255 [details]
libvirtd.log
Also saw the following debugger information(see 'warning' section): 2012-09-03 06:03:09.217+0000: 23307: debug : virDomainShutdownFlags:3217 : dom=0x7ff59c02f750, (VM: name=myRHEL6, uuid=1f2e5ace-b985-b3cc-1c82-31613b771024) 2012-09-03 06:03:39.000+0000: 23307: warning : qemuDomainObjBeginJobInternal:838 : Cannot start job (modify, none) for domain myRHEL6; current job is (modify, none) owned by (23304, 0) 2012-09-03 06:03:39.000+0000: 23307: error : qemuDomainObjBeginJobInternal:842 : Timed out during operation: cannot acquire state change lock This is caused by "virsh reboot --mode agent" command being hung. It is still there, owns the job and thus mostly nothing can be done with the same domain. I'm closing this as a dup of the reboot bug. *** This bug has been marked as a duplicate of bug 853821 *** (In reply to comment #2) > 2012-09-03 06:03:39.000+0000: 23307: warning : > qemuDomainObjBeginJobInternal:838 : Cannot start job (modify, none) for > domain myRHEL6; current job is (modify, none) owned by (23304, 0) > > 2012-09-03 06:03:39.000+0000: 23307: error : > qemuDomainObjBeginJobInternal:842 : Timed out during operation: cannot > acquire state change lock Hi Jiri, I think it's very hard to find a root reason in libvirtd.log for users, the above warning or error message can't tell users the issue is introduced by failed reboot operation/job, maybe, we should improve debug info. Thanks, Alex (In reply to comment #5) > I think it's very hard to find a root reason in libvirtd.log for users, the > above warning or error message can't tell users the issue is introduced by > failed reboot operation/job, maybe, we should improve debug info. I agree with Alex's point here; the failure to acquire state change lock errors can be very difficult to diagnose and result in admins simply blindly restarting libvirt which is frustrating to them. IMO we should be recording sufficient information to provide a useful message to the user to allow them to troubleshoot, or at least understand, the problem. I'm going to reopen this BZ to see if there is anything that can be done to improve the log output. I'm not sure we can do anything more than we currently do. "Cannot acquire state change lock" errors happen when libvirtd thinks another operation (libvirt calls it "job" internally) is being performed on the same domain and the running job is forbidden to be executed concurrently with the job that was requested to start (and failed with the above error). This may happen when there really is another job active (which happened in this BZ) or when there is a bug in libvirt and the last API finished without clearing the job. And the situation becomes even more complicated when you consider migrations, which are active across multiple APIs, that is libvirtd may not be performing any API on a domain while still having a job active on that domain. Libvirt already tries to make debugging such situations easier by remember the ID of the last thread that successfully acquired a job; see the warning Alex quoted in comment 2: 2012-09-03 06:03:39.000+0000: 23307: warning : qemuDomainObjBeginJobInternal:838 : Cannot start job (modify, none) for domain myRHEL6; current job is (modify, none) owned by (23304, 0) After seeing this, we can look at debug logs (if they are available) and identify where the job was set and why it is still active. The only thing (I can think of) we could do to make this even more easier would be to store a complete backtrace at the time a job is set in addition to the thread ID. However, since we would need to do so for each job, I'm afraid it would be accompanied with a noticeable performance impact. Actually, we could probably store function name in a thread local variable when entering each API and use it when entering a job to remember on behalf of what API each job was started. It would be more complicated because not all jobs are started on behalf of public APIs. Ok, the message: 2012-09-03 06:03:39.000+0000: 23307: warning : qemuDomainObjBeginJobInternal:838 : Cannot start job (modify, none) for domain myRHEL6; current job is (modify, none) owned by (23304, 0) is very much along the lines of what I was envisioning. It still needs some work, though, because there's no way for a user to go from (modify, none) to my reboot operation is not responding. Relatedly, we also need to make sure that job control is implemented for these operations so that a user can get control back. centos6 amd64 and all packages lastest bash> rpm -q libvirt qemu-kvm-rhev seabios kernel libvirt-0.9.10-21.el6_3.6.x86_64 package qemu-kvm-rhev is not installed seabios-0.6.1.2-19.el6.x86_64 kernel-2.6.32-279.11.1.el6.x86_64 kernel-2.6.32-279.14.1.el6.x86_64 bash> tail /var/log/libvirt/libvirtd.log 2012-11-26 02:37:17.043+0000: 2037: info : libvirt version: 0.9.10, package: 21.el6_3.6 (CentOS BuildSystem <http://bugs.centos.org>, 2012-11-22-09:45:30, c6b8.bsys.dev.centos.org) 2012-11-26 02:37:17.043+0000: 2037: error : virCommandWait:2304 : internal error Child process (/sbin/tc qdisc del dev virbr0 root) status unexpected: exit status 2 2012-11-26 02:37:17.106+0000: 2037: error : virCommandWait:2304 : internal error Child process (/sbin/tc qdisc del dev virbr0 ingress) status unexpected: exit status 2 2012-11-26 06:25:18.528+0000: 2036: error : qemuMonitorIO:574 : internal error End of file from monitor 2012-11-26 10:41:05.453+0000: 2036: error : qemuMonitorIO:574 : internal error End of file from monitor 2012-11-27 01:09:32.001+0000: 2059: info : libvirt version: 0.9.10, package: 21.el6_3.6 (CentOS BuildSystem <http://bugs.centos.org>, 2012-11-22-09:45:30, c6b8.bsys.dev.centos.org) 2012-11-27 01:09:32.001+0000: 2059: error : qemuDomainObjBeginJobInternal:777 : Timed out during operation: cannot acquire state change lock 2012-11-27 01:09:45.738+0000: 2057: error : virNetSocketReadWire:999 : End of file while reading data: Input/output error 2012-11-27 01:11:00.000+0000: 2060: error : qemuDomainObjBeginJobInternal:777 : Timed out during operation: cannot acquire state change lock 2012-11-27 01:11:24.000+0000: 2058: error : qemuDomainObjBeginJobInternal:777 : Timed out during operation: cannot acquire state change lock sorry for wrong place if centos6, or show me the rignt one. similar proble while start vm(winxp) and hope help bug-shooting. --- Error starting domain: Timed out during operation: cannot acquire state change lock Traceback (most recent call last): File "/usr/share/virt-manager/virtManager/asyncjob.py", line 44, in cb_wrapper callback(asyncjob, *args, **kwargs) File "/usr/share/virt-manager/virtManager/asyncjob.py", line 65, in tmpcb callback(*args, **kwargs) File "/usr/share/virt-manager/virtManager/domain.py", line 1063, in startup self._backend.create() File "/usr/lib64/python2.6/site-packages/libvirt.py", line 619, in create if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self) libvirtError: Timed out during operation: cannot acquire state change lock (In reply to comment #11) > sorry for wrong place if centos6, or show me the rignt one. > > similar proble while start vm(winxp) and hope help bug-shooting. This BZ is primarily now about making the log messages better in the case in which the state change lock cannot be obtained. Your best bet for CentOS problems is probably the upstream libvirt mailing list. This bug was not selected to be addressed in Red Hat Enterprise Linux 6. We will look at it again within the Red Hat Enterprise Linux 7 product. Patches sent upstream for review: https://www.redhat.com/archives/libvir-list/2015-March/msg01137.html The patches are now pushed upstream as v1.2.13-290-g9569538..v1.2.13-296-gf6fbd36 and v1.2.13-299-g3edbfc8: commit 956953884e1ba475b02e4b7d8311680c0ec4f83b Author: Jiri Denemark <jdenemar> Date: Mon Jul 8 12:27:34 2013 +0200 Add support for tracking thread jobs Each thread can use a thread local variable to keep the name of a job which is currently running in the job. The virThreadJobSetWorker API is supposed to be called once by any thread which is used as a worker, i.e., it is waiting in a pool, woken up to do a job, and returned back to the pool. The virThreadJobSet/virThreadJobClear APIs are to be called at the beginning/end of each job. Signed-off-by: Jiri Denemark <jdenemar> commit 23d0c979f70c9bf9894c1f90626d6616335fe470 Author: Jiri Denemark <jdenemar> Date: Fri Mar 20 16:25:15 2015 +0100 Force usage of virThreadCreate We want all threads to be set as workers or to have a job assigned to them, which can easily be achieved in virThreadCreate wrapper to pthread_create. Let's make sure we always use the wrapper. Signed-off-by: Jiri Denemark <jdenemar> commit 55ebc93a085dcc32c8c10ff2cbd4bb3570fdb1f0 Author: Jiri Denemark <jdenemar> Date: Fri Mar 20 17:43:55 2015 +0100 virThread: Set thread job Automatically assign a job to every thread created by virThreadCreate. The name of the virThreadFunc function passed to virThreadCreate is used as the job or worker name in case no name is explicitly passed. Signed-off-by: Jiri Denemark <jdenemar> commit d20f5dde29ddb534ec528797c9e8f08130cfe8b0 Author: Jiri Denemark <jdenemar> Date: Fri Mar 20 17:44:25 2015 +0100 virThreadPool: Set thread worker name Every thread created as a worker thread within a pool gets a name according to virThreadPoolJobFunc name. Signed-off-by: Jiri Denemark <jdenemar> commit 667cce7b60e9645794179fce7de3b161531e4145 Author: Jiri Denemark <jdenemar> Date: Wed Mar 11 13:58:42 2015 +0100 Set thread job for every RPC call Since all APIs are also RPC calls, we automatically get all APIs covered with thread jobs. Signed-off-by: Jiri Denemark <jdenemar> commit b79f25e881cc2d82ec67398278ee073ebfe897ba Author: Jiri Denemark <jdenemar> Date: Mon Mar 16 18:18:46 2015 +0100 qemu: Track the API which started the current job This is very helpful when we want to log and report why we could not acquire a state change lock. Reporting what job keeps it locked helps with understanding the issue. Moreover, after calling virDomainGetControlInfo, it's possible to tell whether libvirt is just stuck somewhere within the API (or it just forgot to cleanup the job) or whether libvirt is waiting for QEMU to reply. The error message will look like the following: # virsh resume cd error: Failed to resume domain cd error: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainSuspend) https://bugzilla.redhat.com/show_bug.cgi?id=853839 Signed-off-by: Jiri Denemark <jdenemar> commit f6fbd36fd490dd9d712c69d58f7469d05a5bf5e5 Author: Jiri Denemark <jdenemar> Date: Mon Mar 23 12:46:45 2015 +0100 qemu: Add timing to domain jobs Whenever we fail to acquire a job, we can report how long ago it was locked by another API. https://bugzilla.redhat.com/show_bug.cgi?id=853839 Signed-off-by: Jiri Denemark <jdenemar> commit 3edbfc882a148428f4614682642a964f8168801f Author: Pavel Hrdina <phrdina> Date: Wed Mar 25 11:30:38 2015 +0100 Makefile: fix typo Commit 95695388 introduced new util/virthreadjob.c/h files but the makefile has type that breaks rpm build. Signed-off-by: Pavel Hrdina <phrdina> Verify this bug with following steps with build libvirt-1.2.16-1.el7.x86_64 we need to trigger one scenario which can cause guest cannot acquire state change lock 1. prepare and start a healthy guest's disk mount with nfs server # virsh start rh7 Domain rh7 started # virsh domblklist rh7 Target Source ------------------------------------------------ vda /mnt/wzhang/rhel7.0-3.qcow2 2. after start guest, add iptables to drop all the input packages from nfs server # iptables -t filter -A INPUT -s 10.66.90.121 -i enp0s25 -j DROP # iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination DROP all -- 10.66.90.121 anywhere 3. open two terminal, run virsh command for blk related operation on the two terminals at the same time #virsh blkdeviotune rh7 vda after almost 30 seconds, on the second terminal, we got below response, the error messages has been improved to show "held by xxxx" # virsh blkdeviotune rh7 vda error: Unable to get number of block I/O throttle parameters error: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune) on the first terminal, command still hang there waiting for qemu reply # virsh blkdeviotune rh7 vda 4. clear iptables rule, # iptables -F on the first terminal, command return successful # virsh blkdeviotune rh7 vda total_bytes_sec: 0 read_bytes_sec : 0 write_bytes_sec: 0 total_iops_sec : 0 read_iops_sec : 0 write_iops_sec : 0 total_bytes_sec_max: 0 read_bytes_sec_max: 0 write_bytes_sec_max: 0 total_iops_sec_max: 0 read_iops_sec_max: 0 write_iops_sec_max: 0 size_iops_sec : 0 according to comment 22, move to verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-2202.html |