Bug 853839

Summary:

Try to make debugging of "cannot acquire state change lock" easier

Product:

Red Hat Enterprise Linux 7

Reporter:

Alex Jia <ajia>

Component:

libvirt

Assignee:

Jiri Denemark <jdenemar>

Status:

CLOSED ERRATA

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

high

Docs Contact:

Priority:

high

Version:

7.0

CC:

cwei, dyuan, jdenemar, mzhan, rbalakri, snagar, zebing86

Target Milestone:

Keywords:

Reopened

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

libvirt-1.2.14-1.el7

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-11-19 05:35:44 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1133060, 1205796

Attachments:

Description	Flags
libvirtd.log	none

Description Alex Jia 2012-09-03 05:57:28 UTC

Description of problem:
Failed to shutdown domain with error "Timed out during operation: cannot acquire state change lock" after running 'virsh reboot --mode agent <domain>' is hang. 

Version-Release number of selected component (if applicable):
# rpm -q libvirt qemu-kvm-rhev seabios kernel
libvirt-0.10.1-1.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.307.el6.x86_64
seabios-0.6.1.2-19.el6.x86_64
kernel-2.6.32-288.el6.x86_64

On the guest:
# rpm -q qemu-guest-agent
qemu-guest-agent-0.12.1.2-2.307.el6.x86_64

How reproducible:
always

Steps to Reproduce:
1. to configure guest support GA
2. virsh start <domain>
3. login the guest then install qemu-guest-agent and start qemu-ga service
4. virsh reboot --mode agent <domain> (it's a known bug 853821)
5. ctrl + c (exit reboot command)
6. virsh domstate --reason <domain>
7. virsh shutdown --mode agent <domain>
  

Actual results:

# virsh reboot --mode agent myRHEL6
^C

Notes, it's a known bug 853821, so need to 'ctrl +c' exits virsh command.

# echo $?
130

# virsh domstate --reason myRHEL6
running (booted)

# virsh shutdown --mode agent myRHEL6
error: Failed to shutdown domain myRHEL6
error: Timed out during operation: cannot acquire state change lock


Expected results:

can successfully shutdown a guest.

Additional info:

# virsh dumpxml myRHEL6
<domain type='kvm' id='2'>
  ......
  <devices>
    <channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/myRHEL6.agent'/>
      <target type='virtio' name='org.qemu.guest_agent.0'/>
      <alias name='channel0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
  ......
  </devices>
  ......
</domain>

Comment 1 Alex Jia 2012-09-03 05:59:09 UTC

Created attachment 609255 [details]
libvirtd.log

Comment 2 Alex Jia 2012-09-03 06:04:55 UTC

Also saw the following debugger information(see 'warning' section):

2012-09-03 06:03:09.217+0000: 23307: debug : virDomainShutdownFlags:3217 : dom=0x7ff59c02f750, (VM: name=myRHEL6, uuid=1f2e5ace-b985-b3cc-1c82-31613b771024)

2012-09-03 06:03:39.000+0000: 23307: warning : qemuDomainObjBeginJobInternal:838 : Cannot start job (modify, none) for domain myRHEL6; current job is (modify, none) owned by (23304, 0)

2012-09-03 06:03:39.000+0000: 23307: error : qemuDomainObjBeginJobInternal:842 : Timed out during operation: cannot acquire state change lock

Comment 4 Jiri Denemark 2012-09-03 12:02:12 UTC

This is caused by "virsh reboot --mode agent" command being hung. It
is still there, owns the job and thus mostly nothing can be done with
the same domain. I'm closing this as a dup of the reboot bug.

*** This bug has been marked as a duplicate of bug 853821 ***

Comment 5 Alex Jia 2012-09-04 07:39:07 UTC

(In reply to comment #2)
> 2012-09-03 06:03:39.000+0000: 23307: warning :
> qemuDomainObjBeginJobInternal:838 : Cannot start job (modify, none) for
> domain myRHEL6; current job is (modify, none) owned by (23304, 0)
> 
> 2012-09-03 06:03:39.000+0000: 23307: error :
> qemuDomainObjBeginJobInternal:842 : Timed out during operation: cannot
> acquire state change lock

Hi Jiri,
I think it's very hard to find a root reason in libvirtd.log for users, the above warning or error message can't tell users the issue is introduced by failed reboot operation/job, maybe, we should improve debug info.

Thanks,
Alex

Comment 6 Dave Allan 2012-09-04 14:01:25 UTC

(In reply to comment #5)
> I think it's very hard to find a root reason in libvirtd.log for users, the
> above warning or error message can't tell users the issue is introduced by
> failed reboot operation/job, maybe, we should improve debug info.

I agree with Alex's point here; the failure to acquire state change lock errors can be very difficult to diagnose and result in admins simply blindly restarting libvirt which is frustrating to them.  IMO we should be recording sufficient information to provide a useful message to the user to allow them to troubleshoot, or at least understand, the problem.  I'm going to reopen this BZ to see if there is anything that can be done to improve the log output.

Comment 7 Jiri Denemark 2012-09-05 08:15:20 UTC

I'm not sure we can do anything more than we currently do. "Cannot acquire
state change lock" errors happen when libvirtd thinks another operation
(libvirt calls it "job" internally) is being performed on the same domain and
the running job is forbidden to be executed concurrently with the job that was
requested to start (and failed with the above error). This may happen when
there really is another job active (which happened in this BZ) or when there
is a bug in libvirt and the last API finished without clearing the job. And
the situation becomes even more complicated when you consider migrations,
which are active across multiple APIs, that is libvirtd may not be performing
any API on a domain while still having a job active on that domain.

Libvirt already tries to make debugging such situations easier by remember the
ID of the last thread that successfully acquired a job; see the warning Alex
quoted in comment 2:

    2012-09-03 06:03:39.000+0000: 23307: warning :
    qemuDomainObjBeginJobInternal:838 : Cannot start job (modify, none) for
    domain myRHEL6; current job is (modify, none) owned by (23304, 0)

After seeing this, we can look at debug logs (if they are available) and
identify where the job was set and why it is still active.

The only thing (I can think of) we could do to make this even more easier
would be to store a complete backtrace at the time a job is set in addition to
the thread ID. However, since we would need to do so for each job, I'm afraid
it would be accompanied with a noticeable performance impact.

Actually, we could probably store function name in a thread local variable
when entering each API and use it when entering a job to remember on behalf of
what API each job was started. It would be more complicated because not all
jobs are started on behalf of public APIs.

Comment 8 Dave Allan 2012-09-05 17:24:45 UTC

Ok, the message:

2012-09-03 06:03:39.000+0000: 23307: warning : qemuDomainObjBeginJobInternal:838 : Cannot start job (modify, none) for domain myRHEL6; current job is (modify, none) owned by (23304, 0)

is very much along the lines of what I was envisioning.  It still needs some work, though, because there's no way for a user to go from (modify, none) to my reboot operation is not responding.  Relatedly, we also need to make sure that job control is implemented for these operations so that a user can get control back.

Comment 10 moonshine 2012-11-27 01:31:25 UTC

centos6 amd64 and all packages lastest

bash> rpm -q libvirt qemu-kvm-rhev seabios kernel
libvirt-0.9.10-21.el6_3.6.x86_64
package qemu-kvm-rhev is not installed
seabios-0.6.1.2-19.el6.x86_64
kernel-2.6.32-279.11.1.el6.x86_64
kernel-2.6.32-279.14.1.el6.x86_64

bash> tail  /var/log/libvirt/libvirtd.log
2012-11-26 02:37:17.043+0000: 2037: info : libvirt version: 0.9.10, package: 21.el6_3.6 (CentOS BuildSystem <http://bugs.centos.org>, 2012-11-22-09:45:30, c6b8.bsys.dev.centos.org)
2012-11-26 02:37:17.043+0000: 2037: error : virCommandWait:2304 : internal error Child process (/sbin/tc qdisc del dev virbr0 root) status unexpected: exit status 2
2012-11-26 02:37:17.106+0000: 2037: error : virCommandWait:2304 : internal error Child process (/sbin/tc qdisc del dev virbr0 ingress) status unexpected: exit status 2
2012-11-26 06:25:18.528+0000: 2036: error : qemuMonitorIO:574 : internal error End of file from monitor
2012-11-26 10:41:05.453+0000: 2036: error : qemuMonitorIO:574 : internal error End of file from monitor
2012-11-27 01:09:32.001+0000: 2059: info : libvirt version: 0.9.10, package: 21.el6_3.6 (CentOS BuildSystem <http://bugs.centos.org>, 2012-11-22-09:45:30, c6b8.bsys.dev.centos.org)
2012-11-27 01:09:32.001+0000: 2059: error : qemuDomainObjBeginJobInternal:777 : Timed out during operation: cannot acquire state change lock
2012-11-27 01:09:45.738+0000: 2057: error : virNetSocketReadWire:999 : End of file while reading data: Input/output error
2012-11-27 01:11:00.000+0000: 2060: error : qemuDomainObjBeginJobInternal:777 : Timed out during operation: cannot acquire state change lock
2012-11-27 01:11:24.000+0000: 2058: error : qemuDomainObjBeginJobInternal:777 : Timed out during operation: cannot acquire state change lock

Comment 11 moonshine 2012-11-27 01:41:02 UTC

sorry for wrong place if centos6, or show me the rignt one.

similar proble while start vm(winxp) and hope help bug-shooting.


---
Error starting domain: Timed out during operation: cannot acquire state change lock

Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 44, in cb_wrapper
    callback(asyncjob, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 65, in tmpcb
    callback(*args, **kwargs)
  File "/usr/share/virt-manager/virtManager/domain.py", line 1063, in startup
    self._backend.create()
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 619, in create
    if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self)
libvirtError: Timed out during operation: cannot acquire state change lock

Comment 12 Dave Allan 2012-11-27 19:10:36 UTC

(In reply to comment #11)
> sorry for wrong place if centos6, or show me the rignt one.
> 
> similar proble while start vm(winxp) and hope help bug-shooting.

This BZ is primarily now about making the log messages better in the case in which the state change lock cannot be obtained.  Your best bet for CentOS problems is probably the upstream libvirt mailing list.

Comment 16 Jiri Denemark 2014-04-04 21:36:42 UTC

This bug was not selected to be addressed in Red Hat Enterprise Linux 6. We will look at it again within the Red Hat Enterprise Linux 7 product.

Comment 19 Jiri Denemark 2015-03-23 13:33:00 UTC

Patches sent upstream for review: https://www.redhat.com/archives/libvir-list/2015-March/msg01137.html

Comment 20 Jiri Denemark 2015-03-25 13:36:00 UTC

The patches are now pushed upstream as v1.2.13-290-g9569538..v1.2.13-296-gf6fbd36 and v1.2.13-299-g3edbfc8:

commit 956953884e1ba475b02e4b7d8311680c0ec4f83b
Author: Jiri Denemark <jdenemar>
Date:   Mon Jul 8 12:27:34 2013 +0200

    Add support for tracking thread jobs
    
    Each thread can use a thread local variable to keep the name of a job
    which is currently running in the job.
    
    The virThreadJobSetWorker API is supposed to be called once by any
    thread which is used as a worker, i.e., it is waiting in a pool, woken
    up to do a job, and returned back to the pool.
    
    The virThreadJobSet/virThreadJobClear APIs are to be called at the
    beginning/end of each job.
    
    Signed-off-by: Jiri Denemark <jdenemar>

commit 23d0c979f70c9bf9894c1f90626d6616335fe470
Author: Jiri Denemark <jdenemar>
Date:   Fri Mar 20 16:25:15 2015 +0100

    Force usage of virThreadCreate
    
    We want all threads to be set as workers or to have a job assigned to
    them, which can easily be achieved in virThreadCreate wrapper to
    pthread_create. Let's make sure we always use the wrapper.
    
    Signed-off-by: Jiri Denemark <jdenemar>

commit 55ebc93a085dcc32c8c10ff2cbd4bb3570fdb1f0
Author: Jiri Denemark <jdenemar>
Date:   Fri Mar 20 17:43:55 2015 +0100

    virThread: Set thread job
    
    Automatically assign a job to every thread created by virThreadCreate.
    The name of the virThreadFunc function passed to virThreadCreate is used
    as the job or worker name in case no name is explicitly passed.
    
    Signed-off-by: Jiri Denemark <jdenemar>

commit d20f5dde29ddb534ec528797c9e8f08130cfe8b0
Author: Jiri Denemark <jdenemar>
Date:   Fri Mar 20 17:44:25 2015 +0100

    virThreadPool: Set thread worker name
    
    Every thread created as a worker thread within a pool gets a name
    according to virThreadPoolJobFunc name.
    
    Signed-off-by: Jiri Denemark <jdenemar>

commit 667cce7b60e9645794179fce7de3b161531e4145
Author: Jiri Denemark <jdenemar>
Date:   Wed Mar 11 13:58:42 2015 +0100

    Set thread job for every RPC call
    
    Since all APIs are also RPC calls, we automatically get all APIs covered
    with thread jobs.
    
    Signed-off-by: Jiri Denemark <jdenemar>

commit b79f25e881cc2d82ec67398278ee073ebfe897ba
Author: Jiri Denemark <jdenemar>
Date:   Mon Mar 16 18:18:46 2015 +0100

    qemu: Track the API which started the current job
    
    This is very helpful when we want to log and report why we could not
    acquire a state change lock. Reporting what job keeps it locked helps
    with understanding the issue. Moreover, after calling
    virDomainGetControlInfo, it's possible to tell whether libvirt is just
    stuck somewhere within the API (or it just forgot to cleanup the job) or
    whether libvirt is waiting for QEMU to reply.
    
    The error message will look like the following:
    
        # virsh resume cd
        error: Failed to resume domain cd
        error: Timed out during operation: cannot acquire state change lock
        (held by remoteDispatchDomainSuspend)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=853839
    
    Signed-off-by: Jiri Denemark <jdenemar>

commit f6fbd36fd490dd9d712c69d58f7469d05a5bf5e5
Author: Jiri Denemark <jdenemar>
Date:   Mon Mar 23 12:46:45 2015 +0100

    qemu: Add timing to domain jobs
    
    Whenever we fail to acquire a job, we can report how long ago it was
    locked by another API.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=853839
    
    Signed-off-by: Jiri Denemark <jdenemar>

commit 3edbfc882a148428f4614682642a964f8168801f
Author: Pavel Hrdina <phrdina>
Date:   Wed Mar 25 11:30:38 2015 +0100

    Makefile: fix typo
    
    Commit 95695388 introduced new util/virthreadjob.c/h files but the
    makefile has type that breaks rpm build.
    
    Signed-off-by: Pavel Hrdina <phrdina>

Comment 22 vivian zhang 2015-06-15 08:52:39 UTC

Verify this bug with following steps with build libvirt-1.2.16-1.el7.x86_64

we need to trigger one scenario which can cause guest cannot acquire state change lock


1. prepare and start a healthy guest's disk mount with nfs server

# virsh start rh7
Domain rh7 started

# virsh domblklist rh7
Target     Source
------------------------------------------------
vda        /mnt/wzhang/rhel7.0-3.qcow2



2. after start guest, add iptables to drop all the input packages from nfs server

# iptables -t filter -A INPUT -s 10.66.90.121 -i enp0s25 -j DROP

# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
DROP       all  --  10.66.90.121         anywhere 

3. open two terminal, run virsh command for blk related operation on the two terminals at the same time

#virsh blkdeviotune rh7 vda

after almost 30 seconds, on the second terminal, we got below response, the error messages has been improved to show "held by xxxx"
# virsh blkdeviotune rh7 vda
error: Unable to get number of block I/O throttle parameters
error: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune)

on the first terminal, command still hang there waiting for qemu reply
# virsh blkdeviotune rh7 vda


4. clear iptables rule, 
# iptables -F

on the first terminal, command return successful
# virsh blkdeviotune rh7 vda
total_bytes_sec: 0
read_bytes_sec : 0
write_bytes_sec: 0
total_iops_sec : 0
read_iops_sec  : 0
write_iops_sec : 0
total_bytes_sec_max: 0
read_bytes_sec_max: 0
write_bytes_sec_max: 0
total_iops_sec_max: 0
read_iops_sec_max: 0
write_iops_sec_max: 0
size_iops_sec  : 0

Comment 23 vivian zhang 2015-06-29 08:51:09 UTC

according to comment 22, move to verified

Comment 25 errata-xmlrpc 2015-11-19 05:35:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2202.html