Bug 1598084 - cannot acquire state change lock when create snapshot with --quiesce
Summary: cannot acquire state change lock when create snapshot with --quiesce
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libvirt
Version: 7.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: rc
: ---
Assignee: Michal Privoznik
QA Contact: yanqzhan@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-04 10:08 UTC by Han Han
Modified: 2018-10-30 09:58 UTC (History)
10 users (show)

Fixed In Version: libvirt-4.5.0-2.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-30 09:57:31 UTC
Target Upstream Version:


Attachments (Terms of Use)
Libvirtd log of step2 (1.03 KB, text/plain)
2018-07-04 10:08 UTC, Han Han
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:3113 None None None 2018-10-30 09:58:30 UTC

Description Han Han 2018-07-04 10:08:59 UTC
Created attachment 1456464 [details]
Libvirtd log of step2

Description of problem:
As subject

Version-Release number of selected component (if applicable):
libvirt-4.5.0-1.el7.x86_64
qemu-kvm-rhev-2.12.0-6.el7.x86_64
guest: qemu-guest-agent-2.12.0-11.el8+619+ba8f0855.1.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Prepare a running VM with qemu-guest-agent active
2. Create snapshot with --quiesce
# virsh snapshot-create-as usb snap_2 snap2-desc --quiesce --disk-only                                                                                                   
error: Timed out during operation: cannot acquire state change lock


Actual results:


Expected results:


Additional info:

Comment 3 Jiri Denemark 2018-07-04 10:23:41 UTC
Please, provide a full log from libvirtd covering both steps.

Comment 4 Michal Privoznik 2018-07-04 11:03:06 UTC
I think I know what the problem is:

Thread 2 (Thread 0x7fee847ba700 (LWP 2206)):
#0  0x00007fee8b309e56 in futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fee847b96e0, expected=0, futex_word=0x7fee4c0da8a0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  __pthread_cond_wait_common (abstime=0x7fee847b96e0, mutex=0x7fee4c0da740, cond=0x7fee4c0da878) at pthread_cond_wait.c:539
#2  __pthread_cond_timedwait (cond=0x7fee4c0da878, mutex=0x7fee4c0da740, abstime=0x7fee847b96e0) at pthread_cond_wait.c:667
#3  0x00007fee8e27cd17 in virCondWaitUntil (c=0x7fee4c0da878, m=0x7fee4c0da740, whenms=1530701415591) at util/virthread.c:169
#4  0x00007fee7849cc10 in qemuDomainObjBeginJobInternal (driver=0x7fee4c0718d0, obj=0x7fee4c0da730, job=QEMU_JOB_NONE, agentJob=QEMU_AGENT_JOB_MODIFY, asyncJob=QEMU_ASYNC_JOB_NONE, nowait=false) at qemu/qemu_domain.c:6505
#5  0x00007fee7849d785 in qemuDomainObjBeginAgentJob (driver=0x7fee4c0718d0, obj=0x7fee4c0da730, agentJob=QEMU_AGENT_JOB_MODIFY) at qemu/qemu_domain.c:6698
#6  0x00007fee785478c7 in qemuDomainSnapshotCreateActiveExternal (driver=0x7fee4c0718d0, vm=0x7fee4c0da730, snap=0x7fee7c001f20, flags=84) at qemu/qemu_driver.c:15159
#7  0x00007fee7854885e in qemuDomainSnapshotCreateXML (domain=0x7fee7c02e4d0, xmlDesc=0x7fee7c002d90 "<domainsnapshot>\n  <name>fedora_snap</name>\n  <description>fedora_snap_desc</description>\n</domainsnapshot>\n", flags=84) at qemu/qemu_driver.c:15547
#8  0x00007fee8e4b419e in virDomainSnapshotCreateXML (domain=0x7fee7c02e4d0, xmlDesc=0x7fee7c002d90 "<domainsnapshot>\n  <name>fedora_snap</name>\n  <description>fedora_snap_desc</description>\n</domainsnapshot>\n", flags=84) at libvirt-domain-snapshot.c:231


qemuDomainSnapshotCreateActiveExternal() is trying to start AGENT_JOB_MODIFY, therefore BeginJobInternal(job=QEMU_JOB_NONE, agentJob=QEMU_AGENT_JOB_MODIFY) is called. And since there is an async job already running @nested is false and thus qemuDomainNestedJobAllowed(priv, job=QEMU_JOB_NONE) is called which returns 0 making the thread wait for async job to finish. This deadlocks obviously.

Comment 5 Michal Privoznik 2018-07-04 11:09:34 UTC
Patch proposed upstream:

https://www.redhat.com/archives/libvir-list/2018-July/msg00258.html

Comment 8 yanqzhan@redhat.com 2018-07-19 05:18:05 UTC
Reproduced on rhel7.6 with:
libvirt-4.5.0-1.el7.x86_64
qemu-kvm-rhev-2.12.0-7.el7.x86_64
qemu-guest-agent-2.12.0-1.el7.x86_64

# virsh snapshot-create-as V-7.6 snap_3 --quiesce --disk-only
error: Timed out during operation: cannot acquire state change lock

# cat libvirtd.log|grep qemuDomainObj
debug : qemuDomainObjBeginJobInternal:6450 : Starting job: job=none agentJob=none asyncJob=snapshot (vm=0x7f56401bc6e0 name=V-7.6, current job=none agentJob=none async=none)
debug : qemuDomainObjBeginJobInternal:6507 : Started async job: snapshot (vm=0x7f56401bc6e0 name=V-7.6)
debug : qemuDomainObjBeginJobInternal:6450 : Starting job: job=none agentJob=modify asyncJob=none (vm=0x7f56401bc6e0 name=V-7.6, current job=none agentJob=none async=snapshot)
debug : qemuDomainObjBeginJobInternal:6471 : Waiting for async job (vm=0x7f56401bc6e0 name=V-7.6)
warning : qemuDomainObjBeginJobInternal:6564 : Cannot start job (none, modify, none) for domain V-7.6; current job is (none, none, snapshot) owned by (0 <null>, 0 <null>, 5140 remoteDispatchDomainSnapshotCreateXML (flags=0x50)) for (0s, 0s, 30s)
error : qemuDomainObjBeginJobInternal:6594 : Timed out during operation: cannot acquire state change lock
debug : qemuDomainObjEndAsyncJob:6838 : Stopping async job: snapshot (vm=0x7f56401bc6e0 name=V-7.6)


Verified on rhel7.6 with: 
libvirt-4.5.0-3.el7.x86_64
qemu-kvm-rhev-2.12.0-7.el7.x86_64
qemu-guest-agent-2.12.0-1.el7.x86_64

# virsh snapshot-create-as V-7.6 snap_1 --quiesce --disk-only
Domain snapshot snap_1 created

And no any error in libvirtd.log;
# cat libvirtd.log|grep qemuDomainObj
debug : qemuDomainObjBeginJobInternal:6458 : Starting job: job=none agentJob=none asyncJob=snapshot (vm=0x7fb1bc1bf790 name=V-7.6, current job=none agentJob=none async=none)
debug : qemuDomainObjBeginJobInternal:6515 : Started async job: snapshot (vm=0x7fb1bc1bf790 name=V-7.6)
2484: debug : qemuDomainObjBeginJobInternal:6458 : Starting job: job=none agentJob=modify asyncJob=none (vm=0x7fb1bc1bf790 name=V-7.6, current job=none agentJob=none async=snapshot)
debug : qemuDomainObjBeginJobInternal:6535 : Started agent job: modify (vm=0x7fb1bc1bf790 name=V-7.6 job=none async=snapshot)
debug : qemuDomainObjEnterAgent:7002 : Entering agent (agent=0x7fb1bc1d1e40 vm=0x7fb1bc1bf790 name=V-7.6)
debug : qemuDomainObjExitAgent:7024 : Exited agent (agent=0x7fb1bc1d1e40 vm=0x7fb1bc1bf790 name=V-7.6)
debug : qemuDomainObjEndAgentJob:6804 : Stopping agent job: modify (async=snapshot vm=0x7fb1bc1bf790 name=V-7.6)
debug : qemuDomainObjBeginJobInternal:6458 : Starting job: job=async nested agentJob=none asyncJob=none (vm=0x7fb1bc1bf790 name=V-7.6, current job=none agentJob=none async=snapshot)
debug : qemuDomainObjBeginJobInternal:6507 : Started job: async nested (async=snapshot vm=0x7fb1bc1bf790 name=V-7.6)
debug : qemuDomainObjEnterMonitorInternal:6899 : Entering monitor (mon=0x7fb210014b90 vm=0x7fb1bc1bf790 name=V-7.6)
debug : qemuDomainObjExitMonitorInternal:6922 : Exited monitor (mon=0x7fb210014b90 vm=0x7fb1bc1bf790 name=V-7.6)
debug : qemuDomainObjEndJob:6783 : Stopping job: async nested (async=snapshot vm=0x7fb1bc1bf790 name=V-7.6)
debug : qemuDomainObjBeginJobInternal:6458 : Starting job: job=none agentJob=modify asyncJob=none (vm=0x7fb1bc1bf790 name=V-7.6, current job=none agentJob=none async=snapshot)
debug : qemuDomainObjBeginJobInternal:6535 : Started agent job: modify (vm=0x7fb1bc1bf790 name=V-7.6 job=none async=snapshot)
debug : qemuDomainObjEnterAgent:7002 : Entering agent (agent=0x7fb1bc1d1e40 vm=0x7fb1bc1bf790 name=V-7.6)
debug : qemuDomainObjExitAgent:7024 : Exited agent (agent=0x7fb1bc1d1e40 vm=0x7fb1bc1bf790 name=V-7.6)
debug : qemuDomainObjEndAgentJob:6804 : Stopping agent job: modify (async=snapshot vm=0x7fb1bc1bf790 name=V-7.6)
debug : qemuDomainObjEndAsyncJob:6846 : Stopping async job: snapshot (vm=0x7fb1bc1bf790 name=V-7.6)

Comment 10 errata-xmlrpc 2018-10-30 09:57:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3113


Note You need to log in before you can comment on or make changes to this bug.