RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1134154 - snapshot's race condition
Summary: snapshot's race condition
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libvirt
Version: 7.1
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Peter Krempa
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On: 1079162
Blocks: 1143900
TreeView+ depends on / blocked
 
Reported: 2014-08-27 01:52 UTC by Jincheng Miao
Modified: 2015-03-05 07:43 UTC (History)
8 users (show)

Fixed In Version: libvirt-1.2.8-2.el7
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1143900 (view as bug list)
Environment:
Last Closed: 2015-03-05 07:43:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:0323 0 normal SHIPPED_LIVE Low: libvirt security, bug fix, and enhancement update 2015-03-05 12:10:54 UTC

Description Jincheng Miao 2014-08-27 01:52:02 UTC
description of problem:
libvirt's snapshot job seems have race condition,
it will make snapshot failed to get the state change lock.


Version-Release number of selected component (if applicable):
libvirt-1.2.7-1.el7.x86_64
qemu-kvm-1.5.3-60.el7.4.x86_64

How reproducible:
nearly always

Steps to Reproduce:
1. start a guest with qcow2 image
# virsh start r7b
Domain r7b started

2. create a snapshot
# virsh snapshot-create r7b
Domain snapshot 1408689049 created

3. run two cmds *in the same time*, follow this sequence strictly.
3.1 run this cmd in a terminal:
# virsh snapshot-delete r7b 1408689049
Domain snapshot 1408689049 deleted

3.2 run this cmd immediately in another terminal:
# virsh snapshot-revert r7b 1408689049
error: internal error: Invalid target domain state 'nostate'. Refusing snapshot reversion

4. then snapshot will not work
# virsh snapshot-create r7b
error: Timed out during operation: cannot acquire state change lock

in libvirtd.log
2014-08-22 06:31:21.998+0000: 7403: error : qemuDomainRevertToSnapshot:14305 : internal error: Invalid target domain state 'nostate'. Refusing snapshot reversion

2014-08-22 05:43:38.000+0000: 6004: error : qemuDomainObjBeginJobInternal:1154 : Timed out during operation: cannot acquire state change lock

the qemu event from qemu-monitor.stp:
362185.761 > 0x7fc628005450 {"execute":"stop","id":"libvirt-9"}
362185.764 ! 0x7fc628005450 {"timestamp": {"seconds": 1408689049, "microseconds": 410410}, "event": "STOP"}
362185.821 < 0x7fc628005450 {"return": {}, "id": "libvirt-9"}
362185.821 > 0x7fc628005450 {"execute":"human-monitor-command","arguments":{"command-line":"savevm \"1408689049\""},"id":"libvirt-10"}
362193.575 < 0x7fc628005450 {"return": "", "id": "libvirt-10"}
362193.576 > 0x7fc628005450 {"execute":"cont","id":"libvirt-11"}
362193.588 ! 0x7fc628005450 {"timestamp": {"seconds": 1408689057, "microseconds": 234486}, "event": "RESUME"}
362193.589 < 0x7fc628005450 {"return": {}, "id": "libvirt-11"}
362193.674 ! 0x7fc628005450 {"timestamp": {"seconds": 1408689057, "microseconds": 320177}, "event": "BALLOON_CHANGE", "data": {"actual": 1072693248}}
362194.674 ! 0x7fc628005450 {"timestamp": {"seconds": 1408689057, "microseconds": 570648}, "event": "BALLOON_CHANGE", "data": {"actual": 409600000}}
362203.322 ! 0x7fc628005450 {"timestamp": {"seconds": 1408689066, "microseconds": 967810}, "event": "NIC_RX_FILTER_CHANGED", "data": {"name": "net0", "path": "/machine/peripheral/net0/virtio-backend"}}
362218.046 > 0x7fc628005450 {"execute":"human-monitor-command","arguments":{"command-line":"delvm \"1408689049\""},"id":"libvirt-12"}
362218.352 < 0x7fc628005450 {"return": "", "id": "libvirt-12"} 

Expect result:
snapshot object needs lock properly.

Comment 1 Eric Blake 2014-08-27 03:03:34 UTC
Looks like this might fix it
https://www.redhat.com/archives/libvir-list/2014-August/msg01293.html

Comment 2 Peter Krempa 2014-08-27 07:26:33 UTC
Fixed upstream:

commit 03b994fac0434c08cca13f51c3980c1b1d218047
Author: Jincheng Miao <jmiao>
Date:   Wed Aug 27 10:10:33 2014 +0800

    qemu: call endjob in RevertToSnapshot
    
    In qemuDomainRevertToSnapshot(), it will check snap->def->state.
    But when the state is PMSUSPENDED/NOSTATE/BLOCKED, it forgets to
    call qemuDomainObjEndJob.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1134154
    Bug introduced in commit 1e833899.

Comment 3 Peter Krempa 2014-08-27 07:27:00 UTC
$ git describe 03b994fac0434c08cca13f51c3980c1b1d218047
v1.2.7-241-g03b994f

Comment 4 Eric Blake 2014-08-27 12:13:14 UTC
That was only a partial fix; according to the mail thread, there is a missing lock where concurrent snapshot operations can cause data corruption, which still needs further patching.

Comment 5 Shanzhi Yu 2014-08-27 13:04:22 UTC
Hi Eric, 
I met libvirtd crash when test with latest libvirt.git 

## git describe 
v1.2.7-241-g03b994f

Step is as commet 0

 Thread 5:
==9932== Invalid read of size 8
==9932==    at 0x210191A5: qemuDomainRevertToSnapshot (qemu_driver.c:14094)
==9932==    by 0x5195370: virDomainRevertToSnapshot (libvirt.c:19324)
==9932==    by 0x133CEC: remoteDispatchDomainRevertToSnapshotHelper (remote_dispatch.h:7121)
==9932==    by 0x51DE391: virNetServerProgramDispatch (virnetserverprogram.c:437)
==9932==    by 0x15151C: virNetServerHandleJob (virnetserver.c:172)
==9932==    by 0x50DAB14: virThreadPoolWorker (virthreadpool.c:145)
==9932==    by 0x50DA4AD: virThreadHelper (virthread.c:197)
==9932==    by 0x831DF32: start_thread (pthread_create.c:309)
==9932==    by 0x8A2FDEC: clone (clone.S:111)
==9932==  Address 0x24075130 is 0 bytes inside a block of size 40 free'd
==9932==    at 0x4A07577: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==9932==    by 0x5083B89: virFree (viralloc.c:582)
==9932==    by 0x511A1FD: virDomainSnapshotObjListDataFree (snapshot_conf.c:753)
==9932==    by 0x50A86E7: virHashRemoveEntry (virhash.c:475)
==9932==    by 0x20FB7CD9: qemuDomainSnapshotDiscard (qemu_domain.c:2096)
==9932==    by 0x21018D29: qemuDomainSnapshotDelete (qemu_driver.c:14474)
==9932==    by 0x51955A0: virDomainSnapshotDelete (libvirt.c:19386)
==9932==    by 0x133B3C: remoteDispatchDomainSnapshotDeleteHelper (remote_dispatch.h:8700)
==9932==    by 0x51DE391: virNetServerProgramDispatch (virnetserverprogram.c:437)
==9932==    by 0x15151C: virNetServerHandleJob (virnetserver.c:172)
==9932==    by 0x50DAB14: virThreadPoolWorker (virthreadpool.c:145)
==9932==    by 0x50DA4AD: virThreadHelper (virthread.c:197)
==9932== 
==9932== Invalid read of size 4
==9932==    at 0x210191A8: qemuDomainRevertToSnapshot (qemu_driver.c:14094)
==9932==    by 0x5195370: virDomainRevertToSnapshot (libvirt.c:19324)
==9932==    by 0x133CEC: remoteDispatchDomainRevertToSnapshotHelper (remote_dispatch.h:7121)
==9932==    by 0x51DE391: virNetServerProgramDispatch (virnetserverprogram.c:437)
==9932==    by 0x15151C: virNetServerHandleJob (virnetserver.c:172)
==9932==    by 0x50DAB14: virThreadPoolWorker (virthreadpool.c:145)
==9932==    by 0x50DA4AD: virThreadHelper (virthread.c:197)
==9932==    by 0x831DF32: start_thread (pthread_create.c:309)
==9932==    by 0x8A2FDEC: clone (clone.S:111)
==9932==  Address 0x23fe3e60 is 32 bytes inside a block of size 80 free'd
==9932==    at 0x4A07577: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==9932==    by 0x5083B89: virFree (viralloc.c:582)
==9932==    by 0x511A17C: virDomainSnapshotDefFree (snapshot_conf.c:105)
==9932==    by 0x511A1F3: virDomainSnapshotObjListDataFree (snapshot_conf.c:752)
==9932==    by 0x50A86E7: virHashRemoveEntry (virhash.c:475)
==9932==    by 0x20FB7CD9: qemuDomainSnapshotDiscard (qemu_domain.c:2096)
==9932==    by 0x21018D29: qemuDomainSnapshotDelete (qemu_driver.c:14474)
==9932==    by 0x51955A0: virDomainSnapshotDelete (libvirt.c:19386)
==9932==    by 0x133B3C: remoteDispatchDomainSnapshotDeleteHelper (remote_dispatch.h:8700)
==9932==    by 0x51DE391: virNetServerProgramDispatch (virnetserverprogram.c:437)
==9932==    by 0x15151C: virNetServerHandleJob (virnetserver.c:172)
==9932==    by 0x50DAB14: virThreadPoolWorker (virthreadpool.c:145)
==9932== 
==9932== Invalid read of size 8
==9932==    at 0x2101936F: qemuDomainRevertToSnapshot (qemu_driver.c:14152)
==9932==    by 0x5195370: virDomainRevertToSnapshot (libvirt.c:19324)
==9932==    by 0x133CEC: remoteDispatchDomainRevertToSnapshotHelper (remote_dispatch.h:7121)
==9932==    by 0x51DE391: virNetServerProgramDispatch (virnetserverprogram.c:437)
==9932==    by 0x15151C: virNetServerHandleJob (virnetserver.c:172)
==9932==    by 0x50DAB14: virThreadPoolWorker (virthreadpool.c:145)
==9932==    by 0x50DA4AD: virThreadHelper (virthread.c:197)
==9932==    by 0x831DF32: start_thread (pthread_create.c:309)
==9932==    by 0x8A2FDEC: clone (clone.S:111)
==9932==  Address 0x24075130 is 0 bytes inside a block of size 40 free'd
==9932==    at 0x4A07577: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==9932==    by 0x5083B89: virFree (viralloc.c:582)
==9932==    by 0x511A1FD: virDomainSnapshotObjListDataFree (snapshot_conf.c:753)
==9932==    by 0x50A86E7: virHashRemoveEntry (virhash.c:475)
==9932==    by 0x20FB7CD9: qemuDomainSnapshotDiscard (qemu_domain.c:2096)
==9932==    by 0x21018D29: qemuDomainSnapshotDelete (qemu_driver.c:14474)
==9932==    by 0x51955A0: virDomainSnapshotDelete (libvirt.c:19386)
==9932==    by 0x133B3C: remoteDispatchDomainSnapshotDeleteHelper (remote_dispatch.h:8700)
==9932==    by 0x51DE391: virNetServerProgramDispatch (virnetserverprogram.c:437)
==9932==    by 0x15151C: virNetServerHandleJob (virnetserver.c:172)
==9932==    by 0x50DAB14: virThreadPoolWorker (virthreadpool.c:145)
==9932==    by 0x50DA4AD: virThreadHelper (virthread.c:197)
==9932== 
==9932== Invalid read of size 8
==9932==    at 0x21019379: qemuDomainRevertToSnapshot (qemu_driver.c:14152)
==9932==    by 0x5195370: virDomainRevertToSnapshot (libvirt.c:19324)
==9932==    by 0x133CEC: remoteDispatchDomainRevertToSnapshotHelper (remote_dispatch.h:7121)
==9932==    by 0x51DE391: virNetServerProgramDispatch (virnetserverprogram.c:437)
==9932==    by 0x15151C: virNetServerHandleJob (virnetserver.c:172)
==9932==    by 0x50DAB14: virThreadPoolWorker (virthreadpool.c:145)
==9932==    by 0x50DA4AD: virThreadHelper (virthread.c:197)
==9932==    by 0x831DF32: start_thread (pthread_create.c:309)
==9932==    by 0x8A2FDEC: clone (clone.S:111)
==9932==  Address 0x23fe3e40 is 0 bytes inside a block of size 80 free'd
==9932==    at 0x4A07577: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==9932==    by 0x5083B89: virFree (viralloc.c:582)
==9932==    by 0x511A17C: virDomainSnapshotDefFree (snapshot_conf.c:105)
==9932==    by 0x511A1F3: virDomainSnapshotObjListDataFree (snapshot_conf.c:752)
==9932==    by 0x50A86E7: virHashRemoveEntry (virhash.c:475)
==9932==    by 0x20FB7CD9: qemuDomainSnapshotDiscard (qemu_domain.c:2096)
==9932==    by 0x21018D29: qemuDomainSnapshotDelete (qemu_driver.c:14474)
==9932==    by 0x51955A0: virDomainSnapshotDelete (libvirt.c:19386)
==9932==    by 0x133B3C: remoteDispatchDomainSnapshotDeleteHelper (remote_dispatch.h:8700)
==9932==    by 0x51DE391: virNetServerProgramDispatch (virnetserverprogram.c:437)
==9932==    by 0x15151C: virNetServerHandleJob (virnetserver.c:172)
==9932==    by 0x50DAB14: virThreadPoolWorker (virthreadpool.c:145)
==9932== 
==9932== Invalid read of size 1
==9932==    at 0x20FDE9D8: qemuMonitorEscapeArg (qemu_monitor.c:156)
==9932==    by 0x20FEDE4E: qemuMonitorTextLoadSnapshot (qemu_monitor_text.c:2761)
==9932==    by 0x21019380: qemuDomainRevertToSnapshot (qemu_driver.c:14152)
==9932==    by 0x5195370: virDomainRevertToSnapshot (libvirt.c:19324)
==9932==    by 0x133CEC: remoteDispatchDomainRevertToSnapshotHelper (remote_dispatch.h:7121)
==9932==    by 0x51DE391: virNetServerProgramDispatch (virnetserverprogram.c:437)
==9932==    by 0x15151C: virNetServerHandleJob (virnetserver.c:172)
==9932==    by 0x50DAB14: virThreadPoolWorker (virthreadpool.c:145)
==9932==    by 0x50DA4AD: virThreadHelper (virthread.c:197)
==9932==    by 0x831DF32: start_thread (pthread_create.c:309)
==9932==    by 0x8A2FDEC: clone (clone.S:111)
==9932==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==9932== 
==9932== 
==9932== Process terminating with default action of signal 11 (SIGSEGV)
==9932==  Access not within mapped region at address 0x0
==9932==    at 0x20FDE9D8: qemuMonitorEscapeArg (qemu_monitor.c:156)
==9932==    by 0x20FEDE4E: qemuMonitorTextLoadSnapshot (qemu_monitor_text.c:2761)
==9932==    by 0x21019380: qemuDomainRevertToSnapshot (qemu_driver.c:14152)
==9932==    by 0x5195370: virDomainRevertToSnapshot (libvirt.c:19324)
==9932==    by 0x133CEC: remoteDispatchDomainRevertToSnapshotHelper (remote_dispatch.h:7121)
==9932==    by 0x51DE391: virNetServerProgramDispatch (virnetserverprogram.c:437)
==9932==    by 0x15151C: virNetServerHandleJob (virnetserver.c:172)
==9932==    by 0x50DAB14: virThreadPoolWorker (virthreadpool.c:145)
==9932==    by 0x50DA4AD: virThreadHelper (virthread.c:197)
==9932==    by 0x831DF32: start_thread (pthread_create.c:309)
==9932==    by 0x8A2FDEC: clone (clone.S:111)
==9932==  If you believe this happened as a result of a stack
==9932==  overflow in your program's main thread (unlikely but
==9932==  possible), you can try to increase the size of the
==9932==  main thread stack using the --main-stacksize= flag.
==9932==  The main thread stack size used in this run was 8388608.
==9932== 
==9932== HEAP SUMMARY:
==9932==     in use at exit: 3,205,933 bytes in 24,542 blocks
==9932==   total heap usage: 189,650 allocs, 165,108 frees, 691,331,922 bytes allocated
==9932==

Comment 6 Peter Krempa 2014-09-08 11:00:58 UTC
Fixed upstream by commits:

commit a4065dc3e7b17a5fa6916fca2fb28a0d723fe13d
Author: Jincheng Miao <jmiao>
Date:   Thu Aug 28 19:27:04 2014 +0800

    qemu: snapshot: Acquire job earlier on snapshot revert/delete
    
    The code would lookup the snapshot object before acquiring the job. This
    could lead to a crash as one thread could delete the snapshot object,
    while a second thread already had the reference.
    
    Signed-off-by: Jincheng Miao <jmiao>
    Signed-off-by: Peter Krempa <pkrempa>

commit b3d2a42e80aaee1f322fc9beb98b6ed541574ab3
Author: Peter Krempa <pkrempa>
Date:   Wed Sep 3 15:17:50 2014 +0200

    qemu: snapshot: Fix job handling when creating snapshots
    
    Creating snapshots modifies the domain state. Currently we wouldn't
    enter the job for certain operations although they would modify the
    state. Refactor job handling so that everything is covered by an async
    job.


v1.2.8-49-ga4065dc

Comment 9 Shanzhi Yu 2014-09-18 06:17:16 UTC
Try to verify this bug with libvirt-1.2.8-2.el7

Steps:

1) Prepare a running guest with OS installed

2) Create internal snapshot for guest

3) Open three terminals, run three cmd at almost same time

terminal 1

#while true;do virsh snapshot-revert rh7-g s;done

terminal 2

#while true;do virsh snapshot-delete rh7-g s;done

terminal 3

# while true;do virsh snapshot-create-as rh7-g s;done

Result:

1) Libvirtd will not crash
2) No error like "Invalid target domain state 'nostate'" appear, instead of error "error: Timed out during operation: cannot acquire state change lock"
3) grep error /tmp/libvirtd.log
2014-09-18 05:47:57.000+0000: 4585: error : qemuDomainObjBeginJobInternal:1327 : Timed out during operation: cannot acquire state change lock
4) After long time step 3), finish the three cmds and try to create another snapshot will succeed

I have a problem, try to do "virsh save rh7-g /tmp/rh7-g.save" when step 3) is in process will fail with same error "can't acquire state change lock". 
Moreover, when there is a snapshot-create job, destroy guest will take long time

# time virsh destroy rh7-g
Domain rh7-g destroyed

real	0m10.288s
user	0m0.012s
sys	0m0.007s

Is this expected ? or a little over-protected? If it is not a problem, I will change this bug to VERIFIED status.

Thanks

Comment 10 Shanzhi Yu 2014-09-18 06:33:33 UTC
Besides, Should I clone this one to RHEL6.6 ? 
Since revert/delete/create same snapshot at same time will easily crash libvirtd.

Comment 11 Peter Krempa 2014-09-18 08:20:00 UTC
(In reply to Shanzhi Yu from comment #9)
> I have a problem, try to do "virsh save rh7-g /tmp/rh7-g.save" when step 3)
> is in process will fail with same error "can't acquire state change lock". 
> Moreover, when there is a snapshot-create job, destroy guest will take long
> time
> 
> # time virsh destroy rh7-g
> Domain rh7-g destroyed
> 
> real	0m10.288s
> user	0m0.012s
> sys	0m0.007s
> 
> Is this expected ? or a little over-protected? If it is not a problem, I
> will change this bug to VERIFIED status.

Internal snapshot apparently makes qemu get stuck. We re-try killing (sometimes with more force) until it finishes. If the destroy was successful after a while it should be okay.

Comment 12 Peter Krempa 2014-09-18 08:20:37 UTC
(In reply to Shanzhi Yu from comment #10)
> Besides, Should I clone this one to RHEL6.6 ? 
> Since revert/delete/create same snapshot at same time will easily crash
> libvirtd.

It's too late for 6.6. 6.7 might be better if you are going to clone it.

Comment 13 Shanzhi Yu 2014-09-18 09:17:07 UTC
Change to VERIFIED status according comment 10, and clone new onw to RHEL6

Comment 15 errata-xmlrpc 2015-03-05 07:43:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0323.html


Note You need to log in before you can comment on or make changes to this bug.