Bug 1591628

Summary: Error occurred when revert a running domain to a running snapshot with "--force"
Product: Red Hat Enterprise Linux 7 Reporter: Yanqiu Zhang <yanqzhan>
Component: libvirtAssignee: John Ferlan <jferlan>
Status: CLOSED ERRATA QA Contact: yisun
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.6CC: chwen, dyuan, fjin, hhan, jferlan, lmen, meili, mzhan, xuzhang, yafu, yanqzhan
Target Milestone: rcKeywords: Automation, Regression
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: libvirt-4.5.0-1.el7 Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-30 09:56:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1149445, 1598348, 2000506    
Attachments:
Description Flags
libvirtd_qemu_logs none

Description Yanqiu Zhang 2018-06-15 07:09:01 UTC
Description of problem:
Error occurred when revert a running domain to a running snapshot with "--force" 

Version-Release number of selected component (if applicable):
libvirt-4.4.0-2.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1.Start a guest and create a running snapshot
# virsh start avocado-vt-vm4
Domain avocado-vt-vm4 started

# virsh snapshot-create-as avocado-vt-vm4 avocado-vt-vm4.s1
Domain snapshot avocado-vt-vm4.s1 created

2.Add a rng device or attach a second disk
# qemu-img create -f qcow2 /var/lib/libvirt/images/bar.img 10M
Formatting '/var/lib/libvirt/images/bar.img', fmt=qcow2 size=10485760 cluster_size=65536 lazy_refcounts=off refcount_bits=16

# virsh attach-disk avocado-vt-vm4 /var/lib/libvirt/images/bar.img vdb --live
Disk attached successfully

3.Try to revert to the 1st snapshot  (Expected failure)
# virsh snapshot-revert avocado-vt-vm4 avocado-vt-vm4.s1
error: revert requires force: Target domain disk count 1 does not match source 2

4.Try to revert with "--force"
# virsh snapshot-revert avocado-vt-vm4 avocado-vt-vm4.s1 --force
error: internal error: unexpected async job 6

5.Check guest status
# virsh list --all
 Id    Name                           State
----------------------------------------------------
 2     avocado-vt-vm4                 paused

# virsh resume avocado-vt-vm4
error: Failed to resume domain avocado-vt-vm4
error: internal error: unable to execute QEMU command 'cont': Expecting capabilities negotiation with 'qmp_capabilities'


Actual results:
In step4, error occurred when snapshot-revert with --force
In step5. guest paused

Expected Results:
Should successfully do "snapshot-revert --force" without error.
Guest should keep running status as the 1st snapshot is.

Additional info:
Not reproduce on libvirt-3.9.0-14.el7_5.6.x86_64.

Comment 2 Yanqiu Zhang 2018-06-15 07:13:46 UTC
Created attachment 1451788 [details]
libvirtd_qemu_logs

Comment 4 Han Han 2018-06-19 03:22:31 UTC
Could you please check if managesave will hang and then libvirtd deadlock after snapshot-revert --force?

Comment 5 Yanqiu Zhang 2018-06-19 05:17:05 UTC
(In reply to Han Han from comment #4)
> Could you please check if managesave will hang and then libvirtd deadlock
> after snapshot-revert --force?

Checked and the result is yes.

Comment 6 John Ferlan 2018-06-20 17:42:52 UTC
Since it seems this is the root cause of the vmgenid validation effort and resulting needsinfo on me - https://bugzilla.redhat.com/show_bug.cgi?id=1149445#c10 - I investigated whether what I had ready to post would also solve this problem and it did.

So, I posted a patch series upstream to resolve:

https://www.redhat.com/archives/libvir-list/2018-June/msg01425.html

patch 5 is the actual fix, but patches 1-4 get a few things set up and also make a minor adjustment to the way vmgenid (e.g. bz1149445) is handled for the --force option.

Comment 7 John Ferlan 2018-06-20 22:47:51 UTC
This has now been pushed upstream:

commit 0c4408c832368b45c8246175e2a75132d3ff0302
Author: John Ferlan <jferlan>
Date:   Tue Jun 19 18:54:19 2018 -0400

    qemu: Don't use asyncJob after stop during snapshot revert
    
...
    
    Attempting to use the FORCE flag for snapshot-revert was resulting
    in failures because qemuProcessStart and qemuProcessStartCPUs were
    using QEMU_ASYNC_JOB_START after a qemuProcessStop resulting in an
    error when entering the monitor:
    
    error: internal error: unexpected async job 6 type expected 0
    
    So create a local @jobType, initialize to QEMU_ASYNC_JOB_START, and
    change to QEMU_ASYNC_JOB_NONE if we end up in the --force path
    where the qemuProcessStop is run before a Start and StartCPUs.
    
$  git describe 0c4408c832368b45c8246175e2a75132d3ff0302
v4.4.0-296-g0c4408c832
$

Comment 10 yisun 2018-08-20 09:34:45 UTC
Verified on:
libvirt-4.5.0-6.el7.x86_64
qemu-kvm-rhev-2.12.0-10.el7.x86_64

(test vm with/without <genid/>, most detailed genid test will be carried on with bz1149445)
# virsh domblklist vm2
Target     Source
------------------------------------------------
vda        /var/lib/libvirt/images/vm2-1.qcow2


# qemu-img info /var/lib/libvirt/images/vm2-1.qcow2 -U --backing-chain
image: /var/lib/libvirt/images/vm2-1.qcow2
file format: qcow2
virtual size: 10G (10737418240 bytes)
disk size: 10G
cluster_size: 65536
Format specific information:
    compat: 1.1
    lazy refcounts: true
    refcount bits: 16
    corrupt: false

# virsh snapshot-create-as vm2 vm2.s1
Domain snapshot vm2.s1 created

# qemu-img info /var/lib/libvirt/images/vm2-1.qcow2 -U --backing-chain
image: /var/lib/libvirt/images/vm2-1.qcow2
file format: qcow2
virtual size: 10G (10737418240 bytes)
disk size: 10G
cluster_size: 65536
Snapshot list:
ID        TAG                 VM SIZE                DATE       VM CLOCK
1         vm2.s1                 487M 2018-08-20 17:26:14   00:02:41.660
Format specific information:
    compat: 1.1
    lazy refcounts: true
    refcount bits: 16
    corrupt: false


# qemu-img create -f qcow2 /var/lib/libvirt/images/bar.img 10M
Formatting '/var/lib/libvirt/images/bar.img', fmt=qcow2 size=10485760 cluster_size=65536 lazy_refcounts=off refcount_bits=16


# virsh attach-disk vm2 /var/lib/libvirt/images/bar.img vdb --live
Disk attached successfully

# virsh domblklist vm2
Target     Source
------------------------------------------------
vda        /var/lib/libvirt/images/vm2-1.qcow2
vdb        /var/lib/libvirt/images/bar.img

# virsh snapshot-revert vm2 vm2.s1
error: revert requires force: Target domain disk count 1 does not match source 2

# virsh snapshot-revert vm2 vm2.s1 --force

# echo $?
0

# virsh domblklist vm2
Target     Source
------------------------------------------------
vda        /var/lib/libvirt/images/vm2-1.qcow2

Comment 12 errata-xmlrpc 2018-10-30 09:56:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3113