Bug 1149445 (vmgenid-libvirt)

Summary:

[RFE] Detection of cloned environment using a unique, inmutable, intelligent identifier programmically accessible - libvirt

Product:

Red Hat Enterprise Linux 7

Reporter:

Ronen Hod <rhod>

Component:

libvirt

Assignee:

John Ferlan <jferlan>

Status:

CLOSED ERRATA

QA Contact:

yisun

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

7.1

CC:

berrange, cww, dyuan, fjin, ghammer, hhan, hhuang, jferlan, juzhang, juzhou, knoel, lersek, lmen, marcandre.lureau, mbaissac, michen, mtessun, mzhan, pstehlik, rjones, tzheng, virt-bugs, virt-maint, xfu, xuzhang

Target Milestone:

Keywords:

FutureFeature

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

libvirt-4.4.0-1.el7

Doc Type:

Enhancement

Doc Text:

Feature: Add support for VM Generation ID Reason: Allows the guest the ability to detect when there is the possibility that the guest is potentially re-executing something that has already been executed before. Result: The VM Generatation ID exposes a 128-bit, cryptographically random, integer value identifier, referred to as a Globally Unique Identifier (GUID) to the guest in order to notify the guest operating system when the virtual machine is executed with a different configuration. Add a new domain XML processing and a domain capabilities feature.

Story Points:

---

Clone Of:

vmgenid

Clones:

1598348 1598350 (view as bug list)

Environment:

Last Closed:

2018-10-30 09:49:43 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1118834, 1159983, 1591628

Bug Blocks:

1118825, 1159981, 1477664, 1598348, 1598350, 1599139, 2000506

Attachments:

Description	Flags
the threads backtrace of managedsave deadlock	none

Description Ronen Hod 2014-10-05 09:17:16 UTC

+++ This bug was initially created as a clone of Bug #1118834 +++

+++ This bug was initially created as a clone of Bug #1118825 +++

Description of problem:
Making copies of VMs is great - until it isn't. The typical window's example: some clones a VM that is running active directory - and now - there are two sources of truth of authentication/authorization- a bad thing.

Microsoft has established a standard identifier-using the ACPI driver to identify when a clone has been creating in a "bad way." This standard is now supported by VMware, Xen Server and, of course, HyperV. See http://go.microsoft.com/fwlink/?LinkId=260709 for more information.

Flexera Software takes advantage of this standard to detect when a user has incorrectly cloned a license server (where a user then has 2 license servers running licenses-same as having 2 active directory). We then enable the producers to implement policies such as "render the license server inoperable" or "give user time to fix this." We have data to show that this particular scenario happens to roughly 20% of all global software licenses (and often by accident).

Our ask is that Red Hat/the linux community also establish something of the same flavor so that software producers can help customers stay compliant with their software licenses.


Version-Release number of selected component (if applicable): All


How reproducible: N/A


Steps to Reproduce: 
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from Ronen Hod on 2014-10-05 04:42:26 EDT ---

Closing bug#1139005 as duplicate.
There it says:
This functionality is new to Hyper-V in Windows Server 2012 or Windows 8 and is designed to differentiate (security-wise) between instances of a VM that were generated using the exact same disk image / snapshot.
The implementation is based on a new hypervisor's device, that returns a 128-bit, cryptographically random integer value identifier that will be different every time the virtual machine executes from a different configuration file—such as executing from a recovered snapshot, or executing after restoring from backup.

--- Additional comment from Ronen Hod on 2014-10-05 04:44:02 EDT ---

Comment 1 Daniel Berrangé 2014-10-06 09:11:35 UTC

QEMU patches are exposing a new device which must be specified on the command line to set an ACPI field:

  http://lists.nongnu.org/archive/html/qemu-devel/2014-09/msg03447.html

We need to comply with this specification:

  http://go.microsoft.com/fwlink/?LinkId=260709

There are some rules on when the VM generation ID *must* change

 Virtual machine is paused or resumed: No
 Virtual machine reboots: No
 Virtual machine host reboots: No
 Virtual machine starts executing a snapshot (every time): Yes
 Virtual machine is recovered from backup: Yes
 Virtual machine is failed over in a disaster recovery environment: Yes
 Virtual machine is live migrated: No
 Virtual machine is imported, copied, or cloned: Yes
 Virtual machine is failed over in a clustered environment: No
 Virtual machine's configuration changes: Unspecified

Most of those can be achieved by specifying new value for the device in CLI args, but if reverting to a snapshot in a running guest, we need QEMU monitor command support to change the VM generation ID. This does not appear to be included in the QEMU work.

Comment 2 Daniel Berrangé 2014-12-10 15:48:10 UTC

Latest upstream QEMU posting of this feature

https://lists.gnu.org/archive/html/qemu-devel/2014-12/msg01083.html

it now includes ability to set it via the QMP monitor too. So we'll need todo a few changes to wire this up into libvirt so it changes correctly on each snapshot revert

Comment 4 John Ferlan 2015-05-19 12:25:28 UTC

Move to 7.3

Comment 5 John Ferlan 2016-06-22 15:03:23 UTC

Move to consideration for 7.4

Comment 7 John Ferlan 2018-04-11 20:12:06 UTC

Posted patches upstream:

https://www.redhat.com/archives/libvir-list/2018-April/msg00913.html

Basic testing done:

Using <genid/>:

Not running or provided, shouldn't display a GUID:
# virsh dumpxml dom | grep genid
  <genid/>

Start the guest:
# virsh start dom
Domain dom started

Running so the running config should go the GUID, while the inactive will not:
# virsh dumpxml dom | grep genid
  <genid>2847a01f-f564-4cf7-8b86-48be06f2124f</genid>
# virsh dumpxml dom --inactive| grep genid
  <genid/>

Destroy the guest, check that the GUID isn't left:
# virsh destroy dom
Domain dom destroyed

# virsh dumpxml dom | grep genid
  <genid/>

Start the guest again:
# virsh start dom
Domain dom started

We should get a different GUID:
# virsh dumpxml dom | grep genid
  <genid>8fda1967-2d0b-4e36-8528-9659856baa78</genid>


Save the domain:
# virsh save dom dom.save

Domain dom saved to dom.save

Restoring the domain should change the GUID:
# virsh restore dom.save
Domain restored from dom.save

# virsh dumpxml dom | grep genid
  <genid>7f27592e-e320-49b0-8e49-6f4ffd6957de</genid>

Save the domain to an XML file, edit it change the name, remove the UUID, alter the disk used for booting, but leave the GUID.  This should fail to create/start:

# virsh dumpxml dom > dom2.xml
<edit as described>
# virsh create dom2.xml
error: Failed to create domain from dom2.xml
error: unsupported configuration: domain 'dom' already running with genid '7f27592e-e320-49b0-8e49-6f4ffd6957de'

#

Now destroy dom and start dom2, which should succeed:

# virsh destroy dom
Domain dom destroyed

# virsh create dom2.xml
Domain dom2 created from dom2.xml

# virsh dumpxml dom2 | grep genid
  <genid>7f27592e-e320-49b0-8e49-6f4ffd6957de</genid>

BTW: This also shows the path of starting when <genid> is set to something.


Starting dom will still work because <genid/> results in auto generation of new GUID:

# virsh start dom
Domain dom started

# virsh dumpxml dom | grep genid
  <genid>448f77ab-f0a3-4f51-a34b-c1e04cb3b335</genid>


Using virt-clone the domain:

# virt-clone --original dom --auto-clone 
Allocating 'dom-clone.img'    |  10 GB  00:00:10     

Clone 'dom-clone' created successfully.

# virsh dumpxml dom-clone | grep genid
  <genid/>
#

If the virt-clone occurs on a domain with a defined <genid> then the GUID is copied to the clone, e.g.:

# virsh dumpxml dom | grep genid
  <genid>7f27592e-e320-49b0-8e49-6f4ffd6957de</genid>
# virt-clone --original dom --auto-clone 
...
# virsh dumpxml dom-clone | grep genid
  <genid>7f27592e-e320-49b0-8e49-6f4ffd6957de</genid>
#

This is expected since all it's doing is dumping the XML of the inactive guest and reformatting it changing the name, uuid, source disk, mac-addr; however, both domains would not be able to run at the same time. The virt-clone code still needs an adjustment to handle that, but would require this code to be present, so it's a chicken/egg problem...

This also could become a non-issue if review decides that we shouldn't support supplying a GUID and only support <genid/> on read, but still display the GUID for the active domain.

Didn't perform snapshot testing

Comment 8 John Ferlan 2018-05-17 12:44:37 UTC

Forgot to update since original posting, most recent set of patches here:

https://www.redhat.com/archives/libvir-list/2018-May/msg01332.html

Comment 9 John Ferlan 2018-05-25 12:37:46 UTC

After a rebase or two and a few adjustments from review, the patches are now pushed upstream:

commit 87973a45f97dcbf0c515d9104e068094a09c74b5
Author: John Ferlan <jferlan>
Date:   Tue Mar 20 18:29:46 2018 -0400

...
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1149445
    
    If the domain requests usage of the genid functionality,
    then add the QEMU '-device vmgenid' to the command line
    providing either the supplied or generated GUID value.
    
    Add tests for both a generated and supplied GUID value.
    
...

$ git describe 87973a45f97dcbf0c515d9104e068094a09c74b5
v4.3.0-311-g87973a45f9
$

Comment 10 Han Han 2018-06-05 09:20:50 UTC

Hi John, I simply test this genid feature, and get result as following:
1. VM is paused or resumed, genid notb changed, as expected
2. VM reboots, genid notb changed, as expected
3. VM starts executing a snapshot and then save restore:
3.1 I tried internal snapshot, get error when exec internal snapshot
# virsh snapshot-create-as RHEL8  s1    
Domain snapshot s1 created

# virsh snapshot-revert RHEL8 s1 --force
error: internal error: unexpected async job 6

# virsh managedsave RHEL8

Libvirtd will get deadlock here. No response to any other virsh commands and management save keep running.

Please check the 3rd. I will attach the all threads backtrace of the deadlock later.

Comment 11 Han Han 2018-06-05 09:24:19 UTC

Created attachment 1447777 [details]
the threads backtrace of managedsave deadlock

Comment 12 Han Han 2018-06-05 09:43:12 UTC

My libvirt version: v4.4.0-2-g67c56f6

Comment 13 John Ferlan 2018-06-05 22:20:45 UTC

First off I neglected to mention that there were a couple changes from the comment 7 posting and test results and the final result from comment 9. Mainly around inactive display value being printed. Originally the thought was if someone put <genid/>, then the inactive XML would print that; however, review of the code indicated that if we have <genid/> and we generate a GUID, then we store the GUID in the inactive domain as well. If you stop, then start the domain the GUID stays the same... I also had to drop the entire "duplicated" GUID check for another domain as it was deemed unnecessary.

As for the hang - nothing immediately springs to mind and there's *a lot* of data in that attached dump - a *lot* of qemu threads which I have no idea about. One thing that does stick out about the qemu thread traces is the string "/usr/src/debug/qemu-2.10.0/" present in many. There's also quite a few which curious stack traces which would make it appear as if something in qemu has gone awry, but I don't know enough about those traces to say one way or another definitively that qemu would be to blame here. I know what I want to think!

There's a couple libvirt threads that are interesting towards the bottom of the data. In particular Thread 19 (Thread 0x7efcc0c98700 (LWP 16885)), Thread 17 (Thread 0x7efcbfc96700 (LWP 16887)), Thread 12 (Thread 0x7efcbdc92700 (LWP 16892)), and Thread 2 (Thread 0x7efc7eb28700 (LWP 18286)).

Thread 19 is in the middle of a virDomainGetJobInfo (I assume from a virsh caller). It's waiting for uuid "2bbf75be-9da7-4d81-9e02-0d4295eef09e" and has obj=0x7efc6c61d8c0 (from frame #4).

Thread 17 is in the end of qemuDomainSaveMemory doing a virThreadJoin from the qemuDomainManagedSave and will have the @vm locked during this period. The vm is 0x7efc6c61d8c0 (from frame #4). So we have a match with thread 19.

Thread 12 is in the middle of a virConnectListAllDomains (perhaps a virsh list --all operation). It's waiting to a specific vm's lock to be free, where vm=0x7efc6c61d8c0 (from frame #4) and again a match with thread 19.

Thread 2 looks like it's in the middle of the virCommandRunAsync that would be the virCommandRunAsync "target" (so to speak) of the qemuMigrationSrcToFile (see _tid = 18286 in frame 0 of Thread 17). This is essentially where things are stuck. I'm assuming waiting for qemu, but I'm not all that familiar with that code path and I'm really not sure how to read the qemu information to find related threads.

Could you run the same test without <genid/> in the VM - if you get a hang, then at least vmgenid could be absolved. If things work, put genid in and try again.

Before anyone spends too much time on this, is there enough free disk space for "/var/lib/libvirt/qemu/save/RHEL8.save"?

I'll look again a bit more and update with anything else I find, but it may be best to narrow things down by trying the same steps without genid, with genid, without genid, etc. to really see if it's genid that is the problem.

Comment 14 Han Han 2018-06-06 02:44:11 UTC

(In reply to John Ferlan from comment #13)
> First off I neglected to mention that there were a couple changes from the
> comment 7 posting and test results and the final result from comment 9. 
> Mainly around inactive display value being printed.  Originally the thought
> was if someone put <genid/>, then the inactive XML would print that;
> however, review of the code indicated that if we have <genid/> and we
> generate a GUID, then we store the GUID in the inactive domain as well. If
> you stop, then start the domain the GUID stays the same...  I also had to
> drop the entire "duplicated" GUID check for another domain as it was deemed
> unnecessary.
Do you mean active genid not changed after VM destroy&&start is expected result?
I tested it. It is not changed after destroy&start.
> 
> As for the hang - nothing immediately springs to mind and there's *a lot* of
> data in that attached dump - a *lot* of qemu threads which I have no idea
> about. One thing that does stick out about the qemu thread traces is the
> string "/usr/src/debug/qemu-2.10.0/" present in many. There's also quite a
> few which curious stack traces which would make it appear as if something in
> qemu has gone awry, but I don't know enough about those traces to say one
> way or another definitively that qemu would be to blame here. I know what I
> want to think!
> 
> There's a couple libvirt threads that are interesting towards the bottom of
> the data. In particular Thread 19 (Thread 0x7efcc0c98700 (LWP 16885)),
> Thread 17 (Thread 0x7efcbfc96700 (LWP 16887)), Thread 12 (Thread
> 0x7efcbdc92700 (LWP 16892)), and Thread 2 (Thread 0x7efc7eb28700 (LWP
> 18286)).
> 
> Thread 19 is in the middle of a virDomainGetJobInfo (I assume from a virsh
> caller).  It's waiting for uuid "2bbf75be-9da7-4d81-9e02-0d4295eef09e" and
> has obj=0x7efc6c61d8c0 (from frame #4).
> 
> Thread 17 is in the end of qemuDomainSaveMemory doing a virThreadJoin  from
> the qemuDomainManagedSave and will have the @vm locked during this period.
> The vm is 0x7efc6c61d8c0 (from frame #4). So we have a match with thread 19.
> 
> Thread 12 is in the middle of a virConnectListAllDomains (perhaps a virsh
> list --all operation).  It's waiting to a specific vm's lock to be free,
> where vm=0x7efc6c61d8c0 (from frame #4) and again a match with thread 19.
> 
> Thread 2 looks like it's in the middle of the virCommandRunAsync that would
> be the virCommandRunAsync "target" (so to speak) of the
> qemuMigrationSrcToFile (see _tid = 18286 in frame 0 of Thread 17). This is
> essentially where things are stuck.  I'm assuming waiting for qemu, but I'm
> not all that familiar with that code path and I'm really not sure how to
> read the qemu information to find related threads.
> 
> Could you run the same test without <genid/> in the VM - if you get a hang,
> then at least vmgenid could be absolved. If things work, put genid in and
> try again.
For VM without <genid/>, it works well:
#!/bin/bash
virsh start RHEL8
sleep 3
virsh snapshot-create-as RHEL8 s1
virsh snapshot-revert RHEL8 s1 --force
virsh managedsave RHEL8
virsh start RHEL8

Result:
Domain RHEL8 started

Domain snapshot s1 created

Domain RHEL8 state saved by libvirt

Domain RHEL8 started

> 
> Before anyone spends too much time on this, is there enough free disk space
> for "/var/lib/libvirt/qemu/save/RHEL8.save"?
Free disk space is enough here.
> 
> I'll look again a bit more and update with anything else I find, but it may
> be best to narrow things down by trying the same steps without genid, with
> genid, without genid, etc. to really see if it's genid that is the problem.

Comment 15 John Ferlan 2018-06-06 10:32:47 UTC

(In reply to Han Han from comment #14)
> (In reply to John Ferlan from comment #13)
> > First off I neglected to mention that there were a couple changes from the
> > comment 7 posting and test results and the final result from comment 9. 
> > Mainly around inactive display value being printed.  Originally the thought
> > was if someone put <genid/>, then the inactive XML would print that;
> > however, review of the code indicated that if we have <genid/> and we
> > generate a GUID, then we store the GUID in the inactive domain as well. If
> > you stop, then start the domain the GUID stays the same...  I also had to
> > drop the entire "duplicated" GUID check for another domain as it was deemed
> > unnecessary.
> Do you mean active genid not changed after VM destroy&&start is expected
> result?
> I tested it. It is not changed after destroy&start.

No I was referring to the printing of the GUID.  Assume you modify domain to have just <genid/>... Then start the domain, you'll see:

# virsh dumpxml $dom | grep genid
  <genid>d409e383-9afd-426d-b959-9fbf42d7d1fb</genid>
# virsh dumpxml $dom --inactive | grep genid
  <genid>d409e383-9afd-426d-b959-9fbf42d7d1fb</genid>


> > 
> > As for the hang - nothing immediately springs to mind and there's *a lot* of
> > data in that attached dump - a *lot* of qemu threads which I have no idea
> > about. One thing that does stick out about the qemu thread traces is the
> > string "/usr/src/debug/qemu-2.10.0/" present in many. There's also quite a
> > few which curious stack traces which would make it appear as if something in
> > qemu has gone awry, but I don't know enough about those traces to say one
> > way or another definitively that qemu would be to blame here. I know what I
> > want to think!
> > 
> > There's a couple libvirt threads that are interesting towards the bottom of
> > the data. In particular Thread 19 (Thread 0x7efcc0c98700 (LWP 16885)),
> > Thread 17 (Thread 0x7efcbfc96700 (LWP 16887)), Thread 12 (Thread
> > 0x7efcbdc92700 (LWP 16892)), and Thread 2 (Thread 0x7efc7eb28700 (LWP
> > 18286)).
> > 
> > Thread 19 is in the middle of a virDomainGetJobInfo (I assume from a virsh
> > caller).  It's waiting for uuid "2bbf75be-9da7-4d81-9e02-0d4295eef09e" and
> > has obj=0x7efc6c61d8c0 (from frame #4).
> > 
> > Thread 17 is in the end of qemuDomainSaveMemory doing a virThreadJoin  from
> > the qemuDomainManagedSave and will have the @vm locked during this period.
> > The vm is 0x7efc6c61d8c0 (from frame #4). So we have a match with thread 19.
> > 
> > Thread 12 is in the middle of a virConnectListAllDomains (perhaps a virsh
> > list --all operation).  It's waiting to a specific vm's lock to be free,
> > where vm=0x7efc6c61d8c0 (from frame #4) and again a match with thread 19.
> > 
> > Thread 2 looks like it's in the middle of the virCommandRunAsync that would
> > be the virCommandRunAsync "target" (so to speak) of the
> > qemuMigrationSrcToFile (see _tid = 18286 in frame 0 of Thread 17). This is
> > essentially where things are stuck.  I'm assuming waiting for qemu, but I'm
> > not all that familiar with that code path and I'm really not sure how to
> > read the qemu information to find related threads.
> > 
> > Could you run the same test without <genid/> in the VM - if you get a hang,
> > then at least vmgenid could be absolved. If things work, put genid in and
> > try again.
> For VM without <genid/>, it works well:
> #!/bin/bash
> virsh start RHEL8
> sleep 3
> virsh snapshot-create-as RHEL8 s1
> virsh snapshot-revert RHEL8 s1 --force
> virsh managedsave RHEL8
> virsh start RHEL8
> 
> Result:
> Domain RHEL8 started
> 
> Domain snapshot s1 created
> 
> Domain RHEL8 state saved by libvirt
> 
> Domain RHEL8 started
> 

Very strange. At the point you are at there is nothing that libvirt does with the genid value. All that it is happening is QEMU based. So I wonder if there's some gotcha that I don't know about deeper in the QEMU code. I'll try to reproduce, but my domains failed to create an internal snapshot (" error: unsupported configuration: internal snapshot for disk vda unsupported for storage type raw"). I'm not much of a consumer of snapshot.

 
> > 
> > Before anyone spends too much time on this, is there enough free disk space
> > for "/var/lib/libvirt/qemu/save/RHEL8.save"?
> Free disk space is enough here.
> > 
> > I'll look again a bit more and update with anything else I find, but it may
> > be best to narrow things down by trying the same steps without genid, with
> > genid, without genid, etc. to really see if it's genid that is the problem.

Comment 16 John Ferlan 2018-06-07 16:12:23 UTC

I was able to set up a guest that I could use a snapshot on and think I was able to reproduce at least part of what was seen using the top of tree:

# virsh snapshot-create-as f23-qcow2 f23-qcow2-snap
Domain snapshot f23-qcow2-snap created
# virsh snapshot-revert f23-qcow2 f23-qcow2-snap --force
error: internal error: unexpected async job 6

#

I stopped here - something is wrong - need to go determine what...

[while running I had had a second window with libvirtd running in gdb]


2018-06-07 15:50:21.612+0000: 15996: warning : qemuDomainObjTaint:7080 : Domain id=7 name='f23-qcow2' uuid=462543e2-a162-4992-96ee-b24b487cea76 is tainted: high-privileges
Detaching after fork from child process 16565.
Detaching after fork from child process 16570.
Detaching after fork from child process 16572.
[New Thread 0x7fffb24e0700 (LWP 16633)]
2018-06-07 15:52:09.779+0000: 15995: error : qemuDomainRevertToSnapshot:16141 : unsupported configuration: domain genid update requires restart
2018-06-07 15:52:09.779+0000: 15995: error : qemuDomainRevertToSnapshot:16152 : revert requires force: domain genid update requires restart
2018-06-07 15:52:09.784+0000: 15996: error : qemuDomainRevertToSnapshot:16141 : unsupported configuration: domain genid update requires restart
2018-06-07 15:52:10.012+0000: 15996: warning : qemuDomainObjTaint:7080 : Domain id=8 name='f23-qcow2' uuid=462543e2-a162-4992-96ee-b24b487cea76 is tainted: high-privileges
Detaching after fork from child process 16765.
Detaching after fork from child process 16770.
Detaching after fork from child process 16771.
2018-06-07 15:52:10.037+0000: 15996: error : qemuDomainObjBeginNestedJob:6522 : internal error: unexpected async job 6
2018-06-07 15:52:10.037+0000: 15996: error : qemuDomainObjBeginNestedJob:6522 : internal error: unexpected async job 6

hmmm..  Looks like the "unexpected async job 6" overwrite the genid error message... The rest just goes down hill. Now to figure out why when getting an error we don't cause a failure. A hacking into snapshot land, I go, hi ho, hi ho.  But before I do, let's try something...

# virsh snapshot-delete f23-qcow2 f23-qcow2-snap
Domain snapshot f23-qcow2-snap deleted

[ resulting in the debug window getting: ]

2018-06-07 16:02:08.543+0000: 15998: error : qemuMonitorJSONHumanCommandWithFd:1321 : Operation not supported: Human monitor command is not available to run delvm "f23-qcow2-snap"


hmmm... really strange, but let's take the next step...

# virsh managedsave f23-qcow2

[ resulting in the debug window getting: ]

Detaching after fork from child process 16994.
[New Thread 0x7fffb12cc700 (LWP 16995)]
2018-06-07 16:04:14.981+0000: 15995: error : qemuMonitorJSONCheckError:394 : internal error: unable to execute QEMU command 'migrate_set_speed': Expecting capabilities negotiation with 'qmp_capabilities'
2018-06-07 16:04:14.982+0000: 15995: error : qemuMonitorJSONCheckError:394 : internal error: unable to execute QEMU command 'getfd': Expecting capabilities negotiation with 'qmp_capabilities'
2018-06-07 16:04:14.983+0000: 15995: error : qemuMonitorJSONCheckError:394 : internal error: unable to execute QEMU command 'migrate_set_speed': Expecting capabilities negotiation with 'qmp_capabilities'


And the virsh command is just hung

[ back to the gdb window: ]

t a a bt (thread apply all backtrace):

...
Thread 4 (Thread 0x7fffe5fd2700 (LWP 15996)):
#0  0x00007ffff3d4bb1d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007ffff3d44ea3 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00007ffff74b0c75 in virMutexLock (m=<optimized out>)
    at util/virthread.c:89
#3  0x00007ffff748b78b in virObjectLock (anyobj=<optimized out>)
    at util/virobject.c:429
#4  0x00007ffff750e82b in virDomainObjListFindByUUIDLocked (
    uuid=uuid@entry=0x7fffd0027724 "F%C\342\241bI\222\226\356\262KH|\352v", 
    doms=<optimized out>) at conf/virdomainobjlist.c:147
#5  0x00007ffff750eddd in virDomainObjListFindByUUID (doms=0x7fff94110c80, 
    uuid=uuid@entry=0x7fffd0027724 "F%C\342\241bI\222\226\356\262KH|\352v")
    at conf/virdomainobjlist.c:173
#6  0x00007fffb56dece8 in qemuDomObjFromDomain (
    domain=domain@entry=0x7fffd0027700) at qemu/qemu_driver.c:206
#7  0x00007fffb56ebf64 in qemuDomainGetJobInfo (dom=0x7fffd0027700, 
    info=0x7fffe5fd1a90) at qemu/qemu_driver.c:13623
#8  0x00007ffff7658e85 in virDomainGetJobInfo (
    domain=domain@entry=0x7fffd0027700, info=info@entry=0x7fffe5fd1a90)
    at libvirt-domain.c:8671
...
Thread 3 (Thread 0x7fffddfd2700 (LWP 15995)):
#0  0x00007ffff3d438ad in pthread_join () from /lib64/libpthread.so.0
#1  0x00007ffff74b0fca in virThreadJoin (thread=<optimized out>)
    at util/virthread.c:299
#2  0x00007ffff7445c10 in virCommandFree (cmd=<optimized out>)
    at util/vircommand.c:2810
#3  0x00007ffff7457d64 in virFileWrapperFdFree (wfd=<optimized out>, 
    wfd@entry=0x7fffd8016740) at util/virfile.c:359
#4  0x00007fffb56f8909 in qemuDomainSaveMemory (
    driver=driver@entry=0x7fff94110b90, vm=vm@entry=0x7fffd80051a0, 
    path=path@entry=0x7fffd8002e80 "/var/lib/libvirt/qemu/save/f23-qcow2.save", data=data@entry=0x7fffd80166c0, compressedpath=compressedpath@entry=0x0, 
    flags=flags@entry=0, asyncJob=QEMU_ASYNC_JOB_SAVE)
    at qemu/qemu_driver.c:3266
#5  0x00007fffb56f8d0d in qemuDomainSaveInternal (
    driver=driver@entry=0x7fff94110b90, vm=0x7fffd80051a0, 
    path=0x7fffd8002e80 "/var/lib/libvirt/qemu/save/f23-qcow2.save", 
    compressed=compressed@entry=0, compressedpath=0x0, xmlin=xmlin@entry=0x0, 
    flags=0) at qemu/qemu_driver.c:3366
...

BTW: I did test that running managedsave with a domain w/ <genid/> without having any sort of snapshot interaction did work...

Let's see what I can figure out w/ the snapshots now. Suffice to say it's not supported, but I'm not sure it should leave things the way it did.

Comment 17 John Ferlan 2018-06-11 21:25:11 UTC

Hmm... I'm understanding a bit more about this code... Did you try: 

# virsh snapshot-revert f23-qcow2 f23-qcow2-snap
error: revert requires force: domain genid update requires restart

#

first and then decided to follow the error message advice?

So it seems --force doesn't work completely as I thought, but I'm pretty sure that's orthogonal to genid as genid was just "making use of" the force functionality in order to do a stop and start. I'll need to determine if that's an 'expected use case' and whether genid should use it or not.

After digging a big and adding a bunch of debug code, I think I have a good idea what's happening. When using the --force flag, on error, the code will first qemuProcessStop the domain and then will attempt to qemuProcessStart and qemuProcessStartCPUs after the stop.  It's the Start that ends up eliciting the "error: internal error: unexpected async job 6" since (as I understand things in this code) there is no longer an async job that we started with - instead we're starting things up again. I've been able to test that I can successfully restart the domain after; however, it's not from the point of the snapshot - in fact that snapload revert really doesn't happen in this force path.

In the long run, I may ask for a new bug to be created, but let me first find out what the 'expected processing' is.

Comment 19 Han Han 2018-06-19 03:28:52 UTC

Hi John, BZ1591628 seems the same problem as comment10. It is triggered by device update. Please check it. It is likely the general problem in snapshot-revert.

Comment 20 John Ferlan 2018-06-19 11:37:29 UTC

Thanks for the heads-up. Probably the same - I haven't heard back from a private email query on this particular problem. I'll look to put something together and post some patches today to at least make some progress on this and perhaps even the other one.

Comment 21 John Ferlan 2018-06-20 17:51:53 UTC

As you probably saw, I took bz1591628 and posted some patches to resolve. Those patches will fix the hang issue and the fact that the domain is left in a paused state after getting async job error message failure.  That series is at:

https://www.redhat.com/archives/libvir-list/2018-June/msg01425.html

I did include a small adjustment for the --force path in the series (patch 4 of the series) to not change the GUID value in this case. Since we wouldn't be "re"starting from a previous point in time for the domain, changing the GUID is not required. The --force option is handled as if someone does a Stop and Start on the domain, in which case, the GUID doesn't need to change.

I also updated the depends on list above. Each of the patches to fix the other bz will be included in that bz for any backports, so no need to reset this one into another state. Just wait for the other to show up and continue testing from there.

Comment 22 Han Han 2018-07-04 01:33:24 UTC

Ting ting, please check if this element will affect virt-tools feature like v2v

Comment 23 tingting zheng 2018-07-04 03:17:12 UTC

(In reply to Han Han from comment #22)
> Ting ting, please check if this element will affect virt-tools feature like
> v2v

Thanks for reminding,based on this bug,we will do some testing for virt-clone and virt-v2v,test result will be attached later.

Comment 24 tingting zheng 2018-07-05 06:34:16 UTC

I checked vmware guest with genid enabled,in guest.vmx it shows something like:
vm.genid = "7344585841658099715"
vm.genidX = "-8483171368186442967"

but from virsh dumpxml,no genid information shows for vmware guest.
# virsh -c vpx://root.75.182/data/10.73.72.61/?no_verify=1 dumpxml esx6.0-win2012r2-x86_64 
Enter root's password for 10.73.75.182: 
<domain type='vmware' xmlns:vmware='http://libvirt.org/schemas/domain/vmware/1.0'>
  <name>esx6.0-win2012r2-x86_64</name>
  <uuid>564d5c84-172a-0ecd-05c9-14003ce70ae3</uuid>
  <memory unit='KiB'>2097152</memory>
  <currentMemory unit='KiB'>2097152</currentMemory>
  <vcpu placement='static'>2</vcpu>
  <os>
    <type arch='x86_64'>hvm</type>
  </os>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <disk type='file' device='disk'>
      <source file='[ESX6.0] esx6.0-win2012r2-x86_64/esx6.0-win2012r2-x86_64.vmdk'/>
      <target dev='sda' bus='scsi'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <controller type='scsi' index='0' model='lsisas1068'/>
    <interface type='bridge'>
      <mac address='00:50:56:bf:f4:79'/>
      <source bridge='VM Network'/>
      <model type='e1000e'/>
    </interface>
    <video>
      <model type='vmvga' vram='4096' primary='yes'/>
    </video>
  </devices>
  <vmware:datacenterpath>data</vmware:datacenterpath>
  <vmware:moref>vm-158</vmware:moref>
</domain>

Also tested for virt-v2v,if I add genid to rhel guest,virt-v2v just ignore it and after conversion the genid information is removed from xml.

Conclusion:
1.virt-v2v doesn't deal with genid information,it just remove it after conversion.
2.Libvirt doesn't show genid information for vmware guests.

If virt-v2v need to support it,2 separate bugs should be filed against the above 2 issues.

I think the genid information should be kept for guests after conversion by virt-v2v,needinfo virt-v2v developer Richard for ideas.

Comment 25 Richard W.M. Jones 2018-07-05 07:52:52 UTC

I agree with what Tingting said in comment 24.  I have filed
two bugs about this:

https://bugzilla.redhat.com/show_bug.cgi?id=1598348
"RFE: Support fetching <genid> from VMware guests" [libvirt]

https://bugzilla.redhat.com/show_bug.cgi?id=1598350
"RFE: virt-v2v should preserve <genid>" [libguestfs]

Comment 26 yisun 2018-08-20 12:22:19 UTC

Test scenarios as follow:
==================
Auto generation:
==================
# virsh domstate vm2
shut off

# virsh edit vm2
...
<genid/>
...

# virsh dumpxml vm2 | grep genid
  <genid>1d0086f5-d8a7-40a5-b508-a69ce50901da</genid>

Suspend and resume:
root@localhost ~  ## virsh dumpxml vm2 | grep genid
  <genid>1d0086f5-d8a7-40a5-b508-a69ce50901da</genid>

root@localhost ~  ## virsh domstate vm2
running

root@localhost ~  ## virsh suspend vm2
Domain vm2 suspended

root@localhost ~  ## virsh domstate vm2
paused

root@localhost ~  ## virsh resume vm2
Domain vm2 resumed

root@localhost ~  ## virsh dumpxml vm2 | grep genid
  <genid>1d0086f5-d8a7-40a5-b508-a69ce50901da</genid>
<=== not changed as expected

==================
Reboot:
==================
# virsh dumpxml vm2 | grep genid
  <genid>1d0086f5-d8a7-40a5-b508-a69ce50901da</genid>

# virsh reboot vm2
Domain vm2 is being rebooted

# virsh dumpxml vm2 | grep genid
  <genid>1d0086f5-d8a7-40a5-b508-a69ce50901da</genid>

# virsh destroy vm2; virsh start vm2
Domain vm2 destroyed

Domain vm2 started

# virsh dumpxml vm2 | grep genid
  <genid>1d0086f5-d8a7-40a5-b508-a69ce50901da</genid>
<=== not changed as expected

==================
Migration:
==================
@source host
# virsh dumpxml vm2 | grep genid
<genid>532a4cf5-3e64-4755-8617-d41ecc653ca1</genid>

# virsh migrate --live vm2 qemu+ssh://10.73.73.57/system --verbose  --timeout 5 --unsafe
Migration: [100 %]

@target host
# virsh dumpxml vm2 | grep genid
  <genid>532a4cf5-3e64-4755-8617-d41ecc653ca1</genid>
<==== not changed as expected


==================
Restart libvirtd:
==================
root@localhost ~  ## virsh domstate vm2
running

root@localhost ~  ## service libvirtd restart
Redirecting to /bin/systemctl restart libvirtd.service
root@localhost ~  ## virsh dumpxml vm2 | grep genid
  <genid>1d0086f5-d8a7-40a5-b508-a69ce50901da</genid>
<== not changed as expected

==================
Save and restore vm:
==================
root@localhost ~  ## virsh dumpxml vm2 | grep genid
  <genid>1d0086f5-d8a7-40a5-b508-a69ce50901da</genid>

root@localhost ~  ## virsh save vm2 vm2.save
Domain vm2 saved to vm2.save

root@localhost ~  ## virsh restore vm2.save
Domain restored from vm2.save

root@localhost ~  ## virsh dumpxml vm2 | grep genid
  <genid>f56c9d09-7416-4a9d-8a0b-ac30a56aee8d</genid>

root@localhost ~  ## ps -ef | grep f56c9d09-7416-4a9d-8a0b-ac30a56aee8d
qemu     24789     1  2 18:45 ?        00:00:00 /usr/libexec/qemu-kvm -name guest=vm2,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-31-vm2/master-key.aes -machine pc-i440fx-rhel7.6.0,accel=kvm,usb=off,vmport=off,dump-guest-core=off -cpu IvyBridge-IBRS,hypervisor=on,arat=on,xsaveopt=on -m 4096 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid 9a8d8b53-64b8-474c-b430-37bb3cede25a -device vmgenid,guid=f56c9d09-7416-4a9d-8a0b-ac30a56aee8d,id=vmgenid0 ...
< === changed as expected

==================
Managed Save and start:
==================
# virsh dumpxml vm2 | grep genid
  <genid>39bb61a2-dff8-4c83-9d65-406f730a80cf</genid>

# virsh managedsave vm2
Domain vm2 state saved by libvirt

# virsh start vm2
Domain vm2 started

# virsh dumpxml vm2 | grep genid
  <genid>765c1a4f-8515-46fa-8111-24cae2dae7cc</genid>
<==== changed as expected

==================
clone vm:
==================
# virsh dumpxml vm2 | grep genid
  <genid>39bb61a2-dff8-4c83-9d65-406f730a80cf</genid>

# virt-clone --original vm2 --name vm2_clone --file /var/lib/libvirt/images/clone.qcow2
Allocating 'clone.qcow2'                                                                                                                                               |  10 GB  00:00:23     
Clone 'vm2_clone' created successfully.

# virsh dumpxml vm2_clone | grep genid
  <genid>39bb61a2-dff8-4c83-9d65-406f730a80cf</genid>
<=== this is not expected but reason provided in https://bugzilla.redhat.com/show_bug.cgi?id=1149445#c7

==================
Snapshot:
==================
# virsh dumpxml vm2 | grep genid
  <genid>39bb61a2-dff8-4c83-9d65-406f730a80cf</genid>
# virsh dumpxml vm2 | grep genid
  <genid>39bb61a2-dff8-4c83-9d65-406f730a80cf</genid>
# virsh snapshot-create-as vm2 vm2.s1
Domain snapshot vm2.s1 created
# virsh dumpxml vm2 | grep genid
  <genid>39bb61a2-dff8-4c83-9d65-406f730a80cf</genid>

# ps -ef | grep 39bb61a2-dff8-4c83-9d65-406f730a80cf
qemu     21872     1 77 08:18 ?        00:00:42 /usr/libexec/qemu-kvm -name guest=vm2,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-28-vm2/master-key.aes -machine pc-i440fx-rhel7.6.0,accel=kvm,usb=off,vmport=off,dump-guest-core=off -cpu Haswell-noTSX-IBRS,vme=on,f16c=on,rdrand=on,hypervisor=on,arat=on,xsaveopt=on,abm=on -m 4096 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid 68c5a7cd-bb03-4dda-8cc3-e03a0e3e13cf -device vmgenid,guid=39bb61a2-dff8-4c83-9d65-406f730a80cf,id=vmgenid0
<===== not changed

# virsh snapshot-revert vm2 vm2.s1 --force

# virsh dumpxml vm2 | grep genid
  <genid>39bb61a2-dff8-4c83-9d65-406f730a80cf</genid>
<===== not changed

Comment 27 yisun 2018-08-20 12:27:12 UTC

Hi John,
During my test, I met following 2 problems, pls help to confirm.

1. snapshot doesn't change genid, is this expected?
==================
Snapshot:
==================
# virsh dumpxml vm2 | grep genid
  <genid>39bb61a2-dff8-4c83-9d65-406f730a80cf</genid>
# virsh dumpxml vm2 | grep genid
  <genid>39bb61a2-dff8-4c83-9d65-406f730a80cf</genid>
# virsh snapshot-create-as vm2 vm2.s1
Domain snapshot vm2.s1 created
# virsh dumpxml vm2 | grep genid
  <genid>39bb61a2-dff8-4c83-9d65-406f730a80cf</genid>

# ps -ef | grep 39bb61a2-dff8-4c83-9d65-406f730a80cf
qemu     21872     1 77 08:18 ?        00:00:42 /usr/libexec/qemu-kvm -name guest=vm2,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-28-vm2/master-key.aes -machine pc-i440fx-rhel7.6.0,accel=kvm,usb=off,vmport=off,dump-guest-core=off -cpu Haswell-noTSX-IBRS,vme=on,f16c=on,rdrand=on,hypervisor=on,arat=on,xsaveopt=on,abm=on -m 4096 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid 68c5a7cd-bb03-4dda-8cc3-e03a0e3e13cf -device vmgenid,guid=39bb61a2-dff8-4c83-9d65-406f730a80cf,id=vmgenid0
<===== not changed

# virsh snapshot-revert vm2 vm2.s1 --force

# virsh dumpxml vm2 | grep genid
  <genid>39bb61a2-dff8-4c83-9d65-406f730a80cf</genid>
<===== not changed


2. when do save/managedsave and restore, the genid changed as expected, but when I destroy/start the vm after this, the genid changed back to the very original one, is this expected?
================================================
after managedsave&start, do destroy&start again
================================================
[root@ibm-x3250m5-04 yum.repos.d]# virsh dumpxml vm2 | grep genid
  <genid>39bb61a2-dff8-4c83-9d65-406f730a80cf</genid>

[root@ibm-x3250m5-04 yum.repos.d]# virsh managedsave vm2
Domain vm2 state saved by libvirt

[root@ibm-x3250m5-04 yum.repos.d]# virsh start vm2
Domain vm2 started

[root@ibm-x3250m5-04 yum.repos.d]# virsh dumpxml vm2 | grep genid
  <genid>07f8b9f0-fafe-45c5-85f8-dd4f2b19932d</genid>
<==== changed as expected

[root@ibm-x3250m5-04 yum.repos.d]# virsh destroy vm2
Domain vm2 destroyed

[root@ibm-x3250m5-04 yum.repos.d]# virsh start vm2
Domain vm2 started

[root@ibm-x3250m5-04 yum.repos.d]# virsh dumpxml vm2 | grep genid
  <genid>39bb61a2-dff8-4c83-9d65-406f730a80cf</genid>
<===== after destroy/start again, the genid changed back to the original one

Comment 28 John Ferlan 2018-08-21 18:16:31 UTC

The implementation is essentially based off the Microsoft spec:

http://go.microsoft.com/fwlink/?LinkId=260709

There's a table in there detailing when the GenID should change or not:

Scenario                                                 GenID changed
-----------------------------------------------------------------------
Virtual machine is paused or resumed                      No
Virtual machine reboots                                   No
Virtual machine host reboots                              No
Virtual machine starts executing a snapshot               Yes
Virtual machine is recovered from backup                  Yes
Virtual machine is failed over in a disaster recovery env Yes
Virtual machine is live migrated                          No
Virtual machine is imported, copied, or cloned            Yes
Virtual machine is failed over in a clustered environment No
Virtual machine's configuration changes                   Unspecified


w/r/t comment 27...

Snapshot processing:
I assume you found that you cannot resume after snapshot because the genid setting wouldn't let you and it suggested using force (e.g. comment 17). Then you used --force and nothing changed.  That's expected as described in comment 21.

Managedsave processing:
I don't think it matters, it's probably a byproduct of how domain "def" and "newDef" processing works. In the long run (cut from Dan's thoughts on this during review, see: https://www.redhat.com/archives/libvir-list/2018-April/msg02342.html)

"The spec literally only wants it to be changed when there is
the possibility that the VM is potentially re-executing something that
has already been executed before."

In the start, managedsave, start, destroy, start case - we've already ensured that the change occurrs between managedsave and start, but then the destroy and start I believe moves into the "Unspecified" row from the above table because in that case, it's a configuration change.  The destroy or stop and start processing restarts the guest from a new point in time and is not restarting it from some previous point in time where the guest could be re-executing something.

Comment 29 yisun 2018-08-22 07:36:15 UTC

(In reply to John Ferlan from comment #28)
> The implementation is essentially based off the Microsoft spec:
> 
> http://go.microsoft.com/fwlink/?LinkId=260709
> 
> There's a table in there detailing when the GenID should change or not:
> 
> Scenario                                                 GenID changed
> -----------------------------------------------------------------------
> Virtual machine is paused or resumed                      No
> Virtual machine reboots                                   No
> Virtual machine host reboots                              No
> Virtual machine starts executing a snapshot               Yes
> Virtual machine is recovered from backup                  Yes
> Virtual machine is failed over in a disaster recovery env Yes
> Virtual machine is live migrated                          No
> Virtual machine is imported, copied, or cloned            Yes
> Virtual machine is failed over in a clustered environment No
> Virtual machine's configuration changes                   Unspecified
> 
> 
> w/r/t comment 27...
> 
> Snapshot processing:
> I assume you found that you cannot resume after snapshot because the genid
> setting wouldn't let you and it suggested using force (e.g. comment 17).
> Then you used --force and nothing changed.  That's expected as described in
> comment 21.
> 
> Managedsave processing:
> I don't think it matters, it's probably a byproduct of how domain "def" and
> "newDef" processing works. In the long run (cut from Dan's thoughts on this
> during review, see:
> https://www.redhat.com/archives/libvir-list/2018-April/msg02342.html)
> 
> "The spec literally only wants it to be changed when there is
> the possibility that the VM is potentially re-executing something that
> has already been executed before."
> 
> In the start, managedsave, start, destroy, start case - we've already
> ensured that the change occurrs between managedsave and start, but then the
> destroy and start I believe moves into the "Unspecified" row from the above
> table because in that case, it's a configuration change.  The destroy or
> stop and start processing restarts the guest from a new point in time and is
> not restarting it from some previous point in time where the guest could be
> re-executing something.

Thx for the explanation, but for the snapshot part, I am still a little confused about the logic. 2 more questions below.

1. After create snapshot, genid not changed, is this expected? The list says when executing snapshot, the genid should be changed.
[root@ibm-x3250m5-04 ~]# virsh start vm2
Domain vm2 started

[root@ibm-x3250m5-04 ~]# virsh snapshot-list vm2
 Name                 Creation Time             State
------------------------------------------------------------

[root@ibm-x3250m5-04 ~]# virsh dumpxml vm2 | grep genid
  <genid>39bb61a2-dff8-4c83-9d65-406f730a80cf</genid>
[root@ibm-x3250m5-04 ~]# virsh snapshot-create-as vm2 vm2.s1
Domain snapshot vm2.s1 created


[root@ibm-x3250m5-04 ~]# virsh dumpxml vm2 | grep genid
  <genid>39bb61a2-dff8-4c83-9d65-406f730a80cf</genid>
<==== not changed, but the list above shows snapshot should change the genid?


2. When I manually changed the genid after snapshot1 and force reverting to snapshot1, the vm's genid changed. But to a totally new genid, not the one in snapshot1. Is this expected?

[root@ibm-x3250m5-04 ~]# virsh dumpxml vm2 | grep genid
  <genid>39bb61a2-dff8-4c83-9d65-406f730a80cf</genid>

//manually change the genid to 39bb61a2-dff8-4c83-9d65-111111111111 with "virsh edit"
[root@ibm-x3250m5-04 ~]# virsh edit vm2
Domain vm2 XML configuration edited.

[root@ibm-x3250m5-04 ~]# virsh destroy vm2; virsh start vm2
Domain vm2 destroyed
Domain vm2 started

[root@ibm-x3250m5-04 ~]# virsh dumpxml vm2 | grep genid
  <genid>39bb61a2-dff8-4c83-9d65-111111111111</genid>

[root@ibm-x3250m5-04 ~]# virsh snapshot-revert vm2 vm2.s1
error: revert requires force: Target domain genid 39bb61a2-dff8-4c83-9d65-406f730a80cf does not match source 39bb61a2-dff8-4c83-9d65-111111111111

[root@ibm-x3250m5-04 ~]# virsh snapshot-revert vm2 vm2.s1 --force

[root@ibm-x3250m5-04 ~]# virsh dumpxml vm2 | grep genid
  <genid>8e191d6e-6964-4b54-8c91-ce30a3f1faca</genid>
<=== in your comment, seems the guid should be changed back to the snapshot1's stauts, but here it changed to a totally new one. expected?

Comment 30 yisun 2018-08-22 07:52:59 UTC

And another question about the snapshot-revert without --force, there is always error as: "domain genid update requires restart", but after I reboot or start/stop the domain, the error still exists. Expected?

[root@ibm-x3250m5-04 ~]# virsh dumpxml vm2 | grep genid
  <genid>39bb61a2-dff8-4c83-9d65-406f730a80cf</genid>
[root@ibm-x3250m5-04 ~]# virsh snapshot-create-as vm2 vm2.s1
Domain snapshot vm2.s1 created
[root@ibm-x3250m5-04 ~]# virsh snapshot-revert vm2 vm2.s1
error: revert requires force: domain genid update requires restart

[root@ibm-x3250m5-04 ~]# virsh reboot vm2
Domain vm2 is being rebooted

[root@ibm-x3250m5-04 ~]# virsh snapshot-revert vm2 vm2.s1
error: revert requires force: domain genid update requires restart

[root@ibm-x3250m5-04 ~]# virsh destroy vm2; virsh start vm2
Domain vm2 destroyed

Domain vm2 started

[root@ibm-x3250m5-04 ~]# virsh snapshot-revert vm2 vm2.s1
error: revert requires force: domain genid update requires restart

Comment 31 John Ferlan 2018-08-23 13:30:30 UTC

For comment 29:

>> 1. After create snapshot, genid not changed, is this expected? The list says
>> when executing snapshot, the genid should be changed.

The list says when starting from a snapshot (OK, actually it says "starts executing a snapshot"). It does not say when creating a snapshot. It's a fine line distinction, but there's no need to change the genid *unless* that snapshot started executing.  Unfortunately, that functionality isn't (yet) possible in QEMU. There was an attempt to add the functionality to QEMU:

http://lists.nongnu.org/archive/html/qemu-devel/2018-03/msg00551.html

But it hasn't been accepted and I've seen no other attempt afterwards.

>> 2. When I manually changed the genid after snapshot1 and force reverting to
>> snapshot1, the vm's genid changed. But to a totally new genid, not the one
>> in snapshot1. Is this expected?

And the purpose of this test is what?  Again once you stop the domain and restart the genid changing means nothing. I would say what you're doing falls into that unspecified row/column of the above table.

The purpose of a changed genid is to allow/force the OS that recognizes that change to make certain decisions about what to do if it's possible the OS could be reexecuting something that was executed before, nothing more, nothing less. 

Comment 30:

The genid still exisst in the XML of the snapshot (vm2.s1). So when starting from that point in time, the processing requires us to take that genid and change it prior to letting the guest run from that snapshot point in time. We cannot do so because it's not supported by QEMU. Just because you physically stop and start the guest doesn't mean anything. You're now going to attempt to start from some point in time that the guest already had reached (that's what the revert does, right?).  I see no issue with what you've shown.

Comment 32 yisun 2018-08-28 03:46:25 UTC

(In reply to John Ferlan from comment #31)
> For comment 29:
> 
> >> 1. After create snapshot, genid not changed, is this expected? The list says
> >> when executing snapshot, the genid should be changed.
> 
> The list says when starting from a snapshot (OK, actually it says "starts
> executing a snapshot"). It does not say when creating a snapshot. It's a
> fine line distinction, but there's no need to change the genid *unless* that
> snapshot started executing.  Unfortunately, that functionality isn't (yet)
> possible in QEMU. There was an attempt to add the functionality to QEMU:
> 
> http://lists.nongnu.org/archive/html/qemu-devel/2018-03/msg00551.html
> 
> But it hasn't been accepted and I've seen no other attempt afterwards.
> 
> >> 2. When I manually changed the genid after snapshot1 and force reverting to
> >> snapshot1, the vm's genid changed. But to a totally new genid, not the one
> >> in snapshot1. Is this expected?
> 
> And the purpose of this test is what?  Again once you stop the domain and
> restart the genid changing means nothing. I would say what you're doing
> falls into that unspecified row/column of the above table.
> 
> The purpose of a changed genid is to allow/force the OS that recognizes that
> change to make certain decisions about what to do if it's possible the OS
> could be reexecuting something that was executed before, nothing more,
> nothing less. 
> 
> Comment 30:
> 
> The genid still exisst in the XML of the snapshot (vm2.s1). So when starting
> from that point in time, the processing requires us to take that genid and
> change it prior to letting the guest run from that snapshot point in time.
> We cannot do so because it's not supported by QEMU. Just because you
> physically stop and start the guest doesn't mean anything. You're now going
> to attempt to start from some point in time that the guest already had
> reached (that's what the revert does, right?).  I see no issue with what
> you've shown.

Thx for the detailed explanation, set to VERIFIED then.

Comment 34 errata-xmlrpc 2018-10-30 09:49:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3113