Bug 1637078 - [downstream clone - 4.2.7] Snapshot deletion fails with "MaxNumOfVmSockets has no value for version"
Summary: [downstream clone - 4.2.7] Snapshot deletion fails with "MaxNumOfVmSockets ha...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.2.3
Hardware: Unspecified
OS: Linux
unspecified
urgent
Target Milestone: ovirt-4.2.7
: ---
Assignee: Eyal Shenitzky
QA Contact: Elad
URL:
Whiteboard:
Depends On: 1628150
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-08 14:49 UTC by RHV bug bot
Modified: 2022-03-13 15:42 UTC (History)
18 users (show)

Fixed In Version: vdsm v4.20.43
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1628150
Environment:
Last Closed: 2018-11-05 15:03:18 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-44308 0 None None None 2021-12-10 18:02:11 UTC
Red Hat Knowledge Base (Solution) 3611401 0 None None None 2018-10-08 14:51:31 UTC
Red Hat Product Errata RHBA-2018:3480 0 None None None 2018-11-05 15:03:50 UTC
oVirt gerrit 94690 0 master MERGED core: fix failing 3.5 snapshot deletion 2021-01-22 22:12:14 UTC
oVirt gerrit 94833 0 ovirt-engine-4.2 MERGED core: fix failing 3.5 snapshot deletion 2021-01-22 22:12:55 UTC

Description RHV bug bot 2018-10-08 14:49:01 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1628150 +++
======================================================================

Description of problem:

The snapshot deletion fails with "MaxNumOfVmSockets has no value for version"

Version-Release number of selected component (if applicable):
rhvm-4.2.3.8-0.1.el7.noarch
vdsm-4.20.27.1-1.el7ev.x86_64
vdsm-client-4.20.27.1-1.el7ev.noarch

How reproducible:

Cannot reproduce it here on my systems.
It is seen on the customer's site once.

Steps to Reproduce:
1.
2.
3.

Actual results:

Fails to delete snapshot

Expected results:

Snapshot should complete.

Additional info:

In this case the merge is successfull and the LV is also removed. we see that Pivot and the image synchronization is complete.

However after this, the entry of that volume which is deleted is not removed from the database, hence the VM then does not start saying a volume is missing when in reality it was deleted previously in a snapshot merge.

Will attach the relevant logs in the subsequent comment.

(Originally by Siddhant Rao)

Comment 11 RHV bug bot 2018-10-08 14:50:08 UTC
What does “engine-config -g MaxNumOfVmSockets“ return?

(Originally by michal.skrivanek)

Comment 12 RHV bug bot 2018-10-08 14:50:13 UTC
Hello Michal,

Thanks for your inputs on this,

(In reply to Michal Skrivanek from comment #8)
> What does “engine-config -g MaxNumOfVmSockets“ return?


~~~~

MaxNumOfCpuPerSocket: 16 version: 3.6
MaxNumOfCpuPerSocket: 16 version: 4.0
MaxNumOfCpuPerSocket: 254 version: 4.1
MaxNumOfCpuPerSocket: 254 version: 4.2
MaxNumOfThreadsPerCpu: 8 version: 3.6
MaxNumOfThreadsPerCpu: 8 version: 4.0
MaxNumOfThreadsPerCpu: 8 version: 4.1
MaxNumOfThreadsPerCpu: 8 version: 4.2
MaxNumOfVmCpus: 240 version: 3.6
MaxNumOfVmCpus: 240 version: 4.0
MaxNumOfVmCpus: 288 version: 4.1
MaxNumOfVmCpus: 384 version: 4.2
MaxNumOfVmSockets: 16 version: 3.6
MaxNumOfVmSockets: 16 version: 4.0
MaxNumOfVmSockets: 16 version: 4.1
MaxNumOfVmSockets: 16 version: 4.2

~~~~

Regards,
Siddhant Rao

(Originally by Siddhant Rao)

Comment 13 RHV bug bot 2018-10-08 14:50:19 UTC
thanks, that looks good. And the snapshot is surely from 3.6 or newer? Is it possible it was created in 3.5 or earlier? (even the one you're trying to delete)

(Originally by michal.skrivanek)

Comment 15 RHV bug bot 2018-10-08 14:50:28 UTC
Hello Michal,

Apparently yes, The VM is cloned from a 3.5 environment.

Let me know your inputs.

Regards,
Siddhant Rao

(Originally by Siddhant Rao)

Comment 16 RHV bug bot 2018-10-08 14:50:33 UTC
VM - ok. Can you please check its current cluster level? (both the cluster setting and any potential VM-level override)

the snapshot which is being deleted - is it from 3.5? Can you get the XML from db and check what cluster version it has?

is there only one snapshot (so the data is being merged to the current running config) or is it being merged with another snapshot? If it's the latter, can you also check its cluster level?

(Originally by michal.skrivanek)

Comment 18 RHV bug bot 2018-10-08 14:50:44 UTC
Hello Michal,

The issue is reproducible in my test environment. The issue will happen for all VMs if the snapshot was taken from 3.5 or lower and if we try to delete it in 4.2 environment.

Steps to reproduce:

1. Create a VM snapshot in 3.5 environment.

2. Export the VM to export domain and import it to 4.2.

3. Try to do a live merge of the snapshot. It will fail with the error below.

===
2018-09-13 06:18:12,285-04 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedThreadFactory-engineScheduled-Thread-37) [de1edc42-c863-4780-b23e-0003db4a1066] Failed invoking callback end method 'onSucceeded' for command 'd326f470-04c9-4265-9c41-4bd5ad114b94' with exception 'MaxNumOfVmSockets has no value for version: ', the callback is marked for end method retries but max number of retries have been attempted. The command will be marked as Failed.
2018-09-13 06:18:12,285-04 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedThreadFactory-engineScheduled-Thread-37) [de1edc42-c863-4780-b23e-0003db4a1066] Error invoking callback method 'onSucceeded' for 'SUCCEEDED' command 'd326f470-04c9-4265-9c41-4bd5ad114b94'
2018-09-13 06:18:12,286-04 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedThreadFactory-engineScheduled-Thread-37) [de1edc42-c863-4780-b23e-0003db4a1066] Exception: java.lang.IllegalArgumentException: MaxNumOfVmSockets has no value for version: 
===

The 3.5 snapshot is not having "ClusterCompatibilityVersion" in the OVF. So the "version" will be empty here.


 60     public <T> T getValue(ConfigValues name, String version) {
 61         Map<String, T> values = getValuesForAllVersions(name);
 62         if (valueExists(name, version)) {
 63             return values.get(version);
 64         }
 65         throw new IllegalArgumentException(name.toString() + " has no value for version: " + version);
 66     }


The error also shows an empty string for the version.

===
2018-09-13 06:18:12,286-04 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedThreadFactory-engineScheduled-Thread-37) [de1edc42-c863-4780-b23e-0003db4a1066] Exception: java.lang.IllegalArgumentException: MaxNumOfVmSockets has no value for version: 
	at org.ovirt.engine.core.dal.dbbroker.generic.DBConfigUtils.getValue(DBConfigUtils.java:65) [dal.jar:]
	at org.ovirt.engine.core.common.config.Config.getValue(Config.java:28) [common.jar:]
===

If I manually edit the vm_configuration of the snapshot and add the "<ClusterCompatibilityVersion>3.6</ClusterCompatibilityVersion>" in the xml before the live merge, everything works well.

So the issue is because of the absence of "<ClusterCompatibilityVersion>" in 3.5 snapshot VM xml.

I can confirm that the customer's xml also don't have  "<ClusterCompatibilityVersion>" and it's from 3.5 (the xml is having  ovf:version="3.5.0.0").

(Originally by Nijin Ashok)

Comment 20 RHV bug bot 2018-10-08 14:50:54 UTC
yes, that is correct and expected (at least from code POV:). Anything with <=3.5 will fail in 4.0+ because we stopped supporting previous clusters in 4.0.
Now why is it touched when that snapshot is being deleted I do not know, that needs analysis from Storage whether it's really required or it can be removed. Generally the code shouldn't be attempting to write a 3.5 OVF because that version is no longer supported.
Tal?

(Originally by michal.skrivanek)

Comment 21 RHV bug bot 2018-10-08 14:50:59 UTC
Ala, any idea why we touch the snapshot while deleting a snapshot after live merge?

(Originally by Tal Nisan)

Comment 22 RHV bug bot 2018-10-08 14:51:04 UTC
I added this request to upgrade helper. 
https://bugzilla.redhat.com/show_bug.cgi?id=1631896

(Originally by Marina Kalinin)

Comment 23 RHV bug bot 2018-10-08 14:51:09 UTC
(In reply to Michal Skrivanek from comment #17)
> yes, that is correct and expected (at least from code POV:). Anything with
> <=3.5 will fail in 4.0+ because we stopped supporting previous clusters in
> 4.0.
> Now why is it touched when that snapshot is being deleted I do not know,
> that needs analysis from Storage whether it's really required or it can be
> removed.

The OVF update should occur because the VM images were changed.

> Generally the code shouldn't be attempting to write a 3.5 OVF
> because that version is no longer supported.
> Tal?

The problem is why the field is missing/contains empty string after the environment was updated to 4.2 / the 3.5 VM was imported to a 4.2 environment.

It doesn't seem like a storage issue.

(Originally by Eyal Shenitzky)

Comment 24 RHV bug bot 2018-10-08 14:51:14 UTC
(In reply to Eyal Shenitzky from comment #20)
> (In reply to Michal Skrivanek from comment #17)
> > yes, that is correct and expected (at least from code POV:). Anything with
> > <=3.5 will fail in 4.0+ because we stopped supporting previous clusters in
> > 4.0.
> > Now why is it touched when that snapshot is being deleted I do not know,
> > that needs analysis from Storage whether it's really required or it can be
> > removed.
> 
> The OVF update should occur because the VM images were changed.

we should touch the VM - the current version of it - which is 4.2. AFAICT that works fine.
But we should not touch the 3.5 OVF from the snapshot because it's not supported anymore and any attempt to produce 3.5 OVF will fail.
 
> > Generally the code shouldn't be attempting to write a 3.5 OVF
> > because that version is no longer supported.
> > Tal?
> 
> The problem is why the field is missing/contains empty string after the
> environment was updated to 4.2 / the 3.5 VM was imported to a 4.2
> environment.

The VM itself (the current version) should be updated just fine, but no one updates or touches past snapshots.
<3.6 didn't have the ClusterCompatibilityVersion field at all.

(Originally by michal.skrivanek)

Comment 25 RHV bug bot 2018-10-08 14:51:19 UTC
> we should touch the VM - the current version of it - which is 4.2. AFAICT
> that works fine.
> But we should not touch the 3.5 OVF from the snapshot because it's not
> supported anymore and any attempt to produce 3.5 OVF will fail.

When performing a change in the VM like removing/adding an image we should update the OVF of the VM, it can be done immediately like in live merge or automatically after some period of time by the OVF update mechanism. 

If we will not do so, the VM will not have a backup and it will affect different flows.
For e.g - In case of disaster recovery, the VM you will try to recreate will be irrelevant.

> The VM itself (the current version) should be updated just fine, but no one
> updates or touches past snapshots.
> <3.6 didn't have the ClusterCompatibilityVersion field at all.

Maybe we should consider filling those gaps synthetically.

(Originally by Eyal Shenitzky)

Comment 26 Elad 2018-10-15 05:44:04 UTC
Tested the following:

- Created a VM in a 3.5 DC (3.6 env)
- Created a snapshot
- Exported the VM 
- Detached the export domain and attached it to 4.2 DC (4.2 env)
- Imported the VM
- Live merged the snapshot

Live merge succeeded



Used:

3.6 setup (tested on 3.5 DC):
rhevm-3.6.13.4-0.1.el6.noarch
vdsm-4.16.38-1.el6ev.x86_64
libvirt-0.10.2-62.el6.x86_64

4.2 setup:
ovirt-engine-4.2.7.3-0.0.master.20181012152958.gitfc1595b.el7.noarch
vdsm-4.20.42-4.git43e2555.el7.x86_64
libvirt-4.5.0-10.el7.x86_64

Comment 27 RHV bug bot 2018-10-18 11:39:37 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Project 'ovirt-engine'/Component 'vdsm' mismatch]

For more info please contact: rhv-devops

Comment 28 Eyal Edri 2018-10-19 13:28:45 UTC
Tal, which component / Errata should this bug be added to?
The attached fixed are from engine repo, but 'fixed in version' points to VDSM.
This currently blocks adding the bug to the Errata

Comment 29 Eyal Edri 2018-10-22 19:43:17 UTC
moving to ON_QA in the meantime to not block QE.

Comment 30 Raz Tamir 2018-10-22 20:30:09 UTC
QE verification bot: the bug was verified upstream

Comment 33 errata-xmlrpc 2018-11-05 15:03:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3480


Note You need to log in before you can comment on or make changes to this bug.