Bug 1628150 - Snapshot deletion fails with "MaxNumOfVmSockets has no value for version"
Summary: Snapshot deletion fails with "MaxNumOfVmSockets has no value for version"
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.2.3
Hardware: Unspecified
OS: Linux
high
urgent
Target Milestone: ovirt-4.3.0
: ---
Assignee: Eyal Shenitzky
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks: 1636846 1637078
TreeView+ depends on / blocked
 
Reported: 2018-09-12 11:50 UTC by Siddhant Rao
Modified: 2022-03-13 15:33 UTC (History)
13 users (show)

Fixed In Version: ovirt-engine-4.3.0_rc
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1637078 (view as bug list)
Environment:
Last Closed: 2019-05-08 12:36:02 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-44277 0 None None None 2021-12-10 17:34:43 UTC
Red Hat Knowledge Base (Solution) 3611401 0 None None None 2018-09-12 15:55:47 UTC
Red Hat Product Errata RHBA-2019:1077 0 None None None 2019-05-08 12:36:24 UTC
oVirt gerrit 94690 0 'None' MERGED core: fix failing 3.5 snapshot deletion 2021-01-25 17:25:56 UTC

Description Siddhant Rao 2018-09-12 11:50:21 UTC
Description of problem:

The snapshot deletion fails with "MaxNumOfVmSockets has no value for version"

Version-Release number of selected component (if applicable):
rhvm-4.2.3.8-0.1.el7.noarch
vdsm-4.20.27.1-1.el7ev.x86_64
vdsm-client-4.20.27.1-1.el7ev.noarch

How reproducible:

Cannot reproduce it here on my systems.
It is seen on the customer's site once.

Steps to Reproduce:
1.
2.
3.

Actual results:

Fails to delete snapshot

Expected results:

Snapshot should complete.

Additional info:

In this case the merge is successfull and the LV is also removed. we see that Pivot and the image synchronization is complete.

However after this, the entry of that volume which is deleted is not removed from the database, hence the VM then does not start saying a volume is missing when in reality it was deleted previously in a snapshot merge.

Will attach the relevant logs in the subsequent comment.

Comment 8 Michal Skrivanek 2018-09-13 05:08:16 UTC
What does “engine-config -g MaxNumOfVmSockets“ return?

Comment 9 Siddhant Rao 2018-09-13 06:30:56 UTC
Hello Michal,

Thanks for your inputs on this,

(In reply to Michal Skrivanek from comment #8)
> What does “engine-config -g MaxNumOfVmSockets“ return?


~~~~

MaxNumOfCpuPerSocket: 16 version: 3.6
MaxNumOfCpuPerSocket: 16 version: 4.0
MaxNumOfCpuPerSocket: 254 version: 4.1
MaxNumOfCpuPerSocket: 254 version: 4.2
MaxNumOfThreadsPerCpu: 8 version: 3.6
MaxNumOfThreadsPerCpu: 8 version: 4.0
MaxNumOfThreadsPerCpu: 8 version: 4.1
MaxNumOfThreadsPerCpu: 8 version: 4.2
MaxNumOfVmCpus: 240 version: 3.6
MaxNumOfVmCpus: 240 version: 4.0
MaxNumOfVmCpus: 288 version: 4.1
MaxNumOfVmCpus: 384 version: 4.2
MaxNumOfVmSockets: 16 version: 3.6
MaxNumOfVmSockets: 16 version: 4.0
MaxNumOfVmSockets: 16 version: 4.1
MaxNumOfVmSockets: 16 version: 4.2

~~~~

Regards,
Siddhant Rao

Comment 10 Michal Skrivanek 2018-09-13 06:57:14 UTC
thanks, that looks good. And the snapshot is surely from 3.6 or newer? Is it possible it was created in 3.5 or earlier? (even the one you're trying to delete)

Comment 12 Siddhant Rao 2018-09-13 09:44:40 UTC
Hello Michal,

Apparently yes, The VM is cloned from a 3.5 environment.

Let me know your inputs.

Regards,
Siddhant Rao

Comment 13 Michal Skrivanek 2018-09-13 10:17:27 UTC
VM - ok. Can you please check its current cluster level? (both the cluster setting and any potential VM-level override)

the snapshot which is being deleted - is it from 3.5? Can you get the XML from db and check what cluster version it has?

is there only one snapshot (so the data is being merged to the current running config) or is it being merged with another snapshot? If it's the latter, can you also check its cluster level?

Comment 15 nijin ashok 2018-09-13 11:08:21 UTC
Hello Michal,

The issue is reproducible in my test environment. The issue will happen for all VMs if the snapshot was taken from 3.5 or lower and if we try to delete it in 4.2 environment.

Steps to reproduce:

1. Create a VM snapshot in 3.5 environment.

2. Export the VM to export domain and import it to 4.2.

3. Try to do a live merge of the snapshot. It will fail with the error below.

===
2018-09-13 06:18:12,285-04 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedThreadFactory-engineScheduled-Thread-37) [de1edc42-c863-4780-b23e-0003db4a1066] Failed invoking callback end method 'onSucceeded' for command 'd326f470-04c9-4265-9c41-4bd5ad114b94' with exception 'MaxNumOfVmSockets has no value for version: ', the callback is marked for end method retries but max number of retries have been attempted. The command will be marked as Failed.
2018-09-13 06:18:12,285-04 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedThreadFactory-engineScheduled-Thread-37) [de1edc42-c863-4780-b23e-0003db4a1066] Error invoking callback method 'onSucceeded' for 'SUCCEEDED' command 'd326f470-04c9-4265-9c41-4bd5ad114b94'
2018-09-13 06:18:12,286-04 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedThreadFactory-engineScheduled-Thread-37) [de1edc42-c863-4780-b23e-0003db4a1066] Exception: java.lang.IllegalArgumentException: MaxNumOfVmSockets has no value for version: 
===

The 3.5 snapshot is not having "ClusterCompatibilityVersion" in the OVF. So the "version" will be empty here.


 60     public <T> T getValue(ConfigValues name, String version) {
 61         Map<String, T> values = getValuesForAllVersions(name);
 62         if (valueExists(name, version)) {
 63             return values.get(version);
 64         }
 65         throw new IllegalArgumentException(name.toString() + " has no value for version: " + version);
 66     }


The error also shows an empty string for the version.

===
2018-09-13 06:18:12,286-04 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedThreadFactory-engineScheduled-Thread-37) [de1edc42-c863-4780-b23e-0003db4a1066] Exception: java.lang.IllegalArgumentException: MaxNumOfVmSockets has no value for version: 
	at org.ovirt.engine.core.dal.dbbroker.generic.DBConfigUtils.getValue(DBConfigUtils.java:65) [dal.jar:]
	at org.ovirt.engine.core.common.config.Config.getValue(Config.java:28) [common.jar:]
===

If I manually edit the vm_configuration of the snapshot and add the "<ClusterCompatibilityVersion>3.6</ClusterCompatibilityVersion>" in the xml before the live merge, everything works well.

So the issue is because of the absence of "<ClusterCompatibilityVersion>" in 3.5 snapshot VM xml.

I can confirm that the customer's xml also don't have  "<ClusterCompatibilityVersion>" and it's from 3.5 (the xml is having  ovf:version="3.5.0.0").

Comment 17 Michal Skrivanek 2018-09-13 11:50:11 UTC
yes, that is correct and expected (at least from code POV:). Anything with <=3.5 will fail in 4.0+ because we stopped supporting previous clusters in 4.0.
Now why is it touched when that snapshot is being deleted I do not know, that needs analysis from Storage whether it's really required or it can be removed. Generally the code shouldn't be attempting to write a 3.5 OVF because that version is no longer supported.
Tal?

Comment 18 Tal Nisan 2018-09-13 13:11:51 UTC
Ala, any idea why we touch the snapshot while deleting a snapshot after live merge?

Comment 19 Marina Kalinin 2018-09-21 20:27:40 UTC
I added this request to upgrade helper. 
https://bugzilla.redhat.com/show_bug.cgi?id=1631896

Comment 20 Eyal Shenitzky 2018-10-02 08:52:00 UTC
(In reply to Michal Skrivanek from comment #17)
> yes, that is correct and expected (at least from code POV:). Anything with
> <=3.5 will fail in 4.0+ because we stopped supporting previous clusters in
> 4.0.
> Now why is it touched when that snapshot is being deleted I do not know,
> that needs analysis from Storage whether it's really required or it can be
> removed.

The OVF update should occur because the VM images were changed.

> Generally the code shouldn't be attempting to write a 3.5 OVF
> because that version is no longer supported.
> Tal?

The problem is why the field is missing/contains empty string after the environment was updated to 4.2 / the 3.5 VM was imported to a 4.2 environment.

It doesn't seem like a storage issue.

Comment 21 Michal Skrivanek 2018-10-02 12:02:20 UTC
(In reply to Eyal Shenitzky from comment #20)
> (In reply to Michal Skrivanek from comment #17)
> > yes, that is correct and expected (at least from code POV:). Anything with
> > <=3.5 will fail in 4.0+ because we stopped supporting previous clusters in
> > 4.0.
> > Now why is it touched when that snapshot is being deleted I do not know,
> > that needs analysis from Storage whether it's really required or it can be
> > removed.
> 
> The OVF update should occur because the VM images were changed.

we should touch the VM - the current version of it - which is 4.2. AFAICT that works fine.
But we should not touch the 3.5 OVF from the snapshot because it's not supported anymore and any attempt to produce 3.5 OVF will fail.
 
> > Generally the code shouldn't be attempting to write a 3.5 OVF
> > because that version is no longer supported.
> > Tal?
> 
> The problem is why the field is missing/contains empty string after the
> environment was updated to 4.2 / the 3.5 VM was imported to a 4.2
> environment.

The VM itself (the current version) should be updated just fine, but no one updates or touches past snapshots.
<3.6 didn't have the ClusterCompatibilityVersion field at all.

Comment 22 Eyal Shenitzky 2018-10-03 03:58:39 UTC
> we should touch the VM - the current version of it - which is 4.2. AFAICT
> that works fine.
> But we should not touch the 3.5 OVF from the snapshot because it's not
> supported anymore and any attempt to produce 3.5 OVF will fail.

When performing a change in the VM like removing/adding an image we should update the OVF of the VM, it can be done immediately like in live merge or automatically after some period of time by the OVF update mechanism. 

If we will not do so, the VM will not have a backup and it will affect different flows.
For e.g - In case of disaster recovery, the VM you will try to recreate will be irrelevant.

> The VM itself (the current version) should be updated just fine, but no one
> updates or touches past snapshots.
> <3.6 didn't have the ClusterCompatibilityVersion field at all.

Maybe we should consider filling those gaps synthetically.

Comment 24 Elad 2018-10-15 06:08:59 UTC
Tested the following:

- Created a VM in a 3.5 DC (3.6 env)
- Created a snapshot
- Exported the VM 
- Detached the export domain and attached it to 4.2 DC (4.2 env)
- Imported the VM
- Live merged the snapshot

Live merge succeeded



Used:

3.6 setup (tested on 3.5 DC):
rhevm-3.6.13.4-0.1.el6.noarch
vdsm-4.16.38-1.el6ev.x86_64
libvirt-0.10.2-62.el6.x86_64

4.3 setup:
ovirt-engine-4.3.0-0.0.master.20181012165724.gitd25f971.el7.noarch
vdsm-4.30.0-640.git6fd8327.el7.x86_64
libvirt-4.5.0-10.el7.x86_64

Comment 25 RHV bug bot 2018-12-10 15:13:44 UTC
WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ]

For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ]

For more info please contact: rhv-devops

Comment 26 RHV bug bot 2019-01-15 23:36:19 UTC
WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ]

For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ]

For more info please contact: rhv-devops

Comment 28 errata-xmlrpc 2019-05-08 12:36:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1077


Note You need to log in before you can comment on or make changes to this bug.