1425089 – Cannot change cluster to 4.1 - Cannot edit Cluster. Maximum memory (24576MB) cannot exceed platform limit (20480MB).

Bug 1425089 - Cannot change cluster to 4.1 - Cannot edit Cluster. Maximum memory (24576MB) cannot exceed platform limit (20480MB).

Summary: Cannot change cluster to 4.1 - Cannot edit Cluster. Maximum memory (24576MB) ...

Keywords:
Status:	CLOSED DUPLICATE of bug 1418641
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	Frontend.WebAdmin
Sub Component:
Version:	4.1.0.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-4.1.2
Target Release:	---
Assignee:	Shahar Havivi
QA Contact:	Pavel Stehlik
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-02-20 14:32 UTC by Jiri Belka
Modified:	2017-03-23 11:10 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-03-23 11:10:42 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.1+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	73588	0	master	ABANDONED	core: fail upgrade when VMs have more max memory then cluster max memory	2017-03-27 07:18:25 UTC

Description Jiri Belka 2017-02-20 14:32:12 UTC

Description of problem:

We recently updated from 4.0 to 4.1-beta and after trying to switch cluster to 4.1 I get this warning (FYI the issue is in second popup):


~~~
Change Cluster Compatibility Version

All running VMs will be temporarily reconfigured to use the previous cluster compatibility version and marked pending configuration change.
In order to change the cluster compatibility version of the VM to a new version, the VM needs to be manually shut down and restarted.
There are 55 running VM(s) affected by this change.

Are you sure you want to change the Cluster Compatibility Version?
~~~

Clicking [OK] and getting then second popup:

~~~
Operation Canceled

Error while executing action: Cannot edit Cluster. Maximum memory (24576MB) cannot exceed platform limit (20480MB)
~~~

This is unfortunate surprise :/

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.0.4-0.1.el7.noarch

How reproducible:
just happens

Steps to Reproduce:
1. have old setup (our was 3.5 -> 3.6 -> 4.0)
2. upgrade to 4.1-beta
3. switch cluster from 4.0 to 4.1

Actual results:
not possible to upgrade cluster compat level to 4.1

Expected results:
should work

Additional info:
if there's an issue the message should be at least more clear.

FYI before the error (which appears also in Events main-tab), there's a huge number of vm reconfiguration events, eg:

  VM lbednar-rhset1 configuration was updated by system.

So the action, even it ended in "operation canceled", was messing with current configuration, ie. it updated configuration of VMs. Strage, shouldn't some precheck be done before doing an action?

Comment 2 Jiri Belka 2017-02-20 14:38:43 UTC

engine=# select vm_name,os,max_memory_size_mb from vms where max_memory_size_mb > 20480 order by max_memory_size_mb desc;
     vm_name      | os | max_memory_size_mb 
------------------+----+--------------------
 brq-openldap     |  0 |            4194304
 brq-rhosci       | 19 |            4194304
 om-ovirt         | 24 |            4194304
 om-openstack     |  5 |            4194304
 om-wgt           | 24 |            4194304
 selenium         | 19 |            4194304
 brq-ipa          | 19 |            4194304
 om-ad-child2     | 25 |            4194304
 brq-w2k8r2       | 17 |            4194304
 brq-w2k12r2      | 25 |            4194304
 lbednar-rhset1   | 19 |            4194304
 HostedEngine     |  5 |              65536
 brq-dev          |  5 |              49152
 lleistne-engine1 |  0 |              32768
 jboss-eap-qe01   | 19 |              32768
 ps-ovirt         | 19 |              32768
 lleistne-edb     | 24 |              32768
 cfme-552         | 24 |              32768
 pn-win8.1        | 21 |              32768
 pk-e5            | 24 |              32000
 ps-rh6           | 19 |              28672
 rhci-cfme-wk1    | 18 |              24576
 rhci-cfme-candu  | 18 |              24576
 brq-update       | 19 |              24480
 mo-update        | 19 |              24480
 pbal-engine36    | 19 |              20516
 pbal-engine      | 19 |              20516
 gr-rhev35_1      | 19 |              20516
 gr-rhev35        | 19 |              20516
 selenium-nodes   | 19 |              20516
(30 rows)

os = '18' = RHEL6 32bit.

Comment 3 Jiri Belka 2017-02-20 14:40:42 UTC

(In reply to Jiri Belka from comment #2)
> engine=# select vm_name,os,max_memory_size_mb from vms where
> max_memory_size_mb > 20480 order by max_memory_size_mb desc;
>      vm_name      | os | max_memory_size_mb 
> ------------------+----+--------------------
...
>  rhci-cfme-wk1    | 18 |              24576
>  rhci-cfme-candu  | 18 |              24576
...
> (30 rows)
> 
> os = '18' = RHEL6 32bit.

Switching to 64bit, after that cluster compat level bump up was successful.

Comment 4 Michal Skrivanek 2017-02-21 06:14:42 UTC

I suppose those VMs were actually imported. Can you confirm how and when they were created? The limit didn't chaange for quite some time so it likely bypassed the checks back then.

Comment 5 Jiri Belka 2017-02-21 09:12:10 UTC

(In reply to Michal Skrivanek from comment #4)
> I suppose those VMs were actually imported. Can you confirm how and when
> they were created? The limit didn't chaange for quite some time so it likely
> bypassed the checks back then.

our env has long history, it used to be - iirc - 3.0 and in 3.5 it was migrated to SHE.

engine=# select vm_name,max_memory_size_mb,vmt_name,creation_date,vmt_creation_date,last_start_time,last_stop_time from vms where vm_name = 'rhci-cfme-wk1';
-[ RECORD 1 ]------+---------------------------
vm_name            | rhci-cfme-wk1
max_memory_size_mb | 24576
vmt_name           | Blank
creation_date      | 2014-11-18 11:10:27.635-05
vmt_creation_date  | 2008-03-31 18:00:00-04
last_start_time    | 2015-08-09 23:39:32.339-04
last_stop_time     | 2015-12-17 10:55:48.59-05

I have no other info about history of those VMs.

Comment 6 Michal Skrivanek 2017-02-21 09:22:15 UTC

the check for max mem was added in ~ 3.5 so if it was imported before that it is possible it's wrong all that time.
Since the name is "rhch-cfme" indicates it might be an external OVF imported into oVirt and that settings were wrong in that OVF. That is quite likely because there were (and I think still are) quite a few differences in the image CFME produces and what we are expecting. 
This should not happen for oVirt/RHV exported VMs

Improved logging in these cases are being tracked as bug 1418641

*** This bug has been marked as a duplicate of bug 1418641 ***

Comment 7 Gil Klein 2017-02-21 20:34:54 UTC

(In reply to Michal Skrivanek from comment #6)
> the check for max mem was added in ~ 3.5 so if it was imported before that
> it is possible it's wrong all that time.
> Since the name is "rhch-cfme" indicates it might be an external OVF imported
> into oVirt and that settings were wrong in that OVF. That is quite likely
> because there were (and I think still are) quite a few differences in the
> image CFME produces and what we are expecting. 
> This should not happen for oVirt/RHV exported VMs
> 
I have to reopen this issue as I just stumble upon it on a different QE production system.

While CFME deployment on a RHEV system is based on an imported VM, I do expect we will hit this if a fix won't be provided. 

engine=# select vm_name,os from vms where os = 18;
       vm_name        | os
----------------------+----
 rhci-cfme-prod-wk1   | 18
 rhci-cfme-prod-candu | 18
 rhci-cfme-prod-db    | 18
(3 rows)

Comment 8 Yaniv Kaul 2017-02-22 09:00:13 UTC

Gil - can you verify if the CFME OVF has a wrong XML?
Michal - I believe we need to address this somehow during upgrade and fix this.

Comment 9 Tomas Jelinek 2017-02-22 10:08:13 UTC

since 4.1.1 it should start complaining during import if the value is not correct:
https://gerrit.ovirt.org/#/c/69741/

But if the VM was imported before, the value may be incorrect. Doing some magic fixes of the wrong values does not sound too good to me. We should rather implement the bug 1418641 to give the user a better chance to understand what is wrong during the update and give him a chance to solve...

Comment 10 Michal Skrivanek 2017-02-22 10:14:40 UTC

we're not fixing old imports, without knowing when exactly it was imported and what kind of OVF was used we can't really do much. It is an invalid VM and that happened already during the import. I believe current available CFME templates do not have this 32bit OS anymore, so it is not a problem anymore

I'll keep the bug opened for a while for further thoughts than, but this is nothing Urgent/Urgent, it's not a Regression, and I will close this again if no further comments are received

Comment 11 Jiri Belka 2017-02-22 10:29:08 UTC

(In reply to Michal Skrivanek from comment #10)
> we're not fixing old imports, without knowing when exactly it was imported
> and what kind of OVF was used we can't really do much. It is an invalid VM
> and that happened already during the import. I believe current available
> CFME templates do not have this 32bit OS anymore, so it is not a problem
> anymore
> 
> I'll keep the bug opened for a while for further thoughts than, but this is
> nothing Urgent/Urgent, it's not a Regression, and I will close this again if
> no further comments are received

engine-setup already does couple of checks, what about to add this into checks for there's a warning before they do upgrade?

Comment 12 Yaniv Kaul 2017-02-22 10:39:40 UTC

(In reply to Jiri Belka from comment #11)
> (In reply to Michal Skrivanek from comment #10)
> > we're not fixing old imports, without knowing when exactly it was imported
> > and what kind of OVF was used we can't really do much. It is an invalid VM
> > and that happened already during the import. I believe current available
> > CFME templates do not have this 32bit OS anymore, so it is not a problem
> > anymore
> > 
> > I'll keep the bug opened for a while for further thoughts than, but this is
> > nothing Urgent/Urgent, it's not a Regression, and I will close this again if
> > no further comments are received
> 
> engine-setup already does couple of checks, what about to add this into
> checks for there's a warning before they do upgrade?

Agreed. I feel it's better to fail upgrade on this, then to fail a 'day 2 operation'.

Comment 13 Tomas Jelinek 2017-03-10 13:58:24 UTC

(In reply to Yaniv Kaul from comment #12)
> (In reply to Jiri Belka from comment #11)
> > (In reply to Michal Skrivanek from comment #10)
> > > we're not fixing old imports, without knowing when exactly it was imported
> > > and what kind of OVF was used we can't really do much. It is an invalid VM
> > > and that happened already during the import. I believe current available
> > > CFME templates do not have this 32bit OS anymore, so it is not a problem
> > > anymore
> > > 
> > > I'll keep the bug opened for a while for further thoughts than, but this is
> > > nothing Urgent/Urgent, it's not a Regression, and I will close this again if
> > > no further comments are received
> > 
> > engine-setup already does couple of checks, what about to add this into
> > checks for there's a warning before they do upgrade?
> 
> Agreed. I feel it's better to fail upgrade on this, then to fail a 'day 2
> operation'.

I don't think the engine-setup is a good place for this. There are lots of checks which need to be done on a VM which are in UpdateVmCommand.validate() - I would not try to re-implement them in SQL scripts on update because:
- we will never be able to implement all and keep them up-to-date in the long run
- the flow for the user would be: turn engine off, run engine setup, look at the failed VMs, turn engine on, fix vms, turn engine off, run engine setup. Not a great user experience. Especially if the second engine setup fails on some other check again.

I think the biggest issue is the incorrect reporting and not sufficient help given to user how to solve the issue. That should be addressed here: 
https://bugzilla.redhat.com/show_bug.cgi?id=1418641#c5

Comment 14 Tomas Jelinek 2017-03-23 11:10:42 UTC

OK, marking it as dup of 1418641 - it is a generic solution for all this kinds of problems.

*** This bug has been marked as a duplicate of bug 1418641 ***

Note You need to log in before you can comment on or make changes to this bug.