Bug 1710740

Summary: [downstream clone - 4.3.5] Do not change DC level if there are VMs running/paused with older CL.
Product: Red Hat Enterprise Virtualization Manager Reporter: RHV bug bot <rhv-bugzilla-bot>
Component: ovirt-engineAssignee: shani <sleviim>
Status: CLOSED ERRATA QA Contact: Polina <pagranat>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.2.6CC: aefrat, emarcus, frolland, gveitmic, klaas, michal.skrivanek, mkalinin, pagranat, rbarry, Rhev-m-bugs, sleviim, tnisan
Target Milestone: ovirt-4.3.4Keywords: ZStream
Target Release: 4.3.1Flags: lsvaty: testing_plan_complete-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-4.3.4.1 Doc Type: Bug Fix
Doc Text:
Updating the Data Center level while the virtual machine was suspended, resulted in the virtual machine not resuming activity following the update. In this release, the suspended virtual machine must be resumed before the Data Center level update. Otherwise, the operation fails.
Story Points: ---
Clone Of: 1693813 Environment:
Last Closed: 2019-06-20 14:48:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1693813    
Bug Blocks:    

Description RHV bug bot 2019-05-16 08:18:17 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1693813 +++
======================================================================

Description of problem:

VM's (holding CL compatibility 4.1) paused due to no space error, fail to resume  back after resolving the space issue in DC & CL compatibility 4.2 with below error.

"Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VM_COMATIBILITY_VERSION_NOT_SUPPORTED,$VmName Test,$VmVersion 4.1,$DcVersion 4.2"


Version-Release number of selected component (if applicable):
rhvm-4.2.6.4-0.1.el7ev.noarch

How reproducible:
N/A

Steps to Reproduce:
1.
2.
3.

Actual results:
VM fails to resume


Expected results:
VM should resume back normally.


Additional info:

-- Here the rhvm database shows below entry for the VM facing issue.  However the custom compatibility tab is seen blank for the VM on rhvm portal.

  vm_name  | cluster_compatibility_version | custom_compatibility_version 
-----------+-------------------------------+------------------------------
 Test | 4.2                           | 4.1

-- Unsure if we support resuming back a VM with old Cluster compatibility in DC & CL of upgraded version.

-- One more thing, Power off and Start of the VM works fine.

(Originally by Koutuk Shukla)

Comment 3 RHV bug bot 2019-05-16 08:18:23 UTC
There was an older issue about customcompatibilityversion which is still present in 4.2.6, but I'd to be clear, the cluster was no updated while these were paused, correct?

If not, please update to the latest 4.2

(Originally by Ryan Barry)

Comment 4 RHV bug bot 2019-05-16 08:18:25 UTC
Hi,
yes; cluster was upgraded a while back (December). A lot of the VMs were not rebooted because usually there is no need to do that asap (and docs don't suggest I would need to do that asap). Also 4.2.7/8 release notes do not show that there is a fix for a problem of this magnitude...

But lets sum this up: 

1) the VMs should have resumed; even if they are still running with 4.1 compatibility because they have not been rebooted yet
2) they haven't because of a known bug in 4.2.6 manager that is not mentioned in release notes?
3) Could you point me to the bz that shows this problem?
4) Bonus question: Could I have resumed them with virsh on the hypervisors?


Greetings
Klaas

(Originally by klaas)

Comment 5 RHV bug bot 2019-05-16 08:18:27 UTC
(In reply to Ryan Barry from comment #3)
> There was an older issue about customcompatibilityversion which is still
> present in 4.2.6, but I'd to be clear, the cluster was no updated while
> these were paused, correct?
> 
> If not, please update to the latest 4.2

it's a DC version, not Cluster version which fails the validation. We do not have a custom DC version support. Generally an upgrade of DC should be prevented if there are VMs in earlier Cluster levels running.
It's likely that they did upgrade DC level while there were VMs running after a Cluster update to 4.2 (i.e. with temporary 4.1 custom level). IMHO we shouldn't allow DC upgrade while there are VMs running (including Paused) in CL<DC(including custom level override).

DC upgrade validation is Storage, Tal, can you comment on what's the desired behavior around DC level upgrade?

(Originally by michal.skrivanek)

Comment 6 RHV bug bot 2019-05-16 08:18:28 UTC
(In reply to Klaas Demter from comment #4)
> Hi,
> yes; cluster was upgraded a while back (December). A lot of the VMs were not
> rebooted because usually there is no need to do that asap (and docs don't
> suggest I would need to do that asap). 

The wording changed in 4.2 to make it a bit more clear:
https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.2/html/upgrade_guide/changing_the_cluster_compatibility_version_3-6_local_db

Also 4.2.7/8 release notes do not
> show that there is a fix for a problem of this magnitude...
> 
> But lets sum this up: 
> 
> 1) the VMs should have resumed; even if they are still running with 4.1
> compatibility because they have not been rebooted yet

no, because apparently you updated DC in the meantime. VMs in CL 4.1 are not supported to run in a DC 4.2

> 2) they haven't because of a known bug in 4.2.6 manager that is not mentioned in release notes?

no, because of a missing validation in DC update it seems. I would swear there was a bug about that but can't find it now. Tal?

> 3) Could you point me to the bz that shows this problem?
> 4) Bonus question: Could I have resumed them with virsh on the hypervisors?

likely yes. It's already in an unsupported situation because DC is 4.2 already. There's no difference for running vs unpausing(still the same qemu process - not to be consused with suspend/resume) so it would be very likely fine to "cont" it via virsh.

(Originally by michal.skrivanek)

Comment 7 RHV bug bot 2019-05-16 08:18:30 UTC
Okay, so the problem is that I can upgrade a DC even if there are still hosts running on a lower compatibility version inside the datacenter -- can't you check for that or at least warn about that? I have to admit I have read the docs multiple times and that was not clear to me.

(Originally by klaas)

Comment 8 RHV bug bot 2019-05-16 08:18:32 UTC
after reading the current docs again I would still argue it does not explicitly say that I need to reboot all VMs before changing the DC compatibility version:

https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.2/html/upgrade_guide/changing_the_cluster_compatibility_version_3-6_local_db
"After you update the cluster’s compatibility version, you must update the cluster compatibility version of all running or suspended virtual machines by restarting them from within the Manager, or using the REST API, instead of within the guest operating system. Virtual machines will continue to run in the previous cluster compatibility level until they are restarted. Those virtual machines that require a restart are marked with the pending changes icon ( pendingchanges ). You cannot change the cluster compatibility version of a virtual machine snapshot that is in preview; you must first commit or undo the preview."
"Once you have updated the compatibility version of all clusters in a data center, you can then change the compatibility version of the data center itself."

This states I must updated them; it does not say I need to do that immediately or before upgrading the DC.

https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.2/html/upgrade_guide/changing_the_data_center_compatibility_version_3-6_local_db
"To change the data center compatibility version, you must have first updated all the clusters in your data center to a level that supports your desired compatibility level."

also no word about the need to update the VMs before doing this.

Side note: "you must update the cluster compatibility version of all running or suspended virtual machines by restarting them from within the Manager, or using the REST API, instead of within the guest operating system" this should be obsolete on all systems that have guest agents installed since 4.2; the reboot should be noticed and transformed to a cold reboot (https://bugzilla.redhat.com/show_bug.cgi?id=1512619)

Greetings
Klaas

(Originally by klaas)

Comment 9 RHV bug bot 2019-05-16 08:18:34 UTC
(In reply to Klaas Demter from comment #8)
> after reading the current docs again I would still argue it does not
> explicitly say that I need to reboot all VMs before changing the DC
> compatibility version:
> 
> https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.2/
> html/upgrade_guide/changing_the_cluster_compatibility_version_3-6_local_db
> "After you update the cluster’s compatibility version, you must update the
> cluster compatibility version of all running or suspended virtual machines
> by restarting them from within the Manager, or using the REST API, instead
> of within the guest operating system. Virtual machines will continue to run
> in the previous cluster compatibility level until they are restarted. Those
> virtual machines that require a restart are marked with the pending changes
> icon ( pendingchanges ). You cannot change the cluster compatibility version
> of a virtual machine snapshot that is in preview; you must first commit or
> undo the preview."

"you must update the cluster compatibility version of all running or suspended virtual machines by restarting them from within the Manager"

What is unclear about this? Myabe we need a docs update.


> "Once you have updated the compatibility version of all clusters in a data
> center, you can then change the compatibility version of the data center
> itself."
> 
> This states I must updated them; it does not say I need to do that
> immediately or before upgrading the DC.

From just below in your comment "to change the DC compatibility version, you must have first..."

So, yes, you need to do that before upgrading the DC.

> 
> https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.2/
> html/upgrade_guide/changing_the_data_center_compatibility_version_3-
> 6_local_db
> "To change the data center compatibility version, you must have first
> updated all the clusters in your data center to a level that supports your
> desired compatibility level."
> 
> also no word about the need to update the VMs before doing this.

It was in the first part of your comment. Specifically, that all running or suspended VMs need to be rebooted, and they may also need configuration updates.

> 
> Side note: "you must update the cluster compatibility version of all running
> or suspended virtual machines by restarting them from within the Manager, or
> using the REST API, instead of within the guest operating system" this
> should be obsolete on all systems that have guest agents installed since
> 4.2; the reboot should be noticed and transformed to a cold reboot
> (https://bugzilla.redhat.com/show_bug.cgi?id=1512619)
> 
> Greetings
> Klaas

Ultimately, the bug here seems to be that it was possible to initiate a DC-level update without following the steps above. That paused VMs fail to come back up (and fail validation) is a side effect of this. That's expected behavior, but it's unexpected that a VM would fall through this gap.

I would have sworn there was another bug around DC upgrades also, but these may also be relevant: 

https://bugzilla.redhat.com/show_bug.cgi?id=1649685 
https://bugzilla.redhat.com/show_bug.cgi?id=1662921

In either case, if configuration updates were performed over the API, it _may_ have kept one of these on an older version. But, in general, the failure to resume here is probably NOTABUG. Instead, it should have failed validation on the DC upgrade. What's the expected behavior here?

(Originally by Ryan Barry)

Comment 10 RHV bug bot 2019-05-16 08:18:35 UTC
(In reply to Ryan Barry from comment #9)
[...]
> 
> "you must update the cluster compatibility version of all running or
> suspended virtual machines by restarting them from within the Manager"
> 
> What is unclear about this? Myabe we need a docs update.

It does not say this is a prerequisite for continuing as it does with "change compatibility of the cluster" so I assumed that is not immediately needed.

[..]
> 
> Ultimately, the bug here seems to be that it was possible to initiate a
> DC-level update without following the steps above. That paused VMs fail to
> come back up (and fail validation) is a side effect of this. That's expected
> behavior, but it's unexpected that a VM would fall through this gap.

I fully agree with this assesment, dc upgrade should not be possible; the error is just a result of this being possible.
> 
> I would have sworn there was another bug around DC upgrades also, but these
> may also be relevant: 
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1649685 
> https://bugzilla.redhat.com/show_bug.cgi?id=1662921
> 
> In either case, if configuration updates were performed over the API, it
> _may_ have kept one of these on an older version. But, in general, the
> failure to resume here is probably NOTABUG. Instead, it should have failed
> validation on the DC upgrade. What's the expected behavior here?


I do not perform changes via api; all is done by rhvm itsself and my changes come through the web-ui for now.


This bug can either be closed as NOTABUG or transformed into "dc upgrade should not be possible if VMs still have older cluster compatibility version"

(Originally by klaas)

Comment 11 RHV bug bot 2019-05-16 08:18:37 UTC
Tal, thoughts on the final part of this? Neither Michal nor I can find an appropriate bug, but this should definitely be blocked

(Originally by Ryan Barry)

Comment 13 RHV bug bot 2019-05-16 08:18:40 UTC
Tal?

(Originally by Ryan Barry)

Comment 14 RHV bug bot 2019-05-16 08:18:42 UTC
There's also a typo ACTION_TYPE_FAILED_VM_COMATIBILITY_VERSION_NOT_SUPPORTED : COMATIBILITY -> COMPATIBILITY

(Originally by Sandro Bonazzola)

Comment 17 RHV bug bot 2019-05-16 08:18:47 UTC
Shani, please check the discussion on rhev-tech about upgrading CL.
What should we do about paused VMs?

(Originally by Fred Rolland)

Comment 18 RHV bug bot 2019-05-16 08:18:49 UTC
(In reply to Fred Rolland from comment #17)
> Shani, please check the discussion on rhev-tech about upgrading CL.
> What should we do about paused VMs?
We did PowerOff -> PowerOn
but as #6 suggest - maybe you can use virsh to resume the VMs.

(Originally by klaas)

Comment 20 RHV bug bot 2019-05-16 15:29:10 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops

Comment 21 Polina 2019-05-20 16:38:28 UTC
Verified on ovirt-engine-4.4.0-0.0.master.20190509133331.gitb9d2a1e.el7.noarch.
The scenario is:
1. create a DC with an 'old' version (4.1/4.2).
2. create a cluster with 4.1/4.2 version.
3. create a host on the DC and create a VM on the cluster.
4. run the VM and suspend it. Also tried pause vm with blocking storage on host which causes IO error pause. 
5. upgrade the cluster to a newer version (VM is still paused). Tried the following updates - 4.1 -> 4.2->4.3->4.4; 4.1->4.3
6. try to update the DC.run the suspended VM . for the paused - delete the blocking rule and see that the vm is running again after the DC is updated

Comment 22 Fred Rolland 2019-05-21 07:05:17 UTC
Avihai, can you ack?

Comment 23 Avihai 2019-05-21 08:12:12 UTC
(In reply to Fred Rolland from comment #22)
> Avihai, can you ack?

Looks like Polina already did the QE work, and scenario looks like virtish .
Polina , can you ack it(as you already tested it) ?

Comment 25 Polina 2019-05-26 06:29:07 UTC
verified according to the https://bugzilla.redhat.com/show_bug.cgi?id=1710740#c21

Comment 29 errata-xmlrpc 2019-06-20 14:48:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1566

Comment 30 Daniel Gur 2019-08-28 13:11:56 UTC
sync2jira

Comment 31 Daniel Gur 2019-08-28 13:16:09 UTC
sync2jira