Description of problem: VM's (holding CL compatibility 4.1) paused due to no space error, fail to resume back after resolving the space issue in DC & CL compatibility 4.2 with below error. "Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VM_COMATIBILITY_VERSION_NOT_SUPPORTED,$VmName Test,$VmVersion 4.1,$DcVersion 4.2" Version-Release number of selected component (if applicable): rhvm-4.2.6.4-0.1.el7ev.noarch How reproducible: N/A Steps to Reproduce: 1. 2. 3. Actual results: VM fails to resume Expected results: VM should resume back normally. Additional info: -- Here the rhvm database shows below entry for the VM facing issue. However the custom compatibility tab is seen blank for the VM on rhvm portal. vm_name | cluster_compatibility_version | custom_compatibility_version -----------+-------------------------------+------------------------------ Test | 4.2 | 4.1 -- Unsure if we support resuming back a VM with old Cluster compatibility in DC & CL of upgraded version. -- One more thing, Power off and Start of the VM works fine.
There was an older issue about customcompatibilityversion which is still present in 4.2.6, but I'd to be clear, the cluster was no updated while these were paused, correct? If not, please update to the latest 4.2
Hi, yes; cluster was upgraded a while back (December). A lot of the VMs were not rebooted because usually there is no need to do that asap (and docs don't suggest I would need to do that asap). Also 4.2.7/8 release notes do not show that there is a fix for a problem of this magnitude... But lets sum this up: 1) the VMs should have resumed; even if they are still running with 4.1 compatibility because they have not been rebooted yet 2) they haven't because of a known bug in 4.2.6 manager that is not mentioned in release notes? 3) Could you point me to the bz that shows this problem? 4) Bonus question: Could I have resumed them with virsh on the hypervisors? Greetings Klaas
(In reply to Ryan Barry from comment #3) > There was an older issue about customcompatibilityversion which is still > present in 4.2.6, but I'd to be clear, the cluster was no updated while > these were paused, correct? > > If not, please update to the latest 4.2 it's a DC version, not Cluster version which fails the validation. We do not have a custom DC version support. Generally an upgrade of DC should be prevented if there are VMs in earlier Cluster levels running. It's likely that they did upgrade DC level while there were VMs running after a Cluster update to 4.2 (i.e. with temporary 4.1 custom level). IMHO we shouldn't allow DC upgrade while there are VMs running (including Paused) in CL<DC(including custom level override). DC upgrade validation is Storage, Tal, can you comment on what's the desired behavior around DC level upgrade?
(In reply to Klaas Demter from comment #4) > Hi, > yes; cluster was upgraded a while back (December). A lot of the VMs were not > rebooted because usually there is no need to do that asap (and docs don't > suggest I would need to do that asap). The wording changed in 4.2 to make it a bit more clear: https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.2/html/upgrade_guide/changing_the_cluster_compatibility_version_3-6_local_db Also 4.2.7/8 release notes do not > show that there is a fix for a problem of this magnitude... > > But lets sum this up: > > 1) the VMs should have resumed; even if they are still running with 4.1 > compatibility because they have not been rebooted yet no, because apparently you updated DC in the meantime. VMs in CL 4.1 are not supported to run in a DC 4.2 > 2) they haven't because of a known bug in 4.2.6 manager that is not mentioned in release notes? no, because of a missing validation in DC update it seems. I would swear there was a bug about that but can't find it now. Tal? > 3) Could you point me to the bz that shows this problem? > 4) Bonus question: Could I have resumed them with virsh on the hypervisors? likely yes. It's already in an unsupported situation because DC is 4.2 already. There's no difference for running vs unpausing(still the same qemu process - not to be consused with suspend/resume) so it would be very likely fine to "cont" it via virsh.
Okay, so the problem is that I can upgrade a DC even if there are still hosts running on a lower compatibility version inside the datacenter -- can't you check for that or at least warn about that? I have to admit I have read the docs multiple times and that was not clear to me.
after reading the current docs again I would still argue it does not explicitly say that I need to reboot all VMs before changing the DC compatibility version: https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.2/html/upgrade_guide/changing_the_cluster_compatibility_version_3-6_local_db "After you update the cluster’s compatibility version, you must update the cluster compatibility version of all running or suspended virtual machines by restarting them from within the Manager, or using the REST API, instead of within the guest operating system. Virtual machines will continue to run in the previous cluster compatibility level until they are restarted. Those virtual machines that require a restart are marked with the pending changes icon ( pendingchanges ). You cannot change the cluster compatibility version of a virtual machine snapshot that is in preview; you must first commit or undo the preview." "Once you have updated the compatibility version of all clusters in a data center, you can then change the compatibility version of the data center itself." This states I must updated them; it does not say I need to do that immediately or before upgrading the DC. https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.2/html/upgrade_guide/changing_the_data_center_compatibility_version_3-6_local_db "To change the data center compatibility version, you must have first updated all the clusters in your data center to a level that supports your desired compatibility level." also no word about the need to update the VMs before doing this. Side note: "you must update the cluster compatibility version of all running or suspended virtual machines by restarting them from within the Manager, or using the REST API, instead of within the guest operating system" this should be obsolete on all systems that have guest agents installed since 4.2; the reboot should be noticed and transformed to a cold reboot (https://bugzilla.redhat.com/show_bug.cgi?id=1512619) Greetings Klaas
(In reply to Klaas Demter from comment #8) > after reading the current docs again I would still argue it does not > explicitly say that I need to reboot all VMs before changing the DC > compatibility version: > > https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.2/ > html/upgrade_guide/changing_the_cluster_compatibility_version_3-6_local_db > "After you update the cluster’s compatibility version, you must update the > cluster compatibility version of all running or suspended virtual machines > by restarting them from within the Manager, or using the REST API, instead > of within the guest operating system. Virtual machines will continue to run > in the previous cluster compatibility level until they are restarted. Those > virtual machines that require a restart are marked with the pending changes > icon ( pendingchanges ). You cannot change the cluster compatibility version > of a virtual machine snapshot that is in preview; you must first commit or > undo the preview." "you must update the cluster compatibility version of all running or suspended virtual machines by restarting them from within the Manager" What is unclear about this? Myabe we need a docs update. > "Once you have updated the compatibility version of all clusters in a data > center, you can then change the compatibility version of the data center > itself." > > This states I must updated them; it does not say I need to do that > immediately or before upgrading the DC. From just below in your comment "to change the DC compatibility version, you must have first..." So, yes, you need to do that before upgrading the DC. > > https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.2/ > html/upgrade_guide/changing_the_data_center_compatibility_version_3- > 6_local_db > "To change the data center compatibility version, you must have first > updated all the clusters in your data center to a level that supports your > desired compatibility level." > > also no word about the need to update the VMs before doing this. It was in the first part of your comment. Specifically, that all running or suspended VMs need to be rebooted, and they may also need configuration updates. > > Side note: "you must update the cluster compatibility version of all running > or suspended virtual machines by restarting them from within the Manager, or > using the REST API, instead of within the guest operating system" this > should be obsolete on all systems that have guest agents installed since > 4.2; the reboot should be noticed and transformed to a cold reboot > (https://bugzilla.redhat.com/show_bug.cgi?id=1512619) > > Greetings > Klaas Ultimately, the bug here seems to be that it was possible to initiate a DC-level update without following the steps above. That paused VMs fail to come back up (and fail validation) is a side effect of this. That's expected behavior, but it's unexpected that a VM would fall through this gap. I would have sworn there was another bug around DC upgrades also, but these may also be relevant: https://bugzilla.redhat.com/show_bug.cgi?id=1649685 https://bugzilla.redhat.com/show_bug.cgi?id=1662921 In either case, if configuration updates were performed over the API, it _may_ have kept one of these on an older version. But, in general, the failure to resume here is probably NOTABUG. Instead, it should have failed validation on the DC upgrade. What's the expected behavior here?
(In reply to Ryan Barry from comment #9) [...] > > "you must update the cluster compatibility version of all running or > suspended virtual machines by restarting them from within the Manager" > > What is unclear about this? Myabe we need a docs update. It does not say this is a prerequisite for continuing as it does with "change compatibility of the cluster" so I assumed that is not immediately needed. [..] > > Ultimately, the bug here seems to be that it was possible to initiate a > DC-level update without following the steps above. That paused VMs fail to > come back up (and fail validation) is a side effect of this. That's expected > behavior, but it's unexpected that a VM would fall through this gap. I fully agree with this assesment, dc upgrade should not be possible; the error is just a result of this being possible. > > I would have sworn there was another bug around DC upgrades also, but these > may also be relevant: > > https://bugzilla.redhat.com/show_bug.cgi?id=1649685 > https://bugzilla.redhat.com/show_bug.cgi?id=1662921 > > In either case, if configuration updates were performed over the API, it > _may_ have kept one of these on an older version. But, in general, the > failure to resume here is probably NOTABUG. Instead, it should have failed > validation on the DC upgrade. What's the expected behavior here? I do not perform changes via api; all is done by rhvm itsself and my changes come through the web-ui for now. This bug can either be closed as NOTABUG or transformed into "dc upgrade should not be possible if VMs still have older cluster compatibility version"
Tal, thoughts on the final part of this? Neither Michal nor I can find an appropriate bug, but this should definitely be blocked
Tal?
There's also a typo ACTION_TYPE_FAILED_VM_COMATIBILITY_VERSION_NOT_SUPPORTED : COMATIBILITY -> COMPATIBILITY
Shani, please check the discussion on rhev-tech about upgrading CL. What should we do about paused VMs?
(In reply to Fred Rolland from comment #17) > Shani, please check the discussion on rhev-tech about upgrading CL. > What should we do about paused VMs? We did PowerOff -> PowerOn but as #6 suggest - maybe you can use virsh to resume the VMs.
Verified on ovirt-engine-4.4.0-0.0.master.20190509133331.gitb9d2a1e.el7.noarch. The scenario is: 1. create a DC with an 'old' version (4.1/4.2). 2. create a cluster with 4.1/4.2 version. 3. create a host on the DC and create a VM on the cluster. 4. run the VM and suspend it. Also tried pause vm with blocking storage on host which causes IO error pause. 5. upgrade the cluster to a newer version (VM is still paused). Tried the following updates - 4.1 -> 4.2->4.3->4.4; 4.1->4.3 6. try to update the DC.run the suspended VM . for the paused - delete the blocking rule and see that the vm is running again after the DC is updated
sync2jira
QE verification bot: the bug was verified upstream
WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops
from my understanding this case is not about resuming paused VMs before updating the DC level. You need to reboot the VMs to change their compatibility level. https://bugzilla.redhat.com/show_bug.cgi?id=1693813#c5
Basically, there are a few situations: - Shutting the VMs down before upgrading the DC's level acts as rebooting the VMs. - In case there are up running VMs, they should be shut down before operating the upgrade. - In case the DC contains some suspended VMs with an older custom compatibility version, those VMs won't be able to resume once the upgrade has finished (due to older compatibility). Therefore, those VMs should be power off/resume before upgrading.
(In reply to shani from comment #33) > Basically, there are a few situations: > - Shutting the VMs down before upgrading the DC's level acts as rebooting > the VMs. > - In case there are up running VMs, they should be shut down before > operating the upgrade. > - In case the DC contains some suspended VMs with an older custom > compatibility version, those VMs won't be able to resume once the upgrade > has finished (due to older compatibility). > Therefore, those VMs should be power off/resume before upgrading. Hi, the last sentence is wrong, which is the point I was trying to make in my last comment. You should not resume them before upgrading, you should reboot or poweroff them in order to upgrade the compatibility level. Resuming them won't change it. Greetings Klaas
Hi Klass, As I can see it, there are 2 levels for the solution: 1. As the bug first mentioned on c1, "Actual results: VM fails to resume". This one has indicated for VMs which their compatibility version is older than the version you want to update to, so not all paused VMs should be resumed, only those that the upgrade can avoid them resuming. The fix indicated those VMs and tells the user to take an action (powering them off/resume them) before performing the upgrade. Taking that action will allow you to upgrade the DC, and make sure those VMs will be able to run again over the new compatibility version DC. 2. In order to completely upgrade on the VM, it should be rebooted/powered off and on again. Otherwise, the upgrade won't be completed and the new compatibility level won't be updated for the VM: After updating the DC level, there's a UI icon next to the VM's name, indicating that a restart is needed so its compatibility level would be upgraded. Before that fix, in case of a suspended VM, the icon appears, but you can't resume the VM due to an error of "custom compatibility version is not supported in Data Center". So that VM won't be able to resume running once the upgrade is being made. To sum up: - This fix was meant to avoid a chance of paused VMs that won't be able to resume again, due to a compatibility version issue. - You still need to reboot the VM in order to fully complete the DC upgrade. Hope it makes more sense.
(In reply to shani from comment #36) [...] > > The fix indicated those VMs and tells the user to take an action (powering > them off/resume them) before performing the upgrade. > Taking that action will allow you to upgrade the DC, and make sure those VMs > will be able to run again over the new compatibility version DC. No, if the fix does this then the fix is not right. The issue is a DC compatibility change should not be posssible while there are VMs (in any state) that still have a lover compatibility version. [...] > After updating the DC level, there's a UI icon next to the VM's name, > indicating that a restart is needed so its compatibility level would be > upgraded. No, you upgrade the cluster compatiblity version and the VMs get the next_run indication. Then you need to reboot them all. Then you can upgrade the DC compatibility version. > > Before that fix, in case of a suspended VM, the icon appears, but you can't > resume the VM due to an error of "custom compatibility version is not > supported in Data Center". > So that VM won't be able to resume running once the upgrade is being made. I am not sure this is correct as I haven't tried it myself, but the initial bug is from a support case I opened. This was not part of that case :) I do think you are mixing cluster and data center compatibility changes though. > > To sum up: > - This fix was meant to avoid a chance of paused VMs that won't be able to > resume again, due to a compatibility version issue. The problem in my case was not a paused VM during cluster compatibility upgrade, the problem was that I upgraded the DC version before all VMs were rebooted. In this state (after cluster compatibility change, after dc compatibility version change, before VM reboot) the VMs went into a paused state because of storage problems and were unable to resume. > - You still need to reboot the VM in order to fully complete the DC upgrade. You shouldn't be able to upgrade DC if there are still VMs running in older compatibility mode. That is this bug (see title "Do not change DC level if there are VMs running/paused with older CL." or see https://bugzilla.redhat.com/show_bug.cgi?id=1693813#c9 "Ultimately, the bug here seems to be that it was possible to initiate a DC-level update without following the steps above. That paused VMs fail to come back up (and fail validation) is a side effect of this. That's expected behavior, but it's unexpected that a VM would fall through this gap."
Hi Klass, The fix is on DC's level: In case there are suspended VMs, the DC update is being blocked with indicating the suspended VM names which would have an unsupported compatibility level. Blocking upgrading the DC level for running VMs should already be the default. The fix is available for viewing here: https://gerrit.ovirt.org/#/c/99762/ Sorry for the misunderstanding.
(In reply to shani from comment #38) > Hi Klass, > > The fix is on DC's level: > In case there are suspended VMs, the DC update is being blocked with > indicating the suspended VM names which would have an unsupported > compatibility level. > Blocking upgrading the DC level for running VMs should already be the > default. this was not the case when this bug was opened, someone should verify that in QA testing of this bug. If the change only tests for paused VMs then the fix is incomplete for the issue or the other part was fixed in another commit. But the thing I was talking about is that the doctexts are not fitting to the issue. Doctext currently: 'Previously, if a datacenter (DC) had a suspended virtual machine (VM) and you updated the DC level, the VM could not resume due to a "not supported custom compatibility version". The current release fixes this issue: It validates the datacenter before upgrading the DC level and displays a list of VMs that contain an old custom compatibility levels to resume before upgrading the DC: "Cannot update Data Center compatibility version. Please resume the following VMs before updating the Data Center."' Should be something like: Previously, if a datacenter (DC) had a VM with a lower compatibility setting and you updated the DC level, the VM could not resume due to a "not supported custom compatibility version". The current release fixes this issue: It validates the datacenter before upgrading the DC level and displays a list of VMs that contain an old custom compatibility levels to restart before upgrading the DC. "Cannot update Data Center compatibility version. Please reboot the following VMs before updating the Data Center." Also the message from the commit (https://imgur.com/a/cA0XLqA) is misleading. Resuming the VM is not enough. The VMs need to be in the new cluster compatibility version before upgrading the datacenter compatibility version. Resuming the VM will not change the cluster compatibility version.
[...] > If the change only tests for paused VMs then the > fix is incomplete for the issue or the other part was fixed in another > commit. As mentioned in my comment, blocking upgrading the DC level for running VMs should already be the default. This means that behavior has been handled already, and not as part of that patch. Since the missing validation (on the DC's level) was for suspended VMs, the fix focuses on that scenario. > But the thing I was talking about is that the doctexts are not fitting to > the issue. Thanks for the doctext suggestion. Rolfe, Can you please review the new doctext? [...] > Also the message from the commit (https://imgur.com/a/cA0XLqA) is > misleading. Resuming the VM is not enough. The VMs need to be in the new > cluster compatibility version before upgrading the datacenter compatibility > version. Resuming the VM will not change the cluster compatibility version. Indeed, resuming the VMs won't complete the change of the cluster compatibility version, but it will allow those VMs to be temporarily reconfigured to use the previous cluster compatibility. This one allows users to use the VM before shutting it down. Eventually, powering the VM off is required (there's also a UI icon and tooltip for mentioning that to the user).
(In reply to shani from comment #40) > [...] > > If the change only tests for paused VMs then the > > fix is incomplete for the issue or the other part was fixed in another > > commit. > > As mentioned in my comment, blocking upgrading the DC level for running VMs > should already be the default. > This means that behavior has been handled already, and not as part of that > patch. > Since the missing validation (on the DC's level) was for suspended VMs, the > fix focuses on that scenario. Then this change should be linked in this issue. When this issue was created there was no such check in RHV. > > > But the thing I was talking about is that the doctexts are not fitting to > > the issue. > > Thanks for the doctext suggestion. > Rolfe, Can you please review the new doctext? > > [...] > > Also the message from the commit (https://imgur.com/a/cA0XLqA) is > > misleading. Resuming the VM is not enough. The VMs need to be in the new > > cluster compatibility version before upgrading the datacenter compatibility > > version. Resuming the VM will not change the cluster compatibility version. > > Indeed, resuming the VMs won't complete the change of the cluster > compatibility version, > but it will allow those VMs to be temporarily reconfigured to use the > previous cluster compatibility. > This one allows users to use the VM before shutting it down. > Eventually, powering the VM off is required (there's also a UI icon and > tooltip for mentioning that to the user). You are mixing datacenter and cluster compatibility version again. If you update the cluster level you get the next_run indication. You then need to reboot from within OS (usually enough https://bugzilla.redhat.com/show_bug.cgi?id=1512619 ) or reboot/poweroff/poweron all VMs from engine; until there is no VM left with an older cluster compatibility level (no more next_run indicators). Then you can upgrade the datacenter compatibility version. So you don't have to "evantually power off the VM" but do that _before_ you upgrade the datacenter compatibility version. That is what this bug is about - see the title: "Do not change DC level if there are VMs running/paused with older CL." or https://bugzilla.redhat.com/show_bug.cgi?id=1693813#c9
(In reply to Klaas Demter from comment #41) (In reply to Ryan Barry from comment #9) >> "you must update the cluster compatibility version of all running or >> suspended virtual machines by restarting them from within the Manager" >> > What is unclear about this? Maybe we need a docs update. Klass is there another documentation change you thought about? Or the one from comment 39 covers it? > Then this change should be linked to this issue. When this issue was created > there was no such check in RHV. > IIRC, the relevant patch is: https://gerrit.ovirt.org/#/c/106641/ Which is a part of the fixing https://bugzilla.redhat.com/show_bug.cgi?id=1691562.
(In reply to shani from comment #43) > (In reply to Klaas Demter from comment #41) > > (In reply to Ryan Barry from comment #9) > >> "you must update the cluster compatibility version of all running or > >> suspended virtual machines by restarting them from within the Manager" > >> > > What is unclear about this? Maybe we need a docs update. > > Klass is there another documentation change you thought about? > Or the one from comment 39 covers it? I think 39 should cover it. It needs to be 100% clear from docs that you first need to have all VMs to change to the current cluster compatibility level before you can update the data center compatibility level. > > > > Then this change should be linked to this issue. When this issue was created > > there was no such check in RHV. > > > IIRC, the relevant patch is: https://gerrit.ovirt.org/#/c/106641/ > Which is a part of the fixing > https://bugzilla.redhat.com/show_bug.cgi?id=1691562. I am not reading that from the commit message, but I'll trust there will be proper QA on this bug :)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3247