Bug 1672859 - Cannot correctly upgrade an hosted engine env from 4.2 to 4.3 if the specific CPU type disappeared in 4.3
Summary: Cannot correctly upgrade an hosted engine env from 4.2 to 4.3 if the specific...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 4.3.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-4.3.3-1
: ---
Assignee: Steven Rosenberg
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On: 1699913
Blocks: 1694787
TreeView+ depends on / blocked
 
Reported: 2019-02-06 04:10 UTC by Juhani Rautiainen
Modified: 2019-04-29 13:57 UTC (History)
13 users (show)

Fixed In Version: ovirt-engine-4.3.3.5
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1694787 (view as bug list)
Environment:
Last Closed: 2019-04-29 13:57:43 UTC
oVirt Team: Virt
Embargoed:
pm-rhel: ovirt-4.3+
mtessun: planning_ack+
rbarry: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)
Virsh capabilities (15.57 KB, text/plain)
2019-02-11 16:09 UTC, Juhani Rautiainen
no flags Details
VDSM host capabilites (24.42 KB, text/plain)
2019-02-11 16:09 UTC, Juhani Rautiainen
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-ansible-cluster-upgrade issues 40 0 'None' open handle the hosted-engine cluster as a special case 2021-01-09 22:01:10 UTC
oVirt gerrit 98137 0 'None' MERGED engine: Update Deprecated CPU Types on Upgrade 2021-01-09 22:01:08 UTC
oVirt gerrit 99097 0 'None' MERGED engine: Update Deprecated CPU Types on Upgrade 2021-01-09 22:01:08 UTC

Description Juhani Rautiainen 2019-02-06 04:10:08 UTC
Description of problem:

It's not possible to change cluster cpu type.  I have 2 node cluster with Epyc processors. It was originally installed with 4.2 so it chose CPU type as Opteron G3 (no Epyc support back then). In Engine 4.3 Epyc is available as CPU type when I choose Compatibility Version: 4.3. Big problem is that it doesn't allow to upgrade CPU because all hosts are not in maintenance: "Error while executing action: Cannot change Cluster CPU type unless all Hosts attached to this Cluster are in Maintenance". Putting all hosts to maintenance is impossible because Engine is hosted in the cluster. I tried with Global HA maintenance, but that didn't help.


Version-Release number of selected component (if applicable):
4.3

How reproducible:


Steps to Reproduce:
1. Install oVirt 4.2 on Epyc hardware with self hosted engine
2. Upgrade to 4.3
3. Try change cpu type from Opteron G3->Epyc

Actual results:
Can't change CPY type because of this: "Error while executing action: Cannot change Cluster CPU type unless all Hosts attached to this Cluster are in Maintenance"


Expected results:
You could change CPU type. As it stands upgrading hardware still locks you to old CPU.


Additional info:

Comment 1 Simone Tiraboschi 2019-02-06 14:30:08 UTC
A manual workaround procedure is:

* set HE global maintenance mode
* set one of the hosted-engine hosts into maintenance mode
* move it to a different temporary cluster
* shutdown the engine VM
* manually restart the engine VM on the host on the temporary cluster directly executing on that host: 'hosted-engine --vm-start'
* connect again to the engine
* set all the hosts of the initial cluster into maintenance mode
* upgrade the cluster
* shut down again the engine VM
* manually restart the engine VM on one of the hosts of the initial cluster
* move back the host that got into a temporary cluster to its initial cluster

but this could be a bit challenging on user side.
Let's try to see if can automate it with ovirt-ansible-cluster-upgrade

Comment 2 Michal Skrivanek 2019-02-08 09:12:22 UTC
Why was it using Opteron G3? Was it auto detected in 4.2 as that? Weird...

Comment 3 Juhani Rautiainen 2019-02-08 10:03:51 UTC
It was autodetected as such. That's why when 4.2.7 started to warn that CPU is going to be deprecated I was surprised. Then I saw that 4.3 release notes had support for Epyc. Now checking things it seems that QEMU is the reason because KVM users have noticed Epyc->Opteron_G3 switch if qemu is too old. Maybe it's fallback in QEMU?

Comment 4 Michal Skrivanek 2019-02-11 12:18:29 UTC
did you upgrade hosts first? 
what does "virsh -r capabilities" and "vdsm-client Host getCapabilities" return?

Comment 5 Juhani Rautiainen 2019-02-11 16:09:02 UTC
Created attachment 1529079 [details]
Virsh capabilities

Comment 6 Juhani Rautiainen 2019-02-11 16:09:42 UTC
Created attachment 1529080 [details]
VDSM host capabilites

Comment 7 Juhani Rautiainen 2019-02-11 16:11:58 UTC
Attached the files for asked capabilities from virsh and VDSM. I updated engine first. Tried to update nodes from there but I had to do it from cli.

Comment 8 Michal Skrivanek 2019-02-11 16:19:37 UTC
thanks! that looks...weird. Is that before or after "rm /var/cache/libvirt/qemu/capabilities/*.xml" (as per bug 1674265)? if you haven't done that, could you give it a try and re-run both capability queries?
Also, did you check for any microcode updates for your CPU?

Comment 9 Juhani Rautiainen 2019-02-11 16:41:19 UTC
I didn't clear any capabilities this is all result of 4.2 install and upgrade to 4.3. I'll try clearing tomorrow. Didn't check any microcode updates but BIOS should be newest for Proliant Gen10 servers.

Comment 10 Michal Skrivanek 2019-02-11 17:01:08 UTC
Ok. Please do. Also try to remove that cache and reboot and rerun both

Do you happen to have a non-upgraded server with the same hardware?

Comment 11 Juhani Rautiainen 2019-02-11 17:19:21 UTC
I managed to do the test today. I put nodes in maintenance one after another, cleared the cache and restarted libvirtd. I can upload the files but they are identical in node that I already uploaded. Or not totally vdsm version has differences in gc_timer lines.

Comment 12 Juhani Rautiainen 2019-02-11 17:34:39 UTC
Just remembered that I did do 'Refresh capabilities' from webadmin after the node updates. Does it do the same operation?

Comment 13 Juhani Rautiainen 2019-02-11 17:45:46 UTC
Forgot to comment that I don't have extra server where to test.

Comment 14 Michal Skrivanek 2019-02-12 14:54:28 UTC
hm. Seems fma4 flag added in G4 and G5 was removed in EPYC, so EPYC processors on non-EPYC enabled oVirt gets detected as G3. That's a problem then when we removed G3. I wonder....it could be that adding G3 back is the most easy solution right now.

Comment 16 Ryan Barry 2019-02-12 16:54:00 UTC
Ugly...

Even with a known workaround, adding G3 back to upstream is probably the nicest suggestion.

It means keeping support for a side-by-side vulnerable CPU type for an entire release, but at least unblocks upgrades

Comment 17 Sandro Bonazzola 2019-02-13 08:22:05 UTC
Moving to Virt team for re-introducing G3 back

Comment 20 Nikolai Sednev 2019-04-21 15:02:20 UTC
Deployed 4.2 HE over NFS on 2 hosts and attached NFS storage domain.
4.2 components on engine and hosts:
ovirt-engine-setup-4.2.8.7-0.1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.19-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.34-1.el7ev.noarch

Set host cluster to Conroe and checked that it stays Conroe after editing.
Upgraded engine to latest bits of 4.3:
ovirt-engine-setup-4.3.3.5-0.1.el7.noarch

After engine got upgraded, I also upgraded both hosts to latest 4.3 bits:
ovirt-hosted-engine-ha-2.3.1-1.el7ev.noarch
ovirt-hosted-engine-setup-2.3.7-1.el7ev.noarch

Then pumped up host-cluster's level to 4.3 and automatically Conroe got changed to Nehalem, I approved the change and after that checked the CPU family and it got changed to proper Intel SandyBridge IBRS SSBD Family.

Moving to verified.


Note You need to log in before you can comment on or make changes to this bug.