1837266 – Failed to restore from 4.3 backup file on new 4.4 hosted-engine

Bug 1837266 - Failed to restore from 4.3 backup file on new 4.4 hosted-engine

Summary: Failed to restore from 4.3 backup file on new 4.4 hosted-engine

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	General
Sub Component:
Version:	---
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-4.4.1
Target Release:	4.40.20
Assignee:	Milan Zamazal
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1841030 (view as bug list)
Depends On:	1841030
Blocks:	1843364
TreeView+	depends on / blocked

Reported:	2020-05-19 08:15 UTC by Nikolai Sednev
Modified:	2020-08-18 12:51 UTC (History)
CC List:	11 users (show)
Fixed In Version:	vdsm-4.40.20
Clone Of:
Environment:
Last Closed:	2020-07-08 08:25:40 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	sbonazzo: ovirt-4.4? sbonazzo: planning_ack? sbonazzo: devel_ack? sbonazzo: testing_ack?

Attachments	(Terms of Use)
logs from engine (5.64 MB, application/x-xz) 2020-05-19 08:15 UTC, Nikolai Sednev	no flags	Details
logs from the host (6.19 MB, application/x-xz) 2020-05-19 08:16 UTC, Nikolai Sednev	no flags	Details
logs from host alma03 with latest appliance (6.10 MB, application/x-xz) 2020-05-25 11:26 UTC, Nikolai Sednev	no flags	Details
sosreport from engine (5.66 MB, application/x-xz) 2020-05-25 11:31 UTC, Nikolai Sednev	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	109409	0	None	MERGED	machinetype: Add spec_ctrl feature for -IBRS model	2020-12-18 06:01:09 UTC

Description Nikolai Sednev 2020-05-19 08:15:15 UTC

Created attachment 1689814 [details]
logs from engine

Description of problem:

Reprovisioned to RHEL8.2 and failed with:

[ INFO  ] TASK [ovirt.hosted_engine_setup : Fail with generic error]
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The host has been set in non_operational status, please check engine logs, more info can be found in the engine logs, fix accordingly and re-deploy."}
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"}
[ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook


In engine log I see:
2020-05-18 20:00:03,367+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engi
ne-Thread-35) [7289cff0] Failed to migrate one or more VMs.
2020-05-18 19:50:23,852+03 ERROR [org.ovirt.engine.core.bll.CpuFlagsManagerHandler] (ServerService Thread Pool -- 57) 
[] Error getting info for CPU ' ', not in expected format.


It looks like a different issue now, but the end result is the same, I was unable to restore the engine over 4.4 from 4.3.
Moving back to assigned and attaching sosreports from host alma03 and the engine.

Version-Release number of selected component (if applicable):
Tested backup and restore from engine:
rhvm-4.3.10.1-0.1.master.el7.noarch
Linux 3.10.0-1127.8.1.el7.x86_64 #1 SMP Fri Apr 24 14:56:59 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.8 (Maipo)

Host:
rhvm-appliance.x86_64 2:4.3-20200507.0.el7 rhv-4.3.10
ovirt-hosted-engine-setup-2.3.13-1.el7ev.noarch
ovirt-hosted-engine-ha-2.3.6-1.el7ev.noarch
Linux 3.10.0-1127.8.2.el7.x86_64 #1 SMP Thu May 7 19:30:37 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.8 (Maipo)

Restored on RHEL8.2 host with these components:
rhvm-appliance.x86_64 2:4.4-20200417.0.el8ev @rhv-4.4.0
ovirt-hosted-engine-setup-2.4.4-1.el8ev.noarch
ovirt-hosted-engine-ha-2.4.3-1.el8ev.noarch
Linux 4.18.0-193.4.1.el8_2.x86_64 #1 SMP Fri May 15 15:02:51 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.2 (Ootpa)

How reproducible:
100%

Steps to Reproduce:
1. install 4.3 HE setup on EL7
2. backup 4.3 engine into shared storage
   engine-backup --mode=backup --file=engine_backup.tar.gz --log=engine_backup.log
2. reinstall EL8
3. run HE-4.4 deployment from 4.3 backup file:
  hosted-engine --deploy --restore-from-file=<file-he>

Actual results:
Restore of 4.3 backup file on 4.4 fails with:
[ INFO  ] TASK [ovirt.hosted_engine_setup : Fail with generic error]
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The host has been set in non_operational status, please check engine logs, more info can be found in the engine logs, fix accordingly and re-deploy."}
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"}
[ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook

Expected results:
Restore should succeed.

Additional info:
Logs are attached.

Comment 1 Nikolai Sednev 2020-05-19 08:16:00 UTC

Created attachment 1689815 [details]
logs from the host

Comment 2 Nikolai Sednev 2020-05-19 08:19:30 UTC

Adding this here, this bug opened forth to https://bugzilla.redhat.com/show_bug.cgi?id=1827135#c11.

Comment 3 Yedidyah Bar David 2020-05-24 13:29:03 UTC

(In reply to Nikolai Sednev from comment #0)
> Created attachment 1689814 [details]
> logs from engine
> 
> Description of problem:
> 
> Reprovisioned to RHEL8.2 and failed with:
> 
> [ INFO  ] TASK [ovirt.hosted_engine_setup : Fail with generic error]
> [ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The host
> has been set in non_operational status, please check engine logs, more info
> can be found in the engine logs, fix accordingly and re-deploy."}
> [ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The
> system may not be provisioned according to the playbook results: please
> check the logs for the issue, fix accordingly or re-deploy from scratch.\n"}
> [ ERROR ] Failed to execute stage 'Closing up': Failed executing
> ansible-playbook
> 
> 
> In engine log I see:
> 2020-05-18 20:00:03,367+03 ERROR
> [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand]
> (EE-ManagedThreadFactory-engi
> ne-Thread-35) [7289cff0] Failed to migrate one or more VMs.
> 2020-05-18 19:50:23,852+03 ERROR
> [org.ovirt.engine.core.bll.CpuFlagsManagerHandler] (ServerService Thread
> Pool -- 57) 
> [] Error getting info for CPU ' ', not in expected format.
> 
> 
> It looks like a different issue now, but the end result is the same, I was
> unable to restore the engine over 4.4 from 4.3.
> Moving back to assigned and attaching sosreports from host alma03 and the
> engine.
> 
> Version-Release number of selected component (if applicable):
> Tested backup and restore from engine:
> rhvm-4.3.10.1-0.1.master.el7.noarch
> Linux 3.10.0-1127.8.1.el7.x86_64 #1 SMP Fri Apr 24 14:56:59 EDT 2020 x86_64
> x86_64 x86_64 GNU/Linux
> Red Hat Enterprise Linux Server release 7.8 (Maipo)
> 
> Host:
> rhvm-appliance.x86_64 2:4.3-20200507.0.el7 rhv-4.3.10
> ovirt-hosted-engine-setup-2.3.13-1.el7ev.noarch
> ovirt-hosted-engine-ha-2.3.6-1.el7ev.noarch
> Linux 3.10.0-1127.8.2.el7.x86_64 #1 SMP Thu May 7 19:30:37 EDT 2020 x86_64
> x86_64 x86_64 GNU/Linux
> Red Hat Enterprise Linux Server release 7.8 (Maipo)
> 
> Restored on RHEL8.2 host with these components:
> rhvm-appliance.x86_64 2:4.4-20200417.0.el8ev @rhv-4.4.0

This is pretty old. Please try again with a newer appliance/engine and attach logs if it reproduces. Thanks.

Comment 4 Nikolai Sednev 2020-05-25 07:51:30 UTC

It was received from:
rhv-release-4.4.0-36-001.noarch 
ovirt-hosted-engine-setup-2.4.4-1.el8ev.noarch
ovirt-hosted-engine-ha-2.4.3-1.el8ev.noarch

Comment 5 Yedidyah Bar David 2020-05-25 07:55:53 UTC

(In reply to Nikolai Sednev from comment #4)
> It was received from:
> rhv-release-4.4.0-36-001.noarch 
> ovirt-hosted-engine-setup-2.4.4-1.el8ev.noarch
> ovirt-hosted-engine-ha-2.4.3-1.el8ev.noarch

I added my comment 3 below "rhvm-appliance.x86_64 2:4.4-20200417.0.el8ev", which was the one I referred to as old. Sorry if this wasn't clear enough.

Comment 6 Nikolai Sednev 2020-05-25 11:22:11 UTC

Tested with rhvm-appliance-4.4-20200521.0.el8ev.x86_64 on clean host and restore still failed.
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The host has been set in non_operational status, please check engine logs, more info can be found in the engine logs, fix accordingly and re-deploy."}
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"}
[ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook
[ ERROR ] Hosted Engine deployment failed: please check the logs for the issue, fix accordingly or re-deploy from scratch.
          Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20200525133313-62uzwb.log

Logs being attached.

Comment 7 Nikolai Sednev 2020-05-25 11:26:11 UTC

Created attachment 1691917 [details]
logs from host alma03 with latest appliance

Comment 8 Nikolai Sednev 2020-05-25 11:31:52 UTC

Created attachment 1691918 [details]
sosreport from engine

Comment 9 Yedidyah Bar David 2020-05-25 11:53:17 UTC

engine.log has:

2020-05-25 14:17:03,014+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-19) [46a070dc] EVENT_ID: VDS_CPU_LOWER_THAN_CLUSTER(515), Host alma03.qa.lab.tlv.redhat.com moved to Non-Operational state as host does not meet the cluster's minimum CPU level. Missing CPU features : spec_ctrl

Comment 10 Yedidyah Bar David 2020-05-25 12:05:20 UTC

Ryan, can you please have a look? Thanks.

Comment 11 Evgeny Slutsky 2020-05-25 12:10:27 UTC

nikolai, did you use Virtual Host when encountered this?

Comment 12 Nikolai Sednev 2020-05-25 14:22:59 UTC

(In reply to Evgeny Slutsky from comment #11)
> nikolai, did you use Virtual Host when encountered this?

No. I always use only physical hosts.

Comment 13 Nikolai Sednev 2020-05-26 08:41:32 UTC

(In reply to Yedidyah Bar David from comment #9)
> engine.log has:
> 
> 2020-05-25 14:17:03,014+03 WARN 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (ForkJoinPool-1-worker-19) [46a070dc] EVENT_ID:
> VDS_CPU_LOWER_THAN_CLUSTER(515), Host alma03.qa.lab.tlv.redhat.com moved to
> Non-Operational state as host does not meet the cluster's minimum CPU level.
> Missing CPU features : spec_ctrl

Can it be related to https://bugzilla.redhat.com/show_bug.cgi?id=1830872 ? Its possible that in 4.3 the CPU of HE was recognized differently than in 4.4 and that caused an issue with the upgrade.

Comment 14 Evgeny Slutsky 2020-05-26 08:55:26 UTC

(In reply to Nikolai Sednev from comment #13)
> (In reply to Yedidyah Bar David from comment #9)
> > engine.log has:
> > 
> > 2020-05-25 14:17:03,014+03 WARN 
> > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> > (ForkJoinPool-1-worker-19) [46a070dc] EVENT_ID:
> > VDS_CPU_LOWER_THAN_CLUSTER(515), Host alma03.qa.lab.tlv.redhat.com moved to
> > Non-Operational state as host does not meet the cluster's minimum CPU level.
> > Missing CPU features : spec_ctrl
> 
> Can it be related to https://bugzilla.redhat.com/show_bug.cgi?id=1830872 ?
> Its possible that in 4.3 the CPU of HE was recognized differently than in
> 4.4 and that caused an issue with the upgrade.
it appears that the newer kernel in el8, reports the  CPU flags differently.

Comment 15 Lucia Jelinkova 2020-05-26 09:19:24 UTC

The cluster is set to:
compatibility_version: 4.3
cpu_name: Intel SandyBridge IBRS SSBD MDS Family 
cpu_flags: vmx,ssbd,md_clear,model_SandyBridge,spec_ctrl

VDS reports flags:
cpuid,pbe,lahf_lm,pni,rdtscp,rep_good,ds_cpl,pse36,stibp,tsc_adjust,mce,cx8,aes,avx,ssbd,arch-capabilities,pti,md-clear,pae,pts,mca,cx16,dts,xtopology,tm,dtherm,pdcm,fxsr,sse4_2,arat,ept,ss
e4_1,aperfmperf,tpr_shadow,ibpb,ss,md_clear,pge,pdpe1gb,x2apic,amd-ssbd,popcnt,cmov,nonstop_tsc,tsc_deadline_timer,sse,arch_perfmon,sse2,constant_tsc,mtrr,smep,nopl,umip,erms,dtes64,smx,pse,
dca,nx,tm2,syscall,pcid,flexpriority,mmx,acpi,skip-l1dfl-vmentry,xsaveopt,xsave,epb,sep,ht,bts,vpid,rdrand,lm,est,f16c,pebs,cpuid_fault,invtsc,ssse3,monitor,msr,fpu,vmx,clflush,pln,pclmulqdq
,hypervisor,de,tsc,vnmi,apic,xtpr,vme,fsgsbase,ibrs,pat,flush_l1d,model_qemu64,model_Nehalem,model_Penryn,model_pentium2,model_Westmere-IBRS,model_SandyBridge,model_486,model_kvm64,model_Wes
tmere,model_pentium3,model_IvyBridge,model_Nehalem-IBRS,model_IvyBridge-IBRS,model_coreduo,model_kvm32,model_Opteron_G1,model_SandyBridge-IBRS,model_qemu32,model_core2duo,model_Opteron_G2,mo
del_Conroe,model_pentium

The VDS is really missing the flag spec_ctrl as the log says. Setting the cluster CPU type to Intel SandyBridge Family should fix the problem.

Comment 16 Nikolai Sednev 2020-05-26 10:59:10 UTC

(In reply to Lucia Jelinkova from comment #15)
> The cluster is set to:
> compatibility_version: 4.3
> cpu_name: Intel SandyBridge IBRS SSBD MDS Family 
> cpu_flags: vmx,ssbd,md_clear,model_SandyBridge,spec_ctrl
> 
> VDS reports flags:
> cpuid,pbe,lahf_lm,pni,rdtscp,rep_good,ds_cpl,pse36,stibp,tsc_adjust,mce,cx8,
> aes,avx,ssbd,arch-capabilities,pti,md-clear,pae,pts,mca,cx16,dts,xtopology,
> tm,dtherm,pdcm,fxsr,sse4_2,arat,ept,ss
> e4_1,aperfmperf,tpr_shadow,ibpb,ss,md_clear,pge,pdpe1gb,x2apic,amd-ssbd,
> popcnt,cmov,nonstop_tsc,tsc_deadline_timer,sse,arch_perfmon,sse2,
> constant_tsc,mtrr,smep,nopl,umip,erms,dtes64,smx,pse,
> dca,nx,tm2,syscall,pcid,flexpriority,mmx,acpi,skip-l1dfl-vmentry,xsaveopt,
> xsave,epb,sep,ht,bts,vpid,rdrand,lm,est,f16c,pebs,cpuid_fault,invtsc,ssse3,
> monitor,msr,fpu,vmx,clflush,pln,pclmulqdq
> ,hypervisor,de,tsc,vnmi,apic,xtpr,vme,fsgsbase,ibrs,pat,flush_l1d,
> model_qemu64,model_Nehalem,model_Penryn,model_pentium2,model_Westmere-IBRS,
> model_SandyBridge,model_486,model_kvm64,model_Wes
> tmere,model_pentium3,model_IvyBridge,model_Nehalem-IBRS,model_IvyBridge-IBRS,
> model_coreduo,model_kvm32,model_Opteron_G1,model_SandyBridge-IBRS,
> model_qemu32,model_core2duo,model_Opteron_G2,mo
> del_Conroe,model_pentium
> 
> The VDS is really missing the flag spec_ctrl as the log says. Setting the
> cluster CPU type to Intel SandyBridge Family should fix the problem.

It will fix the upgrade problem, but the recognition of the CPU type will be not by design and won't match the exact CPU family type of the ha-host, which is:
alma03 ~]# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               62
Model name:          Intel(R) Xeon(R) CPU E5-2603 v2 @ 1.80GHz
Stepping:            4
CPU MHz:             1800.019
CPU max MHz:         1800.0000
CPU min MHz:         1200.0000
BogoMIPS:            3600.10
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            10240K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm arat pln pts md_clear flush_l1d

E5-2603 v2 matches ivybridge E5-2603 v2 - https://ark.intel.com/content/www/us/en/ark/products/76157/intel-xeon-processor-e5-2603-v2-10m-cache-1-80-ghz.html , just like Michal Skrivanek said in https://bugzilla.redhat.com/show_bug.cgi?id=1830872#c34 , thus changing CPU family type to Intel SandyBridge family will actually go against the design, although it will fix the upgrade issue. After an upgrade we have to change it back to Ivybridge.

Comment 17 Lucia Jelinkova 2020-05-26 11:21:30 UTC

This is a slightly different case because the cluster has already set its CPU type (in 4.3) and no autodetection is performed. Only check if the flags of a host match the required cluster's CPU flags. In this case, the host is missing one. 

And yes, if the user wants to "upgrade" the CPU from sandybridge to ivybridge, he has to do it manually after the upgrade from 4.3 to 4.4.

Comment 18 SATHEESARAN 2020-06-03 07:09:18 UTC

I do not see any suggestions or plans for the fix.
Is there any update on the planned action to fix this bug ?

This bug is very crucial for upgrade flow and also affects RHHI-V upgrade flow.

There is a bunch of testing that needs to be done, after RHV upgrade for RHHI-V,
and all those testing are blocked with this bug.

Is there any workaround for this issue now ?

Comment 19 Evgeny Slutsky 2020-06-03 08:25:00 UTC

SATHEESARAN, its being worked on  here https://bugzilla.redhat.com/show_bug.cgi?id=1841030
(In reply to SATHEESARAN from comment #18)
> I do not see any suggestions or plans for the fix.
> Is there any update on the planned action to fix this bug ?
> 
> This bug is very crucial for upgrade flow and also affects RHHI-V upgrade
> flow.
> 
> There is a bunch of testing that needs to be done, after RHV upgrade for
> RHHI-V,
> and all those testing are blocked with this bug.
> 
> Is there any workaround for this issue now ?
SATHEESARAN, its being worked on  here https://bugzilla.redhat.com/show_bug.cgi?id=1841030

Comment 20 SATHEESARAN 2020-06-03 11:02:30 UTC

(In reply to Evgeny Slutsky from comment #19)
> SATHEESARAN, its being worked on  here
> https://bugzilla.redhat.com/show_bug.cgi?id=1841030
> (In reply to SATHEESARAN from comment #18)
> > I do not see any suggestions or plans for the fix.
> > Is there any update on the planned action to fix this bug ?
> > 
> > This bug is very crucial for upgrade flow and also affects RHHI-V upgrade
> > flow.
> > 
> > There is a bunch of testing that needs to be done, after RHV upgrade for
> > RHHI-V,
> > and all those testing are blocked with this bug.
> > 
> > Is there any workaround for this issue now ?
> SATHEESARAN, its being worked on  here
> https://bugzilla.redhat.com/show_bug.cgi?id=1841030

Thanks Evgeny.

Comment 21 Evgeny Slutsky 2020-06-04 06:56:39 UTC

its fixed here https://gerrit.ovirt.org/#/c/109409/

Comment 22 Sandro Bonazzola 2020-06-04 07:52:40 UTC

https://gerrit.ovirt.org/109409 has been merged

Comment 23 Sandro Bonazzola 2020-06-11 13:58:11 UTC

$ git tag --contains 409b13d61939413bc73fc18e13724f0b18d5e336
v4.40.20

Comment 24 Nikolai Sednev 2020-06-18 16:24:35 UTC

Currently only vdsm-api-4.40.19-1.el8ev.noarch.rpm available to QA from 2020-06-04 12:39 bob.

Comment 25 Nikolai Sednev 2020-06-21 08:07:41 UTC

We still have vdsm-4.40.19-1.el8ev.x86_64.rpm 2020-06-04 available to QA.

Comment 26 Arik 2020-06-29 08:12:08 UTC

*** Bug 1841030 has been marked as a duplicate of this bug. ***

Comment 27 Nikolai Sednev 2020-07-02 11:35:47 UTC

Moving to verified this bug, forth to https://bugzilla.redhat.com/show_bug.cgi?id=1853225#c3 results, the engine's backup&restore 4.3->4.4 partially passed, and the initial issue on which this bug was opened had been resolved. Please keep tracking 1853225 for more details regarding the upgrade status.

Comment 28 Sandro Bonazzola 2020-07-08 08:25:40 UTC

This bugzilla is included in oVirt 4.4.1 release, published on July 8th 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.