Created attachment 1689814 [details] logs from engine Description of problem: Reprovisioned to RHEL8.2 and failed with: [ INFO ] TASK [ovirt.hosted_engine_setup : Fail with generic error] [ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The host has been set in non_operational status, please check engine logs, more info can be found in the engine logs, fix accordingly and re-deploy."} [ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"} [ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook In engine log I see: 2020-05-18 20:00:03,367+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engi ne-Thread-35) [7289cff0] Failed to migrate one or more VMs. 2020-05-18 19:50:23,852+03 ERROR [org.ovirt.engine.core.bll.CpuFlagsManagerHandler] (ServerService Thread Pool -- 57) [] Error getting info for CPU ' ', not in expected format. It looks like a different issue now, but the end result is the same, I was unable to restore the engine over 4.4 from 4.3. Moving back to assigned and attaching sosreports from host alma03 and the engine. Version-Release number of selected component (if applicable): Tested backup and restore from engine: rhvm-4.3.10.1-0.1.master.el7.noarch Linux 3.10.0-1127.8.1.el7.x86_64 #1 SMP Fri Apr 24 14:56:59 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.8 (Maipo) Host: rhvm-appliance.x86_64 2:4.3-20200507.0.el7 rhv-4.3.10 ovirt-hosted-engine-setup-2.3.13-1.el7ev.noarch ovirt-hosted-engine-ha-2.3.6-1.el7ev.noarch Linux 3.10.0-1127.8.2.el7.x86_64 #1 SMP Thu May 7 19:30:37 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.8 (Maipo) Restored on RHEL8.2 host with these components: rhvm-appliance.x86_64 2:4.4-20200417.0.el8ev @rhv-4.4.0 ovirt-hosted-engine-setup-2.4.4-1.el8ev.noarch ovirt-hosted-engine-ha-2.4.3-1.el8ev.noarch Linux 4.18.0-193.4.1.el8_2.x86_64 #1 SMP Fri May 15 15:02:51 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.2 (Ootpa) How reproducible: 100% Steps to Reproduce: 1. install 4.3 HE setup on EL7 2. backup 4.3 engine into shared storage engine-backup --mode=backup --file=engine_backup.tar.gz --log=engine_backup.log 2. reinstall EL8 3. run HE-4.4 deployment from 4.3 backup file: hosted-engine --deploy --restore-from-file=<file-he> Actual results: Restore of 4.3 backup file on 4.4 fails with: [ INFO ] TASK [ovirt.hosted_engine_setup : Fail with generic error] [ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The host has been set in non_operational status, please check engine logs, more info can be found in the engine logs, fix accordingly and re-deploy."} [ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"} [ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook Expected results: Restore should succeed. Additional info: Logs are attached.
Created attachment 1689815 [details] logs from the host
Adding this here, this bug opened forth to https://bugzilla.redhat.com/show_bug.cgi?id=1827135#c11.
(In reply to Nikolai Sednev from comment #0) > Created attachment 1689814 [details] > logs from engine > > Description of problem: > > Reprovisioned to RHEL8.2 and failed with: > > [ INFO ] TASK [ovirt.hosted_engine_setup : Fail with generic error] > [ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The host > has been set in non_operational status, please check engine logs, more info > can be found in the engine logs, fix accordingly and re-deploy."} > [ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The > system may not be provisioned according to the playbook results: please > check the logs for the issue, fix accordingly or re-deploy from scratch.\n"} > [ ERROR ] Failed to execute stage 'Closing up': Failed executing > ansible-playbook > > > In engine log I see: > 2020-05-18 20:00:03,367+03 ERROR > [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] > (EE-ManagedThreadFactory-engi > ne-Thread-35) [7289cff0] Failed to migrate one or more VMs. > 2020-05-18 19:50:23,852+03 ERROR > [org.ovirt.engine.core.bll.CpuFlagsManagerHandler] (ServerService Thread > Pool -- 57) > [] Error getting info for CPU ' ', not in expected format. > > > It looks like a different issue now, but the end result is the same, I was > unable to restore the engine over 4.4 from 4.3. > Moving back to assigned and attaching sosreports from host alma03 and the > engine. > > Version-Release number of selected component (if applicable): > Tested backup and restore from engine: > rhvm-4.3.10.1-0.1.master.el7.noarch > Linux 3.10.0-1127.8.1.el7.x86_64 #1 SMP Fri Apr 24 14:56:59 EDT 2020 x86_64 > x86_64 x86_64 GNU/Linux > Red Hat Enterprise Linux Server release 7.8 (Maipo) > > Host: > rhvm-appliance.x86_64 2:4.3-20200507.0.el7 rhv-4.3.10 > ovirt-hosted-engine-setup-2.3.13-1.el7ev.noarch > ovirt-hosted-engine-ha-2.3.6-1.el7ev.noarch > Linux 3.10.0-1127.8.2.el7.x86_64 #1 SMP Thu May 7 19:30:37 EDT 2020 x86_64 > x86_64 x86_64 GNU/Linux > Red Hat Enterprise Linux Server release 7.8 (Maipo) > > Restored on RHEL8.2 host with these components: > rhvm-appliance.x86_64 2:4.4-20200417.0.el8ev @rhv-4.4.0 This is pretty old. Please try again with a newer appliance/engine and attach logs if it reproduces. Thanks.
It was received from: rhv-release-4.4.0-36-001.noarch ovirt-hosted-engine-setup-2.4.4-1.el8ev.noarch ovirt-hosted-engine-ha-2.4.3-1.el8ev.noarch
(In reply to Nikolai Sednev from comment #4) > It was received from: > rhv-release-4.4.0-36-001.noarch > ovirt-hosted-engine-setup-2.4.4-1.el8ev.noarch > ovirt-hosted-engine-ha-2.4.3-1.el8ev.noarch I added my comment 3 below "rhvm-appliance.x86_64 2:4.4-20200417.0.el8ev", which was the one I referred to as old. Sorry if this wasn't clear enough.
Tested with rhvm-appliance-4.4-20200521.0.el8ev.x86_64 on clean host and restore still failed. [ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The host has been set in non_operational status, please check engine logs, more info can be found in the engine logs, fix accordingly and re-deploy."} [ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"} [ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook [ ERROR ] Hosted Engine deployment failed: please check the logs for the issue, fix accordingly or re-deploy from scratch. Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20200525133313-62uzwb.log Logs being attached.
Created attachment 1691917 [details] logs from host alma03 with latest appliance
Created attachment 1691918 [details] sosreport from engine
engine.log has: 2020-05-25 14:17:03,014+03 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-19) [46a070dc] EVENT_ID: VDS_CPU_LOWER_THAN_CLUSTER(515), Host alma03.qa.lab.tlv.redhat.com moved to Non-Operational state as host does not meet the cluster's minimum CPU level. Missing CPU features : spec_ctrl
Ryan, can you please have a look? Thanks.
nikolai, did you use Virtual Host when encountered this?
(In reply to Evgeny Slutsky from comment #11) > nikolai, did you use Virtual Host when encountered this? No. I always use only physical hosts.
(In reply to Yedidyah Bar David from comment #9) > engine.log has: > > 2020-05-25 14:17:03,014+03 WARN > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (ForkJoinPool-1-worker-19) [46a070dc] EVENT_ID: > VDS_CPU_LOWER_THAN_CLUSTER(515), Host alma03.qa.lab.tlv.redhat.com moved to > Non-Operational state as host does not meet the cluster's minimum CPU level. > Missing CPU features : spec_ctrl Can it be related to https://bugzilla.redhat.com/show_bug.cgi?id=1830872 ? Its possible that in 4.3 the CPU of HE was recognized differently than in 4.4 and that caused an issue with the upgrade.
(In reply to Nikolai Sednev from comment #13) > (In reply to Yedidyah Bar David from comment #9) > > engine.log has: > > > > 2020-05-25 14:17:03,014+03 WARN > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (ForkJoinPool-1-worker-19) [46a070dc] EVENT_ID: > > VDS_CPU_LOWER_THAN_CLUSTER(515), Host alma03.qa.lab.tlv.redhat.com moved to > > Non-Operational state as host does not meet the cluster's minimum CPU level. > > Missing CPU features : spec_ctrl > > Can it be related to https://bugzilla.redhat.com/show_bug.cgi?id=1830872 ? > Its possible that in 4.3 the CPU of HE was recognized differently than in > 4.4 and that caused an issue with the upgrade. it appears that the newer kernel in el8, reports the CPU flags differently.
The cluster is set to: compatibility_version: 4.3 cpu_name: Intel SandyBridge IBRS SSBD MDS Family cpu_flags: vmx,ssbd,md_clear,model_SandyBridge,spec_ctrl VDS reports flags: cpuid,pbe,lahf_lm,pni,rdtscp,rep_good,ds_cpl,pse36,stibp,tsc_adjust,mce,cx8,aes,avx,ssbd,arch-capabilities,pti,md-clear,pae,pts,mca,cx16,dts,xtopology,tm,dtherm,pdcm,fxsr,sse4_2,arat,ept,ss e4_1,aperfmperf,tpr_shadow,ibpb,ss,md_clear,pge,pdpe1gb,x2apic,amd-ssbd,popcnt,cmov,nonstop_tsc,tsc_deadline_timer,sse,arch_perfmon,sse2,constant_tsc,mtrr,smep,nopl,umip,erms,dtes64,smx,pse, dca,nx,tm2,syscall,pcid,flexpriority,mmx,acpi,skip-l1dfl-vmentry,xsaveopt,xsave,epb,sep,ht,bts,vpid,rdrand,lm,est,f16c,pebs,cpuid_fault,invtsc,ssse3,monitor,msr,fpu,vmx,clflush,pln,pclmulqdq ,hypervisor,de,tsc,vnmi,apic,xtpr,vme,fsgsbase,ibrs,pat,flush_l1d,model_qemu64,model_Nehalem,model_Penryn,model_pentium2,model_Westmere-IBRS,model_SandyBridge,model_486,model_kvm64,model_Wes tmere,model_pentium3,model_IvyBridge,model_Nehalem-IBRS,model_IvyBridge-IBRS,model_coreduo,model_kvm32,model_Opteron_G1,model_SandyBridge-IBRS,model_qemu32,model_core2duo,model_Opteron_G2,mo del_Conroe,model_pentium The VDS is really missing the flag spec_ctrl as the log says. Setting the cluster CPU type to Intel SandyBridge Family should fix the problem.
(In reply to Lucia Jelinkova from comment #15) > The cluster is set to: > compatibility_version: 4.3 > cpu_name: Intel SandyBridge IBRS SSBD MDS Family > cpu_flags: vmx,ssbd,md_clear,model_SandyBridge,spec_ctrl > > VDS reports flags: > cpuid,pbe,lahf_lm,pni,rdtscp,rep_good,ds_cpl,pse36,stibp,tsc_adjust,mce,cx8, > aes,avx,ssbd,arch-capabilities,pti,md-clear,pae,pts,mca,cx16,dts,xtopology, > tm,dtherm,pdcm,fxsr,sse4_2,arat,ept,ss > e4_1,aperfmperf,tpr_shadow,ibpb,ss,md_clear,pge,pdpe1gb,x2apic,amd-ssbd, > popcnt,cmov,nonstop_tsc,tsc_deadline_timer,sse,arch_perfmon,sse2, > constant_tsc,mtrr,smep,nopl,umip,erms,dtes64,smx,pse, > dca,nx,tm2,syscall,pcid,flexpriority,mmx,acpi,skip-l1dfl-vmentry,xsaveopt, > xsave,epb,sep,ht,bts,vpid,rdrand,lm,est,f16c,pebs,cpuid_fault,invtsc,ssse3, > monitor,msr,fpu,vmx,clflush,pln,pclmulqdq > ,hypervisor,de,tsc,vnmi,apic,xtpr,vme,fsgsbase,ibrs,pat,flush_l1d, > model_qemu64,model_Nehalem,model_Penryn,model_pentium2,model_Westmere-IBRS, > model_SandyBridge,model_486,model_kvm64,model_Wes > tmere,model_pentium3,model_IvyBridge,model_Nehalem-IBRS,model_IvyBridge-IBRS, > model_coreduo,model_kvm32,model_Opteron_G1,model_SandyBridge-IBRS, > model_qemu32,model_core2duo,model_Opteron_G2,mo > del_Conroe,model_pentium > > The VDS is really missing the flag spec_ctrl as the log says. Setting the > cluster CPU type to Intel SandyBridge Family should fix the problem. It will fix the upgrade problem, but the recognition of the CPU type will be not by design and won't match the exact CPU family type of the ha-host, which is: alma03 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 62 Model name: Intel(R) Xeon(R) CPU E5-2603 v2 @ 1.80GHz Stepping: 4 CPU MHz: 1800.019 CPU max MHz: 1800.0000 CPU min MHz: 1200.0000 BogoMIPS: 3600.10 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 10240K NUMA node0 CPU(s): 0-3 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm arat pln pts md_clear flush_l1d E5-2603 v2 matches ivybridge E5-2603 v2 - https://ark.intel.com/content/www/us/en/ark/products/76157/intel-xeon-processor-e5-2603-v2-10m-cache-1-80-ghz.html , just like Michal Skrivanek said in https://bugzilla.redhat.com/show_bug.cgi?id=1830872#c34 , thus changing CPU family type to Intel SandyBridge family will actually go against the design, although it will fix the upgrade issue. After an upgrade we have to change it back to Ivybridge.
This is a slightly different case because the cluster has already set its CPU type (in 4.3) and no autodetection is performed. Only check if the flags of a host match the required cluster's CPU flags. In this case, the host is missing one. And yes, if the user wants to "upgrade" the CPU from sandybridge to ivybridge, he has to do it manually after the upgrade from 4.3 to 4.4.
I do not see any suggestions or plans for the fix. Is there any update on the planned action to fix this bug ? This bug is very crucial for upgrade flow and also affects RHHI-V upgrade flow. There is a bunch of testing that needs to be done, after RHV upgrade for RHHI-V, and all those testing are blocked with this bug. Is there any workaround for this issue now ?
SATHEESARAN, its being worked on here https://bugzilla.redhat.com/show_bug.cgi?id=1841030 (In reply to SATHEESARAN from comment #18) > I do not see any suggestions or plans for the fix. > Is there any update on the planned action to fix this bug ? > > This bug is very crucial for upgrade flow and also affects RHHI-V upgrade > flow. > > There is a bunch of testing that needs to be done, after RHV upgrade for > RHHI-V, > and all those testing are blocked with this bug. > > Is there any workaround for this issue now ? SATHEESARAN, its being worked on here https://bugzilla.redhat.com/show_bug.cgi?id=1841030
(In reply to Evgeny Slutsky from comment #19) > SATHEESARAN, its being worked on here > https://bugzilla.redhat.com/show_bug.cgi?id=1841030 > (In reply to SATHEESARAN from comment #18) > > I do not see any suggestions or plans for the fix. > > Is there any update on the planned action to fix this bug ? > > > > This bug is very crucial for upgrade flow and also affects RHHI-V upgrade > > flow. > > > > There is a bunch of testing that needs to be done, after RHV upgrade for > > RHHI-V, > > and all those testing are blocked with this bug. > > > > Is there any workaround for this issue now ? > SATHEESARAN, its being worked on here > https://bugzilla.redhat.com/show_bug.cgi?id=1841030 Thanks Evgeny.
its fixed here https://gerrit.ovirt.org/#/c/109409/
https://gerrit.ovirt.org/109409 has been merged
$ git tag --contains 409b13d61939413bc73fc18e13724f0b18d5e336 v4.40.20
Currently only vdsm-api-4.40.19-1.el8ev.noarch.rpm available to QA from 2020-06-04 12:39 bob.
We still have vdsm-4.40.19-1.el8ev.x86_64.rpm 2020-06-04 available to QA.
*** Bug 1841030 has been marked as a duplicate of this bug. ***
Moving to verified this bug, forth to https://bugzilla.redhat.com/show_bug.cgi?id=1853225#c3 results, the engine's backup&restore 4.3->4.4 partially passed, and the initial issue on which this bug was opened had been resolved. Please keep tracking 1853225 for more details regarding the upgrade status.
This bugzilla is included in oVirt 4.4.1 release, published on July 8th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.