Bug 1670152 - HE deploy can fail with a not so clear error message on certain CPU types
Summary: HE deploy can fail with a not so clear error message on certain CPU types
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-ansible-collection
Classification: oVirt
Component: hosted-engine-setup
Version: 1.0.21
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ovirt-4.3.5
: 1.0.22
Assignee: Evgeny Slutsky
QA Contact: Nikolai Sednev
Tahlia Richardson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-28 17:58 UTC by Nikolai Sednev
Modified: 2019-07-30 14:08 UTC (History)
7 users (show)

Fixed In Version: ovirt-ansible-hosted-engine-setup-1.0.22
Clone Of:
Environment:
Last Closed: 2019-07-30 14:08:26 UTC
oVirt Team: Integration
Embargoed:
rule-engine: ovirt-4.3+


Attachments (Terms of Use)
sosreport from host (10.45 MB, application/x-xz)
2019-01-28 17:58 UTC, Nikolai Sednev
no flags Details
logs from bonded vlan host (10.53 MB, application/x-xz)
2019-01-29 11:03 UTC, Nikolai Sednev
no flags Details
deployment over bonded vlan out print (44.47 KB, text/plain)
2019-01-29 11:04 UTC, Nikolai Sednev
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-ansible-hosted-engine-setup pull 212 0 'None' closed when deployment fails to add host to the engine, create more detailed log 2020-11-23 15:21:28 UTC

Description Nikolai Sednev 2019-01-28 17:58:20 UTC
Created attachment 1524358 [details]
sosreport from host

Description of problem:
Looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1557624.

[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"}

When deployed over bond, fails with the error.

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.3.0-1.el7ev.noarch
ovirt-hosted-engine-setup-2.3.2-1.el7ev.noarch
rhvm-appliance-4.3-20190115.0.el7.x86_64
ansible-2.7.6-1.el7ae.noarch
ovirt-ansible-engine-setup-1.1.6-1.el7ev.noarch


How reproducible:
100%

Steps to Reproduce:
1.Deploy HE over bond.

Actual results:
Deployment fails

Expected results:
Should succeed.

Additional info:
Deployment details http://pastebin.test.redhat.com/703241

Comment 1 Sandro Bonazzola 2019-01-29 07:18:44 UTC
Error mentioned in bug description is included in 
- ./var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-ansible-bootstrap_local_vm-20190128191928-9ex6ul.log
- ./var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20190128191615-01pf8f.log

So this doesn't seem to be ovirt-host-deploy related, moving to ovirt-hosted-engine-setup.

Comment 2 Nikolai Sednev 2019-01-29 11:02:53 UTC
Very the same deployment failed over bonded vlan with the same issue:
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"}

Logs attached.

Comment 3 Nikolai Sednev 2019-01-29 11:03:33 UTC
Created attachment 1524576 [details]
logs from bonded vlan host

Comment 4 Nikolai Sednev 2019-01-29 11:04:13 UTC
Created attachment 1524577 [details]
deployment over bonded vlan out print

Comment 5 Nikolai Sednev 2019-01-29 11:08:22 UTC
Bridge got properly created on host, so it doesn't look like that its vdsm here:
orchid-vds2-vlan162 ~]# brctl show 
bridge name     bridge id               STP enabled     interfaces
;vdsmdummy;             8000.000000000000       no
ovirtmgmt               8000.001a647a9462       no              bond1.162
virbr0          8000.525400adb219       yes             virbr0-nic
                                                        vnet0


vds2-vlan162 ~]# ifconfig
bond1: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 1500
        ether 00:1a:64:7a:94:62  txqueuelen 1000  (Ethernet)
        RX packets 1111311  bytes 1616828417 (1.5 GiB)
        RX errors 0  dropped 93  overruns 0  frame 0
        TX packets 139162  bytes 22512765 (21.4 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

bond1.162: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether 00:1a:64:7a:94:62  txqueuelen 1000  (Ethernet)
        RX packets 242130  bytes 1546289578 (1.4 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 132008  bytes 20925759 (19.9 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0


ovirtmgmt: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.35.129.15  netmask 255.255.255.0  broadcast 10.35.129.255
        inet6 2620:52:0:2381:21a:64ff:fe7a:9462  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::21a:64ff:fe7a:9462  prefixlen 64  scopeid 0x20<link>
        ether 00:1a:64:7a:94:62  txqueuelen 1000  (Ethernet)
        RX packets 5057  bytes 440309 (429.9 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1565  bytes 11287883 (10.7 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Comment 6 Simone Tiraboschi 2019-01-29 11:20:57 UTC
The issue comes from:

2019-01-28 19:47:00,720+02 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-58) [35677605] START, SetVdsStatusVDSCommand(HostName = orchid-vds1.qa.lab.tlv.redhat.com, SetVdsStatusVDSCommandParameters:{hostId='679b3eab-f884-4f9e-b677-7db25b91a83c', status='NonOperational', nonOperationalReason='CPU_TYPE_INCOMPATIBLE_WITH_CLUSTER', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 2158b1cd
2019-01-28 19:47:00,814+02 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-58) [35677605] FINISH, SetVdsStatusVDSCommand, return: , log id: 2158b1cd
2019-01-28 19:47:01,694+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-58) [35677605] EVENT_ID: CPU_TYPE_UNSUPPORTED_IN_THIS_CLUSTER_VERSION(156), Host orchid-vds1.qa.lab.tlv.redhat.com moved to Non-Operational state as host CPU type is not supported in this cluster compatibility version or is not supported at all
2019-01-28 19:47:02,280+02 INFO  [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-58) [4e49d5e9] Running command: HandleVdsCpuFlagsOrClusterChangedCommand internal: true. Entities affected :  ID: 679b3eab-f884-4f9e-b677-7db25b91a83c Type: VDS
2019-01-28 19:47:02,312+02 ERROR [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-58) [4e49d5e9] Could not find server cpu for server 'orchid-vds1.qa.lab.tlv.redhat.com' (679b3eab-f884-4f9e-b677-7db25b91a83c), flags: 'fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,dts,acpi,mmx,fxsr,sse,sse2,ss,ht,tm,pbe,syscall,nx,lm,constant_tsc,arch_perfmon,pebs,bts,rep_good,nopl,aperfmperf,eagerfpu,pni,dtes64,monitor,ds_cpl,vmx,est,tm2,ssse3,cx16,xtpr,pdcm,dca,lahf_lm,tpr_shadow,dtherm,model_Opteron_G2,model_kvm32,model_coreduo,model_Conroe,model_Opteron_G1,model_core2duo,model_qemu32,model_pentium2,model_pentium3,model_qemu64,model_kvm64,model_pentium,model_486'
2019-01-28 19:47:02,380+02 INFO  [org.ovirt.engine.core.bll.AddVmFromScratchCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-3) [55e9c1ef] Lock Acquired to object 'EngineLock:{exclusiveLocks='[external-HostedEngineLocal=VM_NAME]', sharedLocks=''}'
2019-01-28 19:47:02,406+02 WARN  [org.ovirt.engine.core.bll.AddVmFromScratchCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-3) [55e9c1ef] Validation of action 'AddVmFromScratch' failed for user SYSTEM. Reasons: VAR__ACTION__ADD,VAR__TYPE__VM,ACTION_TYPE_FAILED_CLUSTER_UNDEFINED_ARCHITECTURE

it's completely unrelated to bonding or network in general: it just depends on the host CPU type.

Comment 7 Simone Tiraboschi 2019-01-29 11:25:00 UTC
Please see:
https://bugzilla.redhat.com/show_bug.cgi?id=1649817#c8

----------------------------------
With this update, the following CPU types have been deprecated: Intel Nehalem IBRS Family, Intel Westmere IBRS Family, Intel SandyBridge IBRS Family, Intel Haswell-noTSX IBRS Family, Intel Haswell IBRS Family, Intel Broadwell-noTSX IBRS Family, Intel Broadwell IBRS Family, Intel Skylake Client IBRS Family, Intel Skylake Server IBRS Family and AMD EPYC IBPB. Red Hat Virtualization 4.3 will not support these CPU types.
----------------------------------

Comment 8 Nikolai Sednev 2019-02-12 14:21:44 UTC
Moving to myself as its completely unrelated to the network.

Comment 9 Sandro Bonazzola 2019-02-18 07:54:55 UTC
Moving to 4.3.2 not being identified as blocker for 4.3.1.

Comment 10 Sandro Bonazzola 2019-02-20 08:56:16 UTC
Let's query libvirt and fail early if we find unsupported CPU types.

Comment 11 Evgeny Slutsky 2019-06-10 09:34:11 UTC
the issue is not with libvirt,
the failure originates from engine API during the initial attempt to add the host to the engine (HE),
we need to check the root cause of the failure as it comes from the engine and print it in our logs.

Comment 12 Nikolai Sednev 2019-07-21 03:43:33 UTC
Tested on these components:
Engine Software Version:4.3.5.4-0.1.el7
ovirt-hosted-engine-ha-2.3.3-1.el7ev.noarch
ovirt-hosted-engine-setup-2.3.11-1.el7ev.noarch
Linux 3.10.0-1061.el7.x86_64 #1 SMP Thu Jul 11 21:02:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.7 (Maipo)

Moving to verified, please reopen in case that it does not work for you.

Comment 13 Sandro Bonazzola 2019-07-30 14:08:26 UTC
This bugzilla is included in oVirt 4.3.5 release, published on July 30th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.