Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1706275

Summary: HE deployment fails trying to bootstrap it from a slow USB device
Product: [oVirt] ovirt-ansible-collection Reporter: Chris Kuperstein <ckuperst>
Component: hosted-engine-setupAssignee: Ido Rosenzwig <irosenzw>
Status: CLOSED CURRENTRELEASE QA Contact: Wei Wang <weiwang>
Severity: medium Docs Contact: Tahlia Richardson <trichard>
Priority: unspecified    
Version: unspecifiedCC: bugs, dholler, stirabos
Target Milestone: ovirt-4.3.4Keywords: ZStream
Target Release: 1.0.18Flags: sbonazzo: ovirt-4.3?
sbonazzo: planning_ack?
sbonazzo: devel_ack+
sbonazzo: testing_ack?
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ovirt-ansible-hosted-engine-setup-1.0.18 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-11 06:24:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1709969, 1710352    
Attachments:
Description Flags
tarball of /var/logs/ovirt-hosted-engine-setup
none
output from cockpit
none
virsh output (post-attempt)
none
ip a (output)
none
ovirt-hosted-engine-setup-ansible-bootstrap_local_vm-20194318536-6vruip.log
none
yum.log
none
/var/log/messages none

Description Chris Kuperstein 2019-05-04 01:51:49 UTC
Created attachment 1562864 [details]
tarball of /var/logs/ovirt-hosted-engine-setup

Description of problem: During deployment of the hosted engine, gathering the local VM IP does not succeed and installation fails.


Version-Release number of selected component (if applicable): 4.2


How reproducible: 


Steps to Reproduce:
1. Install RHEL 7.6 Minimal
2. Configure 2 10 Gigabit interfaces in a single LACP Team (team0)
3. Configure 1 VLAN interface on single team (team0)
4. run hosted-engine --deploy OR:
5. firewall-cmd --permanent --add-port=9090/tcp
6. install cockpit and run hosted engine installer

Actual results:
during deployment, the Ansible playbook fails to retrieve the local VM IP from the HostedEngineLocal instance, and the installer fails. The local VM instance is not properly terminated or cleaned up during the installer cleanup, and the network interface on the local VM does not bind an IP.


Expected results:
Installer to complete deploying local engine VM and proceed to the storage domain configuration phase for VM migration.


Additional info:

Dell R720 Asset Tag# J71P5X1
128GB RAM
Intel X520-DA2 10Gbit NIC
Internal Storage: 32GB Sandisk Cruzer Fit (OS)
4x Samsung EVO 860 512GB SSD (LVM RAID10 + XFS)
2x Seagate Barracuda 5TB HDD (LVM RAID1 + XFS)

Comment 1 Chris Kuperstein 2019-05-04 01:52:36 UTC
Created attachment 1562865 [details]
output from cockpit

Comment 2 Chris Kuperstein 2019-05-04 01:53:48 UTC
Created attachment 1562866 [details]
virsh output (post-attempt)

this is the state of virsh during and after the attempt.

Comment 3 Chris Kuperstein 2019-05-04 01:55:20 UTC
Created attachment 1562867 [details]
ip a (output)

Comment 4 Dominik Holler 2019-05-06 15:34:45 UTC
Chris, can you please attach the file ovirt-hosted-engine-setup-ansible-bootstrap_local_vm-*.log again, the currently attached file seems to be broken.
Can you please add /var/log/messages and yum.log, too?

Probably not related:
Please note that teaming is not supported, use bonding instead.
If the bond is not required during the install, it is recommended to create the bond after the installation.

Comment 5 Chris Kuperstein 2019-05-06 15:47:42 UTC
Created attachment 1564589 [details]
ovirt-hosted-engine-setup-ansible-bootstrap_local_vm-20194318536-6vruip.log

Comment 6 Chris Kuperstein 2019-05-06 15:50:38 UTC
Created attachment 1564590 [details]
yum.log

Comment 7 Chris Kuperstein 2019-05-06 15:51:04 UTC
Created attachment 1564591 [details]
/var/log/messages

Comment 8 Chris Kuperstein 2019-05-06 16:56:59 UTC
Actions:

- # ovirt-hosted-engine-cleanup
- removed team0.2, team0, and p5p1+p5p2 interface configurations
- configured bond0 and vlan interface bond0.2 with "mode=4 miimon=100"
- restarted network.service
- attempted hosted engine deployment again

no success. I will attempt no bond as well, with just a standard access port and no VLAN interface on the host machine's native 1Gbit eth interfaces (em1)

Comment 9 Chris Kuperstein 2019-05-06 17:10:31 UTC
standard host network interface without bonds or VLANs is a no-go.

Comment 10 Simone Tiraboschi 2019-05-06 20:49:07 UTC
Chris,
deploying over a teamed device is not supported and it will fail for sure.
Unfortunately we cannot easily identify team interfaces due to an issue in ansible facts module:
it's tracked here: https://github.com/ansible/ansible/issues/43129

Bonds, vlans and vlans over bonds (bond0.2) are instead supported. 

Chris, can you please attach the logs for the attempt mentioned in comments 8 and 9?

Comment 11 Chris Kuperstein 2019-05-06 21:12:03 UTC
Simone,

unfortunately I went forward with a full wipe of the host to proceed with a different storage configuration, so can't provide the logs from attempts in comments 8 and 9. I suspected the limited IO on the internal USB device where the root fs was mounted was inhibiting the local deployment of the engine appliance.

I reconfigured the host like so:

(Dell Perc H710p mini mono controller):
4x SSD in hardware RAID 10 (/dev/sda)
2x HDD in hardware RAID 1 (/dev/sdb)

[root@vhost0 ~]# lsblk
NAME                   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                      8:0    0 930.5G  0 disk 
├─sda1                   8:1    0   2.8G  0 part /boot/efi
├─sda2                   8:2    0   1.9G  0 part /boot
└─sda3                   8:3    0 925.9G  0 part 
  ├─rhel-root          253:0    0  93.1G  0 lvm  /
  ├─rhel-swap          253:1    0   7.5G  0 lvm  [SWAP]
  └─rhel-usr_share_ssd 253:2    0 825.3G  0 lvm  /usr/share/ssd
sdb                      8:16   0   4.6T  0 disk 
└─sdb1                   8:17   0   4.6T  0 part 
  └─data-hdd           253:3    0   4.6T  0 lvm  /usr/share/hdd

with the root fs residing on a 93.1G partition existing on the SSDs, polling for the VM IP only took about a minute or so. I suspect this was a race condition against the timeout for the appliance deployment on local storage.

In the future, perhaps there is a better way to deploy the local engine appliance straight to a selected storage pool rather than deploying the local VM then choosing a storage pool to migrate to after the fact?

Comment 12 Simone Tiraboschi 2019-05-06 21:39:51 UTC
(In reply to Chris Kuperstein from comment #11)
> Simone,
> 
> unfortunately I went forward with a full wipe of the host to proceed with a
> different storage configuration, so can't provide the logs from attempts in
> comments 8 and 9.

So did it finally worked? can we close this?

> I suspected the limited IO on the internal USB device
> where the root fs was mounted was inhibiting the local deployment of the
> engine appliance.

We are polling 50 times with a 10 seconds delay.
500 seconds seems definitively a reasonable amount of time to bootstrap a VM and have getting an address from an internal DHCP server.

> with the root fs residing on a 93.1G partition existing on the SSDs, polling
> for the VM IP only took about a minute or so. I suspect this was a race
> condition against the timeout for the appliance deployment on local storage.

For the deployment we need about 3 GB under /var/tmp
 
> In the future, perhaps there is a better way to deploy the local engine
> appliance straight to a selected storage pool rather than deploying the
> local VM then choosing a storage pool to migrate to after the fact?

The whole point of this flow is to bootstrap a VM with a locally running engine as quickly as possible in order to use that engine (via ansible modules) to do everything else (configuring the storage and the network, creating disks, a VM...) using standard and well tested engine code instead of duplicating it.

Comment 13 Chris Kuperstein 2019-05-06 21:54:10 UTC
This is confirmed working, and I suspect it would be okay when using a SATA DOM for internal host storage, but even if you have a sufficiently sized internal USB device (which is extremely common on commodity hardware ~5+ years old), deployment of the hosted engine simply will not work. It looks like this may be limited by read/write speed on the device which /var/tmp is mounted on.

I understand the reuse of modules present in the engine appliance itself, but this is a (albeit minor) disadvantage in comparison to ESXi/VMWare VCSA which uses an out of band delivery method (OVFTool) for management.

Comment 14 Sandro Bonazzola 2019-05-08 07:07:33 UTC
let's raise the timeout, not sure this is a real common use case.

Comment 15 Wei Wang 2019-05-22 12:58:42 UTC
Discuss with DEV, then test this issue with RHVH-4.3-20190516.1-RHVH-x86_64-dvd1.iso

Version:
RHVH-4.3-20190516.1-RHVH-x86_64-dvd1.iso
cockpit-ovirt-dashboard-0.12.9-1.el7ev.noarch
ovirt-hosted-engine-setup-2.3.8-1.el7ev.noarch
ovirt-hosted-engine-ha-2.3.1-1.el7ev.noarch

Steps:
1. Clean install RHVH-4.3-20190516.1-RHVH-x86_64-dvd1.iso
2. Setting network to bond+vlan
3. Deploy Hosted engine(CLI and Cockpit UI)

Result:
Deployment successful without error under bond+vlan network.

bug is fixed, change status to "VERIFIED"

Comment 16 Sandro Bonazzola 2019-06-11 06:24:12 UTC
This bugzilla is included in oVirt 4.3.4 release, published on June 11th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.4 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.