Bug 1916032 - Engine allows deploying HE hosts with spm_id > 64 but broker won't read past slot 64
Summary: Engine allows deploying HE hosts with spm_id > 64 but broker won't read past ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-ha
Version: 4.4.3
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ovirt-4.4.5
: 4.4.5
Assignee: Yedidyah Bar David
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-14 01:03 UTC by Germano Veit Michel
Modified: 2024-03-25 17:50 UTC (History)
7 users (show)

Fixed In Version: ovirt-hosted-engine-ha-2.4.6-1.el8ev
Doc Type: Bug Fix
Doc Text:
Previously, if a host in the Self-hosted Engine had an ID number higher than 64, other hosts did not recognize that host, and the host did not appear in 'hosted-engine --vm-status'. In this release, the Self-hosted Engine allows host ID numbers of up to 2000.
Clone Of:
Environment:
Last Closed: 2021-04-14 11:38:43 UTC
oVirt Team: Integration
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5702971 0 None None None 2021-01-14 01:16:24 UTC
Red Hat Product Errata RHSA-2021:1184 0 None None None 2021-04-14 11:39:25 UTC
oVirt gerrit 113278 0 master MERGED Increase MAX_HOST_ID_SCAN to 2000 2021-02-17 12:12:31 UTC

Description Germano Veit Michel 2021-01-14 01:03:25 UTC
Description of problem:

1. Customer re-installed all HE hosts with new hardware. The Data-Center was already quite big and a portion of the new hosts
got spm_vds_id > 64. This same id is used for the Hosted-Engine cluster (host_id) and pushed from the engine on host deploy.
2. Hosts with id > 64 only see itself (local md) and the hosts below 64 (global md) in --vm-status
3. HA cluster not working correctly, hosts are not seeing the entire cluster

The broker is only reading the first 64 slots from the metadata, as get_raw_stats() uses:

        bs = constants.HOST_SEGMENT_BYTES
        # TODO it would be better if this was configurable
        read_size = bs * (constants.MAX_HOST_ID_SCAN + 1)    <---------

And the constant is:

ovirt_hosted_engine_ha/env/constants.py.in:25:MAX_HOST_ID_SCAN = 64

But the engine will deploy HE hosts with id > 64 anyway. Either the engine should blocks deploying HE host with spm_vds_id > 64 or maybe the broker should read past slot 64?

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.4.5-1.el8ev.noarch
rhvm-4.4.3.12-0.1.el8ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. Install host
2. Modify vds_spm_id_map in the DB to set its id to > 64
3. Re-install Host (HostedEngine -> DEPLOY)
4. Confirm its not seen by any other HE host

Actual results:
* Hosts cannot see all HostedEngine cluster, failures to determine status

Expected results:
* All hosts see each other to make the cluster work properly

Comment 1 Arik 2021-01-18 12:19:07 UTC
Didi, what's your view on increasing the MAX_HOST_ID_SCAN constant to fix that?

Comment 2 Yedidyah Bar David 2021-01-24 13:40:04 UTC
(In reply to Arik from comment #1)
> Didi, what's your view on increasing the MAX_HOST_ID_SCAN constant to fix
> that?

AFAICT we allocate 1GB for the shared conf volume, and each host gets an entry of 4096 bytes, so in principle we have space for tens of thousands of hosts. I guess we limited the scan to 64 only for performance reasons - if it's unlikely that we'll use more than a few, no need to scan much more.

In the sanlock volume we allocate space for 2000 hosts.

I think it should be safe to increase also MAX_HOST_ID_SCAN to 2000, with a rather low performance hit.

I hope I got it right.

Simone, can you please comment? Thanks.

Comment 14 Nikolai Sednev 2021-02-09 18:48:18 UTC
Value have been changed to 2000:
alma04 ~]# cat /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/env/constants.py |grep MAX_HOST_ID_SCAN
MAX_HOST_ID_SCAN = 2000
Tested on these components:
rhvm-appliance-4.4-20201117.0.el8ev.x86_64
ovirt-hosted-engine-ha-2.4.6-1.el8ev.noarch
ovirt-hosted-engine-setup-2.4.9-4.el8ev.noarch
ansible-2.9.17-1.el8ae.noarch
ovirt-ansible-collection-1.3.0-1.el8ev.noarch
vdsm-4.40.50.4-1.el8ev.x86_64
Linux 4.18.0-240.15.1.el8_3.x86_64 #1 SMP Wed Feb 3 03:12:15 EST 2021 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.3 (Ootpa)

Comment 15 Yedidyah Bar David 2021-02-10 06:03:32 UTC
The bug wasn't about changing the constant to 2000, that's the fix...

Did you verify with spm IDs > 64? Output of --vm-status? Sanity?

Comment 16 Nikolai Sednev 2021-02-10 14:21:36 UTC
(In reply to Yedidyah Bar David from comment #15)
> The bug wasn't about changing the constant to 2000, that's the fix...
> 
> Did you verify with spm IDs > 64? Output of --vm-status? Sanity?

Not yet, I have no sufficient data regarding how to get environment to that high number of host-id.
Please provide exact steps for such verification and db manipulation.

Comment 17 Germano Veit Michel 2021-02-10 20:44:23 UTC
(In reply to Nikolai Sednev from comment #16)
> (In reply to Yedidyah Bar David from comment #15)
> > The bug wasn't about changing the constant to 2000, that's the fix...
> > 
> > Did you verify with spm IDs > 64? Output of --vm-status? Sanity?
> 
> Not yet, I have no sufficient data regarding how to get environment to that
> high number of host-id.
> Please provide exact steps for such verification and db manipulation.

Just add more hosts until you reach about 70 hosts in the DC, and then 
deploy an additional one with HE. For the initial ones maybe you can use
fakevdsm, if that still works.

Another option would be to add and remove in rounds, my impression is that
the next spm_id will be the highest one in the DB+1, but I might be wrong.

DB manipulation should also work to get spm_vds_id up but I think the first
is the best, as it was is done in real world.

Comment 18 Nikolai Sednev 2021-02-10 23:33:42 UTC
(In reply to Germano Veit Michel from comment #17)
> (In reply to Nikolai Sednev from comment #16)
> > (In reply to Yedidyah Bar David from comment #15)
> > > The bug wasn't about changing the constant to 2000, that's the fix...
> > > 
> > > Did you verify with spm IDs > 64? Output of --vm-status? Sanity?
> > 
> > Not yet, I have no sufficient data regarding how to get environment to that
> > high number of host-id.
> > Please provide exact steps for such verification and db manipulation.
> 
> Just add more hosts until you reach about 70 hosts in the DC, and then 
> deploy an additional one with HE. For the initial ones maybe you can use
> fakevdsm, if that still works.
> 
> Another option would be to add and remove in rounds, my impression is that
> the next spm_id will be the highest one in the DB+1, but I might be wrong.
> 
> DB manipulation should also work to get spm_vds_id up but I think the first
> is the best, as it was is done in real world.

The only issue here is that I don't have 70 hosts to be added...
Adding and removing same new host does not increment spm_id.
What is left is DB manipulation.

Comment 19 Germano Veit Michel 2021-02-11 01:46:42 UTC
(In reply to Nikolai Sednev from comment #18)
> The only issue here is that I don't have 70 hosts to be added...
> Adding and removing same new host does not increment spm_id.
> What is left is DB manipulation.

Might be the easiest. My env is down as nested KVM is broken on F33+RHEL8.3
so I cant test it know. Maybe just fill up the vds_spm_id_map table. You might
need to add vds_static and vds_dynamic too, maybe just add some hosts in maintenance
mode.

Comment 21 Eli Marcus 2021-03-25 17:25:09 UTC
Hi Didi, please review this updated doc text: 

Previously, if a host in the Self-hosted Engine had an ID number higher than 64, other hosts did not recognize that host, and the host did not appear in 'hosted-engine --vm-status'.
In this release, the Self-hosted Engine allows host ID numbers  of up to 2000.

Comment 22 Yedidyah Bar David 2021-04-04 07:48:52 UTC
Looks good to me.

Comment 27 errata-xmlrpc 2021-04-14 11:38:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV RHEL Host (ovirt-host) 4.4.z [ovirt-4.4.5] security, bug fix, enhancement), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1184


Note You need to log in before you can comment on or make changes to this bug.