Description of problem: 1. Customer re-installed all HE hosts with new hardware. The Data-Center was already quite big and a portion of the new hosts got spm_vds_id > 64. This same id is used for the Hosted-Engine cluster (host_id) and pushed from the engine on host deploy. 2. Hosts with id > 64 only see itself (local md) and the hosts below 64 (global md) in --vm-status 3. HA cluster not working correctly, hosts are not seeing the entire cluster The broker is only reading the first 64 slots from the metadata, as get_raw_stats() uses: bs = constants.HOST_SEGMENT_BYTES # TODO it would be better if this was configurable read_size = bs * (constants.MAX_HOST_ID_SCAN + 1) <--------- And the constant is: ovirt_hosted_engine_ha/env/constants.py.in:25:MAX_HOST_ID_SCAN = 64 But the engine will deploy HE hosts with id > 64 anyway. Either the engine should blocks deploying HE host with spm_vds_id > 64 or maybe the broker should read past slot 64? Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-2.4.5-1.el8ev.noarch rhvm-4.4.3.12-0.1.el8ev.noarch How reproducible: Always Steps to Reproduce: 1. Install host 2. Modify vds_spm_id_map in the DB to set its id to > 64 3. Re-install Host (HostedEngine -> DEPLOY) 4. Confirm its not seen by any other HE host Actual results: * Hosts cannot see all HostedEngine cluster, failures to determine status Expected results: * All hosts see each other to make the cluster work properly
Didi, what's your view on increasing the MAX_HOST_ID_SCAN constant to fix that?
(In reply to Arik from comment #1) > Didi, what's your view on increasing the MAX_HOST_ID_SCAN constant to fix > that? AFAICT we allocate 1GB for the shared conf volume, and each host gets an entry of 4096 bytes, so in principle we have space for tens of thousands of hosts. I guess we limited the scan to 64 only for performance reasons - if it's unlikely that we'll use more than a few, no need to scan much more. In the sanlock volume we allocate space for 2000 hosts. I think it should be safe to increase also MAX_HOST_ID_SCAN to 2000, with a rather low performance hit. I hope I got it right. Simone, can you please comment? Thanks.
Value have been changed to 2000: alma04 ~]# cat /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/env/constants.py |grep MAX_HOST_ID_SCAN MAX_HOST_ID_SCAN = 2000 Tested on these components: rhvm-appliance-4.4-20201117.0.el8ev.x86_64 ovirt-hosted-engine-ha-2.4.6-1.el8ev.noarch ovirt-hosted-engine-setup-2.4.9-4.el8ev.noarch ansible-2.9.17-1.el8ae.noarch ovirt-ansible-collection-1.3.0-1.el8ev.noarch vdsm-4.40.50.4-1.el8ev.x86_64 Linux 4.18.0-240.15.1.el8_3.x86_64 #1 SMP Wed Feb 3 03:12:15 EST 2021 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.3 (Ootpa)
The bug wasn't about changing the constant to 2000, that's the fix... Did you verify with spm IDs > 64? Output of --vm-status? Sanity?
(In reply to Yedidyah Bar David from comment #15) > The bug wasn't about changing the constant to 2000, that's the fix... > > Did you verify with spm IDs > 64? Output of --vm-status? Sanity? Not yet, I have no sufficient data regarding how to get environment to that high number of host-id. Please provide exact steps for such verification and db manipulation.
(In reply to Nikolai Sednev from comment #16) > (In reply to Yedidyah Bar David from comment #15) > > The bug wasn't about changing the constant to 2000, that's the fix... > > > > Did you verify with spm IDs > 64? Output of --vm-status? Sanity? > > Not yet, I have no sufficient data regarding how to get environment to that > high number of host-id. > Please provide exact steps for such verification and db manipulation. Just add more hosts until you reach about 70 hosts in the DC, and then deploy an additional one with HE. For the initial ones maybe you can use fakevdsm, if that still works. Another option would be to add and remove in rounds, my impression is that the next spm_id will be the highest one in the DB+1, but I might be wrong. DB manipulation should also work to get spm_vds_id up but I think the first is the best, as it was is done in real world.
(In reply to Germano Veit Michel from comment #17) > (In reply to Nikolai Sednev from comment #16) > > (In reply to Yedidyah Bar David from comment #15) > > > The bug wasn't about changing the constant to 2000, that's the fix... > > > > > > Did you verify with spm IDs > 64? Output of --vm-status? Sanity? > > > > Not yet, I have no sufficient data regarding how to get environment to that > > high number of host-id. > > Please provide exact steps for such verification and db manipulation. > > Just add more hosts until you reach about 70 hosts in the DC, and then > deploy an additional one with HE. For the initial ones maybe you can use > fakevdsm, if that still works. > > Another option would be to add and remove in rounds, my impression is that > the next spm_id will be the highest one in the DB+1, but I might be wrong. > > DB manipulation should also work to get spm_vds_id up but I think the first > is the best, as it was is done in real world. The only issue here is that I don't have 70 hosts to be added... Adding and removing same new host does not increment spm_id. What is left is DB manipulation.
(In reply to Nikolai Sednev from comment #18) > The only issue here is that I don't have 70 hosts to be added... > Adding and removing same new host does not increment spm_id. > What is left is DB manipulation. Might be the easiest. My env is down as nested KVM is broken on F33+RHEL8.3 so I cant test it know. Maybe just fill up the vds_spm_id_map table. You might need to add vds_static and vds_dynamic too, maybe just add some hosts in maintenance mode.
Hi Didi, please review this updated doc text: Previously, if a host in the Self-hosted Engine had an ID number higher than 64, other hosts did not recognize that host, and the host did not appear in 'hosted-engine --vm-status'. In this release, the Self-hosted Engine allows host ID numbers of up to 2000.
Looks good to me.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: RHV RHEL Host (ovirt-host) 4.4.z [ovirt-4.4.5] security, bug fix, enhancement), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:1184