Description of problem: When the RHEV-M tries to import the hosted engine storage, it seems like it's trying to do with wrong host id which causes sanlock failure. So the hosted_storage stuck in locked stage in the RHEV-M. Even if we destroy the hosted_storage, it will try to reimport with the wrong host id. Version-Release number of selected component (if applicable): vdsm-4.17.23-0.el7ev.noarch ovirt-hosted-engine-ha-1.2.8-1.el7ev.noarch ovirt-hosted-engine-setup-1.2.6.1-1.el7ev.noarch rhevm-3.6.3.4-0.1.el6.noarch How reproducible: Unknown Steps to Reproduce: Unknown Actual results: hosted_storage stuck in locked stage. Expected results: Import of hosted_storage should work. Additional info:
nsoffer I think you saw something similar?
# Info supplied by stirabos In a few words: - hosted-engine host_id and the spm_id in the engine are not in sync (they could be just by chance) - ovirt-ha-agent is directly calling startMonitoringDomain on the hosted-engine storage domain - the engine is calling connectStoragePool which indirectly calls startMonitoringDomain on each attached SD including he hosted-engine one. - calling startMonitoringDomain the second time on the same host with a different ID seams harmfully (no errors but VDSM continues to use the previous ID); probably it just see that it's already monitoring and so it skip the call. - calling startMonitoringDomain with an ID used on another host will result in a sanlock issue So the issue happens when we mix hosted-engine and regular hosts in the same datacenter if an hosted-engine host steal the lock on the hosted-engine storage domain to a regular (non HE) host. Having non HE hosts skipping the hosted-engine storage domain can be a solution. > If there is a workaround, please specify it first. Unfortunately currently we have just manually syncing hosted-engine host_id and spm_id in the engine DB. The side (DB or hosted-engine.conf on each host) where you take the action defines what you have to do to make it effective.
(In reply to Roy Golan from comment #5) > So the issue happens when we mix hosted-engine and regular hosts in > the same datacenter if an hosted-engine host steal the lock on the > hosted-engine storage domain to a regular (non HE) host. In the customer case attached, we don't have any regular hosts. Both the hosts are HE host.
Is this a different problem then?
(In reply to nijin ashok from comment #0) > Seems like data domain was acquired with wrong host id and hence hosted > engine import process is also using the same however it is already acquired > with host id 2. It seams a different effect of the same root issue: here it seams that the import process is failing due to the ID issue.
Do we have any workaround which can be provided to customer?
To solve the collision meanwhile we can bump the engine spm id to start from say 500. We won't have 500 hosted engine hosts. Sanlock lockspace supports 2000 so that leaves 1500 regular hosts. Again sufficient. This is the most minimal way to overcome that currently without proposing any changes. For 4.0 we need to make sure we always keep the SPM id higher from hosted engine hosts amount and we would be able to do so since deploying from engine is the supported way of adding hosts. Nir, Simone thoughts?
Possibly not a blocker due to a simple workaround. First needs to be tested to make sure it works as expected.
(In reply to Roy Golan from comment #10) > To solve the collision meanwhile we can bump the engine spm id to start from > say 500. We won't have 500 hosted engine hosts. Sanlock lockspace supports > 2000 so that leaves 1500 regular hosts. Again sufficient. > > This is the most minimal way to overcome that currently without proposing > any changes. > > For 4.0 we need to make sure we always keep the SPM id higher from hosted > engine hosts amount and we would be able to do so since deploying from > engine is the supported way of adding hosts. > > Nir, Simone thoughts? We will not change engine spm id range because hosted engine is using the host id incorrectly. The rules are: - engine and hosted engine must use the *same* id always - otherwise critical flows may break (e.g. fencing). - only engine should control the host id hosted engine must get the host if from engine, if needed we can store the host id on the host when we connect to a host, so hosted engine can access it.
(In reply to Nir Soffer from comment #12) > (In reply to Roy Golan from comment #10) > > To solve the collision meanwhile we can bump the engine spm id to start from > > say 500. We won't have 500 hosted engine hosts. Sanlock lockspace supports > > 2000 so that leaves 1500 regular hosts. Again sufficient. > > > > This is the most minimal way to overcome that currently without proposing > > any changes. > > > > For 4.0 we need to make sure we always keep the SPM id higher from hosted > > engine hosts amount and we would be able to do so since deploying from > > engine is the supported way of adding hosts. > > > > Nir, Simone thoughts? > > We will not change engine spm id range because hosted engine is using the > host id > incorrectly. > > The rules are: > > - engine and hosted engine must use the *same* id always - otherwise > critical flows > may break (e.g. fencing). in what way? > - only engine should control the host id > > hosted engine must get the host if from engine, if needed we can store the > host > id on the host when we connect to a host, so hosted engine can access it.
(In reply to Roy Golan from comment #13) > > - engine and hosted engine must use the *same* id always - otherwise > > critical flows > > may break (e.g. fencing). > > in what way? One example is fencing - engine try to get host lease status by host id. If the host lease is ok, engine will not fence the host. If the host lease is dead (because the host is not using that id), engine will fence the host. There may be other failures, I don't know what will work and what will not. Using another host id instead of the one engine allocated for a host is not supported.
I failed to reproduce it on my regular HE deployment with two hosts. If I understand correct, problem that one of the hosts has the same sanlock ID, that SPM has. I will happy to reproduce it if you will give me exact reproduce steps.
Created attachment 1157668 [details] engine vdsm logs After discussion with Roy, I succeed to reproduce bug, steps: 1) Deploy HE on first host(storage does not really matter), host_1 2) Add additional non-HE host to engine, host_2 3) Deploy additional HE host(use sanlock id 2: Please specify the Host ID [Must be integer, default: 2] 2), host_3 Now we have a situation when host_3 has sanlock id 2, but when we added host_2, the engine gave to this host sanlock id 2, so we have the problem when two different hosts try to get sanlock on storage with the same id.
W/A: 1) Edit /etc/ovirt-hosted-engine/hosted-engine.conf on host_3 change host_id=2 to host_id=3 2) restart ovirt-ha-agent and ovirt-ha-broker(systemctl restart ovirt-ha-agent ovirt-ha-broker) My case was pretty simple because I do not have too many hosts under engine, but in more complex case, also possible situation that you will need to update engine database to synchronize HE and engine sanlocks id's(table: vds_spm_id_map).
Important note after applying the workaround - any non-HE additional host that you install after that will hit the same issue. It will collide with the hosted engine id again.
The engine code today is counting the number of hosts filling in gaps in case there is one so the workaround suggested in comment 10 isn't feasible. AddVdsSpmIdCommand: protected void insertSpmIdToDb(List<VdsSpmIdMap> vdsSpmIdMapList) { int selectedId = 1; List<Integer> list = vdsSpmIdMapList.stream().map(VdsSpmIdMap::getVdsSpmId).sorted().collect(Collectors.toList()); for (int id : list) { if (selectedId == id) { selectedId++; } else { break; } } The only supported way of doing deploying today would be to: 1. Deploy all hosted engine hosts first. Make sure they add themselves to engine and the deploy process complete cleanly 2. Deploy all non-hosted-engine hosts after that. 3. Don't deploy any other hosted engine hosts after that or that will bring the ids out of sync
# Summary - In 3.6 the host id of HE hosts isn't synced with the host id of non-HE hosts - First add all HE hosts then regular non-HE hosts. That should keep the system from hitting the bug - Workaround the bug - comment 18 - If there is a need another HE hosts, use this query on engine to determine the next ID to use: ```sql select vds_spm_id from vds ORDER BY vds_spm_id; ``` - ovirt 4 will support adding another HE host only from the engine itself and not from CLI
Workaround in comment 18 worked for the customer and hosted_storage was imported successfully.
Fixed thoroughly in 4.0 by the functionality to add HE hosts using the engine REST/UI. 4.0 knows deployment of HE uses the *same* vds_spm_id table to use for the next host it.
*** Bug 1408602 has been marked as a duplicate of this bug. ***
Opened https://bugzilla.redhat.com/show_bug.cgi?id=1409771 to get this issue covered there.