Bug 1322849 - hosted_storage trying to import the storage domain with incorrect host id
Summary: hosted_storage trying to import the storage domain with incorrect host id
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.6.3
Hardware: All
OS: Linux
urgent
high
Target Milestone: ovirt-3.6.8
: ---
Assignee: Roman Mohr
QA Contact: Nikolai Sednev
URL:
Whiteboard:
: 1408602 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-31 12:37 UTC by nijin ashok
Modified: 2019-12-16 05:35 UTC (History)
22 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-06-05 08:52:45 UTC
oVirt Team: SLA
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
engine vdsm logs (2.81 MB, application/zip)
2016-05-15 14:19 UTC, Artyom
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2599301 0 None None None 2017-04-01 09:42:21 UTC
Red Hat Knowledge Base (Solution) 2981731 0 None None None 2017-03-31 19:13:23 UTC

Description nijin ashok 2016-03-31 12:37:44 UTC
Description of problem:

When the RHEV-M tries to import the hosted engine storage, it seems like it's trying to do with wrong host id which causes sanlock failure. So the hosted_storage stuck in locked stage in the RHEV-M. Even if we destroy the hosted_storage, it will try to reimport with the wrong host id.

Version-Release number of selected component (if applicable):

vdsm-4.17.23-0.el7ev.noarch
ovirt-hosted-engine-ha-1.2.8-1.el7ev.noarch
ovirt-hosted-engine-setup-1.2.6.1-1.el7ev.noarch
rhevm-3.6.3.4-0.1.el6.noarch


How reproducible:

Unknown

Steps to Reproduce:

Unknown

Actual results:

hosted_storage stuck in locked stage.

Expected results:

Import of hosted_storage should work.

Additional info:

Comment 4 Roy Golan 2016-04-17 08:12:37 UTC
nsoffer I think you saw something similar?

Comment 5 Roy Golan 2016-04-18 11:34:37 UTC
# Info supplied by stirabos


In a few words:
- hosted-engine host_id and the spm_id in the engine are not in sync
(they could be just by chance)
- ovirt-ha-agent is directly calling startMonitoringDomain on the
hosted-engine storage domain
- the engine is calling connectStoragePool which indirectly calls
startMonitoringDomain on each attached SD including he hosted-engine
one.
- calling startMonitoringDomain the second time on the same host with
a different ID seams harmfully (no errors but VDSM continues to use
the previous ID); probably it just see that it's already monitoring
and so it skip the call.
- calling startMonitoringDomain with an ID used on another host will
result in a sanlock issue

So the issue happens when we mix hosted-engine and regular hosts in
the same datacenter if an hosted-engine host steal the lock on the
hosted-engine storage domain to a regular (non HE) host.

Having non HE hosts skipping the hosted-engine storage domain can be a solution.

> If there is a workaround, please specify it first.

Unfortunately currently we have just manually syncing hosted-engine
host_id and spm_id in the engine DB.
The side (DB or hosted-engine.conf on each host) where you take the
action defines what you have to do to make it effective.

Comment 6 nijin ashok 2016-04-20 18:30:30 UTC
(In reply to Roy Golan from comment #5)


> So the issue happens when we mix hosted-engine and regular hosts in
> the same datacenter if an hosted-engine host steal the lock on the
> hosted-engine storage domain to a regular (non HE) host.

In the customer case attached, we don't have any regular hosts. Both the hosts are HE host.

Comment 7 Roy Golan 2016-04-21 11:07:04 UTC
Is this a different problem then?

Comment 8 Simone Tiraboschi 2016-04-21 12:32:26 UTC
(In reply to nijin ashok from comment #0)
> Seems like data domain was acquired with wrong host id and hence hosted
> engine import process is also using the same however it is already acquired
> with host id 2.

It seams a different effect of the same root issue: here it seams that the import process is failing due to the ID issue.

Comment 9 nijin ashok 2016-04-25 10:33:35 UTC
Do we have any workaround which can be provided to customer?

Comment 10 Roy Golan 2016-05-05 08:29:23 UTC
To solve the collision meanwhile we can bump the engine spm id to start from say 500. We won't have 500 hosted engine hosts. Sanlock lockspace supports 2000 so that leaves 1500 regular hosts. Again sufficient. 

This is the most minimal way to overcome that currently without proposing any changes.

For 4.0 we need to make sure we always keep the SPM id higher from hosted engine hosts amount and we would be able to do so since deploying from engine is the supported way of adding hosts.

Nir, Simone thoughts?

Comment 11 Roy Golan 2016-05-05 08:30:41 UTC
Possibly not a blocker due to a simple workaround. First needs to be tested to make sure it works as expected.

Comment 12 Nir Soffer 2016-05-05 08:35:21 UTC
(In reply to Roy Golan from comment #10)
> To solve the collision meanwhile we can bump the engine spm id to start from
> say 500. We won't have 500 hosted engine hosts. Sanlock lockspace supports
> 2000 so that leaves 1500 regular hosts. Again sufficient. 
> 
> This is the most minimal way to overcome that currently without proposing
> any changes.
> 
> For 4.0 we need to make sure we always keep the SPM id higher from hosted
> engine hosts amount and we would be able to do so since deploying from
> engine is the supported way of adding hosts.
> 
> Nir, Simone thoughts?

We will not change engine spm id range because hosted engine is using the host id
incorrectly.

The rules are:

- engine and hosted engine must use the *same* id always - otherwise critical flows
  may break (e.g. fencing).
- only engine should control the host id

hosted engine must get the host if from engine, if needed we can store the host
id on the host when we connect to a host, so hosted engine can access it.

Comment 13 Roy Golan 2016-05-05 11:43:41 UTC
(In reply to Nir Soffer from comment #12)
> (In reply to Roy Golan from comment #10)
> > To solve the collision meanwhile we can bump the engine spm id to start from
> > say 500. We won't have 500 hosted engine hosts. Sanlock lockspace supports
> > 2000 so that leaves 1500 regular hosts. Again sufficient. 
> > 
> > This is the most minimal way to overcome that currently without proposing
> > any changes.
> > 
> > For 4.0 we need to make sure we always keep the SPM id higher from hosted
> > engine hosts amount and we would be able to do so since deploying from
> > engine is the supported way of adding hosts.
> > 
> > Nir, Simone thoughts?
> 
> We will not change engine spm id range because hosted engine is using the
> host id
> incorrectly.
> 
> The rules are:
> 
> - engine and hosted engine must use the *same* id always - otherwise
> critical flows
>   may break (e.g. fencing).

in what way?

> - only engine should control the host id
> 
> hosted engine must get the host if from engine, if needed we can store the
> host
> id on the host when we connect to a host, so hosted engine can access it.

Comment 15 Nir Soffer 2016-05-06 18:25:29 UTC
(In reply to Roy Golan from comment #13)
> > - engine and hosted engine must use the *same* id always - otherwise
> > critical flows
> >   may break (e.g. fencing).
> 
> in what way?

One example is fencing - engine try to get host lease status by host id. If the
host lease is ok, engine will not fence the host. If the host lease is dead
(because the host is not using that id), engine will fence the host.

There may be other failures, I don't know what will work and what will not.

Using another host id instead of the one engine allocated for a host is not
supported.

Comment 16 Artyom 2016-05-15 08:13:56 UTC
I failed to reproduce it on my regular HE deployment with two hosts.
If I understand correct, problem that one of the hosts has the same sanlock ID, that SPM has.
I will happy to reproduce it if you will give me exact reproduce steps.

Comment 17 Artyom 2016-05-15 14:19:59 UTC
Created attachment 1157668 [details]
engine vdsm logs

After discussion with Roy, I succeed to reproduce bug, steps:
1) Deploy HE on first host(storage does not really matter), host_1
2) Add additional non-HE host to engine, host_2
3) Deploy additional HE host(use sanlock id 2: Please specify the Host ID [Must be integer, default: 2] 2), host_3

Now we have a situation when host_3 has sanlock id 2, but when we added host_2, the engine gave to this host sanlock id 2, so we have the problem when two different hosts try to get sanlock on storage with the same id.

Comment 18 Artyom 2016-05-15 14:25:30 UTC
W/A:
1) Edit /etc/ovirt-hosted-engine/hosted-engine.conf on host_3 change host_id=2 to host_id=3
2) restart ovirt-ha-agent and ovirt-ha-broker(systemctl restart ovirt-ha-agent ovirt-ha-broker)

My case was pretty simple because I do not have too many hosts under engine, but in more complex case, also possible situation that you will need to update engine database to synchronize HE and engine sanlocks id's(table: vds_spm_id_map).

Comment 23 Roy Golan 2016-05-17 04:16:13 UTC
Important note after applying the workaround - any non-HE additional host that you install after that will hit the same issue. It will collide with the hosted engine id again.

Comment 24 Roy Golan 2016-05-17 04:28:14 UTC
The engine code today is counting the number of hosts filling in gaps in case there is one so the workaround suggested in comment 10 isn't feasible.


AddVdsSpmIdCommand:

protected void insertSpmIdToDb(List<VdsSpmIdMap> vdsSpmIdMapList) {
        int selectedId = 1;
        List<Integer> list = vdsSpmIdMapList.stream().map(VdsSpmIdMap::getVdsSpmId).sorted().collect(Collectors.toList());

        for (int id : list) {
            if (selectedId == id) {
                selectedId++;
            } else {
                break;
            }
        }



The only supported way of doing deploying today would be to:

1. Deploy all hosted engine hosts first. Make sure they add themselves to engine and the deploy process complete cleanly
2. Deploy all non-hosted-engine hosts after that.
3. Don't deploy any other hosted engine hosts after that or that will bring the ids out of sync

Comment 28 Roy Golan 2016-05-19 07:57:58 UTC
# Summary

- In 3.6 the host id of HE hosts isn't synced with the host id of non-HE hosts
- First add all HE hosts then regular non-HE hosts. That should keep the system from hitting the bug
- Workaround the bug - comment 18
- If there is a need another HE hosts, use this query on engine to determine the next ID to use:
```sql
   select vds_spm_id from vds ORDER BY vds_spm_id;
```
- ovirt 4 will support adding another HE host only from the engine itself and not from CLI

Comment 29 nijin ashok 2016-05-25 16:37:28 UTC
Workaround in comment 18  worked for the customer and hosted_storage was imported successfully.

Comment 30 Roy Golan 2016-06-05 08:52:45 UTC
Fixed thoroughly in 4.0 by the functionality to add HE hosts using the engine REST/UI. 4.0 knows deployment of HE uses the *same* vds_spm_id table to use for the next host it.

Comment 31 Simone Tiraboschi 2017-01-02 11:09:38 UTC
*** Bug 1408602 has been marked as a duplicate of this bug. ***

Comment 32 Nikolai Sednev 2017-01-03 10:21:55 UTC
Opened https://bugzilla.redhat.com/show_bug.cgi?id=1409771 to get this issue covered there.


Note You need to log in before you can comment on or make changes to this bug.