Bug 1491462

Summary: In a specific DR scenario, importing a storage domain fails if it was previously a master storage domain
Product: Red Hat Enterprise Virtualization Manager Reporter: Gordon Watson <gwatson>
Component: ovirt-engineAssignee: Maor <mlipchuk>
Status: CLOSED NOTABUG QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: medium    
Version: 4.1.5CC: amureini, bcholler, gveitmic, gwatson, lsurette, mkalinin, mlipchuk, nsoffer, ratamir, rbalakri, Rhev-m-bugs, srevivo, teigland, ykaul, ylavi
Target Milestone: ovirt-4.2.2   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-01-18 11:47:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
attach_warning none

Description Gordon Watson 2017-09-13 22:31:19 UTC
Description of problem:

If I do the following, importing a storage domain fails;

- create an SD in one RHV env
- make it the master SD of its DC
- go to a different RHV env and import the SD
- approve the "risky" operation
- forcedDetachStorageDomain fails with "err=(-243, 'Sanlock resource not acquired', 'Sanlock exception')"


I hear you say "don't do that", but if I do the following, it works;

- create an SD in one RHV env
- don't make it the master SD of its DC
- go to a different RHV env and import the SD
- approve the "risky" operation
- the SD enters maintenance mode
- 'Activate' is then successful


So, as long as it's not a master SD, it can be imported successfully. 

This relates to a customer's problem, for which the above test was the closest I could achieve.

Regardless, this could feasibly happen, although probably unlikely. A RHV environment could encounter some catastrophic condition, in which only one host survives. The user then builds a new RHV environment and tries to import the old master storage domain, and it will fail.


Version-Release number of selected component (if applicable):

RHV 4.1.5
RHEL 7.3 hosts;
  vdsm-4.19.15-1.el7ev


How reproducible:

100%.

Steps to Reproduce:
1. See above.


Actual results:


Expected results:


Additional info:

Comment 3 Allon Mureinik 2017-09-14 12:32:33 UTC
Maor, didn't you solve this for 4.2 (based on another BZ)?

Comment 4 Maor 2017-09-14 13:29:55 UTC
It looks like a sanlock issue. If the host from the previous setup is still alive it will keep the lock and we will get an error when trying to attach it to a different Data Center.
The user must reboot the hosts from the previous setup before importing the storage domain.
Gordon, is it possible to know if the user did reboot its hosts from the previous setup?

Comment 5 Yaniv Kaul 2017-09-14 13:53:14 UTC
(In reply to Maor from comment #4)
> It looks like a sanlock issue. If the host from the previous setup is still
> alive it will keep the lock and we will get an error when trying to attach
> it to a different Data Center.
> The user must reboot the hosts from the previous setup before importing the
> storage domain.
> Gordon, is it possible to know if the user did reboot its hosts from the
> previous setup?

But it is reasonable in a DR scenario that the hosts blew up. We need some way to unlock it, I reckon.

Comment 6 Gordon Watson 2017-09-14 14:29:38 UTC
Maor,

There are two scenarios here;

1) My test envrionment. In this, the lease was still being updated by the host in the original environment. However, that was true for both the master and non-master SDs, and the latter worked ok. As I said, I realise that this was not a great test, but it was the best i could do. Plus, it could actually happen. What I don't get though is why sanlock fails only with a master SD.

In this situation, I didn't try to reboot the hosts in the original environment, as they are in general use by other people. 


2) Customer's environment. In this case, the SD was not in use anywhere else. It was "cloned" on the storage array and presented to the new environment via a separate lun. So, it was unique to the new environment.


I hope that clarifies the situation. If not, please let me know.

Thanks very much, GFW.

Comment 7 Maor 2017-09-14 14:50:55 UTC
(In reply to Yaniv Kaul from comment #5)
> (In reply to Maor from comment #4)
> > It looks like a sanlock issue. If the host from the previous setup is still
> > alive it will keep the lock and we will get an error when trying to attach
> > it to a different Data Center.
> > The user must reboot the hosts from the previous setup before importing the
> > storage domain.
> > Gordon, is it possible to know if the user did reboot its hosts from the
> > previous setup?
> 
> But it is reasonable in a DR scenario that the hosts blew up. We need some
> way to unlock it, I reckon.

If the hosts blew up then they will not be connected to the storage domains any more and after ~120 seconds the lock of sanlock will be released and the storage domain can be imported again.

Comment 8 Maor 2017-09-17 15:01:57 UTC
(In reply to Gordon Watson from comment #6)
> Maor,
> 
> There are two scenarios here;
> 
> 1) My test envrionment. In this, the lease was still being updated by the
> host in the original environment. However, that was true for both the master
> and non-master SDs, and the latter worked ok. As I said, I realise that this
> was not a great test, but it was the best i could do. Plus, it could
> actually happen. What I don't get though is why sanlock fails only with a
> master SD.

In the storage domain in the storage server there are two files which support the leases of sanlock:
1. ids - The delta leases which are responsible to register a host with its id. For example the first host that is registered in the storage domain is usually 1.
2. Leases - The paxos leases which is responsible for the locking.
The SPM will use the paxos lease only on the master storage domain.

Basically the attach of the regular storage domain should fail if the id of the new Host is the same as the id of the previous host, but if those hosts got a different id then the attach of the storage domain might succeed.

If you will try to attach a master storage domain then you should fail to get a lease on the storage domain since the old SPM will acquire a paxos lease on it.

> 
> In this situation, I didn't try to reboot the hosts in the original
> environment, as they are in general use by other people. 
> 
> 
> 2) Customer's environment. In this case, the SD was not in use anywhere
> else. It was "cloned" on the storage array and presented to the new
> environment via a separate lun. So, it was unique to the new environment.

Keep in mind that the lease is working with timestamp and even if it was cloned, it could take a few minutes until the lease is released.
So in that case, was the attach succeeded or not?

> 
> 
> I hope that clarifies the situation. If not, please let me know.
> 
> Thanks very much, GFW.

Comment 9 Gordon Watson 2017-09-18 12:01:12 UTC
(In reply to Maor from comment #8)

> If you will try to attach a master storage domain then you should fail to
> get a lease on the storage domain since the old SPM will acquire a paxos
> lease on it.
> 

Ah, ok, makes sense.

> > 2) Customer's environment. In this case, the SD was not in use anywhere
> > else. It was "cloned" on the storage array and presented to the new
> > environment via a separate lun. So, it was unique to the new environment.
> 
> Keep in mind that the lease is working with timestamp and even if it was
> cloned, it could take a few minutes until the lease is released.
> So in that case, was the attach succeeded or not?
> 

No, the attach was not successful.

Comment 13 Gordon Watson 2017-09-18 14:30:37 UTC
Created attachment 1327480 [details]
vdsm log from customer

Comment 17 Allon Mureinik 2017-11-23 16:36:11 UTC
Maor - do we have an RCA on this?

Comment 18 Maor 2017-11-25 13:54:44 UTC
Basically the customer issue was already resolved based on comment 10
The only thing that I was trying to analyze is the reason why sanlock failed to acquire a lock for a cloned storage domain.

It looks like there is a constant flow of a failing VDS deploy and then a failure to force detach the storage domain.
The host id seems to be the same after every deploy (which is 1) but I can't seem to find any reboot which was done on the host I also could not find any prior logs when the force detach was successful.

Based on the logs attached to this bug it looks like the host properties are:
  Host name: sn1mopx2e0001.1bestarinet.net (see [1]
  Host id: f18b403f-2b45-4f41-91ee-47ac6fa2316c (see [2])

It seems like there was an attempt to deploy the Host to the engine, which ended with failure (see [3])

It seems like part of the engine log between 2017-09-12 11:37:28 to 2017-09-12 23:41:26 is missing. (see [4])

Based on what happen before 2017-09-12 11:37:28, it looks like the behavior was the same, an attempt to deploy the Host which ended with failure and a failure to attach the storage domain.
I assume that a reboot of the host might have solve it.

[1]
Storage Pool Manager runs on Host sn1mopx2e0001.1bestarinet.net (Address: 10.28.59.36).

[2]
2017-09-12 23:41:44,166 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksInfoVDSCommand] (org.ovirt.thread.pool-6-thread-4) [3d4d12de] START, HSMGetAllTasksInfoVDSCommand(HostName = sn1mopx2e0001.1bestarinet.net, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='f18b403f-2b45-4f41-91ee-47ac6fa2316c'}), log id: 781c954

[3]
2017-09-12 23:55:35,402 ERROR [org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase] (DefaultQuartzScheduler3) [] Error during host 10.28.59.36 install: java.io.IOException: Command returned failure code 1 during SSH session 'root.59.36'
.....
17-09-12 23:55:35,437 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler3) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Failed to check for available updates on host sn1mopx2e0001.1bestarinet.net with message 'Command returned failure code 1 during SSH session 'root.59.36''.

[4] 
2017-09-12 11:33:58,553 ERROR [org.ovirt.engine.core.sso.servlets.InteractiveAuthServlet] (default task-101) [] Cannot authenticate user 'admin@internal': Cannot Login. User Account is Disabled or Locked, Please contact your system administrator.
2017-09-12 11:37:28,381 INFO  [org.ovirt.engine.extension.aaa.jdbc.core.Tasks] (default task-20) [] (house keeping) deleting failed logins prior to 2017-09-05 15:37:28Z.

// We have missing logs here

2017-09-12 23:41:26,270 INFO  [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread Pool -- 54) [] Loaded file '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'.

Comment 20 David Teigland 2017-11-27 15:53:55 UTC
The host in comment 14 (apparently host_id 1?) is failing to acquire the "SDM" paxos lease with error -243, i.e. the SDM lease is held by another host.  The on-disk lease for the SDM resource shows that it is held by host_id 14.  host_id 14 has a host name of 28157078-1139-4619-b55a-7b5cce80e6db, which you can find in /var/log/sanlock.log on one of the hosts.

With /var/log/sanlock.log from these hosts, we could fill in more details about what's happening, and if that leaves any unanswered questions, then 'sanlock client log_dump' could fill in any remaining info.

Comment 28 David Teigland 2018-01-02 20:04:46 UTC
This machine is host_id 1 with host name 4b61bab6-08e7-4ee7-92ff-3e0242780371.localhost.

The machine holding the SDM lease is host_id 3 host name f0dc7c64-e5d1-43ea-8a8e-2d7a550d30ab.rhevh-20.g

This machine adds the lockspace:
s8 lockspace a0630ead-dac7-419f-b5e6-2c4360b4d181:1:/dev/a0630ead-dac7-419f-b5e6-2c4360b4d181/ids:0
s8 delta_acquire done 1 1 1190126
s8 add_lockspace done
s8 host 1 1 1190126 4b61bab6-08e7-4ee7-92ff-3e0242780371.localhost.
s8 host 3 2 56671635 f0dc7c64-e5d1-43ea-8a8e-2d7a550d30ab.rhevh-20.g

This machine tries and fails to acquire the SDM lease:
s8:r6 resource a0630ead-dac7-419f-b5e6-2c4360b4d181:SDM:/dev/a0630ead-dac7-419f-b5e6-2c4360b4d181/leases:1048576
r6 paxos_acquire owner 3 delta 3 2 56671635 alive
r6 acquire_disk rv -243 lver 2 at 56006075

To find out which host is holding the lease, you need to check /var/log/sanlock.log on all the other hosts and find the one that says:
  sanlock daemon started ... host f0dc7c64-e5d1-43ea-8a8e-2d7a550d30ab.rhevh-20.g

Comment 29 Maor 2018-01-02 20:48:13 UTC
Thanks David,

Gordon will it be possible to add all the hosts' sanlock logs as David mentioned in the above comment, to find out which host acquired the lock

Comment 39 Maor 2018-01-16 15:53:35 UTC
After I investigated the scenario a bit more, here are the conclusions and explanation of the scenario described in comment 35 (see [1])

As described in comment 8, the master storage domain is being used by the SPM with paxos leases for locking.
Any other storage domain will use the delta leases which are responsible to register the host with its id.

The engine might succeed to import a non-master storage domain while it is active on a different setup, since the SPM does not lock any non-master storage domains. Theoretically the engine could encounter a scenario when a storage domain will be active in one setup and will be imported successfully to another setup.
Although, take in consider that the import operation might fail if both hosts from both setups registered with the same id.
For example if there is one host in setup A with id 1 and the host in setup B will also be with id 1, the sanlock should fail the operation of the import.

[1] "import a non-master SD into the new RHV environment while it was still active and had a valid lease acquired by 'rhevh-20' in the old environment."

Comment 43 Maor 2018-01-17 19:06:32 UTC
Created attachment 1382593 [details]
attach_warning

Comment 46 Marina Kalinin 2019-05-02 23:00:05 UTC
Maybe we can change a title of the bug to: Importing Storage Domain fails if Master SD and was not detached from original DC.

Comment 47 Franta Kust 2019-05-16 12:54:59 UTC
BZ<2>Jira re-sync