Bug 1491462
Summary: | In a specific DR scenario, importing a storage domain fails if it was previously a master storage domain | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Gordon Watson <gwatson> | ||||
Component: | ovirt-engine | Assignee: | Maor <mlipchuk> | ||||
Status: | CLOSED NOTABUG | QA Contact: | Elad <ebenahar> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4.1.5 | CC: | amureini, bcholler, gveitmic, gwatson, lsurette, mkalinin, mlipchuk, nsoffer, ratamir, rbalakri, Rhev-m-bugs, srevivo, teigland, ykaul, ylavi | ||||
Target Milestone: | ovirt-4.2.2 | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-01-18 11:47:09 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Gordon Watson
2017-09-13 22:31:19 UTC
Maor, didn't you solve this for 4.2 (based on another BZ)? It looks like a sanlock issue. If the host from the previous setup is still alive it will keep the lock and we will get an error when trying to attach it to a different Data Center. The user must reboot the hosts from the previous setup before importing the storage domain. Gordon, is it possible to know if the user did reboot its hosts from the previous setup? (In reply to Maor from comment #4) > It looks like a sanlock issue. If the host from the previous setup is still > alive it will keep the lock and we will get an error when trying to attach > it to a different Data Center. > The user must reboot the hosts from the previous setup before importing the > storage domain. > Gordon, is it possible to know if the user did reboot its hosts from the > previous setup? But it is reasonable in a DR scenario that the hosts blew up. We need some way to unlock it, I reckon. Maor, There are two scenarios here; 1) My test envrionment. In this, the lease was still being updated by the host in the original environment. However, that was true for both the master and non-master SDs, and the latter worked ok. As I said, I realise that this was not a great test, but it was the best i could do. Plus, it could actually happen. What I don't get though is why sanlock fails only with a master SD. In this situation, I didn't try to reboot the hosts in the original environment, as they are in general use by other people. 2) Customer's environment. In this case, the SD was not in use anywhere else. It was "cloned" on the storage array and presented to the new environment via a separate lun. So, it was unique to the new environment. I hope that clarifies the situation. If not, please let me know. Thanks very much, GFW. (In reply to Yaniv Kaul from comment #5) > (In reply to Maor from comment #4) > > It looks like a sanlock issue. If the host from the previous setup is still > > alive it will keep the lock and we will get an error when trying to attach > > it to a different Data Center. > > The user must reboot the hosts from the previous setup before importing the > > storage domain. > > Gordon, is it possible to know if the user did reboot its hosts from the > > previous setup? > > But it is reasonable in a DR scenario that the hosts blew up. We need some > way to unlock it, I reckon. If the hosts blew up then they will not be connected to the storage domains any more and after ~120 seconds the lock of sanlock will be released and the storage domain can be imported again. (In reply to Gordon Watson from comment #6) > Maor, > > There are two scenarios here; > > 1) My test envrionment. In this, the lease was still being updated by the > host in the original environment. However, that was true for both the master > and non-master SDs, and the latter worked ok. As I said, I realise that this > was not a great test, but it was the best i could do. Plus, it could > actually happen. What I don't get though is why sanlock fails only with a > master SD. In the storage domain in the storage server there are two files which support the leases of sanlock: 1. ids - The delta leases which are responsible to register a host with its id. For example the first host that is registered in the storage domain is usually 1. 2. Leases - The paxos leases which is responsible for the locking. The SPM will use the paxos lease only on the master storage domain. Basically the attach of the regular storage domain should fail if the id of the new Host is the same as the id of the previous host, but if those hosts got a different id then the attach of the storage domain might succeed. If you will try to attach a master storage domain then you should fail to get a lease on the storage domain since the old SPM will acquire a paxos lease on it. > > In this situation, I didn't try to reboot the hosts in the original > environment, as they are in general use by other people. > > > 2) Customer's environment. In this case, the SD was not in use anywhere > else. It was "cloned" on the storage array and presented to the new > environment via a separate lun. So, it was unique to the new environment. Keep in mind that the lease is working with timestamp and even if it was cloned, it could take a few minutes until the lease is released. So in that case, was the attach succeeded or not? > > > I hope that clarifies the situation. If not, please let me know. > > Thanks very much, GFW. (In reply to Maor from comment #8) > If you will try to attach a master storage domain then you should fail to > get a lease on the storage domain since the old SPM will acquire a paxos > lease on it. > Ah, ok, makes sense. > > 2) Customer's environment. In this case, the SD was not in use anywhere > > else. It was "cloned" on the storage array and presented to the new > > environment via a separate lun. So, it was unique to the new environment. > > Keep in mind that the lease is working with timestamp and even if it was > cloned, it could take a few minutes until the lease is released. > So in that case, was the attach succeeded or not? > No, the attach was not successful. Created attachment 1327480 [details]
vdsm log from customer
Maor - do we have an RCA on this? Basically the customer issue was already resolved based on comment 10 The only thing that I was trying to analyze is the reason why sanlock failed to acquire a lock for a cloned storage domain. It looks like there is a constant flow of a failing VDS deploy and then a failure to force detach the storage domain. The host id seems to be the same after every deploy (which is 1) but I can't seem to find any reboot which was done on the host I also could not find any prior logs when the force detach was successful. Based on the logs attached to this bug it looks like the host properties are: Host name: sn1mopx2e0001.1bestarinet.net (see [1] Host id: f18b403f-2b45-4f41-91ee-47ac6fa2316c (see [2]) It seems like there was an attempt to deploy the Host to the engine, which ended with failure (see [3]) It seems like part of the engine log between 2017-09-12 11:37:28 to 2017-09-12 23:41:26 is missing. (see [4]) Based on what happen before 2017-09-12 11:37:28, it looks like the behavior was the same, an attempt to deploy the Host which ended with failure and a failure to attach the storage domain. I assume that a reboot of the host might have solve it. [1] Storage Pool Manager runs on Host sn1mopx2e0001.1bestarinet.net (Address: 10.28.59.36). [2] 2017-09-12 23:41:44,166 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksInfoVDSCommand] (org.ovirt.thread.pool-6-thread-4) [3d4d12de] START, HSMGetAllTasksInfoVDSCommand(HostName = sn1mopx2e0001.1bestarinet.net, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='f18b403f-2b45-4f41-91ee-47ac6fa2316c'}), log id: 781c954 [3] 2017-09-12 23:55:35,402 ERROR [org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase] (DefaultQuartzScheduler3) [] Error during host 10.28.59.36 install: java.io.IOException: Command returned failure code 1 during SSH session 'root.59.36' ..... 17-09-12 23:55:35,437 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler3) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Failed to check for available updates on host sn1mopx2e0001.1bestarinet.net with message 'Command returned failure code 1 during SSH session 'root.59.36''. [4] 2017-09-12 11:33:58,553 ERROR [org.ovirt.engine.core.sso.servlets.InteractiveAuthServlet] (default task-101) [] Cannot authenticate user 'admin@internal': Cannot Login. User Account is Disabled or Locked, Please contact your system administrator. 2017-09-12 11:37:28,381 INFO [org.ovirt.engine.extension.aaa.jdbc.core.Tasks] (default task-20) [] (house keeping) deleting failed logins prior to 2017-09-05 15:37:28Z. // We have missing logs here 2017-09-12 23:41:26,270 INFO [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread Pool -- 54) [] Loaded file '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'. The host in comment 14 (apparently host_id 1?) is failing to acquire the "SDM" paxos lease with error -243, i.e. the SDM lease is held by another host. The on-disk lease for the SDM resource shows that it is held by host_id 14. host_id 14 has a host name of 28157078-1139-4619-b55a-7b5cce80e6db, which you can find in /var/log/sanlock.log on one of the hosts. With /var/log/sanlock.log from these hosts, we could fill in more details about what's happening, and if that leaves any unanswered questions, then 'sanlock client log_dump' could fill in any remaining info. This machine is host_id 1 with host name 4b61bab6-08e7-4ee7-92ff-3e0242780371.localhost. The machine holding the SDM lease is host_id 3 host name f0dc7c64-e5d1-43ea-8a8e-2d7a550d30ab.rhevh-20.g This machine adds the lockspace: s8 lockspace a0630ead-dac7-419f-b5e6-2c4360b4d181:1:/dev/a0630ead-dac7-419f-b5e6-2c4360b4d181/ids:0 s8 delta_acquire done 1 1 1190126 s8 add_lockspace done s8 host 1 1 1190126 4b61bab6-08e7-4ee7-92ff-3e0242780371.localhost. s8 host 3 2 56671635 f0dc7c64-e5d1-43ea-8a8e-2d7a550d30ab.rhevh-20.g This machine tries and fails to acquire the SDM lease: s8:r6 resource a0630ead-dac7-419f-b5e6-2c4360b4d181:SDM:/dev/a0630ead-dac7-419f-b5e6-2c4360b4d181/leases:1048576 r6 paxos_acquire owner 3 delta 3 2 56671635 alive r6 acquire_disk rv -243 lver 2 at 56006075 To find out which host is holding the lease, you need to check /var/log/sanlock.log on all the other hosts and find the one that says: sanlock daemon started ... host f0dc7c64-e5d1-43ea-8a8e-2d7a550d30ab.rhevh-20.g Thanks David, Gordon will it be possible to add all the hosts' sanlock logs as David mentioned in the above comment, to find out which host acquired the lock After I investigated the scenario a bit more, here are the conclusions and explanation of the scenario described in comment 35 (see [1]) As described in comment 8, the master storage domain is being used by the SPM with paxos leases for locking. Any other storage domain will use the delta leases which are responsible to register the host with its id. The engine might succeed to import a non-master storage domain while it is active on a different setup, since the SPM does not lock any non-master storage domains. Theoretically the engine could encounter a scenario when a storage domain will be active in one setup and will be imported successfully to another setup. Although, take in consider that the import operation might fail if both hosts from both setups registered with the same id. For example if there is one host in setup A with id 1 and the host in setup B will also be with id 1, the sanlock should fail the operation of the import. [1] "import a non-master SD into the new RHV environment while it was still active and had a valid lease acquired by 'rhevh-20' in the old environment." Created attachment 1382593 [details]
attach_warning
Maybe we can change a title of the bug to: Importing Storage Domain fails if Master SD and was not detached from original DC. BZ<2>Jira re-sync |