This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 841525 - [engine-core] reconstruct leads to master version out of sync as engine not sending refreshStoragePool to second host
[engine-core] reconstruct leads to master version out of sync as engine not s...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.1.0
x86_64 Linux
unspecified Severity high
: ---
: 3.1.0
Assigned To: Liron Aravot
Dafna Ron
storage
: Regression
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-07-19 05:51 EDT by Haim
Modified: 2016-02-10 11:56 EST (History)
11 users (show)

See Also:
Fixed In Version: SI21
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-12-04 15:01:51 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
engine.log (316.05 KB, application/x-gzip)
2012-07-19 05:52 EDT, Haim
no flags Details
vdsm logs (27.46 MB, application/x-tar)
2012-07-19 05:56 EDT, Haim
no flags Details

  None (edit)
Description Haim 2012-07-19 05:51:31 EDT
Description of problem:

my pool fails to recover it self due to fact that engine not send required commands to second host in the pool that would cause re-reading of meta-data, this flow leads to wrong master domain.

the follow looks as follows:

- pool with 3 domains, and 2 hosts 
- failure in one of the storage domains initiate a reconstruct attempt 
- reconstruct is sent to vdsm, and succeeded on first host 
- connectStoragePool is sent to first host, and succeeded 
- refreshStoragePool is sent to first host, and succeeded 
- connectStoragePool is sent to second host (which already connected to the pool, with different master version), and fails on wrong master domain
- different thread of spmStart is sent to first host, and succeeded.
- engine decides to stopSpm on first host due to wrong master error and re-start the flow on second host
- flow starts again, and fails again (now hosts switch positions).

engine should either send refreshStoragePool or disconnectStoragePool before connecting second host to pool again (there was a fix that vdsm returns True when connectStoragePool is sent and pool is already connected, which only means that master domain data is not read by the connected host, and it ignores the fact that first hosts succeed to reconstruct).
Comment 1 Haim 2012-07-19 05:52:28 EDT
Created attachment 599102 [details]
engine.log
Comment 2 Haim 2012-07-19 05:56:42 EDT
Created attachment 599104 [details]
vdsm logs
Comment 5 Eduardo Warszawski 2012-07-25 11:48:44 EDT
refreshStoragePool is sent by engine to the re-constructor host after raising the version instead to the already connected (the other) host.

connectStoragePool will fail if the host is already connected to a different version of the pool.
refreshStoragePool should be sent instead.
(disconnectStoragePool will do it but is not recommended.)

The cycle reconstruct-connect-reconstruct  continues alternatively in the two hosts.

Engine should send the refreshStoragePool to the other hosts and not to the reconstruct host and not after connectStoragePool.


# Tried to connect version 5 in nott-vds2:
nott-vds2.log:Thread-60911::INFO::2012-07-19 13:55:49,199::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', hostID=2, scsiKey='cacbdf16-d006-11e1-b98a-001a4a16970e', msdUUID='38706621-19c7-498d-a4c9-003b354ba1d4', masterVersion=5, options=None)
# This fails:
nott-vds2.log:Thread-60911::ERROR::2012-07-19 13:55:51,969::task::853::TaskManager.Task::(_setError) Task=`015f6279-7be0-43aa-9ce0-9c8c71797ad9`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 814, in connectStoragePool
    return self._connectStoragePool(spUUID, hostID, scsiKey, msdUUID, masterVersion, options)
  File "/usr/share/vdsm/storage/hsm.py", line 837, in _connectStoragePool
    pool.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1463, in getMasterDomain
    raise se.StoragePoolWrongMaster(self.spUUID, msdUUID)
StoragePoolWrongMaster: Wrong Master domain or its version: 'SD=38706621-19c7-498d-a4c9-003b354ba1d4, pool=cacbdf16-d006-11e1-b98a-001a4a16970e'

# Leads to succesful reconstruct :
nott-vds2.log:Thread-61000::INFO::2012-07-19 13:57:08,431::logUtils::37::dispatcher::(wrapper) Run and protect: reconstructMaster(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', poolName='Default', masterDom='f11ecf2d-35ed-4426-9fb7-c831c9eb37e2', domDict={'38706621-19c7-498d-a4c9-003b354ba1d4': 'active', 'f11ecf2d-35ed-4426-9fb7-c831c9eb37e2': 'active', '6444a4b5-dcdc-4341-96a1-4bd96553aaeb': 'attached'}, masterVersion=6, lockPolicy=None, lockRenewalIntervalSec=5, leaseTimeSec=60, ioOpTimeoutSec=10, leaseRetries=3, options=None)
nott-vds2.log:Thread-61000::INFO::2012-07-19 13:57:36,380::logUtils::39::dispatcher::(wrapper) Run and protect: reconstructMaster, Return response: None

# Connecting to the new version pool
nott-vds2.log:Thread-61017::INFO::2012-07-19 13:57:36,459::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', hostID=2, scsiKey='cacbdf16-d006-11e1-b98a-001a4a16970e', msdUUID='f11ecf2d-35ed-4426-9fb7-c831c9eb37e2', masterVersion=6, options=None)
nott-vds2.log:Thread-61017::INFO::2012-07-19 13:57:41,464::logUtils::39::dispatcher::(wrapper) Run and protect: connectStoragePool, Return response: True

# Why refresh with the same version 6?
nott-vds2.log:Thread-61030::INFO::2012-07-19 13:57:41,494::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', msdUUID='f11ecf2d-35ed-4426-9fb7-c831c9eb37e2', masterVersion=6, options=None)
nott-vds2.log:Thread-61030::INFO::2012-07-19 13:57:44,996::logUtils::39::dispatcher::(wrapper) Run and protect: refreshStoragePool, Return response: None

# Meanwhile in nott-vds1 succesfull connect to version 6:
nott-vds1.log:Thread-53919::INFO::2012-07-19 14:01:11,447::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', hostID=1, scsiKey='cacbdf16-d006-11e1-b98a-001a4a16970e', msdUUID='f11ecf2d-35ed-4426-9fb7-c831c9eb37e2', masterVersion=6, options=None)
nott-vds1.log:Thread-53919::INFO::2012-07-19 14:01:14,397::logUtils::39::dispatcher::(wrapper) Run and protect: connectStoragePool, Return response: None

# But now is disconnected:
nott-vds1.log:Thread-54150::INFO::2012-07-19 14:06:38,321::logUtils::37::dispatcher::(wrapper) Run and protect: disconnectStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', hostID=1, scsiKey='cacbdf16-d006-11e1-b98a-001a4a16970e', remove=False, options=None)
nott-vds1.log:Thread-54150::INFO::2012-07-19 14:06:41,202::logUtils::39::dispatcher::(wrapper) Run and protect: disconnectStoragePool, Return response: True

# And reconstructed and refreshed
ott-vds1.log:Thread-54153::INFO::2012-07-19 14:06:41,244::logUtils::37::dispatcher::(wrapper) Run and protect: reconstructMaster(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', poolName='Default', masterDom='38706621-19c7-498d-a4c9-003b354ba1d4', domDict={'38706621-19c7-498d-a4c9-003b354ba1d4': 'active', '6444a4b5-dcdc-4341-96a1-4bd96553aaeb': 'attached', 'f11ecf2d-35ed-4426-9fb7-c831c9eb37e2': 'active'}, masterVersion=7, lockPolicy=None, lockRenewalIntervalSec=5, leaseTimeSec=60, ioOpTimeoutSec=10, leaseRetries=3, options=None)
nott-vds1.log:Thread-54153::INFO::2012-07-19 14:07:08,801::logUtils::39::dispatcher::(wrapper) Run and protect: reconstructMaster, Return response: None
nott-vds1.log:Thread-54174::INFO::2012-07-19 14:07:08,873::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', hostID=1, scsiKey='cacbdf16-d006-11e1-b98a-001a4a16970e', msdUUID='38706621-19c7-498d-a4c9-003b354ba1d4', masterVersion=7, options=None)
nott-vds1.log:Thread-54174::INFO::2012-07-19 14:07:13,918::logUtils::39::dispatcher::(wrapper) Run and protect: connectStoragePool, Return response: True
nott-vds1.log:Thread-54186::INFO::2012-07-19 14:07:13,950::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', msdUUID='38706621-19c7-498d-a4c9-003b354ba1d4', masterVersion=7, options=None)
nott-vds1.log:Thread-54186::INFO::2012-07-19 14:07:17,562::logUtils::39::dispatcher::(wrapper) Run and protect: refreshStoragePool, Return response: None

# Connect to the version 7 is sent to nott-vds2
nott-vds2.log:Thread-61491::INFO::2012-07-19 14:06:29,968::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', hostID=2, scsiKey='cacbdf16-d006-11e1-b98a-001a4a16970e', msdUUID='38706621-19c7-498d-a4c9-003b354ba1d4', masterVersion=7, options=None)

#And fails since the pool was not disconnected nor refreshed (after updating to version 7) and we are already connected to version 6.

Thread-61491::ERROR::2012-07-19 14:06:29,973::task::853::TaskManager.Task::(_setError) Task=`0fd76e90-900d-4935-9ce5-9e3a6a670e08`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 814, in connectStoragePool
    return self._connectStoragePool(spUUID, hostID, scsiKey, msdUUID, masterVersion, options)
  File "/usr/share/vdsm/storage/hsm.py", line 837, in _connectStoragePool
    pool.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1463, in getMasterDomain
    raise se.StoragePoolWrongMaster(self.spUUID, msdUUID)
StoragePoolWrongMaster: Wrong Master domain or its version: 'SD=38706621-19c7-498d-a4c9-003b354ba1d4, pool=cacbdf16-d006-11e1-b98a-001a4a16970e'

# Continues this way reconstructing alternatively in the two hosts and raiseng versions up to masterVersion=46 in these logs.
Comment 6 Liron Aravot 2012-08-13 09:45:57 EDT
submitted the following patch as solution:

http://gerrit.ovirt.org/#/c/7137/
Comment 7 mkublin 2012-08-13 17:49:45 EDT
Edu, please explain why a connect fails, if these what was said to me more than one time: send connect it is usually will successes, please send a disconnect only when a host is moved to Maintaince. I think, what was said to me is wrong? 
Yes, from logs from engine side it is obvious that if the host was disconnected it success to connect, so why you are asking not to send a disconnect, and after that when connect is failed, we got answer you should send a disconnect or refresh. So what combination of refresh, connect , disconnect should be send? Why at previous version of vdsm I need to send refresh also after connect?
Also why I should send connect when host was not disconnected by me?
Comment 8 Eduardo Warszawski 2012-08-29 10:46:27 EDT
(In reply to comment #7)
> Edu, please explain why a connect fails, if these what was said to me more
> than one time: send connect it is usually will successes, please send a
> disconnect only when a host is moved to Maintaince. I think, what was said
> to me is wrong? 
If you are doing reconstruct in host 1 when host 2 is connected to the pool and you send connect again, if you are not very careful, is not clear which version of the Pool metadata is in use by host 2. This is a racy situation and I can imagine scenarios when the "new" connect will fail.

> Yes, from logs from engine side it is obvious that if the host was
> disconnected it success to connect, so why you are asking not to send a
> disconnect, and after that when connect is failed, we got answer you should
> send a disconnect or refresh. 
We are asking not to send disconnectStorageServer (which implies disconnectStoragePool or you will receive Stats errors.)
You can disconnect host 2 from the Pool before reconstruct the master using host 1, but this is not required. Send a refreshStoragePool to host 2 after the reconstruct is completed.

> So what combination of refresh, connect ,
> disconnect should be send? 
reconstruct to host1
connectSP to host 1
refreshSP to all the other hosts already connected to the pool.

> Why at previous version of vdsm I need to send
> refresh also after connect?
Only if the master was reconstructed after the connectSP.

> Also why I should send connect when host was not disconnected by me?
The host are not disconnecting (SP, server) by themselves. Only engine do that.
Comment 9 Allon Mureinik 2012-10-16 13:11:11 EDT
Merged I9b1b6c32ecb1c0d3c0a9ef14beef333e442c6ccf
Comment 10 Dafna Ron 2012-10-29 11:46:52 EDT
verified si22.1

Note You need to log in before you can comment on or make changes to this bug.