Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 841525

Summary: [engine-core] reconstruct leads to master version out of sync as engine not sending refreshStoragePool to second host
Product: Red Hat Enterprise Virtualization Manager Reporter: Haim <hateya>
Component: ovirt-engineAssignee: Liron Aravot <laravot>
Status: CLOSED CURRENTRELEASE QA Contact: Dafna Ron <dron>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.1.0CC: abaron, amureini, dyasny, ewarszaw, iheim, lpeer, mkublin, Rhev-m-bugs, sgrinber, yeylon, ykaul
Target Milestone: ---Keywords: Regression
Target Release: 3.1.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: SI21 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-12-04 20:01:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine.log
none
vdsm logs none

Description Haim 2012-07-19 09:51:31 UTC
Description of problem:

my pool fails to recover it self due to fact that engine not send required commands to second host in the pool that would cause re-reading of meta-data, this flow leads to wrong master domain.

the follow looks as follows:

- pool with 3 domains, and 2 hosts 
- failure in one of the storage domains initiate a reconstruct attempt 
- reconstruct is sent to vdsm, and succeeded on first host 
- connectStoragePool is sent to first host, and succeeded 
- refreshStoragePool is sent to first host, and succeeded 
- connectStoragePool is sent to second host (which already connected to the pool, with different master version), and fails on wrong master domain
- different thread of spmStart is sent to first host, and succeeded.
- engine decides to stopSpm on first host due to wrong master error and re-start the flow on second host
- flow starts again, and fails again (now hosts switch positions).

engine should either send refreshStoragePool or disconnectStoragePool before connecting second host to pool again (there was a fix that vdsm returns True when connectStoragePool is sent and pool is already connected, which only means that master domain data is not read by the connected host, and it ignores the fact that first hosts succeed to reconstruct).

Comment 1 Haim 2012-07-19 09:52:28 UTC
Created attachment 599102 [details]
engine.log

Comment 2 Haim 2012-07-19 09:56:42 UTC
Created attachment 599104 [details]
vdsm logs

Comment 5 Eduardo Warszawski 2012-07-25 15:48:44 UTC
refreshStoragePool is sent by engine to the re-constructor host after raising the version instead to the already connected (the other) host.

connectStoragePool will fail if the host is already connected to a different version of the pool.
refreshStoragePool should be sent instead.
(disconnectStoragePool will do it but is not recommended.)

The cycle reconstruct-connect-reconstruct  continues alternatively in the two hosts.

Engine should send the refreshStoragePool to the other hosts and not to the reconstruct host and not after connectStoragePool.


# Tried to connect version 5 in nott-vds2:
nott-vds2.log:Thread-60911::INFO::2012-07-19 13:55:49,199::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', hostID=2, scsiKey='cacbdf16-d006-11e1-b98a-001a4a16970e', msdUUID='38706621-19c7-498d-a4c9-003b354ba1d4', masterVersion=5, options=None)
# This fails:
nott-vds2.log:Thread-60911::ERROR::2012-07-19 13:55:51,969::task::853::TaskManager.Task::(_setError) Task=`015f6279-7be0-43aa-9ce0-9c8c71797ad9`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 814, in connectStoragePool
    return self._connectStoragePool(spUUID, hostID, scsiKey, msdUUID, masterVersion, options)
  File "/usr/share/vdsm/storage/hsm.py", line 837, in _connectStoragePool
    pool.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1463, in getMasterDomain
    raise se.StoragePoolWrongMaster(self.spUUID, msdUUID)
StoragePoolWrongMaster: Wrong Master domain or its version: 'SD=38706621-19c7-498d-a4c9-003b354ba1d4, pool=cacbdf16-d006-11e1-b98a-001a4a16970e'

# Leads to succesful reconstruct :
nott-vds2.log:Thread-61000::INFO::2012-07-19 13:57:08,431::logUtils::37::dispatcher::(wrapper) Run and protect: reconstructMaster(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', poolName='Default', masterDom='f11ecf2d-35ed-4426-9fb7-c831c9eb37e2', domDict={'38706621-19c7-498d-a4c9-003b354ba1d4': 'active', 'f11ecf2d-35ed-4426-9fb7-c831c9eb37e2': 'active', '6444a4b5-dcdc-4341-96a1-4bd96553aaeb': 'attached'}, masterVersion=6, lockPolicy=None, lockRenewalIntervalSec=5, leaseTimeSec=60, ioOpTimeoutSec=10, leaseRetries=3, options=None)
nott-vds2.log:Thread-61000::INFO::2012-07-19 13:57:36,380::logUtils::39::dispatcher::(wrapper) Run and protect: reconstructMaster, Return response: None

# Connecting to the new version pool
nott-vds2.log:Thread-61017::INFO::2012-07-19 13:57:36,459::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', hostID=2, scsiKey='cacbdf16-d006-11e1-b98a-001a4a16970e', msdUUID='f11ecf2d-35ed-4426-9fb7-c831c9eb37e2', masterVersion=6, options=None)
nott-vds2.log:Thread-61017::INFO::2012-07-19 13:57:41,464::logUtils::39::dispatcher::(wrapper) Run and protect: connectStoragePool, Return response: True

# Why refresh with the same version 6?
nott-vds2.log:Thread-61030::INFO::2012-07-19 13:57:41,494::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', msdUUID='f11ecf2d-35ed-4426-9fb7-c831c9eb37e2', masterVersion=6, options=None)
nott-vds2.log:Thread-61030::INFO::2012-07-19 13:57:44,996::logUtils::39::dispatcher::(wrapper) Run and protect: refreshStoragePool, Return response: None

# Meanwhile in nott-vds1 succesfull connect to version 6:
nott-vds1.log:Thread-53919::INFO::2012-07-19 14:01:11,447::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', hostID=1, scsiKey='cacbdf16-d006-11e1-b98a-001a4a16970e', msdUUID='f11ecf2d-35ed-4426-9fb7-c831c9eb37e2', masterVersion=6, options=None)
nott-vds1.log:Thread-53919::INFO::2012-07-19 14:01:14,397::logUtils::39::dispatcher::(wrapper) Run and protect: connectStoragePool, Return response: None

# But now is disconnected:
nott-vds1.log:Thread-54150::INFO::2012-07-19 14:06:38,321::logUtils::37::dispatcher::(wrapper) Run and protect: disconnectStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', hostID=1, scsiKey='cacbdf16-d006-11e1-b98a-001a4a16970e', remove=False, options=None)
nott-vds1.log:Thread-54150::INFO::2012-07-19 14:06:41,202::logUtils::39::dispatcher::(wrapper) Run and protect: disconnectStoragePool, Return response: True

# And reconstructed and refreshed
ott-vds1.log:Thread-54153::INFO::2012-07-19 14:06:41,244::logUtils::37::dispatcher::(wrapper) Run and protect: reconstructMaster(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', poolName='Default', masterDom='38706621-19c7-498d-a4c9-003b354ba1d4', domDict={'38706621-19c7-498d-a4c9-003b354ba1d4': 'active', '6444a4b5-dcdc-4341-96a1-4bd96553aaeb': 'attached', 'f11ecf2d-35ed-4426-9fb7-c831c9eb37e2': 'active'}, masterVersion=7, lockPolicy=None, lockRenewalIntervalSec=5, leaseTimeSec=60, ioOpTimeoutSec=10, leaseRetries=3, options=None)
nott-vds1.log:Thread-54153::INFO::2012-07-19 14:07:08,801::logUtils::39::dispatcher::(wrapper) Run and protect: reconstructMaster, Return response: None
nott-vds1.log:Thread-54174::INFO::2012-07-19 14:07:08,873::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', hostID=1, scsiKey='cacbdf16-d006-11e1-b98a-001a4a16970e', msdUUID='38706621-19c7-498d-a4c9-003b354ba1d4', masterVersion=7, options=None)
nott-vds1.log:Thread-54174::INFO::2012-07-19 14:07:13,918::logUtils::39::dispatcher::(wrapper) Run and protect: connectStoragePool, Return response: True
nott-vds1.log:Thread-54186::INFO::2012-07-19 14:07:13,950::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', msdUUID='38706621-19c7-498d-a4c9-003b354ba1d4', masterVersion=7, options=None)
nott-vds1.log:Thread-54186::INFO::2012-07-19 14:07:17,562::logUtils::39::dispatcher::(wrapper) Run and protect: refreshStoragePool, Return response: None

# Connect to the version 7 is sent to nott-vds2
nott-vds2.log:Thread-61491::INFO::2012-07-19 14:06:29,968::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', hostID=2, scsiKey='cacbdf16-d006-11e1-b98a-001a4a16970e', msdUUID='38706621-19c7-498d-a4c9-003b354ba1d4', masterVersion=7, options=None)

#And fails since the pool was not disconnected nor refreshed (after updating to version 7) and we are already connected to version 6.

Thread-61491::ERROR::2012-07-19 14:06:29,973::task::853::TaskManager.Task::(_setError) Task=`0fd76e90-900d-4935-9ce5-9e3a6a670e08`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 814, in connectStoragePool
    return self._connectStoragePool(spUUID, hostID, scsiKey, msdUUID, masterVersion, options)
  File "/usr/share/vdsm/storage/hsm.py", line 837, in _connectStoragePool
    pool.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1463, in getMasterDomain
    raise se.StoragePoolWrongMaster(self.spUUID, msdUUID)
StoragePoolWrongMaster: Wrong Master domain or its version: 'SD=38706621-19c7-498d-a4c9-003b354ba1d4, pool=cacbdf16-d006-11e1-b98a-001a4a16970e'

# Continues this way reconstructing alternatively in the two hosts and raiseng versions up to masterVersion=46 in these logs.

Comment 6 Liron Aravot 2012-08-13 13:45:57 UTC
submitted the following patch as solution:

http://gerrit.ovirt.org/#/c/7137/

Comment 7 mkublin 2012-08-13 21:49:45 UTC
Edu, please explain why a connect fails, if these what was said to me more than one time: send connect it is usually will successes, please send a disconnect only when a host is moved to Maintaince. I think, what was said to me is wrong? 
Yes, from logs from engine side it is obvious that if the host was disconnected it success to connect, so why you are asking not to send a disconnect, and after that when connect is failed, we got answer you should send a disconnect or refresh. So what combination of refresh, connect , disconnect should be send? Why at previous version of vdsm I need to send refresh also after connect?
Also why I should send connect when host was not disconnected by me?

Comment 8 Eduardo Warszawski 2012-08-29 14:46:27 UTC
(In reply to comment #7)
> Edu, please explain why a connect fails, if these what was said to me more
> than one time: send connect it is usually will successes, please send a
> disconnect only when a host is moved to Maintaince. I think, what was said
> to me is wrong? 
If you are doing reconstruct in host 1 when host 2 is connected to the pool and you send connect again, if you are not very careful, is not clear which version of the Pool metadata is in use by host 2. This is a racy situation and I can imagine scenarios when the "new" connect will fail.

> Yes, from logs from engine side it is obvious that if the host was
> disconnected it success to connect, so why you are asking not to send a
> disconnect, and after that when connect is failed, we got answer you should
> send a disconnect or refresh. 
We are asking not to send disconnectStorageServer (which implies disconnectStoragePool or you will receive Stats errors.)
You can disconnect host 2 from the Pool before reconstruct the master using host 1, but this is not required. Send a refreshStoragePool to host 2 after the reconstruct is completed.

> So what combination of refresh, connect ,
> disconnect should be send? 
reconstruct to host1
connectSP to host 1
refreshSP to all the other hosts already connected to the pool.

> Why at previous version of vdsm I need to send
> refresh also after connect?
Only if the master was reconstructed after the connectSP.

> Also why I should send connect when host was not disconnected by me?
The host are not disconnecting (SP, server) by themselves. Only engine do that.

Comment 9 Allon Mureinik 2012-10-16 17:11:11 UTC
Merged I9b1b6c32ecb1c0d3c0a9ef14beef333e442c6ccf

Comment 10 Dafna Ron 2012-10-29 15:46:52 UTC
verified si22.1