Bug 841525
| Summary: | [engine-core] reconstruct leads to master version out of sync as engine not sending refreshStoragePool to second host | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Haim <hateya> | ||||||
| Component: | ovirt-engine | Assignee: | Liron Aravot <laravot> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Dafna Ron <dron> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 3.1.0 | CC: | abaron, amureini, dyasny, ewarszaw, iheim, lpeer, mkublin, Rhev-m-bugs, sgrinber, yeylon, ykaul | ||||||
| Target Milestone: | --- | Keywords: | Regression | ||||||
| Target Release: | 3.1.0 | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | storage | ||||||||
| Fixed In Version: | SI21 | Doc Type: | Bug Fix | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2012-12-04 20:01:51 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Haim
2012-07-19 09:51:31 UTC
Created attachment 599102 [details]
engine.log
Created attachment 599104 [details]
vdsm logs
refreshStoragePool is sent by engine to the re-constructor host after raising the version instead to the already connected (the other) host.
connectStoragePool will fail if the host is already connected to a different version of the pool.
refreshStoragePool should be sent instead.
(disconnectStoragePool will do it but is not recommended.)
The cycle reconstruct-connect-reconstruct continues alternatively in the two hosts.
Engine should send the refreshStoragePool to the other hosts and not to the reconstruct host and not after connectStoragePool.
# Tried to connect version 5 in nott-vds2:
nott-vds2.log:Thread-60911::INFO::2012-07-19 13:55:49,199::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', hostID=2, scsiKey='cacbdf16-d006-11e1-b98a-001a4a16970e', msdUUID='38706621-19c7-498d-a4c9-003b354ba1d4', masterVersion=5, options=None)
# This fails:
nott-vds2.log:Thread-60911::ERROR::2012-07-19 13:55:51,969::task::853::TaskManager.Task::(_setError) Task=`015f6279-7be0-43aa-9ce0-9c8c71797ad9`::Unexpected error
Traceback (most recent call last):
File "/usr/share/vdsm/storage/task.py", line 861, in _run
return fn(*args, **kargs)
File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
res = f(*args, **kwargs)
File "/usr/share/vdsm/storage/hsm.py", line 814, in connectStoragePool
return self._connectStoragePool(spUUID, hostID, scsiKey, msdUUID, masterVersion, options)
File "/usr/share/vdsm/storage/hsm.py", line 837, in _connectStoragePool
pool.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)
File "/usr/share/vdsm/storage/sp.py", line 1463, in getMasterDomain
raise se.StoragePoolWrongMaster(self.spUUID, msdUUID)
StoragePoolWrongMaster: Wrong Master domain or its version: 'SD=38706621-19c7-498d-a4c9-003b354ba1d4, pool=cacbdf16-d006-11e1-b98a-001a4a16970e'
# Leads to succesful reconstruct :
nott-vds2.log:Thread-61000::INFO::2012-07-19 13:57:08,431::logUtils::37::dispatcher::(wrapper) Run and protect: reconstructMaster(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', poolName='Default', masterDom='f11ecf2d-35ed-4426-9fb7-c831c9eb37e2', domDict={'38706621-19c7-498d-a4c9-003b354ba1d4': 'active', 'f11ecf2d-35ed-4426-9fb7-c831c9eb37e2': 'active', '6444a4b5-dcdc-4341-96a1-4bd96553aaeb': 'attached'}, masterVersion=6, lockPolicy=None, lockRenewalIntervalSec=5, leaseTimeSec=60, ioOpTimeoutSec=10, leaseRetries=3, options=None)
nott-vds2.log:Thread-61000::INFO::2012-07-19 13:57:36,380::logUtils::39::dispatcher::(wrapper) Run and protect: reconstructMaster, Return response: None
# Connecting to the new version pool
nott-vds2.log:Thread-61017::INFO::2012-07-19 13:57:36,459::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', hostID=2, scsiKey='cacbdf16-d006-11e1-b98a-001a4a16970e', msdUUID='f11ecf2d-35ed-4426-9fb7-c831c9eb37e2', masterVersion=6, options=None)
nott-vds2.log:Thread-61017::INFO::2012-07-19 13:57:41,464::logUtils::39::dispatcher::(wrapper) Run and protect: connectStoragePool, Return response: True
# Why refresh with the same version 6?
nott-vds2.log:Thread-61030::INFO::2012-07-19 13:57:41,494::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', msdUUID='f11ecf2d-35ed-4426-9fb7-c831c9eb37e2', masterVersion=6, options=None)
nott-vds2.log:Thread-61030::INFO::2012-07-19 13:57:44,996::logUtils::39::dispatcher::(wrapper) Run and protect: refreshStoragePool, Return response: None
# Meanwhile in nott-vds1 succesfull connect to version 6:
nott-vds1.log:Thread-53919::INFO::2012-07-19 14:01:11,447::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', hostID=1, scsiKey='cacbdf16-d006-11e1-b98a-001a4a16970e', msdUUID='f11ecf2d-35ed-4426-9fb7-c831c9eb37e2', masterVersion=6, options=None)
nott-vds1.log:Thread-53919::INFO::2012-07-19 14:01:14,397::logUtils::39::dispatcher::(wrapper) Run and protect: connectStoragePool, Return response: None
# But now is disconnected:
nott-vds1.log:Thread-54150::INFO::2012-07-19 14:06:38,321::logUtils::37::dispatcher::(wrapper) Run and protect: disconnectStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', hostID=1, scsiKey='cacbdf16-d006-11e1-b98a-001a4a16970e', remove=False, options=None)
nott-vds1.log:Thread-54150::INFO::2012-07-19 14:06:41,202::logUtils::39::dispatcher::(wrapper) Run and protect: disconnectStoragePool, Return response: True
# And reconstructed and refreshed
ott-vds1.log:Thread-54153::INFO::2012-07-19 14:06:41,244::logUtils::37::dispatcher::(wrapper) Run and protect: reconstructMaster(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', poolName='Default', masterDom='38706621-19c7-498d-a4c9-003b354ba1d4', domDict={'38706621-19c7-498d-a4c9-003b354ba1d4': 'active', '6444a4b5-dcdc-4341-96a1-4bd96553aaeb': 'attached', 'f11ecf2d-35ed-4426-9fb7-c831c9eb37e2': 'active'}, masterVersion=7, lockPolicy=None, lockRenewalIntervalSec=5, leaseTimeSec=60, ioOpTimeoutSec=10, leaseRetries=3, options=None)
nott-vds1.log:Thread-54153::INFO::2012-07-19 14:07:08,801::logUtils::39::dispatcher::(wrapper) Run and protect: reconstructMaster, Return response: None
nott-vds1.log:Thread-54174::INFO::2012-07-19 14:07:08,873::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', hostID=1, scsiKey='cacbdf16-d006-11e1-b98a-001a4a16970e', msdUUID='38706621-19c7-498d-a4c9-003b354ba1d4', masterVersion=7, options=None)
nott-vds1.log:Thread-54174::INFO::2012-07-19 14:07:13,918::logUtils::39::dispatcher::(wrapper) Run and protect: connectStoragePool, Return response: True
nott-vds1.log:Thread-54186::INFO::2012-07-19 14:07:13,950::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', msdUUID='38706621-19c7-498d-a4c9-003b354ba1d4', masterVersion=7, options=None)
nott-vds1.log:Thread-54186::INFO::2012-07-19 14:07:17,562::logUtils::39::dispatcher::(wrapper) Run and protect: refreshStoragePool, Return response: None
# Connect to the version 7 is sent to nott-vds2
nott-vds2.log:Thread-61491::INFO::2012-07-19 14:06:29,968::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='cacbdf16-d006-11e1-b98a-001a4a16970e', hostID=2, scsiKey='cacbdf16-d006-11e1-b98a-001a4a16970e', msdUUID='38706621-19c7-498d-a4c9-003b354ba1d4', masterVersion=7, options=None)
#And fails since the pool was not disconnected nor refreshed (after updating to version 7) and we are already connected to version 6.
Thread-61491::ERROR::2012-07-19 14:06:29,973::task::853::TaskManager.Task::(_setError) Task=`0fd76e90-900d-4935-9ce5-9e3a6a670e08`::Unexpected error
Traceback (most recent call last):
File "/usr/share/vdsm/storage/task.py", line 861, in _run
return fn(*args, **kargs)
File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
res = f(*args, **kwargs)
File "/usr/share/vdsm/storage/hsm.py", line 814, in connectStoragePool
return self._connectStoragePool(spUUID, hostID, scsiKey, msdUUID, masterVersion, options)
File "/usr/share/vdsm/storage/hsm.py", line 837, in _connectStoragePool
pool.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)
File "/usr/share/vdsm/storage/sp.py", line 1463, in getMasterDomain
raise se.StoragePoolWrongMaster(self.spUUID, msdUUID)
StoragePoolWrongMaster: Wrong Master domain or its version: 'SD=38706621-19c7-498d-a4c9-003b354ba1d4, pool=cacbdf16-d006-11e1-b98a-001a4a16970e'
# Continues this way reconstructing alternatively in the two hosts and raiseng versions up to masterVersion=46 in these logs.
submitted the following patch as solution: http://gerrit.ovirt.org/#/c/7137/ Edu, please explain why a connect fails, if these what was said to me more than one time: send connect it is usually will successes, please send a disconnect only when a host is moved to Maintaince. I think, what was said to me is wrong? Yes, from logs from engine side it is obvious that if the host was disconnected it success to connect, so why you are asking not to send a disconnect, and after that when connect is failed, we got answer you should send a disconnect or refresh. So what combination of refresh, connect , disconnect should be send? Why at previous version of vdsm I need to send refresh also after connect? Also why I should send connect when host was not disconnected by me? (In reply to comment #7) > Edu, please explain why a connect fails, if these what was said to me more > than one time: send connect it is usually will successes, please send a > disconnect only when a host is moved to Maintaince. I think, what was said > to me is wrong? If you are doing reconstruct in host 1 when host 2 is connected to the pool and you send connect again, if you are not very careful, is not clear which version of the Pool metadata is in use by host 2. This is a racy situation and I can imagine scenarios when the "new" connect will fail. > Yes, from logs from engine side it is obvious that if the host was > disconnected it success to connect, so why you are asking not to send a > disconnect, and after that when connect is failed, we got answer you should > send a disconnect or refresh. We are asking not to send disconnectStorageServer (which implies disconnectStoragePool or you will receive Stats errors.) You can disconnect host 2 from the Pool before reconstruct the master using host 1, but this is not required. Send a refreshStoragePool to host 2 after the reconstruct is completed. > So what combination of refresh, connect , > disconnect should be send? reconstruct to host1 connectSP to host 1 refreshSP to all the other hosts already connected to the pool. > Why at previous version of vdsm I need to send > refresh also after connect? Only if the master was reconstructed after the connectSP. > Also why I should send connect when host was not disconnected by me? The host are not disconnecting (SP, server) by themselves. Only engine do that. Merged I9b1b6c32ecb1c0d3c0a9ef14beef333e442c6ccf verified si22.1 |