Created attachment 1100227 [details] logs from engine and host Description of problem: Master domain role is written under 2 different storage domains metadata in the same pool after a scenario in which the paths to the old master become running. Version-Release number of selected component (if applicable): vdsm-4.17.10.1-0.el7ev.noarch rhevm-3.6.0.3-0.1.el6.noarch How reproducible: Always Steps to Reproduce: 1. On a DC with 2 FC data domains (1 domain is master - 6605112f-7f26-48b9-b64e-f661bfe18799): [root@green-vdsb ~]# for i in `vdsClient -s 0 getStorageDomainsList 00000001-0001-0001-0001-000000000158 ` ; do vdsClient -s 0 getStorageDomainInfo $i; done; uuid = 504c7595-7ca5-459c-be6f-44a3d7f1b5d7 vguuid = rcC8D7-T3yy-LZ9P-5Mx4-sVZe-buqt-7RrXmJ state = OK version = 3 role = Regular type = FCP class = Data pool = ['00000001-0001-0001-0001-000000000158'] name = fc2 uuid = ec5adddd-c1b5-471f-ba86-aa91303b5215 version = 0 role = Regular remotePath = netapp.qa.lab.tlv.redhat.com:/vol/vol_rhev_stress/backupMordechai type = NFS class = Backup pool = ['00000001-0001-0001-0001-000000000158'] name = backupMordechai uuid = 6605112f-7f26-48b9-b64e-f661bfe18799 vguuid = pP6EqC-feb8-DdZW-w6dx-zdPH-swYv-tDQRHf state = OK version = 3 role = Master type = FCP class = Data pool = ['00000001-0001-0001-0001-000000000158'] name = fc1 , I disabled the paths to the current master domain with the following: # echo "offline" > /sys/block/sdf/device/state Waited for the reconstruct to finish. Second FC domain took master 2. Resumed connectivity to the old master domain: echo "running" > /sys/block/sdi/device/state Actual results: Both domains are 'state = OK' and both reported as active in engine. 2 storage domains have master role for the same DC: [root@green-vdsb ~]# for i in `vdsClient -s 0 getStorageDomainsList 00000001-0001-0001-0001-000000000158 ` ; do vdsClient -s 0 getStorageDomainInfo $i; done; uuid = 504c7595-7ca5-459c-be6f-44a3d7f1b5d7 vguuid = rcC8D7-T3yy-LZ9P-5Mx4-sVZe-buqt-7RrXmJ state = OK version = 3 role = Master type = FCP class = Data pool = ['00000001-0001-0001-0001-000000000158'] name = fc2 uuid = ec5adddd-c1b5-471f-ba86-aa91303b5215 version = 0 role = Regular remotePath = netapp.qa.lab.tlv.redhat.com:/vol/vol_rhev_stress/backupMordechai type = NFS class = Backup pool = ['00000001-0001-0001-0001-000000000158'] name = backupMordechai uuid = 6605112f-7f26-48b9-b64e-f661bfe18799 vguuid = pP6EqC-feb8-DdZW-w6dx-zdPH-swYv-tDQRHf state = OK version = 3 role = Master type = FCP class = Data pool = ['00000001-0001-0001-0001-000000000158'] name = fc1 vg_tags: ###For both domains, MDT_ROLE=Master. ###For domain 6605112f-7f26-48b9-b64e-f661bfe18799, the old master, no MDT_MASTER_VERSION while for 504c7595-7ca5-459c-be6f-44a3d7f1b5d7, MDT_MASTER_VERSION=7: [root@green-vdsb ~]# pvscan --cache [root@green-vdsb ~]# vgs -o vg_tags 6605112f-7f26-48b9-b64e-f661bfe18799 VG Tags MDT_CLASS=Data,MDT_DESCRIPTION=fc1,MDT_IOOPTIMEOUTSEC=10,MDT_LEASERETRIES=3,MDT_LEASETIMESEC=60,MDT_LOCKPOLICY=,MDT_LOCKRENEWALINTERVALSEC=5,MDT_LOGBLKSIZE=512,MDT_PHYBLKSIZE=512,MDT_POOL_UUID=00000001-0001-0001-0001-000000000158,MDT_PV0=pv:3514f0c5a51600871&44&uuid:6Imusq-PeIV-Yc0Q-Cg18-NtH1-m3ep-NRfc0b&44&pestart:0&44&pecount:277&44&mapoffset:0,MDT_ROLE=Master,MDT_SDUUID=6605112f-7f26-48b9-b64e-f661bfe18799,MDT_TYPE=FCP,MDT_VERSION=3,MDT_VGUUID=pP6EqC-feb8-DdZW-w6dx-zdPH-swYv-tDQRHf,MDT__SHA_CKSUM=c14498749b865ecbb5c0d427adbf62a370d215cc,RHAT_storage_domain [root@green-vdsb ~]# vgs -o vg_tags 504c7595-7ca5-459c-be6f-44a3d7f1b5d7 VG Tags MDT_CLASS=Data,MDT_DESCRIPTION=fc2,MDT_IOOPTIMEOUTSEC=10,MDT_LEASERETRIES=3,MDT_LEASETIMESEC=60,MDT_LOCKPOLICY=,MDT_LOCKRENEWALINTERVALSEC=5,MDT_LOGBLKSIZE=512,MDT_MASTER_VERSION=7,MDT_PHYBLKSIZE=512,MDT_POOL_DESCRIPTION=Default,MDT_POOL_DOMAINS=504c7595-7ca5-459c-be6f-44a3d7f1b5d7:Active&44&ec5adddd-c1b5-471f-ba86-aa91303b5215:Active&44&ad1ecd8f-8822-4323-9ba4-4864e4d97297:Attached&44&39a95638-1583-461e-9aa1-52e6ea8597ee:Attached&44&6605112f-7f26-48b9-b64e-f661bfe18799:Active&44&6c24e902-052c-43ba-bee5-9e42b0aacc59:Active,MDT_POOL_SPM_ID=-1,MDT_POOL_SPM_LVER=-1,MDT_POOL_UUID=00000001-0001-0001-0001-000000000158,MDT_PV0=pv:3514f0c5a51600858&44&uuid:4N23Pb-5bVj-NyUp-Ss7D-r8pb-OOrF-2uVAeF&44&pestart:0&44&pecount:413&44&mapoffset:0,MDT_ROLE=Master,MDT_SDUUID=504c7595-7ca5-459c-be6f-44a3d7f1b5d7,MDT_TYPE=FCP,MDT_VERSION=3,MDT_VGUUID=rcC8D7-T3yy-LZ9P-5Mx4-sVZe-buqt-7RrXmJ,MDT__SHA_CKSUM=dbddc311b8105117ba6778cce8559919e968ce39,RHAT_storage_domain Expected results: Only one master domain in the pool Additional info: logs from engine and host ###Before connectivity loss to the master domain: Thread-676935::INFO::2015-11-29 14:15:31,452::logUtils::48::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID=u'00000001-0001-0001-0001-000000000158', hostID=1, msdUUID=u'ad1ecd8f-8822-4323-9ba4-4864e4d97297', masterVersion=3, domainsMap={u'39a95638-1583-461e-9aa1-52e6ea8597ee': u'attached', u'ec5adddd-c1b5-471f-ba86-aa91303b5215': u'active', u'6c24e902-052c-43ba-bee5-9e42b0aacc59': u'active', u'ad1ecd8f-8822-4323-9ba4-4864e4d97297': u'active'}, options=None) ###Master domain connection failure: mailbox.SPMMonitor::ERROR::2015-11-29 14:22:28,132::storage_mailbox::793::Storage.MailBox.SpmMailMonitor::(run) Error checking for mail Traceback (most recent call last): File "/usr/share/vdsm/storage/storage_mailbox.py", line 791, in run self._checkForMail() File "/usr/share/vdsm/storage/storage_mailbox.py", line 738, in _checkForMail "Could not read mailbox: %s" % self._inbox) IOError: [Errno 5] _handleRequests._checkForMail - Could not read mailbox: /rhev/data-center/00000001-0001-0001-0001-000000000158/mastersd/dom_md/inbox ###Reconstruct master completed: 2015-11-29 14:23:26,950 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-13) [68ecc5fe] Correlation ID: 68ecc5fe, Job ID: 69036a12-88d2-426e-a3bf-8547052 5d879, Call Stack: null, Custom Event ID: -1, Message: Reconstruct Master Domain for Data Center Default completed. ###Domain 504c7595-7ca5-459c-be6f-44a3d7f1b5d7 is operational again: 2015-11-29 14:40:14,152 INFO [org.ovirt.engine.core.bll.ProcessOvfUpdateForStorageDomainCommand] (DefaultQuartzScheduler_Worker-57) [554ef729] Lock Acquired to object 'EngineLock:{exclusiveLocks='[504c7595-7ca5-459c-be6f-44a3d7f1b5d7=<STORAGE, ACTION_TYPE_FAILED_DOMAIN_OVF_ON_UPDATE>]', sharedLocks='[00000001-0001-0001-0001-000000000158=<OVF_UPDATE, ACTION_TYPE_FAILED_DOMAIN_OVF_ON_UPDATE>]'}'
This is a known issue and documented in sp.py in masterMigrate: # There's no way to ensure that we only have one domain marked # as master in the storage pool (e.g. after a reconstructMaster, # or even in this method if we fail to set the old master to # regular). That said, for API cleaness switchMasterDomain is # the last method to call as "point of no return" after which we # only try to cleanup but we cannot rollback. When engine calls reconstructMaster it supplies a new masterVer parameter which must be greater than the current masterVer. In validateMasterDomainVersion we check that the domain has the expected masterVer and raise an error if it does not match. Although it's possible to return a domain like this from master back to a regular role, we're doing away with the master domain in the future so it's not worth the effort. Dropping severity in light of the above analysis. Allon, I don't recommend creating a RFE to "fix" old master domains and I think we should close this as WONTFIX. Thoughts?
> Dropping severity in light of the above analysis. Allon, I don't recommend > creating a RFE to "fix" old master domains and I think we should close this > as WONTFIX. Thoughts? Agreed.