Bug 1286446

Summary: [vdsm] 2 master domains in same dc after connectivity to old master is resumed
Product: [oVirt] vdsm Reporter: Elad <ebenahar>
Component: GeneralAssignee: Adam Litke <alitke>
Status: CLOSED WONTFIX QA Contact: Aharon Canan <acanan>
Severity: low Docs Contact:
Priority: low    
Version: 4.17.10CC: amureini, bugs, laravot, tnisan, ylavi
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-23 10:46:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs from engine and host none

Description Elad 2015-11-29 15:30:32 UTC
Created attachment 1100227 [details]
logs from engine and host

Description of problem:
Master domain role is written under 2 different storage domains metadata in the same pool after a scenario in which the paths to the old master become running.

Version-Release number of selected component (if applicable):
vdsm-4.17.10.1-0.el7ev.noarch
rhevm-3.6.0.3-0.1.el6.noarch

How reproducible:
Always

Steps to Reproduce:

1. On a DC with 2 FC data domains (1 domain is master - 6605112f-7f26-48b9-b64e-f661bfe18799):

[root@green-vdsb ~]# for i in `vdsClient -s 0 getStorageDomainsList 00000001-0001-0001-0001-000000000158 ` ; do vdsClient -s 0 getStorageDomainInfo $i;  done; 
        uuid = 504c7595-7ca5-459c-be6f-44a3d7f1b5d7
        vguuid = rcC8D7-T3yy-LZ9P-5Mx4-sVZe-buqt-7RrXmJ
        state = OK
        version = 3
        role = Regular
        type = FCP
        class = Data
        pool = ['00000001-0001-0001-0001-000000000158']
        name = fc2

        uuid = ec5adddd-c1b5-471f-ba86-aa91303b5215
        version = 0
        role = Regular
        remotePath = netapp.qa.lab.tlv.redhat.com:/vol/vol_rhev_stress/backupMordechai
        type = NFS
        class = Backup
        pool = ['00000001-0001-0001-0001-000000000158']
        name = backupMordechai

        uuid = 6605112f-7f26-48b9-b64e-f661bfe18799
        vguuid = pP6EqC-feb8-DdZW-w6dx-zdPH-swYv-tDQRHf
        state = OK
        version = 3
        role = Master
        type = FCP
        class = Data
        pool = ['00000001-0001-0001-0001-000000000158']
        name = fc1




, I disabled the paths to the current master domain with the following:
# echo "offline" > /sys/block/sdf/device/state
Waited for the reconstruct to finish. Second FC domain took master
2. Resumed connectivity to the old master domain:
echo "running" > /sys/block/sdi/device/state




Actual results:

Both domains are 'state = OK' and both reported as active  in engine.

2 storage domains have master role for the same DC:


[root@green-vdsb ~]# for i in `vdsClient -s 0 getStorageDomainsList 00000001-0001-0001-0001-000000000158 ` ; do vdsClient -s 0 getStorageDomainInfo $i;  done;
        uuid = 504c7595-7ca5-459c-be6f-44a3d7f1b5d7
        vguuid = rcC8D7-T3yy-LZ9P-5Mx4-sVZe-buqt-7RrXmJ
        state = OK
        version = 3
        role = Master
        type = FCP
        class = Data
        pool = ['00000001-0001-0001-0001-000000000158']
        name = fc2

        uuid = ec5adddd-c1b5-471f-ba86-aa91303b5215
        version = 0
        role = Regular
        remotePath = netapp.qa.lab.tlv.redhat.com:/vol/vol_rhev_stress/backupMordechai
        type = NFS
        class = Backup
        pool = ['00000001-0001-0001-0001-000000000158']
        name = backupMordechai

        uuid = 6605112f-7f26-48b9-b64e-f661bfe18799
        vguuid = pP6EqC-feb8-DdZW-w6dx-zdPH-swYv-tDQRHf
        state = OK
        version = 3
        role = Master
        type = FCP
        class = Data
        pool = ['00000001-0001-0001-0001-000000000158']
        name = fc1




vg_tags:


###For both domains, MDT_ROLE=Master.

###For domain 6605112f-7f26-48b9-b64e-f661bfe18799, the old master,  no MDT_MASTER_VERSION while for 504c7595-7ca5-459c-be6f-44a3d7f1b5d7, MDT_MASTER_VERSION=7:


[root@green-vdsb ~]# pvscan --cache
[root@green-vdsb ~]# vgs -o vg_tags 6605112f-7f26-48b9-b64e-f661bfe18799
  VG Tags                                                                                                                                                            MDT_CLASS=Data,MDT_DESCRIPTION=fc1,MDT_IOOPTIMEOUTSEC=10,MDT_LEASERETRIES=3,MDT_LEASETIMESEC=60,MDT_LOCKPOLICY=,MDT_LOCKRENEWALINTERVALSEC=5,MDT_LOGBLKSIZE=512,MDT_PHYBLKSIZE=512,MDT_POOL_UUID=00000001-0001-0001-0001-000000000158,MDT_PV0=pv:3514f0c5a51600871&44&uuid:6Imusq-PeIV-Yc0Q-Cg18-NtH1-m3ep-NRfc0b&44&pestart:0&44&pecount:277&44&mapoffset:0,MDT_ROLE=Master,MDT_SDUUID=6605112f-7f26-48b9-b64e-f661bfe18799,MDT_TYPE=FCP,MDT_VERSION=3,MDT_VGUUID=pP6EqC-feb8-DdZW-w6dx-zdPH-swYv-tDQRHf,MDT__SHA_CKSUM=c14498749b865ecbb5c0d427adbf62a370d215cc,RHAT_storage_domain

[root@green-vdsb ~]# vgs -o vg_tags 504c7595-7ca5-459c-be6f-44a3d7f1b5d7
  VG Tags                                                                                                                                                            MDT_CLASS=Data,MDT_DESCRIPTION=fc2,MDT_IOOPTIMEOUTSEC=10,MDT_LEASERETRIES=3,MDT_LEASETIMESEC=60,MDT_LOCKPOLICY=,MDT_LOCKRENEWALINTERVALSEC=5,MDT_LOGBLKSIZE=512,MDT_MASTER_VERSION=7,MDT_PHYBLKSIZE=512,MDT_POOL_DESCRIPTION=Default,MDT_POOL_DOMAINS=504c7595-7ca5-459c-be6f-44a3d7f1b5d7:Active&44&ec5adddd-c1b5-471f-ba86-aa91303b5215:Active&44&ad1ecd8f-8822-4323-9ba4-4864e4d97297:Attached&44&39a95638-1583-461e-9aa1-52e6ea8597ee:Attached&44&6605112f-7f26-48b9-b64e-f661bfe18799:Active&44&6c24e902-052c-43ba-bee5-9e42b0aacc59:Active,MDT_POOL_SPM_ID=-1,MDT_POOL_SPM_LVER=-1,MDT_POOL_UUID=00000001-0001-0001-0001-000000000158,MDT_PV0=pv:3514f0c5a51600858&44&uuid:4N23Pb-5bVj-NyUp-Ss7D-r8pb-OOrF-2uVAeF&44&pestart:0&44&pecount:413&44&mapoffset:0,MDT_ROLE=Master,MDT_SDUUID=504c7595-7ca5-459c-be6f-44a3d7f1b5d7,MDT_TYPE=FCP,MDT_VERSION=3,MDT_VGUUID=rcC8D7-T3yy-LZ9P-5Mx4-sVZe-buqt-7RrXmJ,MDT__SHA_CKSUM=dbddc311b8105117ba6778cce8559919e968ce39,RHAT_storage_domain




Expected results:
Only one master domain in the pool

Additional info: logs from engine and host

###Before connectivity loss to the master domain:

Thread-676935::INFO::2015-11-29 14:15:31,452::logUtils::48::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID=u'00000001-0001-0001-0001-000000000158', hostID=1, msdUUID=u'ad1ecd8f-8822-4323-9ba4-4864e4d97297', masterVersion=3, domainsMap={u'39a95638-1583-461e-9aa1-52e6ea8597ee': u'attached', u'ec5adddd-c1b5-471f-ba86-aa91303b5215': u'active', u'6c24e902-052c-43ba-bee5-9e42b0aacc59': u'active', u'ad1ecd8f-8822-4323-9ba4-4864e4d97297': u'active'}, options=None)





###Master domain connection failure:


mailbox.SPMMonitor::ERROR::2015-11-29 14:22:28,132::storage_mailbox::793::Storage.MailBox.SpmMailMonitor::(run) Error checking for mail
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 791, in run
    self._checkForMail()
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 738, in _checkForMail
    "Could not read mailbox: %s" % self._inbox)
IOError: [Errno 5] _handleRequests._checkForMail - Could not read mailbox: /rhev/data-center/00000001-0001-0001-0001-000000000158/mastersd/dom_md/inbox


###Reconstruct master completed:

2015-11-29 14:23:26,950 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-13) [68ecc5fe] Correlation ID: 68ecc5fe, Job ID: 69036a12-88d2-426e-a3bf-8547052
5d879, Call Stack: null, Custom Event ID: -1, Message: Reconstruct Master Domain for Data Center Default completed.



###Domain 504c7595-7ca5-459c-be6f-44a3d7f1b5d7 is operational again:

2015-11-29 14:40:14,152 INFO  [org.ovirt.engine.core.bll.ProcessOvfUpdateForStorageDomainCommand] (DefaultQuartzScheduler_Worker-57) [554ef729] Lock Acquired to object 'EngineLock:{exclusiveLocks='[504c7595-7ca5-459c-be6f-44a3d7f1b5d7=<STORAGE, ACTION_TYPE_FAILED_DOMAIN_OVF_ON_UPDATE>]', sharedLocks='[00000001-0001-0001-0001-000000000158=<OVF_UPDATE, ACTION_TYPE_FAILED_DOMAIN_OVF_ON_UPDATE>]'}'

Comment 1 Adam Litke 2015-11-30 19:07:42 UTC
This is a known issue and documented in sp.py in masterMigrate:

    # There's no way to ensure that we only have one domain marked
    # as master in the storage pool (e.g. after a reconstructMaster,
    # or even in this method if we fail to set the old master to
    # regular). That said, for API cleaness switchMasterDomain is
    # the last method to call as "point of no return" after which we
    # only try to cleanup but we cannot rollback.

When engine calls reconstructMaster it supplies a new masterVer parameter which must be greater than the current masterVer.  In validateMasterDomainVersion we check that the domain has the expected masterVer and raise an error if it does not match.

Although it's possible to return a domain like this from master back to a regular role, we're doing away with the master domain in the future so it's not worth the effort.

Dropping severity in light of the above analysis.  Allon, I don't recommend creating a RFE to "fix" old master domains and I think we should close this as WONTFIX.  Thoughts?

Comment 2 Allon Mureinik 2015-12-23 10:46:14 UTC
> Dropping severity in light of the above analysis.  Allon, I don't recommend
> creating a RFE to "fix" old master domains and I think we should close this
> as WONTFIX.  Thoughts?
Agreed.