1286446 – [vdsm] 2 master domains in same dc after connectivity to old master is resumed

Bug 1286446 - [vdsm] 2 master domains in same dc after connectivity to old master is resumed

Summary: [vdsm] 2 master domains in same dc after connectivity to old master is resumed

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	General
Sub Component:
Version:	4.17.10
Hardware:	x86_64
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Adam Litke
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-11-29 15:30 UTC by Elad
Modified:	2021-11-12 10:13 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-12-23 10:46:14 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
logs from engine and host (1.78 MB, application/x-gzip) 2015-11-29 15:30 UTC, Elad	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-43987	0	None	None	None	2021-11-12 10:13:39 UTC

Description Elad 2015-11-29 15:30:32 UTC

Created attachment 1100227 [details]
logs from engine and host

Description of problem:
Master domain role is written under 2 different storage domains metadata in the same pool after a scenario in which the paths to the old master become running.

Version-Release number of selected component (if applicable):
vdsm-4.17.10.1-0.el7ev.noarch
rhevm-3.6.0.3-0.1.el6.noarch

How reproducible:
Always

Steps to Reproduce:

1. On a DC with 2 FC data domains (1 domain is master - 6605112f-7f26-48b9-b64e-f661bfe18799):

[root@green-vdsb ~]# for i in `vdsClient -s 0 getStorageDomainsList 00000001-0001-0001-0001-000000000158 ` ; do vdsClient -s 0 getStorageDomainInfo $i;  done; 
        uuid = 504c7595-7ca5-459c-be6f-44a3d7f1b5d7
        vguuid = rcC8D7-T3yy-LZ9P-5Mx4-sVZe-buqt-7RrXmJ
        state = OK
        version = 3
        role = Regular
        type = FCP
        class = Data
        pool = ['00000001-0001-0001-0001-000000000158']
        name = fc2

        uuid = ec5adddd-c1b5-471f-ba86-aa91303b5215
        version = 0
        role = Regular
        remotePath = netapp.qa.lab.tlv.redhat.com:/vol/vol_rhev_stress/backupMordechai
        type = NFS
        class = Backup
        pool = ['00000001-0001-0001-0001-000000000158']
        name = backupMordechai

        uuid = 6605112f-7f26-48b9-b64e-f661bfe18799
        vguuid = pP6EqC-feb8-DdZW-w6dx-zdPH-swYv-tDQRHf
        state = OK
        version = 3
        role = Master
        type = FCP
        class = Data
        pool = ['00000001-0001-0001-0001-000000000158']
        name = fc1




, I disabled the paths to the current master domain with the following:
# echo "offline" > /sys/block/sdf/device/state
Waited for the reconstruct to finish. Second FC domain took master
2. Resumed connectivity to the old master domain:
echo "running" > /sys/block/sdi/device/state




Actual results:

Both domains are 'state = OK' and both reported as active  in engine.

2 storage domains have master role for the same DC:


[root@green-vdsb ~]# for i in `vdsClient -s 0 getStorageDomainsList 00000001-0001-0001-0001-000000000158 ` ; do vdsClient -s 0 getStorageDomainInfo $i;  done;
        uuid = 504c7595-7ca5-459c-be6f-44a3d7f1b5d7
        vguuid = rcC8D7-T3yy-LZ9P-5Mx4-sVZe-buqt-7RrXmJ
        state = OK
        version = 3
        role = Master
        type = FCP
        class = Data
        pool = ['00000001-0001-0001-0001-000000000158']
        name = fc2

        uuid = ec5adddd-c1b5-471f-ba86-aa91303b5215
        version = 0
        role = Regular
        remotePath = netapp.qa.lab.tlv.redhat.com:/vol/vol_rhev_stress/backupMordechai
        type = NFS
        class = Backup
        pool = ['00000001-0001-0001-0001-000000000158']
        name = backupMordechai

        uuid = 6605112f-7f26-48b9-b64e-f661bfe18799
        vguuid = pP6EqC-feb8-DdZW-w6dx-zdPH-swYv-tDQRHf
        state = OK
        version = 3
        role = Master
        type = FCP
        class = Data
        pool = ['00000001-0001-0001-0001-000000000158']
        name = fc1




vg_tags:


###For both domains, MDT_ROLE=Master.

###For domain 6605112f-7f26-48b9-b64e-f661bfe18799, the old master,  no MDT_MASTER_VERSION while for 504c7595-7ca5-459c-be6f-44a3d7f1b5d7, MDT_MASTER_VERSION=7:


[root@green-vdsb ~]# pvscan --cache
[root@green-vdsb ~]# vgs -o vg_tags 6605112f-7f26-48b9-b64e-f661bfe18799
  VG Tags                                                                                                                                                            MDT_CLASS=Data,MDT_DESCRIPTION=fc1,MDT_IOOPTIMEOUTSEC=10,MDT_LEASERETRIES=3,MDT_LEASETIMESEC=60,MDT_LOCKPOLICY=,MDT_LOCKRENEWALINTERVALSEC=5,MDT_LOGBLKSIZE=512,MDT_PHYBLKSIZE=512,MDT_POOL_UUID=00000001-0001-0001-0001-000000000158,MDT_PV0=pv:3514f0c5a51600871&44&uuid:6Imusq-PeIV-Yc0Q-Cg18-NtH1-m3ep-NRfc0b&44&pestart:0&44&pecount:277&44&mapoffset:0,MDT_ROLE=Master,MDT_SDUUID=6605112f-7f26-48b9-b64e-f661bfe18799,MDT_TYPE=FCP,MDT_VERSION=3,MDT_VGUUID=pP6EqC-feb8-DdZW-w6dx-zdPH-swYv-tDQRHf,MDT__SHA_CKSUM=c14498749b865ecbb5c0d427adbf62a370d215cc,RHAT_storage_domain

[root@green-vdsb ~]# vgs -o vg_tags 504c7595-7ca5-459c-be6f-44a3d7f1b5d7
  VG Tags                                                                                                                                                            MDT_CLASS=Data,MDT_DESCRIPTION=fc2,MDT_IOOPTIMEOUTSEC=10,MDT_LEASERETRIES=3,MDT_LEASETIMESEC=60,MDT_LOCKPOLICY=,MDT_LOCKRENEWALINTERVALSEC=5,MDT_LOGBLKSIZE=512,MDT_MASTER_VERSION=7,MDT_PHYBLKSIZE=512,MDT_POOL_DESCRIPTION=Default,MDT_POOL_DOMAINS=504c7595-7ca5-459c-be6f-44a3d7f1b5d7:Active&44&ec5adddd-c1b5-471f-ba86-aa91303b5215:Active&44&ad1ecd8f-8822-4323-9ba4-4864e4d97297:Attached&44&39a95638-1583-461e-9aa1-52e6ea8597ee:Attached&44&6605112f-7f26-48b9-b64e-f661bfe18799:Active&44&6c24e902-052c-43ba-bee5-9e42b0aacc59:Active,MDT_POOL_SPM_ID=-1,MDT_POOL_SPM_LVER=-1,MDT_POOL_UUID=00000001-0001-0001-0001-000000000158,MDT_PV0=pv:3514f0c5a51600858&44&uuid:4N23Pb-5bVj-NyUp-Ss7D-r8pb-OOrF-2uVAeF&44&pestart:0&44&pecount:413&44&mapoffset:0,MDT_ROLE=Master,MDT_SDUUID=504c7595-7ca5-459c-be6f-44a3d7f1b5d7,MDT_TYPE=FCP,MDT_VERSION=3,MDT_VGUUID=rcC8D7-T3yy-LZ9P-5Mx4-sVZe-buqt-7RrXmJ,MDT__SHA_CKSUM=dbddc311b8105117ba6778cce8559919e968ce39,RHAT_storage_domain




Expected results:
Only one master domain in the pool

Additional info: logs from engine and host

###Before connectivity loss to the master domain:

Thread-676935::INFO::2015-11-29 14:15:31,452::logUtils::48::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID=u'00000001-0001-0001-0001-000000000158', hostID=1, msdUUID=u'ad1ecd8f-8822-4323-9ba4-4864e4d97297', masterVersion=3, domainsMap={u'39a95638-1583-461e-9aa1-52e6ea8597ee': u'attached', u'ec5adddd-c1b5-471f-ba86-aa91303b5215': u'active', u'6c24e902-052c-43ba-bee5-9e42b0aacc59': u'active', u'ad1ecd8f-8822-4323-9ba4-4864e4d97297': u'active'}, options=None)





###Master domain connection failure:


mailbox.SPMMonitor::ERROR::2015-11-29 14:22:28,132::storage_mailbox::793::Storage.MailBox.SpmMailMonitor::(run) Error checking for mail
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 791, in run
    self._checkForMail()
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 738, in _checkForMail
    "Could not read mailbox: %s" % self._inbox)
IOError: [Errno 5] _handleRequests._checkForMail - Could not read mailbox: /rhev/data-center/00000001-0001-0001-0001-000000000158/mastersd/dom_md/inbox


###Reconstruct master completed:

2015-11-29 14:23:26,950 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-13) [68ecc5fe] Correlation ID: 68ecc5fe, Job ID: 69036a12-88d2-426e-a3bf-8547052
5d879, Call Stack: null, Custom Event ID: -1, Message: Reconstruct Master Domain for Data Center Default completed.



###Domain 504c7595-7ca5-459c-be6f-44a3d7f1b5d7 is operational again:

2015-11-29 14:40:14,152 INFO  [org.ovirt.engine.core.bll.ProcessOvfUpdateForStorageDomainCommand] (DefaultQuartzScheduler_Worker-57) [554ef729] Lock Acquired to object 'EngineLock:{exclusiveLocks='[504c7595-7ca5-459c-be6f-44a3d7f1b5d7=<STORAGE, ACTION_TYPE_FAILED_DOMAIN_OVF_ON_UPDATE>]', sharedLocks='[00000001-0001-0001-0001-000000000158=<OVF_UPDATE, ACTION_TYPE_FAILED_DOMAIN_OVF_ON_UPDATE>]'}'

Comment 1 Adam Litke 2015-11-30 19:07:42 UTC

This is a known issue and documented in sp.py in masterMigrate:

    # There's no way to ensure that we only have one domain marked
    # as master in the storage pool (e.g. after a reconstructMaster,
    # or even in this method if we fail to set the old master to
    # regular). That said, for API cleaness switchMasterDomain is
    # the last method to call as "point of no return" after which we
    # only try to cleanup but we cannot rollback.

When engine calls reconstructMaster it supplies a new masterVer parameter which must be greater than the current masterVer.  In validateMasterDomainVersion we check that the domain has the expected masterVer and raise an error if it does not match.

Although it's possible to return a domain like this from master back to a regular role, we're doing away with the master domain in the future so it's not worth the effort.

Dropping severity in light of the above analysis.  Allon, I don't recommend creating a RFE to "fix" old master domains and I think we should close this as WONTFIX.  Thoughts?

Comment 2 Allon Mureinik 2015-12-23 10:46:14 UTC

> Dropping severity in light of the above analysis.  Allon, I don't recommend
> creating a RFE to "fix" old master domains and I think we should close this
> as WONTFIX.  Thoughts?
Agreed.

Note You need to log in before you can comment on or make changes to this bug.