Bug 1286446

Summary:

[vdsm] 2 master domains in same dc after connectivity to old master is resumed

Product:

[oVirt] vdsm

Reporter:

Elad <ebenahar>

Component:

General

Assignee:

Adam Litke <alitke>

Status:

CLOSED WONTFIX

QA Contact:

Aharon Canan <acanan>

Severity:

low

Docs Contact:

Priority:

low

Version:

4.17.10

CC:

amureini, bugs, laravot, tnisan, ylavi

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Unspecified

Whiteboard:

storage

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-12-23 10:46:14 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
logs from engine and host	none

Description Elad 2015-11-29 15:30:32 UTC

Created attachment 1100227 [details]
logs from engine and host

Description of problem:
Master domain role is written under 2 different storage domains metadata in the same pool after a scenario in which the paths to the old master become running.

Version-Release number of selected component (if applicable):
vdsm-4.17.10.1-0.el7ev.noarch
rhevm-3.6.0.3-0.1.el6.noarch

How reproducible:
Always

Steps to Reproduce:

1. On a DC with 2 FC data domains (1 domain is master - 6605112f-7f26-48b9-b64e-f661bfe18799):

[root@green-vdsb ~]# for i in `vdsClient -s 0 getStorageDomainsList 00000001-0001-0001-0001-000000000158 ` ; do vdsClient -s 0 getStorageDomainInfo $i;  done; 
        uuid = 504c7595-7ca5-459c-be6f-44a3d7f1b5d7
        vguuid = rcC8D7-T3yy-LZ9P-5Mx4-sVZe-buqt-7RrXmJ
        state = OK
        version = 3
        role = Regular
        type = FCP
        class = Data
        pool = ['00000001-0001-0001-0001-000000000158']
        name = fc2

        uuid = ec5adddd-c1b5-471f-ba86-aa91303b5215
        version = 0
        role = Regular
        remotePath = netapp.qa.lab.tlv.redhat.com:/vol/vol_rhev_stress/backupMordechai
        type = NFS
        class = Backup
        pool = ['00000001-0001-0001-0001-000000000158']
        name = backupMordechai

        uuid = 6605112f-7f26-48b9-b64e-f661bfe18799
        vguuid = pP6EqC-feb8-DdZW-w6dx-zdPH-swYv-tDQRHf
        state = OK
        version = 3
        role = Master
        type = FCP
        class = Data
        pool = ['00000001-0001-0001-0001-000000000158']
        name = fc1




, I disabled the paths to the current master domain with the following:
# echo "offline" > /sys/block/sdf/device/state
Waited for the reconstruct to finish. Second FC domain took master
2. Resumed connectivity to the old master domain:
echo "running" > /sys/block/sdi/device/state




Actual results:

Both domains are 'state = OK' and both reported as active  in engine.

2 storage domains have master role for the same DC:


[root@green-vdsb ~]# for i in `vdsClient -s 0 getStorageDomainsList 00000001-0001-0001-0001-000000000158 ` ; do vdsClient -s 0 getStorageDomainInfo $i;  done;
        uuid = 504c7595-7ca5-459c-be6f-44a3d7f1b5d7
        vguuid = rcC8D7-T3yy-LZ9P-5Mx4-sVZe-buqt-7RrXmJ
        state = OK
        version = 3
        role = Master
        type = FCP
        class = Data
        pool = ['00000001-0001-0001-0001-000000000158']
        name = fc2

        uuid = ec5adddd-c1b5-471f-ba86-aa91303b5215
        version = 0
        role = Regular
        remotePath = netapp.qa.lab.tlv.redhat.com:/vol/vol_rhev_stress/backupMordechai
        type = NFS
        class = Backup
        pool = ['00000001-0001-0001-0001-000000000158']
        name = backupMordechai

        uuid = 6605112f-7f26-48b9-b64e-f661bfe18799
        vguuid = pP6EqC-feb8-DdZW-w6dx-zdPH-swYv-tDQRHf
        state = OK
        version = 3
        role = Master
        type = FCP
        class = Data
        pool = ['00000001-0001-0001-0001-000000000158']
        name = fc1




vg_tags:


###For both domains, MDT_ROLE=Master.

###For domain 6605112f-7f26-48b9-b64e-f661bfe18799, the old master,  no MDT_MASTER_VERSION while for 504c7595-7ca5-459c-be6f-44a3d7f1b5d7, MDT_MASTER_VERSION=7:


[root@green-vdsb ~]# pvscan --cache
[root@green-vdsb ~]# vgs -o vg_tags 6605112f-7f26-48b9-b64e-f661bfe18799
  VG Tags                                                                                                                                                            MDT_CLASS=Data,MDT_DESCRIPTION=fc1,MDT_IOOPTIMEOUTSEC=10,MDT_LEASERETRIES=3,MDT_LEASETIMESEC=60,MDT_LOCKPOLICY=,MDT_LOCKRENEWALINTERVALSEC=5,MDT_LOGBLKSIZE=512,MDT_PHYBLKSIZE=512,MDT_POOL_UUID=00000001-0001-0001-0001-000000000158,MDT_PV0=pv:3514f0c5a51600871&44&uuid:6Imusq-PeIV-Yc0Q-Cg18-NtH1-m3ep-NRfc0b&44&pestart:0&44&pecount:277&44&mapoffset:0,MDT_ROLE=Master,MDT_SDUUID=6605112f-7f26-48b9-b64e-f661bfe18799,MDT_TYPE=FCP,MDT_VERSION=3,MDT_VGUUID=pP6EqC-feb8-DdZW-w6dx-zdPH-swYv-tDQRHf,MDT__SHA_CKSUM=c14498749b865ecbb5c0d427adbf62a370d215cc,RHAT_storage_domain

[root@green-vdsb ~]# vgs -o vg_tags 504c7595-7ca5-459c-be6f-44a3d7f1b5d7
  VG Tags                                                                                                                                                            MDT_CLASS=Data,MDT_DESCRIPTION=fc2,MDT_IOOPTIMEOUTSEC=10,MDT_LEASERETRIES=3,MDT_LEASETIMESEC=60,MDT_LOCKPOLICY=,MDT_LOCKRENEWALINTERVALSEC=5,MDT_LOGBLKSIZE=512,MDT_MASTER_VERSION=7,MDT_PHYBLKSIZE=512,MDT_POOL_DESCRIPTION=Default,MDT_POOL_DOMAINS=504c7595-7ca5-459c-be6f-44a3d7f1b5d7:Active&44&ec5adddd-c1b5-471f-ba86-aa91303b5215:Active&44&ad1ecd8f-8822-4323-9ba4-4864e4d97297:Attached&44&39a95638-1583-461e-9aa1-52e6ea8597ee:Attached&44&6605112f-7f26-48b9-b64e-f661bfe18799:Active&44&6c24e902-052c-43ba-bee5-9e42b0aacc59:Active,MDT_POOL_SPM_ID=-1,MDT_POOL_SPM_LVER=-1,MDT_POOL_UUID=00000001-0001-0001-0001-000000000158,MDT_PV0=pv:3514f0c5a51600858&44&uuid:4N23Pb-5bVj-NyUp-Ss7D-r8pb-OOrF-2uVAeF&44&pestart:0&44&pecount:413&44&mapoffset:0,MDT_ROLE=Master,MDT_SDUUID=504c7595-7ca5-459c-be6f-44a3d7f1b5d7,MDT_TYPE=FCP,MDT_VERSION=3,MDT_VGUUID=rcC8D7-T3yy-LZ9P-5Mx4-sVZe-buqt-7RrXmJ,MDT__SHA_CKSUM=dbddc311b8105117ba6778cce8559919e968ce39,RHAT_storage_domain




Expected results:
Only one master domain in the pool

Additional info: logs from engine and host

###Before connectivity loss to the master domain:

Thread-676935::INFO::2015-11-29 14:15:31,452::logUtils::48::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID=u'00000001-0001-0001-0001-000000000158', hostID=1, msdUUID=u'ad1ecd8f-8822-4323-9ba4-4864e4d97297', masterVersion=3, domainsMap={u'39a95638-1583-461e-9aa1-52e6ea8597ee': u'attached', u'ec5adddd-c1b5-471f-ba86-aa91303b5215': u'active', u'6c24e902-052c-43ba-bee5-9e42b0aacc59': u'active', u'ad1ecd8f-8822-4323-9ba4-4864e4d97297': u'active'}, options=None)





###Master domain connection failure:


mailbox.SPMMonitor::ERROR::2015-11-29 14:22:28,132::storage_mailbox::793::Storage.MailBox.SpmMailMonitor::(run) Error checking for mail
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 791, in run
    self._checkForMail()
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 738, in _checkForMail
    "Could not read mailbox: %s" % self._inbox)
IOError: [Errno 5] _handleRequests._checkForMail - Could not read mailbox: /rhev/data-center/00000001-0001-0001-0001-000000000158/mastersd/dom_md/inbox


###Reconstruct master completed:

2015-11-29 14:23:26,950 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-13) [68ecc5fe] Correlation ID: 68ecc5fe, Job ID: 69036a12-88d2-426e-a3bf-8547052
5d879, Call Stack: null, Custom Event ID: -1, Message: Reconstruct Master Domain for Data Center Default completed.



###Domain 504c7595-7ca5-459c-be6f-44a3d7f1b5d7 is operational again:

2015-11-29 14:40:14,152 INFO  [org.ovirt.engine.core.bll.ProcessOvfUpdateForStorageDomainCommand] (DefaultQuartzScheduler_Worker-57) [554ef729] Lock Acquired to object 'EngineLock:{exclusiveLocks='[504c7595-7ca5-459c-be6f-44a3d7f1b5d7=<STORAGE, ACTION_TYPE_FAILED_DOMAIN_OVF_ON_UPDATE>]', sharedLocks='[00000001-0001-0001-0001-000000000158=<OVF_UPDATE, ACTION_TYPE_FAILED_DOMAIN_OVF_ON_UPDATE>]'}'

Comment 1 Adam Litke 2015-11-30 19:07:42 UTC

This is a known issue and documented in sp.py in masterMigrate:

    # There's no way to ensure that we only have one domain marked
    # as master in the storage pool (e.g. after a reconstructMaster,
    # or even in this method if we fail to set the old master to
    # regular). That said, for API cleaness switchMasterDomain is
    # the last method to call as "point of no return" after which we
    # only try to cleanup but we cannot rollback.

When engine calls reconstructMaster it supplies a new masterVer parameter which must be greater than the current masterVer.  In validateMasterDomainVersion we check that the domain has the expected masterVer and raise an error if it does not match.

Although it's possible to return a domain like this from master back to a regular role, we're doing away with the master domain in the future so it's not worth the effort.

Dropping severity in light of the above analysis.  Allon, I don't recommend creating a RFE to "fix" old master domains and I think we should close this as WONTFIX.  Thoughts?

Comment 2 Allon Mureinik 2015-12-23 10:46:14 UTC

> Dropping severity in light of the above analysis.  Allon, I don't recommend
> creating a RFE to "fix" old master domains and I think we should close this
> as WONTFIX.  Thoughts?
Agreed.