Bug 1086210
Summary: | Guest can't boot up after the master domain change | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Shanzhi Yu <shyu> | ||||||
Component: | vdsm | Assignee: | Federico Simoncelli <fsimonce> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Elad <ebenahar> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 3.3.0 | CC: | acanan, ajia, amureini, bazulay, bili, dyuan, fsimonce, gklein, iheim, knesenko, lpeer, mzhan, scohen, shyu, yeylon, zdover | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 3.4.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | storage | ||||||||
Fixed In Version: | vdsm-4.14.7-1.el6ev | Doc Type: | Bug Fix | ||||||
Doc Text: |
Previously, guests were unable to boot up after the master domain changed. This was because refreshStoragePool ran before the domains were reachable again. A change in the code prevents refreshStoragePool from running until the domains are reachable.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2014-06-09 13:30:20 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Shanzhi Yu
2014-04-10 11:22:43 UTC
Hi can you add please engine and vdsm logs? also, did you mean cluster or gluster? looks like you mean cluster but I want to be sure. thanks, (In reply to Aharon Canan from comment #1) > Hi > > can you add please engine and vdsm logs? > > also, did you mean cluster or gluster? > looks like you mean cluster but I want to be sure. > > thanks, Sorry for the mistake. Exactly I mean cluster. Created attachment 884913 [details]
vdsm log
Created attachment 884915 [details]
engine log
(In reply to Shanzhi Yu from comment #0) > #iptables -A OUTPUT -d Master Domain -p tcp --dport 2049 -j DROP > 4.wait about ten or more minutes, the Master domain will change to another > one. DataCenter Gluster,one Domain, Host will recover to active status Not exactly, the DC is 'Non Responsive' state, the SD is inactive state, and the host is always running/connecting state rather than 'Non Responsive' state after blocking connection, of course, it's another question, BTW, the bug 1086223 is used for tracking this. Seems as though we aren't creating the links. Fede - haven't you already dealt with a similar issue? The only refreshStoragePool arrived before the domains were reachable again: Thread-519::INFO::2014-04-10 16:42:00,085::logUtils::44::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='f721168f-5edd-4f5f-8976-591eb93f960e', msdUUID='55cb7668-7c5e-4553-b263-c6f3f86b358f', masterVersion=6, options=None) Thread-469::DEBUG::2014-04-10 16:53:09,963::domainMonitor::247::Storage.DomainMonitorThread::(_monitorDomain) Domain 65eafe2b-4fa9-4ea1-a869-7deb5afd4654 changed its status to Valid Thread-467::DEBUG::2014-04-10 16:53:10,007::domainMonitor::247::Storage.DomainMonitorThread::(_monitorDomain) Domain f9b3c82f-87e3-4cc8-8dda-b2721b4b4d76 changed its status to Valid I am not sure if anything changed on the engine side (lacking of an additional refreshStoragePool) but I agree that when a domain is reachable again we should make sure its links are available. In particular some code related to the issue is: def __rebuild(self, msdUUID, masterVersion): ... blockDomUUIDs = [vg.name for vg in blockSD.lvm.getVGs(domUUIDs)] domDirs = {} # {domUUID: domaindir} # Add the block domains for domUUID in blockDomUUIDs: domaindir = os.path.join(block_mountpoint, domUUID) domDirs[domUUID] = domaindir # create domain special volumes folder fileUtils.createdir(os.path.join(domaindir, sd.DOMAIN_META_DATA)) fileUtils.createdir(os.path.join(domaindir, sd.DOMAIN_IMAGES)) # Add the file domains for domUUID, domaindir in fileSD.scanDomains(): \ # [(fileDomUUID, file_domaindir)] if domUUID in domUUIDs: domDirs[domUUID] = domaindir ... As we see we are creating links only for the domains that we are able to find at that time. Bottom line: the solution is to try and _linkStorageDomain as soon as the domains are visible. Even though we have a vdsm fix for this I strongly suggest that we solve the issue addressing bug 1093924 that covers other scenarios as well. (In reply to Federico Simoncelli from comment #8) > Even though we have a vdsm fix for this I strongly suggest that we solve the > issue addressing bug 1093924 that covers other scenarios as well. Agreed. 2 NFS domains in DC: 1) Created a VM with a disk located on the masted domain, started it 2) Blocked connectivity form SPM to master domain, waited for reconstruct to take place 3) Once the other domain took master, destroyed the VM 4) Resumed connectivity to the first domain 5) Started the VM The VM was started normally, the link to the mount of the storage domain re-appeared under /rhev/data-ceter/SPUUID. Tested 3 times Thread-220::INFO::2014-05-18 15:22:27,387::sp::1113::Storage.StoragePool::(_linkStorageDomain) Linking /rhev/data-center/mnt/lion.qa.lab.tlv.redhat.com:_export_elad_6/4b73f56c-a54a-4f81-b9b2-010cc1b5904e to /rhev/data-center/4aa2760a-c779-4b5c-93aa-8aafd334aeb1/4b73f56c-a54a-4f81-b9b2-010cc1b5904e As this scenario didn't cause the issue of missing links to reproduce, I'm moving this bug to VERIFIED. Verified using av9.1 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0504.html |