Description of problem: Guest can't boot up after the master domain change Version-Release number of selected component (if applicable): libvirt-0.10.2-31.el6.x86_64 vdsm-4.13.2-0.13.el6ev.x86_64 rhevm-3.3.1-0.48.el6ev.noarch How reproducible: 100% Steps to Reproduce: 1.prepare RHEVM env, with one DataCenter,Gluster,Two Domain(NFS type), one Host Keep DataCenter,Gluster,Two Domain,Host active status 2.new an guest with disk on the Master Domain,and keep it running 3.login the Host, add one iptables rule to block the connection from Host to Mater domain #iptables -A OUTPUT -d Master Domain -p tcp --dport 2049 -j DROP 4.wait about ten or more minutes, the Master domain will change to another one. DataCenter Gluster,one Domain, Host will recover to active status 5.destroy the guest(maybe need wait several minutes) 6.delete the block connection from Host to domain added in step 3, the other domain will recover after several minutes #iptables -F 7.start the guest Actual results: fail to start the guest in step 6. will post error like VM vm2 is down. Exit message: cannot open file '/rhev/data-center/f721168f-5edd-4f5f-8976-591eb93f960e/f9b3c82f-87e3-4cc8-8dda-b2721b4b4d76/images/fe90abc5-8638-40e8-b394-9082b3217731/7879e1b7-c238-4222-95eb-e922c5cf80dc': No such file or directory. Expected results: should succeed booting the guest after the Master Domain change Additional info:
Hi can you add please engine and vdsm logs? also, did you mean cluster or gluster? looks like you mean cluster but I want to be sure. thanks,
(In reply to Aharon Canan from comment #1) > Hi > > can you add please engine and vdsm logs? > > also, did you mean cluster or gluster? > looks like you mean cluster but I want to be sure. > > thanks, Sorry for the mistake. Exactly I mean cluster.
Created attachment 884913 [details] vdsm log
Created attachment 884915 [details] engine log
(In reply to Shanzhi Yu from comment #0) > #iptables -A OUTPUT -d Master Domain -p tcp --dport 2049 -j DROP > 4.wait about ten or more minutes, the Master domain will change to another > one. DataCenter Gluster,one Domain, Host will recover to active status Not exactly, the DC is 'Non Responsive' state, the SD is inactive state, and the host is always running/connecting state rather than 'Non Responsive' state after blocking connection, of course, it's another question, BTW, the bug 1086223 is used for tracking this.
Seems as though we aren't creating the links. Fede - haven't you already dealt with a similar issue?
The only refreshStoragePool arrived before the domains were reachable again: Thread-519::INFO::2014-04-10 16:42:00,085::logUtils::44::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='f721168f-5edd-4f5f-8976-591eb93f960e', msdUUID='55cb7668-7c5e-4553-b263-c6f3f86b358f', masterVersion=6, options=None) Thread-469::DEBUG::2014-04-10 16:53:09,963::domainMonitor::247::Storage.DomainMonitorThread::(_monitorDomain) Domain 65eafe2b-4fa9-4ea1-a869-7deb5afd4654 changed its status to Valid Thread-467::DEBUG::2014-04-10 16:53:10,007::domainMonitor::247::Storage.DomainMonitorThread::(_monitorDomain) Domain f9b3c82f-87e3-4cc8-8dda-b2721b4b4d76 changed its status to Valid I am not sure if anything changed on the engine side (lacking of an additional refreshStoragePool) but I agree that when a domain is reachable again we should make sure its links are available. In particular some code related to the issue is: def __rebuild(self, msdUUID, masterVersion): ... blockDomUUIDs = [vg.name for vg in blockSD.lvm.getVGs(domUUIDs)] domDirs = {} # {domUUID: domaindir} # Add the block domains for domUUID in blockDomUUIDs: domaindir = os.path.join(block_mountpoint, domUUID) domDirs[domUUID] = domaindir # create domain special volumes folder fileUtils.createdir(os.path.join(domaindir, sd.DOMAIN_META_DATA)) fileUtils.createdir(os.path.join(domaindir, sd.DOMAIN_IMAGES)) # Add the file domains for domUUID, domaindir in fileSD.scanDomains(): \ # [(fileDomUUID, file_domaindir)] if domUUID in domUUIDs: domDirs[domUUID] = domaindir ... As we see we are creating links only for the domains that we are able to find at that time. Bottom line: the solution is to try and _linkStorageDomain as soon as the domains are visible.
Even though we have a vdsm fix for this I strongly suggest that we solve the issue addressing bug 1093924 that covers other scenarios as well.
(In reply to Federico Simoncelli from comment #8) > Even though we have a vdsm fix for this I strongly suggest that we solve the > issue addressing bug 1093924 that covers other scenarios as well. Agreed.
2 NFS domains in DC: 1) Created a VM with a disk located on the masted domain, started it 2) Blocked connectivity form SPM to master domain, waited for reconstruct to take place 3) Once the other domain took master, destroyed the VM 4) Resumed connectivity to the first domain 5) Started the VM The VM was started normally, the link to the mount of the storage domain re-appeared under /rhev/data-ceter/SPUUID. Tested 3 times Thread-220::INFO::2014-05-18 15:22:27,387::sp::1113::Storage.StoragePool::(_linkStorageDomain) Linking /rhev/data-center/mnt/lion.qa.lab.tlv.redhat.com:_export_elad_6/4b73f56c-a54a-4f81-b9b2-010cc1b5904e to /rhev/data-center/4aa2760a-c779-4b5c-93aa-8aafd334aeb1/4b73f56c-a54a-4f81-b9b2-010cc1b5904e As this scenario didn't cause the issue of missing links to reproduce, I'm moving this bug to VERIFIED. Verified using av9.1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0504.html