1086210 – Guest can't boot up after the master domain change

Bug 1086210 - Guest can't boot up after the master domain change

Summary: Guest can't boot up after the master domain change

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.4.0
Assignee:	Federico Simoncelli
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-04-10 11:22 UTC by Shanzhi Yu
Modified:	2016-02-10 20:56 UTC (History)
CC List:	16 users (show)
Fixed In Version:	vdsm-4.14.7-1.el6ev
Doc Type:	Bug Fix
Doc Text:	Previously, guests were unable to boot up after the master domain changed. This was because refreshStoragePool ran before the domains were reachable again. A change in the code prevents refreshStoragePool from running until the domains are reachable.
Clone Of:
Environment:
Last Closed:	2014-06-09 13:30:20 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
vdsm log (11.57 MB, text/x-log) 2014-04-10 11:51 UTC, Shanzhi Yu	no flags	Details
engine log (2.19 MB, text/x-log) 2014-04-10 11:52 UTC, Shanzhi Yu	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2014:0504	0	normal	SHIPPED_LIVE	vdsm 3.4.0 bug fix and enhancement update	2014-06-09 17:21:35 UTC
oVirt gerrit	27466	0	None	None	None	Never

Description Shanzhi Yu 2014-04-10 11:22:43 UTC

Description of problem:

Guest can't boot up after the master domain change

Version-Release number of selected component (if applicable):

libvirt-0.10.2-31.el6.x86_64
vdsm-4.13.2-0.13.el6ev.x86_64
rhevm-3.3.1-0.48.el6ev.noarch

How reproducible:

100%

Steps to Reproduce:
1.prepare RHEVM env, with one DataCenter,Gluster,Two Domain(NFS type), one Host
Keep DataCenter,Gluster,Two Domain,Host active status
2.new an guest with disk on the Master Domain,and keep it running
3.login the Host, add one iptables rule to block the connection from Host to Mater domain
#iptables -A OUTPUT -d Master Domain -p tcp --dport 2049 -j DROP
4.wait about ten or more minutes, the Master domain will change to another one. DataCenter Gluster,one Domain, Host will recover to active status
5.destroy the guest(maybe need wait several minutes)
6.delete the block connection from Host to domain added in step 3, the other domain will recover after several minutes
#iptables -F
7.start the guest




Actual results:

fail to start the guest in step 6. will post error like

VM vm2 is down. Exit message: cannot open file '/rhev/data-center/f721168f-5edd-4f5f-8976-591eb93f960e/f9b3c82f-87e3-4cc8-8dda-b2721b4b4d76/images/fe90abc5-8638-40e8-b394-9082b3217731/7879e1b7-c238-4222-95eb-e922c5cf80dc': No such file or directory.

Expected results:

should succeed booting the guest after the Master Domain change

Additional info:

Comment 1 Aharon Canan 2014-04-10 11:30:14 UTC

Hi

can you add please engine and vdsm logs?

also, did you mean cluster or gluster?
looks like you mean cluster but I want to be sure.

thanks,

Comment 2 Shanzhi Yu 2014-04-10 11:47:46 UTC

(In reply to Aharon Canan from comment #1)
> Hi
> 
> can you add please engine and vdsm logs?
> 
> also, did you mean cluster or gluster?
> looks like you mean cluster but I want to be sure.
> 
> thanks,

Sorry for the mistake. Exactly I mean cluster.

Comment 3 Shanzhi Yu 2014-04-10 11:51:02 UTC

Created attachment 884913 [details]
vdsm log

Comment 4 Shanzhi Yu 2014-04-10 11:52:35 UTC

Created attachment 884915 [details]
engine log

Comment 5 Alex Jia 2014-04-11 03:12:45 UTC

(In reply to Shanzhi Yu from comment #0)

> #iptables -A OUTPUT -d Master Domain -p tcp --dport 2049 -j DROP
> 4.wait about ten or more minutes, the Master domain will change to another
> one. DataCenter Gluster,one Domain, Host will recover to active status


Not exactly, the DC is 'Non Responsive' state, the SD is inactive state, and 
the host is always running/connecting state rather than 'Non Responsive' state
after blocking connection, of course, it's another question, BTW, the bug 1086223 is used for tracking this.

Comment 6 Allon Mureinik 2014-04-13 03:00:23 UTC

Seems as though we aren't creating the links.
Fede - haven't you already dealt with a similar issue?

Comment 7 Federico Simoncelli 2014-04-29 09:38:03 UTC

The only refreshStoragePool arrived before the domains were reachable again:

Thread-519::INFO::2014-04-10 16:42:00,085::logUtils::44::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='f721168f-5edd-4f5f-8976-591eb93f960e', msdUUID='55cb7668-7c5e-4553-b263-c6f3f86b358f', masterVersion=6, options=None)

Thread-469::DEBUG::2014-04-10 16:53:09,963::domainMonitor::247::Storage.DomainMonitorThread::(_monitorDomain) Domain 65eafe2b-4fa9-4ea1-a869-7deb5afd4654 changed its status to Valid
Thread-467::DEBUG::2014-04-10 16:53:10,007::domainMonitor::247::Storage.DomainMonitorThread::(_monitorDomain) Domain f9b3c82f-87e3-4cc8-8dda-b2721b4b4d76 changed its status to Valid

I am not sure if anything changed on the engine side (lacking of an additional refreshStoragePool) but I agree that when a domain is reachable again we should make sure its links are available.

In particular some code related to the issue is:


    def __rebuild(self, msdUUID, masterVersion):
    ...
        blockDomUUIDs = [vg.name for vg in blockSD.lvm.getVGs(domUUIDs)]
        domDirs = {}  # {domUUID: domaindir}
        # Add the block domains
        for domUUID in blockDomUUIDs:
            domaindir = os.path.join(block_mountpoint, domUUID)
            domDirs[domUUID] = domaindir
            # create domain special volumes folder
            fileUtils.createdir(os.path.join(domaindir, sd.DOMAIN_META_DATA))
            fileUtils.createdir(os.path.join(domaindir, sd.DOMAIN_IMAGES))
        # Add the file domains
        for domUUID, domaindir in fileSD.scanDomains():  \
                # [(fileDomUUID, file_domaindir)]
            if domUUID in domUUIDs:
                domDirs[domUUID] = domaindir
    ...

As we see we are creating links only for the domains that we are able to find at that time.


Bottom line: the solution is to try and _linkStorageDomain as soon as the domains are visible.

Comment 8 Federico Simoncelli 2014-05-03 11:38:30 UTC

Even though we have a vdsm fix for this I strongly suggest that we solve the issue addressing bug 1093924 that covers other scenarios as well.

Comment 9 Allon Mureinik 2014-05-04 13:59:46 UTC

(In reply to Federico Simoncelli from comment #8)
> Even though we have a vdsm fix for this I strongly suggest that we solve the
> issue addressing bug 1093924 that covers other scenarios as well.
Agreed.

Comment 11 Elad 2014-05-18 12:30:28 UTC

2 NFS domains in DC:
1) Created a VM with a disk located on the masted domain, started it
2) Blocked connectivity form SPM to master domain, waited for reconstruct to take place
3) Once the other domain took master, destroyed the VM
4) Resumed connectivity to the first domain
5) Started the VM


The VM was started normally, the link to the mount of the storage domain re-appeared under /rhev/data-ceter/SPUUID. Tested 3 times

Thread-220::INFO::2014-05-18 15:22:27,387::sp::1113::Storage.StoragePool::(_linkStorageDomain) Linking /rhev/data-center/mnt/lion.qa.lab.tlv.redhat.com:_export_elad_6/4b73f56c-a54a-4f81-b9b2-010cc1b5904e to /rhev/data-center/4aa2760a-c779-4b5c-93aa-8aafd334aeb1/4b73f56c-a54a-4f81-b9b2-010cc1b5904e

As this scenario didn't cause the issue of missing links to reproduce, I'm moving this bug to VERIFIED. 
 
Verified using av9.1

Comment 12 errata-xmlrpc 2014-06-09 13:30:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0504.html

Note You need to log in before you can comment on or make changes to this bug.