Bug 1086210

Summary:

Guest can't boot up after the master domain change

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Shanzhi Yu <shyu>

Component:

vdsm

Assignee:

Federico Simoncelli <fsimonce>

Status:

CLOSED ERRATA

QA Contact:

Elad <ebenahar>

Severity:

high

Docs Contact:

Priority:

high

Version:

3.3.0

CC:

acanan, ajia, amureini, bazulay, bili, dyuan, fsimonce, gklein, iheim, knesenko, lpeer, mzhan, scohen, shyu, yeylon, zdover

Target Milestone:

---

Target Release:

3.4.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

storage

Fixed In Version:

vdsm-4.14.7-1.el6ev

Doc Type:

Bug Fix

Doc Text:

Previously, guests were unable to boot up after the master domain changed. This was because refreshStoragePool ran before the domains were reachable again. A change in the code prevents refreshStoragePool from running until the domains are reachable.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-06-09 13:30:20 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
vdsm log	none
engine log	none

Description Shanzhi Yu 2014-04-10 11:22:43 UTC

Description of problem:

Guest can't boot up after the master domain change

Version-Release number of selected component (if applicable):

libvirt-0.10.2-31.el6.x86_64
vdsm-4.13.2-0.13.el6ev.x86_64
rhevm-3.3.1-0.48.el6ev.noarch

How reproducible:

100%

Steps to Reproduce:
1.prepare RHEVM env, with one DataCenter,Gluster,Two Domain(NFS type), one Host
Keep DataCenter,Gluster,Two Domain,Host active status
2.new an guest with disk on the Master Domain,and keep it running
3.login the Host, add one iptables rule to block the connection from Host to Mater domain
#iptables -A OUTPUT -d Master Domain -p tcp --dport 2049 -j DROP
4.wait about ten or more minutes, the Master domain will change to another one. DataCenter Gluster,one Domain, Host will recover to active status
5.destroy the guest(maybe need wait several minutes)
6.delete the block connection from Host to domain added in step 3, the other domain will recover after several minutes
#iptables -F
7.start the guest




Actual results:

fail to start the guest in step 6. will post error like

VM vm2 is down. Exit message: cannot open file '/rhev/data-center/f721168f-5edd-4f5f-8976-591eb93f960e/f9b3c82f-87e3-4cc8-8dda-b2721b4b4d76/images/fe90abc5-8638-40e8-b394-9082b3217731/7879e1b7-c238-4222-95eb-e922c5cf80dc': No such file or directory.

Expected results:

should succeed booting the guest after the Master Domain change

Additional info:

Comment 1 Aharon Canan 2014-04-10 11:30:14 UTC

Hi

can you add please engine and vdsm logs?

also, did you mean cluster or gluster?
looks like you mean cluster but I want to be sure.

thanks,

Comment 2 Shanzhi Yu 2014-04-10 11:47:46 UTC

(In reply to Aharon Canan from comment #1)
> Hi
> 
> can you add please engine and vdsm logs?
> 
> also, did you mean cluster or gluster?
> looks like you mean cluster but I want to be sure.
> 
> thanks,

Sorry for the mistake. Exactly I mean cluster.

Comment 3 Shanzhi Yu 2014-04-10 11:51:02 UTC

Created attachment 884913 [details]
vdsm log

Comment 4 Shanzhi Yu 2014-04-10 11:52:35 UTC

Created attachment 884915 [details]
engine log

Comment 5 Alex Jia 2014-04-11 03:12:45 UTC

(In reply to Shanzhi Yu from comment #0)

> #iptables -A OUTPUT -d Master Domain -p tcp --dport 2049 -j DROP
> 4.wait about ten or more minutes, the Master domain will change to another
> one. DataCenter Gluster,one Domain, Host will recover to active status


Not exactly, the DC is 'Non Responsive' state, the SD is inactive state, and 
the host is always running/connecting state rather than 'Non Responsive' state
after blocking connection, of course, it's another question, BTW, the bug 1086223 is used for tracking this.

Comment 6 Allon Mureinik 2014-04-13 03:00:23 UTC

Seems as though we aren't creating the links.
Fede - haven't you already dealt with a similar issue?

Comment 7 Federico Simoncelli 2014-04-29 09:38:03 UTC

The only refreshStoragePool arrived before the domains were reachable again:

Thread-519::INFO::2014-04-10 16:42:00,085::logUtils::44::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='f721168f-5edd-4f5f-8976-591eb93f960e', msdUUID='55cb7668-7c5e-4553-b263-c6f3f86b358f', masterVersion=6, options=None)

Thread-469::DEBUG::2014-04-10 16:53:09,963::domainMonitor::247::Storage.DomainMonitorThread::(_monitorDomain) Domain 65eafe2b-4fa9-4ea1-a869-7deb5afd4654 changed its status to Valid
Thread-467::DEBUG::2014-04-10 16:53:10,007::domainMonitor::247::Storage.DomainMonitorThread::(_monitorDomain) Domain f9b3c82f-87e3-4cc8-8dda-b2721b4b4d76 changed its status to Valid

I am not sure if anything changed on the engine side (lacking of an additional refreshStoragePool) but I agree that when a domain is reachable again we should make sure its links are available.

In particular some code related to the issue is:


    def __rebuild(self, msdUUID, masterVersion):
    ...
        blockDomUUIDs = [vg.name for vg in blockSD.lvm.getVGs(domUUIDs)]
        domDirs = {}  # {domUUID: domaindir}
        # Add the block domains
        for domUUID in blockDomUUIDs:
            domaindir = os.path.join(block_mountpoint, domUUID)
            domDirs[domUUID] = domaindir
            # create domain special volumes folder
            fileUtils.createdir(os.path.join(domaindir, sd.DOMAIN_META_DATA))
            fileUtils.createdir(os.path.join(domaindir, sd.DOMAIN_IMAGES))
        # Add the file domains
        for domUUID, domaindir in fileSD.scanDomains():  \
                # [(fileDomUUID, file_domaindir)]
            if domUUID in domUUIDs:
                domDirs[domUUID] = domaindir
    ...

As we see we are creating links only for the domains that we are able to find at that time.


Bottom line: the solution is to try and _linkStorageDomain as soon as the domains are visible.

Comment 8 Federico Simoncelli 2014-05-03 11:38:30 UTC

Even though we have a vdsm fix for this I strongly suggest that we solve the issue addressing bug 1093924 that covers other scenarios as well.

Comment 9 Allon Mureinik 2014-05-04 13:59:46 UTC

(In reply to Federico Simoncelli from comment #8)
> Even though we have a vdsm fix for this I strongly suggest that we solve the
> issue addressing bug 1093924 that covers other scenarios as well.
Agreed.

Comment 11 Elad 2014-05-18 12:30:28 UTC

2 NFS domains in DC:
1) Created a VM with a disk located on the masted domain, started it
2) Blocked connectivity form SPM to master domain, waited for reconstruct to take place
3) Once the other domain took master, destroyed the VM
4) Resumed connectivity to the first domain
5) Started the VM


The VM was started normally, the link to the mount of the storage domain re-appeared under /rhev/data-ceter/SPUUID. Tested 3 times

Thread-220::INFO::2014-05-18 15:22:27,387::sp::1113::Storage.StoragePool::(_linkStorageDomain) Linking /rhev/data-center/mnt/lion.qa.lab.tlv.redhat.com:_export_elad_6/4b73f56c-a54a-4f81-b9b2-010cc1b5904e to /rhev/data-center/4aa2760a-c779-4b5c-93aa-8aafd334aeb1/4b73f56c-a54a-4f81-b9b2-010cc1b5904e

As this scenario didn't cause the issue of missing links to reproduce, I'm moving this bug to VERIFIED. 
 
Verified using av9.1

Comment 12 errata-xmlrpc 2014-06-09 13:30:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0504.html