Bug 1069772

Summary: [vdsm] gluster storage domain is reported as 'active' by host, even though its link under /rhev/data-center/SPUUID/ is missing
Product: [Retired] oVirt Reporter: Elad <ebenahar>
Component: vdsmAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED CURRENTRELEASE QA Contact: Gil Klein <gklein>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.4CC: acathrow, amureini, bazulay, fsimonce, gklein, iheim, mgoldboi, nsoffer, tnisan, yeylon
Target Milestone: ---   
Target Release: 3.4.1   
Hardware: x86_64   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: v4.14.8.1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-05-17 18:06:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sos report from engine and vdsm logs none

Description Elad 2014-02-25 15:57:16 UTC
Created attachment 867495 [details]
sos report from engine and vdsm logs

Description of problem:
Similar to https://bugzilla.redhat.com/show_bug.cgi?id=1026697 , only with gluster storage domain
Link of gluster storage domain has disappeared from /rhev/data-center/SPUUID/, but vdsm reports that the domain status is 'active' by:


[root@green-vdsa images]# vdsClient -s 0 getStoragePoolInfo 235c8719-6c66-4535-b13c-95591998d068
        name = shared1
        isoprefix =
        pool_status = connected
        lver = 0
        spm_id = 1
        master_uuid = eb3bf350-d77f-4213-a664-cf0d40a4a173
        version = 3
        domains = d20d8a88-61cf-484b-9ecd-4ebefbc92d7f:Active,7288f26b-36f9-4352-8309-c507adf59f4f:Active,03356609-057f-4d3b-9afb-57a28517b9f4:Active,eb3bf350-d77f-4213-a664-cf0d40a4a173:Active,e7081a6c-1c86-4f6f-85c1-e39dc6c5c198:Active,1d11607f-6e7e-48df-908f-4b28913aad9d:Active,c87164ff-588b-4b71-808f-4a2386e1e8b3:Active
        type = ISCSI
        master_ver = 2
        d20d8a88-61cf-484b-9ecd-4ebefbc92d7f = {'status': 'Active', 'diskfree': '4017566253056', 'isoprefix': '', 'alerts': [], 'disktotal': '4395899027456', 'version': 3}
        
And by repoStats:

[root@green-vdsa images]# vdsClient -s 0 repoStats
Domain d20d8a88-61cf-484b-9ecd-4ebefbc92d7f {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.0012317', 'lastCheck': '7.0', 'valid': True}


Mount point exists:

[root@green-vdsa ~]# ll /rhev/data-center/mnt/glusterSD/10.35.102.17\:_elad-ovirt/
total 0
drwxr-xr-x. 4 vdsm kvm 64 Feb 25 10:05 d20d8a88-61cf-484b-9ecd-4ebefbc92d7f
-rwxr-xr-x. 1 vdsm kvm  0 Feb 25 10:01 __DIRECT_IO_TEST__

getVolumeSize for volumes under that domain succeeds without any issue:

Thread-268076::INFO::2014-02-25 17:47:48,652::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='d20d8a88-61cf-484b-9ecd-4ebefbc92d7f', spUUID='235c8719-6c66-4535-b13c-95591998d068', imgUUID='7c02c1a6-ad75-41b2-9975-a2108b724b5e', volUUID='489f3200-3972-4a49-b33f-8b37d28014ef', options=None)
Thread-268076::INFO::2014-02-25 17:47:48,657::logUtils::47::dispatcher::(wrapper) Run and protect: getVolumeSize, Return response: {'truesize': '7516192768', 'apparentsize': '7516192768'}


But under /rhev/data-center/ it doesn't exist:

[root@green-vdsa images]# ls -l /rhev/data-center/235c8719-6c66-4535-b13c-95591998d068/
total 28
lrwxrwxrwx. 1 vdsm kvm  66 Feb 25 10:32 03356609-057f-4d3b-9afb-57a28517b9f4 -> /rhev/data-center/mnt/blockSD/03356609-057f-4d3b-9afb-57a28517b9f4
lrwxrwxrwx. 1 vdsm kvm  66 Feb 25 15:30 1d11607f-6e7e-48df-908f-4b28913aad9d -> /rhev/data-center/mnt/blockSD/1d11607f-6e7e-48df-908f-4b28913aad9d
lrwxrwxrwx. 1 vdsm kvm  85 Feb 25 10:32 7288f26b-36f9-4352-8309-c507adf59f4f -> /rhev/data-center/mnt/10.35.64.81:_export_elad_1/7288f26b-36f9-4352-8309-c507adf59f4f
lrwxrwxrwx. 1 vdsm kvm 100 Feb 25 10:32 c87164ff-588b-4b71-808f-4a2386e1e8b3 -> /rhev/data-center/mnt/lion.qa.lab.tlv.redhat.com:_export_elad_2/c87164ff-588b-4b71-808f-4a2386e1e8b3
lrwxrwxrwx. 1 vdsm kvm  93 Feb 25 10:32 e7081a6c-1c86-4f6f-85c1-e39dc6c5c198 -> /rhev/data-center/mnt/lion.qa.lab.tlv.redhat.com:_test_1/e7081a6c-1c86-4f6f-85c1-e39dc6c5c198
lrwxrwxrwx. 1 vdsm kvm  66 Feb 25 10:32 eb3bf350-d77f-4213-a664-cf0d40a4a173 -> /rhev/data-center/mnt/blockSD/eb3bf350-d77f-4213-a664-cf0d40a4a173
lrwxrwxrwx. 1 vdsm kvm  66 Feb 25 10:32 mastersd -> /rhev/data-center/mnt/blockSD/eb3bf350-d77f-4213-a664-cf0d40a4a173


^^Link to domain d20d8a88-61cf-484b-9ecd-4ebefbc92d7f does not exist^^

The domain is reported as 'active' by engine but creation of new images under it isn't possible:

Thread-283341::ERROR::2014-02-25 16:35:16,751::dispatcher::67::Storage.Dispatcher.Protect::(run) {'status': {'message': "Image does not exist in domain: 'image=19a5e8e9-2811-4b94-8579-bc7ecd92258b, domain=d20d8a88
-61cf-484b-9ecd-4ebefbc92d7f'", 'code': 268}}


Version-Release number of selected component (if applicable):
vdsm-4.14.3-0.el6.x86_64
ovirt-engine-3.4.0-0.11.beta3.el6.noarch
libvirt-0.10.2-29.el6_5.2.x86_64
qemu-kvm-rhev-0.12.1.2-2.415.el6_5.3.x86_64
glusterfs-server-3.4.2-1.el6.x86_64
glusterfs-vim-3.2.7-1.el6.x86_64
glusterfs-api-devel-3.4.2-1.el6.x86_64
glusterfs-debuginfo-3.4.2-1.el6.x86_64
glusterfs-libs-3.4.2-1.el6.x86_64
glusterfs-cli-3.4.2-1.el6.x86_64
glusterfs-rdma-3.4.2-1.el6.x86_64
glusterfs-3.4.2-1.el6.x86_64
glusterfs-resource-agents-3.4.2-1.el6.noarch
glusterfs-geo-replication-3.4.2-1.el6.x86_64
glusterfs-api-3.4.2-1.el6.x86_64
glusterfs-fuse-3.4.2-1.el6.x86_64
glusterfs-devel-3.4.2-1.el6.x86_64

How reproducible:
Need a situation in which the gluster storage domain link disappears.

Steps to Reproduce:
Happened to me on a shared DC with several storage domains (mixed types)  
I created a new gluster storage domain based on a volume that contain 2 bricks.
After the creation of the domain, I was able to create images under it without any problem. 

Actual results:

Storage domain activation:

Thread-266535::INFO::2014-02-25 10:03:57,068::logUtils::44::dispatcher::(wrapper) Run and protect: activateStorageDomain(sdUUID='d20d8a88-61cf-484b-9ecd-4ebefbc92d7f', spUUID='235c8719-6c66-4535-b13c-95591998d068', options=None)


Thread-266535::INFO::2014-02-25 10:03:57,990::sp::1104::Storage.StoragePool::(_linkStorageDomain) Linking /rhev/data-center/mnt/glusterSD/10.35.102.17:_elad-ovirt/d20d8a88-61cf-484b-9ecd-4ebefbc92d7f to /rhev/data-center/235c8719-6c66-4535-b13c-95591998d068/d20d8a88-61cf-484b-9ecd-4ebefbc92d7f

Storage domain performed well, I was able to create images under it. After several hours, when I tried to create a new disk under it, I got the "image does not exist under the storage domain" error.


Expected results:
Monitoring the domain should be done also for its symbolic link and not only for its mount point, so if the symbolic link get lost, the domain should become inactive.

Additional info: sos report from engine and vdsm logs

Comment 1 Nir Soffer 2014-02-25 17:33:18 UTC
(In reply to Elad from comment #0)
> Created attachment 867495 [details]
> 
> Expected results:
> Monitoring the domain should be done also for its symbolic link and not only
> for its mount point, so if the symbolic link get lost, the domain should
> become inactive.

No, the domain monitor should not watch links and should not change its state becuse the link was deleted.

We should find why the link was not created or removed, and fix that.

However, with the current state of logging in vdsm, it is not possible. We must have a log for each path created, modified or removed. When we have that we can fix this.

I recommend to close this as CANTFIX for now.

Comment 2 Sandro Bonazzola 2014-03-04 09:20:55 UTC
This is an automated message.
Re-targeting all non-blocker bugs still open on 3.4.0 to 3.4.1.

Comment 3 Allon Mureinik 2014-05-04 08:05:23 UTC
Fede, IIUC, your fix for bug 1091030 will address this issue too, right?

Comment 4 Federico Simoncelli 2014-05-14 23:03:57 UTC
Yes this looks like a duplicate of bug 1091030.

It's worth trying to see if the gerrit change 27466 fixed this as well.

Comment 5 Allon Mureinik 2014-05-14 23:16:05 UTC
(In reply to Federico Simoncelli from comment #4)
> Yes this looks like a duplicate of bug 1091030.
> 
> It's worth trying to see if the gerrit change 27466 fixed this as well.
Moving to ON_QA based on that statement

Comment 6 Allon Mureinik 2014-05-17 18:06:11 UTC
(In reply to Allon Mureinik from comment #5)
> (In reply to Federico Simoncelli from comment #4)
> > Yes this looks like a duplicate of bug 1091030.
> > 
> > It's worth trying to see if the gerrit change 27466 fixed this as well.
> Moving to ON_QA based on that statement

oVirt 3.4.1 has been released.