1304268 – [HC] Activate Gluster domain fails with StorageDomainDoesNotExist

Bug 1304268 - [HC] Activate Gluster domain fails with StorageDomainDoesNotExist

Summary: [HC] Activate Gluster domain fails with StorageDomainDoesNotExist

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	General
Sub Component:
Version:	4.17.18
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	ovirt-3.6.6
Target Release:	---
Assignee:	Fred Rolland
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	Gluster-HC-1
TreeView+	depends on / blocked

Reported:	2016-02-03 08:40 UTC by Sahina Bose
Modified:	2016-04-20 10:50 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-04-20 10:50:47 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	sabose: ovirt-3.6.z? ylavi: planning_ack+ sabose: devel_ack? rule-engine: testing_ack?

Attachments	(Terms of Use)
vdsm.log (5.11 MB, text/plain) 2016-02-03 08:40 UTC, Sahina Bose	no flags	Details
vmstore-mount.log (20.04 KB, text/plain) 2016-02-03 08:44 UTC, Sahina Bose	no flags	Details
engine.log (99.60 KB, text/plain) 2016-04-06 17:29 UTC, Sahina Bose	no flags	Details
vdsm.log.new (1.86 MB, text/plain) 2016-04-06 17:32 UTC, Sahina Bose	no flags	Details
rhsdev9-vmstore.tar.gz (1.38 MB, application/x-gzip) 2016-04-06 17:36 UTC, Sahina Bose	no flags	Details
View All

Description Sahina Bose 2016-02-03 08:40:56 UTC

Created attachment 1120683 [details]
vdsm.log

Description of problem:

From a replica 3 gluster volume which was added as Master data domain - remove and add brick. The gluster volume heal is initiated and completed.
The storage domain goes to Inactive state, on trying to activate it from engine, error is - 	
VDSM command failed: Storage domain does not exist: (u'12a2c867-dbc6-43aa-8b62-95c0f1805ac0',)

On the hypervisor, the gluster volume is mounted and accessible. 

rhsdev-docker2 ~]# cat /rhev/data-center/mnt/glusterSD/rhsdev-docker1.lab.eng.blr.redhat.com:_vmstore/12a2c867-dbc6-43aa-8b62-95c0f1805ac0/dom_md/metadata
CLASS=Data
DESCRIPTION=vmstore
IOOPTIMEOUTSEC=10
LEASERETRIES=3
LEASETIMESEC=60
LOCKPOLICY=
LOCKRENEWALINTERVALSEC=5
MASTER_VERSION=1
POOL_DESCRIPTION=Default
POOL_DOMAINS=12a2c867-dbc6-43aa-8b62-95c0f1805ac0:Active
POOL_SPM_ID=-1
POOL_SPM_LVER=-1
POOL_UUID=00000001-0001-0001-0001-000000000348
REMOTE_PATH=rhsdev-docker1.lab.eng.blr.redhat.com:/vmstore
ROLE=Master
SDUUID=12a2c867-dbc6-43aa-8b62-95c0f1805ac0
TYPE=GLUSTERFS
VERSION=3
_SHA_CKSUM=617714fa4b35c94f9e649706ada981f600d61dd7


gluster volume status vmstore
Status of volume: vmstore
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick rhsdev-docker1.lab.eng.blr.redhat.com
:/rhgs/vmstore/brick1                       49154     0          Y       45798
Brick rhsdev-docker2.lab.eng.blr.redhat.com
:/rhgs/vmstore/brick1                       49154     0          Y       4474 
Brick rhsdev9.lab.eng.blr.redhat.com:/rhgs/
vmstore/brick1                              49152     0          Y       45782
NFS Server on localhost                     2049      0          Y       32936
Self-heal Daemon on localhost               N/A       N/A        Y       32991
NFS Server on rhsdev9.lab.eng.blr.redhat.co
m                                           2049      0          Y       4752 
Self-heal Daemon on rhsdev9.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       4760 
NFS Server on rhsdev-docker1.lab.eng.blr.re
dhat.com                                    2049      0          Y       48905
Self-heal Daemon on rhsdev-docker1.lab.eng.
blr.redhat.com                              N/A       N/A        Y       48914
 

[root@rhsdev-docker2 ~]# gluster volume heal vmstore info
Brick rhsdev-docker1.lab.eng.blr.redhat.com:/rhgs/vmstore/brick1
Number of entries: 0

Brick rhsdev-docker2.lab.eng.blr.redhat.com:/rhgs/vmstore/brick1
Number of entries: 0

Brick rhsdev9.lab.eng.blr.redhat.com:/rhgs/vmstore/brick1
Number of entries: 0


Gluster mount logs and vdsm logs from hypervisor attached.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. On a working RHEV + RHGS setup, remove brick reducing replica count from 3 to 2 from gluster volume added as master data storage domain (brick removed to remove host that was re-installed. Engine does not allow to remove hosts that contain bricks)
2. Now add brick to the gluster volume increasing replica count from 2 back to 3


Additional info:
There were problems with heal, which needed to be triggered by adding a dummy file to the mount point and initiating "gluster volume heal vmstore full"

Comment 1 Sahina Bose 2016-02-03 08:44:47 UTC

Created attachment 1120684 [details]
vmstore-mount.log

Comment 2 Sahina Bose 2016-02-03 10:47:27 UTC

Is there any way to debug this? To check manually what vdsm is looking for in the gluster volume mount?

I have the setup as is, so I can run the checks.

Comment 3 Sahina Bose 2016-02-04 11:52:32 UTC

I restarted vdsm on all 3 nodes, but still get the same error (engine events):
VDSM rhsdev9 command failed: Storage domain does not exist: (u'12a2c867-dbc6-43aa-8b62-95c0f1805ac0',)
VDSM rhsdev-docker1 command failed: Storage domain does not exist: (u'12a2c867-dbc6-43aa-8b62-95c0f1805ac0',)

Comment 5 Red Hat Bugzilla Rules Engine 2016-02-05 05:15:23 UTC

Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 6 Ala Hino 2016-02-07 15:56:59 UTC

I will have to reproduce and look into this to better understand what's going on

Comment 9 Fred Rolland 2016-04-06 13:00:25 UTC

Sahina,

Is it hosted engine?
Can you add the engine log ?

Thanks,

Fred

Comment 10 Sahina Bose 2016-04-06 17:29:43 UTC

Created attachment 1144284 [details]
engine.log

Comment 11 Sahina Bose 2016-04-06 17:31:12 UTC

Yes, it is hosted engine
I ran into error again, so have all the logs.

In my setup - the vmstore volume is running, all bricks are up. If I mount on a tmp directory, it is accessible, but the rhev/data-center/mnt/.. folder is not. 
See below:

[root@rhsdev9 ~]# mount -t glusterfs 10.70.42.203:/vmstore tmp
[root@rhsdev9 ~]# cd tmp
[root@rhsdev9 tmp]# ll
total 0
drwxr-xr-x. 5 vdsm kvm 45 Apr  5 11:17 1499860e-9fcd-44d9-a41e-e54ad0a7ff84
-rwxr-xr-x. 1 vdsm kvm  0 Apr  6 22:10 __DIRECT_IO_TEST__               
[root@rhsdev9 tmp]# ll /rhev/data-center/mnt/glusterSD/10.70.42.203\:_vmstore 
ls: cannot access /rhev/data-center/mnt/glusterSD/10.70.42.203:_vmstore: Transport endpoint is not connected

Comment 12 Sahina Bose 2016-04-06 17:32:08 UTC

Created attachment 1144285 [details]
vdsm.log.new

Comment 13 Sahina Bose 2016-04-06 17:36:50 UTC

Created attachment 1144287 [details]
rhsdev9-vmstore.tar.gz

Comment 14 Sahina Bose 2016-04-06 17:39:05 UTC

Krutika, do you know why situation in comment 11 could happen?

Comment 15 Krutika Dhananjay 2016-04-07 01:46:59 UTC

Was there a core dump this time around?

-Krutika

Comment 16 Sahina Bose 2016-04-07 08:47:07 UTC

No, no core dumps in /var/log/core

The volume option changed. The volume was stopped, ssl enabled, and restarted. Post that storage domain cannot be activated on one of the hosts. The other 2 hosts can connect to storage domain.

Comment 17 Krutika Dhananjay 2016-04-07 08:53:15 UTC

Hi,

So Pranith and I looked into your setup. So it looks like you enabled the ssl options long after the volume had been mounted. This is a problem since from that point onwards, the server refuses to accept requests from the existing client. And this is a known issue.

I will try and dig out the specific BZ where this issue was originally reported and post it here again.

Meanwhile, there are two ways to get around the issue:

1) Make sure to have already enabled ssl settings before you mount the volume.
2) If you enable gluster ssl settings AFTER the volume is mounted, we would need to unmount and mount the volume again on these old(er) clients.

-Krutika

Comment 18 Fred Rolland 2016-04-07 08:56:33 UTC

It looks that the root cause is not related to oVirt.
Can you move this report to the right product/component ?

Comment 19 Sahina Bose 2016-04-07 09:00:47 UTC

I will move component once I'm sure that we're not missing anything here. As the earlier issue (Comment 1) had nothing to do with SSL.

When trying to connect storage domain to server, does vdsm remount the Storage domain?

Comment 20 Fred Rolland 2016-04-07 10:22:39 UTC

In general, the vdsm will not remount, it will mount.
I think in the scenario described, you should de-activate and reactivate the storage domain.

Comment 21 Fred Rolland 2016-04-07 13:07:44 UTC

When you re-add the host, you need to make sure you don't have old ovirt mount configured.

Comment 22 Allon Mureinik 2016-04-10 11:25:48 UTC

(In reply to Krutika Dhananjay from comment #17)

> 2) If you enable gluster ssl settings AFTER the volume is mounted, we would
> need to unmount and mount the volume again on these old(er) clients.

Moving the domain/host to maintenance (which you should do anyway before changing any configurations on the storage side) would unmount the directory.
 Seems like this BZ is either a misuse, or a bug on RHGS' side that it doesn't allow maintenancing a host with a brick.

What am I missing?

Comment 23 Sahina Bose 2016-04-11 05:38:11 UTC

(In reply to Allon Mureinik from comment #22)
> (In reply to Krutika Dhananjay from comment #17)
> 
> > 2) If you enable gluster ssl settings AFTER the volume is mounted, we would
> > need to unmount and mount the volume again on these old(er) clients.
> 
> Moving the domain/host to maintenance (which you should do anyway before
> changing any configurations on the storage side) would unmount the directory.
>  Seems like this BZ is either a misuse, or a bug on RHGS' side that it
> doesn't allow maintenancing a host with a brick.
> 
> What am I missing?

Gluster only prevents removing a host if the host has a brick - there's no check yet for maintenance

In this case, we tried to move the stoage domain and hosts to maintenance. The domain failed to be moved to maintenance as it was the master domain - and it's status changed to active state after the failure message.
One of the hosts which was running HE VM failed to move to maintenance, and caused this error.
I think the wrong step done was - we clicked on "confirm the host was rebooted" without unmounting sd.

Is there a way to move master domain to maintenance?

Comment 24 Allon Mureinik 2016-04-11 12:04:22 UTC

You should be able to just maintenance a master domain. Liron - can you take a look? You had some recent work in that area.

Thanks!

Comment 25 Fred Rolland 2016-04-13 07:16:39 UTC

We are able to maintenance a master domain.
If it is the last SD then the Data Center will move to maintenance.
If you have a running VM on this SD, you will not be able to move the SD to maintenance.

Also, you cannot move to maintenance the host that is running the Hosted Engine.

In any case, the VDSM will not unmount existing SD.

Sahina, is there another flow that we need to investigate ?

Comment 26 Fred Rolland 2016-04-19 14:16:47 UTC

Sahina, any insights ?

Comment 27 Sahina Bose 2016-04-20 10:50:47 UTC

Sorry, I have not had time to look into this before now. We did have issues with moving master SD to maintenance - there were no VMs running on it, but there were no other active SDs either

I tried to reproduce this again - and I was able to move master SD to maintenance when there were other storage domains.

Closing this as this seems more like a workflow error. Will re-open if we reproduce problem with correct workflow

Note You need to log in before you can comment on or make changes to this bug.