1249851 – Error attaching glusterfs storage domain: "Cannot acquire host id"

Bug 1249851 - Error attaching glusterfs storage domain: "Cannot acquire host id"

Summary: Error attaching glusterfs storage domain: "Cannot acquire host id"

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	General
Sub Component:
Version:	3.5.2.1
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-3.6.2
Target Release:	3.6.2
Assignee:	Nir Soffer
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-08-04 00:56 UTC by punit
Modified:	2016-03-10 15:08 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-12-22 20:22:54 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	ylavi: ovirt-3.6.z? ylavi: planning_ack? ylavi: devel_ack? ylavi: testing_ack?

Attachments	(Terms of Use)

Description punit 2015-08-04 00:56:01 UTC

Description of problem:I have one testing ovirt cluster with one glusterfs storage (Distributed replicated) and 3 HV nodes...it's all working fine...but when i try to add another glusterfs datastorage (replicateX3),i am not able to add the datastore to ovirt and it failed with the following error :-

2015-07-29 10:05:55,194 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-8-thread-32) [751c5f25] IrsBroker::Failed::AttachStorageDomainVDS due to: IRSErrorException: IRSGenericException: IRSErrorException: Failed to AttachStorageDomainVDS, error = Cannot acquire host id: (u'd0e76dd4-c34a-456e-b7f6-02dc173a3cc1', SanlockException(90, 'Sanlock lockspace add failure', 'Message too long')), code = 661


Version-Release number of selected component (if applicable):
Ovirt Version :- 3.5.2.1-1.el7.centos
VDSM :- vdsm-4.16.20-0.el7.centos
Glusterfs version :- glusterfs-3.6.3-1.el7


[root@stor1 ~]# gluster volume info 3TB

Volume Name: 3TB
Type: Replicate
Volume ID: 78d1f376-178d-4b01-90c0-5dac90b50a6c
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: stor1:/bricks/b/vol2
Brick2: stor2:/bricks/b/vol2
Brick3: stor3:/bricks/b/vol2
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
auth.allow: *
user.cifs: enable
nfs.disable: off
[root@stor1 ~]#
------------------------

Actual results: Datastore failed to add with the error "Cannot acquire host id"

Expected results: The Datastore should be add without any error.


Additional info: 

1. Engine logs :- http://paste.ubuntu.com/11971901/
2. Sanlock lock (HV1) :- http://paste.ubuntu.com/11971916/
3. VDSM Logs (HV1) :- http://paste.ubuntu.com/11971926/
4. Sanlock lock (HV2) :- http://paste.ubuntu.com/11971950/
5. VDSM Logs (HV2) :- http://paste.ubuntu.com/11971955/
6. Var Messages (HV1) :- http://paste.ubuntu.com/11971967/
7. Var Messages (HV2) :- http://paste.ubuntu.com/11971977/

Thanks,
Punit

Comment 1 Nir Soffer 2015-08-04 08:55:53 UTC

(In reply to punit from comment #0)
According to sanlock logs, sanlock cannot update the delta lease on the gluster
domain. In this case, failing to acquire a lock is expected.

Please attach the glusterfs logs for this volumes to this bug. The logs should be found at /var/log/glusterfs/rhev_data_center*<gluster server>_<volume name>.log.

Sahina, can you get someone to look at this?

Comment 2 Nir Soffer 2015-08-04 08:57:23 UTC

David, can you look in sanlock logs and confirm that this is a gluster issue?

Comment 3 Nir Soffer 2015-08-04 09:46:07 UTC

Correction for glusterfs logs - the logs are found at:

/var/log/glusterfs/rhev-data-center-mnt-glusterSD-<server>:_<volume>.log

Comment 4 David Teigland 2015-08-04 14:14:11 UTC

Yes, sanlock gets i/o errors 103 (ECONNABORTED) and 107 (ENOTCONN) from storage.

Comment 5 punit 2015-08-05 04:55:06 UTC

Hi Nir,

The logs are here with both the Hypervisior node :- 

http://paste.ubuntu.com/12004403/
http://paste.ubuntu.com/12004410/
http://paste.ubuntu.com/12004788/
http://paste.ubuntu.com/12004825/

Comment 6 Nir Soffer 2015-08-11 19:05:22 UTC

As I said in comment 1, someone from gluster should check these logs.

Adding back lost needinfo for Sahina.

Comment 7 Nir Soffer 2015-08-11 19:06:21 UTC

Changing category to "sd-gluster" since this is not a sanlock issue.

Comment 8 Sahina Bose 2015-08-24 04:25:52 UTC

Ravi, can you look at this?

Comment 9 Ravishankar N 2015-08-24 15:22:32 UTC

Could someone from the ovirt team can explain the steps that are carried out from a gluster POV when "but when i try to add another glusterfs datastorage (replicateX3),i am not able to add the datastore to ovirt and it failed " is  performed?

I'm assuming 'adding a datastorage' involves the following steps.
1. Forming a storage pool of stor{1..3} using `gluster peer probe`
2. Creating a replica 3 volume and starting it
3. FUSE Mounting the volume on *all* the hypervisors.

- Is this correct? 
- At what point is the adding deemed successful? Will it fail if the FUSE mount is unmounted for some reason? From the logs given in comment #5, I see that the volume 3TB is being mounted and unmounted multiple times. 
- Does sanlock come into play on a volume just created and having no VM images yet?

Comment 10 Nir Soffer 2015-08-24 18:13:31 UTC

(In reply to Ravishankar N from comment #9)
> Could someone from the ovirt team can explain the steps that are carried out
> from a gluster POV when "but when i try to add another glusterfs datastorage
> (replicateX3),i am not able to add the datastore to ovirt and it failed " is
> performed?
> 
> I'm assuming 'adding a datastorage' involves the following steps.
> 1. Forming a storage pool of stor{1..3} using `gluster peer probe`
> 2. Creating a replica 3 volume and starting it

I don't know about this, we don't have any information about
this in the bug.

punit, please confirm the steps above.

> 3. FUSE Mounting the volume on *all* the hypervisors.

Right - and then:

4. Sanlock try to acquire a host id on *all* hosts
   This includes writing to the each host block in the
   "<dom_uuid>/dom_md/ids" file, and reading other hosts blocks.
   According to sanlock log (see comment 4), sanlock get 
   ECONNABORTED and ENOTCONN from storage at this point.

> - At what point is the adding deemed successful? 

When sanlock can acquire the host id. Before acquiring the host id, a host
is not allowed to touch the shared storage.

> Will it fail if the FUSE
> mount is unmounted for some reason? 

If it was unmounted before sanlock acquired the host id, it will fail.

If it fail after that, the storage domain will become non-operational
later, when storage domain monitoring fail to read from storage,

> From the logs given in comment #5, I see
> that the volume 3TB is being mounted and unmounted multiple times. 
> - Does sanlock come into play on a volume just created and having no VM
> images yet?

Yes, as described in step 4.

Comment 11 Nir Soffer 2015-08-24 18:18:21 UTC

punit, we need full vdsm logs, showing the entire flow starting from the point when you try to create a gluster storage domain, until it fails.

And we need the full logs attached to this bug. I cannot download the 
logs via the linkes you posted, it seems that downloading requires 
registration in that site.

Please also check and answer Ravishankar questions from comment 9.

Comment 12 punit 2015-09-07 12:50:56 UTC

Hi,

Logs are already attached and it's free website no need to register...

Additional info: 

1. Engine logs :- http://paste.ubuntu.com/11971901/
2. Sanlock lock (HV1) :- http://paste.ubuntu.com/11971916/
3. VDSM Logs (HV1) :- http://paste.ubuntu.com/11971926/
4. Sanlock lock (HV2) :- http://paste.ubuntu.com/11971950/
5. VDSM Logs (HV2) :- http://paste.ubuntu.com/11971955/
6. Var Messages (HV1) :- http://paste.ubuntu.com/11971967/
7. Var Messages (HV2) :- http://paste.ubuntu.com/11971977/

The logs are here with both the Hypervisior node :-

http://paste.ubuntu.com/12004403/
http://paste.ubuntu.com/12004410/
http://paste.ubuntu.com/12004788/
http://paste.ubuntu.com/12004825/

Comment 13 Nir Soffer 2015-09-07 13:23:13 UTC

(In reply to punit from comment #12)
Punit, I cannot download the log from that site - you don't have an issue 
since you have an account there, but I don't.
http://paste.ubuntu.com/11971926/plain/

Also I need *full* vdsm logs.

Please check again comment 11. If we will not get the requested info we will
have to close this bug.

Comment 14 Nir Soffer 2015-09-07 13:24:53 UTC

Adding back needinfo for ravishankar for comment 10.

Comment 15 punit 2015-09-08 01:10:21 UTC

Hi,

Please try with the following url :- 

http://ur1.ca/npig8

Also if you try to open this url it will be open http://paste.ubuntu.com/11971926/

instead of http://paste.ubuntu.com/11971926/plain/

As i don't have the ubuntu account,but i can easily can see the logs without the plain postfix in the url...

As i cannot reproduce the logs again,so if you can check the logs it's ok .otherwise you may consider to close this bug.

Thanks,
punit

Comment 16 Red Hat Bugzilla Rules Engine 2015-10-19 10:51:51 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 17 Nir Soffer 2015-12-22 20:18:09 UTC

Adding back needinfo for ravishankar for comment 10.

Comment 18 Nir Soffer 2015-12-22 20:22:54 UTC

(In reply to punit from comment #15)
> Please try with the following url :- 
> ...

These urls do not work for me. I need full vdsm logs on my machine to investigate
this.

Closing for now, please reopen if when you can attach full logs to this bug.

Comment 19 Ravishankar N 2016-01-11 13:02:42 UTC

Removing the need-info in my name as the bug is closed.

Note You need to log in before you can comment on or make changes to this bug.