1055928 – Volume creation fails with error "host is not in 'Peer in Cluster' state"

Bug 1055928 - Volume creation fails with error "host is not in 'Peer in Cluster' state"

Summary: Volume creation fails with error "host is not in 'Peer in Cluster' state"

Keywords:
Status:	CLOSED DUPLICATE of bug 1213245
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	2.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Anand Nekkunti
QA Contact:	Lalatendu Mohanty
Docs Contact:
URL:
Whiteboard:	glusterd
Depends On:
Blocks:	1213245
TreeView+	depends on / blocked

Reported:	2014-01-21 09:17 UTC by Lalatendu Mohanty
Modified:	2015-04-20 08:13 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1213245 (view as bug list)
Environment:
Last Closed:	2015-04-20 08:13:03 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Lalatendu Mohanty 2014-01-21 09:17:17 UTC

Description of problem:

Gluster volume creation is failing with error "host is not in 'Peer in Cluster' state". However the host is in 'Peer in Cluster' as per the cmd output of "gluster peer status".

Below things are done differently in this scenario, because of which I may have hit this issue.

#peer probes were done using hostnames, (not fdqns, just hostnames). For details refer the "Additional info:" section

#In Volume create command, Ips are used 

#there are 7 days gap between peer probe and volume creation command

Version-Release number of selected component (if applicable):

glusterfs-server-3.4.0.57rhs-1.el6rhs.x86_64

How reproducible:

I got this error at first attempt. 

Will update later about how frequently this can be reproducible.

Steps to Reproduce:
1. do peer probes using hostnames, (not fdqns, just hostnames)
2. Create a distribute-replicate volume using rhs nodes IPs
3.

Actual results:


Expected results:


Additional info:
###########################################

# gluster peer probe gqac027
# gluster peer probe gqac028
# gluster peer probe gqac029

##########################################

[root@gqac026 ~]# cat /var/log/glusterfs/.cmd_log_history 
[2014-01-13 12:17:30.101816]  : peer probe gqac027 : SUCCESS    
[2014-01-13 12:17:35.273222]  : peer probe gqac028 : SUCCESS    
[2014-01-13 12:17:39.578413]  : peer probe gqac029 : SUCCESS
    
[2014-01-21 07:12:58.499631]  : v create ctdb_meta replica 2 10.16.157.75:/rhs/brick1/ctdb_meta_b1 10.16.157.78:/rhs/brick1/ctdb_meta_b1 10.16.157.81:/rhs/brick1/ctdb_meta_b2 10.16.157.84:/rhs/brick1/ctdb_meta_b2 : FAILED : Host 10.16.157.78 is not in 'Peer in Cluster' state


###########################################

[root@gqac026 ~]# gluster peer status
Number of Peers: 3

Hostname: gqac029
Uuid: b316bf96-f3f5-4a21-9ad0-8bddfcd94076
State: Peer in Cluster (Connected)

Hostname: gqac028
Uuid: a6c9e7b5-c28f-4049-a1c3-5bdcff061a62
State: Peer in Cluster (Connected)

Hostname: gqac027
Uuid: 317febbb-d02f-481b-b576-c68e3ececb65
State: Peer in Cluster (Connected)

#######################################

[root@gqac026 ~]# ping gqac027
PING gqac027.sbu.lab.eng.bos.redhat.com (10.16.157.78) 56(84) bytes of data.
64 bytes from gqac027.sbu.lab.eng.bos.redhat.com (10.16.157.78): icmp_seq=1 ttl=64 time=0.222 ms
64 bytes from gqac027.sbu.lab.eng.bos.redhat.com (10.16.157.78): icmp_seq=2 ttl=64 time=0.201 ms
64 bytes from gqac027.sbu.lab.eng.bos.redhat.com (10.16.157.78): icmp_seq=3 ttl=64 time=0.191 ms
64 bytes from gqac027.sbu.lab.eng.bos.redhat.com (10.16.157.78): icmp_seq=4 ttl=64 time=0.228 ms
64 bytes from gqac027.sbu.lab.eng.bos.redhat.com (10.16.157.78): icmp_seq=5 ttl=64 time=0.181 ms

#########################################

root@gqac026 ~]# cat /var/log/glusterfs/etc-glusterfs-glusterd.vol.log

[2014-01-20 12:06:16.296086] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2014-01-21 06:57:49.755877] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2014-01-21 07:12:58.478502] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s
[2014-01-21 07:12:58.478541] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s
[2014-01-21 07:12:58.478569] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s
[2014-01-21 07:12:58.478958] I [glusterd-ping.c:297:glusterd_ping_cbk] 0-management: defaulting ping-timeout to 10s
[2014-01-21 07:12:58.479003] I [glusterd-ping.c:297:glusterd_ping_cbk] 0-management: defaulting ping-timeout to 10s
[2014-01-21 07:12:58.479030] I [glusterd-ping.c:297:glusterd_ping_cbk] 0-management: defaulting ping-timeout to 10s
[2014-01-21 07:12:58.498138] E [glusterd-utils.c:5600:glusterd_new_brick_validate] 0-management: Host 10.16.157.78 is not in 'Peer in Cluster' state
[2014-01-21 07:12:58.498154] E [glusterd-volume-ops.c:795:glusterd_op_stage_create_volume] 0-management: Host 10.16.157.78 is not in 'Peer in Cluster' state
[2014-01-21 07:12:58.498162] E [glusterd-syncop.c:904:gd_stage_op_phase] 0-management: Staging of operation 'Volume Create' failed on localhost : Host 10.16.157.78 is not in 'Peer in Cluster' state
[2014-01-21 07:13:08.931433] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s
[2014-01-21 07:13:08.931467] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s
[2014-01-21 07:13:08.931484] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s
[2014-01-21 07:17:08.920448] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req

Comment 2 Lalatendu Mohanty 2014-01-21 10:02:43 UTC

Re-ran the volume creation command using the hostnames instead of IPs and the volume creation worked fine

[root@gqac026 ~]# gluster v create ctdb_meta replica 2 gqac026:/rhs/brick1/ctdb_meta_t1 gqac027:/rhs/brick1/ctdb_meta_t1 gqac028:/rhs/brick1/ctdb_meta_t2 gqac029:/rhs/brick1/ctdb_meta_t2
volume create: ctdb_meta: success: please start the volume to access data

Another observation is , only the hostname (with the FDQN) is not DNS resolvable
but it is working in the volume creation command as the domain information is in resolve.conf


root@gqac026 ~]# cat /etc/resolv.conf 
; generated by /sbin/dhclient-script
search sbu.lab.eng.bos.redhat.com
nameserver 10.XXX.XXX.XXX
nameserver 10.XXX.XXX.XXX
nameserver 10.XXX.XXX.XXX


[root@gqac026 ~]# dig gqac027

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.17.rc1.el6_4.6 <<>> gqac027
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 3322
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;gqac027.			IN	A

;; AUTHORITY SECTION:
.			9841	IN	SOA	a.root-servers.net. nstld.verisign-grs.com. 2014012100 1800 900 604800 86400

;; Query time: 0 msec
;; SERVER: 10.16.36.29#53(10.16.36.29)
;; WHEN: Tue Jan 21 05:00:53 2014
;; MSG SIZE  rcvd: 100

Comment 3 Dustin Tsang 2014-02-09 22:18:23 UTC

happens often but in rhsc testing, but has no relation to ip/hostname/dns usage.
error is seen when creating a volume through rhsc and is intermittent; it occurs 1% of the time through rhsc.

running the following script fails 1/2 of the time on a gluster server:

#!/bin/sh -e

HOST=10.14.16.145
SELF=10.14.16.107
gluster peer probe $HOST
gluster --mode=script vol create blah $SELF:/bricks/blah $HOST:/bricks/blah force
gluster --mode=script vol delete blah
gluster peer detach $HOST


is there a requirement to put a sleep between probe and volume creation?

Comment 4 Sahina Bose 2014-02-13 11:28:16 UTC

(In reply to Dustin Tsang from comment #3)
> happens often but in rhsc testing, but has no relation to ip/hostname/dns
> usage.
> error is seen when creating a volume through rhsc and is intermittent; it
> occurs 1% of the time through rhsc.
> 
> running the following script fails 1/2 of the time on a gluster server:
> 
> #!/bin/sh -e
> 
> HOST=10.14.16.145
> SELF=10.14.16.107
> gluster peer probe $HOST
> gluster --mode=script vol create blah $SELF:/bricks/blah $HOST:/bricks/blah
> force
> gluster --mode=script vol delete blah
> gluster peer detach $HOST
> 
> 
> is there a requirement to put a sleep between probe and volume creation?

Yes, as referred in  Bug 1020421#Comment 4, when "gluster peer probe" is done, the CLI returns success - though the peer probe has not yet actually happened.
So a wait before volume create would prevent this error - either that or check gluster peer status to make sure the peer has been added

Comment 5 Vivek Agarwal 2014-02-20 08:36:57 UTC

adding 3.0 flag and removing 2.1.z

Comment 6 Kaushal 2014-05-08 11:31:28 UTC

We are working on improving the peer identification on upstream, which once done would solve these kind of problems. But that cannot be completed within the Denali code-freeze. So it'll be better if this bug isn't targeted for Denali.

Comment 8 Anand Nekkunti 2015-03-10 05:58:20 UTC

This is bug is fixed in upstream as a part of Improvements to peer identification.
Upstream Link : http://review.gluster.org/#/c/8238/

Comment 9 Anand Nekkunti 2015-03-10 05:59:15 UTC

This bug is fixed in upstream as a part of Improvements to peer identification.
Upstream Link : http://review.gluster.org/#/c/8238/

Comment 11 Vivek Agarwal 2015-04-20 08:13:03 UTC

Per discussion with Anand N, this is not a must fix 2.1. Hence closing this.

*** This bug has been marked as a duplicate of bug 1213245 ***

Note You need to log in before you can comment on or make changes to this bug.