Description of problem: Gluster volume creation is failing with error "host is not in 'Peer in Cluster' state". However the host is in 'Peer in Cluster' as per the cmd output of "gluster peer status". Below things are done differently in this scenario, because of which I may have hit this issue. #peer probes were done using hostnames, (not fdqns, just hostnames). For details refer the "Additional info:" section #In Volume create command, Ips are used #there are 7 days gap between peer probe and volume creation command Version-Release number of selected component (if applicable): glusterfs-server-3.4.0.57rhs-1.el6rhs.x86_64 How reproducible: I got this error at first attempt. Will update later about how frequently this can be reproducible. Steps to Reproduce: 1. do peer probes using hostnames, (not fdqns, just hostnames) 2. Create a distribute-replicate volume using rhs nodes IPs 3. Actual results: Expected results: Additional info: ########################################### # gluster peer probe gqac027 # gluster peer probe gqac028 # gluster peer probe gqac029 ########################################## [root@gqac026 ~]# cat /var/log/glusterfs/.cmd_log_history [2014-01-13 12:17:30.101816] : peer probe gqac027 : SUCCESS [2014-01-13 12:17:35.273222] : peer probe gqac028 : SUCCESS [2014-01-13 12:17:39.578413] : peer probe gqac029 : SUCCESS [2014-01-21 07:12:58.499631] : v create ctdb_meta replica 2 10.16.157.75:/rhs/brick1/ctdb_meta_b1 10.16.157.78:/rhs/brick1/ctdb_meta_b1 10.16.157.81:/rhs/brick1/ctdb_meta_b2 10.16.157.84:/rhs/brick1/ctdb_meta_b2 : FAILED : Host 10.16.157.78 is not in 'Peer in Cluster' state ########################################### [root@gqac026 ~]# gluster peer status Number of Peers: 3 Hostname: gqac029 Uuid: b316bf96-f3f5-4a21-9ad0-8bddfcd94076 State: Peer in Cluster (Connected) Hostname: gqac028 Uuid: a6c9e7b5-c28f-4049-a1c3-5bdcff061a62 State: Peer in Cluster (Connected) Hostname: gqac027 Uuid: 317febbb-d02f-481b-b576-c68e3ececb65 State: Peer in Cluster (Connected) ####################################### [root@gqac026 ~]# ping gqac027 PING gqac027.sbu.lab.eng.bos.redhat.com (10.16.157.78) 56(84) bytes of data. 64 bytes from gqac027.sbu.lab.eng.bos.redhat.com (10.16.157.78): icmp_seq=1 ttl=64 time=0.222 ms 64 bytes from gqac027.sbu.lab.eng.bos.redhat.com (10.16.157.78): icmp_seq=2 ttl=64 time=0.201 ms 64 bytes from gqac027.sbu.lab.eng.bos.redhat.com (10.16.157.78): icmp_seq=3 ttl=64 time=0.191 ms 64 bytes from gqac027.sbu.lab.eng.bos.redhat.com (10.16.157.78): icmp_seq=4 ttl=64 time=0.228 ms 64 bytes from gqac027.sbu.lab.eng.bos.redhat.com (10.16.157.78): icmp_seq=5 ttl=64 time=0.181 ms ######################################### root@gqac026 ~]# cat /var/log/glusterfs/etc-glusterfs-glusterd.vol.log [2014-01-20 12:06:16.296086] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2014-01-21 06:57:49.755877] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2014-01-21 07:12:58.478502] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s [2014-01-21 07:12:58.478541] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s [2014-01-21 07:12:58.478569] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s [2014-01-21 07:12:58.478958] I [glusterd-ping.c:297:glusterd_ping_cbk] 0-management: defaulting ping-timeout to 10s [2014-01-21 07:12:58.479003] I [glusterd-ping.c:297:glusterd_ping_cbk] 0-management: defaulting ping-timeout to 10s [2014-01-21 07:12:58.479030] I [glusterd-ping.c:297:glusterd_ping_cbk] 0-management: defaulting ping-timeout to 10s [2014-01-21 07:12:58.498138] E [glusterd-utils.c:5600:glusterd_new_brick_validate] 0-management: Host 10.16.157.78 is not in 'Peer in Cluster' state [2014-01-21 07:12:58.498154] E [glusterd-volume-ops.c:795:glusterd_op_stage_create_volume] 0-management: Host 10.16.157.78 is not in 'Peer in Cluster' state [2014-01-21 07:12:58.498162] E [glusterd-syncop.c:904:gd_stage_op_phase] 0-management: Staging of operation 'Volume Create' failed on localhost : Host 10.16.157.78 is not in 'Peer in Cluster' state [2014-01-21 07:13:08.931433] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s [2014-01-21 07:13:08.931467] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s [2014-01-21 07:13:08.931484] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s [2014-01-21 07:17:08.920448] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
Re-ran the volume creation command using the hostnames instead of IPs and the volume creation worked fine [root@gqac026 ~]# gluster v create ctdb_meta replica 2 gqac026:/rhs/brick1/ctdb_meta_t1 gqac027:/rhs/brick1/ctdb_meta_t1 gqac028:/rhs/brick1/ctdb_meta_t2 gqac029:/rhs/brick1/ctdb_meta_t2 volume create: ctdb_meta: success: please start the volume to access data Another observation is , only the hostname (with the FDQN) is not DNS resolvable but it is working in the volume creation command as the domain information is in resolve.conf root@gqac026 ~]# cat /etc/resolv.conf ; generated by /sbin/dhclient-script search sbu.lab.eng.bos.redhat.com nameserver 10.XXX.XXX.XXX nameserver 10.XXX.XXX.XXX nameserver 10.XXX.XXX.XXX [root@gqac026 ~]# dig gqac027 ; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.17.rc1.el6_4.6 <<>> gqac027 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 3322 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0 ;; QUESTION SECTION: ;gqac027. IN A ;; AUTHORITY SECTION: . 9841 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2014012100 1800 900 604800 86400 ;; Query time: 0 msec ;; SERVER: 10.16.36.29#53(10.16.36.29) ;; WHEN: Tue Jan 21 05:00:53 2014 ;; MSG SIZE rcvd: 100
happens often but in rhsc testing, but has no relation to ip/hostname/dns usage. error is seen when creating a volume through rhsc and is intermittent; it occurs 1% of the time through rhsc. running the following script fails 1/2 of the time on a gluster server: #!/bin/sh -e HOST=10.14.16.145 SELF=10.14.16.107 gluster peer probe $HOST gluster --mode=script vol create blah $SELF:/bricks/blah $HOST:/bricks/blah force gluster --mode=script vol delete blah gluster peer detach $HOST is there a requirement to put a sleep between probe and volume creation?
(In reply to Dustin Tsang from comment #3) > happens often but in rhsc testing, but has no relation to ip/hostname/dns > usage. > error is seen when creating a volume through rhsc and is intermittent; it > occurs 1% of the time through rhsc. > > running the following script fails 1/2 of the time on a gluster server: > > #!/bin/sh -e > > HOST=10.14.16.145 > SELF=10.14.16.107 > gluster peer probe $HOST > gluster --mode=script vol create blah $SELF:/bricks/blah $HOST:/bricks/blah > force > gluster --mode=script vol delete blah > gluster peer detach $HOST > > > is there a requirement to put a sleep between probe and volume creation? Yes, as referred in Bug 1020421#Comment 4, when "gluster peer probe" is done, the CLI returns success - though the peer probe has not yet actually happened. So a wait before volume create would prevent this error - either that or check gluster peer status to make sure the peer has been added
adding 3.0 flag and removing 2.1.z
We are working on improving the peer identification on upstream, which once done would solve these kind of problems. But that cannot be completed within the Denali code-freeze. So it'll be better if this bug isn't targeted for Denali.
This is bug is fixed in upstream as a part of Improvements to peer identification. Upstream Link : http://review.gluster.org/#/c/8238/
This bug is fixed in upstream as a part of Improvements to peer identification. Upstream Link : http://review.gluster.org/#/c/8238/
Per discussion with Anand N, this is not a must fix 2.1. Hence closing this. *** This bug has been marked as a duplicate of bug 1213245 ***