Description of problem: After updating existing cluster from 3.10 to 4.1 our setup stopped working. Our cluster worked in IPv6-only environment. And as I can see in tcpdump glusterfs 4.1 is trying to get only A (without AAAA) record for the other cluster member. Version-Release number of selected component (if applicable): glusterfs 4.1.1 How reproducible: Setup glusterfs 4.1 in IPv6-only environment. Actual results: Cluster doesn't work because it doesn't see other nodes Expected results: Cluster works Additional info: In /var/log/glusterfs/glusterd.log: [2018-07-30 13:20:05.088216] E [name.c:267:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host srv1.prod [2018-07-30 13:20:05.088216] E [name.c:267:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host srv1.prod ~ # gluster pool list UUID Hostname State 997fa0f6-c8d0-4207-8cef-95f25d1b9634 srv1.prod Disconnected c0a17e44-ea23-491f-805e-495cbd09bdf8 localhost Connected ~ # gluster volume info test-volume Volume Name: test-volume Type: Distribute Volume ID: 0a0be90a-5dd0-4d8d-98bc-0a2d9cfaf9f1 Status: Started Snapshot Count: 0 Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: srv1.prod:/gl Brick2: srv2.prod:/gl2 Options Reconfigured: transport.address-family: inet6 nfs.disable: on
I guess, related issues: https://bugzilla.redhat.com/show_bug.cgi?id=1191072 https://bugzilla.redhat.com/show_bug.cgi?id=1277054
With TRACE log level: [2018-08-01 08:33:55.990804] T [rpc-clnt.c:404:rpc_clnt_reconnect] 0-management: attempting reconnect [2018-08-01 08:33:55.991100] T [socket.c:3283:socket_connect] 0-management: connecting 0x563eb6dd3b70, state=0 gen=0 sock=-1 [2018-08-01 08:33:55.991413] D [dict.c:1126:data_to_uint16] (-->/usr/lib64/glusterfs/4.1.2/rpc-transport/socket.so(+0xafd0) [0x7f108f10efd0] -->/usr/lib64/glusterfs/4.1.2/rpc-transport/socket.so(socket_client_get_remote_sockaddr+0x111) [0x7f108f112531] -->/lib64/libglusterfs.so.0(data_to_uint16+0x161) [0x7f109d436051] ) 0-dict: key null, unsigned integer type asked, has integer type [Invalid argument] [2018-08-01 08:33:55.991724] D [logging.c:1983:_gf_msg_internal] 0-logging-infra: Buffer overflow of a buffer whose size limit is 5. About to flush least recently used log message to disk [2018-08-01 08:33:52.990678] T [MSGID: 0] [syncop.c:1031:__synclock_unlock] 0-: Unlock success 2373232384, remaining locks=0 [2018-08-01 08:33:55.991723] T [MSGID: 0] [common-utils.c:299:gf_resolve_ip6] 0-resolver: DNS cache not present, freshly probing hostname: srv1.prod [2018-08-01 08:33:55.992678] E [MSGID: 101075] [common-utils.c:317:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known) [2018-08-01 08:33:55.992817] E [name.c:274:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host srv1.prod Also tried a build with --with-ipv6default flag and setting transport.socket.source-addr to a IPv6 address plus transport.address-family - inet6. Nothing helped. And at least gf_resolve_ip6 still uses AF_INET family.
Actually IPv6 is broken since 3.11.
I'm sorry. I was wrong about --with-ipv6default. There is a typo in the rpm spec: https://src.fedoraproject.org/rpms/glusterfs/blob/master/f/glusterfs.spec#_63 The correct flag is --with-ipv6-default. But it doesn't help if you're using EL7 because of old libtirpc https://src.fedoraproject.org/rpms/glusterfs/blob/f28/f/glusterfs.spec#_71 So I built newer libtirpc and rebuild glusterfs with --with-ipv6-default and glusterfs get worked!
Have tested in 4.1.4 release and the IPv6 is still not working # gluster --version glusterfs 4.1.4 Repository revision: git://git.gluster.org/glusterfs.git Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/> # gluster peer probe gluster-1 peer probe: failed: Probe returned with Transport endpoint is not connected # ping6 gluster-1 PING gluster-1(gluster-1 (3010::13:199:0:0:42)) 56 data bytes 64 bytes from gluster-1 (3010::13:199:0:0:42): icmp_seq=1 ttl=64 time=1.54 ms 64 bytes from gluster-1 (3010::13:199:0:0:42): icmp_seq=2 ttl=64 time=0.439 ms [2018-10-03 19:06:25.009874] I [MSGID: 106487] [glusterd-handler.c:1244:__glusterd_handle_cli_probe] 0-glusterd: Received CLI probe req gluster-1 24007 [2018-10-03 19:06:25.010729] I [MSGID: 106128] [glusterd-handler.c:3635:glusterd_probe_begin] 0-glusterd: Unable to find peerinfo for host: gluster-1 (24007) [2018-10-03 19:06:25.028897] W [MSGID: 106061] [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout [2018-10-03 19:06:25.029031] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2018-10-03 19:06:25.033267] E [MSGID: 101075] [common-utils.c:312:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known) [2018-10-03 19:06:25.033366] E [name.c:267:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host gluster-1 [2018-10-03 19:06:25.033538] I [MSGID: 106498] [glusterd-handler.c:3561:glusterd_friend_add] 0-management: connect returned 0 [2018-10-03 19:06:25.033657] I [MSGID: 106004] [glusterd-handler.c:6382:__glusterd_peer_rpc_notify] 0-management: Peer <gluster-1> (<00000000-0000-0000-0000-000000000000>), in state <Establishing Connection>, has disconnected from glusterd.
Yan, did you try to build glusterfs with `--with-ipv6-default` flag? For me, it works fine with this flag.
I didn't rebuild the product. But have seen below change included in 4.1 and assuming it does the same. Otherwise, would expect a new fix. #1562052: build: revert configure --without-ipv6-default behaviour
We did fix few things with IPv6 with glusterfs-6.0 (now 6.1 is out), please upgrade. (https://bugzilla.redhat.com/1635863)