1417521 – [SNAPSHOT] With all USS plugin enable .snaps directory is not visible in cifs mount as well as windows mount

Bug 1417521 - [SNAPSHOT] With all USS plugin enable .snaps directory is not visible in cifs mount as well as windows mount

Summary: [SNAPSHOT] With all USS plugin enable .snaps directory is not visible in cifs...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	3.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1416251
Blocks:	1411270
TreeView+	depends on / blocked

Reported:	2017-01-30 04:09 UTC by Atin Mukherjee
Modified:	2017-03-06 17:44 UTC (History)
CC List:	10 users (show)
Fixed In Version:	glusterfs-3.10.0
Clone Of:	1416251
Environment:
Last Closed:	2017-03-06 17:44:28 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Atin Mukherjee 2017-01-30 04:09:14 UTC

+++ This bug was initially created as a clone of Bug #1416251 +++

+++ This bug was initially created as a clone of Bug #1411270 +++

Description of problem:
.snaps directory is not visible in cifs mount as well as windows smb mount, even after enabling USS & VSS plugins.

Over fuse mount the .snaps directory is seen and is accessible also.

Currently this issue is seen in a SSL enabled cluster and another cluster setup over EC volume where there is no SSL setup.

The below mentioned information is grabbed from the setup where there is a EC volume.

Disconnect messages are seen in the client logs

[2017-01-09 09:32:57.751250] E [socket.c:2309:socket_connect_finish] 0-test-ec-snapd-client: connection to ::1:49158 failed (Connection refused)

[2017-01-09 09:32:57.751291] T [socket.c:721:__socket_disconnect] 0-test-ec-snapd-client: disconnecting 0x7fbf60061810, state=2 gen=0 sock=53


Version-Release number of selected component (if applicable):
samba-client-libs-4.4.6-4.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-11.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Over an EC volume 2(4+2) enable USS & show-snapshot-directory
2. Enable all VSS plugin
3. Take a snapshot
4. Activate the snapshot
5. Do a cifs mount and also mount the volume over a windows client machine (say windows10)
6. Check for the .snaps directory in cifs mount as well as windows mount

Actual results:
.snaps directory is not seen or accessible or present

Expected results:
.snaps directory should be present

Additional info:

[2017-01-09 09:32:57.751250] E [socket.c:2309:socket_connect_finish] 0-test-ec-snapd-client: connection to ::1:49158 failed (Connection refused)
[2017-01-09 09:32:57.751291] T [socket.c:721:__socket_disconnect] 0-test-ec-snapd-client: disconnecting 0x7fbf60061810, state=2 gen=0 sock=53
[2017-01-09 09:32:57.751312] D [socket.c:683:__socket_shutdown] 0-test-ec-snapd-client: shutdown() returned -1. Transport endpoint is not connected
[2017-01-09 09:32:57.751327] D [socket.c:728:__socket_disconnect] 0-test-ec-snapd-client: __socket_teardown_connection () failed: Transport endpoint is not connected

[2017-01-09 09:32:57.751340] D [socket.c:2403:socket_event_handler] 0-transport: disconnecting now

[2017-01-09 09:32:57.752014] D [rpc-clnt-ping.c:93:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fbf73b1b602] (--> /lib64/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x8e)[0x7fbf74011b9e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x5b)[0x7fbf7400dfbb] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7fbf7400e874] (--> /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fbf7400a893] ))))) 0-: ::1:49158: ping timer event already removed

[2017-01-09 09:32:57.752064] D [MSGID: 0] [client.c:2264:client_rpc_notify] 0-test-ec-snapd-client: got RPC_CLNT_DISCONNECT

[2017-01-09 09:32:57.752095] D [MSGID: 0] [event-epoll.c:587:event_dispatch_epoll_handler] 0-epoll: generation bumped on idx=13 from gen=2764 to slot->gen=2765, fd=53, slot->fd=53

[2017-01-09 09:33:01.733914] T [rpc-clnt.c:422:rpc_clnt_reconnect] 0-test-ec-snapd-client: attempting reconnect

[2017-01-09 09:33:01.733992] T [socket.c:2991:socket_connect] 0-test-ec-snapd-client: connecting 0x7fbf60061810, state=2 gen=0 sock=-1

[2017-01-09 09:33:01.734016] D [name.c:168:client_fill_address_family] 0-test-ec-snapd-client: address-family not specified, marking it as unspec for getaddrinfo to resolve from (remote-host: localhost)

[2017-01-09 09:33:01.734032] T [name.c:238:af_inet_client_get_remote_sockaddr] 0-test-ec-snapd-client: option remote-port missing in volume test-ec-snapd-client. Defaulting to 24007

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-01-09 05:14:58 EST ---

This bug is automatically being proposed for the current release of Red Hat Gluster Storage 3 under active development, by setting the release flag 'rhgs‑3.2.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from surabhi on 2017-01-09 06:35:00 EST ---

As discussed in blocker bug triage meeting providing qa_ack.

--- Additional comment from Vivek Das on 2017-01-09 07:28:39 EST ---

Sosreports & samba logs : http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1411270

--- Additional comment from  on 2017-01-10 09:57:26 EST ---

RCA:

".snaps" is not visible in CIFS mount because it could not make connection to snapd.

During mount the client first connects to glusterd to get the port number of the service (snapd). And once it gets the port number it reconnects to the service on the new port. During this reconnect we resolve the hostname, in this case localhost. Since both IPv6 and IPv4 addresses are configured on this system the hostname is resolved to two addresses. The reconnect logic will pick up the first ip address and tries to make the connection. In this instance the first IP address happened to be IPv6 address and thus the connection fails.

When the connection is failed we reset the port number to 0. 0 port number means we should connect to glusterd on its default port number. So during the next reconnect with IPv4 address we connect to glusterd instead of snapd, leading to failure.

We are not seeing the same issue on bricks because in this test setup bricks use FQDN instead of localhost. And the hostname is resolved only to IPv4 address.
If a user configures both IPv4 and IPv6 address for these hostnames then we might hit the same issue there as well.

This behavior can be controlled by "transport.address-family" volume option. If we set "transport.address-family" to inet then we only look for IPv4 addresses. And thus we will not hit this issue.

gluster-3.8.0 onwards this option is set by default. Therefore any new volume created on RHGS-3.2.0 will not see this issue. But if the user upgrade from an older build then this option may not be set for that volume.

So as a workaround this option can be set to limit the network address resolution only to IPv4.

--- Additional comment from  on 2017-01-10 09:58:31 EST ---

Based on comment 4 I think we can remove the blocker flag for RHGS-3.2.0.

--- Additional comment from Atin Mukherjee on 2017-01-10 23:31:22 EST ---

Going with the RCA as mentioned in comment 4, I have a question here. If you look at glusterd_update_volumes_dict (), we do set the trasnport.address-family option to inet if the transport-type is TCP and this happens when the op-version is bumped up.

Is it like the op-version was not bumped up post upgrade? If that's the case then it's not a bug as bumping up the op-version is part of the upgrade steps.

--- Additional comment from  on 2017-01-11 01:53:47 EST ---

(In reply to Atin Mukherjee from comment #6)
> Going with the RCA as mentioned in comment 4, I have a question here. If you
> look at glusterd_update_volumes_dict (), we do set the
> trasnport.address-family option to inet if the transport-type is TCP and
> this happens when the op-version is bumped up.

With our yesterday's discussion there was confusion whether we update the volume info after a op-version change or not. Thanks for looking into this and clarifying it.

> 
> Is it like the op-version was not bumped up post upgrade? If that's the case
> then it's not a bug as bumping up the op-version is part of the upgrade
> steps.

The answer to this question depends on whether we support changing the value of "transport.address-family" in downstream. If we don't then this may not be a bug now. But if do support then this looks like a bug to me, but may not be a blocker.

--- Additional comment from Atin Mukherjee on 2017-01-11 02:24:04 EST ---

(In reply to rjoseph from comment #7)
> (In reply to Atin Mukherjee from comment #6)
> > Going with the RCA as mentioned in comment 4, I have a question here. If you
> > look at glusterd_update_volumes_dict (), we do set the
> > trasnport.address-family option to inet if the transport-type is TCP and
> > this happens when the op-version is bumped up.
> 
> With our yesterday's discussion there was confusion whether we update the
> volume info after a op-version change or not. Thanks for looking into this
> and clarifying it.
> 
> > 
> > Is it like the op-version was not bumped up post upgrade? If that's the case
> > then it's not a bug as bumping up the op-version is part of the upgrade
> > steps.
> 
> The answer to this question depends on whether we support changing the value
> of "transport.address-family" in downstream. If we don't then this may not
> be a bug now. But if do support then this looks like a bug to me, but may
> not be a blocker.

Given we don't support IPv6 in downstream,having inet6 as a value in transport.address-family option is also not supported. With that I still feel this BZ is not valid in RHGS.

--- Additional comment from Atin Mukherjee on 2017-01-11 02:29:01 EST ---

I've not got the required data asked at comment 6

"Is it like the op-version was not bumped up post upgrade? If that's the case then it's not a bug as bumping up the op-version is part of the upgrade steps."

Setting needinfo back on Vivek.

--- Additional comment from Vivek Das on 2017-01-11 04:18:59 EST ---

I upgraded the setup and ran automation script to create fresh volumes and other functional steps, i checked the op-version it was latest. But somewhere "transport.address-family" got missed and i was not aware of that. I have another setup similar to this where everything is working as expected. Also i tried all together with a fresh setup but i am not able to reproduce this.

--- Additional comment from Atin Mukherjee on 2017-01-11 04:44:59 EST ---

Based on comment 10, closing this BZ.

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-01-18 11:00:41 EST ---

This bug is automatically being proposed for the current release of Red Hat Gluster Storage 3 under active development, by setting the release flag 'rhgs‑3.2.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Vivek Das on 2017-01-18 11:23:08 EST ---

On upgrading from 3.1.3 to 3.2.0 including the OP-VERSION update .snaps directory is not visible in the mount point.

Steps to reproduce
---------------------
1. Take a 3.1.3 RHGS setup with a distributed-replicate volume
2. Enable USS and features.show-snapshot-directory for the volume
3. Follow steps to upgrade and upgrade setup from 3.1.3 to 3.2
4. Start glusterd
5. Start smb
6. Check op-version (older)
7. Update the op-version (bumped up the op-version)
8. reboot
9. Mount cifs
10. check for .snaps directory in cifs mount (not visible)
11. mount -t glusterfs localhost:/volname /mnt	fuse
12. check for .snaps directory in fuse mount (not visible)

gluster volume status
Status of volume: dangal
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick dhcp42-190.lab.eng.blr.redhat.com:/br
icks/brick0/dangal_brick0                   49152     0          Y       3819 
Brick dhcp43-168.lab.eng.blr.redhat.com:/br
icks/brick0/dangal_brick1                   49152     0          Y       3657 
Brick dhcp42-190.lab.eng.blr.redhat.com:/br
icks/brick1/dangal_brick2                   49153     0          Y       3830 
Brick dhcp43-168.lab.eng.blr.redhat.com:/br
icks/brick1/dangal_brick3                   49153     0          Y       3667 
Snapshot Daemon on localhost                49154     0          Y       3838 
NFS Server on localhost                     2049      0          Y       3803 
Self-heal Daemon on localhost               N/A       N/A        Y       3809 
Snapshot Daemon on dhcp43-168.lab.eng.blr.r
edhat.com                                   49154     0          Y       3672 
NFS Server on dhcp43-168.lab.eng.blr.redhat
.com                                        2049      0          Y       4280 
Self-heal Daemon on dhcp43-168.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       4288 
 
Task Status of Volume dangal
------------------------------------------------------------------------------
There are no active volume tasks



Error Logs
--------------
[2017-01-18 13:26:04.156082] E [socket.c:2309:socket_connect_finish] 0-dangal-snapd-client: connection to ::1:49154 failed (Connection refused)
[2017-01-18 13:26:08.150785] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-dangal-snapd-client: changing port to 49154 (from 0)
[2017-01-18 13:26:08.163323] E [socket.c:2309:socket_connect_finish] 0-dangal-snapd-client: connection to ::1:49154 failed (Connection refused)
[2017-01-18 13:26:12.158243] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-dangal-snapd-client: changing port to 49154 (from 0)
[2017-01-18 13:26:12.171078] E [socket.c:2309:socket_connect_finish] 0-dangal-snapd-client: connection to ::1:49154 failed (Connection refused)
[2017-01-18 13:26:16.166085] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-dangal-snapd-client: changing port to 49154 (from 0)
[2017-01-18 13:26:16.178746] E [socket.c:2309:socket_connect_finish] 0-dangal-snapd-client: connection to ::1:49154 failed (Connection refused)

--- Additional comment from Raghavendra Talur on 2017-01-18 17:38:59 EST ---

Thanks for the update Vivek.

Root Cause:

Snapview-client xlator talks to glusterd at the same "hostname" that was provided to the mount command to get the port for snapd. If DNS resolver provides both IPv6 and IPv4 addresses for hostname(in that order) and if glusterd on the host is listening only on IPv4, snapview-client would not be able to get port for snapd. We try to mitigate this by setting transport-family volume option to inet, as pointed out by Atin in glusterd_update_volumes_dict() function, thereby ensuring that only IPv4 addresses are considered. This volume option is however, only updated in-memory of glusterd volume info and is lazily flushed to disk(/var/lib/glusterd/vols/VOLNAME/*.vol). Hence, there a window where clients are trying to connect to IPv6 when they should not be.


Note, there are two bugs here.
1. RPC-lib/RPC-transport/RPC-clients bug, where protocol/client is not trying all the addresses that are obtained from resolver for a FQDN. This affects IPv6 setups mainly and as Atin has pointed out we don't support IPv6 for this release.


2. Glusterd bug, where vol files 
are not updated on disk and pushed to client after a op-version change.
   In Kaushal's opinion, updating vol files after a op-version bump might lead to other problems and is not a trivial problem to solve.


Possible work around that we explored:
a. try performing a innocuous volume set operation like
gluster vol set VOLNAME user.some-dummy-attrib "dummy-value"
to flush volfile to disk and notify client of a fetch_spec.
RESULT: .snaps not visible

b. try performing a volume set operation that does not change graph and just forces reconfigure, like 
gluster vol set VOLNAME performance.cache-size 30
RESULT: .snaps not visible(I suspect that volfile was updated but protocol/client did not reconnect)

c. try performing a volume set operation that changes graph/topology like
gluster vol set VOLNAME open-behind off
RESULT: .snaps is *visible*


Hence, we do need to take care of updating the vol files automatically after op-version bump.

OR

If, in downstream, we don't really support IPv6, I might be able to send a patch which changes default to inet in rpc layer.

OR

Document clearly that after op-version bump, one more command is necessary to regenerate volfiles.

--- Additional comment from Vivek Das on 2017-01-18 22:39:58 EST ---

Version
--------
glusterfs-3.8.4-11.el7rhgs.x86_64
samba-client-4.4.6-4.el7rhgs.x86_64

--- Additional comment from  on 2017-01-20 03:08:33 EST ---

Even after my RCA and Talur's detailed explanation there seems to be some
confusion about the issue, its manifestation and the impact. So let me
take another shot at explaining things here.


Client gets volfile from server (glusterd) and based on the options
provided in the volfile client connects to bricks and other services
(e.g. snapd). The volfile has the information which brick/service to
connect to, which includes the hostname as well. As part of connection
the first thing a client does is to resolve the hostname to get IP address.
The hostname resolution is done by DNS server or by local DNS cache (if 
your OS is configured with one) or something primitive like /etc/hosts.
A hostname can be resolved to multiple IP  addresses (including IPv4 and
IPv6).

Gluster also has some sort of internal DNS cache. All the IP addresses
received during hostname resolution is kept in this cache.  Every time
we try to resolve a hostname the IP from this list is returned one after
another. So if we get "::1" and "127.0.0.1" as IP addresses then first
call to resolve hostname will return "::1" and the second call will
return "127.0.0.1".

Now lets take a look at how a client makes connection to a brick or a
service. First the connection is made to glusterd to get port number of
the brick/service. Once we get the port number we make connection to the
brick/service.

So lets say we got the port number from glusterd, now the client is
trying to connect to the brick/service. During hostname resolution we
got "::1" and "127.0.0.1" IP addresses. So it will first try to reach the
brick/service via "::1". This will obviously fail because we are not
listening on that IP. After the connection failure our state-machine tries
to reconnect with the next IP address, i.e. "127.0.0.1". But before reconnect
our state machine resets the target port to 0, i.e. connect to glusterd.
This is done because the state-machine assumes connection issues with the
brick/service and it will contact glusterd to get the correct state. The
code was initially written to handle only IPv4 addresses.

Gluster has a volume option, "transport.address-family", which tells that
what kind of addresses we should resolve to. Currently the default is
AF_UNSPEC, i.e. it will fetch both ipv4 and ipv6 addresses. As a workaround
during cluster op-version change and new volume creation time we explicitly
set "transport.address-family" to "inet" (i.e. IPv4). But we have a bug
in glusterd where when we change the cluster op-version we only update
the in-memory value of "transport.address-family" and we fail to update
the *.vol files. And when a client gets the volfile from glusterd this
option is missing which make the client to use the default AF_UNSPEC.


So in short we have multiple issues here:
1) Glusterd should persist this option so that during handshake clients
   get the correct options.
2) During connection failure we should try all the IP addresses before
   changing the state-machine.
3) Also we feel the use of AF_UNSPEC as the default value of connection family
   is not very useful as majority of our setup are IPv4. It would be good to
   make default as AF_INET.
   

Also this problem is not limited to just snapd as explained above. If a
hostname is resolved to more than one IP we will see the issue in bricks
and other services as well.

--- Additional comment from Atin Mukherjee on 2017-01-24 23:35:34 EST ---

Description of problem:
.snaps directory is not visible in cifs mount as well as windows smb mount, even after enabling USS & VSS plugins.

Over fuse mount the .snaps directory is seen and is accessible also.

Currently this issue is seen in a SSL enabled cluster and another cluster setup over EC volume where there is no SSL setup.

The below mentioned information is grabbed from the setup where there is a EC volume.

Disconnect messages are seen in the client logs

[2017-01-09 09:32:57.751250] E [socket.c:2309:socket_connect_finish] 0-test-ec-snapd-client: connection to ::1:49158 failed (Connection refused)

[2017-01-09 09:32:57.751291] T [socket.c:721:__socket_disconnect] 0-test-ec-snapd-client: disconnecting 0x7fbf60061810, state=2 gen=0 sock=53


Version-Release number of selected component (if applicable):
samba-client-libs-4.4.6-4.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-11.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Over an EC volume 2(4+2) enable USS & show-snapshot-directory
2. Enable all VSS plugin
3. Take a snapshot
4. Activate the snapshot
5. Do a cifs mount and also mount the volume over a windows client machine (say windows10)
6. Check for the .snaps directory in cifs mount as well as windows mount

Actual results:
.snaps directory is not seen or accessible or present

Expected results:
.snaps directory should be present

Additional info:

[2017-01-09 09:32:57.751250] E [socket.c:2309:socket_connect_finish] 0-test-ec-snapd-client: connection to ::1:49158 failed (Connection refused)
[2017-01-09 09:32:57.751291] T [socket.c:721:__socket_disconnect] 0-test-ec-snapd-client: disconnecting 0x7fbf60061810, state=2 gen=0 sock=53
[2017-01-09 09:32:57.751312] D [socket.c:683:__socket_shutdown] 0-test-ec-snapd-client: shutdown() returned -1. Transport endpoint is not connected
[2017-01-09 09:32:57.751327] D [socket.c:728:__socket_disconnect] 0-test-ec-snapd-client: __socket_teardown_connection () failed: Transport endpoint is not connected

[2017-01-09 09:32:57.751340] D [socket.c:2403:socket_event_handler] 0-transport: disconnecting now

[2017-01-09 09:32:57.752014] D [rpc-clnt-ping.c:93:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fbf73b1b602] (--> /lib64/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x8e)[0x7fbf74011b9e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x5b)[0x7fbf7400dfbb] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7fbf7400e874] (--> /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fbf7400a893] ))))) 0-: ::1:49158: ping timer event already removed

[2017-01-09 09:32:57.752064] D [MSGID: 0] [client.c:2264:client_rpc_notify] 0-test-ec-snapd-client: got RPC_CLNT_DISCONNECT

[2017-01-09 09:32:57.752095] D [MSGID: 0] [event-epoll.c:587:event_dispatch_epoll_handler] 0-epoll: generation bumped on idx=13 from gen=2764 to slot->gen=2765, fd=53, slot->fd=53

[2017-01-09 09:33:01.733914] T [rpc-clnt.c:422:rpc_clnt_reconnect] 0-test-ec-snapd-client: attempting reconnect

[2017-01-09 09:33:01.733992] T [socket.c:2991:socket_connect] 0-test-ec-snapd-client: connecting 0x7fbf60061810, state=2 gen=0 sock=-1

[2017-01-09 09:33:01.734016] D [name.c:168:client_fill_address_family] 0-test-ec-snapd-client: address-family not specified, marking it as unspec for getaddrinfo to resolve from (remote-host: localhost)

[2017-01-09 09:33:01.734032] T [name.c:238:af_inet_client_get_remote_sockaddr] 0-test-ec-snapd-client: option remote-port missing in volume test-ec-snapd-client. Defaulting to 24007

--- Additional comment from Atin Mukherjee on 2017-01-24 23:38:21 EST ---

RCA:

Client gets volfile from server (glusterd) and based on the options
provided in the volfile client connects to bricks and other services
(e.g. snapd). The volfile has the information which brick/service to
connect to, which includes the hostname as well. As part of connection
the first thing a client does is to resolve the hostname to get IP address.
The hostname resolution is done by DNS server or by local DNS cache (if 
your OS is configured with one) or something primitive like /etc/hosts.
A hostname can be resolved to multiple IP  addresses (including IPv4 and
IPv6).

Gluster also has some sort of internal DNS cache. All the IP addresses
received during hostname resolution is kept in this cache.  Every time
we try to resolve a hostname the IP from this list is returned one after
another. So if we get "::1" and "127.0.0.1" as IP addresses then first
call to resolve hostname will return "::1" and the second call will
return "127.0.0.1".

Now lets take a look at how a client makes connection to a brick or a
service. First the connection is made to glusterd to get port number of
the brick/service. Once we get the port number we make connection to the
brick/service.

So lets say we got the port number from glusterd, now the client is
trying to connect to the brick/service. During hostname resolution we
got "::1" and "127.0.0.1" IP addresses. So it will first try to reach the
brick/service via "::1". This will obviously fail because we are not
listening on that IP. After the connection failure our state-machine tries
to reconnect with the next IP address, i.e. "127.0.0.1". But before reconnect
our state machine resets the target port to 0, i.e. connect to glusterd.
This is done because the state-machine assumes connection issues with the
brick/service and it will contact glusterd to get the correct state. The
code was initially written to handle only IPv4 addresses.

Gluster has a volume option, "transport.address-family", which tells that
what kind of addresses we should resolve to. Currently the default is
AF_UNSPEC, i.e. it will fetch both ipv4 and ipv6 addresses. As a workaround
during cluster op-version change and new volume creation time we explicitly
set "transport.address-family" to "inet" (i.e. IPv4). But we have a bug
in glusterd where when we change the cluster op-version we only update
the in-memory value of "transport.address-family" and we fail to update
the *.vol files. And when a client gets the volfile from glusterd this
option is missing which make the client to use the default AF_UNSPEC.


So in short we have multiple issues here:
1) Glusterd should persist this option so that during handshake clients
   get the correct options.
2) During connection failure we should try all the IP addresses before
   changing the state-machine.
3) Also we feel the use of AF_UNSPEC as the default value of connection family
   is not very useful as majority of our setup are IPv4. It would be good to
   make default as AF_INET.
   

Also this problem is not limited to just snapd as explained above. If a
hostname is resolved to more than one IP we will see the issue in bricks
and other services as well.

--- Additional comment from Worker Ant on 2017-01-24 23:41:54 EST ---

REVIEW: https://review.gluster.org/16455 (glusterd: regenerate volfiles on op-version bump up) posted (#2) for review on master by Atin Mukherjee (amukherj)

--- Additional comment from Worker Ant on 2017-01-26 09:28:15 EST ---

REVIEW: https://review.gluster.org/16455 (glusterd: regenerate volfiles on op-version bump up) posted (#3) for review on master by Atin Mukherjee (amukherj)

--- Additional comment from Worker Ant on 2017-01-27 08:52:48 EST ---

COMMIT: https://review.gluster.org/16455 committed in master by Kaushal M (kaushal) 
------
commit 33f8703a12dd97980c43e235546b04dffaf4afa0
Author: Atin Mukherjee <amukherj>
Date:   Mon Jan 23 13:03:06 2017 +0530

    glusterd: regenerate volfiles on op-version bump up
    
    Change-Id: I2fe7a3ebea19492d52253ad5a1fdd67ac95c71c8
    BUG: 1416251
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: https://review.gluster.org/16455
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Prashanth Pai <ppai>
    Reviewed-by: Kaushal M <kaushal>

Comment 1 Worker Ant 2017-01-30 04:10:53 UTC

REVIEW: https://review.gluster.org/16475 (glusterd: regenerate volfiles on op-version bump up) posted (#1) for review on release-3.10 by Atin Mukherjee (amukherj)

Comment 2 Worker Ant 2017-02-01 14:59:40 UTC

COMMIT: https://review.gluster.org/16475 committed in release-3.10 by Shyamsundar Ranganathan (srangana) 
------
commit f05c2ff22fe371a7b9a8ab4226f4dd2d17560d8a
Author: Atin Mukherjee <amukherj>
Date:   Mon Jan 23 13:03:06 2017 +0530

    glusterd: regenerate volfiles on op-version bump up
    
    >Reviewed-on: https://review.gluster.org/16455
    >NetBSD-regression: NetBSD Build System <jenkins.org>
    >Smoke: Gluster Build System <jenkins.org>
    >CentOS-regression: Gluster Build System <jenkins.org>
    >Reviewed-by: Prashanth Pai <ppai>
    >Reviewed-by: Kaushal M <kaushal>
    
    Change-Id: I2fe7a3ebea19492d52253ad5a1fdd67ac95c71c8
    BUG: 1417521
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: https://review.gluster.org/16475
    Reviewed-by: Prashanth Pai <ppai>
    CentOS-regression: Gluster Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Shyamsundar Ranganathan <srangana>

Comment 3 Shyamsundar 2017-03-06 17:44:28 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.0, please open a new bug report.

glusterfs-3.10.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-February/030119.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.