Bug 1416251

Summary:	[SNAPSHOT] With all USS plugin enable .snaps directory is not visible in cifs mount as well as windows mount
Product:	[Community] GlusterFS	Reporter:	Atin Mukherjee <amukherj>
Component:	glusterd	Assignee:	bugs <bugs>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	amukherj, bugs, rcyriac, rhinduja, rhs-bugs, rhs-smb, rjoseph, rtalur, sbhaloth, vdas
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.11.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1411270
Clones:	1417521 (view as bug list)		Environment:
Last Closed:	2017-05-30 18:39:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1411270, 1417521

Comment 1 Atin Mukherjee 2017-01-25 04:35:34 UTC

Description of problem:
.snaps directory is not visible in cifs mount as well as windows smb mount, even after enabling USS & VSS plugins.

Over fuse mount the .snaps directory is seen and is accessible also.

Currently this issue is seen in a SSL enabled cluster and another cluster setup over EC volume where there is no SSL setup.

The below mentioned information is grabbed from the setup where there is a EC volume.

Disconnect messages are seen in the client logs

[2017-01-09 09:32:57.751250] E [socket.c:2309:socket_connect_finish] 0-test-ec-snapd-client: connection to ::1:49158 failed (Connection refused)

[2017-01-09 09:32:57.751291] T [socket.c:721:__socket_disconnect] 0-test-ec-snapd-client: disconnecting 0x7fbf60061810, state=2 gen=0 sock=53


Version-Release number of selected component (if applicable):
samba-client-libs-4.4.6-4.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-11.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Over an EC volume 2(4+2) enable USS & show-snapshot-directory
2. Enable all VSS plugin
3. Take a snapshot
4. Activate the snapshot
5. Do a cifs mount and also mount the volume over a windows client machine (say windows10)
6. Check for the .snaps directory in cifs mount as well as windows mount

Actual results:
.snaps directory is not seen or accessible or present

Expected results:
.snaps directory should be present

Additional info:

[2017-01-09 09:32:57.751250] E [socket.c:2309:socket_connect_finish] 0-test-ec-snapd-client: connection to ::1:49158 failed (Connection refused)
[2017-01-09 09:32:57.751291] T [socket.c:721:__socket_disconnect] 0-test-ec-snapd-client: disconnecting 0x7fbf60061810, state=2 gen=0 sock=53
[2017-01-09 09:32:57.751312] D [socket.c:683:__socket_shutdown] 0-test-ec-snapd-client: shutdown() returned -1. Transport endpoint is not connected
[2017-01-09 09:32:57.751327] D [socket.c:728:__socket_disconnect] 0-test-ec-snapd-client: __socket_teardown_connection () failed: Transport endpoint is not connected

[2017-01-09 09:32:57.751340] D [socket.c:2403:socket_event_handler] 0-transport: disconnecting now

[2017-01-09 09:32:57.752014] D [rpc-clnt-ping.c:93:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fbf73b1b602] (--> /lib64/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x8e)[0x7fbf74011b9e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x5b)[0x7fbf7400dfbb] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7fbf7400e874] (--> /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fbf7400a893] ))))) 0-: ::1:49158: ping timer event already removed

[2017-01-09 09:32:57.752064] D [MSGID: 0] [client.c:2264:client_rpc_notify] 0-test-ec-snapd-client: got RPC_CLNT_DISCONNECT

[2017-01-09 09:32:57.752095] D [MSGID: 0] [event-epoll.c:587:event_dispatch_epoll_handler] 0-epoll: generation bumped on idx=13 from gen=2764 to slot->gen=2765, fd=53, slot->fd=53

[2017-01-09 09:33:01.733914] T [rpc-clnt.c:422:rpc_clnt_reconnect] 0-test-ec-snapd-client: attempting reconnect

[2017-01-09 09:33:01.733992] T [socket.c:2991:socket_connect] 0-test-ec-snapd-client: connecting 0x7fbf60061810, state=2 gen=0 sock=-1

[2017-01-09 09:33:01.734016] D [name.c:168:client_fill_address_family] 0-test-ec-snapd-client: address-family not specified, marking it as unspec for getaddrinfo to resolve from (remote-host: localhost)

[2017-01-09 09:33:01.734032] T [name.c:238:af_inet_client_get_remote_sockaddr] 0-test-ec-snapd-client: option remote-port missing in volume test-ec-snapd-client. Defaulting to 24007

Comment 2 Atin Mukherjee 2017-01-25 04:38:21 UTC

RCA:

Client gets volfile from server (glusterd) and based on the options
provided in the volfile client connects to bricks and other services
(e.g. snapd). The volfile has the information which brick/service to
connect to, which includes the hostname as well. As part of connection
the first thing a client does is to resolve the hostname to get IP address.
The hostname resolution is done by DNS server or by local DNS cache (if 
your OS is configured with one) or something primitive like /etc/hosts.
A hostname can be resolved to multiple IP  addresses (including IPv4 and
IPv6).

Gluster also has some sort of internal DNS cache. All the IP addresses
received during hostname resolution is kept in this cache.  Every time
we try to resolve a hostname the IP from this list is returned one after
another. So if we get "::1" and "127.0.0.1" as IP addresses then first
call to resolve hostname will return "::1" and the second call will
return "127.0.0.1".

Now lets take a look at how a client makes connection to a brick or a
service. First the connection is made to glusterd to get port number of
the brick/service. Once we get the port number we make connection to the
brick/service.

So lets say we got the port number from glusterd, now the client is
trying to connect to the brick/service. During hostname resolution we
got "::1" and "127.0.0.1" IP addresses. So it will first try to reach the
brick/service via "::1". This will obviously fail because we are not
listening on that IP. After the connection failure our state-machine tries
to reconnect with the next IP address, i.e. "127.0.0.1". But before reconnect
our state machine resets the target port to 0, i.e. connect to glusterd.
This is done because the state-machine assumes connection issues with the
brick/service and it will contact glusterd to get the correct state. The
code was initially written to handle only IPv4 addresses.

Gluster has a volume option, "transport.address-family", which tells that
what kind of addresses we should resolve to. Currently the default is
AF_UNSPEC, i.e. it will fetch both ipv4 and ipv6 addresses. As a workaround
during cluster op-version change and new volume creation time we explicitly
set "transport.address-family" to "inet" (i.e. IPv4). But we have a bug
in glusterd where when we change the cluster op-version we only update
the in-memory value of "transport.address-family" and we fail to update
the *.vol files. And when a client gets the volfile from glusterd this
option is missing which make the client to use the default AF_UNSPEC.


So in short we have multiple issues here:
1) Glusterd should persist this option so that during handshake clients
   get the correct options.
2) During connection failure we should try all the IP addresses before
   changing the state-machine.
3) Also we feel the use of AF_UNSPEC as the default value of connection family
   is not very useful as majority of our setup are IPv4. It would be good to
   make default as AF_INET.
   

Also this problem is not limited to just snapd as explained above. If a
hostname is resolved to more than one IP we will see the issue in bricks
and other services as well.

Comment 3 Worker Ant 2017-01-25 04:41:54 UTC

REVIEW: https://review.gluster.org/16455 (glusterd: regenerate volfiles on op-version bump up) posted (#2) for review on master by Atin Mukherjee (amukherj)

Comment 4 Worker Ant 2017-01-26 14:28:15 UTC

REVIEW: https://review.gluster.org/16455 (glusterd: regenerate volfiles on op-version bump up) posted (#3) for review on master by Atin Mukherjee (amukherj)

Comment 5 Worker Ant 2017-01-27 13:52:48 UTC

COMMIT: https://review.gluster.org/16455 committed in master by Kaushal M (kaushal) 
------
commit 33f8703a12dd97980c43e235546b04dffaf4afa0
Author: Atin Mukherjee <amukherj>
Date:   Mon Jan 23 13:03:06 2017 +0530

    glusterd: regenerate volfiles on op-version bump up
    
    Change-Id: I2fe7a3ebea19492d52253ad5a1fdd67ac95c71c8
    BUG: 1416251
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: https://review.gluster.org/16455
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Prashanth Pai <ppai>
    Reviewed-by: Kaushal M <kaushal>

Comment 6 Shyamsundar 2017-05-30 18:39:31 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report.

glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html
[2] https://www.gluster.org/pipermail/gluster-users/