Description of problem: .snaps directory is not visible in cifs mount as well as windows smb mount, even after enabling USS & VSS plugins. Over fuse mount the .snaps directory is seen and is accessible also. Currently this issue is seen in a SSL enabled cluster and another cluster setup over EC volume where there is no SSL setup. The below mentioned information is grabbed from the setup where there is a EC volume. Disconnect messages are seen in the client logs [2017-01-09 09:32:57.751250] E [socket.c:2309:socket_connect_finish] 0-test-ec-snapd-client: connection to ::1:49158 failed (Connection refused) [2017-01-09 09:32:57.751291] T [socket.c:721:__socket_disconnect] 0-test-ec-snapd-client: disconnecting 0x7fbf60061810, state=2 gen=0 sock=53 Version-Release number of selected component (if applicable): samba-client-libs-4.4.6-4.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-11.el7rhgs.x86_64 How reproducible: Always Steps to Reproduce: 1. Over an EC volume 2(4+2) enable USS & show-snapshot-directory 2. Enable all VSS plugin 3. Take a snapshot 4. Activate the snapshot 5. Do a cifs mount and also mount the volume over a windows client machine (say windows10) 6. Check for the .snaps directory in cifs mount as well as windows mount Actual results: .snaps directory is not seen or accessible or present Expected results: .snaps directory should be present Additional info: [2017-01-09 09:32:57.751250] E [socket.c:2309:socket_connect_finish] 0-test-ec-snapd-client: connection to ::1:49158 failed (Connection refused) [2017-01-09 09:32:57.751291] T [socket.c:721:__socket_disconnect] 0-test-ec-snapd-client: disconnecting 0x7fbf60061810, state=2 gen=0 sock=53 [2017-01-09 09:32:57.751312] D [socket.c:683:__socket_shutdown] 0-test-ec-snapd-client: shutdown() returned -1. Transport endpoint is not connected [2017-01-09 09:32:57.751327] D [socket.c:728:__socket_disconnect] 0-test-ec-snapd-client: __socket_teardown_connection () failed: Transport endpoint is not connected [2017-01-09 09:32:57.751340] D [socket.c:2403:socket_event_handler] 0-transport: disconnecting now [2017-01-09 09:32:57.752014] D [rpc-clnt-ping.c:93:rpc_clnt_remove_ping_timer_locked] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fbf73b1b602] (--> /lib64/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x8e)[0x7fbf74011b9e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x5b)[0x7fbf7400dfbb] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7fbf7400e874] (--> /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fbf7400a893] ))))) 0-: ::1:49158: ping timer event already removed [2017-01-09 09:32:57.752064] D [MSGID: 0] [client.c:2264:client_rpc_notify] 0-test-ec-snapd-client: got RPC_CLNT_DISCONNECT [2017-01-09 09:32:57.752095] D [MSGID: 0] [event-epoll.c:587:event_dispatch_epoll_handler] 0-epoll: generation bumped on idx=13 from gen=2764 to slot->gen=2765, fd=53, slot->fd=53 [2017-01-09 09:33:01.733914] T [rpc-clnt.c:422:rpc_clnt_reconnect] 0-test-ec-snapd-client: attempting reconnect [2017-01-09 09:33:01.733992] T [socket.c:2991:socket_connect] 0-test-ec-snapd-client: connecting 0x7fbf60061810, state=2 gen=0 sock=-1 [2017-01-09 09:33:01.734016] D [name.c:168:client_fill_address_family] 0-test-ec-snapd-client: address-family not specified, marking it as unspec for getaddrinfo to resolve from (remote-host: localhost) [2017-01-09 09:33:01.734032] T [name.c:238:af_inet_client_get_remote_sockaddr] 0-test-ec-snapd-client: option remote-port missing in volume test-ec-snapd-client. Defaulting to 24007
As discussed in blocker bug triage meeting providing qa_ack.
RCA: ".snaps" is not visible in CIFS mount because it could not make connection to snapd. During mount the client first connects to glusterd to get the port number of the service (snapd). And once it gets the port number it reconnects to the service on the new port. During this reconnect we resolve the hostname, in this case localhost. Since both IPv6 and IPv4 addresses are configured on this system the hostname is resolved to two addresses. The reconnect logic will pick up the first ip address and tries to make the connection. In this instance the first IP address happened to be IPv6 address and thus the connection fails. When the connection is failed we reset the port number to 0. 0 port number means we should connect to glusterd on its default port number. So during the next reconnect with IPv4 address we connect to glusterd instead of snapd, leading to failure. We are not seeing the same issue on bricks because in this test setup bricks use FQDN instead of localhost. And the hostname is resolved only to IPv4 address. If a user configures both IPv4 and IPv6 address for these hostnames then we might hit the same issue there as well. This behavior can be controlled by "transport.address-family" volume option. If we set "transport.address-family" to inet then we only look for IPv4 addresses. And thus we will not hit this issue. gluster-3.8.0 onwards this option is set by default. Therefore any new volume created on RHGS-3.2.0 will not see this issue. But if the user upgrade from an older build then this option may not be set for that volume. So as a workaround this option can be set to limit the network address resolution only to IPv4.
I upgraded the setup and ran automation script to create fresh volumes and other functional steps, i checked the op-version it was latest. But somewhere "transport.address-family" got missed and i was not aware of that. I have another setup similar to this where everything is working as expected. Also i tried all together with a fresh setup but i am not able to reproduce this.
On upgrading from 3.1.3 to 3.2.0 including the OP-VERSION update .snaps directory is not visible in the mount point. Steps to reproduce --------------------- 1. Take a 3.1.3 RHGS setup with a distributed-replicate volume 2. Enable USS and features.show-snapshot-directory for the volume 3. Follow steps to upgrade and upgrade setup from 3.1.3 to 3.2 4. Start glusterd 5. Start smb 6. Check op-version (older) 7. Update the op-version (bumped up the op-version) 8. reboot 9. Mount cifs 10. check for .snaps directory in cifs mount (not visible) 11. mount -t glusterfs localhost:/volname /mnt fuse 12. check for .snaps directory in fuse mount (not visible) gluster volume status Status of volume: dangal Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp42-190.lab.eng.blr.redhat.com:/br icks/brick0/dangal_brick0 49152 0 Y 3819 Brick dhcp43-168.lab.eng.blr.redhat.com:/br icks/brick0/dangal_brick1 49152 0 Y 3657 Brick dhcp42-190.lab.eng.blr.redhat.com:/br icks/brick1/dangal_brick2 49153 0 Y 3830 Brick dhcp43-168.lab.eng.blr.redhat.com:/br icks/brick1/dangal_brick3 49153 0 Y 3667 Snapshot Daemon on localhost 49154 0 Y 3838 NFS Server on localhost 2049 0 Y 3803 Self-heal Daemon on localhost N/A N/A Y 3809 Snapshot Daemon on dhcp43-168.lab.eng.blr.r edhat.com 49154 0 Y 3672 NFS Server on dhcp43-168.lab.eng.blr.redhat .com 2049 0 Y 4280 Self-heal Daemon on dhcp43-168.lab.eng.blr. redhat.com N/A N/A Y 4288 Task Status of Volume dangal ------------------------------------------------------------------------------ There are no active volume tasks Error Logs -------------- [2017-01-18 13:26:04.156082] E [socket.c:2309:socket_connect_finish] 0-dangal-snapd-client: connection to ::1:49154 failed (Connection refused) [2017-01-18 13:26:08.150785] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-dangal-snapd-client: changing port to 49154 (from 0) [2017-01-18 13:26:08.163323] E [socket.c:2309:socket_connect_finish] 0-dangal-snapd-client: connection to ::1:49154 failed (Connection refused) [2017-01-18 13:26:12.158243] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-dangal-snapd-client: changing port to 49154 (from 0) [2017-01-18 13:26:12.171078] E [socket.c:2309:socket_connect_finish] 0-dangal-snapd-client: connection to ::1:49154 failed (Connection refused) [2017-01-18 13:26:16.166085] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-dangal-snapd-client: changing port to 49154 (from 0) [2017-01-18 13:26:16.178746] E [socket.c:2309:socket_connect_finish] 0-dangal-snapd-client: connection to ::1:49154 failed (Connection refused)
Version -------- glusterfs-3.8.4-11.el7rhgs.x86_64 samba-client-4.4.6-4.el7rhgs.x86_64
Even after my RCA and Talur's detailed explanation there seems to be some confusion about the issue, its manifestation and the impact. So let me take another shot at explaining things here. Client gets volfile from server (glusterd) and based on the options provided in the volfile client connects to bricks and other services (e.g. snapd). The volfile has the information which brick/service to connect to, which includes the hostname as well. As part of connection the first thing a client does is to resolve the hostname to get IP address. The hostname resolution is done by DNS server or by local DNS cache (if your OS is configured with one) or something primitive like /etc/hosts. A hostname can be resolved to multiple IP addresses (including IPv4 and IPv6). Gluster also has some sort of internal DNS cache. All the IP addresses received during hostname resolution is kept in this cache. Every time we try to resolve a hostname the IP from this list is returned one after another. So if we get "::1" and "127.0.0.1" as IP addresses then first call to resolve hostname will return "::1" and the second call will return "127.0.0.1". Now lets take a look at how a client makes connection to a brick or a service. First the connection is made to glusterd to get port number of the brick/service. Once we get the port number we make connection to the brick/service. So lets say we got the port number from glusterd, now the client is trying to connect to the brick/service. During hostname resolution we got "::1" and "127.0.0.1" IP addresses. So it will first try to reach the brick/service via "::1". This will obviously fail because we are not listening on that IP. After the connection failure our state-machine tries to reconnect with the next IP address, i.e. "127.0.0.1". But before reconnect our state machine resets the target port to 0, i.e. connect to glusterd. This is done because the state-machine assumes connection issues with the brick/service and it will contact glusterd to get the correct state. The code was initially written to handle only IPv4 addresses. Gluster has a volume option, "transport.address-family", which tells that what kind of addresses we should resolve to. Currently the default is AF_UNSPEC, i.e. it will fetch both ipv4 and ipv6 addresses. As a workaround during cluster op-version change and new volume creation time we explicitly set "transport.address-family" to "inet" (i.e. IPv4). But we have a bug in glusterd where when we change the cluster op-version we only update the in-memory value of "transport.address-family" and we fail to update the *.vol files. And when a client gets the volfile from glusterd this option is missing which make the client to use the default AF_UNSPEC. So in short we have multiple issues here: 1) Glusterd should persist this option so that during handshake clients get the correct options. 2) During connection failure we should try all the IP addresses before changing the state-machine. 3) Also we feel the use of AF_UNSPEC as the default value of connection family is not very useful as majority of our setup are IPv4. It would be good to make default as AF_INET. Also this problem is not limited to just snapd as explained above. If a hostname is resolved to more than one IP we will see the issue in bricks and other services as well.
upstream patch : https://review.gluster.org/#/c/16455
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/96368
Followed the steps to verify over version glusterfs-server-3.8.4-14.el7rhgs.x86_64 .snaps is visible even after upgrade. ll /mnt/cifs/.snaps/ total 0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html