Bug 1332075

Summary: SAMBA-CTDB : Virtual IP mount on windows client fails
Product: Red Hat Gluster Storage Reporter: Vivek Das <vdas>
Component: sambaAssignee: rhs-smb <rhs-smb>
Status: CLOSED NOTABUG QA Contact: storage-qa-internal <storage-qa-internal>
Severity: urgent Docs Contact:
Priority: urgent    
Version: rhgs-3.1CC: madam, nlevinki, rjoseph, rtalur, vdas
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-05 11:06:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Vivek Das 2016-05-02 06:18:40 UTC
Description of problem:

On a 4 node samba-ctdb setup (8GB RAM each) when we try to mount share in windows using VIPs for the very first time it gets successfully mounted but soon after that it gets disconnected automatically.
And after that if we try to mount the VIP again it does not get mounted any further.


Version-Release number of selected component (if applicable):
glusterfs-3.7.9-3.el7rhgs.x86_64
samba-client-libs-4.4.2-1.el7rhgs.x86_64
Windows Client 8.1

How reproducible:
Always

Steps to Reproduce:
1.A 4 node CTDB setup
2.Mount volume share in windows client using VIPs
3.run command net use in windows powershell few times to keep checking the status or keep accessing the mount for a while. (just open the mount and wait for few seconds)

Actual results:
Mount gets connected but within few seconds get disconnected.

Expected results:
Mount should get connected and should not get disconnected automatically.

Additional info:

Comment 2 Vivek Das 2016-05-02 06:29:09 UTC
sosreports uploaded @http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1332075

Comment 3 Michael Adam 2016-05-02 09:49:05 UTC
All I can tell from the sosreport is that the cluster does not seem to be properly/fully set up: There is no ctdb nodes file, and the recovery lock file is not accessible (is gluster mounted?).

Is the sosreport info complete?

Comment 5 Vivek Das 2016-05-02 11:55:57 UTC
Michael : Sosreport info was complete. gluster was mounted.
I will be providing one more sosreport and will update here soon.

Comment 7 Michael Adam 2016-05-02 16:30:04 UTC
From looking at the network traces:

1. vip-1 is the capture of an attempt to connect to 10.70.47.11.
   Client establishes tcp connection and sends an smb negotiate
   request. Server sends tcp RSTs and a spurious replay of the
   initial syn+ack. Afterwards all attempts by the client to send
   a RST packet himself, are answered with a ICMP destination
   unreachable (host administratively prohibited).

   This looks suspiciously like network problems.
   Most probably not on the nodes but somewhere in the
   network (virtualization?) between nodes and client.

2. vip-2 is the capture of a connection that succeeds but
   is disconnected after a while.
   What I can see here is there is one RST from the
   server right after the initial SYN. But after the
   retransmitted SYN, everything seems to work just fine on
   the TCP level. SMB2 is negotiated, session is etablished,
   and a tree-connect to share harharmahdev is done.
   Interestingly, the tree connect response takes 2.5 seconds...
   Afterwards, only a FSCTL_VALIDATE_NEGOTIATE_INFO is exchanged.
   Without any further packets on the tcp connection, the
   client sends a tree disconnect after another 8.2 seconds.

   I.e. nothing strange here, apparently just a decision by
   the client to make no use of the share connection.


ctdb does not seem to have any issues at this time, according to the logs.

The samba logs in the sosreport are not useful for analyzing this further:
They contain only incomplete logs. sosreport does not include the necessary
files. and the included files are too short.

It would be good to get hold of untruncated samba logs:

configure samba like this:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[global]
...
max log size = 0
debug hires timestamp = yes
log level = 10
log file = /var/log/samba/log.smbd
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

(instead of the current logging options)
And restart samba.

This way, all processes should log into log.smbd,
and log.smbd will not get rotated.

It would also be good to get logs (samba+ctdb) from all nodes
and not just of one.

Thanks!

Comment 9 Michael Adam 2016-05-03 11:26:12 UTC
what does 'usual windows client' and 'different windows client' mean?
What kinds of actions were done during the captures?
And what what was the expected and the observed result?
In the captures, I only see negotiate, session setup, tree connect, ioctl, and create for Desktop.ini and AutoRun.inf. All appears regular. Also the samba logs don't immediately give clues.

I need to know what to look out for in this data.

Comment 10 Vivek Das 2016-05-03 12:36:21 UTC
Usual windows client is the windows client which is involved from the very first test for this issue.
Different windows client i meant i have also tested the above issue i.e mounting the vip from a different windows client (8.1) and was able to reproduce the above issue comfortably.

 What kinds of actions were done during the captures?

Tried to mount VIPs on windows client using net use in powershell.
eg. net use * \\vip\gluster-volname /user:**** "paswrd"

 what what was the expected and the observed result?

expected result = vip should get mounted successfully and and should not get disconnected automatically

observed result = vip mount connects & then disconnects within few seconds.

Comment 11 Raghavendra Talur 2016-05-05 10:16:44 UTC
Just to have a clean set up , we used a different set of IPs as public addresses. We did not observe any disconnects with those ips in this case.

When we again reused the old set of ips we were able to reproduce the issues. The VIP that got disconnected was 10.70.47.55

Below are the details from windows client and 4 of the servers:-

10.70.47.10 (VIP : 10.70.47.164)
------------------
[root@dhcp47-10 samba]# arp -n | grep 47.55
10.70.47.55              ether   52:54:00:26:d5:84   C                     ens3
[root@dhcp47-10 samba]# ifconfig | grep ether
        ether 52:54:00:5f:3e:6d  txqueuelen 1000  (Ethernet)


10.70.47.166 (VIP : 10.70.47.55)
------------------
[root@dhcp47-166 samba]# arp -n | grep 47.55
10.70.47.55              ether   52:54:00:13:7f:26   C                     ens3
[root@dhcp47-166 samba]# ifconfig | grep ether
        ether 52:54:00:26:d5:84  txqueuelen 1000  (Ethernet)


10.70.47.159 (VIP : 10.70.47.160)
------------------
[root@dhcp47-159 samba]# arp -n | grep 47.55
10.70.47.55              ether   52:54:00:26:d5:84   C                     ens3
[root@dhcp47-159 samba]# ifconfig | grep ether
        ether 52:54:00:bf:1a:7a  txqueuelen 1000  (Ethernet)


10.70.47.58 (VIP : 10.70.47.11)
------------------
[root@dhcp47-58 samba]# arp -n | grep 47.55
10.70.47.55              ether   52:54:00:26:d5:84   C                     ens3
[root@dhcp47-58 samba]# ifconfig | grep ether
        ether 52:54:00:13:7f:26  txqueuelen 1000  (Ethernet)


Windows Client
-----------------------------
arp -a
--------------

Interface: 10.70.47.181 --- 0xc
  Internet Address      Physical Address      Type
  10.70.46.29           52-54-00-62-05-da     dynamic
  10.70.46.177          52-54-00-45-8b-4f     dynamic
  10.70.47.5            52-54-00-cd-59-30     dynamic
  10.70.47.10           52-54-00-5f-3e-6d     dynamic
  10.70.47.11           52-54-00-13-7f-26     dynamic
  10.70.47.55           00-1a-4a-f7-23-1a     dynamic
  10.70.47.58           52-54-00-13-7f-26     dynamic
  10.70.47.63           52-54-00-8b-0b-4f     dynamic
  10.70.47.159          52-54-00-bf-1a-7a     dynamic
  10.70.47.160          52-54-00-bf-1a-7a     dynamic
  10.70.47.164          52-54-00-5f-3e-6d     dynamic
  10.70.47.166          52-54-00-26-d5-84     dynamic
  10.70.47.254          64-87-88-b8-75-41     dynamic
  10.70.47.255          ff-ff-ff-ff-ff-ff     static
  224.0.0.22            01-00-5e-00-00-16     static
  224.0.0.252           01-00-5e-00-00-fc     static
  239.255.255.250       01-00-5e-7f-ff-fa     static
  255.255.255.255       ff-ff-ff-ff-ff-ff     static


From the above findings, it looks like a network issue where the address table is not updated on windows client.

This is most likely not a bug.

Comment 12 Michael Adam 2016-05-05 10:20:53 UTC
Seems like another machine is using the same IP as one of the public IPs (10.70.47.55). And the two hosts are fighting for the arp entry for this address. I think this explains the behaviour.