| Summary: | SAMBA-CTDB : Virtual IP mount on windows client fails | ||
|---|---|---|---|
| Product: | Red Hat Gluster Storage | Reporter: | Vivek Das <vdas> |
| Component: | samba | Assignee: | rhs-smb <rhs-smb> |
| Status: | CLOSED NOTABUG | QA Contact: | storage-qa-internal <storage-qa-internal> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | rhgs-3.1 | CC: | madam, nlevinki, rjoseph, rtalur, vdas |
| Target Milestone: | --- | Keywords: | Triaged, ZStream |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-05-05 11:06:43 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Vivek Das
2016-05-02 06:18:40 UTC
sosreports uploaded @http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1332075 All I can tell from the sosreport is that the cluster does not seem to be properly/fully set up: There is no ctdb nodes file, and the recovery lock file is not accessible (is gluster mounted?). Is the sosreport info complete? Michael : Sosreport info was complete. gluster was mounted. I will be providing one more sosreport and will update here soon. From looking at the network traces: 1. vip-1 is the capture of an attempt to connect to 10.70.47.11. Client establishes tcp connection and sends an smb negotiate request. Server sends tcp RSTs and a spurious replay of the initial syn+ack. Afterwards all attempts by the client to send a RST packet himself, are answered with a ICMP destination unreachable (host administratively prohibited). This looks suspiciously like network problems. Most probably not on the nodes but somewhere in the network (virtualization?) between nodes and client. 2. vip-2 is the capture of a connection that succeeds but is disconnected after a while. What I can see here is there is one RST from the server right after the initial SYN. But after the retransmitted SYN, everything seems to work just fine on the TCP level. SMB2 is negotiated, session is etablished, and a tree-connect to share harharmahdev is done. Interestingly, the tree connect response takes 2.5 seconds... Afterwards, only a FSCTL_VALIDATE_NEGOTIATE_INFO is exchanged. Without any further packets on the tcp connection, the client sends a tree disconnect after another 8.2 seconds. I.e. nothing strange here, apparently just a decision by the client to make no use of the share connection. ctdb does not seem to have any issues at this time, according to the logs. The samba logs in the sosreport are not useful for analyzing this further: They contain only incomplete logs. sosreport does not include the necessary files. and the included files are too short. It would be good to get hold of untruncated samba logs: configure samba like this: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [global] ... max log size = 0 debug hires timestamp = yes log level = 10 log file = /var/log/samba/log.smbd ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ (instead of the current logging options) And restart samba. This way, all processes should log into log.smbd, and log.smbd will not get rotated. It would also be good to get logs (samba+ctdb) from all nodes and not just of one. Thanks! what does 'usual windows client' and 'different windows client' mean? What kinds of actions were done during the captures? And what what was the expected and the observed result? In the captures, I only see negotiate, session setup, tree connect, ioctl, and create for Desktop.ini and AutoRun.inf. All appears regular. Also the samba logs don't immediately give clues. I need to know what to look out for in this data. Usual windows client is the windows client which is involved from the very first test for this issue. Different windows client i meant i have also tested the above issue i.e mounting the vip from a different windows client (8.1) and was able to reproduce the above issue comfortably. What kinds of actions were done during the captures? Tried to mount VIPs on windows client using net use in powershell. eg. net use * \\vip\gluster-volname /user:**** "paswrd" what what was the expected and the observed result? expected result = vip should get mounted successfully and and should not get disconnected automatically observed result = vip mount connects & then disconnects within few seconds. Just to have a clean set up , we used a different set of IPs as public addresses. We did not observe any disconnects with those ips in this case.
When we again reused the old set of ips we were able to reproduce the issues. The VIP that got disconnected was 10.70.47.55
Below are the details from windows client and 4 of the servers:-
10.70.47.10 (VIP : 10.70.47.164)
------------------
[root@dhcp47-10 samba]# arp -n | grep 47.55
10.70.47.55 ether 52:54:00:26:d5:84 C ens3
[root@dhcp47-10 samba]# ifconfig | grep ether
ether 52:54:00:5f:3e:6d txqueuelen 1000 (Ethernet)
10.70.47.166 (VIP : 10.70.47.55)
------------------
[root@dhcp47-166 samba]# arp -n | grep 47.55
10.70.47.55 ether 52:54:00:13:7f:26 C ens3
[root@dhcp47-166 samba]# ifconfig | grep ether
ether 52:54:00:26:d5:84 txqueuelen 1000 (Ethernet)
10.70.47.159 (VIP : 10.70.47.160)
------------------
[root@dhcp47-159 samba]# arp -n | grep 47.55
10.70.47.55 ether 52:54:00:26:d5:84 C ens3
[root@dhcp47-159 samba]# ifconfig | grep ether
ether 52:54:00:bf:1a:7a txqueuelen 1000 (Ethernet)
10.70.47.58 (VIP : 10.70.47.11)
------------------
[root@dhcp47-58 samba]# arp -n | grep 47.55
10.70.47.55 ether 52:54:00:26:d5:84 C ens3
[root@dhcp47-58 samba]# ifconfig | grep ether
ether 52:54:00:13:7f:26 txqueuelen 1000 (Ethernet)
Windows Client
-----------------------------
arp -a
--------------
Interface: 10.70.47.181 --- 0xc
Internet Address Physical Address Type
10.70.46.29 52-54-00-62-05-da dynamic
10.70.46.177 52-54-00-45-8b-4f dynamic
10.70.47.5 52-54-00-cd-59-30 dynamic
10.70.47.10 52-54-00-5f-3e-6d dynamic
10.70.47.11 52-54-00-13-7f-26 dynamic
10.70.47.55 00-1a-4a-f7-23-1a dynamic
10.70.47.58 52-54-00-13-7f-26 dynamic
10.70.47.63 52-54-00-8b-0b-4f dynamic
10.70.47.159 52-54-00-bf-1a-7a dynamic
10.70.47.160 52-54-00-bf-1a-7a dynamic
10.70.47.164 52-54-00-5f-3e-6d dynamic
10.70.47.166 52-54-00-26-d5-84 dynamic
10.70.47.254 64-87-88-b8-75-41 dynamic
10.70.47.255 ff-ff-ff-ff-ff-ff static
224.0.0.22 01-00-5e-00-00-16 static
224.0.0.252 01-00-5e-00-00-fc static
239.255.255.250 01-00-5e-7f-ff-fa static
255.255.255.255 ff-ff-ff-ff-ff-ff static
From the above findings, it looks like a network issue where the address table is not updated on windows client.
This is most likely not a bug.
Seems like another machine is using the same IP as one of the public IPs (10.70.47.55). And the two hosts are fighting for the arp entry for this address. I think this explains the behaviour. |