Hide Forgot
Edit: Storage 1 is running glusterfs 3.0.3 built on Mar 14 2010 04:05:39 Storage 2 is running glusterfs 3.0.3 built on Mar 14 2010 04:05:39 Clients srv132,133,134 useing glusterfs 3.0.5 built on Oct 10 2010 00:00:29
Hi there, i got a very strange issue and i cannot determine the problem. I have 2 Storages running, both 3.0.5 Version When i use a client on storage 1 i see some traffic on an node from storage2 + some traffic on an completly uninvolved node. My network setup is like this pic http://yfrog.com/f/eastoragexj/ http://img401.imageshack.us/gal.php?g=bse232eth0.png Explanation: ============== Storage 1 consists of: Chassis 1: node1, node2, node3, node4 -> unify 13TB Storage 2 consists of: srv231, srv232, srv233, srv234 -> unify 73TB Clients on Chassis 2 (light orange blades) srv132, srv133, srv134 mountpoints to storage1 and storage2 ----- srv132,133,134 reading data from storage 1 with 64 threads storage2 is not used currently -> no jobs accessing storage2 Independend client (no glusterfs mount points or server) node6 on chassis 1 sees traffic on eth1 Network interfaces distribution: chassis 1 node1,2,3,4 eth0 30.11 , 30.12, 30.13, 30.14 node1,2,3,4 eth1 60.11 , 60.12, 60.13, 60.14 (should be for interaction interface for the bricks, clients should connect to this interface as well) chassis 2 srv132,133,134 eth0 30.132, 30.133, 30.134 srv132,133,134 eth1 60.132, 60.133, 60.134 (client interface for glfs) chassis 1 node6 eth0 30.16 node6 eth1 60.16 orange interfaces always represents eth0 red interfaces always represents eth1 configs storage 1 http://pastebin.com/m2DDEKPG configs storage 2 http://pastebin.com/RyXgJegT
Hi, it occurs again. I am not sure if its an glfsd bug or something with the network. Lets see the details from tcpdump: =========================== [12:31:12-root@srv14 ~]# ifconfig eth0 && ifconfig eth1 eth0 Link encap:Ethernet Hardware Adresse 00:15:C5:FD:42:60 inet Adresse:192.168.30.14 Bcast:192.168.30.255 Maske:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:10648496853 errors:0 dropped:0 overruns:0 frame:0 TX packets:1109021748 errors:0 dropped:0 overruns:0 carrier:0 Kollisionen:0 Sendewarteschlangenlänge:0 RX bytes:12024592926561 (10.9 TiB) TX bytes:4033570961141 (3.6 TiB) eth1 Link encap:Ethernet Hardware Adresse 00:15:C5:FD:42:62 inet Adresse:192.168.60.14 Bcast:192.168.60.255 Maske:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:37729206203 errors:0 dropped:0 overruns:0 frame:0 TX packets:29656361309 errors:0 dropped:0 overruns:0 carrier:0 Kollisionen:0 Sendewarteschlangenlänge:0 RX bytes:27337727209003 (24.8 TiB) TX bytes:120227047524740 (109.3 TiB) ============================ [12:32:01-root@srv14 ~]# head -n 100 /tmp/glfd.dump |tail -n 50 17:51:34.830844 IP 192.168.60.89.1005 > 192.168.60.235.6996: P 7148:8504(1356) ack 281 win 1525 <nop,nop,timestamp 2366338968 3744359983> 17:51:34.831131 IP cluster1node5.intern.gatc.de.1014 > cluster1node4.intern.gatc.de.6996: P 144:288(144) ack 121 win 1525 <nop,nop,timestamp 1410381646 681749388> 17:51:34.831181 IP cluster1node5.intern.gatc.de.1015 > cluster1node4.intern.gatc.de.6996: P 144:288(144) ack 121 win 1525 <nop,nop,timestamp 1410381646 681749388> 17:51:34.831191 IP cluster1node4.intern.gatc.de.6996 > cluster1node5.intern.gatc.de.1014: P 121:241(120) ack 288 win 1525 <nop,nop,timestamp 681749389 1410381646> 17:51:34.831229 IP cluster1node4.intern.gatc.de.6996 > cluster1node5.intern.gatc.de.1015: P 121:241(120) ack 288 win 1525 <nop,nop,timestamp 681749389 1410381646> 17:51:34.831685 IP cluster1node5.intern.gatc.de.1013 > cluster1node4.intern.gatc.de.6996: P 971:1323(352) ack 8557 win 1525 <nop,nop,timestamp 1410381646 681749389> 17:51:34.831750 IP cluster1node5.intern.gatc.de.1012 > cluster1node4.intern.gatc.de.6996: P 971:1323(352) ack 10310 win 1525 <nop,nop,timestamp 1410381646 681749389> 17:51:34.831824 IP cluster1node4.intern.gatc.de.6996 > cluster1node5.intern.gatc.de.1013: P 8557:8841(284) ack 1323 win 1525 <nop,nop,timestamp 681749389 1410381646> 17:51:34.831864 IP cluster1node4.intern.gatc.de.6996 > cluster1node5.intern.gatc.de.1012: P 10310:10594(284) ack 1323 win 1525 <nop,nop,timestamp 681749389 1410381646> 17:51:34.832048 IP 192.168.60.83.1007 > 192.168.60.235.6996: . 4252:5700(1448) ack 281 win 1525 <nop,nop,timestamp 2444060750 3744359984> 17:51:34.832052 IP 192.168.60.83.1007 > 192.168.60.235.6996: . 5700:7148(1448) ack 281 win 1525 <nop,nop,timestamp 2444060750 3744359984> 17:51:34.832056 IP 192.168.60.83.1007 > 192.168.60.235.6996: P 7148:8504(1356) ack 281 win 1525 <nop,nop,timestamp 2444060750 3744359984> 17:51:34.832221 IP cluster1node5.intern.gatc.de.1013 > cluster1node4.intern.gatc.de.6996: P 1323:1546(223) ack 8841 win 1525 <nop,nop,timestamp 1410381646 681749389> 17:51:34.832252 IP cluster1node5.intern.gatc.de.1012 > cluster1node4.intern.gatc.de.6996: P 1323:1546(223) ack 10594 win 1525 <nop,nop,timestamp 1410381646 681749389> 17:51:34.832362 IP cluster1node4.intern.gatc.de.6996 > cluster1node5.intern.gatc.de.1013: P 8841:9175(334) ack 1546 win 1525 <nop,nop,timestamp 681749389 1410381646> 17:51:34.832365 IP cluster1node4.intern.gatc.de.6996 > cluster1node5.intern.gatc.de.1012: P 10594:10928(334) ack 1546 win 1525 <nop,nop,timestamp 681749389 1410381646> 17:51:34.832833 IP 192.168.60.85.1005 > 192.168.60.235.6996: . ack 1653876659 win 1402 <nop,nop,timestamp 2448184066 3744359995> 17:51:34.832870 IP 192.168.60.85.1005 > 192.168.60.235.6996: . ack 2897 win 1514 <nop,nop,timestamp 2448184066 3744359995> 17:51:34.832903 IP 192.168.60.85.1005 > 192.168.60.235.6996: . ack 17377 win 1525 <nop,nop,timestamp 2448184066 3744359995> 17:51:34.833132 IP 192.168.60.85.1005 > 192.168.60.235.6996: . ack 34753 win 1525 <nop,nop,timestamp 2448184066 3744359995> 17:51:34.833184 IP 192.168.60.85.1005 > 192.168.60.235.6996: . ack 52129 win 1525 <nop,nop,timestamp 2448184066 3744359995> 17:51:34.833342 IP 192.168.60.85.1005 > 192.168.60.235.6996: . ack 69505 win 1525 <nop,nop,timestamp 2448184066 3744359995> 17:51:34.833583 IP 192.168.60.85.1005 > 192.168.60.235.6996: . ack 98465 win 1525 <nop,nop,timestamp 2448184066 3744359995> 17:51:34.833622 IP 192.168.60.87.1005 > 192.168.60.235.6996: . ack 2399993835 win 1525 <nop,nop,timestamp 2443603783 3744359995> 17:51:34.833849 IP 192.168.60.87.1005 > 192.168.60.235.6996: . ack 1449 win 1525 <nop,nop,timestamp 2443603784 3744359995> ================== So we can see "normal traffic" from IP cluster1node5.intern.gatc.de.1014 > cluster1node4.intern.gatc.de.6996: (srv14 & srv15 same server dns name would be cluster1node5...) And we can see the traffic coming from srv85 to srv235 which is at least 1 hop away! So why this is happening?! Does anyone have a suggestions how to investigate that more deeply? Thx Matthias
This looks like something specific to your routing setup. Maybe this server as an L3 gateway for those two networks or your switch does not do L2/MAC filtering. I don't see in what possible way this is an issue with Gluster (except the traffic is Gluster traffice). Gluster uses only standard sockets for all its communication. Avati
(Mit Bezug zu comment #3) > This looks like something specific to your routing setup. Maybe this server as > an L3 gateway for those two networks or your switch does not do L2/MAC > filtering. I don't see in what possible way this is an issue with Gluster > (except the traffic is Gluster traffice). Gluster uses only standard sockets > for all its communication. > > Avati The wired thing is, i only see glusterfs traffic! Nothing else. One more point is, its only ack traffic, which we can see: 6.134719 192.168.60.13 -> 192.168.30.143 TCP 6996 > 1014 [PSH, ACK] Seq=1 Ack=1 Win=1525 Len=120 TSV=2243628531 TSER=711865572 6.134720 192.168.60.14 -> 192.168.30.143 TCP 6996 > 1011 [PSH, ACK] Seq=1 Ack=1 Win=1525 Len=120 TSV=934533108 TSER=711865572 6.134808 192.168.60.12 -> 192.168.30.143 TCP 6996 > 1017 [ACK] Seq=1 Ack=1 Win=1525 Len=0 TSV=934436222 TSER=711865572 6.134810 192.168.60.13 -> 192.168.30.143 TCP 6996 > 1015 [ACK] Seq=1 Ack=1 Win=1525 Len=0 TSV=2243628531 TSER=711865572 6.134811 192.168.60.12 -> 192.168.30.143 TCP 6996 > 1017 [PSH, ACK] Seq=1 Ack=1 Win=1525 Len=120 TSV=934436222 TSER=711865572 6.134812 192.168.60.11 -> 192.168.30.143 TCP 6996 > 1023 [PSH, ACK] Seq=1 Ack=1 Win=1525 Len=208 TSV=931764301 TSER=711865572 60.11, 60.12, 60.13, 60.14 are the bricks server, this ack packets are running periodicly through the network, there is no data traffic on the bricks..as far i can tell. The ack packets , go to all connected clients, 60.141, 142, 143, 145 ... So something is very strange here. If a switch is not working correctly. What i wouldn't deny thats possible, why i can only see glusterfs push/ack packets? No data packets, no other miss routed packets? Some ideas?
(Mit Bezug zu comment #4) > (Mit Bezug zu comment #3) > > This looks like something specific to your routing setup. Maybe this server as > > an L3 gateway for those two networks or your switch does not do L2/MAC > > filtering. I don't see in what possible way this is an issue with Gluster > > (except the traffic is Gluster traffice). Gluster uses only standard sockets > > for all its communication. > > > > Avati > > The wired thing is, i only see glusterfs traffic! Nothing else. > One more point is, its only ack traffic, which we can see: > > 6.134719 192.168.60.13 -> 192.168.30.143 TCP 6996 > 1014 [PSH, ACK] Seq=1 > Ack=1 Win=1525 Len=120 TSV=2243628531 TSER=711865572 > 6.134720 192.168.60.14 -> 192.168.30.143 TCP 6996 > 1011 [PSH, ACK] Seq=1 > Ack=1 Win=1525 Len=120 TSV=934533108 TSER=711865572 > 6.134808 192.168.60.12 -> 192.168.30.143 TCP 6996 > 1017 [ACK] Seq=1 Ack=1 > Win=1525 Len=0 TSV=934436222 TSER=711865572 > 6.134810 192.168.60.13 -> 192.168.30.143 TCP 6996 > 1015 [ACK] Seq=1 Ack=1 > Win=1525 Len=0 TSV=2243628531 TSER=711865572 > 6.134811 192.168.60.12 -> 192.168.30.143 TCP 6996 > 1017 [PSH, ACK] Seq=1 > Ack=1 Win=1525 Len=120 TSV=934436222 TSER=711865572 > 6.134812 192.168.60.11 -> 192.168.30.143 TCP 6996 > 1023 [PSH, ACK] Seq=1 > Ack=1 Win=1525 Len=208 TSV=931764301 TSER=711865572 > > > 60.11, 60.12, 60.13, 60.14 are the bricks server, this ack packets are running > periodicly through the network, there is no data traffic on the bricks..as far > i can tell. The ack packets , go to all connected clients, 60.141, 142, 143, > 145 ... > > So something is very strange here. If a switch is not working correctly. What i > wouldn't deny thats possible, why i can only see glusterfs push/ack packets? No > data packets, no other miss routed packets? > > Some ideas? Update: traffic was seen on 60.131 an additional different server, which is located in another bladecenter..its getting even stranger..
> Update: traffic was seen on 60.131 an additional different server, which is > located in another bladecenter..its getting even stranger.. Maybe your configuration is such that it is resulting in traffic between the two sites. As I already mentioned, glusterfs contains its entire communication within AF_INET socket API communication. Even if it wanted to, glusterfs could not be sending mis-routed traffic on the wire. Avati
Please update the status of this bug as its been more than 6months since its filed (bug id < 2000) Please resolve it with proper resolution if its not valid anymore. If its still valid and not critical, move it to 'enhancement' severity.
Seems to be a local network problem on Dell 5316M Switches. We still have no idea why this is happening. ----- The same issues occure with 3.0.7 or 3.2.x Versions of glusterfs. I am a bit glueless. If i ran iperf on the storages i get good performance, about 1Gbit. [09:31:08 root@srv11 glusterfs]# iperf -s ------------------------------------------------------------ Server listening on TCP port 5001 TCP window size: 85.3 KByte (default) ------------------------------------------------------------ [ 4] local 192.168.30.11 port 5001 connected with 192.168.30.20 port 48902 [ ID] Interval Transfer Bandwidth [ 4] 0.0-10.0 sec 1.10 GBytes 941 Mbits/sec ------ But if i tranfer data, only over this switch, means internal bladechassis switch, the traffic is more than bad and i see lost connections in the logfile. ------ [2011-06-22 08:46:41] N [client-protocol.c:6329:client_setvolume_cbk] 192.168.60.14-1: Connected to 192.168.60.14:6996, attached to remote volume 'brick1'. [2011-06-22 08:46:41] N [client-protocol.c:6329:client_setvolume_cbk] 192.168.60.14-1: Connected to 192.168.60.14:6996, attached to remote volume 'brick1'. [2011-06-22 08:46:41] W [fuse-bridge.c:1179:fuse_err_cbk] glusterfs-fuse: 5798: FLUSH() ERR => -1 (Transport endpoint is not connected) [2011-06-22 08:47:23] E [client-protocol.c:415:client_ping_timer_expired] 192.168.60.11-1: Server 192.168.60.11:6996 has not responded in the last 42 seconds, disconnecting. [2011-06-22 08:47:23] E [saved-frames.c:165:saved_frames_unwind] 192.168.60.11-1: forced unwinding frame type(1) op(WRITE) [2011-06-22 08:47:23] E [saved-frames.c:165:saved_frames_unwind] 192.168.60.11-1: forced unwinding frame type(1) op(WRITE) [2011-06-22 08:47:23] W [fuse-bridge.c:936:fuse_setattr_cbk] glusterfs-fuse: 5806: SETATTR() /pacbio/common/test/jobs/lambda/results/accuracyHistogram.png => -1 (Transport endpoint is not connected) [2011-06-22 08:47:23] E [saved-frames.c:165:saved_frames_unwind] 192.168.60.11-1: forced unwinding frame type(2) op(PING) [2011-06-22 08:47:23] N [client-protocol.c:7077:notify] 192.168.60.11-1: disconnected [2011-06-22 08:47:23] N [client-protocol.c:6329:client_setvolume_cbk] 192.168.60.11-1: Connected to 192.168.60.11:6996, attached to remote volume 'brick1'. ----- Someone got a glue? Else this is closed for me.. M