Bug 244708 - flock/dlm conflict causing slow file locking on gfs nodes
flock/dlm conflict causing slow file locking on gfs nodes
Status: CLOSED ERRATA
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: dlm-kernel (Show other bugs)
4
i386 Linux
urgent Severity high
: rc
: ---
Assigned To: Christine Caulfield
GFS Bugs
: ZStream
Depends On:
Blocks: 442573
  Show dependency treegraph
 
Reported: 2007-06-18 14:02 EDT by Jeremy West
Modified: 2010-10-22 11:46 EDT (History)
5 users (show)

See Also:
Fixed In Version: RHBA-2008-0796
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-07-25 15:17:46 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Validation program for gfs writes (5.37 KB, application/octet-stream)
2007-06-18 14:27 EDT, Jeremy West
no flags Details
Validation program for gfs non-writes (5.07 KB, application/octet-stream)
2007-06-18 14:30 EDT, Jeremy West
no flags Details
De-nagle it (1.12 KB, patch)
2007-06-19 12:23 EDT, Christine Caulfield
no flags Details | Diff
Send a reply from resdir remove commands (784 bytes, patch)
2007-06-21 08:41 EDT, Christine Caulfield
no flags Details | Diff
ethereal captures (380.20 KB, application/x-gzip)
2007-06-22 12:00 EDT, Jeremy West
no flags Details
Userland test program (2.07 KB, text/x-csrc)
2007-06-29 05:19 EDT, Christine Caulfield
no flags Details
Working nagle patch (3.69 KB, patch)
2007-07-02 06:47 EDT, Christine Caulfield
no flags Details | Diff

  None (edit)
Description Jeremy West 2007-06-18 14:02:54 EDT
Customer is running RHEL 4.3 with Cluster Suite 6.1 and experience poor
performance with gfs locking.  Currently this issue is being looked at by
engineering who have narrowed the problem down to this:

When exclusive locking is used the nodes don't free up the locks fast enough. 
This may be caused by a conflict between dlm and flock.  A validation program
has been sent to the customer and the return results confirm this.  The next
step is for engineering to examine the new code in RHEL 5, which resolves the
problem, and determine how to patch the existing code in RHEL 4.

Thanks
Jeremy West
Comment 1 Jeremy West 2007-06-18 14:27:35 EDT
Created attachment 157305 [details]
Validation program for gfs writes

This is a reproduction/validation program which should be run with the
following command.

time ./gfs_flock_test "testfile_on_GFS_SAN"

where "testfile_on_GFS_SAN" is an actual file on the gfs filesystem.
Comment 2 Jeremy West 2007-06-18 14:30:36 EDT
Created attachment 157306 [details]
Validation program for gfs non-writes

This is a the validation/reproduction program for testing non-writes to the gfs
filesystem.  The syntax for using this is as follows:

time ./gfs_flock_test "testfile_on_GFS_SAN"

where "testfile_on_GFS_SAN" is the actual existing file that should be locked
Comment 19 David Teigland 2007-06-19 11:31:16 EDT
I doubt the same problem exists on RHEL5, I've not tested it properly.

update to comment 15:
When node1 or node2 need to talk with node3, the program runs slow -- that's
for files test1 and test2 where node3 is the resdir node.
When node3 needs to talk with node1 or node2, the program runs slow -- that's
for file test3 where node1 is the resdir node.

The flock test is quite unique in that the only dlm traffic when it runs is
resdir lookups/removals.
Comment 20 Christine Caulfield 2007-06-19 12:23:54 EDT
Created attachment 157391 [details]
De-nagle it

Disabling Nagle fixes this for me. Though it almost certainly needs to be an
option rather than the default.
Comment 22 David Teigland 2007-06-19 13:23:17 EDT
I"m guessing what's happening here is that the customer is mounting a third
node that's not doing anything.  So, the third node only gets lots of small
resdir lookups/removals which triggers the Nagle effects (batching together
lots of small messages to make more effecient use of the network).

An interesting experiment would be to have the third node actively doing
stuff on the file system to cause more dlm traffic.  The increased traffic
may avert the Nagling.

Meanwhile, I'll work on a patch that makes the NODELAY flag configurable.
Then we can get a test package created.
Comment 23 David Teigland 2007-06-19 13:40:28 EDT
Trying the NODELAY patch myself, I'm not seeing any difference, so it may not
be the answer.
Comment 26 David Teigland 2007-06-19 14:54:15 EDT
The NODELAY patch isn't working for me.
Some more interesting observations, though.

When two nodes are communicating fine, a resdir lookup from one to the
other takes in the range of 2-4 jiffies.  When two nodes are in this state
of slow comms, the resdir lookups take in the range of 37-40 jiffies.
This is when I run the gfs_flock_test where resdir lookups/removals are the
only dlm traffic.  However, other resdir lookups between these nodes for gfs
locks are normal, not slowed down.  So, it's only lookups for the flock test
that are slow, other lookups sent to the same node are fast.

Also, when I run the flock test, the _first_ lookup is fast, but all the rest
are slow.  If I put a delay of 40 ms between each iteration of the flock test,
then there's no delay in the resdir lookups.

Comment 27 David Teigland 2007-06-19 15:26:25 EDT
varying the delay between iterations in gfs_flock_test when there's
a remote resdir node with slow comms:

0 ms delay -> 40 ms per lookup
10 ms delay -> 30 ms per lookup
20 ms delay -> 20 ms per lookup
30 ms delay -> 10 ms per lookup
40 ms delay -> 3-5 ms per lookup

of course, if there's no slow comms with the resdir node, a lookup
always takes 3-5 ms regardless of the delay between iterations.
Comment 29 David Teigland 2007-06-19 17:23:02 EDT
I'm using the NODELAY patch, but I don't think it's having any effect.
The tcpdumps from the two nodes (fast and slow) do seem to show some clear
differences, but I can't tell much from them.

resdir for "test2" is dct-xen-06 (node2)
resdir for "test3" is dct-xen-07 (node3)

and I run the test on dct-xen-05 (node1):

[root@dct-xen-05 gfs]# time /root/gfs_flock_test test2 40

real    0m0.598s
user    0m0.000s
sys     0m0.264s
[root@dct-xen-05 gfs]# time /root/gfs_flock_test test3 40

real    0m1.871s
user    0m0.001s
sys     0m0.236s

This test does 40 iterations of flock/funlock.  That's 40 resdir lookups,
40 lookup replies, and 40 resdir removals (which have no reply).  That's
120 total dlm messages, which I suppose we should be able to correlate
with what tcpdump shows.  Including the tcp acks for each message would
get us to about 240 packets, which is just under the total packets counted
between the nodes in the quick test (248), and is above the 200 packets
counted between the nodes that are slow (05 and 07).


from 06:

09:39:39.512295 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 749162895:749163009(
114) ack 750896109 win 183 <nop,nop,timestamp 8398529 8221406>
09:39:39.516927 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 114 win 883 <nop
,nop,timestamp 8403076 8398529>
09:39:39.516938 IP 10.15.84.96.32777 > 10.15.84.95.21064: P 753958116:753958188(
72) ack 747348844 win 183 <nop,nop,timestamp 8403079 8189088>
09:39:39.513351 IP 10.15.84.95.21064 > 10.15.84.96.32777: . ack 72 win 1096 <nop
,nop,timestamp 8398535 8403079>
09:39:39.521318 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 114:228(114) ack 1 w
in 183 <nop,nop,timestamp 8398543 8403076>
09:39:39.521359 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 228 win 883 <nop
,nop,timestamp 8403089 8398543>
09:39:39.524281 IP 10.15.84.96.32777 > 10.15.84.95.21064: P 72:144(72) ack 1 win
 183 <nop,nop,timestamp 8403092 8398535>
09:39:39.526253 IP 10.15.84.95.21064 > 10.15.84.96.32777: . ack 144 win 1096 <no
p,nop,timestamp 8398547 8403092>
09:39:39.532359 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 228:342(114) ack 1 w
in 183 <nop,nop,timestamp 8398554 8403089>
09:39:39.532372 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 342 win 883 <nop
,nop,timestamp 8403100 8398554>
09:39:39.534268 IP 10.15.84.96.32777 > 10.15.84.95.21064: P 144:216(72) ack 1 wi
n 183 <nop,nop,timestamp 8403102 8398547>
09:39:39.537287 IP 10.15.84.95.21064 > 10.15.84.96.32777: . ack 216 win 1096 <no
p,nop,timestamp 8398558 8403102>
09:39:39.539438 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 342:456(114) ack 1 w
in 183 <nop,nop,timestamp 8398560 8403100>
09:39:39.539448 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 456 win 883 <nop
,nop,timestamp 8403107 8398560>
09:39:39.545362 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 456:570(114) ack 1 w
in 183 <nop,nop,timestamp 8398566 8403107>
09:39:39.545378 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 570 win 883 <nop
,nop,timestamp 8403113 8398566>
09:39:39.548245 IP 10.15.84.96.32777 > 10.15.84.95.21064: P 216:288(72) ack 1 wi
n 183 <nop,nop,timestamp 8403116 8398558>
09:39:39.548805 IP 10.15.84.95.21064 > 10.15.84.96.32777: . ack 288 win 1096 <no
p,nop,timestamp 8398570 8403116>
09:39:39.552415 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 570:684(114) ack 1 w
in 183 <nop,nop,timestamp 8398573 8403113>
09:39:39.552433 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 684 win 883 <nop
,nop,timestamp 8403120 8398573>
09:39:39.559541 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 684:798(114) ack 1 w
in 183 <nop,nop,timestamp 8398581 8403120>
09:39:39.559584 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 798 win 883 <nop
,nop,timestamp 8403127 8398581>
09:39:39.563596 IP 10.15.84.96.32777 > 10.15.84.95.21064: P 288:360(72) ack 1 wi
n 183 <nop,nop,timestamp 8403131 8398570>
09:39:39.564449 IP 10.15.84.95.21064 > 10.15.84.96.32777: . ack 360 win 1096 <no
p,nop,timestamp 8398586 8403131>
09:39:39.565601 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 798:912(114) ack 1 w
in 183 <nop,nop,timestamp 8398587 8403127>
09:39:39.565622 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 912 win 883 <nop
,nop,timestamp 8403133 8398587>
09:39:39.570582 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 912:1026(114) ack 1 
win 183 <nop,nop,timestamp 8398592 8403133>
09:39:39.570598 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 1026 win 883 <no
p,nop,timestamp 8403138 8398592>
09:39:39.576240 IP 10.15.84.96.32777 > 10.15.84.95.21064: P 360:432(72) ack 1 wi
n 183 <nop,nop,timestamp 8403144 8398586>
09:39:39.577523 IP 10.15.84.95.21064 > 10.15.84.96.32777: . ack 432 win 1096 <no
p,nop,timestamp 8398599 8403144>
09:39:39.580416 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 1026:1140(114) ack 1
 win 183 <nop,nop,timestamp 8398601 8403138>
09:39:39.580546 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 1140 win 883 <no
p,nop,timestamp 8403148 8398601>
09:39:39.586533 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 1140:1254(114) ack 1
 win 183 <nop,nop,timestamp 8398608 8403148>
09:39:39.586635 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 1254 win 883 <no
p,nop,timestamp 8403154 8398608>
09:39:39.588237 IP 10.15.84.96.32777 > 10.15.84.95.21064: P 432:504(72) ack 1 wi
n 183 <nop,nop,timestamp 8403156 8398599>
09:39:39.589402 IP 10.15.84.95.21064 > 10.15.84.96.32777: . ack 504 win 1096 <no
p,nop,timestamp 8398611 8403156>
09:39:39.591499 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 1254:1368(114) ack 1
 win 183 <nop,nop,timestamp 8398613 8403154>
09:39:39.592067 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 1368 win 883 <no
p,nop,timestamp 8403159 8398613>
09:39:39.597416 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 1368:1482(114) ack 1
 win 183 <nop,nop,timestamp 8398619 8403159>
09:39:39.598238 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 1482 win 883 <no
p,nop,timestamp 8403166 8398619>
09:39:39.601494 IP 10.15.84.96.32777 > 10.15.84.95.21064: P 504:576(72) ack 1 wi
n 183 <nop,nop,timestamp 8403169 8398611>
09:39:39.601995 IP 10.15.84.95.21064 > 10.15.84.96.32777: . ack 576 win 1096 <no
p,nop,timestamp 8398623 8403169>
09:39:39.605351 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 1482:1596(114) ack 1
 win 183 <nop,nop,timestamp 8398626 8403166>
09:39:39.605501 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 1596 win 883 <no
p,nop,timestamp 8403173 8398626>
09:39:39.612487 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 1596:1710(114) ack 1
 win 183 <nop,nop,timestamp 8398633 8403173>
09:39:39.612621 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 1710 win 883 <no
p,nop,timestamp 8403180 8398633>
09:39:39.614244 IP 10.15.84.96.32777 > 10.15.84.95.21064: P 576:648(72) ack 1 wi
n 183 <nop,nop,timestamp 8403182 8398623>
09:39:39.616396 IP 10.15.84.95.21064 > 10.15.84.96.32777: . ack 648 win 1096 <no
p,nop,timestamp 8398638 8403182>
09:39:39.620372 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 1710:1824(114) ack 1
 win 183 <nop,nop,timestamp 8398641 8403180>
09:39:39.620497 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 1824 win 883 <no
p,nop,timestamp 8403188 8398641>
09:39:39.628488 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 1824:1938(114) ack 1
 win 183 <nop,nop,timestamp 8398651 8403188>
09:39:39.628590 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 1938 win 883 <no
p,nop,timestamp 8403196 8398651>

...

248 packets captured
252 packets received by filter
0 packets dropped by kernel


from 07:

09:39:42.323185 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 760452160:760452274(
114) ack 766374568 win 1064 <nop,nop,timestamp 8401322 7918000>
09:39:42.330915 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 114 win 1019 <no
p,nop,timestamp 8402933 8401322>
09:39:42.330928 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 1:73(72) ack 114 win
 1019 <nop,nop,timestamp 8402937 8401322>
09:39:42.325988 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 114:228(114) ack 73 
win 1064 <nop,nop,timestamp 8401330 8402937>
09:39:42.365814 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 228 win 1019 <no
p,nop,timestamp 8402980 8401330>
09:39:42.366977 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 228:342(114) ack 73 
win 1064 <nop,nop,timestamp 8401371 8402980>
09:39:42.366988 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 342 win 1019 <no
p,nop,timestamp 8402981 8401371>
09:39:42.368867 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 73:145(72) ack 342 w
in 1019 <nop,nop,timestamp 8402983 8401371>
09:39:42.373062 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 342:456(114) ack 145
 win 1064 <nop,nop,timestamp 8401377 8402983>
09:39:42.412809 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 456 win 1019 <no
p,nop,timestamp 8403027 8401377>
09:39:42.413939 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 456:570(114) ack 145
 win 1064 <nop,nop,timestamp 8401418 8403027>
09:39:42.413949 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 570 win 1019 <no
p,nop,timestamp 8403028 8401418>
09:39:42.415843 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 145:217(72) ack 570 
win 1019 <nop,nop,timestamp 8403030 8401418>
09:39:42.420382 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 570:684(114) ack 217
 win 1064 <nop,nop,timestamp 8401424 8403030>
09:39:42.459813 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 684 win 1019 <no
p,nop,timestamp 8403074 8401424>
09:39:42.460947 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 684:798(114) ack 217
 win 1064 <nop,nop,timestamp 8401464 8403074>
09:39:42.460959 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 798 win 1019 <no
p,nop,timestamp 8403075 8401464>
09:39:42.462809 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 217:289(72) ack 798 
win 1019 <nop,nop,timestamp 8403077 8401464>
09:39:42.467878 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 798:912(114) ack 289
 win 1064 <nop,nop,timestamp 8401470 8403077>
09:39:42.507801 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 912 win 1019 <no
p,nop,timestamp 8403122 8401470>
09:39:42.508926 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 912:1026(114) ack 28
9 win 1064 <nop,nop,timestamp 8401512 8403122>
09:39:42.509013 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1026 win 1019 <n
op,nop,timestamp 8403123 8401512>
09:39:42.510837 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 289:361(72) ack 1026
 win 1019 <nop,nop,timestamp 8403125 8401512>
09:39:42.515217 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1026:1140(114) ack 3
61 win 1064 <nop,nop,timestamp 8401518 8403125>
09:39:42.554791 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1140 win 1019 <n
op,nop,timestamp 8403169 8401518>
09:39:42.555923 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1140:1254(114) ack 3
61 win 1064 <nop,nop,timestamp 8401559 8403169>
09:39:42.556007 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1254 win 1019 <n
op,nop,timestamp 8403170 8401559>
09:39:42.557820 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 361:433(72) ack 1254
 win 1019 <nop,nop,timestamp 8403172 8401559>
09:39:42.561807 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1254:1368(114) ack 4
33 win 1064 <nop,nop,timestamp 8401565 8403172>
09:39:42.601779 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1368 win 1019 <n
op,nop,timestamp 8403216 8401565>
09:39:42.602908 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1368:1482(114) ack 4
33 win 1064 <nop,nop,timestamp 8401606 8403216>
09:39:42.602988 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1482 win 1019 <n
op,nop,timestamp 8403217 8401606>
09:39:42.604785 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 433:505(72) ack 1482
 win 1019 <nop,nop,timestamp 8403219 8401606>
09:39:42.609921 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1482:1596(114) ack 5
05 win 1064 <nop,nop,timestamp 8401613 8403219>
09:39:42.649774 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1596 win 1019 <n
op,nop,timestamp 8403264 8401613>
09:39:42.650908 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1596:1710(114) ack 5
05 win 1064 <nop,nop,timestamp 8401654 8403264>
09:39:42.650993 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1710 win 1019 <n
op,nop,timestamp 8403265 8401654>
09:39:42.652782 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 505:577(72) ack 1710
 win 1019 <nop,nop,timestamp 8403267 8401654>
09:39:42.656888 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1710:1824(114) ack 5
77 win 1064 <nop,nop,timestamp 8401660 8403267>
09:39:42.696773 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1824 win 1019 <n
op,nop,timestamp 8403311 8401660>
09:39:42.697891 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1824:1938(114) ack 5
77 win 1064 <nop,nop,timestamp 8401701 8403311>
09:39:42.697975 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1938 win 1019 <n
op,nop,timestamp 8403312 8401701>
09:39:42.699807 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 577:649(72) ack 1938
 win 1019 <nop,nop,timestamp 8403314 8401701>
09:39:42.705159 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1938:2052(114) ack 6
49 win 1064 <nop,nop,timestamp 8401708 8403314>
09:39:42.744762 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 2052 win 1019 <n
op,nop,timestamp 8403359 8401708>
09:39:42.745900 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 2052:2166(114) ack 6
49 win 1064 <nop,nop,timestamp 8401749 8403359>
09:39:42.745993 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 2166 win 1019 <n
op,nop,timestamp 8403360 8401749>
09:39:42.747775 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 649:721(72) ack 2166
 win 1019 <nop,nop,timestamp 8403362 8401749>
09:39:42.752909 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 2166:2280(114) ack 7
21 win 1064 <nop,nop,timestamp 8401756 8403362>
09:39:42.792751 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 2280 win 1019 <n
op,nop,timestamp 8403407 8401756>
09:39:42.793879 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 2280:2394(114) ack 7
21 win 1064 <nop,nop,timestamp 8401797 8403407>
09:39:42.793965 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 2394 w1019 <nop,
nop,timestamp 8403408 8401797>
09:39:42.795766 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 721:793(72) ack 2394
 win 1019 <nop,nop,timestamp 8403410 8401797>
09:39:42.800149 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 2394:2508(114) ack 7
93 win 1064 <nop,nop,timestamp 8401803 8403410>
09:39:42.839749 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 2508 win 1019 <n
op,nop,timestamp 8403454 8401803>

...

200 packets captured
206 packets received by filter
0 packets dropped by kernel


This is consistent between runs.  If I add a 40 ms delay between each
iteratation in gfs_flock_test, then the resdir lookup delays disappear (go
back to 3-5 ms instead of about 40 ms), but the tcpdump is still the same:

[root@dct-xen-05 gfs]# time /root/gfs_flock_test test3 40 40

real    0m2.152s
user    0m0.001s
sys     0m0.169s

10:04:27.404170 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 760479520:760479634(
114) ack 766383208 win 1064 <nop,nop,timestamp 9886814 9448184>
10:04:27.404431 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 114 win 1019 <no
p,nop,timestamp 9888242 9886814>
10:04:27.404442 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 1:73(72) ack 114 win
 1019 <nop,nop,timestamp 9888243 9886814>
10:04:27.407413 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 114:228(114) ack 73 
win 1064 <nop,nop,timestamp 9886821 9888243>
10:04:27.447054 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 228 win 1019 <no
p,nop,timestamp 9888287 9886821>
10:04:27.455182 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 228:342(114) ack 73 
win 1064 <nop,nop,timestamp 9886868 9888287>
10:04:27.455230 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 342 win 1019 <no
p,nop,timestamp 9888295 9886868>
10:04:27.457061 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 73:145(72) ack 342 w
in 1019 <nop,nop,timestamp 9888297 9886868>
10:04:27.461451 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 342:456(114) ack 145
 win 1064 <nop,nop,timestamp 9886875 9888297>
10:04:27.501046 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 456 win 1019 <no
p,nop,timestamp 9888341 9886875>
10:04:27.508170 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 456:570(114) ack 145
 win 1064 <nop,nop,timestamp 9886922 9888341>
10:04:27.508207 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 570 win 1019 <no
p,nop,timestamp 9888348 9886922>
10:04:27.510052 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 145:217(72) ack 570 
win 1019 <nop,nop,timestamp 9888350 9886922>
10:04:27.515291 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 570:684(114) ack 217
 win 1064 <nop,nop,timestamp 9886929 9888350>
10:04:27.555034 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 684 win 1019 <no
p,nop,timestamp 9888395 9886929>
10:04:27.562156 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 684:798(114) ack 217
 win 1064 <nop,nop,timestamp 9886975 9888395>
10:04:27.562185 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 798 win 1019 <no
p,nop,timestamp 9888402 9886975>
10:04:27.564061 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 217:289(72) ack 798 
win 1019 <nop,nop,timestamp 9888404 9886975>
10:04:27.570161 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 798:912(114) ack 289
 win 1064 <nop,nop,timestamp 9886983 9888404>
10:04:27.610026 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 912 win 1019 <no
p,nop,timestamp 9888450 9886983>
10:04:27.615157 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 912:1026(114) ack 28
9 win 1064 <nop,nop,timestamp 9887029 9888450>
10:04:27.615296 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1026 win 1019 <n
op,nop,timestamp 9888455 9887029>
10:04:27.617040 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 289:361(72) ack 1026
 win 1019 <nop,nop,timestamp 9888457 9887029>
10:04:27.621303 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1026:1140(114) ack 3
61 win 1064 <nop,nop,timestamp 9887035 9888457>
10:04:27.661018 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1140 win 1019 <n
op,nop,timestamp 9888501 9887035>
10:04:27.666145 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1140:1254(114) ack 3
61 win 1064 <nop,nop,timestamp 9887080 9888501>
10:04:27.666250 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1254 win 1019 <n
op,nop,timestamp 9888506 9887080>
10:04:27.667042 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 361:433(72) ack 1254
 win 1019 <nop,nop,timestamp 9888507 9887080>
10:04:27.673150 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1254:1368(114) ack 4
33 win 1064 <nop,nop,timestamp 9887086 9888507>
10:04:27.713127 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1368 win 1019 <n
op,nop,timestamp 9888553 9887086>
10:04:27.718131 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1368:1482(114) ack 4
33 win 1064 <nop,nop,timestamp 9887131 9888553>
10:04:27.718233 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1482 win 1019 <n
op,nop,timestamp 9888558 9887131>
10:04:27.720032 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 433:505(72) ack 1482
 win 1019 <nop,nop,timestamp 9888560 9887131>
10:04:27.725236 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1482:1596(114) ack 5
05 win 1064 <nop,nop,timestamp 9887139 9888560>
10:04:27.764999 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1596 win 1019 <n
op,nop,timestamp 9888605 9887139>
10:04:27.771125 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1596:1710(114) ack 5
05 win 1064 <nop,nop,timestamp 9887185 9888605>
10:04:27.771240 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1710 win 1019 <n
op,nop,timestamp 9888611 9887185>
10:04:27.772019 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 505:577(72) ack 1710
 win 1019 <nop,nop,timestamp 9888612 9887185>
10:04:27.777124 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1710:1824(114) ack 5
77 win 1064 <nop,nop,timestamp 9887191 9888612>
10:04:27.816993 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1824 win 1019 <n
op,nop,timestamp 9888657 9887191>
10:04:27.822121 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1824:1938(114) ack 5
77 win 1064 <nop,nop,timestamp 9887238 9888657>
10:04:27.822254 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1938 win 1019 <n
op,nop,timestamp 9888662 9887238>
10:04:27.824027 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 577:649(72) ack 1938
 win 1019 <nop,nop,timestamp 9888664 9887238>
10:04:27.828255 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1938:2052(114) ack 6
49 win 1064 <nop,nop,timestamp 9887244 9888664>
10:04:27.867991 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 2052 win 1019 <n
op,nop,timestamp 9888708 9887244>
10:04:27.873113 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 2052:2166(114) ack 6
49 win 1064 <nop,nop,timestamp 9887290 9888708>
10:04:27.873219 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 2166 win 1019 <n
op,nop,timestamp 9888713 9887290>
10:04:27.874005 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 649:721(72) ack 2166
 win 1019 <nop,nop,timestamp 9888714 9887290>
10:04:27.879114 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 2166:2280(114) ack 7
21 win 1064 <nop,nop,timestamp 9887296 9888714>
10:04:27.918976 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 2280 win 1019 <n
op,nop,timestamp 9888759 9887296>
10:04:27.925108 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 2280:2394(114) ack 7
21 win 1064 <nop,nop,timestamp 9887340 9888759>

...
200 packets captured
206 packets received by filter
0 packets dropped by kernel
Comment 30 David Teigland 2007-06-19 17:26:50 EDT
Current thinking is that this still looks like odd networking-related delays.
Patrick, do the tcpdumps give you any new ideas?
Comment 31 Christine Caulfield 2007-06-20 03:26:10 EDT
Just wondering how much difference Xen makes in all this. I'm using real nodes
for my tests. Xen networking add all sorts of unknowns to the situation.

I'll have a scour of those tcpumps though.
Comment 33 David Teigland 2007-06-20 17:29:28 EDT
Here's an observation based on looking at netstat output.  When there are
two connections between the two nodes involved in the test, the test runs
fast.  When there's one connection between the two nodes, it goes slow.

That doesn't seem to tell us what's wrong, but it may give us a way to
work around it, assuming we could arrange to always create two connections
between nodes.

I now have a four node cluster of dct-xen-05,06,07,08.
Comms between 05 and 06 are fast, and any other pairing is slow (05/07,
05/08, 06/07, 06/08, 07/08).

05 and 06 is the only pair where two connections exist.  (IIRC it's normal
for the first two nodes to have two connections between them since they
both try to connect to each other at once.)

[root@dct-xen-05]# netstat -n | grep 21064
tcp        0      0 10.15.84.95:21064           10.15.84.98:43883          
ESTABLISHED 
tcp        0      0 10.15.84.95:21064           10.15.84.97:32781          
ESTABLISHED 
tcp        0      0 10.15.84.95:21064           10.15.84.96:32775          
ESTABLISHED 
tcp        0      0 10.15.84.95:32775           10.15.84.96:21064          
ESTABLISHED 

[root@dct-xen-06]# netstat -n | grep 21064
tcp        0      0 10.15.84.96:32776           10.15.84.97:21064          
ESTABLISHED
tcp        0      0 10.15.84.96:21064           10.15.84.95:32775          
ESTABLISHED
tcp        0      0 10.15.84.96:32775           10.15.84.95:21064          
ESTABLISHED
tcp        0      0 10.15.84.96:21064           10.15.84.98:43884          
ESTABLISHED

[root@dct-xen-07]# netstat -n | grep 21064
tcp        0      0 10.15.84.97:21064           10.15.84.96:32776          
ESTABLISHED
tcp        0      0 10.15.84.97:32781           10.15.84.95:21064          
ESTABLISHED
tcp        0      0 10.15.84.97:21064           10.15.84.98:43885          
ESTABLISHED

[root@dct-xen-08]# netstat -n | grep 21064
tcp        0      0 10.15.84.98:43883           10.15.84.95:21064          
ESTABLISHED
tcp        0      0 10.15.84.98:43885           10.15.84.97:21064          
ESTABLISHED
tcp        0      0 10.15.84.98:43884           10.15.84.96:21064          
ESTABLISHED


Same holds on the roth nodes where only comms between 05 and 06 are fast.

[root@roth-05 gfs]# netstat -n | grep 21064
tcp        0      0 10.15.84.105:21064          10.15.84.108:32779         
ESTABLISHED 
tcp        0      0 10.15.84.105:21064          10.15.84.106:32782         
ESTABLISHED 
tcp        0      0 10.15.84.105:32782          10.15.84.106:21064         
ESTABLISHED 

[root@roth-06 gfs]# netstat -n | grep 21064
tcp        0      0 10.15.84.106:32783          10.15.84.108:21064         
ESTABLISHED
tcp        0      0 10.15.84.106:21064          10.15.84.105:32782         
ESTABLISHED
tcp        0      0 10.15.84.106:32782          10.15.84.105:21064         
ESTABLISHED

[root@roth-08 gfs]# netstat -n | grep 21064
tcp        0      0 10.15.84.108:21064          10.15.84.106:32783         
ESTABLISHED
tcp        0      0 10.15.84.108:32779          10.15.84.105:21064         
ESTABLISHED
Comment 34 Christine Caulfield 2007-06-21 08:41:42 EDT
Created attachment 157531 [details]
Send a reply from resdir remove commands

That's quite wierd actually. If anything I would expect the single-connections
to be more efficient than the dual-connections because they can piggyback ACKs
on the data packets - and looking at the tcpdump that's exactly what is
happening. Of course, if the ACKs are 40ms late then you sort of lose that
advantage!

What seems to be the indirect cause of this is the resdir remove command. This
does not need a reply so the acks get delayed. Now, for some reason I don't
understand this stalls the whole communication for about 40ms, until the ACK
timer fires.

My way around this is to force the remove resdir command to send a dummy reply.
this gets everything back on an even footing again and the whole communications
is at full speed regardless of how many connections you have between the nodes.
Comment 35 Jeremy West 2007-06-21 09:33:14 EDT
Patrick,

Is this patch ready for testing in the customer's environment?

Thanks
Jeremy West
Comment 36 Christine Caulfield 2007-06-21 10:50:07 EDT
Yes, it's fine for it to go to the customer
Comment 40 Jeremy West 2007-06-22 12:00:40 EDT
Created attachment 157628 [details]
ethereal captures

Ethereal captures taken while running the gfs_flock_test program. One capture
per server included.
Comment 45 David Teigland 2007-06-22 15:30:00 EDT
So they run the following on three nodes at once:
  while true; do time ./gfs_flock_test test_file; sleep 1 ; done
where gfs_flock_test does 130 iterations of flock/funlock and no reads/writes?
What do they see specifically that they don't like?

When I run this, I periodically see a time of 3 seconds, mostly it's around half
a second.  There's nothing that strikes me as out of place in what I'm seeing.
Comment 50 David Teigland 2007-06-25 16:08:23 EDT
test 1
while true; do time /root/gfs_write_test test_file 100; sleep 0 ; done
program does write(); 100 times
run on three nodes in parallel
each time output was abaout 5.8 sec on each node

test 2
while true; do time /root/gfs_flock_test test_file 100; sleep 0 ; done
program does flock(EX); flock(UN); 100 times
run on three nodes in parallel
each time output varied between 0.5 sec and 4 sec on the nodes, average ~1.7

test 3
while true; do time /root/gfs_flockwrite_test test_file 100; sleep 0 ; done
program does flock(EX); write(); flock(UN); 100 times
run on three nodes in parallel
each time output was about 10 sec

test 4
same as test 1, but the write was followed by an fsync, and the three processes
were all running on the same node
each time output was 1.3 sec

test 5
same as test 2, but the three processes were all running on the same node
each time output was 1.7 sec

test 6
same as test 3, but the write was followed by an fsync, and the three processes
were all running on the same node
each time output was 2.5 sec
Comment 53 David Teigland 2007-06-26 13:42:37 EDT
As usual, this comes down to carefully crafting the way an app uses gfs.
If you do things just right, gfs can help, but if you don't it can be very
painful.  For example, I took the test that does this:

  { flock(EX); write(testfile); flock(UN): } x100

-> takes about 10 sec when running on three nodes in parallel.

and changed it to this:

  { flock(EX); [write(testfile) x100]; flock(UN); }

-> takes about 0.15 sec when running on three nodes in parallel.

If an app can be made to operate less like the former and more like the later,
then contention is minimized instead of maximized.  With some apps it's
impossible to rework them in such a way that gfs can help; with other apps it
is.
Comment 57 David Teigland 2007-06-26 16:56:52 EDT
The two different example tests I showed in comment 53 involve exactly the
same locks being used.  The only difference between the two examples is the
frequency at which the locks are changing hands between the nodes.

In this case there are two locks involved.  The first is the inode lock that
gfs uses to protect the metadata.  The second is the flock that the app uses
to protect the file contents.  Each of these locks gets translated into a
separate dlm lock.  Each file on the file system will have separate inode
locks and separate flock locks.

In the first example:
node1 gets flock lock
node1 gets write lock
node1 does 1 write
node1 releases write lock
node1 releases flock lock
node2 gets flock lock
node2 gets write lock
node2 does 1 write
node2 releases write lock
node2 releases flock lock
node3 gets flock lock
node3 gets write lock
node3 does 1 write
node3 releases write lock
node3 releases flock lock
... x100

In the second example:
node1 gets flock lock
node1 gets inode lock
node1 does 100 writes
node1 releases inode lock
node1 releases flock lock
node2 gets flock lock
node2 gets inode lock
node2 does 100 writes
node2 releases inode lock
node2 releases flock lock
node3 gets flock lock
node3 gets inode lock
node3 does 100 writes
node3 releases inode lock
node3 releases flock lock
The end, that's the only locking that takes place, and the locks are
identical to the ones used above.

So, the app can control how much back and forth happens by using flocks to
synchronize its activity, not just to protect the file content.  Of course,
the app doesn't have to use flocks to synchronize its activity, apps often
use other mechanisms for this too, like their own network messages.  This
gets into how to design and build a distributed application.
Comment 58 David Teigland 2007-06-26 17:03:23 EDT
In comment 57 I inadvertently changed wording.
"gets write lock" is the same as "gets inode lock"
Comment 59 Christine Caulfield 2007-06-29 05:19:26 EDT
Created attachment 158189 [details]
Userland test program

I've reproduced the symptoms of this problem with a simple userland program
that does the same sort of traffic as the DLM does: two request/response
messages followed by a message that needs no reply.

To run:

node1$ strace -ttT ./test

node2$ strace -ttT ./test <ip-addr of node1>

You can see that after the message that does not need a reply, there is always
a 40ms delay before the next HELLO message can be sent.

tcpdumps show the same thing as the DLM problem. for some reason the next
packet doesn't get sent until the ACK for the previous packet is received.
Comment 62 Christine Caulfield 2007-07-02 05:59:32 EDT
hmm, disabling nagle fixes the userland test prog. So now to find out why it
doesn't seem to affect the DLM.
Comment 63 Christine Caulfield 2007-07-02 06:47:07 EDT
Created attachment 158327 [details]
Working nagle patch

Ah, it seems that the previous nagle patch didn't actually set the nonagle flag
at all - which is why it had no effect!

This patch DOES set the nagle flag (and adds the /proc config option too) and
works for me in all scenarios I could find.

It supercedes the "send reply" patch above and rememeber to 

  # echo 1 > /proc/cluster/config/dlm/tcp_nodelay

immediately after loading the dlm module.
Comment 64 Christine Caulfield 2007-07-02 09:30:28 EDT
I'd just like to add that this should NOT be recommended as a default setting
for any customers. If someone is seeing bad lock latencies with certain lock
traffic then it's worth a try. But for a heavily used GFS cluster this would
dramatically decrease performance.
Comment 71 David Teigland 2008-02-26 12:27:36 EST
I think this problem just went away when they started using gfs normally
instead of just running this specific, unrealistic test program.
Comment 72 Jeremy West 2008-04-14 17:13:26 EDT
In reply to comment #71 the customer's problem did *not* go away when they
started using GFS normally.  For the last year they've been running with the
tcp_no_delay patch we provided them in this BZ.  However they ran into an issue
at a customer site which we showed was resolved in RHEL 4u6, so they upgraded
the cluster appliance to RHEL 4u6 to verify this was the case.  In doing that,
they've now hit this issue again in the same way as before.  This customer is
asking us to provide them with a patched "dlm-kernel" which has this
tcp_no_delay patch in it.

Thanks
Jeremy West
Comment 73 RHEL Product and Program Management 2008-04-14 17:19:05 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 79 Rob Kenna 2008-04-15 10:26:18 EDT
re: comment #76

See comment #72

"In reply to comment #71 the customer's problem did *not* go away..."

---------------------
re: comment #78

We absolutely don't want to maintain a separate build.  Then this discussion
will be like "Groundhog Day.
Comment 81 Christine Caulfield 2008-04-15 11:37:46 EDT
Checked in for 4.7:

commit 8dc51a64cf9a0bc1b467353a4161220a7cec0de0
Author: Christine Caulfield <ccaulfie@redhat.com>
Date:   Tue Apr 15 16:34:48 2008 +0100

    [DLM] Add tcp_nodelay option to DLM comms
Comment 90 errata-xmlrpc 2008-07-25 15:17:46 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0796.html

Note You need to log in before you can comment on or make changes to this bug.