Customer is running RHEL 4.3 with Cluster Suite 6.1 and experience poor performance with gfs locking. Currently this issue is being looked at by engineering who have narrowed the problem down to this: When exclusive locking is used the nodes don't free up the locks fast enough. This may be caused by a conflict between dlm and flock. A validation program has been sent to the customer and the return results confirm this. The next step is for engineering to examine the new code in RHEL 5, which resolves the problem, and determine how to patch the existing code in RHEL 4. Thanks Jeremy West
Created attachment 157305 [details] Validation program for gfs writes This is a reproduction/validation program which should be run with the following command. time ./gfs_flock_test "testfile_on_GFS_SAN" where "testfile_on_GFS_SAN" is an actual file on the gfs filesystem.
Created attachment 157306 [details] Validation program for gfs non-writes This is a the validation/reproduction program for testing non-writes to the gfs filesystem. The syntax for using this is as follows: time ./gfs_flock_test "testfile_on_GFS_SAN" where "testfile_on_GFS_SAN" is the actual existing file that should be locked
I doubt the same problem exists on RHEL5, I've not tested it properly. update to comment 15: When node1 or node2 need to talk with node3, the program runs slow -- that's for files test1 and test2 where node3 is the resdir node. When node3 needs to talk with node1 or node2, the program runs slow -- that's for file test3 where node1 is the resdir node. The flock test is quite unique in that the only dlm traffic when it runs is resdir lookups/removals.
Created attachment 157391 [details] De-nagle it Disabling Nagle fixes this for me. Though it almost certainly needs to be an option rather than the default.
I"m guessing what's happening here is that the customer is mounting a third node that's not doing anything. So, the third node only gets lots of small resdir lookups/removals which triggers the Nagle effects (batching together lots of small messages to make more effecient use of the network). An interesting experiment would be to have the third node actively doing stuff on the file system to cause more dlm traffic. The increased traffic may avert the Nagling. Meanwhile, I'll work on a patch that makes the NODELAY flag configurable. Then we can get a test package created.
Trying the NODELAY patch myself, I'm not seeing any difference, so it may not be the answer.
The NODELAY patch isn't working for me. Some more interesting observations, though. When two nodes are communicating fine, a resdir lookup from one to the other takes in the range of 2-4 jiffies. When two nodes are in this state of slow comms, the resdir lookups take in the range of 37-40 jiffies. This is when I run the gfs_flock_test where resdir lookups/removals are the only dlm traffic. However, other resdir lookups between these nodes for gfs locks are normal, not slowed down. So, it's only lookups for the flock test that are slow, other lookups sent to the same node are fast. Also, when I run the flock test, the _first_ lookup is fast, but all the rest are slow. If I put a delay of 40 ms between each iteration of the flock test, then there's no delay in the resdir lookups.
varying the delay between iterations in gfs_flock_test when there's a remote resdir node with slow comms: 0 ms delay -> 40 ms per lookup 10 ms delay -> 30 ms per lookup 20 ms delay -> 20 ms per lookup 30 ms delay -> 10 ms per lookup 40 ms delay -> 3-5 ms per lookup of course, if there's no slow comms with the resdir node, a lookup always takes 3-5 ms regardless of the delay between iterations.
I'm using the NODELAY patch, but I don't think it's having any effect. The tcpdumps from the two nodes (fast and slow) do seem to show some clear differences, but I can't tell much from them. resdir for "test2" is dct-xen-06 (node2) resdir for "test3" is dct-xen-07 (node3) and I run the test on dct-xen-05 (node1): [root@dct-xen-05 gfs]# time /root/gfs_flock_test test2 40 real 0m0.598s user 0m0.000s sys 0m0.264s [root@dct-xen-05 gfs]# time /root/gfs_flock_test test3 40 real 0m1.871s user 0m0.001s sys 0m0.236s This test does 40 iterations of flock/funlock. That's 40 resdir lookups, 40 lookup replies, and 40 resdir removals (which have no reply). That's 120 total dlm messages, which I suppose we should be able to correlate with what tcpdump shows. Including the tcp acks for each message would get us to about 240 packets, which is just under the total packets counted between the nodes in the quick test (248), and is above the 200 packets counted between the nodes that are slow (05 and 07). from 06: 09:39:39.512295 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 749162895:749163009( 114) ack 750896109 win 183 <nop,nop,timestamp 8398529 8221406> 09:39:39.516927 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 114 win 883 <nop ,nop,timestamp 8403076 8398529> 09:39:39.516938 IP 10.15.84.96.32777 > 10.15.84.95.21064: P 753958116:753958188( 72) ack 747348844 win 183 <nop,nop,timestamp 8403079 8189088> 09:39:39.513351 IP 10.15.84.95.21064 > 10.15.84.96.32777: . ack 72 win 1096 <nop ,nop,timestamp 8398535 8403079> 09:39:39.521318 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 114:228(114) ack 1 w in 183 <nop,nop,timestamp 8398543 8403076> 09:39:39.521359 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 228 win 883 <nop ,nop,timestamp 8403089 8398543> 09:39:39.524281 IP 10.15.84.96.32777 > 10.15.84.95.21064: P 72:144(72) ack 1 win 183 <nop,nop,timestamp 8403092 8398535> 09:39:39.526253 IP 10.15.84.95.21064 > 10.15.84.96.32777: . ack 144 win 1096 <no p,nop,timestamp 8398547 8403092> 09:39:39.532359 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 228:342(114) ack 1 w in 183 <nop,nop,timestamp 8398554 8403089> 09:39:39.532372 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 342 win 883 <nop ,nop,timestamp 8403100 8398554> 09:39:39.534268 IP 10.15.84.96.32777 > 10.15.84.95.21064: P 144:216(72) ack 1 wi n 183 <nop,nop,timestamp 8403102 8398547> 09:39:39.537287 IP 10.15.84.95.21064 > 10.15.84.96.32777: . ack 216 win 1096 <no p,nop,timestamp 8398558 8403102> 09:39:39.539438 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 342:456(114) ack 1 w in 183 <nop,nop,timestamp 8398560 8403100> 09:39:39.539448 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 456 win 883 <nop ,nop,timestamp 8403107 8398560> 09:39:39.545362 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 456:570(114) ack 1 w in 183 <nop,nop,timestamp 8398566 8403107> 09:39:39.545378 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 570 win 883 <nop ,nop,timestamp 8403113 8398566> 09:39:39.548245 IP 10.15.84.96.32777 > 10.15.84.95.21064: P 216:288(72) ack 1 wi n 183 <nop,nop,timestamp 8403116 8398558> 09:39:39.548805 IP 10.15.84.95.21064 > 10.15.84.96.32777: . ack 288 win 1096 <no p,nop,timestamp 8398570 8403116> 09:39:39.552415 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 570:684(114) ack 1 w in 183 <nop,nop,timestamp 8398573 8403113> 09:39:39.552433 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 684 win 883 <nop ,nop,timestamp 8403120 8398573> 09:39:39.559541 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 684:798(114) ack 1 w in 183 <nop,nop,timestamp 8398581 8403120> 09:39:39.559584 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 798 win 883 <nop ,nop,timestamp 8403127 8398581> 09:39:39.563596 IP 10.15.84.96.32777 > 10.15.84.95.21064: P 288:360(72) ack 1 wi n 183 <nop,nop,timestamp 8403131 8398570> 09:39:39.564449 IP 10.15.84.95.21064 > 10.15.84.96.32777: . ack 360 win 1096 <no p,nop,timestamp 8398586 8403131> 09:39:39.565601 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 798:912(114) ack 1 w in 183 <nop,nop,timestamp 8398587 8403127> 09:39:39.565622 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 912 win 883 <nop ,nop,timestamp 8403133 8398587> 09:39:39.570582 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 912:1026(114) ack 1 win 183 <nop,nop,timestamp 8398592 8403133> 09:39:39.570598 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 1026 win 883 <no p,nop,timestamp 8403138 8398592> 09:39:39.576240 IP 10.15.84.96.32777 > 10.15.84.95.21064: P 360:432(72) ack 1 wi n 183 <nop,nop,timestamp 8403144 8398586> 09:39:39.577523 IP 10.15.84.95.21064 > 10.15.84.96.32777: . ack 432 win 1096 <no p,nop,timestamp 8398599 8403144> 09:39:39.580416 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 1026:1140(114) ack 1 win 183 <nop,nop,timestamp 8398601 8403138> 09:39:39.580546 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 1140 win 883 <no p,nop,timestamp 8403148 8398601> 09:39:39.586533 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 1140:1254(114) ack 1 win 183 <nop,nop,timestamp 8398608 8403148> 09:39:39.586635 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 1254 win 883 <no p,nop,timestamp 8403154 8398608> 09:39:39.588237 IP 10.15.84.96.32777 > 10.15.84.95.21064: P 432:504(72) ack 1 wi n 183 <nop,nop,timestamp 8403156 8398599> 09:39:39.589402 IP 10.15.84.95.21064 > 10.15.84.96.32777: . ack 504 win 1096 <no p,nop,timestamp 8398611 8403156> 09:39:39.591499 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 1254:1368(114) ack 1 win 183 <nop,nop,timestamp 8398613 8403154> 09:39:39.592067 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 1368 win 883 <no p,nop,timestamp 8403159 8398613> 09:39:39.597416 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 1368:1482(114) ack 1 win 183 <nop,nop,timestamp 8398619 8403159> 09:39:39.598238 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 1482 win 883 <no p,nop,timestamp 8403166 8398619> 09:39:39.601494 IP 10.15.84.96.32777 > 10.15.84.95.21064: P 504:576(72) ack 1 wi n 183 <nop,nop,timestamp 8403169 8398611> 09:39:39.601995 IP 10.15.84.95.21064 > 10.15.84.96.32777: . ack 576 win 1096 <no p,nop,timestamp 8398623 8403169> 09:39:39.605351 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 1482:1596(114) ack 1 win 183 <nop,nop,timestamp 8398626 8403166> 09:39:39.605501 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 1596 win 883 <no p,nop,timestamp 8403173 8398626> 09:39:39.612487 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 1596:1710(114) ack 1 win 183 <nop,nop,timestamp 8398633 8403173> 09:39:39.612621 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 1710 win 883 <no p,nop,timestamp 8403180 8398633> 09:39:39.614244 IP 10.15.84.96.32777 > 10.15.84.95.21064: P 576:648(72) ack 1 wi n 183 <nop,nop,timestamp 8403182 8398623> 09:39:39.616396 IP 10.15.84.95.21064 > 10.15.84.96.32777: . ack 648 win 1096 <no p,nop,timestamp 8398638 8403182> 09:39:39.620372 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 1710:1824(114) ack 1 win 183 <nop,nop,timestamp 8398641 8403180> 09:39:39.620497 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 1824 win 883 <no p,nop,timestamp 8403188 8398641> 09:39:39.628488 IP 10.15.84.95.32777 > 10.15.84.96.21064: P 1824:1938(114) ack 1 win 183 <nop,nop,timestamp 8398651 8403188> 09:39:39.628590 IP 10.15.84.96.21064 > 10.15.84.95.32777: . ack 1938 win 883 <no p,nop,timestamp 8403196 8398651> ... 248 packets captured 252 packets received by filter 0 packets dropped by kernel from 07: 09:39:42.323185 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 760452160:760452274( 114) ack 766374568 win 1064 <nop,nop,timestamp 8401322 7918000> 09:39:42.330915 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 114 win 1019 <no p,nop,timestamp 8402933 8401322> 09:39:42.330928 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 1:73(72) ack 114 win 1019 <nop,nop,timestamp 8402937 8401322> 09:39:42.325988 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 114:228(114) ack 73 win 1064 <nop,nop,timestamp 8401330 8402937> 09:39:42.365814 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 228 win 1019 <no p,nop,timestamp 8402980 8401330> 09:39:42.366977 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 228:342(114) ack 73 win 1064 <nop,nop,timestamp 8401371 8402980> 09:39:42.366988 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 342 win 1019 <no p,nop,timestamp 8402981 8401371> 09:39:42.368867 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 73:145(72) ack 342 w in 1019 <nop,nop,timestamp 8402983 8401371> 09:39:42.373062 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 342:456(114) ack 145 win 1064 <nop,nop,timestamp 8401377 8402983> 09:39:42.412809 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 456 win 1019 <no p,nop,timestamp 8403027 8401377> 09:39:42.413939 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 456:570(114) ack 145 win 1064 <nop,nop,timestamp 8401418 8403027> 09:39:42.413949 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 570 win 1019 <no p,nop,timestamp 8403028 8401418> 09:39:42.415843 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 145:217(72) ack 570 win 1019 <nop,nop,timestamp 8403030 8401418> 09:39:42.420382 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 570:684(114) ack 217 win 1064 <nop,nop,timestamp 8401424 8403030> 09:39:42.459813 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 684 win 1019 <no p,nop,timestamp 8403074 8401424> 09:39:42.460947 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 684:798(114) ack 217 win 1064 <nop,nop,timestamp 8401464 8403074> 09:39:42.460959 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 798 win 1019 <no p,nop,timestamp 8403075 8401464> 09:39:42.462809 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 217:289(72) ack 798 win 1019 <nop,nop,timestamp 8403077 8401464> 09:39:42.467878 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 798:912(114) ack 289 win 1064 <nop,nop,timestamp 8401470 8403077> 09:39:42.507801 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 912 win 1019 <no p,nop,timestamp 8403122 8401470> 09:39:42.508926 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 912:1026(114) ack 28 9 win 1064 <nop,nop,timestamp 8401512 8403122> 09:39:42.509013 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1026 win 1019 <n op,nop,timestamp 8403123 8401512> 09:39:42.510837 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 289:361(72) ack 1026 win 1019 <nop,nop,timestamp 8403125 8401512> 09:39:42.515217 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1026:1140(114) ack 3 61 win 1064 <nop,nop,timestamp 8401518 8403125> 09:39:42.554791 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1140 win 1019 <n op,nop,timestamp 8403169 8401518> 09:39:42.555923 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1140:1254(114) ack 3 61 win 1064 <nop,nop,timestamp 8401559 8403169> 09:39:42.556007 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1254 win 1019 <n op,nop,timestamp 8403170 8401559> 09:39:42.557820 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 361:433(72) ack 1254 win 1019 <nop,nop,timestamp 8403172 8401559> 09:39:42.561807 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1254:1368(114) ack 4 33 win 1064 <nop,nop,timestamp 8401565 8403172> 09:39:42.601779 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1368 win 1019 <n op,nop,timestamp 8403216 8401565> 09:39:42.602908 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1368:1482(114) ack 4 33 win 1064 <nop,nop,timestamp 8401606 8403216> 09:39:42.602988 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1482 win 1019 <n op,nop,timestamp 8403217 8401606> 09:39:42.604785 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 433:505(72) ack 1482 win 1019 <nop,nop,timestamp 8403219 8401606> 09:39:42.609921 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1482:1596(114) ack 5 05 win 1064 <nop,nop,timestamp 8401613 8403219> 09:39:42.649774 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1596 win 1019 <n op,nop,timestamp 8403264 8401613> 09:39:42.650908 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1596:1710(114) ack 5 05 win 1064 <nop,nop,timestamp 8401654 8403264> 09:39:42.650993 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1710 win 1019 <n op,nop,timestamp 8403265 8401654> 09:39:42.652782 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 505:577(72) ack 1710 win 1019 <nop,nop,timestamp 8403267 8401654> 09:39:42.656888 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1710:1824(114) ack 5 77 win 1064 <nop,nop,timestamp 8401660 8403267> 09:39:42.696773 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1824 win 1019 <n op,nop,timestamp 8403311 8401660> 09:39:42.697891 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1824:1938(114) ack 5 77 win 1064 <nop,nop,timestamp 8401701 8403311> 09:39:42.697975 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1938 win 1019 <n op,nop,timestamp 8403312 8401701> 09:39:42.699807 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 577:649(72) ack 1938 win 1019 <nop,nop,timestamp 8403314 8401701> 09:39:42.705159 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1938:2052(114) ack 6 49 win 1064 <nop,nop,timestamp 8401708 8403314> 09:39:42.744762 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 2052 win 1019 <n op,nop,timestamp 8403359 8401708> 09:39:42.745900 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 2052:2166(114) ack 6 49 win 1064 <nop,nop,timestamp 8401749 8403359> 09:39:42.745993 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 2166 win 1019 <n op,nop,timestamp 8403360 8401749> 09:39:42.747775 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 649:721(72) ack 2166 win 1019 <nop,nop,timestamp 8403362 8401749> 09:39:42.752909 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 2166:2280(114) ack 7 21 win 1064 <nop,nop,timestamp 8401756 8403362> 09:39:42.792751 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 2280 win 1019 <n op,nop,timestamp 8403407 8401756> 09:39:42.793879 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 2280:2394(114) ack 7 21 win 1064 <nop,nop,timestamp 8401797 8403407> 09:39:42.793965 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 2394 w1019 <nop, nop,timestamp 8403408 8401797> 09:39:42.795766 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 721:793(72) ack 2394 win 1019 <nop,nop,timestamp 8403410 8401797> 09:39:42.800149 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 2394:2508(114) ack 7 93 win 1064 <nop,nop,timestamp 8401803 8403410> 09:39:42.839749 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 2508 win 1019 <n op,nop,timestamp 8403454 8401803> ... 200 packets captured 206 packets received by filter 0 packets dropped by kernel This is consistent between runs. If I add a 40 ms delay between each iteratation in gfs_flock_test, then the resdir lookup delays disappear (go back to 3-5 ms instead of about 40 ms), but the tcpdump is still the same: [root@dct-xen-05 gfs]# time /root/gfs_flock_test test3 40 40 real 0m2.152s user 0m0.001s sys 0m0.169s 10:04:27.404170 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 760479520:760479634( 114) ack 766383208 win 1064 <nop,nop,timestamp 9886814 9448184> 10:04:27.404431 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 114 win 1019 <no p,nop,timestamp 9888242 9886814> 10:04:27.404442 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 1:73(72) ack 114 win 1019 <nop,nop,timestamp 9888243 9886814> 10:04:27.407413 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 114:228(114) ack 73 win 1064 <nop,nop,timestamp 9886821 9888243> 10:04:27.447054 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 228 win 1019 <no p,nop,timestamp 9888287 9886821> 10:04:27.455182 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 228:342(114) ack 73 win 1064 <nop,nop,timestamp 9886868 9888287> 10:04:27.455230 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 342 win 1019 <no p,nop,timestamp 9888295 9886868> 10:04:27.457061 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 73:145(72) ack 342 w in 1019 <nop,nop,timestamp 9888297 9886868> 10:04:27.461451 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 342:456(114) ack 145 win 1064 <nop,nop,timestamp 9886875 9888297> 10:04:27.501046 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 456 win 1019 <no p,nop,timestamp 9888341 9886875> 10:04:27.508170 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 456:570(114) ack 145 win 1064 <nop,nop,timestamp 9886922 9888341> 10:04:27.508207 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 570 win 1019 <no p,nop,timestamp 9888348 9886922> 10:04:27.510052 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 145:217(72) ack 570 win 1019 <nop,nop,timestamp 9888350 9886922> 10:04:27.515291 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 570:684(114) ack 217 win 1064 <nop,nop,timestamp 9886929 9888350> 10:04:27.555034 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 684 win 1019 <no p,nop,timestamp 9888395 9886929> 10:04:27.562156 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 684:798(114) ack 217 win 1064 <nop,nop,timestamp 9886975 9888395> 10:04:27.562185 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 798 win 1019 <no p,nop,timestamp 9888402 9886975> 10:04:27.564061 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 217:289(72) ack 798 win 1019 <nop,nop,timestamp 9888404 9886975> 10:04:27.570161 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 798:912(114) ack 289 win 1064 <nop,nop,timestamp 9886983 9888404> 10:04:27.610026 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 912 win 1019 <no p,nop,timestamp 9888450 9886983> 10:04:27.615157 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 912:1026(114) ack 28 9 win 1064 <nop,nop,timestamp 9887029 9888450> 10:04:27.615296 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1026 win 1019 <n op,nop,timestamp 9888455 9887029> 10:04:27.617040 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 289:361(72) ack 1026 win 1019 <nop,nop,timestamp 9888457 9887029> 10:04:27.621303 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1026:1140(114) ack 3 61 win 1064 <nop,nop,timestamp 9887035 9888457> 10:04:27.661018 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1140 win 1019 <n op,nop,timestamp 9888501 9887035> 10:04:27.666145 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1140:1254(114) ack 3 61 win 1064 <nop,nop,timestamp 9887080 9888501> 10:04:27.666250 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1254 win 1019 <n op,nop,timestamp 9888506 9887080> 10:04:27.667042 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 361:433(72) ack 1254 win 1019 <nop,nop,timestamp 9888507 9887080> 10:04:27.673150 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1254:1368(114) ack 4 33 win 1064 <nop,nop,timestamp 9887086 9888507> 10:04:27.713127 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1368 win 1019 <n op,nop,timestamp 9888553 9887086> 10:04:27.718131 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1368:1482(114) ack 4 33 win 1064 <nop,nop,timestamp 9887131 9888553> 10:04:27.718233 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1482 win 1019 <n op,nop,timestamp 9888558 9887131> 10:04:27.720032 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 433:505(72) ack 1482 win 1019 <nop,nop,timestamp 9888560 9887131> 10:04:27.725236 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1482:1596(114) ack 5 05 win 1064 <nop,nop,timestamp 9887139 9888560> 10:04:27.764999 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1596 win 1019 <n op,nop,timestamp 9888605 9887139> 10:04:27.771125 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1596:1710(114) ack 5 05 win 1064 <nop,nop,timestamp 9887185 9888605> 10:04:27.771240 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1710 win 1019 <n op,nop,timestamp 9888611 9887185> 10:04:27.772019 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 505:577(72) ack 1710 win 1019 <nop,nop,timestamp 9888612 9887185> 10:04:27.777124 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1710:1824(114) ack 5 77 win 1064 <nop,nop,timestamp 9887191 9888612> 10:04:27.816993 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1824 win 1019 <n op,nop,timestamp 9888657 9887191> 10:04:27.822121 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1824:1938(114) ack 5 77 win 1064 <nop,nop,timestamp 9887238 9888657> 10:04:27.822254 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 1938 win 1019 <n op,nop,timestamp 9888662 9887238> 10:04:27.824027 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 577:649(72) ack 1938 win 1019 <nop,nop,timestamp 9888664 9887238> 10:04:27.828255 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 1938:2052(114) ack 6 49 win 1064 <nop,nop,timestamp 9887244 9888664> 10:04:27.867991 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 2052 win 1019 <n op,nop,timestamp 9888708 9887244> 10:04:27.873113 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 2052:2166(114) ack 6 49 win 1064 <nop,nop,timestamp 9887290 9888708> 10:04:27.873219 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 2166 win 1019 <n op,nop,timestamp 9888713 9887290> 10:04:27.874005 IP 10.15.84.97.21064 > 10.15.84.95.32778: P 649:721(72) ack 2166 win 1019 <nop,nop,timestamp 9888714 9887290> 10:04:27.879114 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 2166:2280(114) ack 7 21 win 1064 <nop,nop,timestamp 9887296 9888714> 10:04:27.918976 IP 10.15.84.97.21064 > 10.15.84.95.32778: . ack 2280 win 1019 <n op,nop,timestamp 9888759 9887296> 10:04:27.925108 IP 10.15.84.95.32778 > 10.15.84.97.21064: P 2280:2394(114) ack 7 21 win 1064 <nop,nop,timestamp 9887340 9888759> ... 200 packets captured 206 packets received by filter 0 packets dropped by kernel
Current thinking is that this still looks like odd networking-related delays. Patrick, do the tcpdumps give you any new ideas?
Just wondering how much difference Xen makes in all this. I'm using real nodes for my tests. Xen networking add all sorts of unknowns to the situation. I'll have a scour of those tcpumps though.
Here's an observation based on looking at netstat output. When there are two connections between the two nodes involved in the test, the test runs fast. When there's one connection between the two nodes, it goes slow. That doesn't seem to tell us what's wrong, but it may give us a way to work around it, assuming we could arrange to always create two connections between nodes. I now have a four node cluster of dct-xen-05,06,07,08. Comms between 05 and 06 are fast, and any other pairing is slow (05/07, 05/08, 06/07, 06/08, 07/08). 05 and 06 is the only pair where two connections exist. (IIRC it's normal for the first two nodes to have two connections between them since they both try to connect to each other at once.) [root@dct-xen-05]# netstat -n | grep 21064 tcp 0 0 10.15.84.95:21064 10.15.84.98:43883 ESTABLISHED tcp 0 0 10.15.84.95:21064 10.15.84.97:32781 ESTABLISHED tcp 0 0 10.15.84.95:21064 10.15.84.96:32775 ESTABLISHED tcp 0 0 10.15.84.95:32775 10.15.84.96:21064 ESTABLISHED [root@dct-xen-06]# netstat -n | grep 21064 tcp 0 0 10.15.84.96:32776 10.15.84.97:21064 ESTABLISHED tcp 0 0 10.15.84.96:21064 10.15.84.95:32775 ESTABLISHED tcp 0 0 10.15.84.96:32775 10.15.84.95:21064 ESTABLISHED tcp 0 0 10.15.84.96:21064 10.15.84.98:43884 ESTABLISHED [root@dct-xen-07]# netstat -n | grep 21064 tcp 0 0 10.15.84.97:21064 10.15.84.96:32776 ESTABLISHED tcp 0 0 10.15.84.97:32781 10.15.84.95:21064 ESTABLISHED tcp 0 0 10.15.84.97:21064 10.15.84.98:43885 ESTABLISHED [root@dct-xen-08]# netstat -n | grep 21064 tcp 0 0 10.15.84.98:43883 10.15.84.95:21064 ESTABLISHED tcp 0 0 10.15.84.98:43885 10.15.84.97:21064 ESTABLISHED tcp 0 0 10.15.84.98:43884 10.15.84.96:21064 ESTABLISHED Same holds on the roth nodes where only comms between 05 and 06 are fast. [root@roth-05 gfs]# netstat -n | grep 21064 tcp 0 0 10.15.84.105:21064 10.15.84.108:32779 ESTABLISHED tcp 0 0 10.15.84.105:21064 10.15.84.106:32782 ESTABLISHED tcp 0 0 10.15.84.105:32782 10.15.84.106:21064 ESTABLISHED [root@roth-06 gfs]# netstat -n | grep 21064 tcp 0 0 10.15.84.106:32783 10.15.84.108:21064 ESTABLISHED tcp 0 0 10.15.84.106:21064 10.15.84.105:32782 ESTABLISHED tcp 0 0 10.15.84.106:32782 10.15.84.105:21064 ESTABLISHED [root@roth-08 gfs]# netstat -n | grep 21064 tcp 0 0 10.15.84.108:21064 10.15.84.106:32783 ESTABLISHED tcp 0 0 10.15.84.108:32779 10.15.84.105:21064 ESTABLISHED
Created attachment 157531 [details] Send a reply from resdir remove commands That's quite wierd actually. If anything I would expect the single-connections to be more efficient than the dual-connections because they can piggyback ACKs on the data packets - and looking at the tcpdump that's exactly what is happening. Of course, if the ACKs are 40ms late then you sort of lose that advantage! What seems to be the indirect cause of this is the resdir remove command. This does not need a reply so the acks get delayed. Now, for some reason I don't understand this stalls the whole communication for about 40ms, until the ACK timer fires. My way around this is to force the remove resdir command to send a dummy reply. this gets everything back on an even footing again and the whole communications is at full speed regardless of how many connections you have between the nodes.
Patrick, Is this patch ready for testing in the customer's environment? Thanks Jeremy West
Yes, it's fine for it to go to the customer
Created attachment 157628 [details] ethereal captures Ethereal captures taken while running the gfs_flock_test program. One capture per server included.
So they run the following on three nodes at once: while true; do time ./gfs_flock_test test_file; sleep 1 ; done where gfs_flock_test does 130 iterations of flock/funlock and no reads/writes? What do they see specifically that they don't like? When I run this, I periodically see a time of 3 seconds, mostly it's around half a second. There's nothing that strikes me as out of place in what I'm seeing.
test 1 while true; do time /root/gfs_write_test test_file 100; sleep 0 ; done program does write(); 100 times run on three nodes in parallel each time output was abaout 5.8 sec on each node test 2 while true; do time /root/gfs_flock_test test_file 100; sleep 0 ; done program does flock(EX); flock(UN); 100 times run on three nodes in parallel each time output varied between 0.5 sec and 4 sec on the nodes, average ~1.7 test 3 while true; do time /root/gfs_flockwrite_test test_file 100; sleep 0 ; done program does flock(EX); write(); flock(UN); 100 times run on three nodes in parallel each time output was about 10 sec test 4 same as test 1, but the write was followed by an fsync, and the three processes were all running on the same node each time output was 1.3 sec test 5 same as test 2, but the three processes were all running on the same node each time output was 1.7 sec test 6 same as test 3, but the write was followed by an fsync, and the three processes were all running on the same node each time output was 2.5 sec
As usual, this comes down to carefully crafting the way an app uses gfs. If you do things just right, gfs can help, but if you don't it can be very painful. For example, I took the test that does this: { flock(EX); write(testfile); flock(UN): } x100 -> takes about 10 sec when running on three nodes in parallel. and changed it to this: { flock(EX); [write(testfile) x100]; flock(UN); } -> takes about 0.15 sec when running on three nodes in parallel. If an app can be made to operate less like the former and more like the later, then contention is minimized instead of maximized. With some apps it's impossible to rework them in such a way that gfs can help; with other apps it is.
The two different example tests I showed in comment 53 involve exactly the same locks being used. The only difference between the two examples is the frequency at which the locks are changing hands between the nodes. In this case there are two locks involved. The first is the inode lock that gfs uses to protect the metadata. The second is the flock that the app uses to protect the file contents. Each of these locks gets translated into a separate dlm lock. Each file on the file system will have separate inode locks and separate flock locks. In the first example: node1 gets flock lock node1 gets write lock node1 does 1 write node1 releases write lock node1 releases flock lock node2 gets flock lock node2 gets write lock node2 does 1 write node2 releases write lock node2 releases flock lock node3 gets flock lock node3 gets write lock node3 does 1 write node3 releases write lock node3 releases flock lock ... x100 In the second example: node1 gets flock lock node1 gets inode lock node1 does 100 writes node1 releases inode lock node1 releases flock lock node2 gets flock lock node2 gets inode lock node2 does 100 writes node2 releases inode lock node2 releases flock lock node3 gets flock lock node3 gets inode lock node3 does 100 writes node3 releases inode lock node3 releases flock lock The end, that's the only locking that takes place, and the locks are identical to the ones used above. So, the app can control how much back and forth happens by using flocks to synchronize its activity, not just to protect the file content. Of course, the app doesn't have to use flocks to synchronize its activity, apps often use other mechanisms for this too, like their own network messages. This gets into how to design and build a distributed application.
In comment 57 I inadvertently changed wording. "gets write lock" is the same as "gets inode lock"
Created attachment 158189 [details] Userland test program I've reproduced the symptoms of this problem with a simple userland program that does the same sort of traffic as the DLM does: two request/response messages followed by a message that needs no reply. To run: node1$ strace -ttT ./test node2$ strace -ttT ./test <ip-addr of node1> You can see that after the message that does not need a reply, there is always a 40ms delay before the next HELLO message can be sent. tcpdumps show the same thing as the DLM problem. for some reason the next packet doesn't get sent until the ACK for the previous packet is received.
hmm, disabling nagle fixes the userland test prog. So now to find out why it doesn't seem to affect the DLM.
Created attachment 158327 [details] Working nagle patch Ah, it seems that the previous nagle patch didn't actually set the nonagle flag at all - which is why it had no effect! This patch DOES set the nagle flag (and adds the /proc config option too) and works for me in all scenarios I could find. It supercedes the "send reply" patch above and rememeber to # echo 1 > /proc/cluster/config/dlm/tcp_nodelay immediately after loading the dlm module.
I'd just like to add that this should NOT be recommended as a default setting for any customers. If someone is seeing bad lock latencies with certain lock traffic then it's worth a try. But for a heavily used GFS cluster this would dramatically decrease performance.
I think this problem just went away when they started using gfs normally instead of just running this specific, unrealistic test program.
In reply to comment #71 the customer's problem did *not* go away when they started using GFS normally. For the last year they've been running with the tcp_no_delay patch we provided them in this BZ. However they ran into an issue at a customer site which we showed was resolved in RHEL 4u6, so they upgraded the cluster appliance to RHEL 4u6 to verify this was the case. In doing that, they've now hit this issue again in the same way as before. This customer is asking us to provide them with a patched "dlm-kernel" which has this tcp_no_delay patch in it. Thanks Jeremy West
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
re: comment #76 See comment #72 "In reply to comment #71 the customer's problem did *not* go away..." --------------------- re: comment #78 We absolutely don't want to maintain a separate build. Then this discussion will be like "Groundhog Day.
Checked in for 4.7: commit 8dc51a64cf9a0bc1b467353a4161220a7cec0de0 Author: Christine Caulfield <ccaulfie> Date: Tue Apr 15 16:34:48 2008 +0100 [DLM] Add tcp_nodelay option to DLM comms
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0796.html