Running a performace measurement of GFS2 vs. GFS1 shows GFS2 has problems with multiple nodes writing to the same file. The test writes a 123 MB file sequentially, non-overlapping, from 1, 2 and 3 nodes. The IO requests are distributed around the cluster in a quasi-round-robin fashion. A node making better progress may get more requests than one which is stuck waiting to do IO for example. The times below are aggregate write syscall times across all nodes in the test. (GFS2: -102 kernel, GFS: Stock 5.2GA) 1 Node 1M write: GFS2 - 1.4 sec GFS - .4 sec 1 Node 4K write: GFS2 - 2.3 sec GFS - 2.7 sec 2 Nodes 1M write: GFS2 - 304 sec GFS - 3.6 sec 2 Nodes 4K write: GFS2 - 7578 sec GFS - 299 sec 3 Nodes 1M write: GFS2 - 1687 sec GFS - 12 sec 3 Nodes 4K write: GFS2 - 19912 sec GFS - 443 sec The herd file used as well as 'mygen' are below. These will have to be edited to match your cluster. Also, be sure to get a version of d_doio with the new write pattern cache (-C option). ----------- multiwriter.h2 --------------- <herd name="multiwriters" scheduler="pan2"> <pan2_opts sequential="true"/> <herd name="writer1" scheduler="pan2"> <pan2_opts numactive="all"/> <test name="iogen"> <cmd> <![CDATA[ ./mygen | ./d_iosend -I 12345 -P > perf.writer1 ]]> </cmd> </test> <test name="d_doio1"> <cmd> <![CDATA[ qarsh root@marathon-01 "/tmp/d_doio -I 12345 -P xenon.msp.redhat.com -C" ]]> </cmd> </test> </herd> <herd name="writer2" scheduler="pan2"> <pan2_opts numactive="all"/> <test name="iogen"> <cmd> <![CDATA[ ./mygen | ./d_iosend -I 12345 -P > perf.writer2 ]]> </cmd> </test> <test name="d_doio1"> <cmd> <![CDATA[ qarsh root@marathon-01 "/tmp/d_doio -I 12345 -P xenon.msp.redhat.com -C" ]]> </cmd> </test> <test name="d_doio2"> <cmd> <![CDATA[ qarsh root@marathon-02 "/tmp/d_doio -I 12345 -P xenon.msp.redhat.com -C" ]]> </cmd> </test> </herd> <herd name="writer3" scheduler="pan2"> <pan2_opts numactive="all"/> <test name="iogen"> <cmd> <![CDATA[ ./mygen | ./d_iosend -I 12345 -P > perf.writer3 ]]> </cmd> </test> <test name="d_doio1"> <cmd> <![CDATA[ qarsh root@marathon-01 "/tmp/d_doio -I 12345 -P xenon.msp.redhat.com -C" ]]> </cmd> </test> <test name="d_doio2"> <cmd> <![CDATA[ qarsh root@marathon-02 "/tmp/d_doio -I 12345 -P xenon.msp.redhat.com -C" ]]> </cmd> </test> <test name="d_doio3"> <cmd> <![CDATA[ qarsh root@marathon-03 "/tmp/d_doio -I 12345 -P xenon.msp.redhat.com -C" ]]> </cmd> </test> </herd> </herd> -------------- mygen ----------------- #!/bin/bash WORKINGFILE=/mnt/marathon0/TESTFILE FILESIZE=123456789 CHUNKSIZE=1048576 #CHUNKSIZE=4096 cat << EOXIOR <xior magic="0xfeed10"><creat><path>$WORKINGFILE</path><mode>666</mode><nbytes>$FILESIZE</nbytes></creat></xior> EOXIOR for offset in $(seq --format %f 0 $CHUNKSIZE $FILESIZE) do offset=${offset/.*/} cat << EOXIOR <xior magic="0xfeed10"><write syscall="write"><path>/$WORKINGFILE</path><oflags>O_RDWR</oflags><offset>$offset</offset><count>$CHUNKSIZE</count><pattern>*PERF*</pattern><chksum>0x0</chksum></write></xior> EOXIOR done
FWIW -- Running with a single writer and multiple readers doesn't seem to show this wild performance drop when adding readers (using 1M read/write sizes): 1 reader, 1 writer: GFS2 .2 sec (read) 1.2 sec(write) GFS .3 sec (read) 3.6 sec (write) 2 readers, 1 writer: GFS2 1.8 sec (read) 1.8 sec (write) GFS .9 sec (read) .5 sec (write) 3 readers, 1 writer: GFS2 3.0 sec (read) 2.5 sec (write) GFS 3.6 sec (read) 2.0 sec (write) GFS1 runs all seem to show inconsistent results, as seen in the 2 reader, 1 writer case. Probably the test case and the luck of the draw during the runs. Hoped the data was of some use anyway so I've included it.
I think I can start to explain some of this now.... looking at the GFS figures too it starts to make a bit more sense. I think what we are seeing is, in part, a result of the different locking in GFS2 vs. GFS. Bearing in mind that GFS is locking complete syscalls and GFS2 is locking on a per page basis, I think its not too surprising that there are more opportunities for GFS2 to drop the lock, and hence for performance to degrade. There is obviously more to it than that, but I do wonder if that is not part of the problem. Looking at the two node results (opening comment), the GFS2 results for 1M are very similar to the GFS results for 4k. The real question is why the 4k results for GFS2 are so much worse. The min hold time code should be enforcing the same minimum hold time whatever the I/O size. We could certainly try some changes to the min-hold time code to see what difference it makes, if any. We could increase the min hold time itself, or another idea is to change the point at which we set gl_tchange to after the glock has read in any info it needs from disk. It also occurs to me that maybe there is a race in that before we process a reply from the DLM, its possible that the demote request arrives first (due to scheduling of the threads) and thus maybe gl_tchange is being checked before its been updated. Thats my list of things to check for now, anyway. In GFS it doesn't surprise me that as the I/O size changes, the performance in this test changes. I'd expect to see less of that effect with GFS2, so I'm pretty sure that the min-hold time code has something not quite right about it still.
Created attachment 315086 [details] Test patch So this is a test patch to see if I'm right about the race condition. It would also be worth altering the min hold time as well I think, to see if that makes a difference above & beyond this patch.
Results with the test patch build (/kmod-gfs2-1.104-1.1.el5.abhi.4.x86_64.rpm) 1 Node 1M write: GFS2 - 1.4 sec 1 Node 4K write: GFS2 - 2.1 sec 2 Nodes 1M write: GFS2 - 5.6 sec 2 Nodes 4K write: GFS2 - 7.4 sec 3 Nodes 1M write: GFS2 - 6.8 sec 3 Nodes 4K write: GFS2 - 143 sec
in kernel-2.6.18-108.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html