Bug 459738

Summary: GFS2: Multiple writer performance issue.
Product: Red Hat Enterprise Linux 5 Reporter: Dean Jansa <djansa>
Component: kernelAssignee: Abhijith Das <adas>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.3CC: bstevens, edamato, nstraz, rpeterso, syeghiay
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 20:05:59 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Test patch none

Description Dean Jansa 2008-08-21 19:13:08 UTC
Running a performace measurement of GFS2 vs. GFS1 shows GFS2 has 
problems with multiple nodes writing to the same file.

The test writes a 123 MB file sequentially, non-overlapping, from 1,
2 and 3 nodes.  The IO requests are distributed around the cluster in a
quasi-round-robin fashion.  A node making better progress may get more
requests than one which is stuck waiting to do IO for example.

The times below are aggregate write syscall times across all nodes in the test.

(GFS2: -102 kernel, GFS: Stock 5.2GA)

1 Node 1M write:    GFS2 - 1.4 sec    GFS -  .4 sec
1 Node 4K write:    GFS2 - 2.3 sec    GFS - 2.7 sec

2 Nodes 1M write:   GFS2 -  304 sec   GFS - 3.6 sec
2 Nodes 4K write:   GFS2 - 7578 sec   GFS - 299 sec

3 Nodes 1M write:   GFS2 -  1687 sec  GFS -  12 sec
3 Nodes 4K write:   GFS2 - 19912 sec  GFS - 443 sec


The herd file used as well as 'mygen' are below.  These will have to be
edited to match your cluster.  Also, be sure to get a version of d_doio
with the new write pattern cache (-C option).


----------- multiwriter.h2 ---------------

<herd name="multiwriters" scheduler="pan2">
        <pan2_opts sequential="true"/>

        <herd name="writer1" scheduler="pan2">
                <pan2_opts numactive="all"/>
                <test name="iogen">
                        <cmd> <![CDATA[
                                ./mygen | ./d_iosend -I 12345 -P > perf.writer1
                        ]]> </cmd>
                </test>

                <test name="d_doio1">
                        <cmd> <![CDATA[
                                qarsh root@marathon-01 "/tmp/d_doio -I 12345 -P xenon.msp.redhat.com -C"
                        ]]> </cmd>
                </test>
        </herd>

        <herd name="writer2" scheduler="pan2">
                <pan2_opts numactive="all"/>
                <test name="iogen">
                        <cmd> <![CDATA[
                                ./mygen | ./d_iosend -I 12345 -P > perf.writer2
                        ]]> </cmd>
                </test>

                <test name="d_doio1">
                        <cmd> <![CDATA[
                                qarsh root@marathon-01 "/tmp/d_doio -I 12345 -P xenon.msp.redhat.com -C"
                        ]]> </cmd>
                </test>

                <test name="d_doio2">
                        <cmd> <![CDATA[
                                qarsh root@marathon-02 "/tmp/d_doio -I 12345 -P xenon.msp.redhat.com -C"
                        ]]> </cmd>
                </test>
        </herd>

        <herd name="writer3" scheduler="pan2">
                <pan2_opts numactive="all"/>
                <test name="iogen">
                        <cmd> <![CDATA[
                                ./mygen | ./d_iosend -I 12345 -P > perf.writer3
                        ]]> </cmd>
                </test>

                <test name="d_doio1">
                        <cmd> <![CDATA[
                                qarsh root@marathon-01 "/tmp/d_doio -I 12345 -P xenon.msp.redhat.com -C"
                        ]]> </cmd>
                </test>

                <test name="d_doio2">
                        <cmd> <![CDATA[
                                qarsh root@marathon-02 "/tmp/d_doio -I 12345 -P xenon.msp.redhat.com -C"
                        ]]> </cmd>
                </test>

                <test name="d_doio3">
                        <cmd> <![CDATA[
                                qarsh root@marathon-03 "/tmp/d_doio -I 12345 -P xenon.msp.redhat.com -C"
                        ]]> </cmd>
                </test>
        </herd>
</herd>



--------------  mygen  -----------------

#!/bin/bash

WORKINGFILE=/mnt/marathon0/TESTFILE
FILESIZE=123456789
CHUNKSIZE=1048576
#CHUNKSIZE=4096

cat << EOXIOR
<xior magic="0xfeed10"><creat><path>$WORKINGFILE</path><mode>666</mode><nbytes>$FILESIZE</nbytes></creat></xior>
EOXIOR

for offset in $(seq --format %f 0 $CHUNKSIZE $FILESIZE)
do
        offset=${offset/.*/}

cat << EOXIOR
<xior magic="0xfeed10"><write syscall="write"><path>/$WORKINGFILE</path><oflags>O_RDWR</oflags><offset>$offset</offset><count>$CHUNKSIZE</count><pattern>*PERF*</pattern><chksum>0x0</chksum></write></xior>
EOXIOR
done

Comment 1 Dean Jansa 2008-08-22 15:59:15 UTC
FWIW -- 

Running with a single writer and multiple readers doesn't seem to show this wild performance drop when adding readers (using 1M read/write sizes):

1 reader, 1 writer: GFS2  .2 sec (read)  1.2 sec(write)
                    GFS   .3 sec (read)  3.6 sec (write)

2 readers, 1 writer: GFS2  1.8 sec (read)  1.8 sec (write)
                     GFS    .9 sec (read)   .5 sec (write)

3 readers, 1 writer: GFS2  3.0 sec (read)  2.5 sec (write)
                     GFS   3.6 sec (read)  2.0 sec (write)
  


GFS1 runs all seem to show inconsistent results, as seen in the 2 reader, 1 writer case.  Probably the test case and the luck of the draw during the runs.
Hoped the data was of some use anyway so I've included it.

Comment 2 Steve Whitehouse 2008-08-22 16:44:12 UTC
I think I can start to explain some of this now.... looking at the GFS figures too it starts to make a bit more sense.

I think what we are seeing is, in part, a result of the different locking in GFS2 vs. GFS. Bearing in mind that GFS is locking complete syscalls and GFS2 is locking on a per page basis, I think its not too surprising that there are more opportunities for GFS2 to drop the lock, and hence for performance to degrade. There is obviously more to it than that, but I do wonder if that is not part of the problem.

Looking at the two node results (opening comment), the GFS2 results for 1M are very similar to the GFS results for 4k. The real question is why the 4k results for GFS2 are so much worse. The min hold time code should be enforcing the same minimum hold time whatever the I/O size.

We could certainly try some changes to the min-hold time code to see what difference it makes, if any. We could increase the min hold time itself, or another idea is to change the point at which we set gl_tchange to after the
glock has read in any info it needs from disk. It also occurs to me that maybe there is a race in that before we process a reply from the DLM, its possible that the demote request arrives first (due to scheduling of the threads) and thus maybe gl_tchange is being checked before its been updated.

Thats my list of things to check for now, anyway.

In GFS it doesn't surprise me that as the I/O size changes, the performance in this test changes. I'd expect to see less of that effect with GFS2, so I'm pretty sure that the min-hold time code has something not quite right about it still.

Comment 4 Steve Whitehouse 2008-08-27 11:48:16 UTC
Created attachment 315086 [details]
Test patch

So this is a test patch to see if I'm right about the race condition. It would also be worth altering the min hold time as well I think, to see if that makes a difference above & beyond this patch.

Comment 5 Dean Jansa 2008-08-27 20:22:14 UTC
Results with the test patch build (/kmod-gfs2-1.104-1.1.el5.abhi.4.x86_64.rpm)

1 Node 1M write:    GFS2 - 1.4 sec 
1 Node 4K write:    GFS2 - 2.1 sec 

2 Nodes 1M write:   GFS2 -  5.6 sec
2 Nodes 4K write:   GFS2 -  7.4 sec

3 Nodes 1M write:   GFS2 -  6.8 sec
3 Nodes 4K write:   GFS2 -  143 sec

Comment 8 Don Zickus 2008-09-05 20:06:58 UTC
in kernel-2.6.18-108.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 12 errata-xmlrpc 2009-01-20 20:05:59 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html