459738 – GFS2: Multiple writer performance issue.

Bug 459738 - GFS2: Multiple writer performance issue.

Summary: GFS2: Multiple writer performance issue.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Abhijith Das
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-08-21 19:13 UTC by Dean Jansa
Modified:	2009-01-20 20:05 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-01-20 20:05:59 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Test patch (496 bytes, patch) 2008-08-27 11:48 UTC, Steve Whitehouse	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:0225	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update	2009-01-20 16:06:24 UTC

Description Dean Jansa 2008-08-21 19:13:08 UTC

Running a performace measurement of GFS2 vs. GFS1 shows GFS2 has 
problems with multiple nodes writing to the same file.

The test writes a 123 MB file sequentially, non-overlapping, from 1,
2 and 3 nodes.  The IO requests are distributed around the cluster in a
quasi-round-robin fashion.  A node making better progress may get more
requests than one which is stuck waiting to do IO for example.

The times below are aggregate write syscall times across all nodes in the test.

(GFS2: -102 kernel, GFS: Stock 5.2GA)

1 Node 1M write:    GFS2 - 1.4 sec    GFS -  .4 sec
1 Node 4K write:    GFS2 - 2.3 sec    GFS - 2.7 sec

2 Nodes 1M write:   GFS2 -  304 sec   GFS - 3.6 sec
2 Nodes 4K write:   GFS2 - 7578 sec   GFS - 299 sec

3 Nodes 1M write:   GFS2 -  1687 sec  GFS -  12 sec
3 Nodes 4K write:   GFS2 - 19912 sec  GFS - 443 sec


The herd file used as well as 'mygen' are below.  These will have to be
edited to match your cluster.  Also, be sure to get a version of d_doio
with the new write pattern cache (-C option).


----------- multiwriter.h2 ---------------

<herd name="multiwriters" scheduler="pan2">
        <pan2_opts sequential="true"/>

        <herd name="writer1" scheduler="pan2">
                <pan2_opts numactive="all"/>
                <test name="iogen">
                        <cmd> <![CDATA[
                                ./mygen | ./d_iosend -I 12345 -P > perf.writer1
                        ]]> </cmd>
                </test>

                <test name="d_doio1">
                        <cmd> <![CDATA[
                                qarsh root@marathon-01 "/tmp/d_doio -I 12345 -P xenon.msp.redhat.com -C"
                        ]]> </cmd>
                </test>
        </herd>

        <herd name="writer2" scheduler="pan2">
                <pan2_opts numactive="all"/>
                <test name="iogen">
                        <cmd> <![CDATA[
                                ./mygen | ./d_iosend -I 12345 -P > perf.writer2
                        ]]> </cmd>
                </test>

                <test name="d_doio1">
                        <cmd> <![CDATA[
                                qarsh root@marathon-01 "/tmp/d_doio -I 12345 -P xenon.msp.redhat.com -C"
                        ]]> </cmd>
                </test>

                <test name="d_doio2">
                        <cmd> <![CDATA[
                                qarsh root@marathon-02 "/tmp/d_doio -I 12345 -P xenon.msp.redhat.com -C"
                        ]]> </cmd>
                </test>
        </herd>

        <herd name="writer3" scheduler="pan2">
                <pan2_opts numactive="all"/>
                <test name="iogen">
                        <cmd> <![CDATA[
                                ./mygen | ./d_iosend -I 12345 -P > perf.writer3
                        ]]> </cmd>
                </test>

                <test name="d_doio1">
                        <cmd> <![CDATA[
                                qarsh root@marathon-01 "/tmp/d_doio -I 12345 -P xenon.msp.redhat.com -C"
                        ]]> </cmd>
                </test>

                <test name="d_doio2">
                        <cmd> <![CDATA[
                                qarsh root@marathon-02 "/tmp/d_doio -I 12345 -P xenon.msp.redhat.com -C"
                        ]]> </cmd>
                </test>

                <test name="d_doio3">
                        <cmd> <![CDATA[
                                qarsh root@marathon-03 "/tmp/d_doio -I 12345 -P xenon.msp.redhat.com -C"
                        ]]> </cmd>
                </test>
        </herd>
</herd>



--------------  mygen  -----------------

#!/bin/bash

WORKINGFILE=/mnt/marathon0/TESTFILE
FILESIZE=123456789
CHUNKSIZE=1048576
#CHUNKSIZE=4096

cat << EOXIOR
<xior magic="0xfeed10"><creat><path>$WORKINGFILE</path><mode>666</mode><nbytes>$FILESIZE</nbytes></creat></xior>
EOXIOR

for offset in $(seq --format %f 0 $CHUNKSIZE $FILESIZE)
do
        offset=${offset/.*/}

cat << EOXIOR
<xior magic="0xfeed10"><write syscall="write"><path>/$WORKINGFILE</path><oflags>O_RDWR</oflags><offset>$offset</offset><count>$CHUNKSIZE</count><pattern>*PERF*</pattern><chksum>0x0</chksum></write></xior>
EOXIOR
done

Comment 1 Dean Jansa 2008-08-22 15:59:15 UTC

FWIW -- 

Running with a single writer and multiple readers doesn't seem to show this wild performance drop when adding readers (using 1M read/write sizes):

1 reader, 1 writer: GFS2  .2 sec (read)  1.2 sec(write)
                    GFS   .3 sec (read)  3.6 sec (write)

2 readers, 1 writer: GFS2  1.8 sec (read)  1.8 sec (write)
                     GFS    .9 sec (read)   .5 sec (write)

3 readers, 1 writer: GFS2  3.0 sec (read)  2.5 sec (write)
                     GFS   3.6 sec (read)  2.0 sec (write)
  


GFS1 runs all seem to show inconsistent results, as seen in the 2 reader, 1 writer case.  Probably the test case and the luck of the draw during the runs.
Hoped the data was of some use anyway so I've included it.

Comment 2 Steve Whitehouse 2008-08-22 16:44:12 UTC

I think I can start to explain some of this now.... looking at the GFS figures too it starts to make a bit more sense.

I think what we are seeing is, in part, a result of the different locking in GFS2 vs. GFS. Bearing in mind that GFS is locking complete syscalls and GFS2 is locking on a per page basis, I think its not too surprising that there are more opportunities for GFS2 to drop the lock, and hence for performance to degrade. There is obviously more to it than that, but I do wonder if that is not part of the problem.

Looking at the two node results (opening comment), the GFS2 results for 1M are very similar to the GFS results for 4k. The real question is why the 4k results for GFS2 are so much worse. The min hold time code should be enforcing the same minimum hold time whatever the I/O size.

We could certainly try some changes to the min-hold time code to see what difference it makes, if any. We could increase the min hold time itself, or another idea is to change the point at which we set gl_tchange to after the
glock has read in any info it needs from disk. It also occurs to me that maybe there is a race in that before we process a reply from the DLM, its possible that the demote request arrives first (due to scheduling of the threads) and thus maybe gl_tchange is being checked before its been updated.

Thats my list of things to check for now, anyway.

In GFS it doesn't surprise me that as the I/O size changes, the performance in this test changes. I'd expect to see less of that effect with GFS2, so I'm pretty sure that the min-hold time code has something not quite right about it still.

Comment 4 Steve Whitehouse 2008-08-27 11:48:16 UTC

Created attachment 315086 [details]
Test patch

So this is a test patch to see if I'm right about the race condition. It would also be worth altering the min hold time as well I think, to see if that makes a difference above & beyond this patch.

Comment 5 Dean Jansa 2008-08-27 20:22:14 UTC

Results with the test patch build (/kmod-gfs2-1.104-1.1.el5.abhi.4.x86_64.rpm)

1 Node 1M write:    GFS2 - 1.4 sec 
1 Node 4K write:    GFS2 - 2.1 sec 

2 Nodes 1M write:   GFS2 -  5.6 sec
2 Nodes 4K write:   GFS2 -  7.4 sec

3 Nodes 1M write:   GFS2 -  6.8 sec
3 Nodes 4K write:   GFS2 -  143 sec

Comment 8 Don Zickus 2008-09-05 20:06:58 UTC

in kernel-2.6.18-108.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 12 errata-xmlrpc 2009-01-20 20:05:59 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Note You need to log in before you can comment on or make changes to this bug.