Bug 202449

Summary: poor performance compared with NFS
Product: [Retired] Red Hat Cluster Suite Reporter: Michael Waite <mwaite>
Component: gfsAssignee: Wendy Cheng <nobody+wcheng>
Status: CLOSED WONTFIX QA Contact: GFS Bugs <gfs-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 4CC: eric.ayers, jesse.marlin, nobody+wcheng, rkenna, teigland
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-09-18 15:58:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
C source code
none
Python script none

Description Michael Waite 2006-08-14 15:13:44 UTC
I originally was just verifying 
that when multiple processes moved files in and out of a directory that 
both filesystems got consistent results.  But I noticed that GFS 
performance across multiple nodes became really bad.  So I added some 
timing code to see how bad.

I have included the source code and a python script which handles 
launching the process that does the actual moving.  The program locks a 
common file using a blocking lock, moves a single file to another 
directory and unlocks the common file.  Another process would move to a 
different directory.  This emulates the activity of one of our applications.

I tested first using fcntl against both GFS and NFS.  I then added some 
logic to use the DLM API to lock files just for the GFS test to see if 
it improved performance any.  The DLM versus fcntl numbers were 
identical.  Below are the results of the tests.  The numbers greatly 
increase on GFS when multiple nodes are involved.  For NFS the numbers 
actually decrease.

GFS tests:

---------------------------------------------------------------

   2 move processes on the same node:

   Creating 10000 files in /data01/dir_test/src
   Elapsed time to move files:  29s
   Total files found:  10000

   2 move processes on 2 different nodes:

   Creating 10000 files in /data01/dir_test/src
   Elapsed time to move files:  5m 3s
   Total files found:  10000

   4 move processes on 4 different nodes:

   Creating 10000 files in /data01/dir_test/src
   Elapsed time to move files:  9m 51s
   Total files found:  10000

---------------------------------------------------------------

NFS tests:

   2 move processes on the same node:

   Creating 10000 files in /redhat/dir_test/src
   Elapsed time to move files:  3m 2s
   Total files found:  10000

   2 move processes on 2 different nodes:

   Creating 10000 files in /redhat/dir_test/src
   Elapsed time to move files:  1m 55s
   Total files found:  10000

   4 move processes on 4 different nodes:

   Creating 10000 files in /redhat/dir_test/src
   Elapsed time to move files:  1m 21s
   Total files found:  10000

Comment 1 Michael Waite 2006-08-14 15:13:44 UTC
Created attachment 134145 [details]
C source code

Comment 2 Michael Waite 2006-08-14 15:16:33 UTC
Created attachment 134146 [details]
Python script

here is the python script as well

Comment 3 Michael Waite 2006-08-14 16:27:08 UTC
64 bit machines and OS installed.

We are running 32 bit binaries, however.

Comment 4 Robert Peterson 2006-08-22 22:19:53 UTC
I've been doing a lot of timing experiments with this one.  I've been
testing gfs, but not nfs, so the following comments apply only to gfs:

The timings on my 3-node cluster, x86_64, are:

1 node ext3: 15 seconds
1 node gfs:  6 seconds
2 nodes gfs: 5min 3s
3 nodes GFS: 10 Min!

I determined that we were spending a lot of time doing the
gfs file locking, so I made it work without the locks.
Occasionally I'd get a rename error, but the timing for 2 nodes
went from 5min 3sec to 3min 17sec.  Of course that had issues.

Next I changed it so that a rename collision would cause the node
to stop processing, and much to my surprise, the remaining
node moved all 10000 files, but it STILL took 5min 3sec, even though
it was apparently acting alone.

However, the same code, same node only took 7sec to move all 
10000 files when the other node was not involved at all.

The time it took to actually get into the rename kernel code was
2000 times longer than the kernel rename code itself.  

The kernel code spent almost all of its time in step 0, which
was time spent between the last rename call and the next one.

This lead me to believe the time is spent in vfs.

Next I added some timing information to vfs's rename function and 
what I found was that vfs was calling the gfs lookup function to
look up the old file before it is renamed, and that step is
typically taking 500 times longer than other steps that vfs takes,
including the actual rename.

I'll continue my research.  Next I have to collect timing information
from the gfs kernel lookup functions.


Comment 6 Wendy Cheng 2006-09-01 00:26:35 UTC
One thing not clear in this report is that "was the NFS run on top of GFS or
EXT3" ? 

Note that GFS performance hit should be expected due to:

1. Bouncing exclusive locks between different nodes requires the node that
   relinquishes the locks to flush the changes into disk and the node 
   freshly obtains the lock to read from the disk. This won't happen if 
   these operations are done within the same node. We're talking about disk 
   IO vs. memory cache access.
2. When the files within one directory grows to certain number, GFS readdir
   performance is a known issue. In the particular case, the disk flush and
   reading are *all* directory operations.

So this issue hits three known GFS performance issue:
* fsync issue
* directory read issue
* bouncing exclusive locks between nodes (this is an issue for *all* cluster
  filesystem).  
 

Comment 7 Wendy Cheng 2006-09-01 03:44:35 UTC
Sorry ! Was interrupted while writing above comment - it was too rush.

The easiest and painless workaround for this customer is to modify the
application using seperate directories. This is to avoid bouncing locks
between different nodes that can generate repeatedly directory read and
write. We certainly can squeeze some performance out of rename if required.
But it is good to know whether the different-directories workaround is
acceptable (by the customer).

It is also good to know which filesystem was used in the NFS case (so
we'll know what kind of alternatives the customer has in mind).

On the other hand, the NFS vs. GFS numbers is also expected since we
are talking about network latency (NFS) vs. disk latency (GFS). When
using multipl NFS clients to do the job, the (single) file server does
not need to go to disk to retrieve the directory contents repeatedly. 
It can obtain the updated contents from its memory cache. So the time 
spent will be mostly on network packets ping-pong between the nfs clients 
and servers. Last time I did the measurement (on 2.4 kernel), the disk 
latency vs. network latency was about 2:1 (but the exact ratio should be 
subject to the hardware types).

So in short, we would like to know the customer thought and comment
on the workaround. Then we can continue...


Comment 8 Michael Waite 2006-09-05 15:56:10 UTC
Can I add the engineers from Intec to this bug so they can respond?
I have tried to do it myself and I may not have the necessary privs.

Comment 10 Michael Waite 2006-09-05 16:06:10 UTC
Can you please add these three to this bug please?

Rick.Woods
eric.ayers
jesse.marlin

Comment 11 Eric Z. Ayers 2006-09-05 16:28:37 UTC
Thanks for your suggestion of a workaround.  We aren't really tied to this
implementation, but the whole point of this application is to be a load
balancing application to divide up a directory of files between multiple worker
tasks on different nodes.  We need to split out the work to do on demand because
we cannot determine a priori how long it will take each worker task to process a
file.  Having each worker task lock and scan the directory seemed to be the
easiest way to do this, but it is just one of many ways we could do it, I
suppose.  From what you described, we could designate one task running on one
node to do the scan and copy to another directory and that would get around the
bouncing lock issue.

As a new user of GFS, we were surprised and the difference in performance
between NFS and GFS, in that with NFS, adding nodes improves performance for
this  task, but with GFS performance seems to dramatically degrade instead of
increase.

Comment 12 Jesse Marlin 2006-09-05 17:03:54 UTC
(In reply to comment #6)
> One thing not clear in this report is that "was the NFS run on top of GFS or
> EXT3" ? 

NFS in this case is on top of ext3.  Both the GFS and NFS were going to the 
same SAN.

Comment 14 Wendy Cheng 2006-09-05 20:02:30 UTC
Bugzilla didn't recognize "rick.woods" - will work with our
bugzilla maintainer on this when he is back to office.
                                                                                
For comment #12 - note that "name lookup" is a very expensive operation in 
Linux. To overcome this significant performance hit, linux kernel caches the 
lookup results (implemented as "dentry" and "dcache" objects in linux kernel 
VFS layer). When you bounce the lookup operations between GFS nodes, you can 
no longer take the advantage of the dcache performance gain. While in NFS 
over ext3 case, the single (ext3) server can read the lookup result from 
dcache, instead of constantly invalidating the cache on one node, then
activating disk IO on another node to get the updated data (as in multi-node
GFS case). So we are actually comparing network latency (NFS case) vs. disk
IO latency (GFS case) using the attached test program as in comment #1.
                                                                                
Will update this bugzilla with action plan first thing tomorrow morning.


Comment 16 Wendy Cheng 2006-09-06 15:39:41 UTC
Just did few quick and dirty test runs based on stock RHEL 4.4 GFS on my cluster: 

* 2-nodes-gfs vs. 1-node-gfs = 3m 5s : 17s (moved 10000 files total)
* 2-nodes-gfs-with-2-directories = max 27s (moved 20000 files total)

This should give you an idea of how different-directories approach can help.

Comment 17 Wendy Cheng 2006-09-06 15:42:21 UTC
The 2-nodes-gfs-with-2-directories is setup as each node does 10000-file moves
in its own directories (both src and dest). 

Comment 21 Wendy Cheng 2007-09-18 15:58:04 UTC
This is actually cluster filesystem rename performance issue that would be 
hard to fix with current GFS1 lock order rule. With current development 
resources, we are not able to do changes for this (to avoid disturbing the
stability of GFS1 on RHEL4). Add it into RHEL 5.2 TO-DO list.