I originally was just verifying that when multiple processes moved files in and out of a directory that both filesystems got consistent results. But I noticed that GFS performance across multiple nodes became really bad. So I added some timing code to see how bad. I have included the source code and a python script which handles launching the process that does the actual moving. The program locks a common file using a blocking lock, moves a single file to another directory and unlocks the common file. Another process would move to a different directory. This emulates the activity of one of our applications. I tested first using fcntl against both GFS and NFS. I then added some logic to use the DLM API to lock files just for the GFS test to see if it improved performance any. The DLM versus fcntl numbers were identical. Below are the results of the tests. The numbers greatly increase on GFS when multiple nodes are involved. For NFS the numbers actually decrease. GFS tests: --------------------------------------------------------------- 2 move processes on the same node: Creating 10000 files in /data01/dir_test/src Elapsed time to move files: 29s Total files found: 10000 2 move processes on 2 different nodes: Creating 10000 files in /data01/dir_test/src Elapsed time to move files: 5m 3s Total files found: 10000 4 move processes on 4 different nodes: Creating 10000 files in /data01/dir_test/src Elapsed time to move files: 9m 51s Total files found: 10000 --------------------------------------------------------------- NFS tests: 2 move processes on the same node: Creating 10000 files in /redhat/dir_test/src Elapsed time to move files: 3m 2s Total files found: 10000 2 move processes on 2 different nodes: Creating 10000 files in /redhat/dir_test/src Elapsed time to move files: 1m 55s Total files found: 10000 4 move processes on 4 different nodes: Creating 10000 files in /redhat/dir_test/src Elapsed time to move files: 1m 21s Total files found: 10000
Created attachment 134145 [details] C source code
Created attachment 134146 [details] Python script here is the python script as well
64 bit machines and OS installed. We are running 32 bit binaries, however.
I've been doing a lot of timing experiments with this one. I've been testing gfs, but not nfs, so the following comments apply only to gfs: The timings on my 3-node cluster, x86_64, are: 1 node ext3: 15 seconds 1 node gfs: 6 seconds 2 nodes gfs: 5min 3s 3 nodes GFS: 10 Min! I determined that we were spending a lot of time doing the gfs file locking, so I made it work without the locks. Occasionally I'd get a rename error, but the timing for 2 nodes went from 5min 3sec to 3min 17sec. Of course that had issues. Next I changed it so that a rename collision would cause the node to stop processing, and much to my surprise, the remaining node moved all 10000 files, but it STILL took 5min 3sec, even though it was apparently acting alone. However, the same code, same node only took 7sec to move all 10000 files when the other node was not involved at all. The time it took to actually get into the rename kernel code was 2000 times longer than the kernel rename code itself. The kernel code spent almost all of its time in step 0, which was time spent between the last rename call and the next one. This lead me to believe the time is spent in vfs. Next I added some timing information to vfs's rename function and what I found was that vfs was calling the gfs lookup function to look up the old file before it is renamed, and that step is typically taking 500 times longer than other steps that vfs takes, including the actual rename. I'll continue my research. Next I have to collect timing information from the gfs kernel lookup functions.
One thing not clear in this report is that "was the NFS run on top of GFS or EXT3" ? Note that GFS performance hit should be expected due to: 1. Bouncing exclusive locks between different nodes requires the node that relinquishes the locks to flush the changes into disk and the node freshly obtains the lock to read from the disk. This won't happen if these operations are done within the same node. We're talking about disk IO vs. memory cache access. 2. When the files within one directory grows to certain number, GFS readdir performance is a known issue. In the particular case, the disk flush and reading are *all* directory operations. So this issue hits three known GFS performance issue: * fsync issue * directory read issue * bouncing exclusive locks between nodes (this is an issue for *all* cluster filesystem).
Sorry ! Was interrupted while writing above comment - it was too rush. The easiest and painless workaround for this customer is to modify the application using seperate directories. This is to avoid bouncing locks between different nodes that can generate repeatedly directory read and write. We certainly can squeeze some performance out of rename if required. But it is good to know whether the different-directories workaround is acceptable (by the customer). It is also good to know which filesystem was used in the NFS case (so we'll know what kind of alternatives the customer has in mind). On the other hand, the NFS vs. GFS numbers is also expected since we are talking about network latency (NFS) vs. disk latency (GFS). When using multipl NFS clients to do the job, the (single) file server does not need to go to disk to retrieve the directory contents repeatedly. It can obtain the updated contents from its memory cache. So the time spent will be mostly on network packets ping-pong between the nfs clients and servers. Last time I did the measurement (on 2.4 kernel), the disk latency vs. network latency was about 2:1 (but the exact ratio should be subject to the hardware types). So in short, we would like to know the customer thought and comment on the workaround. Then we can continue...
Can I add the engineers from Intec to this bug so they can respond? I have tried to do it myself and I may not have the necessary privs.
Can you please add these three to this bug please? Rick.Woods eric.ayers jesse.marlin
Thanks for your suggestion of a workaround. We aren't really tied to this implementation, but the whole point of this application is to be a load balancing application to divide up a directory of files between multiple worker tasks on different nodes. We need to split out the work to do on demand because we cannot determine a priori how long it will take each worker task to process a file. Having each worker task lock and scan the directory seemed to be the easiest way to do this, but it is just one of many ways we could do it, I suppose. From what you described, we could designate one task running on one node to do the scan and copy to another directory and that would get around the bouncing lock issue. As a new user of GFS, we were surprised and the difference in performance between NFS and GFS, in that with NFS, adding nodes improves performance for this task, but with GFS performance seems to dramatically degrade instead of increase.
(In reply to comment #6) > One thing not clear in this report is that "was the NFS run on top of GFS or > EXT3" ? NFS in this case is on top of ext3. Both the GFS and NFS were going to the same SAN.
Bugzilla didn't recognize "rick.woods" - will work with our bugzilla maintainer on this when he is back to office. For comment #12 - note that "name lookup" is a very expensive operation in Linux. To overcome this significant performance hit, linux kernel caches the lookup results (implemented as "dentry" and "dcache" objects in linux kernel VFS layer). When you bounce the lookup operations between GFS nodes, you can no longer take the advantage of the dcache performance gain. While in NFS over ext3 case, the single (ext3) server can read the lookup result from dcache, instead of constantly invalidating the cache on one node, then activating disk IO on another node to get the updated data (as in multi-node GFS case). So we are actually comparing network latency (NFS case) vs. disk IO latency (GFS case) using the attached test program as in comment #1. Will update this bugzilla with action plan first thing tomorrow morning.
Just did few quick and dirty test runs based on stock RHEL 4.4 GFS on my cluster: * 2-nodes-gfs vs. 1-node-gfs = 3m 5s : 17s (moved 10000 files total) * 2-nodes-gfs-with-2-directories = max 27s (moved 20000 files total) This should give you an idea of how different-directories approach can help.
The 2-nodes-gfs-with-2-directories is setup as each node does 10000-file moves in its own directories (both src and dest).
This is actually cluster filesystem rename performance issue that would be hard to fix with current GFS1 lock order rule. With current development resources, we are not able to do changes for this (to avoid disturbing the stability of GFS1 on RHEL4). Add it into RHEL 5.2 TO-DO list.