Created attachment 475319 [details] Program to demonstrate the problem. Description of problem: When an application uses mmap to map in a file in a gfs2 filesystem in a read-only mode, it acquires an exclusive glock, even with noatime set on the filesystem. This has a significant impact on the performance of subsequent invocations of the application if the same file is accessed on multiple nodes. Version-Release number of selected component (if applicable): 2.6.18-238.el5 How reproducible: Always Steps to Reproduce: 1. Start with a file on a gfs2 filesystem that has no cached glocks 2. Run the attached application to map that file in read only 3. Actual results: An exclusive glock will be created Expected results: Only shared locks should be created
I've tracked down what is going on here.... It is all down to the test used in the ->mmap() function which is supposed to skip the EX lock if there are no atime updates to be performed. The reason that the EX lock is being taken, is that there are a number of different ways in which the noatime state can be set: via the mount flags, via the O_NOATIME file flag and via the S_NOATIME flag (set on a per file basis via setattr) The code checks only for O_NOATIME (which if set does prevent grabbing the EX lock) but the check is repeated later on in the VFS atime code, so that the actual atime updates are done correctly. Its only the locking that isn't quite correct. So if you have access to the source code, there is a temporary workaround of opening the files to be mmaped with O_NOATIME. Note that this only happens on mmap() and not on page faults, so if the files are mmap()ed just once and then used many times, only the initial mmap call will require an EX lock. After that point all the locks will be PR (for read-only access, even if the file is mapped read/write). That should allow you to get on with your BLAST runs. I'll try and get a patch sorted out for this as soon as I can.
Steve, Excellent news!! We'll change BLAST right away and let you know the impact. Since the loader uses mmap() quite heavily, we are still interested in a patched kernel. This explains some symptoms that we had early on that we weren't able to explain (so we worked around them).
Steve. It turns out the O_NOATIME can only be used if you are the file owner or root, which is not a good solution for shared databases :-( We'll go ahead and get the timings to make sure that this works as expected, though.
Yes. The O_NOATIME suggestion was just meant to be a temporary workaround. I have a patch now which should solve the issue.
Created attachment 476357 [details] RHEL5 port of the patch
QE Testing Notes: One method to test this patch is to simply run a loop which mmaps and unmaps a file on multiple nodes. This should run a lot faster after the fix has been applied. The mmap may be read only or read/write provided the actual pages are only ever read from. On RHEL6 however, the tracepoints can be used to verify the fix in the following manner: 1. Create file to map and stat it to find out the inode number 2. umount the fs and remount it to clear out any glock state 3. echo "glnum == <inode number> && gltype == 2" > /sys/kernel/debug/tracing/events/gfs2/gfs2_glock_state_change 4. echo 1 > /sys/kernel/debug/tracing/events/gfs2/enable 5. echo "" >/sys/kernel/debug/tracing/trace (to clear the event buffer) 6. Mmap the file (read only, or read write so long as no writes are made to the mapped pages of the file) 7. cat /sys/kernel/debug/tracing/trace Note that the glock was not obtained in the EX state after the patch has been applied. Also note that after the patch, the glock will not be taken in the exclusive mode, even in the atime case.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Just a note -- we have installed the test build provided to us by RedHat support with this patch and verified that it has fixed the performance issues we were seeing with BLAST. Our perception is that it has also generally improved the cluster performance overall, but that's harder to quantify. Thanks very much for this fix!!!
Scooter, thanks for testing that for us. We'll roll it into the next release now that we know it is working for you.
in kernel-2.6.18-244.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
I ran a multi-node test with d_iogen/d_doio using only mmapped reads. The command line was: bin/d_iogen -i 5000 -m random -s mmread -v mmread -t 10k -T 10m -F 500m:/mnt/west4/largemmap -I 1234 Iterations were fixed at 5000 total. I created /mnt/west4/largemmap by hand from /dev/urandom. Running it only on the nodes with kernel-2.6.18-262.el5 took 24 seconds. Running it only on the nodes with kernel-2.6.18-238.el5 took 184 seconds.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Performance issues occurred in some situations when using mmap() on a file on a GFS2 file system, as it was using an exclusive lock via glock. With this update, a shared lock is used instead.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -Performance issues occurred in some situations when using mmap() on a file on a GFS2 file system, as it was using an exclusive lock via glock. With this update, a shared lock is used instead.+Performance issues occurred when multiple nodes attempted to mmap() the same inode at the same time on a GFS2 file system, as it was using an exclusive glock. With this update, a shared lock is used when noatime is set on the mount, allowing mmap operations to occur in parallel. Note that this issue only refers to the mmap() syscall, and not to subsequent page faults.