Bug 672724

Summary: mmapping a read only file on a gfs2 filesystem incorrectly acquires an exclusive glock
Product: Red Hat Enterprise Linux 5 Reporter: Scooter Morris <scooter>
Component: kernelAssignee: Steve Whitehouse <swhiteho>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: urgent    
Version: 5.6CC: adas, bmarzins, cww, djansa, eguan, jwest, kzhang, qcai, rpeterso, swhiteho
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Performance issues occurred when multiple nodes attempted to mmap() the same inode at the same time on a GFS2 file system, as it was using an exclusive glock. With this update, a shared lock is used when noatime is set on the mount, allowing mmap operations to occur in parallel. Note that this issue only refers to the mmap() syscall, and not to subsequent page faults.
Story Points: ---
Clone Of:
: 674286 (view as bug list) Environment:
Last Closed: 2011-07-21 10:22:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 674286, 729088, 729090    
Attachments:
Description Flags
Program to demonstrate the problem.
none
RHEL5 port of the patch none

Description Scooter Morris 2011-01-26 02:11:01 UTC
Created attachment 475319 [details]
Program to demonstrate the problem.

Description of problem: When an application uses mmap to map in a file in a gfs2 filesystem in a read-only mode, it acquires an exclusive glock, even with noatime set on the filesystem.  This has a significant impact on the performance of subsequent invocations of the application if the same file is accessed on multiple nodes.  


Version-Release number of selected component (if applicable): 2.6.18-238.el5


How reproducible: Always


Steps to Reproduce:
1. Start with a file on a gfs2 filesystem that has no cached glocks
2. Run the attached application to map that file in read only
3.
  
Actual results: An exclusive glock will be created


Expected results: Only shared locks should be created

Comment 1 Steve Whitehouse 2011-01-31 17:45:19 UTC
I've tracked down what is going on here....

It is all down to the test used in the ->mmap() function which is supposed to skip the EX lock if there are no atime updates to be performed. The reason that the EX lock is being taken, is that there are a number of different ways in which the noatime state can be set: via the mount flags, via the O_NOATIME file flag and via the S_NOATIME flag (set on a per file basis via setattr)

The code checks only for O_NOATIME (which if set does prevent grabbing the EX lock) but the check is repeated later on in the VFS atime code, so that the actual atime updates are done correctly. Its only the locking that isn't quite correct.

So if you have access to the source code, there is a temporary workaround of opening the files to be mmaped with O_NOATIME. Note that this only happens on mmap() and not on page faults, so if the files are mmap()ed just once and then used many times, only the initial mmap call will require an EX lock. After that point all the locks will be PR (for read-only access, even if the file is mapped read/write).

That should allow you to get on with your BLAST runs. I'll try and get a patch sorted out for this as soon as I can.

Comment 2 Scooter Morris 2011-01-31 18:48:28 UTC
Steve,
   Excellent news!!  We'll change BLAST right away and let you know the impact.  Since the loader uses mmap() quite heavily, we are still interested in a patched kernel.  This explains some symptoms that we had early on that we weren't able to explain (so we worked around them).

Comment 3 Scooter Morris 2011-01-31 22:05:17 UTC
Steve.  It turns out the O_NOATIME can only be used if you are the file owner or root, which is not a good solution for shared databases :-(  We'll go ahead and get the timings to make sure that this works as expected, though.

Comment 4 Steve Whitehouse 2011-02-01 10:48:18 UTC
Yes. The O_NOATIME suggestion was just meant to be a temporary workaround. I have a patch now which should solve the issue.

Comment 5 Steve Whitehouse 2011-02-01 10:49:24 UTC
Created attachment 476357 [details]
RHEL5 port of the patch

Comment 6 Steve Whitehouse 2011-02-01 11:03:45 UTC
QE Testing Notes:

One method to test this patch is to simply run a loop which mmaps and unmaps a  file on multiple nodes. This should run a lot faster after the fix has been applied. The mmap may be read only or read/write provided the actual pages are only ever read from.

On RHEL6 however, the tracepoints can be used to verify the fix in the following manner:

1. Create file to map and stat it to find out the inode number
2. umount the fs and remount it to clear out any glock state
3. echo "glnum == <inode number> && gltype == 2" > /sys/kernel/debug/tracing/events/gfs2/gfs2_glock_state_change
4. echo 1 > /sys/kernel/debug/tracing/events/gfs2/enable
5. echo "" >/sys/kernel/debug/tracing/trace (to clear the event buffer)
6. Mmap the file (read only, or read write so long as no writes are made to the mapped pages of the file)
7. cat /sys/kernel/debug/tracing/trace

Note that the glock was not obtained in the EX state after the patch has been applied.

Also note that after the patch, the glock will not be taken in the exclusive mode, even in the atime case.

Comment 7 RHEL Program Management 2011-02-01 17:05:24 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 8 Scooter Morris 2011-02-08 20:31:44 UTC
Just a note -- we have installed the test build provided to us by RedHat support with this patch and verified that it has fixed the performance issues we were seeing with BLAST.  Our perception is that it has also generally improved the cluster performance overall, but that's harder to quantify.  Thanks very much for this fix!!!

Comment 9 Steve Whitehouse 2011-02-08 23:07:04 UTC
Scooter, thanks for testing that for us. We'll roll it into the next release now that we know it is working for you.

Comment 13 Jarod Wilson 2011-02-18 22:41:04 UTC
in kernel-2.6.18-244.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 15 Nate Straz 2011-05-31 19:28:41 UTC
I ran a multi-node test with d_iogen/d_doio using only mmapped reads.  The command line was:

 bin/d_iogen -i 5000 -m random -s mmread -v mmread -t 10k -T 10m -F 500m:/mnt/west4/largemmap -I 1234

Iterations were fixed at 5000 total.

I created /mnt/west4/largemmap by hand from /dev/urandom.

Running it only on the nodes with kernel-2.6.18-262.el5 took 24 seconds.

Running it only on the nodes with kernel-2.6.18-238.el5 took 184 seconds.

Comment 16 errata-xmlrpc 2011-07-21 10:22:10 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html

Comment 19 Tomas Capek 2011-09-13 13:01:47 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Performance issues occurred in some situations when using mmap() on a file on a GFS2 file system, as it was using an exclusive lock via glock. With this update, a shared lock is used instead.

Comment 20 Steve Whitehouse 2011-09-15 15:12:54 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-Performance issues occurred in some situations when using mmap() on a file on a GFS2 file system, as it was using an exclusive lock via glock. With this update, a shared lock is used instead.+Performance issues occurred when multiple nodes attempted to mmap() the same inode at the same time on a GFS2 file system, as it was using an exclusive glock. With this update, a shared lock is used when noatime is set on the mount, allowing mmap operations to occur in parallel. Note that this issue only refers to the mmap() syscall, and not to subsequent page faults.