Bug 519126 - BKL implementation related performance problems
Summary: BKL implementation related performance problems
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.4
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Red Hat Kernel Manager
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
: 427357 578552 626983 (view as bug list)
Depends On:
Blocks: 533192
TreeView+ depends on / blocked
 
Reported: 2009-08-25 09:58 UTC by Sachin Prabhu
Modified: 2018-12-03 17:06 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-12-03 16:18:56 UTC


Attachments (Terms of Use)
Test program to demonstrace performance problem. (3.49 KB, text/x-csrc)
2009-08-25 10:02 UTC, Sachin Prabhu
no flags Details
SystemTap script to list out what pid waiting for bkl (1.61 KB, application/octet-stream)
2010-03-25 18:06 UTC, William Cohen
no flags Details
Give backtraces for users of bkl that hold the lock for longer than a certain threshold (1.14 KB, application/octet-stream)
2010-03-25 18:16 UTC, William Cohen
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 427623 None None None 2016-03-30 03:42:36 UTC

Description Sachin Prabhu 2009-08-25 09:58:01 UTC
We have received reports about NFS file operation performance on RHEL 5 getting worse as the number of CPUs go up. This problem was not seen on RHEL 4 and RHEL 5. 

The performance problem was traced to the open syscalls. A simple test to demonstrate the problem is provided below

The test setup consists of 
1) A NFS share mounted with option nocto to get the server to use its cache to avoid sending GETATTR requests over the network when opening files.
2) A test program which opens mutiple threads each opening and closing a random file chosen from a file list.

2 test runs were performed on the same machine and share. 
1) The first test run was performed by using all 8 CPUs on the system.
2) The machine was converted to a uniprocessor machine by setting maxcpus=1.

kernel: 2.6.18-128.el5
Test program opens 4 threads working on 10 files and runs for 120 seconds.

Test on SMP machine with 8 CPUS
# ./openspeed 4 files.txt 10 120
openspeed: 875428 opens in 120.000 seconds (7295.2 opens per second).
openspeed: 912913 opens in 120.000 seconds (7607.6 opens per second).
openspeed: 920757 opens in 120.000 seconds (7673.0 opens per second).
openspeed: 893784 opens in 120.000 seconds (7448.2 opens per second).

Test on same machine with maxcpus=1 set to enable only 1 CPU.
# ./openspeed 4 files.txt 10 120
openspeed: 7908973 opens in 120.000 seconds (65908.1 opens per second).
openspeed: 7951445 opens in 120.000 seconds (66262.0 opens per second).
openspeed: 8574444 opens in 120.000 seconds (71453.7 opens per second).
openspeed: 7820680 opens in 120.000 seconds (65172.3 opens per second).

The difference can be seen in the number of files opened by each thread. The test program is able to open many more files with a single CPU then with multiple CPUs.

Using GIT bisect we were able to find the following patch which fixes this upstream.
--
commit 8e3e076c5a78519a9f64cd384e8f18bc21882ce0
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Sat May 10 20:58:02 2008 -0700

    BKL: revert back to the old spinlock implementation
--

The issue has been described in detail at http://lwn.net/Articles/281938/

Using kernel 2.6.26-rc1.BKL_SEMAPHORE
openspeed: 948472 opens in 120.000 seconds (7903.9 opens per second).
openspeed: 948367 opens in 120.000 seconds (7903.1 opens per second).
openspeed: 948480 opens in 120.000 seconds (7904.0 opens per second).
openspeed: 948493 opens in 120.000 seconds (7904.1 opens per second).

Using kernel 2.6.26-rc1.BKL_SPINLOCK
openspeed: 5311913 opens in 120.000 seconds (44265.9 opens per second).
openspeed: 5301736 opens in 120.000 seconds (44181.1 opens per second).
openspeed: 5317271 opens in 120.000 seconds (44310.6 opens per second).
openspeed: 5295553 opens in 120.000 seconds (44129.6 opens per second).

Apart from the one patch mentioned above, kernel 2.6.26-rc1.BKL_SPINLOCK is similar to kernel 2.6.26-rc1.BKL_SEMAPHORE. 

By simply unsetting CONFIG_PREEMPT_BKL on 2.6.19, a user was able to match the results seen with this patch.

Comment 1 Sachin Prabhu 2009-08-25 10:02:04 UTC
Created attachment 358546 [details]
Test program to demonstrace performance problem.

Comment 2 Sachin Prabhu 2009-08-25 10:03:37 UTC
There is a mistake in the summary
The problem is seen on RHEL 5. This problem was _not_ seen on _RHEL_4_ and _RHEL_3_.

Comment 3 Sachin Prabhu 2009-08-25 10:06:05 UTC
There is an additional performance boost later for NFS file operations in 2.6.26 by a patch which removes calls to BKL from within the NFS module.

commit: 8df1b049bc86495a40e421abc8b9cf1dda32f0d9

Comment 4 Sachin Prabhu 2009-08-25 10:07:26 UTC
*** Bug 427357 has been marked as a duplicate of this bug. ***

Comment 5 Sachin Prabhu 2009-08-27 10:26:37 UTC
A different workload tested by another user showed the same benefits in performance with CONFIG_PREEMPT_BKL unset on the RHEL kernel.

Comment 11 Reagen Ward 2009-12-17 18:44:59 UTC
What are the odds of this fix being made during the Enterprise 5.x lifetime?  We are heavy NFS users and very concerned with the poor NFS performance we see in RedHat kernels vs modern kernels.

Comment 12 Ric Wheeler 2009-12-17 19:16:14 UTC
Developers are not very comfortable with backporting the changes to RHEL5 since tampering with the BKL tends to be very tricky and can lead to serious issues with stability/data corruption.

This seems like a good reason to leave this BKL work for post-RHEL5.

Comment 18 Sachin Prabhu 2010-03-24 12:10:45 UTC
We have another customer report which is related to this issue. This time, the problem was seen on a customer application which makes use of fcntl locking which uses BKLs within its codepath. Performance degradation is seen when a large number of such applications are run together. 

A number of processes show the following stack trace which indicates that they are waiting for the BKL.

Kernel: 2.6.18-164.11.1.el5

app           D ffff81081ea22820     0 22611      1         22612 22610 (NOTLB)
 ffff81081ea41d78 0000000000000086 0000000000000000 0000000000000000
 0000000000000000 0000000000000009 ffff81081ea22820 ffff810be78b9820
 000001cc41a7d6e5 000000000001a0ee ffff81081ea22a08 0000000400000092
Call Trace:
 [<ffffffff8008ac96>] __wake_up_common+0x3e/0x68
 [<ffffffff80064a3b>] __down+0xc3/0xd8 
 [<ffffffff8008c86c>] default_wake_function+0x0/0xe
 [<ffffffff8002e2ce>] __wake_up+0x38/0x4f
 [<ffffffff800646f9>] __down_failed+0x35/0x3a
 [<ffffffff80064cf9>] .text.lock.kernel_lock+0xf/0x1f
 [<ffffffff800e8f35>] __posix_lock_file_conf+0x67/0x3e0
 [<ffffffff80039e33>] fcntl_setlk+0x11e/0x273
 [<ffffffff800b6c6d>] audit_syscall_entry+0x180/0x1b3
 [<ffffffff8002e597>] sys_fcntl+0x269/0x2dc
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

A quick test with the test kernel used earlier in this bz shows that the problem is not seen with the alternative implementation of BKL.

Comment 19 William Cohen 2010-03-25 18:06:32 UTC
Created attachment 402634 [details]
SystemTap script to list out what pid waiting for bkl

The script take one argument, a number. When the number of processes waiting for the bkl execeed that number the script lists out the pids waiting for the blk. The pid are printed in the order that the queued up for the bkl lock. This script could give some clues as to what is using the bkl.

Comment 20 William Cohen 2010-03-25 18:16:50 UTC
Created attachment 402636 [details]
Give backtraces for users of bkl that hold the lock for longer than a certain threshold

The bkl_stack.stp systemtap script provides backtrace each time the threshold is exceeded for holding the bkl. The script takes one argument the threshold in microseconds. A backtrace is printed each time the bkl is held for more than the threshold. Something like the following:

$ stap /tmp/bkl_stack.stp 100
stap ready
pid 5407 waited for 124 us.
Returning from: 0xffffffff81455df1 : lock_kernel+0x0/0x46 [kernel]
Returning to  : 0xffffffff812ac415 : tty_open+0x21/0x460 [kernel]
 0xffffffff81227bcd : kobject_get+0x1a/0x22 [kernel] (inexact)
 0xffffffff8111e33a : chrdev_open+0x122/0x143 [kernel] (inexact)
 0xffffffff811eeeb5 : selinux_dentry_open+0xee/0xf7 [kernel] (inexact)
 0xffffffff8111e218 : chrdev_open+0x0/0x143 [kernel] (inexact)
 0xffffffff81118f0d : __dentry_open+0x199/0x2d2 [kernel] (inexact)
 0xffffffff81119116 : nameidata_to_filp+0x42/0x53 [kernel] (inexact)
 0xffffffff81127caa : do_filp_open+0x4fc/0x975 [kernel] (inexact)
 0xffffffff811f08fb : task_has_capability+0xde/0x10a [kernel] (inexact)
 0xffffffff811eedb0 : selinux_cred_prepare+0x1f/0x36 [kernel] (inexact)
 0xffffffff8122e856 : might_fault+0x21/0x23 [kernel] (inexact)
 0xffffffff8122e955 : __strncpy_from_user+0x1e/0x49 [kernel] (inexact)
 0xffffffff8113230f : alloc_fd+0x7b/0x124 [kernel] (inexact)
 0xffffffff81118c7c : do_sys_open+0x63/0x10f [kernel] (inexact)
 0xffffffff81118d5b : sys_open+0x20/0x22 [kernel] (inexact)
 0xffffffff81011d32 : system_call_fastpath+0x16/0x1b [kernel] (inexact)
pid 2123 waited for 1364 us.
Returning from: 0xffffffff81455df1 : lock_kernel+0x0/0x46 [kernel]
Returning to  : 0xffffffffa002b205
 0xffffffff811ec914 : inode_has_perm+0x7a/0x90 [kernel] (inexact)
 0xffffffff8112a148 : vfs_ioctl+0x22/0x87 [kernel] (inexact)
 0xffffffff8112a6a4 : do_vfs_ioctl+0x47b/0x4c1 [kernel] (inexact)
 0xffffffff8112a740 : sys_ioctl+0x56/0x79 [kernel] (inexact)
 0xffffffff81011d32 : system_call_fastpath+0x16/0x1b [kernel] (inexact)

Comment 22 IBM Bug Proxy 2010-04-06 20:25:17 UTC
Would it be possible to attach the patch to this Bugzilla or make it available by another method?

Thanks

Comment 23 Chuck Lever 2010-04-06 20:53:25 UTC
(In reply to comment #3)
> There is an additional performance boost later for NFS file operations in
> 2.6.26 by a patch which removes calls to BKL from within the NFS module.
> 
> commit: 8df1b049bc86495a40e421abc8b9cf1dda32f0d9    

There is no single patch which removes the BKL from the NFS client.  However, I've backported the two separate NFS client BKL removal series from upstream to EL5.  We're testing the backport now, and plan to publish it here when we are confident it is working well.

Comment 24 Sachin Prabhu 2010-05-06 11:00:28 UTC
Summary of the events leading to the BKL implementation changing from spinlocks to semaphores and back to spinlocks.

BKL initially was implemented using spinlocks. The problem with this implementation was that code which was holding the BKL could not be pre-empted while holding the BKL unless the thread voluntarily scheduled itself. This means no other process could be scheduled to run on that processor while the thread holding the BKL was running. If this thread had to execute a long code path while holding this BKL, it would block other high priority threads and thus increase latency. For this reason, the implementation was changed to use semaphores instead. With the semaphore implementation as long as the thread doesn't take any other spinlocks it could be pre-empted by other high priority threads.
http://lwn.net/Articles/102253/

However as we seen in the case of RHEL 5, we have performance problems related to the BKL implemented using semaphores. This is caused because the implementation using semaphores resulted in BKL being allocated to contending threads in roughly the order in which they were requested. This fairness was expensive especially if the thread now holding the BKL was of a lower priority and the scheduling of the thread was delayed. This delay in scheduling results in dead time where none of the running threads hold the BKL. As a result, the upstream kernel reverted back to usign the spinlock implementation of the BKL.
http://lwn.net/Articles/281938/
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8e3e076c5a78519a9f64cd384e8f18bc21882ce0

On RHEL 6, pre-emption is disabled when acquiring the kernel lock. This should avoid the performance problem you see with RHEL 5. Even though BKLs implemented with spinlocks are prone to latency issues, we shouldn't see much of this since most of the code which required the BKL to be held for a long time have been fixed.

Comment 25 Chuck Lever 2010-05-06 15:31:03 UTC
(In reply to comment #24)
> On RHEL 6, pre-emption is disabled when acquiring the kernel lock. This should
> avoid the performance problem you see with RHEL 5. Even though BKLs implemented
> with spinlocks are prone to latency issues, we shouldn't see much of this since
> most of the code which required the BKL to be held for a long time have been
> fixed.    

While that's true, it's also the case that EL6 is based on an upstream kernel that has eliminated the BKL from most NFS client code paths.

Comment 26 Sachin Prabhu 2010-05-06 16:08:43 UTC
(In reply to comment #25)

> > On RHEL 6, pre-emption is disabled when acquiring the kernel lock. This should
> > avoid the performance problem you see with RHEL 5. Even though BKLs implemented
> > with spinlocks are prone to latency issues, we shouldn't see much of this since
> > most of the code which required the BKL to be held for a long time have been
> > fixed.    
> 
> While that's true, it's also the case that EL6 is based on an upstream kernel
> that has eliminated the BKL from most NFS client code paths.    

We have had reports of a similar performance issue with a large number of processes repeatedly making  posix lock requests(reported in c#18). posix_lock_file() unfortunately still uses BKL. However the impact of this BKL lock in that particular code path should be mitigated with the non-preemptible BKL in RHEL 6.

Comment 28 Adrien Kunysz 2010-07-17 13:34:37 UTC
Created attachment 432588 [details]
reproducer using byte range locking (doesn't require NFS) developped by customer

Comment 34 Jeff Layton 2010-10-08 12:55:08 UTC
I've forward-ported the patches to remove the BKL from much of the NFS/RPC code in RHEL5 for bug 626983. Test kernels with those patches are on my people.redhat.com page:

    http://people.redhat.com/jlayton/

Positive test results (with actual numbers showing a performance benefit) would be helpful in allowing us to decide whether they should be included in an upcoming release.

Comment 38 Ric Wheeler 2010-12-02 21:27:13 UTC
In https://bugzilla.redhat.com/show_bug.cgi?id=519126#c34 above, we posted test kernels.

Any feedback on the results with those kernels?

Comment 44 Jeff Layton 2010-12-03 16:17:16 UTC
Most of the BKL was removed from the NFS client upstream, and the BKL has also been switched back to a more traditional spinlock-based mechanism too. We've had anecdotal reports that this problem is now mitigated in RHEL6.

Removing the BKL from the NFS client in RHEL5, or changing the BKL implementation in RHEL5 would be quite a high-risk move. The risk of subtle, difficult to detect regressions would be quite high.

Based on that, I'm going to devel-NAK this request. Anyone suffering from it should be strongly considering a move to RHEL6 at this point.

Comment 45 Ric Wheeler 2010-12-03 16:18:56 UTC
The BKL rework is present in RHEL6.0. Please verify that this works properly for your workload there.

Regards,

Ric

Comment 47 Jeff Layton 2011-08-09 11:34:15 UTC
*** Bug 578552 has been marked as a duplicate of this bug. ***

Comment 48 Harshula Jayasuriya 2014-12-18 09:25:50 UTC
*** Bug 626983 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.