Bug 729261

Summary: ext3/ext4 mbcache causes high CPU load
Product: Red Hat Enterprise Linux 5 Reporter: Bernd Schubert <bernd.schubert>
Component: kernelAssignee: Eric Sandeen <esandeen>
Status: CLOSED ERRATA QA Contact: Eryu Guan <eguan>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 5.7CC: ccui, eguan, esandeen, perfbz, rwheeler
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.18-283.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 731585 (view as bug list) Environment:
Last Closed: 2012-02-21 03:51:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 731585    

Description Bernd Schubert 2011-08-09 09:38:59 UTC
Description of problem:

We copied small files on FhGFS file systems and noticed unusual high CPU load of the fhgfs-meta server (6 x E5520  @ 2.27GHz cores saturated). Application and kernel profiling showed that is due to the ext3/ext4 mbcache usage.

CPU: Intel Core/i7, speed 2266.81 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        app name                 symbol name
836662   48.7378  vmlinux                  mb_cache_entry_insert
368013   21.4377  vmlinux                  mb_cache_entry_free
103116    6.0068  vmlinux                  mb_cache_entry_release
93262     5.4328  vmlinux                  mb_cache_entry_get
72118     4.2011  vmlinux                  mb_cache_entry_find_first
45358     2.6422  vmlinux                  mb_cache_shrink_fn

The issue was also already exposed in the past by the Lustre file system and the Lustre file system simply added a patch to disable the mbcache:

https://bugzilla.lustre.org/show_bug.cgi?id=22771


Unlike Lustre the FhGFS file system does not need kernel patches at all and so our users depend on upstream kernels. We therefore would like to see either the patch attached to the Lustre bugzilla in RHEL kernel version or at least commit 3a48ee8a4ad26c3a538b6fc11a86a8f80c3dce18 (mbcache: Limit the maximum number of cache entries) landed in upstream linux, which should "At least partially solves https://bugzilla.lustre.org/show_bug.cgi?id=22771".


Version-Release number of selected component (if applicable):


How reproducible:

Easily with FhGFS or Lustre and with an inode size of 128 Bytes of the mete server.

Install FhGFS, the meta server should be on ext3 or ext4 and with enabled XATTR usage. Furthermore, in order to reproduce, the inode size should be limited to 128 Bytes, 


Steps to Reproduce:
1. Install FhGFS
2. meta server on ext3 or ext4 with 128 Bytes inode size, enable XATTR usage
3. Fill the filesystem with small files and introduce a high memory usage (server has at least 12 GB memory).
  
Actual results:

Top will show a high CPU usage of fhgfs-meta and Oprofile profiling will show that is due to mbcache.

Expected results:

Low CPU usage of fhgfs-meta.

Additional info:

Comment 1 Ric Wheeler 2011-08-09 09:46:48 UTC
Do you see the same issue in RHEL6?

Thanks!

Comment 2 Bernd Schubert 2011-08-09 11:35:29 UTC
I have not tested it yet, but I guess so, as commit 3a48ee8a4ad26c3a538b6fc11a86a8f80c3dce18 is not included in 2.6.32-131.6.1.el6 yet. I will test with that kernel version as soon as possible and then will report results here.

Cheers,
Bernd

Comment 3 Bernd Schubert 2011-08-09 14:56:18 UTC
So I can reproduce it with the 2.6.32-131.6.1.el6 kernel. Right now I'm simply using tar 

(cd /mnt/tmp/fhgfs_meta && /root/tar -cf - . --xattrs --sparse) | (cd /mnt/tmp2/fhgfs_meta/ && /root/tar -xpf -)

to copy files from /mnt/tmp, which has 512 Byte ext4 inodes to /mnt/tmp, which only has 128 Byte ext4 inodes.

After about 300,000 inodes "perf top" shows 30% mb_cache_entry_insert() and 24% __mb_cache_entry_find(). While I'm writing it up here, the numbers are uncreasing and now already

------------------------------------------------------------------------------
   PerfTop:   23470 irqs/sec  kernel:92.3% [100000 cycles],  (all, 2 CPUs)
------------------------------------------------------------------------------

             samples    pcnt   kernel function
             _______   _____   _______________

           119181.00 - 37.2% : mb_cache_entry_insert    [mbcache]
            81295.00 - 25.4% : __mb_cache_entry_find    [mbcache]
             3456.00 -  1.1% : __d_lookup
             3364.00 -  1.1% : avc_has_perm_noaudit
             1967.00 -  0.6% : __link_path_walk
             1880.00 -  0.6% : inode_has_perm
             1619.00 -  0.5% : _spin_lock


(with about 470,000 copied files). I guess it will have 80-90% mbcache once tar is almost done (we have about 16,000,000 files on that test file system).

Comment 4 Bernd Schubert 2011-08-09 15:45:48 UTC
I now tested with a recent 3.1-git kernel and with that version "perf top" does not show anything related to the mbcache.

Comment 5 Eric Sandeen 2011-08-17 21:42:33 UTC
Ok, patch backports without trouble.  I thought we'd have kabi issues but I guess mbcache isn't on the whitelist after all.

Comment 6 RHEL Product and Program Management 2011-08-17 22:09:37 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 Jarod Wilson 2011-08-30 19:24:38 UTC
Patch(es) available in kernel-2.6.18-283.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 10 Bernd Schubert 2011-10-05 15:53:16 UTC
Thanks for updating the kernel. We will test as soon as possible (I just need to finish some other work first).

Comment 12 Eryu Guan 2011-12-12 12:13:15 UTC
I used a modified fs_mark to create xattr(key=user.testname/value=$filename) on each file created

./fs_mark -s 0 -w 0 -p 64 -r 64 -d /mnt/ext4/test -n NUM
NUM will be 1000 2000 5000 10000 20000 50000

On 2.6.18-298.el5 kernel

Count    Files/sec     App Overhead
 1000         58.9            26808
 2000         58.8            57948
 5000         55.2           146221
10000         53.7           294792
20000         47.0           596290
50000         57.9          1581170

On 2.6.18-274.el5 kernel

Count    Files/sec     App Overhead
 1000         57.1            47332
 2000         57.2            94190
 5000         55.2           256476
10000         48.3           572964
20000         42.6          1448644
50000         44.7          5406174

So -298 kernel shows a bit improvement



oprofile shows more clear result

On -274 kernel, 50000 file fs_mark run

samples  %        symbol name
25141     3.5907  mb_cache_entry_get
18872     2.6953  .text.__mb_cache_entry_find
10497     1.4992  mb_cache_entry_insert

On -300 kernel mb_* related functions took much less resource

978       0.1575  mb_cache_entry_get
852       0.1372  .text.__mb_cache_entry_find
515       0.0829  mb_cache_entry_insert

Set to VERIFIED

Comment 13 errata-xmlrpc 2012-02-21 03:51:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0150.html