Description of problem: We copied small files on FhGFS file systems and noticed unusual high CPU load of the fhgfs-meta server (6 x E5520 @ 2.27GHz cores saturated). Application and kernel profiling showed that is due to the ext3/ext4 mbcache usage. CPU: Intel Core/i7, speed 2266.81 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000 samples % app name symbol name 836662 48.7378 vmlinux mb_cache_entry_insert 368013 21.4377 vmlinux mb_cache_entry_free 103116 6.0068 vmlinux mb_cache_entry_release 93262 5.4328 vmlinux mb_cache_entry_get 72118 4.2011 vmlinux mb_cache_entry_find_first 45358 2.6422 vmlinux mb_cache_shrink_fn The issue was also already exposed in the past by the Lustre file system and the Lustre file system simply added a patch to disable the mbcache: https://bugzilla.lustre.org/show_bug.cgi?id=22771 Unlike Lustre the FhGFS file system does not need kernel patches at all and so our users depend on upstream kernels. We therefore would like to see either the patch attached to the Lustre bugzilla in RHEL kernel version or at least commit 3a48ee8a4ad26c3a538b6fc11a86a8f80c3dce18 (mbcache: Limit the maximum number of cache entries) landed in upstream linux, which should "At least partially solves https://bugzilla.lustre.org/show_bug.cgi?id=22771". Version-Release number of selected component (if applicable): How reproducible: Easily with FhGFS or Lustre and with an inode size of 128 Bytes of the mete server. Install FhGFS, the meta server should be on ext3 or ext4 and with enabled XATTR usage. Furthermore, in order to reproduce, the inode size should be limited to 128 Bytes, Steps to Reproduce: 1. Install FhGFS 2. meta server on ext3 or ext4 with 128 Bytes inode size, enable XATTR usage 3. Fill the filesystem with small files and introduce a high memory usage (server has at least 12 GB memory). Actual results: Top will show a high CPU usage of fhgfs-meta and Oprofile profiling will show that is due to mbcache. Expected results: Low CPU usage of fhgfs-meta. Additional info:
Do you see the same issue in RHEL6? Thanks!
I have not tested it yet, but I guess so, as commit 3a48ee8a4ad26c3a538b6fc11a86a8f80c3dce18 is not included in 2.6.32-131.6.1.el6 yet. I will test with that kernel version as soon as possible and then will report results here. Cheers, Bernd
So I can reproduce it with the 2.6.32-131.6.1.el6 kernel. Right now I'm simply using tar (cd /mnt/tmp/fhgfs_meta && /root/tar -cf - . --xattrs --sparse) | (cd /mnt/tmp2/fhgfs_meta/ && /root/tar -xpf -) to copy files from /mnt/tmp, which has 512 Byte ext4 inodes to /mnt/tmp, which only has 128 Byte ext4 inodes. After about 300,000 inodes "perf top" shows 30% mb_cache_entry_insert() and 24% __mb_cache_entry_find(). While I'm writing it up here, the numbers are uncreasing and now already ------------------------------------------------------------------------------ PerfTop: 23470 irqs/sec kernel:92.3% [100000 cycles], (all, 2 CPUs) ------------------------------------------------------------------------------ samples pcnt kernel function _______ _____ _______________ 119181.00 - 37.2% : mb_cache_entry_insert [mbcache] 81295.00 - 25.4% : __mb_cache_entry_find [mbcache] 3456.00 - 1.1% : __d_lookup 3364.00 - 1.1% : avc_has_perm_noaudit 1967.00 - 0.6% : __link_path_walk 1880.00 - 0.6% : inode_has_perm 1619.00 - 0.5% : _spin_lock (with about 470,000 copied files). I guess it will have 80-90% mbcache once tar is almost done (we have about 16,000,000 files on that test file system).
I now tested with a recent 3.1-git kernel and with that version "perf top" does not show anything related to the mbcache.
Ok, patch backports without trouble. I thought we'd have kabi issues but I guess mbcache isn't on the whitelist after all.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Patch(es) available in kernel-2.6.18-283.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Thanks for updating the kernel. We will test as soon as possible (I just need to finish some other work first).
I used a modified fs_mark to create xattr(key=user.testname/value=$filename) on each file created ./fs_mark -s 0 -w 0 -p 64 -r 64 -d /mnt/ext4/test -n NUM NUM will be 1000 2000 5000 10000 20000 50000 On 2.6.18-298.el5 kernel Count Files/sec App Overhead 1000 58.9 26808 2000 58.8 57948 5000 55.2 146221 10000 53.7 294792 20000 47.0 596290 50000 57.9 1581170 On 2.6.18-274.el5 kernel Count Files/sec App Overhead 1000 57.1 47332 2000 57.2 94190 5000 55.2 256476 10000 48.3 572964 20000 42.6 1448644 50000 44.7 5406174 So -298 kernel shows a bit improvement oprofile shows more clear result On -274 kernel, 50000 file fs_mark run samples % symbol name 25141 3.5907 mb_cache_entry_get 18872 2.6953 .text.__mb_cache_entry_find 10497 1.4992 mb_cache_entry_insert On -300 kernel mb_* related functions took much less resource 978 0.1575 mb_cache_entry_get 852 0.1372 .text.__mb_cache_entry_find 515 0.0829 mb_cache_entry_insert Set to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0150.html