From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.7) Gecko/20050427 Red Hat/1.7.7-1.1.3.4 Description of problem: The 64 bit (IA64 and x86_64) page cache performance has been an issue since AS2.1 and it still exists in RHEL 4. Since most of the customer reports are filed against RHEL 3, this ticket is opened with RHEL 3 due to its relevant nature. The issue is difficult to quantify since it is normally filed with a vague statement such as "the system is sluggish", "the cluster keeps failing over for no reason", or "I've recreated our production system's performance problem using 9 dd to pump 900GB of data into disk", etc. The following is an attempt to characterize the issue so we can investigate the problem in a managable manner: 1. Command latency is high (interactive commands temporarily suspended): Ext3 seems to have difficulties to flush its data/journal into the disk. This temporarily suspend the access to that filesystem. If the blocked (ext3) filesystem happens to be the root partition, users would see sluggish system response time. Ext3 tuning normally provides some relieves (say seperating a gigantic filesystem into smaller ones and/or uses an external journal device). 2. Memory gets fragmented easily (and higher-than-normal swapping activities): Sysrq-m shows the larger cache buckets are completely depleted. We normally suggest the customer to fine tune its /proc/sys/vm/bdflush (sync more often) and /proc/sys/vm/pagecache (restrict the pagecache usage). 3. IO path runs out of io request descriptors: This is shown in sysrq-t thread trace that large amount of io paths are waiting for free io request descriptors. We sometime ship customers with test kernel that increases the descriptor count and reduces its batch sizes. 4. Sync helps - Force a "sync" command, mount filesystem sync, and/or increasing bdflush wakeup interval helps and sometime even make the problem goes away. Note that all of the above tuning tips help but most of the time we still can't bring the system up to a satisfying state acceptable by the customer and/or comparable with 32 bit systems. With several critical customer issues on hand, this bugzilla is opened to request engineering team's further investigation and searching for solutions. Individual IT tickets will follow after this bugzilla is opened. BTW, I suspect all of the above are caused by the size of the page (4 times larger than 32 bit system) that exaggerates the linux vm slab cache memory problem and depletes the page cache in an amazing speed. BTW, I suspect all of the above are caused by the size of the page (4 times larger than 32 bit system) that exaggerates the linux vm slab cache memory problem and all our internal logic (such as page reclaiming logic) do not play well with 64 bit system (compared with 32 bit boxes Version-Release number of selected component (if applicable): 2.4.21-32.0.1.ELsmp How reproducible: Sometimes Steps to Reproduce: 1. please see each individual ticket. 2. 3. Additional info:
Since bugzilla doesn't allow editing (on the comments), add two side notes for clarification: 1. IA64 exhibits the same problems. 2. The very same application and/or system configuration normally run fine on IA32 system (such as i686).
action plan from our end: 1. done with VM tuning (by pagecache and bdflush) 2. in the middle of doing ext3 tuning proposal to the customer (external journal device, writeback mode, etc) 3. after 2), we'll do io tuning (elvtune, io request descriptor, etc). If anyone has any other idea, please do chip in.
Red Hat internal benchmark (we use MySQL) shows 25 - 30% of performance improvement with ext3 journal tuning.
Working with Red Hat kernel engineers on this issue. 1) One customer is currently experimenting with setting up a 10000MB hugetlb_pool on a 16GB machine with its application (DB2 - udb) tuned to use huge pages. echo <MB> > /proc/sys/vm/hugetlb_pool This creates the pool of hugetlb pages of the size specified by <MB>. It must be done at boot time or shortly after since these pages are subject to fragmentation. It is best to add it to the /etc/sysctl.conf: vm.hugetlb_pool = <Size In MB> To verify the setting, say if set to 32 MB after reboot: * check /proc/meminfo (before running the application). It should show: HugePages_Total: 16 HugePages_Free: 16 Hugepagesize: 2048 kB This means there are 16 2MB TLB pages available. * The following entry should also show up in /proc/filesystems: nodev hugetlbfs 2) There is a kscand fix in 2.4.21-32.11.EL kernel (preview ISOs are available on the partners ftp site) that is suggested to the customer. Along with the kernel, the customer is directed to force daemon to scan 10% of its normal number of pages each time it runs by: "echo 10 > /proc/sys/vm/kscand_work_percent". The tuned system is currently closely monitored.
Add one of the tuning tips embeded in previous private comment for public reference: * /proc/sys/vm/pagecache and /proc/sys/vm/bdlfush tuning relieves the hang from > 10 sec down to 1-4 secs in one of the customer's 16GB system.
Closing per last comment.