127896 – Using hugetlb causes massive slowdowns with ramfs

Bug 127896 - Using hugetlb causes massive slowdowns with ramfs

Summary: Using hugetlb causes massive slowdowns with ramfs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Larry Woodman
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-07-15 00:49 UTC by John Caruso
Modified:	2007-11-30 22:07 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-12-20 20:55:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2004:550	0	normal	SHIPPED_LIVE	Updated kernel packages available for Red Hat Enterprise Linux 3 Update 4	2004-12-20 05:00:00 UTC

Description John Caruso 2004-07-15 00:49:49 UTC

Description of problem:
On a system with 8GB of RAM I'm allocating 2.5GB for hugepages.  When 
I subsequently try to create a large file in /dev/shm (mounted as 
ramfs), the file will grow to 1.6-2GB relatively quickly (within 10-
20 seconds), but after the file reaches that size the growth rate 
slows to 20MB/min (!?!).

Version-Release number of selected component (if applicable):
kernel-hugemem-2.4.21-15.0.3.EL

How reproducible:
echo 2560 > /proc/sys/vm/hugetlb_pool
umount /dev/shm
mount -t ramfs none /dev/shm
cat /dev/zero > /dev/shm/grassgrowsfasterthanthis

Steps to Reproduce:
1. See above.
  
Actual results:
Sure is slow.

Expected results:
Wish it were faster.

Additional info:
Note that this problem does *not* occur if /dev/shm is mounted as 
tmpfs rather than ramfs (with everything else the same as I've 
described it above).  It's just a problem with ramfs+hugetlb.

Also, this is not just a theoretical situation; it's part of an 
Oracle configuration in which the shared portions of the SGA will be 
allocated out of hugepages, and the buffer cache will be allocated 
out of /dev/shm (which Oracle recommends mounting as ramfs, so that 
it's locked into memory).  The main workaround I've found so far is 
just to avoid hugetlb--which is also good because hugetlb can cause 
extreme system instability with large Oracle SGA configurations (I'm 
about to file a bug for that as well).

Comment 1 Rik van Riel 2004-07-15 02:43:47 UTC

Looking at page_alloc.c, it appears that there could be a problem when
all the sum of hugetlbfs + ramfs pages are larger than what fits in
the first tried zone memory is allocated from, in your case, the
highmem zone.

The RHEL3 kernel tries to fit just over 4GB of data into a 4GB size
highmem zone and has trouble fitting things.

I'll try to come up with an experimental patch to alleviate this
problem...

Comment 3 Larry Woodman 2004-08-06 17:22:25 UTC

John, I think the slowdown you are seeing is caused by a combination
of the hugemem kernel, allocating 2.5GB for hugetlbfs and the fact
that ramfs sets the GFP_WIRED flag the inode->i_mapping->gfp_mask. 
This causes the system to attempt to reclaim highmem pages for ramfs
because you have overcommited highmem between the hugetlb pages and
ramfs pages.

The first thing to do is get me an "AltSysrq M" output when you notice
the ramfs allocation slowdown.  Next, please try running the smp
kernel instead of the hugemem kernel.  Why are you running the hugemem
kernel in an 8GB system in the first place, have you seen other lowmem
issues with the standard smp kernel?  Finally, I am re-evaluating
whether the GFP_WIRED should only be set in the ramfs inode for the
smp kernel since lowmem exhaustion is not nearly as much of an issue
for the hugemem kernel as it is for the smp kernel.

Larry Woodman

Comment 4 John Caruso 2004-08-10 18:12:15 UTC

Have you tried reproducing this yourself?  The method I mentioned is 
pretty straightfoward (though the values might require tweaking 
depending on your memory configuration), and so I was intending that 
y'all at RedHat could test this yourselves.  I don't have a system 
that's readily available for such testing anymore.

You may be right about highmem being overcommitted: the system 
reports a HighTotal of 4.5GB (LowTotal=3.4GB), so 2.5GB for hugetlb 
plus the 1.6GB file in ramfs is close to that.  That does suggest how 
you could test it on a machine with a different memory size, I 
suppose.  I'm deeply dismayed to learn that the old lowmem/highmem 
distinction is still around and that we can't just treat 8GB of 
memory as 8GB of memory.  Is there any documentation that indicates 
the actual limitations on the use of hugetlb, ramfs/shmfs, etc in 
terms of highmem/lowmem and all other relevant factors?  I get the 
feeling that we're one of the first sites even trying this kind of 
configuration, and we're having to make our way through the dark to 
do it.

We're using the hugemem kernel mainly to get the 4/4 memory split, to 
allow as much memory as possible for the portions of the Oracle SGA 
that have to reside within process memory (i.e., those portions that 
can't go into shmfs/ramfs and be accessed indirectly).  I've yet to 
find a thorough, detailed explanation of the differences between the 
various kernel choices, though.

Comment 5 Larry Woodman 2004-08-10 19:05:11 UTC

John, yes I did test myself and saw quite a but of variation in the
slowdown and system responsiveness.  Anyway, as far as the "old
lowmem/highmem distinction" is concerned yes we still have it and no
you cant tread 8MB as 8MB until you start using a 64 bit computer. 
The origional design of the Linux kernel was to map the shared kernel
address space, including physical memory in the upper 1GB of the 4GB
user address space.  Once we started supporting more than 1GB of
physical memory it couldnt all be mapped into that 1GB shared kernel
address space, hence the lowmem/highmem distinction.  With the advent
of more that 4GB of physical memory in a 32 bit system(PAE) we created
a separate kernel address space(hugemem) kernel.  However, since we
support more than 4GB of physical memory it still can not all be
makked into the kernel address space at the same time so we still have
a lowmem/highmem distinction even in the hugemem kernel when you have
more than 4GB of memory.  You will continue to have this distinction
until you switch to 64 bit hardware(EMT64, AMD64, IPF, etc).

In the mean time, I think this problem has been fixed by falling back
to the lowmem zone when allocating wired memory(ramfs and hugepages)
and the highmem is more than 90% wired.  The test kernel with this fix
is located here:

http://people.redhat.com/~lwoodman/.test/


Thanks, Larry Woodman

Comment 6 John Caruso 2004-08-16 20:57:24 UTC

The URL you specified is giving a 404 error.

Thanks for the (apparent) fix.  I'd ask again, though: is there some 
document--anything at all--that details all of the constraints on or 
considerations involved in actually using the various large memory 
features of RHEL3?  We've run into nothing but problems in attempting 
to do so (e.g. bug 127897, among other issues we haven't reported), 
even though the configuration we're using is well within the apparent 
limits.

Comment 7 Larry Woodman 2004-08-16 21:37:34 UTC

John, I just re-copied the kernel so it should work now(sorry about
the 404 error, disquota limits problem).

Larry

Comment 8 Ernie Petrides 2004-09-10 01:02:35 UTC

A fix for this problem has just been committed to the RHEL3 U4
patch pool this evening (in kernel version 2.4.21-20.4.EL).

Comment 9 John Flanagan 2004-12-20 20:55:42 UTC

An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-550.html

Note You need to log in before you can comment on or make changes to this bug.