Bug 445858

Summary:	GFS: gfs_fsck cannot allocate enough memory to run on large file systems
Product:	Red Hat Enterprise Linux 5	Reporter:	Ben Yarwood <ben.yarwood>
Component:	gfs-utils	Assignee:	Robert Peterson <rpeterso>
Status:	CLOSED WONTFIX	QA Contact:	GFS Bugs <gfs-bugs>
Severity:	low	Docs Contact:
Priority:	low
Version:	5.0	CC:	edamato, rwheeler, slevine, swhiteho
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-03-15 17:14:54 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ben Yarwood 2008-05-09 13:53:51 UTC

Description of problem:

When running on RHEl5 on a X86 32bit system, gfs_fsck can not allocate enough
memory to run when the file system size is very large.  

How reproducible:
Always

Steps to Reproduce:
1.  Create a GFS 16TB file system
2.  Perform a gfs_fsck

  
Actual results:

gfs_fsck -vvv -n /dev/backup/wav 
Initializing fsck
Initializing lists...
Initializing special inodes...
(file.c:45)     readi:  Offset (400) is >= the file size (400).
(super.c:226)   5 journals found.
Validating Resource Group index.
Level 1 check.
(file.c:45)     readi:  Offset (1468608) is >= the file size (1468608).
15298 resource groups found.
(passed)
Setting block ranges...
Creating a block list of size 4160749568...
(bitmap.c:68)   Allocated bitmap of size 2080374785 with 2 chunks per byte
(bitmap.c:68)   Allocated bitmap of size 520093697 with 8 chunks per byte
Unable to allocate bitmap of size 520093697
(block_list.c:72)       <backtrace> - block_list_create()
This system doesn't have enough memory + swap space to fsck this file system.
Additional memory needed is approximately: 5952MB
Please increase your swap space by that amount and run gfs_fsck again.
Freeing buffers.
(initialize.c:400)      <backtrace> - init_sbp()

Expected results:

The gfs_fsck should complete normally.

Comment 1 Robert Peterson 2008-05-09 14:35:45 UTC

Reassigning to myself: I've been working with Ben on this.

According to http://www.redhat.com/rhel/compare/ the RHEL5 release
does not ship with a HUGEMEM kernel, so on x86 (32-bit) platforms, the
system is limited to 3GB of address space.  So regardless of how much
swap and/or ram you have, gfs_fsck cannot run properly today on a
16TB file system as documented, even if gfs itself can run.

It turns out that for a 16TB file system, gfs_fsck allocates a 2GB chunk
of RAM, then 3 more 1G chunks of RAM, for its internal bitmaps.  These
bitmaps are needed to keep track of every block in the file system and
to determine what kind of block it is (inode, directory, data, duplicate,
etc.)

One way I've thought of to fix it is to try to process the bitmaps
one resource group at a time, rather than all bitmaps in memory at once.
The problem with that approach is that a block in one resource group can
reference blocks in another resource group.  So it might get complex
trying to keep the cross-RG references straight.  Plus each pass the
code makes references the bits left behind by the previous passes.

Another thought is to try to pare down the memory usage by combining
certain bits in the bitmaps and adding some code to resolve the type.
Perhaps we can get it down to one bitmap of 2G that way.

Right now, the only circumvention is to do the gfs_fsck on a 64-bit arch.

Comment 2 Robert Peterson 2008-05-09 14:37:19 UTC

Correction to comment #1: It allocates 2G for bitmaps, then three
smaller bitmaps of 512MB each (not 1G as previous stated).

Comment 3 Robert Peterson 2008-07-21 14:06:54 UTC

I've been working on the issue of gfs_fsck memory usage indirectly
because gfs2_fsck has the same problem.  For bug #404611, I wanted
to test a gfs2_fsck against a 2T file system that's had millions of
files and directories created through a benchmark program called benchp.
The test system, kool, has 2GB of memory.

The test didn't go so well because of gfs2_fsck's memory problems that
are directly inherited from gfs_fsck.  I created a patch that saves
lots of memory.  Basically, I eliminated three of the four bitmaps
associated with the file system and made some data structures smaller.
See the attachment associated with this comment:

https://bugzilla.redhat.com/show_bug.cgi?id=404611#c8

Despite the patch, the program ran all weekend and only made it to 3%
done with pass1.  In doing all this, I've researched where gfs_fsck and
gfs2_fsck are using their memory.  A big part of the problem is that
during pass1, it creates elements in memory associated with every inode
and every directory.  Their purpose is to keep counts, primarily the
count of links, to be used later in pass4.

So the patch is not enough memory savings.  In fact, the cpu is way
underutilized and the system spends most of its time swapping to disk.
So I'm looking for more good ways to improve the memory usage and
anything I find in gfs2_fsck should directly apply to gfs_fsck.

Comment 5 Robert Peterson 2009-12-22 15:06:12 UTC

The fix isn't ready to ship yet; retargeting to 5.6.

Comment 8 Robert Peterson 2010-03-15 17:14:54 UTC

I don't think we should fix this bug for several reasons:

(1) It would be a lot of work and redesign for gfs's fsck.
(2) gfs is being phased out in favor of gfs2, where we should
    consider fixing it.
(3) To the best of my knowledge, no customers have expressed
    an interest in the fix.
(4) There are no customer issues attached to the bug.

I think the best course of action is to close this as WONTFIX
unless customers start complaining and demanding a fix.
Until that time, I think I'll open a DOC bug to document the
approximate memory requirements for gfs_fsck.

In gfs_fsck the majority of the memory used is consumed for
the block maps that are all kept in memory.  There is one big
array that needs a nibble (half-byte) for each block, and
three smaller arrays that need two bits per block.  Add to
that the additional memory needed for the buffers, the dinode
hash table, the directory hash table and the duplicates linked
list.  So every block needs at least 7 bits plus slop, so call
it 8 bits, or one byte, per block.  So a good estimate is
file system size (in bytes) divided by the block size, and that
will be approximately how much memory you will need to run
gfs_fsck.

For this particular 16TB file system, the file system is 16TB
and the block size is 4K, so 16TB / 4K blocks:

17592186044416 / 4096 = 4294967296, so this file system
requires approximately 4GB of free memory to run gfs_fsck,
above and beyond all the memory used for the operating system
and kernel.

Note that if the block size was 1K, it would require four times
the memory.