200883 – gfs_fsck seg faults

Bug 200883 - gfs_fsck seg faults

Summary: gfs_fsck seg faults

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	4
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Robert Peterson
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-08-01 10:35 UTC by Stephen Willey
Modified:	2010-01-12 03:12 UTC (History)
CC List:	1 user (show)
Fixed In Version:	RHBA-2007-0139
Clone Of:
Environment:
Last Closed:	2007-05-10 20:59:29 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2007:0139	0	normal	SHIPPED_LIVE	GFS bug fix update	2007-05-10 20:59:07 UTC

Description Stephen Willey 2006-08-01 10:35:21 UTC

Description of problem:

gfs_fsck uses all available RAM and swap before seg faulting.

Version-Release number of selected component (if applicable):


How reproducible:

Every time the filesystem is checked.

Steps to Reproduce:
1. gfs_fsck -y /dev/blah
2.
3.
  
Actual results:

Started checking because the following errors were appearing:

GFS: fsid=nearlineA:gfs1.0: fatal: invalid metadata block
GFS: fsid=nearlineA:gfs1.0:   bh = 2644310219 (type: exp=4, found=5)
GFS: fsid=nearlineA:gfs1.0:   function = gfs_get_meta_buffer
GFS: fsid=nearlineA:gfs1.0:   file =
/usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/gfs/dio.c, line = 1223
GFS: fsid=nearlineA:gfs1.0:   time = 1154425344
GFS: fsid=nearlineA:gfs1.0: about to withdraw from the cluster
GFS: fsid=nearlineA:gfs1.0: waiting for outstanding I/O
GFS: fsid=nearlineA:gfs1.0: telling LM to withdraw
lock_dlm: withdraw abandoned memory
GFS: fsid=nearlineA:gfs1.0: withdrawn

And another instance:

GFS: fsid=nearlineA:gfs1.1: fatal: filesystem consistency error
GFS: fsid=nearlineA:gfs1.1:   inode = 2384574146/2384574146
GFS: fsid=nearlineA:gfs1.1:   function = dir_e_del
GFS: fsid=nearlineA:gfs1.1:   file =
/usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/gfs/dir.c, line = 1495
GFS: fsid=nearlineA:gfs1.1:   time = 1154393717
GFS: fsid=nearlineA:gfs1.1: about to withdraw from the cluster
GFS: fsid=nearlineA:gfs1.1: waiting for outstanding I/O
GFS: fsid=nearlineA:gfs1.1: telling LM to withdraw
lock_dlm: withdraw abandoned memory
GFS: fsid=nearlineA:gfs1.1: withdrawn

And running gfs_fsck -vvv -y /dev/blah would return:

Initializing fsck
Initializing lists...
Initializing special inodes...
Setting block ranges...
Creating a block list of size 11105160192...
Unable to allocate bitmap of size 1388145025
Segmentation fault
[root@ns1a ~]# gfs_fsck -vvv -y /dev/gfs1_vg/gfs1_lv
Initializing fsck
Initializing lists...
(bio.c:140)     Writing to 65536 - 16 4096
Initializing special inodes...
(file.c:45)     readi:  Offset (640) is >= the file size (640).
(super.c:208)   8 journals found.
(file.c:45)     readi:  Offset (7116576) is >= the file size (7116576).
(super.c:265)   74131 resource groups found.
Setting block ranges...
Creating a block list of size 11105160192...
(bitmap.c:68)   Allocated bitmap of size 5552580097 with 2 chunks per byte
Unable to allocate bitmap of size 1388145025
(block_list.c:72)       <backtrace> - block_list_create()
Segmentation fault


Expected results:


Additional info:

Filesystem is roughly 45Tb and compiled on x86_64 so we're gonna try adding a
137Gb swap disk to see if it gets any further.

Comment 1 Stephen Willey 2006-08-01 16:22:37 UTC

The fsck is now running after we added the 137Gb swap drive.  It appears to
consistently chew about 4Gb of RAM (sometimes higher) but it is working (for now).

Comment 2 Robert Peterson 2006-09-19 22:10:37 UTC

Without a major design change, gfs_fsck will always need memory
for its in-core bitmaps based on the size of the file system.

I looked into the possibility of using the journals as a 
scratch-pad for keeping bitmap information, but they're just not
big enough to do the job.

The memory requirements are approximately 230MB per terabyte of
storage, but that's variable based on the number of journals and
other things.  Therefore, the only way to get it to run is to add
memory or increase swap space as needed, as the customer did from 
comment #1.

However, gfs_fsck should not segfault when it runs out of memory.
Therefore I am fixing gfs_fsck so that it doesn't segfault, but 
rather reports the problem, how much additional memory is required
(rounding up to be on the safe side) and exits gracefully.

Comment 5 Red Hat Bugzilla 2007-05-10 20:59:29 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0139.html

Note You need to log in before you can comment on or make changes to this bug.