Bug 221743 - gfs2_fsck errors still
gfs2_fsck errors still
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: gfs2-utils (Show other bugs)
5.0
All Linux
medium Severity medium
: ---
: ---
Assigned To: Robert Peterson
Cluster QE
:
Depends On: 222308
Blocks:
  Show dependency treegraph
 
Reported: 2007-01-06 23:23 EST by Gary Lindstrom
Modified: 2010-01-11 22:37 EST (History)
3 users (show)

See Also:
Fixed In Version: RHBA-2007-0579
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-11-07 13:04:24 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Std out and err of gfs_fsck errors (21.05 KB, application/octet-stream)
2007-01-06 23:23 EST, Gary Lindstrom
no flags Details
Script to generate error (524 bytes, application/octet-stream)
2007-02-09 02:14 EST, Gary Lindstrom
no flags Details
RPMS on system (22.87 KB, application/octet-stream)
2007-02-09 02:16 EST, Gary Lindstrom
no flags Details
FSCK STDOUT before copy, no errors to stderr (1.07 KB, application/octet-stream)
2007-02-09 02:17 EST, Gary Lindstrom
no flags Details
FSCK stderr output after the copy (2.36 KB, application/octet-stream)
2007-02-09 02:18 EST, Gary Lindstrom
no flags Details
FSCK stdout output after the copy (5.41 KB, application/octet-stream)
2007-02-09 02:19 EST, Gary Lindstrom
no flags Details
df before copy (514 bytes, application/octet-stream)
2007-02-09 02:20 EST, Gary Lindstrom
no flags Details
df after copy (628 bytes, application/octet-stream)
2007-02-09 02:22 EST, Gary Lindstrom
no flags Details
running kernel version output from uname (110 bytes, application/octet-stream)
2007-02-09 02:23 EST, Gary Lindstrom
no flags Details
Ohh, and the mkfs output before doing the copy (636 bytes, application/octet-stream)
2007-02-09 02:24 EST, Gary Lindstrom
no flags Details
patch to fix the problem (2.39 KB, patch)
2007-02-12 14:24 EST, Robert Peterson
no flags Details | Diff

  None (edit)
Description Gary Lindstrom 2007-01-06 23:23:31 EST
Description of problem:

I am going to re-open a new bug report to replace bz 211465 which I originally
opened.  It had two different issues in it.  One was for a deadlock, which may
have been fixed but I have submitted a new BZ for another I am now experiencing,
and the other was gfs2 fsck errors which this BZ is for.

Version-Release number of selected component (if applicable):

kernel-2.6.18-1.2869.fc6 and all latest fc6 updates

Also, been having problems with fc6 versions of the following so grabbed the
following RPMs from EL5 beta 2:

cman-2.0.35-2.el5.i386.rpm
device-mapper-1.02.12-2.el5.i386.rpm
gfs2-utils-0.1.14-1.el5.i386.rpm
gfs-utils-0.1.7-1.el5.i386.rpm
gnbd-1.1.4-2.el5.i386.rpm
lvm2-2.02.12-7.el5.i386.rpm
lvm2-cluster-2.02.12-7.el5.i386.rpm
openais-0.80.1-15.el5.i386.rpm
rgmanager-2.0.16-1.i386.rpm

How reproducible:

Every Time

Steps to Reproduce:
1.  make a Clean GFS2 file system
2.  Copy a bunch of data to it
3.  Dismount file system
4.  Run gfs2_fsck
  
Actual results:

File system errors

Expected results:

Clean file system

Additional info:

I simply format the file system, mount it, copy some data to it (couple hundred+
 MB), dismount, and run fsck and I get file system errors.  Note that this
machine was the only machine that had the gfs2 volume mounted when the copy was
being done, and none had it mounted (obviously) for the fsck.  Attached to this
BZ is the std out and err output of the fsck.  Maybe I just don't have corrected
version of fsck, but I would have thought the version from rhel5 betat 2 would
have had it...
Comment 1 Gary Lindstrom 2007-01-06 23:23:31 EST
Created attachment 144994 [details]
Std out and err of gfs_fsck errors
Comment 2 Steve Whitehouse 2007-01-08 11:08:35 EST
Again the fixes for this are upstream and in RHEL5 (beta) but haven't made it to
FC6 yet. The fixes to the fsck program may also not have made it into the FC
packages, so I'll copy in Bob who can elaborate on that.

I'll try and get this speeded up a bit, sorry for the delay.
Comment 4 Steve Whitehouse 2007-01-15 04:38:35 EST
The kernel part of this should be fixed in FC6 kernel 2.6.19-1.2895
Comment 5 Chris Feist 2007-01-17 15:51:53 EST
This update is in gfs2-utils-0.1.25-1.fc6, can you please test this rpm from the
testing branch of fc6 and let me know how it works?

You should be able to get that package from here:
http://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/6/
Comment 6 Gary Lindstrom 2007-01-20 20:27:53 EST
Installed:

kernel-2.6.19-1.2895.fc6
gfs2-utils-0.1.25-1.fc6

Sorry, still errors :(  Did a fresh format, mounted the volume, copied data, did
a clean unmount, ran fsck and got the following:

[root@spool7 /]# gfs2_fsck /dev/mapper/fpcl01vg02-fpcl01vg02lv00 
Initializing fsck
Clearing journals (this may take a while)........
Journals cleared.
Starting pass1
Inode 1020306 (0xf9192): Ondisk block count (862340) does not match what fsck
found (1696)
Fix ondisk block count? (y/n) y
Inode 1882749 (0x1cba7d): Ondisk block count (566115) does not match what fsck
found (1115)
Fix ondisk block count? (y/n) y
Inode 2448916 (0x255e14): Ondisk block count (670137) does not match what fsck
found (1318)
Fix ondisk block count? (y/n) y
Inode 3119104 (0x2f9800): Ondisk block count (609936) does not match what fsck
found (1200)
Fix ondisk block count? (y/n) y
Inode 3729142 (0x38e6f6): Ondisk block count (871089) does not match what fsck
found (1714)
Fix ondisk block count? (y/n) y
Inode 4085111 (0x3e5577): Ondisk block count (688447) does not match what fsck
found (1354)
Fix ondisk block count? (y/n) y
Pass1 complete      
Starting pass1b
Pass1b complete
Starting pass1c
Pass1c complete
Starting pass2
Pass2 complete      
Starting pass3
Pass3 complete      
Starting pass4
Pass4 complete      
Starting pass5
Ondisk and fsck bitmaps differ at block 1020307 (0xf9193) 
Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
Metadata type is 0 (free)
Fix bitmap for block 1020307 (0xf9193) ? (y/n) y
Succeeded.
Ondisk and fsck bitmaps differ at block 1020308 (0xf9194) 
Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
Metadata type is 0 (free)
Fix bitmap for block 1020308 (0xf9194) ? (y/n) y
Succeeded.
Ondisk and fsck bitmaps differ at block 1020309 (0xf9195) 
Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
Metadata type is 0 (free)
Fix bitmap for block 1020309 (0xf9195) ? (y/n) y
(...These errors go on and on and on and on...)
Comment 10 Chris Feist 2007-01-23 15:16:29 EST
Re-assigning to Bob, it appears that there may be another bug causing this problem.
Comment 11 Robert Peterson 2007-01-24 17:04:34 EST
I've done lots of investigating and testing on this problem, and 
here's what I found:

The gfs2_fsck messages seen in the attachment #1 [details] that look like this:

        Clearing .
        Block # referenced by directory entry . is out of range

were caused by the "reverse sentinel" bug described in bz #211465.
That bug is now fixed in gfs2_fsck, and I verified that fix is in the
gfs2-utils-0.1.25-1.fc6 package.  I tried to recreate the problem by
following the reliable method described in bz #221465, and I verified 
that that problem was indeed fixed.

The errors described in comment #6 are a separate issue.  Next, I 
used a RHEL5 beta cluster to copy a bunch of data to a brand new 
GFS2 file system.  Then I unmounted and ran the gfs2_fsck found in the
gfs2-utils-0.1.25-1.fc6 package, and it did not report any errors.

I did this several times using different data, one of which consisted of
400MB worth of mp3 files, and another that was 1.3GB of mp3 files of varying
sizes.  The largest of these files was 26MB.  Then I unmounted the
file system and ran gfs2_fsck.  In all cases, there were no errors
reported by gfs2_fsck.

Either I'm doing something wrong, or else this is a problem with the
GFS2 kernel code.  I compared the GFS2 source code in my RHEL5 kernel
to the GFS2 source code in kernel-2.6.19-1.2895.fc6 mentioned in
comment #6 and found them to be very different.

I also compared the gfs2_fsck code and found it to be pretty recent
compared to what's in the CVS repository.

So my theory is that this might be a GFS2 kernel bug that has been 
fixed in the RHEL5 kernel but not yet ported to the FC6 kernel.  
If my theory is correct, gfs2_fsck is innocent of any wrongdoing.

We need to get a newer FC6 kernel built and pushed to the FC6 community 
that incorporates the newest GFS2 kernel code and see if the problem
goes away.  Either that, or I need to be given more clues on how to 
recreate the problem.
Comment 12 Robert Peterson 2007-02-06 20:01:02 EST
I still haven't been able to reproduce this using the upstream version
of GFS2 from the git tree.  I guess I'll have to put fc6 on a test
machine and try that.
Comment 13 Steve Whitehouse 2007-02-07 06:21:54 EST
I've sent in an update to FC-6 for GFS2, and I'll shortly be sending another few
patches too, so that the very latest FC kernel will have a number of recent
fixes in it.
Comment 14 Robert Peterson 2007-02-08 14:49:15 EST
I even tried to recreate problem this at FC6 and I don't see the error.
This is an FC6 machine running the 2.6.19-1.2895.fc6 #1 SMP kernel
and the latest build of gfs2_fsck.  I need to know more details how
to reproduce the problem.  Setting status to NEEDINFO.
Comment 15 Gary Lindstrom 2007-02-09 02:14:33 EST
Created attachment 147731 [details]
Script to generate error

So, what do you need me to send you... I am going to upload some attachments,
one of a script of the commands I do to create the error, and the rest for the
output.  Also including a list of RPMS installed on the machine and a uname
showing I am running the same kernel...  Whatever else you need, I'll send to.
Comment 16 Gary Lindstrom 2007-02-09 02:16:16 EST
Created attachment 147732 [details]
RPMS on system

RPMS on system
Comment 17 Gary Lindstrom 2007-02-09 02:17:24 EST
Created attachment 147733 [details]
FSCK STDOUT before copy, no errors to stderr

FSCK STDOUT before copy, no errors to stderr
Comment 18 Gary Lindstrom 2007-02-09 02:18:22 EST
Created attachment 147734 [details]
FSCK stderr output after the copy

FSCK stderr output after the copy
Comment 19 Gary Lindstrom 2007-02-09 02:19:17 EST
Created attachment 147735 [details]
FSCK stdout output after the copy

FSCK stdout output after the copy
Comment 20 Gary Lindstrom 2007-02-09 02:20:36 EST
Created attachment 147736 [details]
df before copy

not much (nothing) on /mnt/fpcl01vg02lv00
Comment 21 Gary Lindstrom 2007-02-09 02:22:01 EST
Created attachment 147737 [details]
df after copy

Copied a fair amount of ISO's, some vmware images, etc to /mnt/fpcl01vg02lv00
Comment 22 Gary Lindstrom 2007-02-09 02:23:11 EST
Created attachment 147738 [details]
running kernel version output from uname

running kernel version output from uname
Comment 23 Gary Lindstrom 2007-02-09 02:24:26 EST
Created attachment 147739 [details]
Ohh, and the mkfs output before doing the copy

So you can really see I did the mkfs before copying things over...
Comment 24 Robert Peterson 2007-02-12 14:18:55 EST
I believe I figured out the problem, so I'm changing the status
from NEEDINFO back to ASSIGNED.  The problem was that gfs2_fsck was
specifying the wrong journal size when clearing the journals.
It was using di_blocks, which is the disk inode number of blocks.

The problem is that di_blocks includes metadata blocks, not just data
blocks.  For example, if you specify -J 64 in mkfs.gfs2, the journals
will be 64MB in size, which is 16384 data blocks of 4K.  However, 
di_blocks would be 16384 data blocks + 1 inode block + 33 indirect inode 
blocks for a grand total of 16418.  When it wrote data on those extra 34
blocks, it caused the journals to grow past their specified 64M.  Later, 
during pass1, gfs2_fsck would discover the discrepency in the number of
blocks and offer to fix it.  In fact, the extra blocks should not have
been there.  I've got a patch that fixes the problem, but it assumes you
have the prerequisite fix for bz 222308, which affects libgfs2.

I don't believe this affects GFS1 because journals are done differently.
Comment 25 Robert Peterson 2007-02-12 14:24:30 EST
Created attachment 147928 [details]
patch to fix the problem

This patch fixes the problem as described in the previous comment.
It now specifies the file size / block size, which gives us the correct
number of data blocks for the journal files.

It also consolidates code a little bit: gfs2_fsck was using its own 
function to clear the journals.  This version now uses the same
standard function in libgfs2 that mkfs.gfs2 uses.

This version requires the fix for bz 222308 which affects libgfs2.
Comment 26 Robert Peterson 2007-02-12 14:32:34 EST
Fix committed to CVS at HEAD and RHEL5.  Tested on trin-10 with a variety
of journal sizes.  Changing status to MODIFIED.
Comment 29 RHEL Product and Program Management 2007-06-27 11:34:37 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 32 errata-xmlrpc 2007-11-07 13:04:24 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0579.html

Note You need to log in before you can comment on or make changes to this bug.