Bug 620384
Summary: | fsck.gfs2 segfaults if journals are missing | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Theophanis Kontogiannis <theophanis_kontogiannis> | ||||||||
Component: | gfs2-utils | Assignee: | Robert Peterson <rpeterso> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||
Severity: | low | Docs Contact: | |||||||||
Priority: | low | ||||||||||
Version: | 5.5 | CC: | adas, bmarzins, djansa, edamato, swhiteho, theophanis_kontogiannis | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | 5.6 | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | gfs2-utils-0.1.62-26.el5 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 622576 (view as bug list) | Environment: | |||||||||
Last Closed: | 2011-01-13 23:21:07 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 575968 | ||||||||||
Bug Blocks: | 622576, 624689, 624691 | ||||||||||
Attachments: |
|
Description
Theophanis Kontogiannis
2010-08-02 11:52:50 UTC
This may be a bug that I've previously found and fixed. Can you try (at your own risk) the fsck.gfs2 on my people page? http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/fsck.gfs2 If that doesn't work, please save off your file system metadata with gfs2_edit savemeta and post it somewhere (private) where I can download it and recreate the problem. Hello Bob, No change with the mentioned fsck. [root@tweety-2 ~]# ./fsck.gfs2 -v /dev/mapper/vg1-data1 Initializing fsck Initializing lists... jid=0: Looking at journal... jid=0: Journal is clean. jid=1: Looking at journal... jid=1: Journal is clean. jid=2: Looking at journal... jid=2: Journal is clean. jid=3: Looking at journal... jid=3: Journal is clean. jid=4: Looking at journal... jid=4: Journal is clean. jid=5: Looking at journal... jid=5: Journal is clean. Segmentation fault fsck.gfs2[7436]: segfault at 00000000000000f8 rip 0000000000410ea3 rsp 00007fff0cd9de80 error 4 and again changed the lock proto. GFS2: fsid=: Trying to join cluster "fsck_dlm", "tweety:gfs2-11" GFS2: can't find protocol fsck_dlm GFS2: fsid=: can't mount proto=fsck_dlm, table=tweety:gfs2-11, hostdata= I am sending you an e-mail with the link for the metada. No worries about my data. I have backed up all of them so we can do whatever we like on this file system. BR TK I received a copy of Theophanis's metadata, restored it and recreated the problem using my latest and greatest code. The problem seems to be that there are two journals mysteriously deleted from the jindex. The code that looks at the integrity of the journals is apparently unable to cope with missing journals. In this case there are supposed to be ten journals (journal0 through journal9) but journal6 and journal7 are gone for some reason. I changed the code so that it just skips over the missing journals but it encounters another problem in pass1 when it tries to recreate them. I'm investigating that now and hopefully it will be easy to figure out. It should be noted that this file system has a block size of 512 bytes (1/2K) which I believe is an unsupported configuration. Normal journals are driven to 4 levels of metadata indirection! So far I haven't run into any code that can't deal with this block size, and the missing journal problem would still be there even if the block size was bigger, so that's not impacting me at the moment. It might, however, have contributed to the fact that those two journals are missing (under investigation as well). It would be helpful to know if Theophanis knows how the journals went missing. Was the file system created with ten journals or fewer, and gfs2_jadd run? Hi Bob, In fact until now I did not even know the journals were missing. After moving out all my files I run gfs2_fsck for no reason, and this is how I ended up filling this bug. The file system was created from the beggining with ten journals and no kind of alternations were made throughout its lifecycle. BR TK I've got a prototype that seems to be working properly. I'm testing it now. Hopefully we can get this into 5.6. I'm going to try to figure out how the journals went missing. Created attachment 437297 [details]
Preliminary patch
With this patch I was able to fix the broken file system.
This is still a preliminary patch and has not been tested properly.
Created attachment 438553 [details]
Try 3 patch
I found some problems with the previous patch under more
rigorous testing. The previous patch was also for upstream
code. This version is more comprehensive in its cleaning up
of deleted, missing and destroyed journal dinodes.
Even so, it needs more testing. This one is at least close.
Yesterday I did more rigorous testing and discovered two more bugs. The first one affects mkfs.gfs2 and I opened it as bug #624535. The second bug I'm trying to decide what to do about. Basically, the latest and greatest fsck.gfs2 doesn't like when directories get really big (i.e. lots of entries). This happens easier with a small block size like the 512B blocks from the user's metadata for this bug. For almost all normal directories, the metadata structure looks like this: height structure ------ ------------------------------------------------- 0. dinode 1. journaled data block (hash table block pointers) 2. directory leaf blocks When directories get really big their metadata structure gets more complex and ends up looking like this: height structure ------ ------------------------------------------------- 0. dinode 1. indirect block (block pointers to block pointers) 2. journaled data block (hash table block pointers) 3. directory leaf blocks If there are enough directory entries, the structure can reach more heights, with level 2 being another level of indirect blocks: height structure ------ ------------------------------------------------- 0. dinode 1. indirect block (block pointers to block pointers) 2. indirect block (block pointers to block pointers) 3. journaled data block (hash table block pointers) 4. directory leaf blocks Right now, fsck.gfs2 can only handle directories of the first form. Large directories with four different metadata types are flagged as errors and data is destroyed. This is very serious and needs to get fixed ASAP. I've written a patch for this issue and I'm testing it now. So far the patch has passed a simple unit test using a four-level directory. Now I'm running it against the metadata for this bug to see if it has any issues. Unfortunately, that takes a long time to complete. I'm likely to open a new bugzilla record for this new problem. I opened up bugzilla records for the second problem listed in comment #8. The RHEL5 bug is bug #624689. The RHEL6 bug is bug #624691. My combined patch that includes both fixes ran successfully on the user's metadata. That means my patch works perfectly. But since I separated out the fix for that second problem, I need to rebase this patch. Created attachment 439455 [details]
Final patch for 5.6
This is an updated version of the patch that fixes some issues
I caught in testing. Hopefully this is the final version I
will push to the git repo. It was tested on system kool.
The patch was pushed to the RHEL56 branch of the cluster git tree for inclusion into 5.6. Changing status to POST until this gets built into a gfs2-utils package. Build 2770902 successful. Changing status to Modified. This fix is in gfs2-utils-0.1.62-26.el5. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0135.html |