Bug 471618
Summary: | GFS2: gfs2_convert is broken | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Robert Peterson <rpeterso> | ||||||||||||||||
Component: | gfs2-utils | Assignee: | Robert Peterson <rpeterso> | ||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||
Priority: | high | ||||||||||||||||||
Version: | 5.3 | CC: | adas, bstevens, cfeist, edamato, nstraz, swhiteho, syeghiay | ||||||||||||||||
Target Milestone: | rc | ||||||||||||||||||
Target Release: | --- | ||||||||||||||||||
Hardware: | All | ||||||||||||||||||
OS: | Linux | ||||||||||||||||||
Whiteboard: | |||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||
Last Closed: | 2009-01-20 20:51:52 UTC | Type: | --- | ||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
Embargoed: | |||||||||||||||||||
Bug Depends On: | 475488 | ||||||||||||||||||
Bug Blocks: | |||||||||||||||||||
Attachments: |
|
Description
Robert Peterson
2008-11-14 17:43:28 UTC
Created attachment 323683 [details]
Preliminary patch
This patch restores basic functionality to gfs2_convert, but more work
needs to be done. This is not the final patch. This patch does not
address the problem regarding gfs_indirect blocks versus indirect blocks
in GFS2 which do not have the extra 64-bytes of reserved space.
This patch may look big, but it's not as bad as it looks.
Much of the gfs1-specific code was local to gfs2_edit parts. But now
gfs2_convert needs to use these functions as well. Therefore I created
a new part, gfs1.c, and moved all these gfs1-specific functions to that
new part.
I confirmed that the gfs_indirect structure, which doesn't exist in gfs2, does cause a problem for file systems that have been converted by the gfs2_convert tool. Here's what I did: [root@roth-01 ../bob/cluster/gfs2/convert]# mkfs.gfs -O -b 1024 -t bobs_roth:test_gfs -p lock_dlm -j 3 /dev/roth_vg/roth_lv Device: /dev/roth_vg/roth_lv Blocksize: 1024 Filesystem Size: 1073072060 Journals: 3 Resource Groups: 4096 Locking Protocol: lock_dlm Lock Table: bobs_roth:test_gfs Syncing... All Done [root@roth-01 ../bob/cluster/gfs2/convert]# mount -tgfs /dev/roth_vg/roth_lv /mnt/gfs [root@roth-01 ../bob/cluster/gfs2/convert]# dd if=/dev/urandom of=/tmp/somejunk bs=1024 count=3 3+0 records in 3+0 records out 3072 bytes (3.1 kB) copied, 0.000999 seconds, 3.1 MB/s [root@roth-01 ../bob/cluster/gfs2/convert]# dd if=/tmp/somejunk of=/mnt/gfs/bigmama bs=1024 seek=496572346368 3+0 records in 3+0 records out 3072 bytes (3.1 kB) copied, 0.000235 seconds, 13.1 MB/s [root@roth-01 ../bob/cluster/gfs2/convert]# sync [root@roth-01 ../bob/cluster/gfs2/convert]# dd if=/mnt/gfs/bigmama of=/tmp/tocompare skip=496572346368 bs=1024 count=3 3+0 records in 3+0 records out 3072 bytes (3.1 kB) copied, 8.6e-05 seconds, 35.7 MB/s [root@roth-01 ../bob/cluster/gfs2/convert]# diff /tmp/somejunk /tmp/tocompare [root@roth-01 ../bob/cluster/gfs2/convert]# umount /mnt/gfs [root@roth-01 ../bob/cluster/gfs2/convert]# ./gfs2_convert /dev/roth_vg/roth_lv gfs2_convert version 2 (built Nov 14 2008 17:51:38) Copyright (C) Red Hat, Inc. 2004-2006 All rights reserved. Examining file system........................................ This program will convert a gfs1 filesystem to a gfs2 filesystem. WARNING: This can't be undone. It is strongly advised that you: 1. Back up your entire filesystem first. 2. Run gfs_fsck first to ensure filesystem integrity. 3. Make sure the filesystem is NOT mounted from any node. 4. Make sure you have the latest software versions. Convert /dev/roth_vg/roth_lv from GFS1 to GFS2? (y/n)y Converting resource groups......................................... Converting inodes. 7 inodes from 4096 rgs converted. Fixing file and directory information. 1 directories, 3 dirents fixed. Converting journals. Converting journal space to rg space. Building system structures. Removing obsolete gfs1 structures. Committing changes to disk. /dev/roth_vg/roth_lv: filesystem converted successfully to gfs2. [root@roth-01 ../bob/cluster/gfs2/convert]# mount -tgfs2 /dev/roth_vg/roth_lv /mnt/gfs2 [root@roth-01 ../bob/cluster/gfs2/convert]# dd if=/mnt/gfs2/bigmama of=/tmp/tocompare skip=496572346368 bs=1024 count=3 3+0 records in 3+0 records out 3072 bytes (3.1 kB) copied, 8.2e-05 seconds, 37.5 MB/s [root@roth-01 ../bob/cluster/gfs2/convert]# diff /tmp/somejunk /tmp/tocompare Binary files /tmp/somejunk and /tmp/tocompare differ [root@roth-01 ../bob/cluster/gfs2/convert]# /home/bob/hdump /tmp/tocompare 0000 00000000 00000000 00000000 00000000 [................] 0010 00000000 00000000 00000000 00000000 [................] 0020 00000000 00000000 00000000 00000000 [................] 0030 00000000 00000000 00000000 00000000 [................] 0040 00000000 00000000 00000000 00000000 [................] [root@roth-01 ../bob/cluster/gfs2/convert]# /home/bob/hdump /tmp/somejunk 0000 5E22B876 28111176 A184E819 7F2418A8 [^".v(..v.....$..] 0010 7829A91C CC77D720 364B90B2 2E4D123B [x)...w. 6K...M.;] 0020 9673BE18 A826C0B8 5617C5A1 1CAC13E6 [.s...&..V.......] 0030 29F65171 A69787EF 62DA2825 1AF7DDC8 [).Qq....b.(%....] 0040 B13F5161 AED75A07 D9EC6E4D D707E6D3 [.?Qa..Z...nM....] So now I've got to work on a reasonable way to consolidate all gfs_indirect blocks into blocks with only a gfs2_metaheader on the front, and that means shifting all the indirect pointers down and moving block pointers from page to page. Not an easy thing. I was hoping for an easy way to fix this by copying block pointers from block buffer to block buffer, but there's more to it than that. At heights 0 and 1, things are relatively easy. At height 2 things get complex. GFS1 reserves an area of 64 (0x40) bytes at the start of the indirect block, so for 4K blocks, you can fit 501 pointers. GFS2 doesn't reserve that space, so you can fit 509 pointers. For 1K blocks, it's 117 pointers in GFS1 and 125 in GFS2. That means, for example, that if you have 4K blocks, a 946MB file will require a height of 3 for GFS, but require a height of 2 for GFS2. There isn't a good way to shift the pointers around from one height to another; you would overwrite good pointers and lose information. So the only way to do it is to rebuild the entire metadata structure for each file, and that requires allocating blocks. We can't just convert them in place. It will also be very slow running. I'm having trouble figuring out the best way to do this. I suspect that the way to do this, is to start at the bottom level of the metadata tree. You can consider that like an array of pointers. You can fix that up first, moving the pointers and packing them as you go. Having done that, then you can use the existing blocks higher in the metadata tree into which can be rebuilt a new metadata tree. Any unused blocks can then be freed, and the inode's block count updated. The only questions then are how to keep track of where you've got to, in case a conversion needs to be restarted, and how much memory is required in the worst case. I could be wrong but I don't think that method would work for all cases. I'm not sure I can explain this well enough. On very large files, pointers would have to be shifted around. The dinode has the same structure size between GFS and GFS2, so it doesn't need to be packed. But in order to ensure that no valid pointers were overwritten, we would have to pack all height==1 blocks first, then height==2 blocks, then all height==3 blocks, and so forth. But due to the different number of pointers that can be kept between GFS and GFS2, each pointer at height==1 represents a different block offset for GFS and GFS2. Take my example from comment #3: a file of 946MB and block size 4K. The height for the file is 3 for GFS, but 2 for GFS2. For GFS, the block pointer for the last block would be near the start of level 3. For GFS2 the same block needs to appear near end of the last level 2 block. I think I figured out a way to reformat the blocks, but it requires block allocations be done. *** Bug 472003 has been marked as a duplicate of this bug. *** After getting a patch from Bob to fix the busy loop from 472003 I hit a segfault when running gfs2_convert: Program terminated with signal 11, Segmentation fault. [New process 20507] #0 0x000000000040e6f5 in gfs2_rgrp_free (rglist=0x6198d8, updated=updated) at rgrp.c:164 (gdb) list 159 { 160 struct rgrp_list *rgd; 161 162 while(!osi_list_empty(rglist->next)){ 163 rgd = osi_list_entry(rglist->next, struct rgrp_list, list); 164 if (rgd->bh && rgd->bh[0] && /* if a buffer exists and */ 165 rgd->bh[0]->b_count) /* the 1st buffer is allocated */ 166 gfs2_rgrp_relse(rgd, updated); /* free them all. */ 167 if(rgd->bits) 168 free(rgd->bits); (gdb) print rgd->bh[0]->b_count Cannot access memory at address 0xe4d4190 I limited the file system size to get around #c7 and I was able to successfully convert an empty GFS to GFS2, but gfs2_fsck didn't like the file system. fsck 1.39 (29-May-2006) Initializing fsck Validating Resource Group index. Level 1 RG check. (level 1 passed) Starting pass1 Pass1 complete Invalid or missing root system inode. Cannot continue without valid root inode Starting pass1b Pass1b complete Starting pass1c Pass1c complete Starting pass2 Directory entry '.' at block 25 (0x19) in dir inode 25 (0x19) has an invalid block type: 15. Directory entry to non-inode block remains Directory entry '..' at block 25 (0x19) in dir inode 25 (0x19) has an invalid block type: 15. Directory entry to non-inode block remains No '.' entry found for root directory. Adding '.' entry Pass2 complete Starting pass3 Pass3 complete Starting pass4 Pass4 complete Link count inconsistent for inode 25 (0x19) has 0 but fsck found 1. Starting pass5 Link count for inode 25 (0x19) still incorrect RG #17 (0x11) free count inconsistent: is 31626 should be 31627 Inode count inconsistent: is 12 should be 11 Resource group counts left inconsistent Pass5 complete gfs2_fsck: bad write: Bad file descriptor on line 42 of file buf.c Created attachment 324730 [details] A working patch Like its predecessor, this patch is a lot simpler than it looks. There are two reasons for its huge size: (1) A bunch of functions that are gfs1-specific were moved from gfs2_edit's savemeta code to a gfs1-specific "gfs1.c" in libgfs2 so that they could be shared with gfs2_convert.c. Similarly, a bunch of gfs1-specific structures were moved into libgfs2.h. (2) I changed all the functions in buf.c so that the caller can keep a separate linked list of buffers rather than always working on one list. That has two advantages: First and most most importantly, the user space tools can keep a set of volatile and a set of non-volatile buffers. The non-volatile buffers are for things like resource group bitmaps needed in memory for gfs2_fsck and/or gfs2_convert. The volatile buffers are for ordinary gfs2 structures that we fix and forget. Since these functions are used extensively throughout the gfs2 user space tools, the changes affected all the calls. The change was necessary because the fact that RG buffers were volatile was causing them to be freed at inopportune times, and the pointers to them accessed afterward, causing occasional segfaults when the file system gets very big. I would like to have rewritten how the whole interface works, but this way minimizes the impact to all the tools. Functional changes to the code are as follows: 1. Original problem is fixed The original problem reported in the bug was caused because there is a fundamental shift of indirect block pointers between GFS1 and GFS2. For indirect blocks, GFS1 added 64 bytes of reserved space after the gfs meta header. GFS2 did not. That meant GFS1 could hold fewer pointers than GFS2 on the same block. The meant code had to be added to gfs2_convert to shuffle all the pointers around to their corresponding gfs2 locations. This was not a simple calculation because of sparse files. For example, with a 1K block size, if you do: dd if=/mnt/gfs/big of=/tmp/tocompare skip=496572346368 bs=1024 count=1 the resulting metadata paths will look vastly different for the single block of data: height 0 1 2 3 4 5 GFS1: 0x16 0x46 0x70 0x11 0x5e 0x4a GFS2: 0x10 0x21 0x78 0x05 0x14 0x78 It is relatively easy to calculate the new metapath, but you can't just shuffle a pointer from its old location to its new location because the destination slot might be used for a different source slot. In the example above, we couldn't just move the pointer at height 3 from offset 0x11 to another block at offset 0x05 because there might already be a pointer at 0x05 being used for something else. To complicate matters, we could not just assign new blocks and copy the data because we should be able to run on a "full" file system. So I had to write a complex new function called adjust_indirect_blocks along with several support functions. Someone else might be able to write a function to get the job done simpler, but at least this one works for both sparse and fully-packed metadata trees of several heights. It basically formulates in memory a metadata tree for all the data blocks, clears out all the metadata buffers, and lays them all back down again according to the new layout. 2. Improved error checking I added some badly needed error checking. That's because I was running into problems and wanted to eliminate the possibility that errors were occurring, but not being reported. 3. Improved progress reporting With a large number of resource groups, I saw long periods of time where the program would just sit for five minutes. I was afraid that the tool had hung, so I kept breaking in with gdb to check on the tool's progress. But if I'm that concerned about what it is doing when it's quiet, so too will be the customers who are anxiously trying to watch its progress. So I added periodic reports of progress through the RGs when converting inodes, and also an occasional "." when the RGs themselves are being analyzed and/or manipulated. Same goes for writing out the new gfs2 journals. The journals in gfs2 are quite different and we need to write them out entirely, all 128MB. That takes a fair amount of time going through the functions in buf.c. So I added a new message when each journal is written, so that the code doesn't appear stuck for a long period of time. This could still use some improvement, but it's better. I also modified the progress messages to make them more clear. For example, the message "Removing obsolete gfs1 structures" sounded to me like it could cause alarm or panic to a customer. So I rephased it to "Removing obsolete GFS1 file system structures". 4. Journal size was not being preserved In testing, I discovered that if the source file system had 32MB journals, the reformatted GFS2 file system had 128MB. I decided that's not good and decided to fix it. That is especially important if the file system is "full" to begin with because we can't add more blocks. 5. Fixed Nate's bug I found and fixed the bug reported in comment #7. Created attachment 324790 [details]
Revised patch
This version fixes some ugly problems I found during testing.
I'm not promising this one is the final version, but it's close.
Needs more testing.
Here are the basic functional tests I've run that have uncovered so many bugs. At this point I am hand-verifying the results of each step to make sure there were no errors, and I'm just copy-and-pasting the commands into a terminal session: convert_test1: sparse file with height of 6: mkfs.gfs -J32 -O -b 1024 -t bobs_roth:test_gfs -p lock_dlm -j 3 /dev/roth_vg/bob mount -tgfs /dev/roth_vg/bob /mnt/gfs dd if=/tmp/somejunk of=/mnt/gfs/bigmama bs=1024 seek=496572346368 sync dd if=/mnt/gfs/bigmama of=/tmp/tocompare skip=496572346368 bs=1024 count=3 diff /tmp/somejunk /tmp/tocompare umount /mnt/gfs /home/bob/cluster/gfs2/convert/gfs2_convert -y /dev/roth_vg/bob mount -tgfs2 /dev/roth_vg/bob /mnt/gfs dd if=/mnt/gfs/bigmama of=/tmp/tocompare skip=496572346368 bs=1024 count=3 umount /mnt/gfs diff /tmp/somejunk /tmp/tocompare /home/bob/cluster/gfs2/fsck/fsck.gfs2 /dev/roth_vg/bob convert_test2: contiguous 5.7MB file, height 2 mkfs.gfs -O -t bobs_roth:test_gfs -p lock_dlm -j 1 /dev/roth_vg/roth_lv mount -tgfs /dev/roth_vg/roth_lv /mnt/gfs cp -a /bob_music/Metal/Adagio/Sanctus\ Ignis/01\ -\ Second\ Sight.mp3 /mnt/gfs/ sync umount /mnt/gfs /home/bob/cluster/gfs2/convert/gfs2_convert -y /dev/roth_vg/roth_lv mount -tgfs2 /dev/roth_vg/roth_lv /mnt/gfs diff /bob_music/Metal/Adagio/Sanctus\ Ignis/01\ -\ Second\ Sight.mp3 /mnt/gfs/01\ -\ Second\ Sight.mp3 umount /mnt/gfs /home/bob/cluster/gfs2/fsck/fsck.gfs2 /dev/roth_vg/roth_lv convert_test3: metadata reduction test: contiguous 4MB file. mkfs.gfs -O -t bobs_roth:test_gfs -p lock_dlm -j 1 /dev/roth_vg/bob mount -tgfs /dev/roth_vg/bob /mnt/gfs cp -a /bob_music/Metal/Aerosmith/Pump/01\ -\ Young\ Lust.mp3 /mnt/gfs/ sync umount /mnt/gfs /home/bob/cluster/gfs2/convert/gfs2_convert -y /dev/roth_vg/bob mount -tgfs2 /dev/roth_vg/bob /mnt/gfs diff /bob_music/Metal/Aerosmith/Pump/01\ -\ Young\ Lust.mp3 /mnt/gfs/01\ -\ Young\ Lust.mp3 umount /mnt/gfs /home/bob/cluster/gfs2/fsck/fsck.gfs2 /dev/roth_vg/bob convert_test4: 15GB of misc mp3 data I've used for other tests: mkfs.gfs -O -t bobs_roth:test_gfs -p lock_dlm -j 1 /dev/roth_vg/roth_lv mount -tgfs /dev/roth_vg/roth_lv /mnt/gfs cd /bob_music/Metal/ time cp -a * /mnt/gfs/ sync umount /mnt/gfs /home/bob/cluster/gfs2/convert/gfs2_convert -y /dev/roth_vg/roth_lv mount -tgfs2 /dev/roth_vg/roth_lv /mnt/gfs diff -r /mnt/gfs /bob_music/Metal umount /mnt/gfs diff /tmp/somejunk /tmp/tocompare /home/bob/cluster/gfs2/fsck/fsck.gfs2 /dev/roth_vg/roth_lv Created attachment 324803 [details]
Final patch
This patch fixes one last bug I found in testing:
This is best explained with an example:
In a normal 4K block file system a contiguous file that is 4152273
bytes long needs 1014 data blocks. GFS1 needs 3 gfs_indirect metadata
blocks to keep track of them: (#1) data blocks 0-500, (#2) data blocks
501-1001, and (#3) data blocks 1002=1014. However, GFS2 only needs 2
metadata blocks to keep track of them: (1) 0-508, (2) 509-1014. So the
dinode's count of blocks used has to be one less, and that final block
needs to be freed.
So this patch adds the ability to free up metadata blocks that are no
longer needed, and adjust the dinode blocks counter accordingly.
This version has passed all four of the tests given above and will be
committed to git.
Patch tested on system roth-01 with the tests given in comment #12. Patch was pushed to the master and STABLE2 branches of the cluster git tree. Now I'm waiting for approval before I push it to the RHEL5 and RHEL53 branches. I also gave a pre-release copy to the QE people for testing. I received the necessary ack flags, pushed the patch to the RHEL5 and RHEL53 branches of the cluster git tree. Changing status to Modified for inclusion into 5.3. I think we should add a couple more tests: 1. Upgrade of a 100% full gfs1 filesystem (just to be sure that it works ok) 2. Kill an upgrade while its halfway though, and then restart it to see if it completes correctly. Might also be a useful future addition to fsck if it can detect a partial upgrade in case someone tries to fix an aborted upgrade with fsck. At least we should document somewhere what someone should do if the conversion fails part way though. I'm running into complaints from gfs2_fsck after I convert a 1k block size GFS to GFS2. Starting pass5 Ondisk and fsck bitmaps differ at block 137 (0x89) Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free) Metadata type is 0 (free) Bitmap at block 137 (0x89) left inconsistent Ondisk and fsck bitmaps differ at block 138 (0x8a) Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free) Metadata type is 0 (free) Bitmap at block 138 (0x8a) left inconsistent [ 4000+ lines of output ] fsck -t gfs2 returns 0 after all of these warnings and the subsequent verify operation passes. So there isn't any data corruption going on, but gfs2_fsck isn't happy with the converted file system. There are two problems cited in comment #20: (1) The blocks that were supposed to be freed but weren't, and (2) The return code from gfs2_fsck does not follow the standards listed in the fsck(8) man page. The first problem was caused because the rindex file was so big it went to height 2, but libgfs2 function gfs2_freedi cannot currently free blocks above height 1. I consider this a separate bug. It may be annoying that gfs2_convert should have freed those blocks and did not, but IMHO, it's not worth re-spinning the package. I recommend opening a new bug and deferring this to 5.4. The issue of gfs2_fsck not providing a standard return code as listed in the fsck(8) man page is also a separate problem, also an annoyance, and should also be fixed in 5.4. The code was following the lead of gfs1's gfs_fsck that also provides the wrong return code. We need to determine if that's worth fixing in 5.4. I suspect not because customers may already expect gfs_fsck to return the "wrong" return code, and may have coded scripts that rely upon that fact. Bug #474707 was opened to address the first problem in comment #21. Bug #474705 was opened to address the second problem in comment #21. Unfortunately the fix for this bug introduced a regression in mkfs.gfs2, bug 475488. Can we get this in for snap6? Created attachment 326421 [details]
addendum patch
This addendum patch fixes the bug Nate found in mkfs.gfs2, and it's
being tested now.
*** Bug 475488 has been marked as a duplicate of this bug. *** Created attachment 326541 [details] addendum patch 2 This patch is an add-on to the previous addendum patch to fix the problem. This second addendum consists of two parts: (1) to fix the problem for mkfs, and (2) to prevent other pieces of user space code from falling on their face if they make the same assumption. To in summary: It turns out that there were three problems with the original patch: (1) The first problem is where the offsets were getting skewed by bad calculations. This allowed mkfs to start working for some cases, but only if the journal size was small. This is fixed by the first addendum patch. (2) The second problem is that the original patch inadvertently reversed the order in which buffers were freed off the block list in the event of an overflow. That meant that the most recently used block was first to be freed, which makes it horribly inefficient, but likely to uncover bugs such as bug #3. This is fixed in addendum patch 2. (3) The third problem is due to the fact that the get_first_leaf and get_next_leaf code were making the faulty assumption that the leaf blocks they're searching will be in memory. That's not guaranteed, but we've been getting away with it for all these years because of bug #2. This is fixed in addendum patch 2. For 5.4 we need to carefully scrutinize every place that function bget is called in the gfs2 user space code. We need to make sure bget is only called in places where the buffer is newly allocated or unallocated. Otherwise we can get into problems like #3 again. Created attachment 326662 [details]
Addendum 3 patch
Nate found one more bug here. This one has been in the code forever,
but Nate found a test case that uncovers it. This one-line patch
fixes the problem. I suggest another respin for the next snap.
Changing status back to ASSIGNED so we can get the addendum 3 patch included in the next build. Addendum tested by Nate and pushed to these branches of the cluster git tree: RHEL5, RHEL53, master, STABLE2. Changing status back to MODIFIED. Verified with gfs2-utils-0.1.53-1.el5. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-0087.html gfs2-utils-2.03.11-1.fc9, cman-2.03.11-1.fc9, rgmanager-2.03.11-1.fc9 has been pushed to the Fedora 9 stable repository. If problems still persist, please make note of it in this bug report. |