Description of problem: While running gfs2_grow regression tests I started hitting this problem: GFS2: fsid=morph-cluster:grow1.0: fatal: filesystem consistency error GFS2: fsid=morph-cluster:grow1.0: inode = 19 99327 GFS2: fsid=morph-cluster:grow1.0: function = gfs2_ri_update, file = fs/gfs2/rgrp.c, line = 590 GFS2: fsid=morph-cluster:grow1.0: about to withdraw this file system GFS2: fsid=morph-cluster:grow1.0: telling LM to withdraw Version-Release number of selected component (if applicable): kernel-2.6.18-183.el5 How reproducible: Easily Steps to Reproduce: 1. run growfs test 2. 3. Actual results: gfs2_grow hangs and file system withdraws. Expected results: gfs2_grow should complete without errors or withdrawing the file system. Additional info:
This seems to be a regression caused by my fix for bz 482756.
That consistency error is if (do_div(rgrp_count, sizeof(struct gfs2_rindex))) { Whenever it fails, rgrp_count is definitely not a multiple of gfs2_rindex. Instead, it's always a multiple of 4k. Looking at gfs2_write_begin() and gfs2_write_end() when this error occurs, gfs2_write_begin() is only being told to write up to a page boundary, which it correctly does. That's because gfs2_perform_write() is only writing a page at a time. and it's dropping the exclusive glock on the sd_rindex between times. This means that in another process calls gfs2_rindex_hold in this window, when it sees that the index is not uptodate, and calls gfs2_ri_update() it won't be looking at the completely updated rindex, and will get a consistency error. This problem could be lessened by moving the clearing of gl->gl_sbd->sd_rindex_uptodate till after the entire rindex file has been written out. However that still will likely leave some nasty corner cases since the other process is still able to grab the rindex without it being up to date.
Can we just change the kernel code to ignore partial entries rather than withdraw or whatever?
Created attachment 385713 [details] Do not withdraw on partial rindex entries This patch fixes the problems as long as you do not have two nodes trying to grow the fs at the same time. That can be fixed by grabbing a flock on the rindex file in userspace before writing to it.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Posted
Reposted
Nate seems to have recreated this problem by doing gfs2_grow while a gfs2 load was running. He got: GFS2: fsid=morph-cluster:grow1.2: fatal: invalid metadata block GFS2: fsid=morph-cluster:grow1.2: bh = 16744776 (magic number) GFS2: fsid=morph-cluster:grow1.2: function = gfs2_rgrp_bh_get, file = fs/gfs2/rgrp.c, line = 754 GFS2: fsid=morph-cluster:grow1.2: about to withdraw this file system GFS2: fsid=morph-cluster:grow1.2: telling LM to withdraw 16744776 = 0xff8148 Excerpt from the rindex indirect pointers: 1865B220 00000000 00061994 00000000 00061995 [................] 1865B230 00000000 00061996 00000000 00061997 [................] 1865B240 00000000 004B60A0 00000000 004DDA4B [.....K`......M.K] 1865B250 00000000 004DDA4C 00000000 004DDA4D [.....M.L.....M.M] 1865B260 00000000 004DDA4E 00000000 0054D6CD [.....M.N.....T..] 1865B270 00000000 0054D6CE 00000000 0054D6CF [.....T.......T..] 1865B280 00000000 0054D6D0 00000000 005BD34D [.....T.......[.M] 1865B290 00000000 005BD34E 00000000 005BD34F [.....[.N.....[.O] 1865B2A0 00000000 005BD350 00000000 00634F8D [.....[.P.....cO.] Except from indirect block 0x4dda4e: 0000000137693BA0 00000000 00FF0180 00000009 00000000 [................] 0000000137693BB0 00000000 00FF0189 00007FB4 00001FED [................] 0000000137693BC0 00000000 00000000 00000000 00000000 [................] 0000000137693BD0 00000000 00000000 00000000 00000000 [................] All the new rgrps have "complete trash" rather than the rgrps and bitmaps as they should. But gfs2_grow writes the new rgrp blocks before it writes changes to the rindex file. I examined all 3 of the journals and none of the three loads were accessing blocks anywhere near the new section of rgrps. My theory is that gfs2_grow needs to open and write "2" to /proc/sys/vm/drop_caches to get the kernel to re-read the modified blocks. If we can recreate this reliably, I should be able to test this very easily.
Correction: The block 0x4dda4e was the wrong one. Here's the right one: Block #5560013 (0x54d6cd) of 38522880 (0x24BD000) (p.1 of 1--Data ) 00000001535B3400 00000000 00FF8140 00000009 00000000 [.......@........] 00000001535B3410 00000000 00FF8149 00007FB4 00001FED [.......I........] 00000001535B3420 00000000 00000000 00000000 00000000 [................] 00000001535B3430 00000000 00000000 00000000 00000000 [................] Since length == 9, this rindex entry implies that the block in question, 0xff8148, is supposed to be a rgrp bitmap block, and that it was calculated to be in the correct location. The data is just trash though; it doesn't even have a gfs2 metadata header. None of the new rgrps do. So it's as if gfs2_grow did not write the new rgrps or their bitmaps at all.
I don't see any reason to think that Nate's new issue is related to this bug. The original issue is that sometimes another node reads the rindex file before the node doing the grow has written it out completely. This is a transient issue. Moments after the node hits the consistency error, the rest of the rindex file is written out. In this new issue, the rgrps are never written out, and they should have been already written out and synced to disk before the rindex file was modified in the first place. Since a fix for the original issue has already gone into the kernel, and this new issue isn't suggesting that there is anything wrong with that fix, we should probably open a new bug instead.
Made it through 100 iterations of our growfs test without hitting this.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html