Bug 1251036
| Summary: | fsck.gfs2: Segfault with corrupt rindex | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Robert Peterson <rpeterso> |
| Component: | gfs2-utils | Assignee: | Robert Peterson <rpeterso> |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 7.2 | CC: | dwysocha, gfs2-maint, jpayne, sbradley, swhiteho, wili |
| Target Milestone: | rc | Keywords: | Patch |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | gfs2-utils-3.1.9-1.el7 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-11-04 06:29:59 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1271674 | ||
| Bug Blocks: | 1203710, 1295577, 1313485 | ||
| Attachments: | |||
Created attachment 1063170 [details]
Tool to rebuild rindex from printsavedmeta plus file system
After much struggle and strife, I wrote and debugged this program.
It rebuilds a corrupt rindex from scratch, based on the contents
of a printsavemeta. It also uses the existing file system to tell
where the journal blocks are, so it can avoid rgrps that lie
within. It might still have bugs, but it successfully rebuilt this
particular file system's rindex, and it's a very complex case.
Setting this to assigned and requesting some flags. I've got a couple patches for this already, which I'll attach shortly. Created attachment 1071807 [details]
Patch #1 - fsck.gfs2: Read jindex before making rindex repairs
In most cases, the rindex needs to be read into memory in case the
journals or jindex are corrupt and need repairs. However, in some
rare cases, the rindex needs repairs, and in the rindex repair code
it needs to read in the jindex and journals in order to filter out
rgrp records that appear in the journals. This prevents the rgrp
records inside journals from being treated as real rgrps, rather
than false-positives.
This patch also fixes a segfault in the rgrp code for cases of
extremely corrupt rindex files where the rgrp has no buffers.
Created attachment 1071808 [details]
Patch #2 - fsck.gfs2: Detect multiple rgrp grow segments
This patch gives fsck.gfs2's rgrepair code the capability to detect
multiple gfs2_grow segments and repair them accordingly.
Today I posted 11 patches to upstream cluster-devel related to this. Yesterday I pushed the upstream patches to the gfs2-utils git tree. It should be an easy port to rhel7. This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions Verified in gfs2-utils-3.1.9-3.el7: [root@dash-02 ~]# rpm -q gfs2-utils gfs2-utils-3.1.9-3.el7.x86_64 [root@dash-02 ~]# gfs2_edit restoremeta savemeta.mda.china.01445381 /dev/mapper/mpatha1 [root@dash-02 ~]# ./rindex_set_di_size /dev/mapper/mpatha1 [root@dash-02 ~]# gfs2_edit printsavedmeta savemeta.mda.china.01445381 > printsavedmeta.mda.china.01445381.out [root@dash-02 ~]# ./rindex_from_printsavedmeta printsavedmeta.mda.china.01445381.out /dev/mapper/mpatha1 Checking for rgrps in journal0. Checking for rgrps in journal1. Checking for rgrps in journal2. Checking for rgrps in journal3. First rg length: 0x5 42735 resource groups written.here. [root@dash-02 ~]# fsck.gfs2 -y /dev/mapper/mpatha1 &> fsck.out [root@dash-02 ~]# tail fsck.out dinodes: 78652666 (0x4b024fa) Calculated statfs values: blocks: 2594517996 (0x9aa533ec) free: 166405146 (0x9eb241a) dinodes: 78639364 (0x4aff104) The statfs file was fixed. check_statfs completed in 0.002s Writing changes to disk gfs2_fsck complete Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2438.html |
Description of problem: I recently received a set of customer metadata which had a corrupt rindex file. When I ran fsck.gfs2 on it, it segfaulted. I ran my little rindex size fix-up program, rindex_set_di_size.c which fixed di_size properly. When I ran fsck.gfs2, it still segfaulted. Using gdb, I got the following call trace: (gdb) run -y /dev/clariion_lun10/scratch &> /tmp/fsck.out Starting program: /home/bob/gfs2-utils/gfs2/fsck/./fsck.gfs2 -y /dev/clariion_lun10/scratch &> /tmp/fsck.out Program received signal SIGSEGV, Segmentation fault. 0x0000000000422a0b in gfs2_rgrp_free (rgrp_tree=rgrp_tree@entry=0x7fffffffdbf0) at rgrp.c:247 247 if (rgd->bits[0].bi_bh) { /* if a buffer exists */ Missing separate debuginfos, use: debuginfo-install glibc-2.17-79.el7.x86_64 (gdb) bt #0 0x0000000000422a0b in gfs2_rgrp_free (rgrp_tree=rgrp_tree@entry=0x7fffffffdbf0) at rgrp.c:247 #1 0x000000000041da87 in rg_repair (sdp=sdp@entry=0x7fffffffd860, trust_lvl=trust_lvl@entry=2, rg_count=rg_count@entry=0x7fffffffd508, sane=sane@entry=0x7fffffffd50c) at rgrepair.c:874 #2 0x00000000004046aa in fetch_rgrps (sdp=sdp@entry=0x7fffffffd860) at initialize.c:638 #3 0x000000000040618a in initialize (sdp=sdp@entry=0x7fffffffd860, force_check=0, preen=0, all_clean=all_clean@entry=0x7fffffffd77c) at initialize.c:1742 #4 0x0000000000401fe8 in main (argc=3, argv=0x7fffffffdda8) at main.c:355 (gdb) My philosophy is: The fsck.gfs2 program should NEVER segfault no matter what kind of hideous rubbish you throw at it. Version-Release number of selected component (if applicable): RHEL7.2 How reproducible: Always Steps to Reproduce: 1.gfs2_edit restoremeta savemeta.mda.china.01445381 <dev> 2.fsck.gfs2 <dev> Actual results: Segmentation fault Expected results: fsck.gfs2 should never segfault. Additional info: WARNING: This is a huge 15GB set of metadata. It requires a huge device (I used 25TB because 10TB was not enough). It takes many hours to restore, probably more than 8 hours or more, depending on the hardware.