Bug 1251036

Summary: fsck.gfs2: Segfault with corrupt rindex
Product: Red Hat Enterprise Linux 7 Reporter: Robert Peterson <rpeterso>
Component: gfs2-utilsAssignee: Robert Peterson <rpeterso>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: high Docs Contact:
Priority: high    
Version: 7.2CC: dwysocha, gfs2-maint, jpayne, sbradley, swhiteho, wili
Target Milestone: rcKeywords: Patch
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: gfs2-utils-3.1.9-1.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-04 06:29:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1271674    
Bug Blocks: 1203710, 1295577, 1313485    
Attachments:
Description Flags
Tool to rebuild rindex from printsavedmeta plus file system
none
Patch #1 - fsck.gfs2: Read jindex before making rindex repairs
none
Patch #2 - fsck.gfs2: Detect multiple rgrp grow segments none

Description Robert Peterson 2015-08-06 12:53:52 UTC
Description of problem:
I recently received a set of customer metadata which had a
corrupt rindex file. When I ran fsck.gfs2 on it, it segfaulted.
I ran my little rindex size fix-up program, rindex_set_di_size.c
which fixed di_size properly. When I ran fsck.gfs2, it still
segfaulted. Using gdb, I got the following call trace:

(gdb) run -y /dev/clariion_lun10/scratch &> /tmp/fsck.out
Starting program: /home/bob/gfs2-utils/gfs2/fsck/./fsck.gfs2 -y /dev/clariion_lun10/scratch &> /tmp/fsck.out

Program received signal SIGSEGV, Segmentation fault.
0x0000000000422a0b in gfs2_rgrp_free (rgrp_tree=rgrp_tree@entry=0x7fffffffdbf0) at rgrp.c:247
247                     if (rgd->bits[0].bi_bh) { /* if a buffer exists */
Missing separate debuginfos, use: debuginfo-install glibc-2.17-79.el7.x86_64
(gdb) bt
#0  0x0000000000422a0b in gfs2_rgrp_free (rgrp_tree=rgrp_tree@entry=0x7fffffffdbf0) at rgrp.c:247
#1  0x000000000041da87 in rg_repair (sdp=sdp@entry=0x7fffffffd860, trust_lvl=trust_lvl@entry=2, rg_count=rg_count@entry=0x7fffffffd508, sane=sane@entry=0x7fffffffd50c) at rgrepair.c:874
#2  0x00000000004046aa in fetch_rgrps (sdp=sdp@entry=0x7fffffffd860) at initialize.c:638
#3  0x000000000040618a in initialize (sdp=sdp@entry=0x7fffffffd860, force_check=0, preen=0, all_clean=all_clean@entry=0x7fffffffd77c) at initialize.c:1742
#4  0x0000000000401fe8 in main (argc=3, argv=0x7fffffffdda8) at main.c:355
(gdb) 

My philosophy is: The fsck.gfs2 program should NEVER segfault
no matter what kind of hideous rubbish you throw at it.

Version-Release number of selected component (if applicable):
RHEL7.2

How reproducible:
Always

Steps to Reproduce:
1.gfs2_edit restoremeta savemeta.mda.china.01445381 <dev>
2.fsck.gfs2 <dev>

Actual results:
Segmentation fault

Expected results:
fsck.gfs2 should never segfault.

Additional info:
WARNING: This is a huge 15GB set of metadata. It requires a huge
device (I used 25TB because 10TB was not enough). It takes many
hours to restore, probably more than 8 hours or more, depending
on the hardware.

Comment 1 Robert Peterson 2015-08-14 20:37:19 UTC
Created attachment 1063170 [details]
Tool to rebuild rindex from printsavedmeta plus file system

After much struggle and strife, I wrote and debugged this program.
It rebuilds a corrupt rindex from scratch, based on the contents
of a printsavemeta. It also uses the existing file system to tell
where the journal blocks are, so it can avoid rgrps that lie
within. It might still have bugs, but it successfully rebuilt this
particular file system's rindex, and it's a very complex case.

Comment 2 Robert Peterson 2015-09-09 14:33:27 UTC
Setting this to assigned and requesting some flags. I've got
a couple patches for this already, which I'll attach shortly.

Comment 3 Robert Peterson 2015-09-09 14:43:16 UTC
Created attachment 1071807 [details]
Patch #1 - fsck.gfs2: Read jindex before making rindex repairs

In most cases, the rindex needs to be read into memory in case the
journals or jindex are corrupt and need repairs. However, in some
rare cases, the rindex needs repairs, and in the rindex repair code
it needs to read in the jindex and journals in order to filter out
rgrp records that appear in the journals. This prevents the rgrp
records inside journals from being treated as real rgrps, rather
than false-positives.

This patch also fixes a segfault in the rgrp code for cases of
extremely corrupt rindex files where the rgrp has no buffers.

Comment 4 Robert Peterson 2015-09-09 14:44:19 UTC
Created attachment 1071808 [details]
Patch #2 - fsck.gfs2: Detect multiple rgrp grow segments

This patch gives fsck.gfs2's rgrepair code the capability to detect
multiple gfs2_grow segments and repair them accordingly.

Comment 6 Robert Peterson 2016-02-25 18:41:14 UTC
Today I posted 11 patches to upstream cluster-devel related to this.

Comment 7 Robert Peterson 2016-03-24 13:19:04 UTC
Yesterday I pushed the upstream patches to the gfs2-utils
git tree. It should be an easy port to rhel7.

Comment 8 Mike McCune 2016-03-28 23:30:47 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 12 Justin Payne 2016-09-09 01:11:36 UTC
Verified in gfs2-utils-3.1.9-3.el7:

[root@dash-02 ~]# rpm -q gfs2-utils
gfs2-utils-3.1.9-3.el7.x86_64
[root@dash-02 ~]# gfs2_edit restoremeta savemeta.mda.china.01445381 /dev/mapper/mpatha1
[root@dash-02 ~]# ./rindex_set_di_size /dev/mapper/mpatha1
[root@dash-02 ~]# gfs2_edit printsavedmeta savemeta.mda.china.01445381 > printsavedmeta.mda.china.01445381.out
[root@dash-02 ~]# ./rindex_from_printsavedmeta  printsavedmeta.mda.china.01445381.out /dev/mapper/mpatha1
Checking for rgrps in journal0.
Checking for rgrps in journal1.
Checking for rgrps in journal2.
Checking for rgrps in journal3.
First rg length: 0x5
 42735 resource groups written.here.
[root@dash-02 ~]# fsck.gfs2 -y /dev/mapper/mpatha1 &> fsck.out
[root@dash-02 ~]# tail fsck.out 
dinodes: 78652666 (0x4b024fa)

Calculated statfs values:
blocks:  2594517996 (0x9aa533ec)
free:    166405146 (0x9eb241a)
dinodes: 78639364 (0x4aff104)
The statfs file was fixed.
check_statfs completed in 0.002s
Writing changes to disk
gfs2_fsck complete

Comment 14 errata-xmlrpc 2016-11-04 06:29:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2438.html