Bug 711519

Summary: GFS2: resource group bitmap corruption resulting in panics and withdraws
Product: Red Hat Enterprise Linux 5 Reporter: Benjamin Kahn <bkahn>
Component: kernelAssignee: Robert Peterson <rpeterso>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: urgent    
Version: 5.8CC: adas, adrew, ajb2, bmarzins, cww, John.Hadad, jwest, liko, mjuricek, pm-eus, rpeterso, rryder, rwheeler, swhiteho, syeghiay, teigland
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.18-238.15.1.el5 Doc Type: Bug Fix
Doc Text:
Multiple GFS2 nodes attempted to unlink, rename, or manipulate files at the same time, causing various forms of file system corruption, panics, and withdraws. This update adds multiple checks for dinode's i_nlink value to assure inode operations such as link, unlink, or rename no longer cause the aforementioned problems.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-07-15 06:10:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 690555    
Bug Blocks:    
Attachments:
Description Flags
Patch I posted none

Description Benjamin Kahn 2011-06-07 17:51:53 UTC
This bug has been copied from bug #690555 and has been proposed
to be backported to 5.6 z-stream (EUS).

Comment 3 Robert Peterson 2011-06-14 16:20:15 UTC
Reassigning to myself; hope to post the 5.6.z patch shortly.

Comment 4 Robert Peterson 2011-06-14 19:02:02 UTC
The patch was posted to rhkernel-list for inclusion into 5.6.z.
Changing status to POST.

Comment 5 Robert Peterson 2011-06-14 19:03:00 UTC
Created attachment 504744 [details]
Patch I posted

Comments aside, this is the patch I posted.

Comment 6 Phillip Lougher 2011-06-17 09:14:25 UTC
in kernel-2.6.18-238.15.1.el5

linux-2.6-fs-gfs2-fix-resource-group-bitmap-corruption.patch

Comment 7 Alan Brown 2011-06-17 13:01:23 UTC
Which test stream kernel is this in? (We're running -262 and I don't want to step back out of the fixes already in that)

Comment 9 Alan Brown 2011-06-24 09:47:29 UTC
Answering my own question.... It's in -261.

We're running -262 and still seeing this occasionally under heavy load.

Comment 10 Adam Drew 2011-06-24 21:37:08 UTC
(In reply to comment #9)
> Answering my own question.... It's in -261.
> 
> We're running -262 and still seeing this occasionally under heavy load.

The code to resolve 690555 was tested quite heavily by Red Hat and partners. One of the partners who deployed this code has in their lab the one of the most aggressive workloads that we're aware of on GFS2 and has not documented a single occurrence of this issue on this code (when previously they could reproduce it in 2.5 hours reliably.) I believe we're fairly confident that BZ 690555 has been successfully resolved. Could you be hitting a new or different issue then?

It would be great if you could open a case with support so that my team and I can help you out with this issue. If you could open a case with sosreports from your cluster, a description of what you suspect, and let us know the approximate time and date of the last withdraw that you suspect to be 69055 on the -262 then we can help. You could do this either here https://access.redhat.com/support/cases/new or by calling in. Also, please feel free to point to this note in this BZ and I'm sure my colleagues will alert me to the case so that I can personally assist.

Thanks in advance.

Comment 11 Alan Brown 2011-06-27 14:27:39 UTC
As luck would have it, I just had another instance and am generating a sosreport.

It will be attached to support ticket #00353457

Thanks
Alan

Comment 13 Martin Prpič 2011-07-12 11:51:44 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Multiple GFS2 nodes attempted to unlink, rename, or manipulate files at the same time, causing various forms of file system corruption, panics, and withdraws. This update adds multiple checks for dinode's i_nlink value to assure inode operations such as link, unlink, or rename no longer cause the aforementioned problems.

Comment 14 errata-xmlrpc 2011-07-15 06:10:44 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0927.html