Bug 711519 - GFS2: resource group bitmap corruption resulting in panics and withdraws
Summary: GFS2: resource group bitmap corruption resulting in panics and withdraws
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.8
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: rc
: ---
Assignee: Robert Peterson
QA Contact: Cluster QE
URL:
Whiteboard:
Keywords: ZStream
Depends On: 690555
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-06-07 17:51 UTC by Benjamin Kahn
Modified: 2018-11-14 11:40 UTC (History)
16 users (show)

(edit)
Multiple GFS2 nodes attempted to unlink, rename, or manipulate files at the same time, causing various forms of file system corruption, panics, and withdraws. This update adds multiple checks for dinode's i_nlink value to assure inode operations such as link, unlink, or rename no longer cause the aforementioned problems.
Clone Of:
(edit)
Last Closed: 2011-07-15 06:10:44 UTC


Attachments (Terms of Use)
Patch I posted (5.55 KB, patch)
2011-06-14 19:03 UTC, Robert Peterson
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0927 normal SHIPPED_LIVE Important: kernel security and bug fix update 2011-07-15 06:07:56 UTC

Description Benjamin Kahn 2011-06-07 17:51:53 UTC
This bug has been copied from bug #690555 and has been proposed
to be backported to 5.6 z-stream (EUS).

Comment 3 Robert Peterson 2011-06-14 16:20:15 UTC
Reassigning to myself; hope to post the 5.6.z patch shortly.

Comment 4 Robert Peterson 2011-06-14 19:02:02 UTC
The patch was posted to rhkernel-list for inclusion into 5.6.z.
Changing status to POST.

Comment 5 Robert Peterson 2011-06-14 19:03:00 UTC
Created attachment 504744 [details]
Patch I posted

Comments aside, this is the patch I posted.

Comment 6 Phillip Lougher 2011-06-17 09:14:25 UTC
in kernel-2.6.18-238.15.1.el5

linux-2.6-fs-gfs2-fix-resource-group-bitmap-corruption.patch

Comment 7 Alan Brown 2011-06-17 13:01:23 UTC
Which test stream kernel is this in? (We're running -262 and I don't want to step back out of the fixes already in that)

Comment 9 Alan Brown 2011-06-24 09:47:29 UTC
Answering my own question.... It's in -261.

We're running -262 and still seeing this occasionally under heavy load.

Comment 10 Adam Drew 2011-06-24 21:37:08 UTC
(In reply to comment #9)
> Answering my own question.... It's in -261.
> 
> We're running -262 and still seeing this occasionally under heavy load.

The code to resolve 690555 was tested quite heavily by Red Hat and partners. One of the partners who deployed this code has in their lab the one of the most aggressive workloads that we're aware of on GFS2 and has not documented a single occurrence of this issue on this code (when previously they could reproduce it in 2.5 hours reliably.) I believe we're fairly confident that BZ 690555 has been successfully resolved. Could you be hitting a new or different issue then?

It would be great if you could open a case with support so that my team and I can help you out with this issue. If you could open a case with sosreports from your cluster, a description of what you suspect, and let us know the approximate time and date of the last withdraw that you suspect to be 69055 on the -262 then we can help. You could do this either here https://access.redhat.com/support/cases/new or by calling in. Also, please feel free to point to this note in this BZ and I'm sure my colleagues will alert me to the case so that I can personally assist.

Thanks in advance.

Comment 11 Alan Brown 2011-06-27 14:27:39 UTC
As luck would have it, I just had another instance and am generating a sosreport.

It will be attached to support ticket #00353457

Thanks
Alan

Comment 13 Martin Prpič 2011-07-12 11:51:44 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Multiple GFS2 nodes attempted to unlink, rename, or manipulate files at the same time, causing various forms of file system corruption, panics, and withdraws. This update adds multiple checks for dinode's i_nlink value to assure inode operations such as link, unlink, or rename no longer cause the aforementioned problems.

Comment 14 errata-xmlrpc 2011-07-15 06:10:44 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0927.html


Note You need to log in before you can comment on or make changes to this bug.