Bug 203705

Summary: GFS2 OOPS removing files
Product: Red Hat Enterprise Linux 5 Reporter: Fabio Massimo Di Nitto <fdinitto>
Component: kernelAssignee: Steve Whitehouse <swhiteho>
Status: CLOSED CURRENTRELEASE QA Contact: GFS Bugs <gfs-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.0CC: dzickus, lwang
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: beta2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-12-23 00:03:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 204760    
Attachments:
Description Flags
kern.log
none
Debugging patch
none
OOPS
none
First of three patches for RHEL5 to fix this bug
none
The second of the three patches
none
The third of the three patches none

Description Fabio Massimo Di Nitto 2006-08-23 10:20:43 UTC
Version-Release number of selected component (if applicable):

latest git code from gfs2-2.6.git and latest CVS HEAD checkout for userland

How reproducible:

Setup a 2 node cluster sharing a gfs2 block device.
Run "dbench -D /mnt/point/node1 5" and "dbench -D /mnt/point/node2 5"
on node1 and node2 of the cluster.

One of the node will OOPS (see attachment with kern.log) and the other
stall (no oops or log)

Actual results:

see attachement.

Comment 1 Fabio Massimo Di Nitto 2006-08-23 10:20:45 UTC
Created attachment 134699 [details]
kern.log

Comment 2 Steve Whitehouse 2006-08-23 13:39:41 UTC
One odd thing here is that the assertion that was triggered in on line 287
rather than 286 (as the oops claims) so perhaps gcc was having an off day. It
rather looks like two sets of error messages on top of each other, but I think
I've got a fair idea of where the error is occuring. For the avoidance of doubt,
can you tell me the last commit in the git tree that you are using?

We did fix something very similar to this in:
5dd9feafb351a8bf304292623cbc63335c34d279 but I suspect that this is long enough
ago that its in your kernel since you say that you are using the latest git tree.

Also one more question; does the oops occur straight away or after the dbench
processes have been running for a few seconds/minutes ?


Comment 3 Fabio Massimo Di Nitto 2006-08-23 15:05:20 UTC
Hi Steve,

the off by one line might have been introduced in the patch i did post as RFC here:
https://www.redhat.com/archives/cluster-devel/2006-July/msg00144.html
and since than i did merge constantly from your tree.

The last commit in your tree i have is:
commit b8e1aabf218a2037d9d6a3256c33fc6ef96ac44c
Author: Steven Whitehouse <swhiteho>
Date:   Tue Aug 22 16:25:50 2006 -0400

as it was available approx 60 minutes before the bug report.

I did run dbench with 1 thread on each node 600 secs and it did work fine. The
next i did fire up with 5 and got the OOPS. tested and reproduced 3 times in a
raw to make sure it was not lunar rays hitting the fiber cable to the SAN.

Fabio

PS Just let me know if you want any patch tested or so. The setup is clearly not
in production and we can play with it as much as we like.

Comment 4 Steve Whitehouse 2006-08-30 11:21:57 UTC
Created attachment 135198 [details]
Debugging patch

Here is a patch which should allow grabbing some extra information about the
bug that you've uncovered. Sorry for the delay in getting back to you I've had
a bit of a backlog of things recently.

I've looked through all the code in that area (freeing and deallocation of
inodes) and I can't see the problem, so hopefully this patch will gather enough
information to point us in the right direction. So if you have time to run this
then it would be very useful to know the results. I have also committed
recently a number of patches and also updated the origin branch to Linus'
latest kernel. I don't think any of them would have a bearing on this
particular bug, but it is  probably as well to update anyway, so then we are
both working from the same baseline.

Also, regarding your patch of 19th July, moving lm_interface.h is probably a
good plan long term. I've left it in its current position for now just because
RHEL5 test1 was not frozen. Now it is, that can be moved easily enough without
causing any undue problems. What we don't want to do though is to add the
exports for GFS1 to the git tree. The git tree is intended to go upstream and
GFS1 will remain a patch to the kernel, which is why the exports are a separate
patch in FC6/RHEL5t1. If you want to send me a cleaned up version without the
GFS1 changes, then I'll commit it to my tree. Its best to send it direct to me
rather than via the list (but feel free to cc the list as well).

Comment 5 Fabio Massimo Di Nitto 2006-09-01 07:11:06 UTC
Hi Steve,

i will run the tests within a week or maximum 2 (enjoying some vacation). Of
course i will sync with your latest git tree, no worries about that.

In regard of the patch, i will clean it up in my git tree and push it somewhere
for you to pull (or send another email.. whatever works best for you).

Thanks
Fabio

Comment 6 Fabio Massimo Di Nitto 2006-09-12 07:50:14 UTC
Created attachment 136058 [details]
OOPS

Hi Steve,

in attachment there is the OOPS running the very latest git (last commit
24264434603cc102d71fb2a1b3b7e282a781f449 Rewrite of examine_bucket()) and it is
without the suggested debugging patch.

Applying the patch and running the test makes the machine hang really hard
before any OOPS is spawned on console or logs.

Thanks
Fabio

Comment 7 Steve Whitehouse 2006-09-14 13:09:57 UTC
Hi,

I'm still investigating this. I think I can see whats going on and I'll try and
come up with another patch within the next few days.

Steve

Comment 8 Fabio Massimo Di Nitto 2006-09-14 13:11:59 UTC
That's ok. I am back from holidays and i can test patches easily now.

Fabio

Comment 11 Steve Whitehouse 2006-11-24 11:58:25 UTC
I have pushed a patch: fae24ae10e0256e187431f5852eb31605415cef9 entitled
[GFS2] Fix journal flush problem to my -nmw git tree. I hope that this will fix
the problem thats been reported here as a side affect. It removes part of the
code in which this problem was reported, so if its still not working correctly,
then it would be very helpful if you could provide a further stack trace.

The patch will be going to Linus next time he opens his merge window provided no
problems are found in it, and once thats happened, the way is open for us to
merge it into RHEL/FC.


Comment 12 Fabio Massimo Di Nitto 2006-11-24 14:55:17 UTC
I will check within the next 48 hours. I saw the patches pushed on LKML and
cluster-devel.

Thanks a lot
Fabio

Comment 13 Fabio Massimo Di Nitto 2006-11-27 11:36:42 UTC
Well good news. I did pull from your -nwm branch, rerun the tests and it seems
that the problem is gone.

I will let some tests running for a few hours just to be safe.

Fabio

Comment 14 Fabio Massimo Di Nitto 2006-11-27 13:24:44 UTC
Ok this problem seems gone 100%. Found another one tho.. bug report on the way.

Thanks a lot!
Fabio

Comment 15 Steve Whitehouse 2006-11-29 11:25:39 UTC
Since there is a patch for this upstream, changing status to modified.

Comment 16 Steve Whitehouse 2006-12-08 16:46:19 UTC
Created attachment 143163 [details]
First of three patches for RHEL5 to fix this bug

This is the first of three patches required for RHEL5 to fix this problem. The
patches are already upstream so these are the back ported fixes.

Comment 17 Steve Whitehouse 2006-12-08 16:47:17 UTC
Created attachment 143164 [details]
The second of the three patches

Comment 18 Steve Whitehouse 2006-12-08 16:48:04 UTC
Created attachment 143165 [details]
The third of the three patches

Comment 19 Don Zickus 2006-12-14 01:05:49 UTC
in 2.6.18-1.2876.el5

Comment 20 RHEL Program Management 2006-12-23 00:03:30 UTC
A package has been built which should help the problem described in 
this bug report. This report is therefore being closed with a resolution 
of CURRENTRELEASE. You may reopen this bug report if the solution does 
not work for you.