Red Hat Bugzilla – Bug 203705
GFS2 OOPS removing files
Last modified: 2007-11-30 17:07:33 EST
Version-Release number of selected component (if applicable):
latest git code from gfs2-2.6.git and latest CVS HEAD checkout for userland
Setup a 2 node cluster sharing a gfs2 block device.
Run "dbench -D /mnt/point/node1 5" and "dbench -D /mnt/point/node2 5"
on node1 and node2 of the cluster.
One of the node will OOPS (see attachment with kern.log) and the other
stall (no oops or log)
Created attachment 134699 [details]
One odd thing here is that the assertion that was triggered in on line 287
rather than 286 (as the oops claims) so perhaps gcc was having an off day. It
rather looks like two sets of error messages on top of each other, but I think
I've got a fair idea of where the error is occuring. For the avoidance of doubt,
can you tell me the last commit in the git tree that you are using?
We did fix something very similar to this in:
5dd9feafb351a8bf304292623cbc63335c34d279 but I suspect that this is long enough
ago that its in your kernel since you say that you are using the latest git tree.
Also one more question; does the oops occur straight away or after the dbench
processes have been running for a few seconds/minutes ?
the off by one line might have been introduced in the patch i did post as RFC here:
and since than i did merge constantly from your tree.
The last commit in your tree i have is:
Author: Steven Whitehouse <email@example.com>
Date: Tue Aug 22 16:25:50 2006 -0400
as it was available approx 60 minutes before the bug report.
I did run dbench with 1 thread on each node 600 secs and it did work fine. The
next i did fire up with 5 and got the OOPS. tested and reproduced 3 times in a
raw to make sure it was not lunar rays hitting the fiber cable to the SAN.
PS Just let me know if you want any patch tested or so. The setup is clearly not
in production and we can play with it as much as we like.
Created attachment 135198 [details]
Here is a patch which should allow grabbing some extra information about the
bug that you've uncovered. Sorry for the delay in getting back to you I've had
a bit of a backlog of things recently.
I've looked through all the code in that area (freeing and deallocation of
inodes) and I can't see the problem, so hopefully this patch will gather enough
information to point us in the right direction. So if you have time to run this
then it would be very useful to know the results. I have also committed
recently a number of patches and also updated the origin branch to Linus'
latest kernel. I don't think any of them would have a bearing on this
particular bug, but it is probably as well to update anyway, so then we are
both working from the same baseline.
Also, regarding your patch of 19th July, moving lm_interface.h is probably a
good plan long term. I've left it in its current position for now just because
RHEL5 test1 was not frozen. Now it is, that can be moved easily enough without
causing any undue problems. What we don't want to do though is to add the
exports for GFS1 to the git tree. The git tree is intended to go upstream and
GFS1 will remain a patch to the kernel, which is why the exports are a separate
patch in FC6/RHEL5t1. If you want to send me a cleaned up version without the
GFS1 changes, then I'll commit it to my tree. Its best to send it direct to me
rather than via the list (but feel free to cc the list as well).
i will run the tests within a week or maximum 2 (enjoying some vacation). Of
course i will sync with your latest git tree, no worries about that.
In regard of the patch, i will clean it up in my git tree and push it somewhere
for you to pull (or send another email.. whatever works best for you).
Created attachment 136058 [details]
in attachment there is the OOPS running the very latest git (last commit
24264434603cc102d71fb2a1b3b7e282a781f449 Rewrite of examine_bucket()) and it is
without the suggested debugging patch.
Applying the patch and running the test makes the machine hang really hard
before any OOPS is spawned on console or logs.
I'm still investigating this. I think I can see whats going on and I'll try and
come up with another patch within the next few days.
That's ok. I am back from holidays and i can test patches easily now.
I have pushed a patch: fae24ae10e0256e187431f5852eb31605415cef9 entitled
[GFS2] Fix journal flush problem to my -nmw git tree. I hope that this will fix
the problem thats been reported here as a side affect. It removes part of the
code in which this problem was reported, so if its still not working correctly,
then it would be very helpful if you could provide a further stack trace.
The patch will be going to Linus next time he opens his merge window provided no
problems are found in it, and once thats happened, the way is open for us to
merge it into RHEL/FC.
I will check within the next 48 hours. I saw the patches pushed on LKML and
Thanks a lot
Well good news. I did pull from your -nwm branch, rerun the tests and it seems
that the problem is gone.
I will let some tests running for a few hours just to be safe.
Ok this problem seems gone 100%. Found another one tho.. bug report on the way.
Thanks a lot!
Since there is a patch for this upstream, changing status to modified.
Created attachment 143163 [details]
First of three patches for RHEL5 to fix this bug
This is the first of three patches required for RHEL5 to fix this problem. The
patches are already upstream so these are the back ported fixes.
Created attachment 143164 [details]
The second of the three patches
Created attachment 143165 [details]
The third of the three patches
A package has been built which should help the problem described in
this bug report. This report is therefore being closed with a resolution
of CURRENTRELEASE. You may reopen this bug report if the solution does
not work for you.