Version-Release number of selected component (if applicable): latest git code from gfs2-2.6.git and latest CVS HEAD checkout for userland How reproducible: Setup a 2 node cluster sharing a gfs2 block device. Run "dbench -D /mnt/point/node1 5" and "dbench -D /mnt/point/node2 5" on node1 and node2 of the cluster. One of the node will OOPS (see attachment with kern.log) and the other stall (no oops or log) Actual results: see attachement.
Created attachment 134699 [details] kern.log
One odd thing here is that the assertion that was triggered in on line 287 rather than 286 (as the oops claims) so perhaps gcc was having an off day. It rather looks like two sets of error messages on top of each other, but I think I've got a fair idea of where the error is occuring. For the avoidance of doubt, can you tell me the last commit in the git tree that you are using? We did fix something very similar to this in: 5dd9feafb351a8bf304292623cbc63335c34d279 but I suspect that this is long enough ago that its in your kernel since you say that you are using the latest git tree. Also one more question; does the oops occur straight away or after the dbench processes have been running for a few seconds/minutes ?
Hi Steve, the off by one line might have been introduced in the patch i did post as RFC here: https://www.redhat.com/archives/cluster-devel/2006-July/msg00144.html and since than i did merge constantly from your tree. The last commit in your tree i have is: commit b8e1aabf218a2037d9d6a3256c33fc6ef96ac44c Author: Steven Whitehouse <swhiteho> Date: Tue Aug 22 16:25:50 2006 -0400 as it was available approx 60 minutes before the bug report. I did run dbench with 1 thread on each node 600 secs and it did work fine. The next i did fire up with 5 and got the OOPS. tested and reproduced 3 times in a raw to make sure it was not lunar rays hitting the fiber cable to the SAN. Fabio PS Just let me know if you want any patch tested or so. The setup is clearly not in production and we can play with it as much as we like.
Created attachment 135198 [details] Debugging patch Here is a patch which should allow grabbing some extra information about the bug that you've uncovered. Sorry for the delay in getting back to you I've had a bit of a backlog of things recently. I've looked through all the code in that area (freeing and deallocation of inodes) and I can't see the problem, so hopefully this patch will gather enough information to point us in the right direction. So if you have time to run this then it would be very useful to know the results. I have also committed recently a number of patches and also updated the origin branch to Linus' latest kernel. I don't think any of them would have a bearing on this particular bug, but it is probably as well to update anyway, so then we are both working from the same baseline. Also, regarding your patch of 19th July, moving lm_interface.h is probably a good plan long term. I've left it in its current position for now just because RHEL5 test1 was not frozen. Now it is, that can be moved easily enough without causing any undue problems. What we don't want to do though is to add the exports for GFS1 to the git tree. The git tree is intended to go upstream and GFS1 will remain a patch to the kernel, which is why the exports are a separate patch in FC6/RHEL5t1. If you want to send me a cleaned up version without the GFS1 changes, then I'll commit it to my tree. Its best to send it direct to me rather than via the list (but feel free to cc the list as well).
Hi Steve, i will run the tests within a week or maximum 2 (enjoying some vacation). Of course i will sync with your latest git tree, no worries about that. In regard of the patch, i will clean it up in my git tree and push it somewhere for you to pull (or send another email.. whatever works best for you). Thanks Fabio
Created attachment 136058 [details] OOPS Hi Steve, in attachment there is the OOPS running the very latest git (last commit 24264434603cc102d71fb2a1b3b7e282a781f449 Rewrite of examine_bucket()) and it is without the suggested debugging patch. Applying the patch and running the test makes the machine hang really hard before any OOPS is spawned on console or logs. Thanks Fabio
Hi, I'm still investigating this. I think I can see whats going on and I'll try and come up with another patch within the next few days. Steve
That's ok. I am back from holidays and i can test patches easily now. Fabio
I have pushed a patch: fae24ae10e0256e187431f5852eb31605415cef9 entitled [GFS2] Fix journal flush problem to my -nmw git tree. I hope that this will fix the problem thats been reported here as a side affect. It removes part of the code in which this problem was reported, so if its still not working correctly, then it would be very helpful if you could provide a further stack trace. The patch will be going to Linus next time he opens his merge window provided no problems are found in it, and once thats happened, the way is open for us to merge it into RHEL/FC.
I will check within the next 48 hours. I saw the patches pushed on LKML and cluster-devel. Thanks a lot Fabio
Well good news. I did pull from your -nwm branch, rerun the tests and it seems that the problem is gone. I will let some tests running for a few hours just to be safe. Fabio
Ok this problem seems gone 100%. Found another one tho.. bug report on the way. Thanks a lot! Fabio
Since there is a patch for this upstream, changing status to modified.
Created attachment 143163 [details] First of three patches for RHEL5 to fix this bug This is the first of three patches required for RHEL5 to fix this problem. The patches are already upstream so these are the back ported fixes.
Created attachment 143164 [details] The second of the three patches
Created attachment 143165 [details] The third of the three patches
in 2.6.18-1.2876.el5
A package has been built which should help the problem described in this bug report. This report is therefore being closed with a resolution of CURRENTRELEASE. You may reopen this bug report if the solution does not work for you.