Description of problem: This is the same bug as in bz 201217 and 215793 which are private which means they are useless to many important people. https://www.redhat.com/archives/linux-cluster/2007-February/msg00057.html reports: Since some days I do get a withdraw on 1 node of my 6 nodes gfs1 cluster. Yesterday I did reboot all nodes. Now the problem has moved to another node. kernel messages are the same anytime: GFS: fsid=epsilon:amal.1: fatal: assertion "x <= length" failed GFS: fsid=epsilon:amal.1: function = blkalloc_internal GFS: fsid=epsilon:amal.1: file = /build/buildd/linux-modules-extra-2.6-2.6.17/debian/build/build_amd64_none_amd64_redhat-cluster/gfs/gfs/rgrp.c, line = 1458 GFS: fsid=epsilon:amal.1: time = 1170922910 GFS: fsid=epsilon:amal.1: about to withdraw from the cluster GFS: fsid=epsilon:amal.1: waiting for outstanding I/O GFS: fsid=epsilon:amal.1: telling LM to withdraw lock_dlm: withdraw abandoned memory GFS: fsid=epsilon:amal.1: withdrawn https://www.redhat.com/archives/linux-cluster/2007-February/msg00063.html reports: Interesting. While testing GFS with low jounrnal size and ResourceGroup size, I hit the same issue, Feb 7 17:01:42 cfs1 kernel: GFS: fsid=cisco:gfs2.2: fatal: assertion "x <= length" failed Feb 7 17:01:42 cfs1 kernel: GFS: fsid=cisco:gfs2.2: function = blkalloc_internal Feb 7 17:01:42 cfs1 kernel: GFS: fsid=cisco:gfs2.2: file = /download/gfs/cluster.cvs-rhel4/gfs-kernel/src/gfs/rgrp.c, line = 1458 Feb 7 17:01:42 cfs1 kernel: GFS: fsid=cisco:gfs2.2: time = 1170896502 Feb 7 17:01:42 cfs1 kernel: GFS: fsid=cisco:gfs2.2: about to withdraw from the cluster Feb 7 17:01:42 cfs1 kernel: GFS: fsid=cisco:gfs2.2: waiting for outstanding I/O Feb 7 17:01:42 cfs1 kernel: GFS: fsid=cisco:gfs2.2: telling LM to withdraw This happened on a 3 node GFS over 512M device. $ gfs_mkfs -t cisco:gfs2 -p lock_dlm -j 3 -J 8 -r 16 -X /dev/hda12 I was using bonnie++ to create about 10K files of 1K each from each of 3 nodes simulataneous. Look at the code in rgrp.c it seems related to failure to find a particular resource group block. Could this be due to a very low RG size I'm using (16M) ?? Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
After File System Check all problems seems to have gone.
When you ran gfs_fsck, did it give you any messages about repairing Resource Groups (RGs)? If you have the output, please post it as an attachment to the bugzilla.
I have the same error, details below. Mar 9 17:38:28 jrmedia-c kernel: GFS: fsid=alpha_cluster:64mp3.2: fatal: assertion "x <= length" failed Mar 9 17:38:28 jrmedia-c kernel: GFS: fsid=alpha_cluster:64mp3.2: function = blkalloc_internal Mar 9 17:38:28 jrmedia-c kernel: GFS: fsid=alpha_cluster:64mp3.2: file = /builddir/build/BUILD/gfs-kernel-2.6.9-60/smp/src/gfs/rgrp.c, line = 1458 Mar 9 17:38:28 jrmedia-c kernel: GFS: fsid=alpha_cluster:64mp3.2: time = 1173461908 Mar 9 17:38:28 jrmedia-c kernel: GFS: fsid=alpha_cluster:64mp3.2: about to withdraw from the cluster Mar 9 17:38:28 jrmedia-c kernel: GFS: fsid=alpha_cluster:64mp3.2: waiting for outstanding I/O Mar 9 17:38:28 jrmedia-c kernel: GFS: fsid=alpha_cluster:64mp3.2: telling LM to withdraw Mar 9 17:38:31 jrmedia-c kernel: lock_dlm: withdraw abandoned memory Mar 9 17:38:31 jrmedia-c kernel: GFS: fsid=alpha_cluster:64mp3.2: withdrawn I ran a gfs_fsck. Output of which is attached.
Created attachment 149717 [details] Output of gfs_fsck -v -y
The attachment shows no overt resource group corruption or repair. The highest block I saw in the output indicated the file system was about 167GB in size, so not exceedingly big. The gfs_fsck output is pretty run-of-the-mill except for one thing: Of particular interest are a hundred or so messages that look like: Converting free metadata block at 43291043 to a free data block Succeeded. Converting 1 unused metadata blocks to free data blocks... Updating Resource Group 289 These messages are predominantly at the bottom on the output, and the affected block numbers are descending in the output. That tells me that the file system was most likely extended with gfs_grow. I wonder if this problem is related to the fact that gfs_grow adds the new rgindex entries in reverse order (see bz 222743). I'll investigate that possibility and see if I can recreate any kind of corruption by extending with gfs_grow. I know Abhi Das did some experiments in this regard and found no way to cause corruption using a multi-terabyte file system. Still, it's a lead. Ben Yarwood: Can you verify that the file system has been extended with gfs_grow, and if so, was it shortly before the assert happened?
The file system has been extended more than once. The last time was over a month ago and the first assert was three days ago. This file system is used in quite a strange manner. In most cases a file will be written to the file system once, then read once and then there will be no further access to the file. It's basically a back up storage for another device. Occasionally (1 in 100) a file will be rewritten and reread at a later date.
Hi, This is a known issue and Wendy Cheng is already aware that this customer is seeing it and is working on a debug rpm for them to use. Version info:: x86_64 RHEL4 running 2.6.9-55.ELsmp Therefore I am elevating it to GFS engineering. The latest messages shows the problem.. Oct 23 10:56:55 nike sshd(pam_unix)[9248]: session closed for user omc Oct 23 10:59:55 nike kernel: GFS: fsid=9628936469471:vgglobal-lvol3.2: fatal: assertion "x <= length" failed Oct 23 10:59:55 nike kernel: GFS: fsid=9628936469471:vgglobal-lvol3.2: function = blkalloc_internal Oct 23 10:59:55 nike kernel: GFS: fsid=9628936469471:vgglobal-lvol3.2: file = /builddir/build/BUILD/gfs-kernel-2.6.9-72/smp/src/gfs/rgrp.c, line = 1459 Oct 23 10:59:55 nike kernel: GFS: fsid=9628936469471:vgglobal-lvol3.2: time = 1193126395 Oct 23 10:59:55 nike kernel: GFS: fsid=9628936469471:vgglobal-lvol3.2: about to withdraw from the cluster Oct 23 10:59:55 nike kernel: GFS: fsid=9628936469471:vgglobal-lvol3.2: waiting for outstanding I/O Oct 23 10:59:55 nike kernel: GFS: fsid=9628936469471:vgglobal-lvol3.2: telling LM to withdraw Oct 23 10:59:57 nike kernel: lock_dlm: withdraw abandoned memory Oct 23 10:59:57 nike kernel: GFS: fsid=9628936469471:vgglobal-lvol3.2: withdrawn Oct 23 10:59:58 nike kernel: zrnevhmx[10472]: segfault at 0000000000000018 rip 000000000022a8f6 rsp 00000000ffffc1a0 error 6 Oct 23 11:00:01 nike crond[14530]: (root) CMD (/usr/lib64/sa/sa1 1 1) Oct 23 11:00:01 nike crond[14531]: (omc) CMD (/bin/bash -lc "invupl_xmlcleaner.sh" > /dev/null 2>&1 # HOST=slab21core ID=INV_CRON1) Oct 23 11:00:02 nike kernel: mecmanmx[11981]: segfault at 0000000000000000 rip 0000000000204b25 rsp 00000000ffffbdc4 error 6 Oct 23 11:01:01 nike crond[16211]: (root) CMD (run-parts /etc/cron.hourly) Oct 23 11:02:30 nike kernel: zrnevhmx[18249]: segfault at 0000000000000018 rip 000000000022a8f6 rsp 00000000ffffc070 error 6 Oct 23 11:05:08 nike kernel: zrnevhmx[21772]: segfault at 0000000000000018 rip 000000000022a8f6 rsp 00000000ffffc070 error 6 Oct 23 11:07:43 nike kernel: zrnevhmx[25388]: segfault at 0000000000000018 rip 000000000022a8f6 rsp 00000000ffffc070 error 6 If you need more data, let me know. Thanks, Debbie Johnson SEG Issue escalated to RHEL 4 Base OS by: dejohnso. dejohnso assigned to issue for SEG - Storage. Internal Status set to 'Waiting on Engineering' This event sent from IssueTracker by dejohnso issue 135820
By checking back the issues, we don't seem to ever capture the patch(s) (stack trace) that is (are) associated with this assert. So first thing we will do is passing a test RPM to capture the path when this assert occurs. This should reduce the guess work since we don't have an in-house reproducer to work with for this issue.
The test RPM to capture the failing path is uploaded into: http://people.redhat.com/wcheng/Patches/GFS/R4/bz227892 with a readme file. Let me know how it goes.
For djuran and support folks, if possible, please run the test few times to see whether each failure has the same exact stack trace. Send back the /var/log/messages outputs and/or attach them in this bugzilla.
Hello Wendy. I ran the test three times and got an identical stack trace every time: Oct 25 12:05:00 nike kernel: Call Trace:<ffffffffa023a17e>{:gfs:blkalloc_internal+231} <ffffffffa023a7f4>{:gfs:gfs_dialloc+113} Oct 25 12:05:00 nike kernel: <ffffffffa021f196>{:gfs:dinode_alloc+137} <ffffffffa021f4ff>{:gfs:gfs_createi+156} Oct 25 12:05:00 nike kernel: <ffffffff8030bf21>{wait_for_completion+231} <ffffffff801d0b9d>{avc_alloc_node+55} Oct 25 12:05:00 nike kernel: <ffffffff801d0b5d>{avc_node_replace+52} <ffffffff801d1f33>{avc_has_perm_noaudit+635} Oct 25 12:05:00 nike kernel: <ffffffffa0232567>{:gfs:gfs_symlink+153} <ffffffff801d2eac>{may_create+220} Oct 25 12:05:00 nike kernel: <ffffffff80189742>{vfs_symlink+193} <ffffffff80189812>{sys_symlink+133} Oct 25 12:05:00 nike kernel: <ffffffff80110d91>{error_exit+0} <ffffffff8011026a>{system_call+126} Also attaching the full /var/log/messages in a (private) atatchement. Do note that this trace was obtained on a tainted kernel (Xelas OSI stack) but I have been able to produce the assertion on an untainted kernel as well.
Thanks for the data .. looking into it now.
Judging by the stack trace and info from comment #6, GFS1 probably has a bug somewhere around the handling of rgd's rd_last_alloc_meta when its associated local resource group is close to full. Will give this more thoughts today to see how we can capture the culprit. Will try to get the 2nd test RPM done by tomorrow morning. For this particular issue, it is ok to run the test on tainted kernels, as long as the test gfs.ko can load.
Went thru the captured data and relevant code in details today. I suspect we can recreate this issue in-house by artifically switching the allocation rgrp groups with a three-node cluster. So instead of using customer's system, would like to try this out in our lab first. Since gfs1 code in RHEL5 and RHEL4 are mostly identical, I'll start with my own RDU RHEL5 cluster. So djuran, I don't need to use your customer's cluster at this moment. Will keep everyone updated on Monday (10/28) on this issue.
I ran the experiments twice over the weekend and was not able to recreate the issue. Probably need the customer's cluster again - but have to think the issue more (in terms of what exactly we should capture). Starting to draft a second debug patch now..
The reason for this assert panic is that the resource group accounting info says there are spaces in the resource group for at least one inode but the bitmap shows no space for it. There are many possible causes (disk corruption for even one bit, memory corruption, logic error, locking issue, or cache problems) and the culprit event could be far from the time when the issue actually shows up. That's why this issue is so hard to shoot without recreation in our lab. So ... the action plan is: 1. Keep trying to recreate this issue on my RHEL5 rdu cluster every night. 2. In the new debug patch, when this error occurs, we'll a) save the faulty rgrp structure somewhere else b) flush the journal (this will flush whatever occurs before this error event into the disk) c) re-read the rgrp from disk d) compare the faulty rgrp structure and current one. Print out the difference. e) see if we can keep goig without issue if (e) keeps going without issue, then we know it would be locking and/or cache issue. In the middle of testing the debug patch now. When it is done, need to discuss with support folks to see whether we can run this test in customer's cluster.
2nd set of test RPMS are uploaded into: http://people.redhat.com/wcheng/Patches/GFS/R4/bz227892 with a readme file. When the error condition occurs, this modified GFS will * print out debugging info * give up the block in question; find another block; then keep going. So there *is* a possibility to corrupt the filesystem, depending on how bad the original error condition is. If the customer is ok with this, please try it out and pass back the /var/log/messages file when error occurs. Remember to either fsck the partition and/or re-make the fs when the experiment is done. If the customer is not ok with this experiment, let me know too (so I can think of other possible way to capture the culprit).
New debug RPMs and readme are in: http://people.redhat.com/wcheng/Patches/GFS/R4/bz227892/ Upon problem recreation, please send back /var/log/message file on the problem node. Also try to run it at least twice - so we could have at least two /var/log/message files to compare. Thanks.
Hello Wendy. Just talked to the lab tech and it seems like the assertion has stopped happening lately )-: Anyway, I've installed your latest patch and I'll let you know if we catch anything. Internal Status set to 'Waiting on Customer' Status set to: Waiting on Client This event sent from IssueTracker by djuran issue 135820
Hello Wendy. To quote a memorable phrase from "Twin Peaks": It is happening again! (-: Anyway, the assertion has started to show up again, this time on all four nodes and (at least) on two different file systems. The nodes "nike" and "reebok" has a tainted kernel while the other two nodes should be clean. I'm attaching the relevant log file excerpts which contains the output from your latest debug rpm in a private attachment. I've also indicated in there when the nodes has been rebooted. Internal Status set to 'Waiting on Engineering' Status set to: Waiting on Tech This event sent from IssueTracker by djuran issue 135820
Customer has increased severity of the Issue Tracker, commenting that this issue is preventing application testing. Setting the BZ severity to match.
ok.. look into this now.
ok, to be on the safe side, please remove the test RPM from the systems. From messages.adidas, the on-disk bitmap is same as on-memory version so look to me there are disk-corruptions. Please fsck the filesystem and send back the fsck output. Check out "man gfs_fsck" to get the correct syntax.
Didn't have time to explain the problem during office hour today .. here is the details ... GFS keeps groups of blocks on disk, called resource group (rg), to store its meta data info (similar to ext3's inode table). Each group starts with a header block that has the relevant statistics (how many free meta blocks, how many used meta blocks, etc). Each header is followed by a pre-defined number of bitmap blocks that are used to tell which disk block is actually used and which disk block is free. These bitmap blocks are read into memory as part of the glock (cluster lock) operation. When this issue occurs, the counters in the on-memory header says it has 146 free blocks available but the bitmap contents says otherwise. As the result, GFS panic itself to avoid any possible further corruption. What I did in the debug RPM is upon the event, it 1. save the on-memory blocks 2. re-read the on-disk blocks 3. compare (1) and (2) 4. print out the differences. From Dave's messages file, it is found that this event *always* occurs on the very same resource group (its header block address is 1834264). The header says it has 146 free meta blocks but bitmap says otherwise. The on-memory contents and on-disk blocks are identical, except its header's generation number. This implies the on-disk header and the on-disk bitmap are inconsistent to each other - so fsck to fix it is *required*. This also explains why this problem is easy to "recreate" since this particular rg on the disk had been corrupted sometime ago. As soon as gfs tries to use this particular rg, this panic (assert) will occur. Note that GFS is more or less "rotating" among different RGs so this problem may not show up immediately until it comes down to this particular resource group. So our first step is to fix this disk inconsistency (as a workaround) while we are trying to figure out the root cause of the original corruption. Note that the gfs fsck utility in RHEL4 is known to have its own issues too. Please pass back the fsck output for sanity check.
Update on this problem: * The fsck output *does* show on-disk bitmap corruptions. * Based on conversation with our field engineers, after fsck, the problem disappears. * Now the issue is how and why it got corrupted in the first place. I was hoping the customer could have a way to recreate the corruption but it turns out the "recreation" was due to the already-corrupted on-disk bitmap. So it is not very useful. I was also hoping to see minor corruption (say bitmap was in the middle of being transformed from free to un-used - then it could be our glock flush flaw) but it also turns out not true (the fsck shows data corruption as well). The rgrp handling code is one of GFS2 core functions that are complicated and dangerous to mess with. I'm reluctant to make any "guessing" fix unless the evidence and rationale are strong. Will keep watching and monitoring this issue. However, until a recreation method is found that can be used to verify the fix and/or any code changes, I don't plan to rush into any solution. BTW, the corruption can come from anywhere (unclean shutdown, abruptly interruption during journal replays, etc) so this is really a hard problem to shoot.
I accidentally recreated a similar issue last month by mounting gfs with "nolock" protocol on two nodes. It may not be the cause of the corruption that the customers encounter but it shows at least one source of the problem is due to locking errors. It is hard to prevent this type of errors though. Two side effects of this problem are: * the filesystem keeps being used without knowning one particular Resource Group has been corrupted until it is used again. Note that GFS can rotate the RGs during block allocation. * the persistence of the corruption will give people a wrong impression that this problem can easily re-occur. Without a problem recreation method that can be used to trace the real culprit, plus the above two side effects, we'll tentatively settle this issue with a warning message that: 1. Warn people about the RG corruption and request a gfs_fsck 2. Upon error detection, perform a minimum error data collection (and hopefully we'll eventually figure this out by observing these data). If anyone knows a recreation method that can be used in our lab environment, please do let us know.
Code checked into CVS today. Should make RHEL 4.7.
I hit this during recovery testing on ppc for 4.6.z. We probably should make sure this is in the errata for 4.7.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0802.html