Bug 227892 - assertion 'x <= length' failed
Summary: assertion 'x <= length' failed
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: gfs
Version: 4
Hardware: All
OS: Linux
medium
high
Target Milestone: ---
Assignee: Kiersten (Kerri) Anderson
QA Contact: GFS Bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-02-08 19:54 UTC by David Teigland
Modified: 2018-10-20 00:27 UTC (History)
6 users (show)

Fixed In Version: RHBA-2008-0802
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-07-25 19:27:17 UTC
Embargoed:


Attachments (Terms of Use)
Output of gfs_fsck -v -y (12.07 MB, text/plain)
2007-03-09 18:33 UTC, Ben Yarwood
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2008:0802 0 normal SHIPPED_LIVE GFS-kernel bug fix update 2008-07-25 19:27:12 UTC

Description David Teigland 2007-02-08 19:54:07 UTC
Description of problem:

This is the same bug as in bz 201217 and 215793 which are
private which means they are useless to many important people.


https://www.redhat.com/archives/linux-cluster/2007-February/msg00057.html
reports:

Since some days I do get a withdraw on 1 node of my 6 nodes gfs1 cluster.
Yesterday I did reboot all nodes. Now the problem has moved to another
node.

kernel messages are the same anytime:

GFS: fsid=epsilon:amal.1: fatal: assertion "x <= length" failed
GFS: fsid=epsilon:amal.1:   function = blkalloc_internal
GFS: fsid=epsilon:amal.1:   file =
/build/buildd/linux-modules-extra-2.6-2.6.17/debian/build/build_amd64_none_amd64_redhat-cluster/gfs/gfs/rgrp.c,
line = 1458
GFS: fsid=epsilon:amal.1:   time = 1170922910
GFS: fsid=epsilon:amal.1: about to withdraw from the cluster
GFS: fsid=epsilon:amal.1: waiting for outstanding I/O
GFS: fsid=epsilon:amal.1: telling LM to withdraw
lock_dlm: withdraw abandoned memory
GFS: fsid=epsilon:amal.1: withdrawn


https://www.redhat.com/archives/linux-cluster/2007-February/msg00063.html
reports:

Interesting. While testing GFS with low jounrnal size and ResourceGroup
size, I hit the same issue,


Feb  7 17:01:42 cfs1 kernel: GFS: fsid=cisco:gfs2.2: fatal: assertion "x
<= length" failed
Feb  7 17:01:42 cfs1 kernel: GFS: fsid=cisco:gfs2.2:   function =
blkalloc_internal 
Feb  7 17:01:42 cfs1 kernel: GFS: fsid=cisco:gfs2.2:   file =
/download/gfs/cluster.cvs-rhel4/gfs-kernel/src/gfs/rgrp.c, line = 1458 
Feb  7 17:01:42 cfs1 kernel: GFS: fsid=cisco:gfs2.2:   time = 1170896502
Feb  7 17:01:42 cfs1 kernel: GFS: fsid=cisco:gfs2.2: about to withdraw
from the cluster
Feb  7 17:01:42 cfs1 kernel: GFS: fsid=cisco:gfs2.2: waiting for
outstanding I/O
Feb  7 17:01:42 cfs1 kernel: GFS: fsid=cisco:gfs2.2: telling LM to
withdraw

This happened on a 3 node GFS over 512M device.

$ gfs_mkfs -t cisco:gfs2 -p lock_dlm -j 3 -J 8 -r 16 -X /dev/hda12

I was using bonnie++ to create about 10K files of 1K each from each of 3
nodes simulataneous.

Look at the code in rgrp.c it seems related to failure to find a
particular resource group block. Could this be due to a very low RG size
I'm using (16M) ??



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 menole 2007-02-12 14:51:22 UTC
After File System Check all problems seems to have gone.

Comment 2 Robert Peterson 2007-02-13 17:31:27 UTC
When you ran gfs_fsck, did it give you any messages about repairing
Resource Groups (RGs)?  If you have the output, please post it as an
attachment to the bugzilla.


Comment 3 Ben Yarwood 2007-03-09 18:27:19 UTC
I have the same error, details below.
Mar  9 17:38:28 jrmedia-c kernel: GFS: fsid=alpha_cluster:64mp3.2: fatal: 
assertion "x <= length" failed
Mar  9 17:38:28 jrmedia-c kernel: GFS: fsid=alpha_cluster:64mp3.2:   function = 
blkalloc_internal
Mar  9 17:38:28 jrmedia-c kernel: GFS: fsid=alpha_cluster:64mp3.2:   file 
= /builddir/build/BUILD/gfs-kernel-2.6.9-60/smp/src/gfs/rgrp.c, line = 1458
Mar  9 17:38:28 jrmedia-c kernel: GFS: fsid=alpha_cluster:64mp3.2:   time = 
1173461908
Mar  9 17:38:28 jrmedia-c kernel: GFS: fsid=alpha_cluster:64mp3.2: about to 
withdraw from the cluster
Mar  9 17:38:28 jrmedia-c kernel: GFS: fsid=alpha_cluster:64mp3.2: waiting for 
outstanding I/O
Mar  9 17:38:28 jrmedia-c kernel: GFS: fsid=alpha_cluster:64mp3.2: telling LM 
to withdraw
Mar  9 17:38:31 jrmedia-c kernel: lock_dlm: withdraw abandoned memory
Mar  9 17:38:31 jrmedia-c kernel: GFS: fsid=alpha_cluster:64mp3.2: withdrawn


I ran a gfs_fsck.  Output of which is attached.

Comment 4 Ben Yarwood 2007-03-09 18:33:39 UTC
Created attachment 149717 [details]
Output of gfs_fsck -v -y

Comment 5 Robert Peterson 2007-03-09 22:58:09 UTC
The attachment shows no overt resource group corruption or repair.
The highest block I saw in the output indicated the file system was
about 167GB in size, so not exceedingly big.
The gfs_fsck output is pretty run-of-the-mill except for one thing:
Of particular interest are a hundred or so messages that look like:

Converting free metadata block at 43291043 to a free data block
Succeeded.
Converting 1 unused metadata blocks to free data blocks...
Updating Resource Group 289

These messages are predominantly at the bottom on the output, and the
affected block numbers are descending in the output.  That tells me
that the file system was most likely extended with gfs_grow.

I wonder if this problem is related to the fact that gfs_grow
adds the new rgindex entries in reverse order (see bz 222743).
I'll investigate that possibility and see if I can recreate any
kind of corruption by extending with gfs_grow.  I know Abhi Das did 
some experiments in this regard and found no way to cause corruption
using a multi-terabyte file system.  Still, it's a lead.

Ben Yarwood: Can you verify that the file system has been extended
with gfs_grow, and if so, was it shortly before the assert happened?


Comment 6 Ben Yarwood 2007-03-10 10:22:44 UTC
The file system has been extended more than once.  The last time was over a 
month ago and the first assert was three days ago.

This file system is used in quite a strange manner.  In most cases a file will 
be written to the file system once, then read once and then there will be no 
further access to the file.  It's basically a back up storage for another 
device.  Occasionally (1 in 100) a file will be rewritten and reread at a later 
date.

Comment 8 Issue Tracker 2007-10-23 18:54:31 UTC
Hi,

This is a known issue and Wendy Cheng is already aware that this customer
is seeing it and is working on a debug rpm for them to use. Version info::
x86_64 RHEL4 running 2.6.9-55.ELsmp

Therefore I am elevating it to GFS engineering.

The latest messages shows the problem..

Oct 23 10:56:55 nike sshd(pam_unix)[9248]: session closed for user omc
Oct 23 10:59:55 nike kernel: GFS: fsid=9628936469471:vgglobal-lvol3.2:
fatal: assertion "x <= length" failed
Oct 23 10:59:55 nike kernel: GFS: fsid=9628936469471:vgglobal-lvol3.2:  
function = blkalloc_internal
Oct 23 10:59:55 nike kernel: GFS: fsid=9628936469471:vgglobal-lvol3.2:  
file = /builddir/build/BUILD/gfs-kernel-2.6.9-72/smp/src/gfs/rgrp.c, line
= 1459
Oct 23 10:59:55 nike kernel: GFS: fsid=9628936469471:vgglobal-lvol3.2:  
time = 1193126395
Oct 23 10:59:55 nike kernel: GFS: fsid=9628936469471:vgglobal-lvol3.2:
about to withdraw from the cluster
Oct 23 10:59:55 nike kernel: GFS: fsid=9628936469471:vgglobal-lvol3.2:
waiting for outstanding I/O
Oct 23 10:59:55 nike kernel: GFS: fsid=9628936469471:vgglobal-lvol3.2:
telling LM to withdraw
Oct 23 10:59:57 nike kernel: lock_dlm: withdraw abandoned memory
Oct 23 10:59:57 nike kernel: GFS: fsid=9628936469471:vgglobal-lvol3.2:
withdrawn
Oct 23 10:59:58 nike kernel: zrnevhmx[10472]: segfault at 0000000000000018
rip 000000000022a8f6 rsp 00000000ffffc1a0 error 6
Oct 23 11:00:01 nike crond[14530]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Oct 23 11:00:01 nike crond[14531]: (omc) CMD (/bin/bash -lc
"invupl_xmlcleaner.sh" > /dev/null 2>&1 # HOST=slab21core ID=INV_CRON1)
Oct 23 11:00:02 nike kernel: mecmanmx[11981]: segfault at 0000000000000000
rip 0000000000204b25 rsp 00000000ffffbdc4 error 6
Oct 23 11:01:01 nike crond[16211]: (root) CMD (run-parts
/etc/cron.hourly)
Oct 23 11:02:30 nike kernel: zrnevhmx[18249]: segfault at 0000000000000018
rip 000000000022a8f6 rsp 00000000ffffc070 error 6
Oct 23 11:05:08 nike kernel: zrnevhmx[21772]: segfault at 0000000000000018
rip 000000000022a8f6 rsp 00000000ffffc070 error 6
Oct 23 11:07:43 nike kernel: zrnevhmx[25388]: segfault at 0000000000000018
rip 000000000022a8f6 rsp 00000000ffffc070 error 6

If you need more data, let me know.

Thanks,

Debbie Johnson
SEG


Issue escalated to RHEL 4 Base OS by: dejohnso.
dejohnso assigned to issue for SEG - Storage.
Internal Status set to 'Waiting on Engineering'

This event sent from IssueTracker by dejohnso 
 issue 135820

Comment 9 Wendy Cheng 2007-10-24 15:26:37 UTC
By checking back the issues, we don't seem to ever capture the patch(s) 
(stack trace) that is (are) associated with this assert. So first thing
we will do is passing a test RPM to capture the path when this assert 
occurs. This should reduce the guess work since we don't have an in-house
reproducer to work with for this issue. 

Comment 10 Wendy Cheng 2007-10-24 19:03:08 UTC
The test RPM to capture the failing path is uploaded into:
http://people.redhat.com/wcheng/Patches/GFS/R4/bz227892 with a readme file.

Let me know how it goes.


Comment 11 Wendy Cheng 2007-10-24 20:59:59 UTC
For djuran and support folks, if possible, please run the test few times 
to see whether each failure has the same exact stack trace. Send back the 
/var/log/messages outputs and/or attach them in this bugzilla. 


Comment 12 David Juran 2007-10-25 09:58:09 UTC
Hello Wendy.

I ran the test three times and got an identical stack trace every time:

Oct 25 12:05:00 nike kernel: Call
Trace:<ffffffffa023a17e>{:gfs:blkalloc_internal+231}
<ffffffffa023a7f4>{:gfs:gfs_dialloc+113} 
Oct 25 12:05:00 nike kernel:        <ffffffffa021f196>{:gfs:dinode_alloc+137}
<ffffffffa021f4ff>{:gfs:gfs_createi+156} 
Oct 25 12:05:00 nike kernel:        <ffffffff8030bf21>{wait_for_completion+231}
<ffffffff801d0b9d>{avc_alloc_node+55} 
Oct 25 12:05:00 nike kernel:        <ffffffff801d0b5d>{avc_node_replace+52}
<ffffffff801d1f33>{avc_has_perm_noaudit+635} 
Oct 25 12:05:00 nike kernel:        <ffffffffa0232567>{:gfs:gfs_symlink+153}
<ffffffff801d2eac>{may_create+220} 
Oct 25 12:05:00 nike kernel:        <ffffffff80189742>{vfs_symlink+193}
<ffffffff80189812>{sys_symlink+133} 
Oct 25 12:05:00 nike kernel:        <ffffffff80110d91>{error_exit+0}
<ffffffff8011026a>{system_call+126} 

Also attaching the full /var/log/messages in a (private) atatchement. Do note
that this trace was obtained on a tainted kernel (Xelas OSI stack) but I have
been able to produce the assertion on an untainted kernel as well.

Comment 15 Wendy Cheng 2007-10-25 12:52:51 UTC
Thanks for the data .. looking into it now. 

Comment 16 Wendy Cheng 2007-10-25 14:56:09 UTC
Judging by the stack trace and info from comment #6, GFS1 probably has 
a bug somewhere around the handling of rgd's rd_last_alloc_meta when 
its associated local resource group is close to full. Will give this 
more thoughts today to see how we can capture the culprit. Will try to
get the 2nd test RPM done by tomorrow morning.

For this particular issue, it is ok to run the test on tainted kernels,
as long as the test gfs.ko can load.  

Comment 17 Wendy Cheng 2007-10-26 01:10:27 UTC
Went thru the captured data and relevant code in details today. I suspect 
we can recreate this issue in-house by artifically switching the allocation
rgrp groups with a three-node cluster. So instead of using customer's system, 
would like to try this out in our lab first. Since gfs1 code in RHEL5 and 
RHEL4 are mostly identical, I'll start with my own RDU RHEL5 cluster.

So djuran, I don't need to use your customer's cluster at this moment. Will 
keep everyone updated on Monday (10/28) on this issue. 


Comment 18 Wendy Cheng 2007-10-29 14:39:48 UTC
I ran the experiments twice over the weekend and was not able to recreate the
issue. Probably need the customer's cluster again - but have to think the issue
more (in terms of what exactly we should capture). Starting to draft a second
debug patch now.. 

Comment 19 Wendy Cheng 2007-10-29 21:20:27 UTC
The reason for this assert panic is that the resource group accounting info
says there are spaces in the resource group for at least one inode but the 
bitmap shows no space for it. There are many possible causes (disk corruption
for even one bit, memory corruption, logic error, locking issue, or cache 
problems) and the culprit event could be far from the time when the issue 
actually shows up. That's why this issue is so hard to shoot without recreation
in our lab. So ... the action plan is:

1. Keep trying to recreate this issue on my RHEL5 rdu cluster every night.
2. In the new debug patch, when this error occurs, we'll
   a) save the faulty rgrp structure somewhere else
   b) flush the journal (this will flush whatever occurs before this error 
      event into the disk)
   c) re-read the rgrp from disk
   d) compare the faulty rgrp structure and current one. Print out the 
      difference.
   e) see if we can keep goig without issue
   
   if (e) keeps going without issue, then we know it would be locking and/or
   cache issue.

In the middle of testing the debug patch now. When it is done, need to 
discuss with support folks to see whether we can run this test in customer's
cluster. 

Comment 20 Wendy Cheng 2007-11-01 05:58:47 UTC
2nd set of test RPMS are uploaded into:
http://people.redhat.com/wcheng/Patches/GFS/R4/bz227892 with a readme file.

When the error condition occurs, this modified GFS will 
* print out debugging info
* give up the block in question; find another block; then keep going.

So there *is* a possibility to corrupt the filesystem, depending on how 
bad the original error condition is. If the customer is ok with this, 
please try it out and pass back the /var/log/messages file when error 
occurs. Remember to either fsck the partition and/or re-make the fs when
the experiment is done. 

If the customer is not ok with this experiment, let me know too (so I can 
think of other possible way to capture the culprit). 

Comment 21 Wendy Cheng 2007-11-05 03:32:17 UTC
New debug RPMs and readme are in:
http://people.redhat.com/wcheng/Patches/GFS/R4/bz227892/

Upon problem recreation, please send back /var/log/message file on the 
problem node. Also try to run it at least twice - so we could have at
least two /var/log/message files to compare. Thanks. 

Comment 22 Issue Tracker 2007-11-05 08:59:57 UTC
Hello Wendy.

Just talked to the lab tech and it seems like the assertion has stopped
happening lately )-:
 Anyway, I've installed your latest patch and I'll let you know if we
catch anything.


Internal Status set to 'Waiting on Customer'
Status set to: Waiting on Client

This event sent from IssueTracker by djuran 
 issue 135820

Comment 23 Issue Tracker 2007-11-07 11:27:15 UTC
Hello Wendy.

To quote a memorable phrase from "Twin Peaks": It is happening again!
(-:

Anyway, the assertion has started to show up again, this time on all four
nodes and (at least) on two different file systems.  The nodes "nike"
and "reebok" has a tainted kernel while the other two nodes should be
clean. I'm attaching the relevant log file excerpts which contains the
output from your latest debug rpm in a private attachment. I've also
indicated in there when the nodes has been rebooted.


Internal Status set to 'Waiting on Engineering'
Status set to: Waiting on Tech

This event sent from IssueTracker by djuran 
 issue 135820

Comment 25 Sirius Rayner-Karlsson 2007-11-07 12:38:28 UTC
Customer has increased severity of the Issue Tracker, commenting that this issue
is preventing application testing. Setting the BZ severity to match.

Comment 26 Wendy Cheng 2007-11-07 16:05:27 UTC
ok.. look into this now. 

Comment 27 Wendy Cheng 2007-11-07 16:37:09 UTC
ok, to be on the safe side, please remove the test RPM from the systems.

From messages.adidas, the on-disk bitmap is same as on-memory version so
look to me there are disk-corruptions. Please fsck the filesystem and send
back the fsck output. Check out "man gfs_fsck" to get the correct syntax. 

Comment 28 Wendy Cheng 2007-11-08 03:53:13 UTC
Didn't have time to explain the problem during office hour today .. here 
is the details ...

GFS keeps groups of blocks on disk, called resource group (rg), to store 
its meta data info (similar to ext3's inode table). Each group starts 
with a header block that has the relevant statistics (how many free 
meta blocks, how many used meta blocks, etc). Each header is followed
by a pre-defined number of bitmap blocks that are used to tell which
disk block is actually used and which disk block is free. These bitmap
blocks are read into memory as part of the glock (cluster lock) operation. 
When this issue occurs, the counters in the on-memory header says it has 
146 free blocks available but the bitmap contents says otherwise. As the 
result, GFS panic itself to avoid any possible further corruption.

What I did in the debug RPM is upon the event, it
1. save the on-memory blocks 
2. re-read the on-disk blocks
3. compare (1) and (2)
4. print out the differences.

From Dave's messages file, it is found that this event *always* occurs 
on the very same resource group (its header block address is 1834264). 
The header says it has 146 free meta blocks but bitmap says otherwise.
The on-memory contents and on-disk blocks are identical, except its
header's generation number. This implies the on-disk header and the
on-disk bitmap are inconsistent to each other - so fsck to fix it is
*required*. This also explains why this problem is easy to "recreate"
since this particular rg on the disk had been corrupted sometime ago. 
As soon as gfs tries to use this particular rg, this panic (assert)
will occur. Note that GFS is more or less "rotating" among different
RGs so this problem may not show up immediately until it comes down
to this particular resource group. So our first step is to fix this
disk inconsistency (as a workaround) while we are trying to figure out
the root cause of the original corruption. 

Note that the gfs fsck utility in RHEL4 is known to have its own issues 
too. Please pass back the fsck output for sanity check.

Comment 29 Wendy Cheng 2007-11-27 17:39:12 UTC
Update on this problem:

* The fsck output *does* show on-disk bitmap corruptions. 
* Based on conversation with our field engineers, after fsck, the problem
  disappears.
* Now the issue is how and why it got corrupted in the first place. 

I was hoping the customer could have a way to recreate the corruption but 
it turns out the "recreation" was due to the already-corrupted on-disk 
bitmap. So it is not very useful. I was also hoping to see minor corruption
(say bitmap was in the middle of being transformed from free to un-used -
then it could be our glock flush flaw) but it also turns out not true
(the fsck shows data corruption as well). The rgrp handling code is one
of GFS2 core functions that are complicated and dangerous to mess with.
I'm reluctant to make any "guessing" fix unless the evidence and rationale
are strong. 

Will keep watching and monitoring this issue. However, until a recreation 
method is found that can be used to verify the fix and/or any code changes,
I don't plan to rush into any solution.

BTW, the corruption can come from anywhere (unclean shutdown, abruptly 
interruption during journal replays, etc) so this is really a hard problem
to shoot.  



Comment 31 Wendy Cheng 2008-01-02 15:57:58 UTC
I accidentally recreated a similar issue last month by mounting gfs with 
"nolock" protocol on two nodes. It may not be the cause of the corruption
that the customers encounter but it shows at least one source of the problem
is due to locking errors. It is hard to prevent this type of errors though.

Two side effects of this problem are:
* the filesystem keeps being used without knowning one particular Resource 
  Group has been corrupted until it is used again. Note that GFS can rotate
  the RGs during block allocation. 
* the persistence of the corruption will give people a wrong impression 
  that this problem can easily re-occur.

Without a problem recreation method that can be used to trace the real 
culprit, plus the above two side effects, we'll tentatively settle this 
issue with a warning message that:

1. Warn people about the RG corruption and request a gfs_fsck
2. Upon error detection, perform a minimum error data collection (and
   hopefully we'll eventually figure this out by observing these data).

If anyone knows a recreation method that can be used in our lab environment,
please do let us know. 

Comment 32 Wendy Cheng 2008-01-02 19:17:41 UTC
Code checked into CVS today. Should make RHEL 4.7. 

Comment 34 Nate Straz 2008-03-04 22:33:05 UTC
I hit this during recovery testing on ppc for 4.6.z.  We probably should make
sure this is in the errata for 4.7.

Comment 35 RHEL Program Management 2008-03-04 22:48:32 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 38 errata-xmlrpc 2008-07-25 19:27:17 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0802.html



Note You need to log in before you can comment on or make changes to this bug.