Bug 499333 - Kernel BUG at fs/gfs2/rgrp.c:1458
Kernel BUG at fs/gfs2/rgrp.c:1458
Status: CLOSED DUPLICATE of bug 500483
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.3
x86_64 Linux
low Severity high
: rc
: ---
Assigned To: Steve Whitehouse
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-05-06 04:20 EDT by Carlo de Wolf
Modified: 2009-12-09 06:34 EST (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-12-09 06:34:33 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
x86_64 binary for bz 500483 (515.73 KB, application/octet-stream)
2009-07-31 10:45 EDT, Robert Peterson
no flags Details

  None (edit)
Description Carlo de Wolf 2009-05-06 04:20:22 EDT
Description of problem:

# ----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at fs/gfs2/rgrp.c:1458
invalid opcode: 0000 [1] SMP
last sysfs file: /kernel/dlm/shared/id
CPU 0
Modules linked in: deflate zlib_deflate ccm serpent blowfish twofish ecb xcbc crypto_hash cbc md5 sha256 sha512 des aes_generic testmgr_cipher testmgr crypto_blkcipher aes_x86_64 ipcomp6 ipcomp ah6 ah4 esp6 xfrm6_esp esp4 xfrm4_esp aead crypto_algapi xfrm4_tunnel tunnel4 xfrm4_mode_tunnel xfrm4_mode_transport xfrm6_mode_transport xfrm6_mode_tunnel xfrm6_tunnel tunnel6 af_key autofs4 i2c_dev i2c_core hidp rfcomm l2cap bluetooth lock_dlm gfs2 dlm configfs sunrpc ipv6 xfrm_nalgo crypto_api ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_multipath scsi_dh scsi_mod parport_pc lp parport xennet pcspkr dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache xenblk ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 4272, comm: wget Not tainted 2.6.18-136.el5xen #1
RIP: e030:[<ffffffff88354e37>]  [<ffffffff88354e37>] :gfs2:gfs2_alloc_data+0x62/0x127
RSP: e02b:ffff88004fd6b9e8  EFLAGS: 00010246
RAX: 00000000ffffffff RBX: ffff88006f011788 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff88003ce9bff8 RDI: ffff88003ce9b018
RBP: ffff880027e5b990 R08: 0000000000000006 R09: 0000000000000000
R10: ffff88003ce9c000 R11: 5555555555555555 R12: ffff880020022a90
R13: ffff880074eeae00 R14: ffff8800708e0000 R15: 0000000000000001
FS:  00002b5571f7cb70(0000) GS:ffffffff805bb000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process wget (pid: 4272, threadinfo ffff88004fd6a000, task ffff880004edc0c0)
Stack:  ffff880027e5b990  ffff8800200369b8  ffff880020022a90  ffff88004fd6baa0
ffff88004fd6bab4  ffffffff88338d79  ffff88001eee7f10  ffff880027e5b990
ffff88004e6d0670  0000000000000001
Call Trace:
[<ffffffff88338d79>] :gfs2:lookup_block+0x9b/0x109
[<ffffffff88338ff7>] :gfs2:gfs2_block_map+0x210/0x33e
[<ffffffff80223970>] alloc_buffer_head+0x31/0x36
[<ffffffff8022fe6c>] alloc_page_buffers+0x81/0xd3
[<ffffffff8020eee8>] __block_prepare_write+0x1b6/0x43e
[<ffffffff88338de7>] :gfs2:gfs2_block_map+0x0/0x33e
[<ffffffff8023e9fa>] block_prepare_write+0x1a/0x25
[<ffffffff883491ee>] :gfs2:gfs2_write_begin+0x2cf/0x36a
[<ffffffff8834aa57>] :gfs2:gfs2_file_buffered_write+0x14b/0x2e5
[<ffffffff8834ae8d>] :gfs2:__gfs2_file_aio_write_nolock+0x29c/0x2d4
[<ffffffff80409a8d>] sock_aio_read+0x4f/0x5e
[<ffffffff8834b030>] :gfs2:gfs2_file_write_nolock+0xaa/0x10f
[<ffffffff8029b0ff>] autoremove_wake_function+0x0/0x2e
[<ffffffff8029b0ff>] autoremove_wake_function+0x0/0x2e
[<ffffffff8834b180>] :gfs2:gfs2_file_write+0x49/0xa7
[<ffffffff802175e9>] vfs_write+0xce/0x174
[<fffff 

Version-Release number of selected component (if applicable):

kernel 2.6.18-136.el5xen
gfs2-utils 0.1.53-1.el5

How reproducible:

sporadic

Steps to Reproduce:
1. unknown
2.
3.
  
Actual results:

Kernel BUG

Expected results:

!Kernel BUG

Additional info:
Comment 1 Carlo de Wolf 2009-05-06 05:12:32 EDT
Unmounted on all nodes and did a fsck. It shouldn't have any trouble finding a free block after that.
Note that it crashed right away again on another node after going in use.

# fsck.gfs2 -y /dev/mapper/FAST-SHARED 
Initializing fsck
Recovering journals (this may take a while)....
Journal recovery complete.
Validating Resource Group index.
Level 1 RG check.
(level 1 passed)
Starting pass1
Pass1 complete      
Starting pass1b
Pass1b complete
Starting pass1c
Pass1c complete
Starting pass2
Directory block 1073035(0x105f8b), entry 1 of directory 1073035(0x105f8b) is corrupt.
No '.' entry found
Entries is 0 - should be 1 for inode block 1073035 (0x105f8b)
Pass2 complete      
Starting pass3
Found unlinked directory at block 1073035 (0x105f8b)
Unlinked directory has zero size.
Pass3 complete      
Starting pass4
Link count inconsistent for inode 1073035 (0x105f8b) has 0 but fsck found 1.
Link count updated for inode 1073035 (0x105f8b) 
Pass4 complete      
Starting pass5
Ondisk and fsck bitmaps differ at block 1073035 (0x105f8b) 
Ondisk status is 3 (inode) but FSCK thinks it should be 0 (Free)
Metadata type is 0 (free)
Succeeded.
RG #1049072 (0x1001f0) free count inconsistent: is 61284 should be 61285
Inode count inconsistent: is 2251 should be 2250
Resource group counts updated
RG #5505452 (0x5401ac) free count inconsistent: is 19 should be 16
Inode count inconsistent: is 15285 should be 15288
Resource group counts updated
Unlinked block found at block 5580916 (0x552874), left unchanged.
Unlinked block found at block 5580920 (0x552878), left unchanged.
Unlinked block found at block 5580936 (0x552888), left unchanged.
Unlinked block found at block 5580937 (0x552889), left unchanged.
Unlinked block found at block 5580938 (0x55288a), left unchanged.
Unlinked block found at block 5580939 (0x55288b), left unchanged.
Unlinked block found at block 5580940 (0x55288c), left unchanged.
Unlinked block found at block 5580941 (0x55288d), left unchanged.
Unlinked block found at block 5580950 (0x552896), left unchanged.
Unlinked block found at block 5580953 (0x552899), left unchanged.
Unlinked block found at block 5580954 (0x55289a), left unchanged.
Unlinked block found at block 5580955 (0x55289b), left unchanged.
Unlinked block found at block 5580956 (0x55289c), left unchanged.
Unlinked block found at block 5580961 (0x5528a1), left unchanged.
Unlinked block found at block 5581293 (0x5529ed), left unchanged.
Unlinked block found at block 5581641 (0x552b49), left unchanged.
Unlinked block found at block 5581642 (0x552b4a), left unchanged.
Unlinked block found at block 5581702 (0x552b86), left unchanged.
Unlinked block found at block 5581808 (0x552bf0), left unchanged.
Unlinked block found at block 5581842 (0x552c12), left unchanged.
Unlinked block found at block 5581843 (0x552c13), left unchanged.
Unlinked block found at block 5581844 (0x552c14), left unchanged.
Unlinked block found at block 5581955 (0x552c83), left unchanged.
Unlinked block found at block 5582268 (0x552dbc), left unchanged.
Unlinked block found at block 5582269 (0x552dbd), left unchanged.
Unlinked block found at block 5582368 (0x552e20), left unchanged.
Unlinked block found at block 5582592 (0x552f00), left unchanged.
Unlinked block found at block 5582593 (0x552f01), left unchanged.
Unlinked block found at block 5582821 (0x552fe5), left unchanged.
Unlinked block found at block 5582822 (0x552fe6), left unchanged.
Unlinked block found at block 5582922 (0x55304a), left unchanged.
Unlinked block found at block 5582923 (0x55304b), left unchanged.
Unlinked block found at block 5582924 (0x55304c), left unchanged.
Unlinked block found at block 5582937 (0x553059), left unchanged.
Unlinked block found at block 5582938 (0x55305a), left unchanged.
Unlinked block found at block 5583067 (0x5530db), left unchanged.
Unlinked block found at block 5583393 (0x553221), left unchanged.
Unlinked block found at block 5583591 (0x5532e7), left unchanged.
Unlinked block found at block 5583777 (0x5533a1), left unchanged.
Unlinked block found at block 5583778 (0x5533a2), left unchanged.
Unlinked block found at block 5583780 (0x5533a4), left unchanged.
Unlinked block found at block 5583919 (0x55342f), left unchanged.
Unlinked block found at block 5583921 (0x553431), left unchanged.
Unlinked block found at block 5583923 (0x553433), left unchanged.
Unlinked block found at block 5583977 (0x553469), left unchanged.
Unlinked block found at block 5584206 (0x55354e), left unchanged.
Unlinked block found at block 5584208 (0x553550), left unchanged.
Unlinked block found at block 5584297 (0x5535a9), left unchanged.
Unlinked block found at block 5584298 (0x5535aa), left unchanged.
Unlinked block found at block 5584501 (0x553675), left unchanged.
Unlinked block found at block 5585022 (0x55387e), left unchanged.
Unlinked block found at block 5585106 (0x5538d2), left unchanged.
Unlinked block found at block 5585509 (0x553a65), left unchanged.
Unlinked block found at block 5585510 (0x553a66), left unchanged.
Unlinked block found at block 5585756 (0x553b5c), left unchanged.
Unlinked block found at block 5585838 (0x553bae), left unchanged.
Unlinked block found at block 6661004 (0x65a38c), left unchanged.
Unlinked block found at block 6661005 (0x65a38d), left unchanged.
Unlinked block found at block 6661056 (0x65a3c0), left unchanged.
Unlinked block found at block 6671651 (0x65cd23), left unchanged.
Unlinked block found at block 6675991 (0x65de17), left unchanged.
Unlinked block found at block 6675993 (0x65de19), left unchanged.
Unlinked block found at block 6676001 (0x65de21), left unchanged.
Unlinked block found at block 6676002 (0x65de22), left unchanged.
Unlinked block found at block 6676006 (0x65de26), left unchanged.
Unlinked block found at block 6676009 (0x65de29), left unchanged.
Unlinked block found at block 6676074 (0x65de6a), left unchanged.
Unlinked block found at block 6676075 (0x65de6b), left unchanged.
Unlinked block found at block 6676076 (0x65de6c), left unchanged.
Unlinked block found at block 6676077 (0x65de6d), left unchanged.
Unlinked block found at block 6676078 (0x65de6e), left unchanged.
Unlinked block found at block 6676079 (0x65de6f), left unchanged.
Unlinked block found at block 6676080 (0x65de70), left unchanged.
Unlinked block found at block 6676081 (0x65de71), left unchanged.
Unlinked block found at block 6676082 (0x65de72), left unchanged.
Unlinked block found at block 6676149 (0x65deb5), left unchanged.
Unlinked block found at block 6676150 (0x65deb6), left unchanged.
Unlinked block found at block 6676287 (0x65df3f), left unchanged.
Unlinked block found at block 6676288 (0x65df40), left unchanged.
Unlinked block found at block 6676289 (0x65df41), left unchanged.
Unlinked block found at block 6676290 (0x65df42), left unchanged.
Unlinked block found at block 6676572 (0x65e05c), left unchanged.
Unlinked block found at block 6676573 (0x65e05d), left unchanged.
Unlinked block found at block 6676574 (0x65e05e), left unchanged.
Unlinked block found at block 6676712 (0x65e0e8), left unchanged.
RG #6619547 (0x65019b) free count inconsistent: is 59773 should be 59802
Inode count inconsistent: is 32 should be 3
Resource group counts updated
Pass5 complete      
Writing changes to disk
gfs2_fsck complete
Comment 2 Steve Whitehouse 2009-05-06 07:04:46 EDT
What is the application that you are running actually doing? Are you running the disk close to capacity?
Comment 3 Carlo de Wolf 2009-05-06 07:31:11 EDT
We're running a Hudson instance on another machine which has two slaves (/ xen vm) running on this machine.

dom0 and the two slaves form a 'cluster' around /shared. So that we have a filesystem that's shared between dom0 and the vms.

From time to time (almost always) the Hudson machine will instantiate a job (via ssh) towards one of the slaves. As soon as both slaves are engaged one of them will kernel panic.

# df -h /shared
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/FAST-SHARED
                      128G   17G  112G  13% /shared
Comment 4 Steve Whitehouse 2009-05-06 09:23:25 EDT
I know nothing about Hudson. What is it actually doing at a file system level? Just normal file reads/writes, or does it create directories, remove files or directories, mmap things? I'm just trying to get a rough idea of the normal I/O going through the system.

The bug that is being tripped occurs when, after having reserved one or more blocks in a particular resource group, the allocation fails due to not being able to find a suitable free block in that rgrp. So in other words it looks like the summary information for the rgrp doesn't match the bitmap for some reason.

I'm not sure why you are getting messages that its leaving unlinked blocks unchanged, it ought to be deallocating them if they are unlinked. Perhaps Bob can shed some light on that?
Comment 5 Carlo de Wolf 2009-05-06 10:18:55 EDT
As Hudson starts up it'll do:
$ ssh 192.168.102.11 ~/common/slave.sh

/home/hudson/common/slave.sh:
#!/bin/sh
wget -q -N http://<hudson>:8380/hudson/jnlpJars/slave.jar -O slave.jar
exec /usr/java/jdk1.6.0_11/bin/java -jar slave.jar

Then when the time comes it'll create a workspace directory (differently named for each job) and runs ~/common/run_tck.sh (see below).
It effectively crashes in the wget.

So basically it boils down to a wget doing a bit of i/o within a xen slave on a dom0 device.

/home/hudson/common/run_tck.sh:
if [ $# != 1 ]; then
   echo 1>&2 "Usage: $0 <tests>"
   exit 1
fi
TESTS=$1

if [ -z "$JAVA_HOME" ]; then
   echo "JAVA_HOME is not set (no JDK selected)"
   exit 1
fi

set -x

#wget -N http://<hudson>:8380/hudson/job/JBoss-AS-5.x-plugged/lastSuccessfulBuild/artifact/jboss/jboss-5.x-plugged.zip
wget -nv -N http://<hudson>:8380/hudson/job/JBoss-AS-5.x-latest/lastSuccessfulBuild/artifact/Branch_5_x/build/output/jboss-5.x-latest.zip
#if [ jboss-5.x-latest.zip -nt jboss ]; then
   rm -rf jboss
   unzip -q -d jboss jboss-5.x-latest.zip
   touch jboss
#fi

# Nuke any previous results so they won't interfere for sure
rm -rf javaeetck/bin/JTreport
rm -rf javaeetck/bin/JTwork

wget -nv -N http://<hudson>:8380/hudson/job/tck51_package/lastSuccessfulBuild/artifact/javaeetck.zip
if [ javaeetck.zip -nt javaeetck ]; then
   rm -rf javaeetck
   unzip javaeetck.zip
   touch javaeetck
fi

wget -nv -N http://<hudson>:8380/hudson/job/glassfish-package/lastSuccessfulBuild/artifact/glassfish.zip
if [ glassfish.zip -nt glassfish ]; then
   rm -rf glassfish
   unzip glassfish.zip
   touch glassfish
fi

export JAVAEE_HOME=${WORKSPACE}/glassfish
export JBOSS_HOME=`echo ${WORKSPACE}/jboss/*`
export TS_HOME=`echo ${WORKSPACE}/javaeetck`

cd javaeetck/j2eetck-mods
/opt/apache/ant/apache-ant-1.7.1/bin/ant

cd ${WORKSPACE}/javaeetck/bin
./tsant config.vi



cd $TS_HOME/bin
./tsant -f xml/s1as.xml start.javadb
./tsant init.javadb

(cd $JBOSS_HOME/bin; ./run.sh -c cts -b localhost) &
PID=$!

trap "${JBOSS_HOME}/bin/shutdown.sh -S; ./stop-javadb; sleep 15; /sbin/fuser -k $JBOSS_HOME/bin/run.jar" EXIT

~/common/waitfor $JBOSS_HOME/server/cts/log/server.log "Started in" 180

set -x
./tsant "-Dmultiple.tests=$TESTS" runclient

/usr/java/jdk1.5.0_17/bin/java -cp ../lib/javatest.jar:../lib/tsharness.jar:../lib/cts.jar com.sun.javatest.cof.Main -o JTreport/report.xml JTwork
#/usr/java/jdk1.5.0_17/bin/java -cp /home/carlo/tools/jtharness-4.1.4-MR1-b17/lib/javatest.jar:../lib/cts.jar:../lib/tsharness.jar com.sun.javatest.tool.Main -testsuite ${TS_HOME}/src/ -workDir JTwork  -writeReport -type xml JTreport

#kill $PID
#./stop-javadb
Comment 6 Carlo de Wolf 2009-05-06 10:25:42 EDT
Note that the system has already had some crashes/hangs and needed to be rebooted -f -n.

The wicked thing is that after fsck it should come up with a stable fs.
Comment 7 Carlo de Wolf 2009-05-06 10:33:14 EDT
/etc/cluster/cluster.conf:
<?xml version="1.0"?>
<!--
	vim:ts=3:sw=3:
-->
<cluster name="boo" config_version="12" alias="jboss-ejb3">
	<!-- 
		post_fail_delay is 0, so there should be no blocking when a node needs
		a reboot.
		post_join_delay is 3, so after 3 seconds all other nodes are fenced.
	-->
	<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
	<clusternodes>
		<clusternode name="192.168.102.1" nodeid="1">
			<fence>
				<!-- hur, hur, hur :-) -->
				<method name="first">
					<device name="xvm" domain="Domain-0"/>
				</method>
				<!-- pull the plug pretty please -->
				<method name="second">
					<device name="manual"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="slave1b.localdomain" nodeid="2" votes="0">
			<fence>
				<method name="single">
					<device name="xvm" domain="rhel-5.3-guest1"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="slave2b.localdomain" nodeid="3" votes="0">
			<fence>
				<method name="single">
					<device name="xvm" domain="rhel-5.3-guest2"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<!-- we only need dom0 to run -->
	<cman expected_votes="1"/>
	<fence_xvmd/>
	<gfs_controld/>
	<fencedevices>
		<fencedevice name="manual" agent="fence_manual"/>
		<fencedevice name="xvm" agent="fence_xvm"/>
	</fencedevices>
	<rm>
		<failoverdomains/>
		<resources/>
		<service name="test" autorestart="0"/>
	</rm>
</cluster>
Comment 8 Steve Whitehouse 2009-05-21 06:50:52 EDT
Although it doesn't fix this issue, I have already pushed an upstream patch to improve error handling in gfs2 in relation to resource groups. It should mean that the fs will be more resilient to these kinds of failures in future.
Comment 10 Peter Schobel 2009-07-22 16:58:55 EDT
I am having a difficulty with this bug. Occurrence is frequent (several times daily) and the process that causes the exception varies. Following are a couple of stack traces. I have just upgraded kernel version to 2.6.18-128.2.1.el5PAE which didn't resolve the issue.

------------[ cut here ]------------
kernel BUG at fs/gfs2/rgrp.c:1458!
invalid opcode: 0000 [#1]
SMP
last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
Modules linked in: ipv6 xfrm_nalgo crypto_api lock_dlm gfs2 dlm configfs sunrpc dm_round_robin dm_multipath scsi_dh video hwmon backlight sbs i2c_ec
i2c_cord
CPU:    0
EIP:    0060:[<f96e04da>]    Not tainted VLI
EFLAGS: 00010246   (2.6.18-128.1.16.el5PAE #1)
EIP is at gfs2_alloc_data+0x75/0x155 [gfs2]
eax: ffffffff   ebx: 00000000   ecx: 00000000   edx: 00000001
esi: 05ec1513   edi: 00000000   ebp: f51aa114   esp: edf96c74
ds: 007b   es: 007b   ss: 0068
Process p4v.bin (pid: 8895, ti=edf96000 task=ccfe0550 task.ti=edf96000)
Stack: d3b34548 f7242000 d9663380 00000000 d3b34548 00000000 d4575000 f96c4db2
       cf449378 d4575140 00001000 00000000 cf449378 d3b34548 edf96cf4 f96c50a0
       edf96cf4 00000001 edf96d18 edf96d10 00000000 0000000c 00000000 0000c000
Call Trace:
 [<f96c4db2>] lookup_block+0xb4/0x153 [gfs2]
 [<f96c50a0>] gfs2_block_map+0x24f/0x392 [gfs2]
 [<c047364e>] set_bh_page+0x43/0x4c
 [<c047371f>] alloc_page_buffers+0x74/0xba
 [<c0474455>] __block_prepare_write+0x1a2/0x439
 [<f96cc405>] do_promote+0xe8/0x10b [gfs2]
 [<c0474702>] block_prepare_write+0x16/0x23
 [<f96c4e51>] gfs2_block_map+0x0/0x392 [gfs2]
 [<f96d5005>] gfs2_write_begin+0x2af/0x359 [gfs2]
 [<f96c4e51>] gfs2_block_map+0x0/0x392 [gfs2]
 [<f96d6823>] gfs2_file_buffered_write+0x10d/0x287 [gfs2]
 [<c0428969>] current_fs_time+0x4a/0x55
 [<f96d6c71>] __gfs2_file_aio_write_nolock+0x2d4/0x32d [gfs2]
 [<c05ab197>] sock_aio_read+0x53/0x61
 [<f96d6e24>] gfs2_file_write_nolock+0xb0/0x111 [gfs2]
 [<c0434a97>] autoremove_wake_function+0x0/0x2d
 [<c0434a97>] autoremove_wake_function+0x0/0x2d
 [<f96d6f1b>] gfs2_file_write+0x0/0x94 [gfs2]
 [<f96d6f55>] gfs2_file_write+0x3a/0x94 [gfs2]
 [<f96d6f1b>] gfs2_file_write+0x0/0x94 [gfs2]
 [<c04720ef>] vfs_write+0xa1/0x143
 [<c04726e1>] sys_write+0x3c/0x63
 [<c0404ead>] sysenter_past_esp+0x56/0x79
 =======================
Code: 16 31 d2 01 f0 11 fa 39 d3 77 0c 72 04 39 c1 73 06 89 ca 29 f2 eb 03 8b 55 70 31 c9 89 e8 6a 01 e8 39 e8 ff ff 5a 83 f8 ff 75 08 <0f> 0b b2 05
7c 5d 6
EIP: [<f96e04da>] gfs2_alloc_data+0x75/0x155 [gfs2] SS:ESP 0068:edf96c74
 <0>Kernel panic - not syncing: Fatal exception


kernel BUG at fs/gfs2/rgrp.c:1458!
invalid opcode: 0000 [#1]
SMP
last sysfs file: /devices/pci0000:00/0000:00:02.0/0000:04:00.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/irq
Modules linked in: ipv6 xfrm_nalgo crypto_api lock_dlm gfs2 dlm configfs sunrpc dm_round_robin dm_multipath scsi_dh video hwmon backlight sbs i2c_ec
i2c_cord
CPU:    1
EIP:    0060:[<f96df4da>]    Not tainted VLI
EFLAGS: 00010246   (2.6.18-128.1.16.el5PAE #1)
EIP is at gfs2_alloc_data+0x75/0x155 [gfs2]
eax: ffffffff   ebx: 00000000   ecx: 00000000   edx: 00000001
esi: 05ec1513   edi: 00000000   ebp: f0e11858   esp: f01f1c74
ds: 007b   es: 007b   ss: 0068
Process cp (pid: 31700, ti=f01f1000 task=f2e83550 task.ti=f01f1000)
Stack: f4361d68 f6cb1000 e45b7dc0 00000000 f4361d68 00000001 d25b8000 f96c3db2
       d12d53ac d25b8e10 00001000 00000001 e40686ec f4361d68 f01f1cf4 f96c40a0
       f01f1cf4 00000001 f01f1d18 f01f1d10 00000000 000013a5 00000000 013a5000
Call Trace:
 [<f96c3db2>] lookup_block+0xb4/0x153 [gfs2]
 [<f96c40a0>] gfs2_block_map+0x24f/0x392 [gfs2]
 [<c047364e>] set_bh_page+0x43/0x4c
 [<c047371f>] alloc_page_buffers+0x74/0xba
 [<c0474455>] __block_prepare_write+0x1a2/0x439
 [<f96cb405>] do_promote+0xe8/0x10b [gfs2]
 [<c0474702>] block_prepare_write+0x16/0x23
 [<f96c3e51>] gfs2_block_map+0x0/0x392 [gfs2]
 [<f96d4005>] gfs2_write_begin+0x2af/0x359 [gfs2]
 [<f96c3e51>] gfs2_block_map+0x0/0x392 [gfs2]
 [<f96d5823>] gfs2_file_buffered_write+0x10d/0x287 [gfs2]
 [<c0428969>] current_fs_time+0x4a/0x55
 [<f96d5c71>] __gfs2_file_aio_write_nolock+0x2d4/0x32d [gfs2]
 [<f96d5e24>] gfs2_file_write_nolock+0xb0/0x111 [gfs2]
 [<c0434a97>] autoremove_wake_function+0x0/0x2d
 [<c0434a97>] autoremove_wake_function+0x0/0x2d
 [<f96d5f1b>] gfs2_file_write+0x0/0x94 [gfs2]
 [<f96d5f55>] gfs2_file_write+0x3a/0x94 [gfs2]
 [<f96d5f1b>] gfs2_file_write+0x0/0x94 [gfs2]
 [<c04720ef>] vfs_write+0xa1/0x143
 [<c04726e1>] sys_write+0x3c/0x63
 [<c0404ead>] sysenter_past_esp+0x56/0x79
 =======================
Code: 16 31 d2 01 f0 11 fa 39 d3 77 0c 72 04 39 c1 73 06 89 ca 29 f2 eb 03 8b 55 70 31 c9 89 e8 6a 01 e8 39 e8 ff ff 5a 83 f8 ff 75 08 <0f> 0b b2 05
7c 4d 6
EIP: [<f96df4da>] gfs2_alloc_data+0x75/0x155 [gfs2] SS:ESP 0068:f01f1c74
 <0>Kernel panic - not syncing: Fatal exception
Comment 11 Robert Peterson 2009-07-31 10:45:42 EDT
Created attachment 355822 [details]
x86_64 binary for bz 500483

Perhaps someone can run this x86_64 binary of fsck.gfs2 against
one of the file systems that is damaged to see if it fixes the
problem.  It would be nice if it was run twice, just to verify
it fixes all the problems the first time it is run.
Comment 12 Peter Schobel 2009-08-19 11:45:14 EDT
Sorry, In order to get this cluster production-ready I had to revert gfs filesystems to gfs v1. I cannot perform this test at this time.
Comment 13 Paulo Castro 2009-10-26 16:47:27 EDT
Hi.

I'm having the same problem actually.
Robert, Can you generate a 32bit version of that fsck?
I'll gladly run it against my FS.
Comment 14 Steve Whitehouse 2009-10-27 05:36:07 EDT
Paulo, we can, but it will probably be next week before its ready as Bob is away at the moment.

I'm very interested in collecting as much information as possible about the workload just before the failure occurred (for the first time, or basically how to reproduce this issue from a newly created filesystem) if it is possible to do so.

Any clues you can provide us with would be very helpful.
Comment 15 Paulo Castro 2009-10-27 15:20:58 EDT
Steve,

In my case is quite simple to make it happen again.
I currently have the filesystem mounted and working "fine" but as soon as I start a rsync of any group of files into it it will explode whilst trying to copy the first file rendering the mount point totally unusable with only a reboot capable of saving the box.

I'm currently running:

kernel-2.6.27.25-78.2.56.fc9.i686
gfs2-utils-2.03.11-1.fc9.i386

This is the error i get:

Oct 26 18:51:30 dellix kernel: ------------[ cut here ]------------
Oct 26 18:51:30 dellix kernel: kernel BUG at fs/gfs2/rgrp.c:1442!
Oct 26 18:51:30 dellix kernel: invalid opcode: 0000 [#1] SMP
Oct 26 18:51:30 dellix kernel: Modules linked in: lock_dlm dlm configfs usb_storage gfs2 drbd vmnet ppdev parport_pc parport vsock vmci vmmon iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 iptable_mangle ip_tables nf_conntrack_ipv6 xt_state nf_conntrack xt_tcpudp ip6t_ipv6header ip6t_REJECT ip6table_filter ip6_tables x_tables cpufreq_ondemand acpi_cpufreq ipv6 dm_multipath scsi_dh snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq nvidia(P) serio_raw dcdbas snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer iTCO_wdt snd_page_alloc e1000e i2c_i801 snd_hwdep iTCO_vendor_support snd sr_mod rt2870sta pcspkr i2c_core cdrom soundcore floppy sg dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_generic pata_acpi ata_piix libata sd_mod scsi_mod crc_t10dif ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
Oct 26 18:51:30 dellix kernel:
Oct 26 18:51:30 dellix kernel: Pid: 9239, comm: rsync Tainted: P          (2.6.27.25-78.2.56.fc9.i686 #1) Inspiron 530
Oct 26 18:51:30 dellix kernel: EIP: 0060:[<f90007ab>] EFLAGS: 00210246 CPU: 1
Oct 26 18:51:30 dellix kernel: EIP is at gfs2_alloc_di+0x44/0x113 [gfs2]
Oct 26 18:51:30 dellix kernel: EAX: ffffffff EBX: f350c000 ECX: f272ad70 EDX: 00000003
Oct 26 18:51:30 dellix kernel: ESI: c699a168 EDI: f379d400 EBP: f272ad80 ESP: f272ad64
Oct 26 18:51:30 dellix kernel: DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Oct 26 18:51:30 dellix kernel: Process rsync (pid: 9239, ti=f272a000 task=f267b2c0 task.ti=f272a000)
Oct 26 18:51:30 dellix kernel: Stack: f272ae34 f90042b1 00000000 00000000 f350c000 c6e953b0 c6eb01f8 f272ae54
Oct 26 18:51:30 dellix kernel:       f8ff2a0b 00000000 f350c000 f2520d80 f272adc4 f2520d80 00008180 c6e0201c
Oct 26 18:51:30 dellix kernel:       f272ae6c f272adb0 c6eb01f8 00000000 f272adc8 f350c000 f272ae7c f272ae98
Oct 26 18:51:30 dellix kernel: Call Trace:
Oct 26 18:51:30 dellix kernel: [<f90042b1>] ? gfs2_trans_begin+0xdc/0x10f [gfs2]
Oct 26 18:51:30 dellix kernel: [<f8ff2a0b>] ? gfs2_createi+0x574/0xe39 [gfs2]
Oct 26 18:51:30 dellix kernel: [<f8feee58>] ? do_promote+0x3e/0x15b [gfs2]
Oct 26 18:51:30 dellix kernel: [<c041f707>] ? need_resched+0x18/0x22
Oct 26 18:51:30 dellix kernel: [<f8feedec>] ? gfs2_glock_wait+0x2a/0x4c [gfs2]
Oct 26 18:51:30 dellix kernel: [<f8ff059a>] ? gfs2_glock_nq+0x2ac/0x2b8 [gfs2]
Oct 26 18:51:30 dellix kernel: [<f8ffb8ee>] ? gfs2_create+0x51/0x100 [gfs2]
Oct 26 18:51:30 dellix kernel: [<f8ff24ff>] ? gfs2_createi+0x68/0xe39 [gfs2]
Oct 26 18:51:30 dellix kernel: [<c04de0ed>] ? security_inode_permission+0x1e/0x20
Oct 26 18:51:30 dellix kernel: [<c0497aef>] ? inode_permission+0xa0/0xb2
Oct 26 18:51:30 dellix kernel: [<c0497f29>] ? vfs_create+0x61/0x83
Oct 26 18:51:30 dellix kernel: [<c049960e>] ? do_filp_open+0x1a7/0x611
Oct 26 18:51:30 dellix kernel: [<c049960e>] ? do_filp_open+0x1a7/0x611
Oct 26 18:51:30 dellix kernel: [<c041f707>] ? need_resched+0x18/0x22
Oct 26 18:51:30 dellix kernel: [<c048f8f4>] ? do_sys_open+0x42/0xb7
Oct 26 18:51:30 dellix kernel: [<c048f9ab>] ? sys_open+0x1e/0x26
Oct 26 18:51:30 dellix kernel: [<c0404c8a>] ? syscall_call+0x7/0xb
Oct 26 18:51:30 dellix kernel: =======================
Oct 26 18:51:30 dellix kernel: Code: 00 8d 45 f0 8b b7 fc 00 00 00 8b 9a 9c 01 00 00 c7 45 f0 01 00 00 00 8b 56 68 50 89 f0 6a 03 e8 73 fc ff ff 5a 59 83 f8 ff 75 04 <0f> 0b eb fe 89 46 68 89 45 e8 c7 45 ec 00 00 00 00 8b 46 1c 8b
Oct 26 18:51:30 dellix kernel: EIP: [<f90007ab>] gfs2_alloc_di+0x44/0x113 [gfs2] SS:ESP 0068:f272ad64
Oct 26 18:51:30 dellix kernel: ---[ end trace e79d8292c3d80c27 ]---


Let me know if there's more info you might need...
Comment 16 Steve Whitehouse 2009-10-28 13:17:12 EDT
Paulo, the kernel from fc9 is ancient and I'm not surprised that it doesn't work very well. I'd highly recommend upgrading that if you are continuing to run gfs2.

Are you running fsck.gfs2 between those fsync runs? If not then get a later copy of gfs2-utils and try that. The output from fsck might be useful, but I think that your kernel & utils are so old that it is probably not very helpful to the investigation.

The other reports have all been RHEL5 (I'd accept CentOS too as thats pretty close) but Fedora 9 is too different to be much use I'm afraid.

What I'm looking for ideally is a series of steps which says "if I start with a newly mkfs'ed gfs2 filesystem and do X it always triggers". I know that we might not get to quite that stage though, as its not easy to make it happen.
Comment 17 Steve Whitehouse 2009-12-09 06:34:33 EST

*** This bug has been marked as a duplicate of bug 500483 ***

Note You need to log in before you can comment on or make changes to this bug.