Description of problem: # ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at fs/gfs2/rgrp.c:1458 invalid opcode: 0000 [1] SMP last sysfs file: /kernel/dlm/shared/id CPU 0 Modules linked in: deflate zlib_deflate ccm serpent blowfish twofish ecb xcbc crypto_hash cbc md5 sha256 sha512 des aes_generic testmgr_cipher testmgr crypto_blkcipher aes_x86_64 ipcomp6 ipcomp ah6 ah4 esp6 xfrm6_esp esp4 xfrm4_esp aead crypto_algapi xfrm4_tunnel tunnel4 xfrm4_mode_tunnel xfrm4_mode_transport xfrm6_mode_transport xfrm6_mode_tunnel xfrm6_tunnel tunnel6 af_key autofs4 i2c_dev i2c_core hidp rfcomm l2cap bluetooth lock_dlm gfs2 dlm configfs sunrpc ipv6 xfrm_nalgo crypto_api ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_multipath scsi_dh scsi_mod parport_pc lp parport xennet pcspkr dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache xenblk ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 4272, comm: wget Not tainted 2.6.18-136.el5xen #1 RIP: e030:[<ffffffff88354e37>] [<ffffffff88354e37>] :gfs2:gfs2_alloc_data+0x62/0x127 RSP: e02b:ffff88004fd6b9e8 EFLAGS: 00010246 RAX: 00000000ffffffff RBX: ffff88006f011788 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff88003ce9bff8 RDI: ffff88003ce9b018 RBP: ffff880027e5b990 R08: 0000000000000006 R09: 0000000000000000 R10: ffff88003ce9c000 R11: 5555555555555555 R12: ffff880020022a90 R13: ffff880074eeae00 R14: ffff8800708e0000 R15: 0000000000000001 FS: 00002b5571f7cb70(0000) GS:ffffffff805bb000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process wget (pid: 4272, threadinfo ffff88004fd6a000, task ffff880004edc0c0) Stack: ffff880027e5b990 ffff8800200369b8 ffff880020022a90 ffff88004fd6baa0 ffff88004fd6bab4 ffffffff88338d79 ffff88001eee7f10 ffff880027e5b990 ffff88004e6d0670 0000000000000001 Call Trace: [<ffffffff88338d79>] :gfs2:lookup_block+0x9b/0x109 [<ffffffff88338ff7>] :gfs2:gfs2_block_map+0x210/0x33e [<ffffffff80223970>] alloc_buffer_head+0x31/0x36 [<ffffffff8022fe6c>] alloc_page_buffers+0x81/0xd3 [<ffffffff8020eee8>] __block_prepare_write+0x1b6/0x43e [<ffffffff88338de7>] :gfs2:gfs2_block_map+0x0/0x33e [<ffffffff8023e9fa>] block_prepare_write+0x1a/0x25 [<ffffffff883491ee>] :gfs2:gfs2_write_begin+0x2cf/0x36a [<ffffffff8834aa57>] :gfs2:gfs2_file_buffered_write+0x14b/0x2e5 [<ffffffff8834ae8d>] :gfs2:__gfs2_file_aio_write_nolock+0x29c/0x2d4 [<ffffffff80409a8d>] sock_aio_read+0x4f/0x5e [<ffffffff8834b030>] :gfs2:gfs2_file_write_nolock+0xaa/0x10f [<ffffffff8029b0ff>] autoremove_wake_function+0x0/0x2e [<ffffffff8029b0ff>] autoremove_wake_function+0x0/0x2e [<ffffffff8834b180>] :gfs2:gfs2_file_write+0x49/0xa7 [<ffffffff802175e9>] vfs_write+0xce/0x174 [<fffff Version-Release number of selected component (if applicable): kernel 2.6.18-136.el5xen gfs2-utils 0.1.53-1.el5 How reproducible: sporadic Steps to Reproduce: 1. unknown 2. 3. Actual results: Kernel BUG Expected results: !Kernel BUG Additional info:
Unmounted on all nodes and did a fsck. It shouldn't have any trouble finding a free block after that. Note that it crashed right away again on another node after going in use. # fsck.gfs2 -y /dev/mapper/FAST-SHARED Initializing fsck Recovering journals (this may take a while).... Journal recovery complete. Validating Resource Group index. Level 1 RG check. (level 1 passed) Starting pass1 Pass1 complete Starting pass1b Pass1b complete Starting pass1c Pass1c complete Starting pass2 Directory block 1073035(0x105f8b), entry 1 of directory 1073035(0x105f8b) is corrupt. No '.' entry found Entries is 0 - should be 1 for inode block 1073035 (0x105f8b) Pass2 complete Starting pass3 Found unlinked directory at block 1073035 (0x105f8b) Unlinked directory has zero size. Pass3 complete Starting pass4 Link count inconsistent for inode 1073035 (0x105f8b) has 0 but fsck found 1. Link count updated for inode 1073035 (0x105f8b) Pass4 complete Starting pass5 Ondisk and fsck bitmaps differ at block 1073035 (0x105f8b) Ondisk status is 3 (inode) but FSCK thinks it should be 0 (Free) Metadata type is 0 (free) Succeeded. RG #1049072 (0x1001f0) free count inconsistent: is 61284 should be 61285 Inode count inconsistent: is 2251 should be 2250 Resource group counts updated RG #5505452 (0x5401ac) free count inconsistent: is 19 should be 16 Inode count inconsistent: is 15285 should be 15288 Resource group counts updated Unlinked block found at block 5580916 (0x552874), left unchanged. Unlinked block found at block 5580920 (0x552878), left unchanged. Unlinked block found at block 5580936 (0x552888), left unchanged. Unlinked block found at block 5580937 (0x552889), left unchanged. Unlinked block found at block 5580938 (0x55288a), left unchanged. Unlinked block found at block 5580939 (0x55288b), left unchanged. Unlinked block found at block 5580940 (0x55288c), left unchanged. Unlinked block found at block 5580941 (0x55288d), left unchanged. Unlinked block found at block 5580950 (0x552896), left unchanged. Unlinked block found at block 5580953 (0x552899), left unchanged. Unlinked block found at block 5580954 (0x55289a), left unchanged. Unlinked block found at block 5580955 (0x55289b), left unchanged. Unlinked block found at block 5580956 (0x55289c), left unchanged. Unlinked block found at block 5580961 (0x5528a1), left unchanged. Unlinked block found at block 5581293 (0x5529ed), left unchanged. Unlinked block found at block 5581641 (0x552b49), left unchanged. Unlinked block found at block 5581642 (0x552b4a), left unchanged. Unlinked block found at block 5581702 (0x552b86), left unchanged. Unlinked block found at block 5581808 (0x552bf0), left unchanged. Unlinked block found at block 5581842 (0x552c12), left unchanged. Unlinked block found at block 5581843 (0x552c13), left unchanged. Unlinked block found at block 5581844 (0x552c14), left unchanged. Unlinked block found at block 5581955 (0x552c83), left unchanged. Unlinked block found at block 5582268 (0x552dbc), left unchanged. Unlinked block found at block 5582269 (0x552dbd), left unchanged. Unlinked block found at block 5582368 (0x552e20), left unchanged. Unlinked block found at block 5582592 (0x552f00), left unchanged. Unlinked block found at block 5582593 (0x552f01), left unchanged. Unlinked block found at block 5582821 (0x552fe5), left unchanged. Unlinked block found at block 5582822 (0x552fe6), left unchanged. Unlinked block found at block 5582922 (0x55304a), left unchanged. Unlinked block found at block 5582923 (0x55304b), left unchanged. Unlinked block found at block 5582924 (0x55304c), left unchanged. Unlinked block found at block 5582937 (0x553059), left unchanged. Unlinked block found at block 5582938 (0x55305a), left unchanged. Unlinked block found at block 5583067 (0x5530db), left unchanged. Unlinked block found at block 5583393 (0x553221), left unchanged. Unlinked block found at block 5583591 (0x5532e7), left unchanged. Unlinked block found at block 5583777 (0x5533a1), left unchanged. Unlinked block found at block 5583778 (0x5533a2), left unchanged. Unlinked block found at block 5583780 (0x5533a4), left unchanged. Unlinked block found at block 5583919 (0x55342f), left unchanged. Unlinked block found at block 5583921 (0x553431), left unchanged. Unlinked block found at block 5583923 (0x553433), left unchanged. Unlinked block found at block 5583977 (0x553469), left unchanged. Unlinked block found at block 5584206 (0x55354e), left unchanged. Unlinked block found at block 5584208 (0x553550), left unchanged. Unlinked block found at block 5584297 (0x5535a9), left unchanged. Unlinked block found at block 5584298 (0x5535aa), left unchanged. Unlinked block found at block 5584501 (0x553675), left unchanged. Unlinked block found at block 5585022 (0x55387e), left unchanged. Unlinked block found at block 5585106 (0x5538d2), left unchanged. Unlinked block found at block 5585509 (0x553a65), left unchanged. Unlinked block found at block 5585510 (0x553a66), left unchanged. Unlinked block found at block 5585756 (0x553b5c), left unchanged. Unlinked block found at block 5585838 (0x553bae), left unchanged. Unlinked block found at block 6661004 (0x65a38c), left unchanged. Unlinked block found at block 6661005 (0x65a38d), left unchanged. Unlinked block found at block 6661056 (0x65a3c0), left unchanged. Unlinked block found at block 6671651 (0x65cd23), left unchanged. Unlinked block found at block 6675991 (0x65de17), left unchanged. Unlinked block found at block 6675993 (0x65de19), left unchanged. Unlinked block found at block 6676001 (0x65de21), left unchanged. Unlinked block found at block 6676002 (0x65de22), left unchanged. Unlinked block found at block 6676006 (0x65de26), left unchanged. Unlinked block found at block 6676009 (0x65de29), left unchanged. Unlinked block found at block 6676074 (0x65de6a), left unchanged. Unlinked block found at block 6676075 (0x65de6b), left unchanged. Unlinked block found at block 6676076 (0x65de6c), left unchanged. Unlinked block found at block 6676077 (0x65de6d), left unchanged. Unlinked block found at block 6676078 (0x65de6e), left unchanged. Unlinked block found at block 6676079 (0x65de6f), left unchanged. Unlinked block found at block 6676080 (0x65de70), left unchanged. Unlinked block found at block 6676081 (0x65de71), left unchanged. Unlinked block found at block 6676082 (0x65de72), left unchanged. Unlinked block found at block 6676149 (0x65deb5), left unchanged. Unlinked block found at block 6676150 (0x65deb6), left unchanged. Unlinked block found at block 6676287 (0x65df3f), left unchanged. Unlinked block found at block 6676288 (0x65df40), left unchanged. Unlinked block found at block 6676289 (0x65df41), left unchanged. Unlinked block found at block 6676290 (0x65df42), left unchanged. Unlinked block found at block 6676572 (0x65e05c), left unchanged. Unlinked block found at block 6676573 (0x65e05d), left unchanged. Unlinked block found at block 6676574 (0x65e05e), left unchanged. Unlinked block found at block 6676712 (0x65e0e8), left unchanged. RG #6619547 (0x65019b) free count inconsistent: is 59773 should be 59802 Inode count inconsistent: is 32 should be 3 Resource group counts updated Pass5 complete Writing changes to disk gfs2_fsck complete
What is the application that you are running actually doing? Are you running the disk close to capacity?
We're running a Hudson instance on another machine which has two slaves (/ xen vm) running on this machine. dom0 and the two slaves form a 'cluster' around /shared. So that we have a filesystem that's shared between dom0 and the vms. From time to time (almost always) the Hudson machine will instantiate a job (via ssh) towards one of the slaves. As soon as both slaves are engaged one of them will kernel panic. # df -h /shared Filesystem Size Used Avail Use% Mounted on /dev/mapper/FAST-SHARED 128G 17G 112G 13% /shared
I know nothing about Hudson. What is it actually doing at a file system level? Just normal file reads/writes, or does it create directories, remove files or directories, mmap things? I'm just trying to get a rough idea of the normal I/O going through the system. The bug that is being tripped occurs when, after having reserved one or more blocks in a particular resource group, the allocation fails due to not being able to find a suitable free block in that rgrp. So in other words it looks like the summary information for the rgrp doesn't match the bitmap for some reason. I'm not sure why you are getting messages that its leaving unlinked blocks unchanged, it ought to be deallocating them if they are unlinked. Perhaps Bob can shed some light on that?
As Hudson starts up it'll do: $ ssh 192.168.102.11 ~/common/slave.sh /home/hudson/common/slave.sh: #!/bin/sh wget -q -N http://<hudson>:8380/hudson/jnlpJars/slave.jar -O slave.jar exec /usr/java/jdk1.6.0_11/bin/java -jar slave.jar Then when the time comes it'll create a workspace directory (differently named for each job) and runs ~/common/run_tck.sh (see below). It effectively crashes in the wget. So basically it boils down to a wget doing a bit of i/o within a xen slave on a dom0 device. /home/hudson/common/run_tck.sh: if [ $# != 1 ]; then echo 1>&2 "Usage: $0 <tests>" exit 1 fi TESTS=$1 if [ -z "$JAVA_HOME" ]; then echo "JAVA_HOME is not set (no JDK selected)" exit 1 fi set -x #wget -N http://<hudson>:8380/hudson/job/JBoss-AS-5.x-plugged/lastSuccessfulBuild/artifact/jboss/jboss-5.x-plugged.zip wget -nv -N http://<hudson>:8380/hudson/job/JBoss-AS-5.x-latest/lastSuccessfulBuild/artifact/Branch_5_x/build/output/jboss-5.x-latest.zip #if [ jboss-5.x-latest.zip -nt jboss ]; then rm -rf jboss unzip -q -d jboss jboss-5.x-latest.zip touch jboss #fi # Nuke any previous results so they won't interfere for sure rm -rf javaeetck/bin/JTreport rm -rf javaeetck/bin/JTwork wget -nv -N http://<hudson>:8380/hudson/job/tck51_package/lastSuccessfulBuild/artifact/javaeetck.zip if [ javaeetck.zip -nt javaeetck ]; then rm -rf javaeetck unzip javaeetck.zip touch javaeetck fi wget -nv -N http://<hudson>:8380/hudson/job/glassfish-package/lastSuccessfulBuild/artifact/glassfish.zip if [ glassfish.zip -nt glassfish ]; then rm -rf glassfish unzip glassfish.zip touch glassfish fi export JAVAEE_HOME=${WORKSPACE}/glassfish export JBOSS_HOME=`echo ${WORKSPACE}/jboss/*` export TS_HOME=`echo ${WORKSPACE}/javaeetck` cd javaeetck/j2eetck-mods /opt/apache/ant/apache-ant-1.7.1/bin/ant cd ${WORKSPACE}/javaeetck/bin ./tsant config.vi cd $TS_HOME/bin ./tsant -f xml/s1as.xml start.javadb ./tsant init.javadb (cd $JBOSS_HOME/bin; ./run.sh -c cts -b localhost) & PID=$! trap "${JBOSS_HOME}/bin/shutdown.sh -S; ./stop-javadb; sleep 15; /sbin/fuser -k $JBOSS_HOME/bin/run.jar" EXIT ~/common/waitfor $JBOSS_HOME/server/cts/log/server.log "Started in" 180 set -x ./tsant "-Dmultiple.tests=$TESTS" runclient /usr/java/jdk1.5.0_17/bin/java -cp ../lib/javatest.jar:../lib/tsharness.jar:../lib/cts.jar com.sun.javatest.cof.Main -o JTreport/report.xml JTwork #/usr/java/jdk1.5.0_17/bin/java -cp /home/carlo/tools/jtharness-4.1.4-MR1-b17/lib/javatest.jar:../lib/cts.jar:../lib/tsharness.jar com.sun.javatest.tool.Main -testsuite ${TS_HOME}/src/ -workDir JTwork -writeReport -type xml JTreport #kill $PID #./stop-javadb
Note that the system has already had some crashes/hangs and needed to be rebooted -f -n. The wicked thing is that after fsck it should come up with a stable fs.
/etc/cluster/cluster.conf: <?xml version="1.0"?> <!-- vim:ts=3:sw=3: --> <cluster name="boo" config_version="12" alias="jboss-ejb3"> <!-- post_fail_delay is 0, so there should be no blocking when a node needs a reboot. post_join_delay is 3, so after 3 seconds all other nodes are fenced. --> <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="192.168.102.1" nodeid="1"> <fence> <!-- hur, hur, hur :-) --> <method name="first"> <device name="xvm" domain="Domain-0"/> </method> <!-- pull the plug pretty please --> <method name="second"> <device name="manual"/> </method> </fence> </clusternode> <clusternode name="slave1b.localdomain" nodeid="2" votes="0"> <fence> <method name="single"> <device name="xvm" domain="rhel-5.3-guest1"/> </method> </fence> </clusternode> <clusternode name="slave2b.localdomain" nodeid="3" votes="0"> <fence> <method name="single"> <device name="xvm" domain="rhel-5.3-guest2"/> </method> </fence> </clusternode> </clusternodes> <!-- we only need dom0 to run --> <cman expected_votes="1"/> <fence_xvmd/> <gfs_controld/> <fencedevices> <fencedevice name="manual" agent="fence_manual"/> <fencedevice name="xvm" agent="fence_xvm"/> </fencedevices> <rm> <failoverdomains/> <resources/> <service name="test" autorestart="0"/> </rm> </cluster>
Although it doesn't fix this issue, I have already pushed an upstream patch to improve error handling in gfs2 in relation to resource groups. It should mean that the fs will be more resilient to these kinds of failures in future.
I am having a difficulty with this bug. Occurrence is frequent (several times daily) and the process that causes the exception varies. Following are a couple of stack traces. I have just upgraded kernel version to 2.6.18-128.2.1.el5PAE which didn't resolve the issue. ------------[ cut here ]------------ kernel BUG at fs/gfs2/rgrp.c:1458! invalid opcode: 0000 [#1] SMP last sysfs file: /devices/pci0000:00/0000:00:00.0/irq Modules linked in: ipv6 xfrm_nalgo crypto_api lock_dlm gfs2 dlm configfs sunrpc dm_round_robin dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_cord CPU: 0 EIP: 0060:[<f96e04da>] Not tainted VLI EFLAGS: 00010246 (2.6.18-128.1.16.el5PAE #1) EIP is at gfs2_alloc_data+0x75/0x155 [gfs2] eax: ffffffff ebx: 00000000 ecx: 00000000 edx: 00000001 esi: 05ec1513 edi: 00000000 ebp: f51aa114 esp: edf96c74 ds: 007b es: 007b ss: 0068 Process p4v.bin (pid: 8895, ti=edf96000 task=ccfe0550 task.ti=edf96000) Stack: d3b34548 f7242000 d9663380 00000000 d3b34548 00000000 d4575000 f96c4db2 cf449378 d4575140 00001000 00000000 cf449378 d3b34548 edf96cf4 f96c50a0 edf96cf4 00000001 edf96d18 edf96d10 00000000 0000000c 00000000 0000c000 Call Trace: [<f96c4db2>] lookup_block+0xb4/0x153 [gfs2] [<f96c50a0>] gfs2_block_map+0x24f/0x392 [gfs2] [<c047364e>] set_bh_page+0x43/0x4c [<c047371f>] alloc_page_buffers+0x74/0xba [<c0474455>] __block_prepare_write+0x1a2/0x439 [<f96cc405>] do_promote+0xe8/0x10b [gfs2] [<c0474702>] block_prepare_write+0x16/0x23 [<f96c4e51>] gfs2_block_map+0x0/0x392 [gfs2] [<f96d5005>] gfs2_write_begin+0x2af/0x359 [gfs2] [<f96c4e51>] gfs2_block_map+0x0/0x392 [gfs2] [<f96d6823>] gfs2_file_buffered_write+0x10d/0x287 [gfs2] [<c0428969>] current_fs_time+0x4a/0x55 [<f96d6c71>] __gfs2_file_aio_write_nolock+0x2d4/0x32d [gfs2] [<c05ab197>] sock_aio_read+0x53/0x61 [<f96d6e24>] gfs2_file_write_nolock+0xb0/0x111 [gfs2] [<c0434a97>] autoremove_wake_function+0x0/0x2d [<c0434a97>] autoremove_wake_function+0x0/0x2d [<f96d6f1b>] gfs2_file_write+0x0/0x94 [gfs2] [<f96d6f55>] gfs2_file_write+0x3a/0x94 [gfs2] [<f96d6f1b>] gfs2_file_write+0x0/0x94 [gfs2] [<c04720ef>] vfs_write+0xa1/0x143 [<c04726e1>] sys_write+0x3c/0x63 [<c0404ead>] sysenter_past_esp+0x56/0x79 ======================= Code: 16 31 d2 01 f0 11 fa 39 d3 77 0c 72 04 39 c1 73 06 89 ca 29 f2 eb 03 8b 55 70 31 c9 89 e8 6a 01 e8 39 e8 ff ff 5a 83 f8 ff 75 08 <0f> 0b b2 05 7c 5d 6 EIP: [<f96e04da>] gfs2_alloc_data+0x75/0x155 [gfs2] SS:ESP 0068:edf96c74 <0>Kernel panic - not syncing: Fatal exception kernel BUG at fs/gfs2/rgrp.c:1458! invalid opcode: 0000 [#1] SMP last sysfs file: /devices/pci0000:00/0000:00:02.0/0000:04:00.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/irq Modules linked in: ipv6 xfrm_nalgo crypto_api lock_dlm gfs2 dlm configfs sunrpc dm_round_robin dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_cord CPU: 1 EIP: 0060:[<f96df4da>] Not tainted VLI EFLAGS: 00010246 (2.6.18-128.1.16.el5PAE #1) EIP is at gfs2_alloc_data+0x75/0x155 [gfs2] eax: ffffffff ebx: 00000000 ecx: 00000000 edx: 00000001 esi: 05ec1513 edi: 00000000 ebp: f0e11858 esp: f01f1c74 ds: 007b es: 007b ss: 0068 Process cp (pid: 31700, ti=f01f1000 task=f2e83550 task.ti=f01f1000) Stack: f4361d68 f6cb1000 e45b7dc0 00000000 f4361d68 00000001 d25b8000 f96c3db2 d12d53ac d25b8e10 00001000 00000001 e40686ec f4361d68 f01f1cf4 f96c40a0 f01f1cf4 00000001 f01f1d18 f01f1d10 00000000 000013a5 00000000 013a5000 Call Trace: [<f96c3db2>] lookup_block+0xb4/0x153 [gfs2] [<f96c40a0>] gfs2_block_map+0x24f/0x392 [gfs2] [<c047364e>] set_bh_page+0x43/0x4c [<c047371f>] alloc_page_buffers+0x74/0xba [<c0474455>] __block_prepare_write+0x1a2/0x439 [<f96cb405>] do_promote+0xe8/0x10b [gfs2] [<c0474702>] block_prepare_write+0x16/0x23 [<f96c3e51>] gfs2_block_map+0x0/0x392 [gfs2] [<f96d4005>] gfs2_write_begin+0x2af/0x359 [gfs2] [<f96c3e51>] gfs2_block_map+0x0/0x392 [gfs2] [<f96d5823>] gfs2_file_buffered_write+0x10d/0x287 [gfs2] [<c0428969>] current_fs_time+0x4a/0x55 [<f96d5c71>] __gfs2_file_aio_write_nolock+0x2d4/0x32d [gfs2] [<f96d5e24>] gfs2_file_write_nolock+0xb0/0x111 [gfs2] [<c0434a97>] autoremove_wake_function+0x0/0x2d [<c0434a97>] autoremove_wake_function+0x0/0x2d [<f96d5f1b>] gfs2_file_write+0x0/0x94 [gfs2] [<f96d5f55>] gfs2_file_write+0x3a/0x94 [gfs2] [<f96d5f1b>] gfs2_file_write+0x0/0x94 [gfs2] [<c04720ef>] vfs_write+0xa1/0x143 [<c04726e1>] sys_write+0x3c/0x63 [<c0404ead>] sysenter_past_esp+0x56/0x79 ======================= Code: 16 31 d2 01 f0 11 fa 39 d3 77 0c 72 04 39 c1 73 06 89 ca 29 f2 eb 03 8b 55 70 31 c9 89 e8 6a 01 e8 39 e8 ff ff 5a 83 f8 ff 75 08 <0f> 0b b2 05 7c 4d 6 EIP: [<f96df4da>] gfs2_alloc_data+0x75/0x155 [gfs2] SS:ESP 0068:f01f1c74 <0>Kernel panic - not syncing: Fatal exception
Created attachment 355822 [details] x86_64 binary for bz 500483 Perhaps someone can run this x86_64 binary of fsck.gfs2 against one of the file systems that is damaged to see if it fixes the problem. It would be nice if it was run twice, just to verify it fixes all the problems the first time it is run.
Sorry, In order to get this cluster production-ready I had to revert gfs filesystems to gfs v1. I cannot perform this test at this time.
Hi. I'm having the same problem actually. Robert, Can you generate a 32bit version of that fsck? I'll gladly run it against my FS.
Paulo, we can, but it will probably be next week before its ready as Bob is away at the moment. I'm very interested in collecting as much information as possible about the workload just before the failure occurred (for the first time, or basically how to reproduce this issue from a newly created filesystem) if it is possible to do so. Any clues you can provide us with would be very helpful.
Steve, In my case is quite simple to make it happen again. I currently have the filesystem mounted and working "fine" but as soon as I start a rsync of any group of files into it it will explode whilst trying to copy the first file rendering the mount point totally unusable with only a reboot capable of saving the box. I'm currently running: kernel-2.6.27.25-78.2.56.fc9.i686 gfs2-utils-2.03.11-1.fc9.i386 This is the error i get: Oct 26 18:51:30 dellix kernel: ------------[ cut here ]------------ Oct 26 18:51:30 dellix kernel: kernel BUG at fs/gfs2/rgrp.c:1442! Oct 26 18:51:30 dellix kernel: invalid opcode: 0000 [#1] SMP Oct 26 18:51:30 dellix kernel: Modules linked in: lock_dlm dlm configfs usb_storage gfs2 drbd vmnet ppdev parport_pc parport vsock vmci vmmon iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 iptable_mangle ip_tables nf_conntrack_ipv6 xt_state nf_conntrack xt_tcpudp ip6t_ipv6header ip6t_REJECT ip6table_filter ip6_tables x_tables cpufreq_ondemand acpi_cpufreq ipv6 dm_multipath scsi_dh snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq nvidia(P) serio_raw dcdbas snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer iTCO_wdt snd_page_alloc e1000e i2c_i801 snd_hwdep iTCO_vendor_support snd sr_mod rt2870sta pcspkr i2c_core cdrom soundcore floppy sg dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_generic pata_acpi ata_piix libata sd_mod scsi_mod crc_t10dif ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode] Oct 26 18:51:30 dellix kernel: Oct 26 18:51:30 dellix kernel: Pid: 9239, comm: rsync Tainted: P (2.6.27.25-78.2.56.fc9.i686 #1) Inspiron 530 Oct 26 18:51:30 dellix kernel: EIP: 0060:[<f90007ab>] EFLAGS: 00210246 CPU: 1 Oct 26 18:51:30 dellix kernel: EIP is at gfs2_alloc_di+0x44/0x113 [gfs2] Oct 26 18:51:30 dellix kernel: EAX: ffffffff EBX: f350c000 ECX: f272ad70 EDX: 00000003 Oct 26 18:51:30 dellix kernel: ESI: c699a168 EDI: f379d400 EBP: f272ad80 ESP: f272ad64 Oct 26 18:51:30 dellix kernel: DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Oct 26 18:51:30 dellix kernel: Process rsync (pid: 9239, ti=f272a000 task=f267b2c0 task.ti=f272a000) Oct 26 18:51:30 dellix kernel: Stack: f272ae34 f90042b1 00000000 00000000 f350c000 c6e953b0 c6eb01f8 f272ae54 Oct 26 18:51:30 dellix kernel: f8ff2a0b 00000000 f350c000 f2520d80 f272adc4 f2520d80 00008180 c6e0201c Oct 26 18:51:30 dellix kernel: f272ae6c f272adb0 c6eb01f8 00000000 f272adc8 f350c000 f272ae7c f272ae98 Oct 26 18:51:30 dellix kernel: Call Trace: Oct 26 18:51:30 dellix kernel: [<f90042b1>] ? gfs2_trans_begin+0xdc/0x10f [gfs2] Oct 26 18:51:30 dellix kernel: [<f8ff2a0b>] ? gfs2_createi+0x574/0xe39 [gfs2] Oct 26 18:51:30 dellix kernel: [<f8feee58>] ? do_promote+0x3e/0x15b [gfs2] Oct 26 18:51:30 dellix kernel: [<c041f707>] ? need_resched+0x18/0x22 Oct 26 18:51:30 dellix kernel: [<f8feedec>] ? gfs2_glock_wait+0x2a/0x4c [gfs2] Oct 26 18:51:30 dellix kernel: [<f8ff059a>] ? gfs2_glock_nq+0x2ac/0x2b8 [gfs2] Oct 26 18:51:30 dellix kernel: [<f8ffb8ee>] ? gfs2_create+0x51/0x100 [gfs2] Oct 26 18:51:30 dellix kernel: [<f8ff24ff>] ? gfs2_createi+0x68/0xe39 [gfs2] Oct 26 18:51:30 dellix kernel: [<c04de0ed>] ? security_inode_permission+0x1e/0x20 Oct 26 18:51:30 dellix kernel: [<c0497aef>] ? inode_permission+0xa0/0xb2 Oct 26 18:51:30 dellix kernel: [<c0497f29>] ? vfs_create+0x61/0x83 Oct 26 18:51:30 dellix kernel: [<c049960e>] ? do_filp_open+0x1a7/0x611 Oct 26 18:51:30 dellix kernel: [<c049960e>] ? do_filp_open+0x1a7/0x611 Oct 26 18:51:30 dellix kernel: [<c041f707>] ? need_resched+0x18/0x22 Oct 26 18:51:30 dellix kernel: [<c048f8f4>] ? do_sys_open+0x42/0xb7 Oct 26 18:51:30 dellix kernel: [<c048f9ab>] ? sys_open+0x1e/0x26 Oct 26 18:51:30 dellix kernel: [<c0404c8a>] ? syscall_call+0x7/0xb Oct 26 18:51:30 dellix kernel: ======================= Oct 26 18:51:30 dellix kernel: Code: 00 8d 45 f0 8b b7 fc 00 00 00 8b 9a 9c 01 00 00 c7 45 f0 01 00 00 00 8b 56 68 50 89 f0 6a 03 e8 73 fc ff ff 5a 59 83 f8 ff 75 04 <0f> 0b eb fe 89 46 68 89 45 e8 c7 45 ec 00 00 00 00 8b 46 1c 8b Oct 26 18:51:30 dellix kernel: EIP: [<f90007ab>] gfs2_alloc_di+0x44/0x113 [gfs2] SS:ESP 0068:f272ad64 Oct 26 18:51:30 dellix kernel: ---[ end trace e79d8292c3d80c27 ]--- Let me know if there's more info you might need...
Paulo, the kernel from fc9 is ancient and I'm not surprised that it doesn't work very well. I'd highly recommend upgrading that if you are continuing to run gfs2. Are you running fsck.gfs2 between those fsync runs? If not then get a later copy of gfs2-utils and try that. The output from fsck might be useful, but I think that your kernel & utils are so old that it is probably not very helpful to the investigation. The other reports have all been RHEL5 (I'd accept CentOS too as thats pretty close) but Fedora 9 is too different to be much use I'm afraid. What I'm looking for ideally is a series of steps which says "if I start with a newly mkfs'ed gfs2 filesystem and do X it always triggers". I know that we might not get to quite that stage though, as its not easy to make it happen.
*** This bug has been marked as a duplicate of bug 500483 ***