Bug 76275

Summary: Assertion failure in unmap_underlying_metadata() at buffer.c:1542: "!buffer_jlist_eq(old_bh, 3)"
Product: Red Hat Enterprise Linux 2.1 Reporter: Philip Pokorny <ppokorny>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED NOTABUG QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 2.1CC: epierce, jike.zhou
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-09-28 11:13:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Philip Pokorny 2002-10-18 23:41:10 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i586; en-US; rv:0.9.2.1) Gecko/20010901

Description of problem:
While copying data, the kernel logged this assertion failure to
/var/log/messages and the system became unresponsive (ping, snmp, etc) soon after.

The system is located at a co-lo facility and was reset.  Upon reboot, the
following was found in /var/log/messages:

Oct 18 01:06:20 golem kernel: EXT3-fs error (device sd(8,241)): ext3_readdir:
bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0,
rec_len=0, name_len=0
Oct 18 01:13:30 golem kernel: EXT3-fs error (device sd(8,193)): ext3_new_block:
Allocating block in system zone - block = 2883584
Oct 18 01:13:30 golem kernel: Assertion failure in unmap_underlying_metadata()
at buffer.c:1542: "!buffer_jlist_eq(old_bh, 3)"
Oct 18 01:13:30 golem kernel: ------------[ cut here ]------------
Oct 18 01:13:30 golem kernel: kernel BUG at buffer.c:1542!
Oct 18 01:13:30 golem kernel: invalid operand: 0000
Oct 18 01:13:30 golem kernel: Kernel 2.4.9-e.8enterprise
Oct 18 01:13:30 golem kernel: CPU:    0
Oct 18 01:13:30 golem kernel: EIP: 0010:[unmap_underlying_metadata+219/288]   
Tainted: P
Oct 18 01:13:30 golem kernel: EIP:    0010:[<c01478ab>]    Tainted: P
Oct 18 01:13:30 golem kernel: EFLAGS: 00010292
Oct 18 01:13:30 golem kernel: EIP is at unmap_underlying_metadata [kernel] 0xdb
Oct 18 01:13:30 golem kernel: eax: 00000020   ebx: ea83fc00   ecx: c02f23a4 edx:
000147a1
Oct 18 01:13:30 golem kernel: esi: c8d72f00   edi: 00001000   ebp: 00000000 esp:
f35ade2c
Oct 18 01:13:30 golem kernel: ds: 0018   es: 0018   ss: 0018
Oct 18 01:13:31 golem kernel: Process mv (pid: 25636, stackpage=f35ad000)
Oct 18 01:13:31 golem kernel: Stack: c0263b77 00000606 00000001 00001000
00001000 c0147c71 00000000 00001000
Oct 18 01:13:31 golem kernel:        c0147c92 ea83fc00 ea83fc00 fe1f0000
f35ade6c 00001000 000000d7 ea83fc00
Oct 18 01:13:31 golem kernel:        00000070 00000014 f8863e50 f8860bf7
00000014 00000070 c434d558 f50a6740


Version-Release number of selected component (if applicable):


How reproducible:
Didn't try


Additional info:



Several days prior to this, a disk (/dev/sdp) had failed and taken a different
filesystem (sdp1 aka. 8,241) with it.  The node had not been rebooted, and the
following messages had been logged nightly:

Oct 13 04:04:30 golem kernel: SCSI disk error : host 2 channel 4 id 12 lun 0
return code = 25040001
Oct 13 04:04:30 golem kernel:  I/O error: dev 08:f1, sector 4136
Oct 13 04:04:30 golem kernel: EXT3-fs error (device sd(8,241)): ext3_readdir:
directory #2 contains a hole at offset 0
Oct 14 04:03:14 golem kernel: SCSI disk error : host 2 channel 4 id 12 lun 0
return code = 25040001
Oct 14 04:03:14 golem kernel:  I/O error: dev 08:f1, sector 4136
Oct 14 04:03:14 golem kernel: EXT3-fs error (device sd(8,241)): ext3_readdir:
directory #2 contains a hole at offset 0
Oct 14 22:49:42 golem kernel: SCSI disk error : host 2 channel 4 id 12 lun 0
return code = 25040001
Oct 14 22:49:42 golem kernel:  I/O error: dev 08:f1, sector 4136
Oct 14 22:49:42 golem kernel: EXT3-fs error (device sd(8,241)): ext3_readdir:
directory #2 contains a hole at offset 0
Oct 15 04:03:25 golem kernel: SCSI disk error : host 2 channel 4 id 12 lun 0
return code = 25040001
Oct 15 04:03:25 golem kernel:  I/O error: dev 08:f1, sector 4136
Oct 15 04:03:25 golem kernel: EXT3-fs error (device sd(8,241)): ext3_readdir:
directory #2 contains a hole at offset 0
Oct 15 05:13:17 golem kernel: ENOMEM in journal_alloc_journal_head, retrying.
Oct 15 12:58:17 golem kernel: SCSI disk error : host 2 channel 4 id 12 lun 0
return code = 25040001
Oct 15 12:58:17 golem kernel:  I/O error: dev 08:f1, sector 4136
Oct 15 12:58:17 golem kernel: EXT3-fs error (device sd(8,241)): ext3_readdir:
directory #2 contains a hole at offset 0
Oct 16 04:04:24 golem kernel: SCSI disk error : host 2 channel 4 id 12 lun 0
return code = 25040001
Oct 16 04:04:24 golem kernel:  I/O error: dev 08:f1, sector 4136
Oct 16 04:04:24 golem kernel: EXT3-fs error (device sd(8,241)): ext3_readdir:
directory #2 contains a hole at offset 0
Oct 17 04:03:32 golem kernel: SCSI disk error : host 2 channel 4 id 12 lun 0
return code = 25040001
Oct 17 04:03:32 golem kernel:  I/O error: dev 08:f1, sector 4136
Oct 17 04:03:32 golem kernel: EXT3-fs error (device sd(8,241)): ext3_readdir:
directory #2 contains a hole at offset 0

Comment 1 Arjan van de Ven 2002-12-05 17:25:53 UTC
which kernel modules are you using ?

Comment 2 Philip Pokorny 2002-12-05 22:08:11 UTC
[root@golem root]# uname -a
Linux golem.mydomain.com 2.4.9-e.8enterprise #1 SMP Fri Jul 19 15:27:35
EDT 2002 i686 unknown

[root@golem root]# lsmod
Module                  Size  Used by    Tainted: P  
e1000                  52740   1 
eepro100               21968   1 
usb-uhci               26948   0  (unused)
usbcore                68896   1  [usb-uhci]
ext3                   73536  16 
jbd                    55048  16  [ext3]
raid1                  16548  13 
megaraid               28160  19 
aic7xxx               127200  26 
sd_mod                 13468  45 
scsi_mod              125420   3  [megaraid aic7xxx sd_mod]



Comment 3 jike.zhou 2003-06-03 05:38:48 UTC
Has this bug fixed?

Comment 4 Philip Pokorny 2003-06-03 19:21:56 UTC
Not to my knowledge.

It has not happened again, but the conditions that triggered the problem are not
the kind of things I'm interested in inducing on a 24x7 production server.


Comment 5 Larry Woodman 2003-06-23 17:56:27 UTC
I cant reproduce this problem internally.  There were several changes in
this area over the past several AS2.1 kernel errata.  If you can reproduce
this problem with the latest kernel errata(e.24) I'd like to close this 
bug.

Larry Woodman


Comment 6 Philip Pokorny 2003-06-24 03:45:55 UTC
Kernel has been upgraded and SCSI cables were shortened.

Problem has not happened again, so close the ticket.


Comment 7 Eric Pierce 2004-08-25 21:15:05 UTC
Sorry to add to this, I know this bug is over a year old, but I just had the same error with a 
new (2.4.9-e.48smp) kernel.

 EXT3-fs error (device sd(8,18)): ext3_free_blocks: bit already cleared for block 53500615
EXT3-fs error (device sd(8,18)): ext3_free_blocks: bit already cleared for block 253599764
EXT3-fs error (device sd(8,18)): ext3_free_blocks: bit already cleared for block 357537704
Assertion failure in unmap_underlying_metadata() at buffer.c:1542: "!
buffer_jlist_eq(old_bh, 3)"
------------[ cut here ]------------
kernel BUG at buffer.c:1542!
invalid operand: 0000
Kernel 2.4.9-e.48smp
CPU:    2
EIP:    0010:[<c0148c1b>]    Not tainted
EFLAGS: 00010286
EIP is at unmap_underlying_metadata [kernel] 0xdb 
eax: 00000020   ebx: efb97500   ecx: c02f6ec4   edx: 00005a3a
esi: f501e740   edi: d5f0aee0   ebp: 00000800   esp: f55fdd24
ds: 0018   es: 0018   ss: 0018
Process nfsd (pid: 965, stackpage=f55fd000)
Stack: c02670c2 00000606 00000001 00000800 00001000 
c0148ff1 00000000 00001000 
       c0149012 efb97500 efb97920 fe15c000 f55fdd64 00000800 00000001 efb97500 
       00000070 00000014 f887bfd0 f8878cc7 00000014 00000070 c2419934 dc216740 
Call Trace: [<c02670c2>] .rodata.str1.1 [kernel] 0x301d (0xf55fdd24)
[<c0148ff1>] __block_prepare_write [kernel] 0x121 (0xf55fdd38)
[<c0149012>] __block_prepare_write [kernel] 0x142 (0xf55fdd44)
[<f887bfd0>] .LC13 [jbd] 0x0 (0xf55fdd6c)
[<f8878cc7>] __jbd_kmalloc [jbd] 0x27 (0xf55fdd70)
[<c0149932>] block_prepare_write [kernel] 0x22 (0xf55fdd8c)
[<f88849e0>] ext3_get_block [ext3] 0x0 (0xf55fdda0)
[<f8884ed1>] ext3_prepare_write [ext3] 0xb1 (0xf55fddac)
[<f88849e0>] ext3_get_block [ext3] 0x0 (0xf55fddbc)
[<c0130f47>] __find_lock_page [kernel] 0x97 (0xf55fddd0)
[<c0133d44>] generic_file_write [kernel] 0x434 (0xf55fdde4)
[<f8a10c87>] nfsd_open [nfsd] 0x27 (0xf55fde38)
[<f8882a82>] ext3_file_write [ext3] 0x22 (0xf55fde54)
[<f8a1131c>] nfsd_write [nfsd] 0x14c (0xf55fde74)
[<c01da03c>] kfree_skbmem [kernel] 0xc (0xf55fdeb8)
[<f8892060>] ext3_file_operations [ext3] 0x0 (0xf55fdedc)
[<f8a167b0>] nfsd3_proc_write [nfsd] 0xf0 (0xf55fdf3c)
[<f8a20240>] nfsd_procedures3 [nfsd] 0xe0 (0xf55fdf60)
[<f8a20240>] nfsd_procedures3 [nfsd] 0xe0 (0xf55fdf68)
[<f8a0d5b1>] nfsd_dispatch [nfsd] 0xc1 (0xf55fdf6c)
[<f89fd999>] svc_process_Rsmp_d8cce609 [sunrpc] 0x349 (0xf55fdf88)
[<f8a1fbd8>] nfsd_version3 [nfsd] 0x0 (0xf55fdf9c)
[<f8a1fbf8>] nfsd_program [nfsd] 0x0 (0xf55fdfa0)
[<f8a0d39b>] nfsd [nfsd] 0x20b (0xf55fdfbc)
[<c0105856>] arch_kernel_thread [kernel] 0x26 (0xf55fdff0)
[<f8a0d190>] nfsd [nfsd] 0x0 (0xf55fdff8)


Code: 0f 0b 59 5b b8 01 00 00 00 f0 0f b3 46 18 19 c0 85 c0 74 07 
 <0>Kernel panic: not continuing


This machine is part of a NFS server cluster.  The cluster ran as it should and transferred 
the service to the second machine, however, it also kernel panicked with the same error 
message.  When the machines were rebooted, they would come up fine until they started 
the cluster service and mounted this filesystem, when they would kernel panic.

I booted with the rescue disc and disabled the cluster service, then ran fsck on the 
partition.  72 duplicate blocks were found and repaired and the filesystem has been 
remounted.