82238 – [ext3/jbd] kernel oops in do_get_write_access() at transaction.c:715

Bug 82238 - [ext3/jbd] kernel oops in do_get_write_access() at transaction.c:715

Summary: [ext3/jbd] kernel oops in do_get_write_access() at transaction.c:715

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.3
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Stephen Tweedie
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	52378 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2003-01-20 10:14 UTC by Fabrice Bellet
Modified:	2007-04-18 16:50 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2003-02-05 13:58:17 UTC
Embargoed:

Attachments	(Terms of Use)
FIx signed use of unsigned i_blocks (3.66 KB, patch) 2003-02-03 18:23 UTC, Stephen Tweedie	no flags	Details \| Diff
View All

Description Fabrice Bellet 2003-01-20 10:14:23 UTC

Description of problem:
Kernel oops in assertion checking during the removal of a huge file on an EXT3 fs.

Version-Release number of selected component (if applicable):
Latest 7.3 kernel : 2.4.18-19.7.x

How reproducible:
always reproducible. 

Steps to Reproduce:
I have a RAID5 IDE-SCSI system, providing 1.2To free space. I partition it
with a single partition, format it as EXT3 fs with standard options (mke2fs -j
/dev/sda1), I mount it, and create a file that uses all the available space :

mount /dev/sda1 /mnt/raid -text3
cd /mnt/raid
dd if=/dev/zero of=foo bs=4096
--> the commands stops when no space left is available. Then I remove the file.

root@dhcp-14 /mnt/raid > ls -l
total 1211403296
-rw-r--r--    1 root     root     1239265550336 Jan 18 21:58 foo
drwx------    2 root     root        16384 Jan 18 15:21 lost+found
root@dhcp-14 /mnt/raid > df -k
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/hda1              6190664   1943744   3932452  34% /
/dev/hda3             51882048     70480  49176108   1% /home
none                    451460         0    451460   0% /dev/shm
/dev/sda1            1211436108 1211436108         0 100% /mnt/raid
root@dhcp-14 /mnt/raid > /bin/rm -f /mnt/raid/foo 
Segmentation fault
root@dhcp-14 /mnt/raid > 
Message from syslogd@dhcp-14 at Mon Jan 20 10:53:10 2003 ...
dhcp-14 kernel: Assertion failure in do_get_write_access() at transaction.c:715:
"handle->h_buffer_credits > 0"

The oops is :
kernel BUG at transaction.c:715!
invalid operand: 0000
agpgart nfsd lockd sunrpc autofs 8139too mii ide-cd cdrom ext3 jbd aic7xxx sd_mo
d scsi_mod
CPU:    0
EIP:    0010:[<f884daee>]    Not tainted
EFLAGS: 00010286

EIP is at do_get_write_access [jbd] 0x336 (2.4.18-19.7.x)
eax: 00000063   ebx: 00000000   ecx: 00000000   edx: f6e68000
esi: e0803b10   edi: f4311e00   ebp: c334e9f0   esp: ee78fcc4
ds: 0018   es: 0018   ss: 0018
Process rm (pid: 2020, stackpage=ee78f000)
Stack: f8853ce0 f8853622 f8853570 000002cb f8853636 00000000 00000000 ef99dee0
       d467d000 f6d28800 cdc9c8a0 f884de39 ef99dee0 cdc9c8a0 00000001 f4311e00
       00000000 d467d000 f6d28800 ef99dee0 f885a468 ef99dee0 e0803b10 00000001
Call Trace: [<f8853ce0>] .rodata.str1.32 [jbd] 0x20 (0xee78fcc4))
[<f8853622>] .rodata.str1.1 [jbd] 0xc2 (0xee78fcc8))
[<f8853570>] .rodata.str1.1 [jbd] 0x10 (0xee78fccc))
[<f8853636>] .rodata.str1.1 [jbd] 0xd6 (0xee78fcd4))
[<f884de39>] journal_get_undo_access_Rc7bbb223 [jbd] 0x3d (0xee78fcf0))
[<f885a468>] ext3_free_blocks [ext3] 0x174 (0xee78fd14))
[<f8851f6f>] log_start_commit_R335a5fa9 [jbd] 0x3b (0xee78fd50))
[<f884d6a5>] journal_restart_R3772af02 [jbd] 0xc1 (0xee78fd58))
[<f884d272>] start_this_handle [jbd] 0x122 (0xee78fd5c))
[<c0114bfd>] schedule [kernel] 0x20d (0xee78fd78))
[<f885e7ef>] ext3_clear_blocks [ext3] 0x137 (0xee78fd90))
[<c011eb7f>] update_process_times [kernel] 0x23 (0xee78fda8))
[<c011ea56>] update_wall_time [kernel] 0xe (0xee78fdbc))
[<c011ebc7>] timer_bh [kernel] 0x27 (0xee78fdc4))
[<c011ef41>] do_timer [kernel] 0x45 (0xee78fdd0))
[<f885e903>] ext3_free_data [ext3] 0x10b (0xee78fdf4))
[<c0139f54>] try_to_free_buffers [kernel] 0x8c (0xee78fe40))
[<f885eccc>] ext3_truncate [ext3] 0x128 (0xee78fe6c))
[<f884d272>] start_this_handle [jbd] 0x122 (0xee78fea8))
[<f884d379>] journal_start_R2d67426b [jbd] 0x7d (0xee78fec0))
[<f885cc23>] start_transaction [ext3] 0x57 (0xee78fed8))
[<f885cd64>] ext3_delete_inode [ext3] 0xa0 (0xee78fefc))
[<f8868de0>] ext3_sops [ext3] 0x0 (0xee78ff10))
[<f885ccc4>] ext3_delete_inode [ext3] 0x0 (0xee78ff18))
[<f8868de0>] ext3_sops [ext3] 0x0 (0xee78ff1c))
[<c0148ccf>] iput [kernel] 0xf7 (0xee78ff20))
[<f88618ad>] ext3_unlink [ext3] 0x1a5 (0xee78ff30))
[<c014735a>] d_delete [kernel] 0x4e (0xee78ff44))
[<c0140950>] vfs_unlink [kernel] 0x154 (0xee78ff64))
[<c0140a09>] sys_unlink [kernel] 0x85 (0xee78ff88))
[<c01138dc>] do_page_fault [kernel] 0x0 (0xee78ffb0))
[<c0108623>] system_call [kernel] 0x33 (0xee78ffc0))


Code: 0f 0b cb 02 70 35 85 f8 83 c4 14 8b 4c 24 1c 8b 41 04 48 8b

Running the oops through ksymoops adds this information :

>>EIP; f884daee <[jbd]do_get_write_access+336/464>   <=====
Trace; f8853ce0 <[jbd].rodata.end+781/4e2d>
Trace; f8853622 <[jbd].rodata.end+c3/4e2d>
Trace; f8853570 <[jbd].rodata.end+11/4e2d>
Trace; f8853636 <[jbd].rodata.end+d7/4e2d>
Trace; f884de39 <[jbd]journal_get_undo_access+3d/fc>
Trace; f885a468 <[ext3].text.start+408/a397>
Trace; f8851f6f <[jbd]log_start_commit+3b/40>
Trace; f884d6a5 <[jbd]journal_restart+c1/d8>
Trace; f884d272 <[jbd]__kstrtab_journal_init_inode+18/26>
Trace; c0114bfd <schedule+20d/230>
Trace; f885e7ef <[ext3].text.start+478f/a397>
Trace; c0138000 <getblk+18/3c>
Trace; f885d85f <[ext3].text.start+37ff/a397>
Trace; f885e903 <[ext3].text.start+48a3/a397>
Trace; f884dbfc <[jbd]do_get_write_access+444/464>
Trace; f885eccc <[ext3].text.start+4c6c/a397>
Trace; f884d272 <[jbd]__kstrtab_journal_init_inode+18/26>
Trace; f884d379 <[jbd]__kstrtab_journal_destroy+a/1a>
Trace; f885cc23 <[ext3].text.start+2bc3/a397>
Trace; f885cd64 <[ext3].text.start+2d04/a397>
Trace; f884e54d <[jbd]journal_stop+1b1/1c0>
Trace; f885ccc4 <[ext3].text.start+2c64/a397>
Trace; f8868de0 <[ext3].data.start+200/2bf>
Trace; c0148ccf <iput+f7/1d8>
Trace; f88618ad <[ext3].text.start+784d/a397>
Trace; c014735a <d_delete+4e/70>
Trace; c0140950 <vfs_unlink+154/188>
Trace; c0140a09 <sys_unlink+85/f0>
Trace; c01138dc <do_page_fault+0/428>
Trace; c0108623 <system_call+33/38>
Code;  f884daee <[jbd]do_get_write_access+336/464>
00000000 <_EIP>:
Code;  f884daee <[jbd]do_get_write_access+336/464>   <=====
   0:   0f 0b                     ud2a      <=====
Code;  f884daf0 <[jbd]do_get_write_access+338/464>
   2:   cb                        lret   
Code;  f884daf1 <[jbd]do_get_write_access+339/464>
   3:   02 70 35                  add    0x35(%eax),%dh
Code;  f884daf4 <[jbd]do_get_write_access+33c/464>
   6:   85 f8                     test   %edi,%eax
Code;  f884daf6 <[jbd]do_get_write_access+33e/464>
   8:   83 c4 14                  add    $0x14,%esp
Code;  f884daf9 <[jbd]do_get_write_access+341/464>
   b:   8b 4c 24 1c               mov    0x1c(%esp,1),%ecx
Code;  f884dafd <[jbd]do_get_write_access+345/464>
   f:   8b 41 04                  mov    0x4(%ecx),%eax
Code;  f884db00 <[jbd]do_get_write_access+348/464>
  12:   48                        dec    %eax
Code;  f884db01 <[jbd]do_get_write_access+349/464>
  13:   8b 00                     mov    (%eax),%eax


2 warnings and 5 errors issued.  Results may not be reliable.

Comment 1 Fabrice Bellet 2003-01-24 16:40:35 UTC

I obtain the same behaviour on phoebe too, kernel 2.4.20-2.21.

Comment 2 Stephen Tweedie 2003-02-03 18:20:44 UTC

Thanks very much for this report --- it has allowed me to find a problem that
I've been chasing for some time.

The bug occurs when the "i_blocks" count in the file's inode overflows past
2^31.  That works fine most of the time, because i_blocks is an unsigned long,
and should go up to 2^32; but there's a place in truncate where ext3 calculates
the size of the next transaction chunk for the delete, and that mistakenly uses
a signed long instead.  Because the huge i_blocks gets cast to a negative value,
ext3 does not reserve enough credits for the transaction and the above error
results. 

The attached patch fixes things for me in testing, and it will be in a
subsequent errata.

Comment 3 Stephen Tweedie 2003-02-03 18:23:04 UTC

Created attachment 89813 [details]
FIx signed use of unsigned i_blocks

Comment 4 Stephen Tweedie 2003-02-03 18:33:27 UTC

*** Bug 52378 has been marked as a duplicate of this bug. ***

Comment 5 Fabrice Bellet 2003-02-05 11:56:13 UTC

The patch you submitted at
https://listman.redhat.com/pipermail/ext3-users/2003-February/004726.html,
applied on the kernel from phoebe, solved this problem for me. Thank you Stephen.

Comment 6 Stephen Tweedie 2003-02-05 13:58:17 UTC

Thanks for testing it!

Comment 7 Chris Pepper 2003-08-05 13:59:50 UTC

Which RH 7.3 kernel incorporates this change? Is it in 8.0+? We've just seen it
on an IBM x345 w/ ServeRAID 5i controller, and were sticking with the stock 7.3
kernel for IBM's supported SR driver.

I need to know what RH kernel includes this fix, so I can find out if IBM
supports it yet.

Thanks,


Chris Pepper

Comment 8 Stephen Tweedie 2003-08-05 14:22:51 UTC

It's definitely in 9, and is probably in recent errata for older kernels as a
result, but I don't have information handy about exactly which errata it might
have first been included in.

Note You need to log in before you can comment on or make changes to this bug.