Bug 82238

Summary: [ext3/jbd] kernel oops in do_get_write_access() at transaction.c:715
Product: [Retired] Red Hat Linux Reporter: Fabrice Bellet <fabrice>
Component: kernelAssignee: Stephen Tweedie <sct>
Status: CLOSED RAWHIDE QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.3CC: nphilipp, pepper
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2003-02-05 13:58:17 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
FIx signed use of unsigned i_blocks none

Description Fabrice Bellet 2003-01-20 10:14:23 UTC
Description of problem:
Kernel oops in assertion checking during the removal of a huge file on an EXT3 fs.

Version-Release number of selected component (if applicable):
Latest 7.3 kernel : 2.4.18-19.7.x

How reproducible:
always reproducible. 

Steps to Reproduce:
I have a RAID5 IDE-SCSI system, providing 1.2To free space. I partition it
with a single partition, format it as EXT3 fs with standard options (mke2fs -j
/dev/sda1), I mount it, and create a file that uses all the available space :

mount /dev/sda1 /mnt/raid -text3
cd /mnt/raid
dd if=/dev/zero of=foo bs=4096
--> the commands stops when no space left is available. Then I remove the file.

root@dhcp-14 /mnt/raid > ls -l
total 1211403296
-rw-r--r--    1 root     root     1239265550336 Jan 18 21:58 foo
drwx------    2 root     root        16384 Jan 18 15:21 lost+found
root@dhcp-14 /mnt/raid > df -k
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/hda1              6190664   1943744   3932452  34% /
/dev/hda3             51882048     70480  49176108   1% /home
none                    451460         0    451460   0% /dev/shm
/dev/sda1            1211436108 1211436108         0 100% /mnt/raid
root@dhcp-14 /mnt/raid > /bin/rm -f /mnt/raid/foo 
Segmentation fault
root@dhcp-14 /mnt/raid > 
Message from syslogd@dhcp-14 at Mon Jan 20 10:53:10 2003 ...
dhcp-14 kernel: Assertion failure in do_get_write_access() at transaction.c:715:
"handle->h_buffer_credits > 0"

The oops is :
kernel BUG at transaction.c:715!
invalid operand: 0000
agpgart nfsd lockd sunrpc autofs 8139too mii ide-cd cdrom ext3 jbd aic7xxx sd_mo
d scsi_mod
CPU:    0
EIP:    0010:[<f884daee>]    Not tainted
EFLAGS: 00010286

EIP is at do_get_write_access [jbd] 0x336 (2.4.18-19.7.x)
eax: 00000063   ebx: 00000000   ecx: 00000000   edx: f6e68000
esi: e0803b10   edi: f4311e00   ebp: c334e9f0   esp: ee78fcc4
ds: 0018   es: 0018   ss: 0018
Process rm (pid: 2020, stackpage=ee78f000)
Stack: f8853ce0 f8853622 f8853570 000002cb f8853636 00000000 00000000 ef99dee0
       d467d000 f6d28800 cdc9c8a0 f884de39 ef99dee0 cdc9c8a0 00000001 f4311e00
       00000000 d467d000 f6d28800 ef99dee0 f885a468 ef99dee0 e0803b10 00000001
Call Trace: [<f8853ce0>] .rodata.str1.32 [jbd] 0x20 (0xee78fcc4))
[<f8853622>] .rodata.str1.1 [jbd] 0xc2 (0xee78fcc8))
[<f8853570>] .rodata.str1.1 [jbd] 0x10 (0xee78fccc))
[<f8853636>] .rodata.str1.1 [jbd] 0xd6 (0xee78fcd4))
[<f884de39>] journal_get_undo_access_Rc7bbb223 [jbd] 0x3d (0xee78fcf0))
[<f885a468>] ext3_free_blocks [ext3] 0x174 (0xee78fd14))
[<f8851f6f>] log_start_commit_R335a5fa9 [jbd] 0x3b (0xee78fd50))
[<f884d6a5>] journal_restart_R3772af02 [jbd] 0xc1 (0xee78fd58))
[<f884d272>] start_this_handle [jbd] 0x122 (0xee78fd5c))
[<c0114bfd>] schedule [kernel] 0x20d (0xee78fd78))
[<f885e7ef>] ext3_clear_blocks [ext3] 0x137 (0xee78fd90))
[<c011eb7f>] update_process_times [kernel] 0x23 (0xee78fda8))
[<c011ea56>] update_wall_time [kernel] 0xe (0xee78fdbc))
[<c011ebc7>] timer_bh [kernel] 0x27 (0xee78fdc4))
[<c011ef41>] do_timer [kernel] 0x45 (0xee78fdd0))
[<f885e903>] ext3_free_data [ext3] 0x10b (0xee78fdf4))
[<c0139f54>] try_to_free_buffers [kernel] 0x8c (0xee78fe40))
[<f885eccc>] ext3_truncate [ext3] 0x128 (0xee78fe6c))
[<f884d272>] start_this_handle [jbd] 0x122 (0xee78fea8))
[<f884d379>] journal_start_R2d67426b [jbd] 0x7d (0xee78fec0))
[<f885cc23>] start_transaction [ext3] 0x57 (0xee78fed8))
[<f885cd64>] ext3_delete_inode [ext3] 0xa0 (0xee78fefc))
[<f8868de0>] ext3_sops [ext3] 0x0 (0xee78ff10))
[<f885ccc4>] ext3_delete_inode [ext3] 0x0 (0xee78ff18))
[<f8868de0>] ext3_sops [ext3] 0x0 (0xee78ff1c))
[<c0148ccf>] iput [kernel] 0xf7 (0xee78ff20))
[<f88618ad>] ext3_unlink [ext3] 0x1a5 (0xee78ff30))
[<c014735a>] d_delete [kernel] 0x4e (0xee78ff44))
[<c0140950>] vfs_unlink [kernel] 0x154 (0xee78ff64))
[<c0140a09>] sys_unlink [kernel] 0x85 (0xee78ff88))
[<c01138dc>] do_page_fault [kernel] 0x0 (0xee78ffb0))
[<c0108623>] system_call [kernel] 0x33 (0xee78ffc0))


Code: 0f 0b cb 02 70 35 85 f8 83 c4 14 8b 4c 24 1c 8b 41 04 48 8b

Running the oops through ksymoops adds this information :

>>EIP; f884daee <[jbd]do_get_write_access+336/464>   <=====
Trace; f8853ce0 <[jbd].rodata.end+781/4e2d>
Trace; f8853622 <[jbd].rodata.end+c3/4e2d>
Trace; f8853570 <[jbd].rodata.end+11/4e2d>
Trace; f8853636 <[jbd].rodata.end+d7/4e2d>
Trace; f884de39 <[jbd]journal_get_undo_access+3d/fc>
Trace; f885a468 <[ext3].text.start+408/a397>
Trace; f8851f6f <[jbd]log_start_commit+3b/40>
Trace; f884d6a5 <[jbd]journal_restart+c1/d8>
Trace; f884d272 <[jbd]__kstrtab_journal_init_inode+18/26>
Trace; c0114bfd <schedule+20d/230>
Trace; f885e7ef <[ext3].text.start+478f/a397>
Trace; c0138000 <getblk+18/3c>
Trace; f885d85f <[ext3].text.start+37ff/a397>
Trace; f885e903 <[ext3].text.start+48a3/a397>
Trace; f884dbfc <[jbd]do_get_write_access+444/464>
Trace; f885eccc <[ext3].text.start+4c6c/a397>
Trace; f884d272 <[jbd]__kstrtab_journal_init_inode+18/26>
Trace; f884d379 <[jbd]__kstrtab_journal_destroy+a/1a>
Trace; f885cc23 <[ext3].text.start+2bc3/a397>
Trace; f885cd64 <[ext3].text.start+2d04/a397>
Trace; f884e54d <[jbd]journal_stop+1b1/1c0>
Trace; f885ccc4 <[ext3].text.start+2c64/a397>
Trace; f8868de0 <[ext3].data.start+200/2bf>
Trace; c0148ccf <iput+f7/1d8>
Trace; f88618ad <[ext3].text.start+784d/a397>
Trace; c014735a <d_delete+4e/70>
Trace; c0140950 <vfs_unlink+154/188>
Trace; c0140a09 <sys_unlink+85/f0>
Trace; c01138dc <do_page_fault+0/428>
Trace; c0108623 <system_call+33/38>
Code;  f884daee <[jbd]do_get_write_access+336/464>
00000000 <_EIP>:
Code;  f884daee <[jbd]do_get_write_access+336/464>   <=====
   0:   0f 0b                     ud2a      <=====
Code;  f884daf0 <[jbd]do_get_write_access+338/464>
   2:   cb                        lret   
Code;  f884daf1 <[jbd]do_get_write_access+339/464>
   3:   02 70 35                  add    0x35(%eax),%dh
Code;  f884daf4 <[jbd]do_get_write_access+33c/464>
   6:   85 f8                     test   %edi,%eax
Code;  f884daf6 <[jbd]do_get_write_access+33e/464>
   8:   83 c4 14                  add    $0x14,%esp
Code;  f884daf9 <[jbd]do_get_write_access+341/464>
   b:   8b 4c 24 1c               mov    0x1c(%esp,1),%ecx
Code;  f884dafd <[jbd]do_get_write_access+345/464>
   f:   8b 41 04                  mov    0x4(%ecx),%eax
Code;  f884db00 <[jbd]do_get_write_access+348/464>
  12:   48                        dec    %eax
Code;  f884db01 <[jbd]do_get_write_access+349/464>
  13:   8b 00                     mov    (%eax),%eax


2 warnings and 5 errors issued.  Results may not be reliable.

Comment 1 Fabrice Bellet 2003-01-24 16:40:35 UTC
I obtain the same behaviour on phoebe too, kernel 2.4.20-2.21.

Comment 2 Stephen Tweedie 2003-02-03 18:20:44 UTC
Thanks very much for this report --- it has allowed me to find a problem that
I've been chasing for some time.

The bug occurs when the "i_blocks" count in the file's inode overflows past
2^31.  That works fine most of the time, because i_blocks is an unsigned long,
and should go up to 2^32; but there's a place in truncate where ext3 calculates
the size of the next transaction chunk for the delete, and that mistakenly uses
a signed long instead.  Because the huge i_blocks gets cast to a negative value,
ext3 does not reserve enough credits for the transaction and the above error
results. 

The attached patch fixes things for me in testing, and it will be in a
subsequent errata.

Comment 3 Stephen Tweedie 2003-02-03 18:23:04 UTC
Created attachment 89813 [details]
FIx signed use of unsigned i_blocks

Comment 4 Stephen Tweedie 2003-02-03 18:33:27 UTC
*** Bug 52378 has been marked as a duplicate of this bug. ***

Comment 5 Fabrice Bellet 2003-02-05 11:56:13 UTC
The patch you submitted at
https://listman.redhat.com/pipermail/ext3-users/2003-February/004726.html,
applied on the kernel from phoebe, solved this problem for me. Thank you Stephen.

Comment 6 Stephen Tweedie 2003-02-05 13:58:17 UTC
Thanks for testing it!


Comment 7 Chris Pepper 2003-08-05 13:59:50 UTC
Which RH 7.3 kernel incorporates this change? Is it in 8.0+? We've just seen it
on an IBM x345 w/ ServeRAID 5i controller, and were sticking with the stock 7.3
kernel for IBM's supported SR driver.

I need to know what RH kernel includes this fix, so I can find out if IBM
supports it yet.

Thanks,


Chris Pepper

Comment 8 Stephen Tweedie 2003-08-05 14:22:51 UTC
It's definitely in 9, and is probably in recent errata for older kernels as a
result, but I don't have information handy about exactly which errata it might
have first been included in.