54873 – fatal mount / ext2 filesytems error

Bug 54873 - fatal mount / ext2 filesytems error

Summary: fatal mount / ext2 filesytems error

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.1
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Stephen Tweedie
QA Contact:	Brian Brock
Docs Contact:
URL:	system has fatal error with corrupt p...
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2001-10-22 12:39 UTC by Matt Clark
Modified:	2007-04-18 16:37 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-09-10 10:10:50 UTC
Embargoed:

Attachments	(Terms of Use)

Description Matt Clark 2001-10-22 12:39:30 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.76 [en] (X11; U; Linux 2.4.2-2 i686)

Description of problem:

Possibly mount- who knows?
Tried to "upgrade" from rh6.2->rh7.1
after choosing "upgrade" but before
any RPM action the machine tried to reboot but failed.

After power cycling the original installation failed to boot due to
a divide by zero cpu error in the swapper- I have no real details the
machine is locked and the error scrolls
off screen.

Using a boot floppy or 7.1 cd now results in the same problem as soon
as an attempt to mount any of the
partitions on the hard disk.

I have put the hard disk into a working
machine with kernel 2.4.2 and a rh7.1
installation.  If I try to mount any
of the partitions I get a system error
divide by zero etc etc (I can get this
out of the logs so see below) and mount
locks and cannot be killed.  The machine
continues to work but fails to unmount the disks on reboot.

I have managed run fsck on the first partition (boot) but it fails on the
remaining partitions.  I have managed
to fix the remaining partitions using
alternative superblock.

So the bugs are 

1) rh7.1 install trashed the filesystem and I don't know why and
have no access to any record of what happened

and

Under both rh 6.2 and rh7.1 the system was brought down by trying to 
mount the corrupt partition.

Under 6.2 the error on boot was "divide by zero :0000" and then a
set of processor flags and the stack frame (scrolled away before any
details)- this does not surprise me as the faulty file system was
loaded at boot.

Under 7.1 the error was similar and caught in the log files like so-
I could reproduce it using mount to get this error- and a dodgy
system

Oct 22 12:14:00 substitute kernel: divide error: 0000
Oct 22 12:14:00 substitute kernel: CPU:    0
Oct 22 12:14:00 substitute kernel: EIP:    0010:[ext2_read_super+1236/1776]
Oct 22 12:14:00 substitute kernel: EIP:    0010:[<c015a034>]
Oct 22 12:14:00 substitute kernel: EFLAGS: 00010246
Oct 22 12:14:00 substitute kernel: eax: 000f88fe   ebx: 00001000   ecx:
00000000   edx: 00000000
Oct 22 12:14:00 substitute kernel: esi: c2452600   edi: 00000000   ebp:
c20c3400   esp: c2195e84
Oct 22 12:14:00 substitute kernel: ds: 0018   es: 0018   ss: 0018
Oct 22 12:14:00 substitute kernel: Process mount (pid: 891,
stackpage=c2195000)
Oct 22 12:14:00 substitute kernel: Stack: 00000007 c2195ee8 00000000
00000346 00000000 c20c3400 c20ddc60 00000001
Oct 22 12:14:00 substitute kernel:        00000000 00000000 00000003
00000000 00000000 c2452600 00000000 c38ea3a0
Oct 22 12:14:00 substitute kernel:        c025b6f8 c0138bcb c2452600
00000000 00000000 00000000 00000000 00000000
Oct 22 12:14:00 substitute kernel: Call Trace: [read_super+251/368]
[get_sb_bdev+320/416] [do_mount+378/704] [copy_mount_options+78/160]
[sys_mou
nt+124/192] [system_call+51/56]
Oct 22 12:14:00 substitute kernel: Call Trace: [<c0138bcb>] [<c0138df0>]
[<c013998a>] [<c01397be>] [<c0139b4c>] [<c010901b>]
Oct 22 12:14:00 substitute kernel:
Oct 22 12:14:00 substitute kernel: Code: f7 f1 8b 96 e0 00 00 00 89 d1 89
86 e4 00 00 00 8d 44 02 ff

and while the system sometimes kept going it failed to unmount the disks
cleanly.

Here are a few more copies of the same
error

Oct 22 12:21:36 substitute kernel: divide error: 0000
Oct 22 12:21:36 substitute kernel: CPU:    0
Oct 22 12:21:36 substitute kernel: EIP:    0010:[ext2_read_super+1236/1776]
Oct 22 12:21:36 substitute kernel: EIP:    0010:[<c015a034>]
Oct 22 12:21:36 substitute kernel: EFLAGS: 00010246
Oct 22 12:21:36 substitute kernel: eax: 000f88fe   ebx: 00001000   ecx:
00000000   edx: 00000000
Oct 22 12:21:36 substitute kernel: esi: c2657600   edi: 00000000   ebp:
c24fa400   esp: c223fe84
Oct 22 12:21:36 substitute kernel: ds: 0018   es: 0018   ss: 0018
Oct 22 12:21:36 substitute kernel: Process mount (pid: 868,
stackpage=c223f000)
Oct 22 12:21:36 substitute kernel: Stack: 00000007 c223fee8 00000000
00000346 00000000 c24fa400 c1f61860 00000001
Oct 22 12:21:36 substitute kernel:        00000246 00000000 00000003
00000000 00000000 c2657600 00000000 c3e9caa0
Oct 22 12:21:36 substitute kernel:        c025b6f8 c0138bcb c2657600
00000000 00000000 00000000 00000000 00000000
Oct 22 12:21:36 substitute kernel: Call Trace: [read_super+251/368]
[get_sb_bdev+320/416] [do_mount+378/704] [copy_mount_options+78/160]
[sys_mou
nt+124/192] [system_call+51/56]
Oct 22 12:21:36 substitute kernel: Call Trace: [<c0138bcb>] [<c0138df0>]
[<c013998a>] [<c01397be>] [<c0139b4c>] [<c010901b>]
Oct 22 12:21:36 substitute kernel:
Oct 22 12:21:36 substitute kernel: Code: f7 f1 8b 96 e0 00 00 00 89 d1 89
86 e4 00 00 00 8d 44 02 ff

Oct 22 12:38:58 substitute kernel: divide error: 0000
Oct 22 12:38:58 substitute kernel: CPU:    0
Oct 22 12:38:58 substitute kernel: EIP:    0010:[ext2_read_super+1236/1776]
Oct 22 12:38:58 substitute kernel: EIP:    0010:[<c015a034>]
Oct 22 12:38:58 substitute kernel: EFLAGS: 00010246
Oct 22 12:38:58 substitute kernel: eax: 000f88fe   ebx: 00001000   ecx:
00000000   edx: 00000000
Oct 22 12:38:58 substitute kernel: esi: c2848a00   edi: 00000000   ebp:
c2da4400   esp: c1f49e84
Oct 22 12:38:58 substitute kernel: ds: 0018   es: 0018   ss: 0018
Oct 22 12:38:58 substitute kernel: Process mount (pid: 1104,
stackpage=c1f49000)
Oct 22 12:38:58 substitute kernel: Stack: c1113b30 c2803005 00000000
00000346 00000000 c2da4400 c32dff00 00000001
Oct 22 12:38:58 substitute kernel:        00000246 00000000 00000003
00000000 00000000 c2848a00 00000000 c3e9c320
Oct 22 12:38:58 substitute kernel:        c025b6f8 c0138bcb c2848a00
00000000 00000000 00000000 00000000 00000000
Oct 22 12:38:58 substitute kernel: Call Trace: [read_super+251/368]
[get_sb_bdev+320/416] [do_mount+378/704] [copy_mount_options+78/160]
[sys_mou
nt+124/192] [system_call+51/56]
Oct 22 12:38:58 substitute kernel: Call Trace: [<c0138bcb>] [<c0138df0>]
[<c013998a>] [<c01397be>] [<c0139b4c>] [<c010901b>]
Oct 22 12:38:58 substitute kernel:
Oct 22 12:38:58 substitute kernel: Code: f7 f1 8b 96 e0 00 00 00 89 d1 89
86 e4 00 00 00 8d 44 02 ff

Oct 22 12:14:00 substitute kernel: divide error: 0000
Oct 22 12:14:00 substitute kernel: CPU:    0
Oct 22 12:14:00 substitute kernel: EIP:    0010:[ext2_read_super+1236/1776]
Oct 22 12:14:00 substitute kernel: EIP:    0010:[<c015a034>]
Oct 22 12:14:00 substitute kernel: EFLAGS: 00010246
Oct 22 12:14:00 substitute kernel: eax: 000f88fe   ebx: 00001000   ecx:
00000000   edx: 00000000
Oct 22 12:14:00 substitute kernel: esi: c2452600   edi: 00000000   ebp:
c20c3400   esp: c2195e84
Oct 22 12:14:00 substitute kernel: ds: 0018   es: 0018   ss: 0018
Oct 22 12:14:00 substitute kernel: Process mount (pid: 891,
stackpage=c2195000)
Oct 22 12:14:00 substitute kernel: Stack: 00000007 c2195ee8 00000000
00000346 00000000 c20c3400 c20ddc60 00000001
Oct 22 12:14:00 substitute kernel:        00000000 00000000 00000003
00000000 00000000 c2452600 00000000 c38ea3a0
Oct 22 12:14:00 substitute kernel:        c025b6f8 c0138bcb c2452600
00000000 00000000 00000000 00000000 00000000
Oct 22 12:14:00 substitute kernel: Call Trace: [read_super+251/368]
[get_sb_bdev+320/416] [do_mount+378/704] [copy_mount_options+78/160]
[sys_mou
nt+124/192] [system_call+51/56]
Oct 22 12:14:00 substitute kernel: Call Trace: [<c0138bcb>] [<c0138df0>]
[<c013998a>] [<c01397be>] [<c0139b4c>] [<c010901b>]
Oct 22 12:14:00 substitute kernel:
Oct 22 12:14:00 substitute kernel: Code: f7 f1 8b 96 e0 00 00 00 89 d1 89
86 e4 00 00 00 8d 44 02 ff
Oct 22 12:21:36 substitute kernel: divide error: 0000
Oct 22 12:21:36 substitute kernel: CPU:    0
Oct 22 12:21:36 substitute kernel: EIP:    0010:[ext2_read_super+1236/1776]
Oct 22 12:21:36 substitute kernel: EIP:    0010:[<c015a034>]
Oct 22 12:21:36 substitute kernel: EFLAGS: 00010246
Oct 22 12:21:36 substitute kernel: eax: 000f88fe   ebx: 00001000   ecx:
00000000   edx: 00000000
Oct 22 12:21:36 substitute kernel: esi: c2657600   edi: 00000000   ebp:
c24fa400   esp: c223fe84
Oct 22 12:21:36 substitute kernel: ds: 0018   es: 0018   ss: 0018
Oct 22 12:21:36 substitute kernel: Process mount (pid: 868,
stackpage=c223f000)
Oct 22 12:21:36 substitute kernel: Stack: 00000007 c223fee8 00000000
00000346 00000000 c24fa400 c1f61860 00000001
Oct 22 12:21:36 substitute kernel:        00000246 00000000 00000003
00000000 00000000 c2657600 00000000 c3e9caa0
Oct 22 12:21:36 substitute kernel:        c025b6f8 c0138bcb c2657600
00000000 00000000 00000000 00000000 00000000
Oct 22 12:21:36 substitute kernel: Call Trace: [read_super+251/368]
[get_sb_bdev+320/416] [do_mount+378/704] [copy_mount_options+78/160]
[sys_mou
nt+124/192] [system_call+51/56] 
Oct 22 12:21:36 substitute kernel: Call Trace: [<c0138bcb>] [<c0138df0>]
[<c013998a>] [<c01397be>] [<c0139b4c>] [<c010901b>]
Oct 22 12:21:36 substitute kernel:
Oct 22 12:21:36 substitute kernel: Code: f7 f1 8b 96 e0 00 00 00 89 d1 89
86 e400 00 00 8d 44 02 ff

Oct 22 12:38:58 substitute kernel: divide error: 0000
Oct 22 12:38:58 substitute kernel: CPU:    0
Oct 22 12:38:58 substitute kernel: EIP:    0010:[ext2_read_super+1236/1776]
Oct 22 12:38:58 substitute kernel: EIP:    0010:[<c015a034>]
Oct 22 12:38:58 substitute kernel: EFLAGS: 00010246
Oct 22 12:38:58 substitute kernel: eax: 000f88fe   ebx: 00001000   ecx:
00000000   edx: 00000000
Oct 22 12:38:58 substitute kernel: esi: c2848a00   edi: 00000000   ebp:
c2da4400   esp: c1f49e84
Oct 22 12:38:58 substitute kernel: ds: 0018   es: 0018   ss: 0018
Oct 22 12:38:58 substitute kernel: Process mount (pid: 1104,
stackpage=c1f49000)
Oct 22 12:38:58 substitute kernel: Stack: c1113b30 c2803005 00000000
00000346 00000000 c2da4400 c32dff00 00000001
Oct 22 12:38:58 substitute kernel:        00000246 00000000 00000003
00000000 00000000 c2848a00 00000000 c3e9c320
Oct 22 12:38:58 substitute kernel:        c025b6f8 c0138bcb c2848a00
00000000 00000000 00000000 00000000 00000000
Oct 22 12:38:58 substitute kernel: Call Trace: [read_super+251/368]
[get_sb_bdev+320/416] [do_mount+378/704] [copy_mount_options+78/160]
[sys_mou
nt+124/192] [system_call+51/56]
Oct 22 12:38:58 substitute kernel: Call Trace: [<c0138bcb>] [<c0138df0>]
[<c013998a>] [<c01397be>] [<c0139b4c>] [<c010901b>]
Oct 22 12:38:58 substitute kernel:
Oct 22 12:38:58 substitute kernel: Code: f7 f1 8b 96 e0 00 00 00 89 d1 89
86 e4 00 00 00 8d 44 02 ff
and here is a ksymoops output for the last error
ksymoops
ksymoops 2.4.0 on i686 2.4.2-2.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.4.2-2/ (default)
     -m /boot/System.map-2.4.2-2 (default)

Warning: You did not tell me where to find symbol information.  I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc.  ksymoops -h explains the options.

Warning (compare_maps): ksyms_base symbol
__VERSIONED_SYMBOL(shmem_file_setup) not found in System.map.  Ignoring
ksyms_base entry
Warning (compare_maps): mismatch on symbol partition_name  , ksyms_base
says c01af860, System.map says c0153510.  Ignoring ksyms_base entry
Warning (compare_maps): mismatch on symbol usb_devfs_handle  , usbcore says
c48271a0, /lib/modules/2.4.2-2/kernel/drivers/usb/usbcore.o says c482
6cc0.  Ignoring /lib/modules/2.4.2-2/kernel/drivers/usb/usbcore.o entry
Reading Oops report from the terminal
Oct 22 12:38:58 substitute kernel: divide error: 0000
Oct 22 12:38:58 substitute kernel: CPU:    0
Oct 22 12:38:58 substitute kernel: EIP:    0010:[ext2_read_super+1236/1776]
Oct 22 12:38:58 substitute kernel: EIP:    0010:[<c015a034>]
Oct 22 12:38:58 substitute kernel: EFLAGS: 00010246
Oct 22 12:38:58 substitute kernel: CPU:    0Oct 22 12:38:58 substitute
kernel: eax: 000f88fe   ebx: 00001000   ecx: 00000000   edx: 00000000
Oct 22 12:38:58 substitute kernel: esi: c2848a00   edi: 00000000   ebp:
c2da4400   esp: c1f49e84
Oct 22 12:38:58 substitute kernel: ds: 0018   es: 0018   ss: 0018
Oct 22 12:38:58 substitute kernel: Process mount (pid: 1104,
stackpage=c1f49000)
Oct 22 12:38:58 substitute kernel: Stack: c
1113b30 c2803005 00000000 00000346 00000000 c2da4400 c32dff00 00000001
Oct 22 12:38:58 substitute kernel:        00000246 00000000 00000003
00000000 00000000 c2848a00 00000000 c3e9c320
Oct 22 12:38:58 substitute kernel:        c025b6f8 c0138bcb c2848a00
00000000 00000000 00000000 00000000 00000000
Oct 22 12:38:58 substitute kernel: Call Trace: [read_super+251/368]
[get_sb_bdev+320/Oct 22 12:38:58 substitute kernel: EIP:   
0010:[ext2_read_s
uper+1236/1776]
Oct 22 12:38:58 substitute kernel: EIP:    0010:[<c015a034>]
Using defaults from ksymoops -t elf32-i386 -a i386
Oct 22 12:38:58 substitute kernel: EFLAGS: 00010246
Oct 22 12:38:58 substitute kernel: eax: 000f88fe   ebx: 00001000   ecx:
00000000   edx: 00000000
Oct 22 12:38:58 substitute kernel: esi: c2848a00   edi: 00000000   ebp:
c2da4400   esp: c1f49e84
Oct 22 12:38:58 substitute kernel: ds: 0018   es: 0018   ss: 0018
416] [do_mount+378/704] [copy_mount_options+78/160] [sys_mount+124/19
2] [system_call+51/56]
Oct 22 12:38:58 substitute kernel: Call Trace: [<c0138bcb>] [<c0138df0>]
[<c013998a>] [<c01397be>] [<c0139b4c>] [<c010901b>]
Oct 22 12:38:58 substitute kernel:
Oct 22 12:38:58 substitute kernel: Code: f7 f1 8b 96 e0 00 00 00 89 d1 89
86 e4 00 00 00 8d 44 02 ff
Oct 22 12:38:58 substitute kernel: Process mount (pid: 1104,
stackpage=c1f49000)
Oct 22 12:38:58 substitute kernel: Stack: c1113b30 c2803005 00000000
00000346 00000000 c2da4400 c32dff00 00000001
Oct 22 12:38:58 substitute kernel:        00000246 00000000 00000003
00000000 00000000 c2848a00 00000000 c3e9c320
Oct 22 12:38:58 substitute kernel:        c025b6f8 c0138bcb c2848a00
00000000 00000000 00000000 00000000 00000000
Oct 22 12:38:58 substitute kernel: Call Trace: [read_super+251/368]
[get_sb_bdev+320/416] [do_mount+378/704] [copy_mount_options+78/160]
[sys_mou
nt+124/19
Oct 22 12:38:58 substitute kernel: Call Trace: [<c0138bcb>] [<c0138df0>]
[<c013998a>] [<c01397be>] [<c0139b4c>] [<c010901b>]
Oct 22 12:38:58 substitute kernel: Code: f7 f1 8b 96 e0 00 00 00 89 d1 89
86 e4 00 00 00 8d 44 02 ff

>>EIP; c015a034 <ext2_read_super+4d4/6f0>   <=====
Trace; c0138bcb <read_super+fb/170>
Trace; c0138df0 <get_sb_bdev+140/1a0>
Trace; c013998a <do_mount+17a/2c0>
Trace; c01397be <copy_mount_options+4e/a0>
Trace; c0139b4c <sys_mount+7c/c0>
Trace; c010901b <system_call+33/38>
Code;  c015a034 <ext2_read_super+4d4/6f0>
00000000 <_EIP>:
Code;  c015a034 <ext2_read_super+4d4/6f0>   <=====
   0:   f7 f1                     div    %ecx,%eax   <=====
Code;  c015a036 <ext2_read_super+4d6/6f0>
   2:   8b96 e0 00 00 00         mov    0xe0(%esi),%edx
Code;  c015a03c <ext2_read_super+4dc/6f0>
   8:   89 d1                     mov    %edx,%ecx
Code;  c015a03e <ext2_read_super+4de/6f0>
   a:   89 86 e4 00 00 00         mov    %eax,0xe4(%esi)
Code;  c015a044 <ext2_read_super+4e4/6f0>
  10:   8d 44 02 ff               lea    0xffffffff(%edx,%eax,1),%eax

It seems to me that regardless of the state of the partition that the
filesystem utilities should behave more gracefully.



I apologise if this has been dealt with elsewhere- bugzilla seems to have
crashed and I can't see the previous bug reports






Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
Can't tell you- will depend on the corrupt partition which I have managed
to fix- rh7.1 install caused the problem trying to update a rh6.2 install.
The keysymoops should let you know where it is- My money is on
an untrapped divide by zero in ext2_read_super.

I got the error by getting a corrupted filesystem (caused by
rh7.1 upgrade) and then trying to mount the disk





Actual Results:  
on attempting to mount the disk I get the divide by zero
error then the system goes bad- can't mount anymore can
umount any more and in the case of the system it came from
the fatal error prevented booting by any method



Expected Results:  
1) the upgrade should have occured without error (this install version 
has trashed quite a of my computers for a variety of reasons).

2) mount should have handled this gracefully (ext2 drivers at fault?)

3) the boot process should have handled it gracefully- if it were not
for the fact I many spare linux machines around I would have lost all
data.


Additional info:

This is a severe bug in the ext filesystem and appears to
be caused by an easily trapped divide by zero error in
ext2_read_super.  It can be fatal to the system and appears
to have other side effects as far as stability goes.

Comment 1 Bernhard Rosenkraenzer 2001-10-23 12:04:03 UTC

Looks like a kernel thing to me.

Comment 2 Matt Clark 2001-10-29 12:46:03 UTC

What happens now ?-matt

Comment 3 Stephen Tweedie 2001-11-02 19:03:11 UTC

There are two issues here.  First, what caused the original problem?  Unless
it's reproducible, there's not enough information here to diagnose it --- it
could be just about any combination of hardware or software problems.

The second problem is, why is the kernel panicing on mounting of the new
filesystems? I can see one or two possible reasons for that, all of which
involve massively corrupt filesystems which ext2 isn't *quite* smart enough
about rejecting.

Can you possibly send me a copy of one of the corrupt superblocks (if you still
have one) so that I can verify (a) exactly what is causing this, and (b) that it
is fixed once the kernel is patched?  The command

   dd if=/dev/whatever of=superblock.dat bs=4k count=1

will do it.

Comment 4 Matt Clark 2001-11-02 19:59:51 UTC

It is reproducible for a particular machine- a cycle of building, corrupting
and recovering is lengthy and requires an addition free machine.  I am not 
likely to do this in the near future.

Some thoughts about it none the less-
If I try to install kernel 2.4 on this machine the superblock is corrupted.
Once the disk is corrupted then it can only be fixed on another machine-
no version of the kernel can fix it on the original machine.
I originally thought the cause of the problem was the format utilities in 
the install but I now think it is mount and / or ext2.

I tried building the system on the hard disk in another machine- all the same
hardware but a different motherboard hence a different hard disk controller.
The install went well and the machine was stable.  

I then rebuild the original machine and booted it.  It boot 100% ok and ran a
stable system.  

I then rebooted the machine and the superblocks were corrupted.

I repeated this using ext3 fs with the same results except the error on 
reboot was different- something like iblock (1024) does not equal bblock 
size (4096)- can't remember exactly.  In this case the error was trapped 
by ext3 fs - it was fatal.

I agree the are two issues-
(a)the process of corruption and the (bug 54884)
(b) process of failing to handle the error gracefully (bug 54873)

There are less than 10 divides in the whole of the ext2_fs tree that could
be suspect- it should be fairly easy to trap all of these.

(a) is a tougher problem- as soon as the machine is available for a long enough
(when I am free) I will try to get you a corrupt superblock but this make be
some time (months).  IDing the disk contoller might be a start though (although
I don't know how to do this apart from reading the chips on the board).

Comment 5 Stephen Tweedie 2004-09-10 10:10:50 UTC

In current kernels, the only divide in ext2_read_super is
divide-by-blocksize, and we now validate that first via a prior
set_blocksize() call.

Note You need to log in before you can comment on or make changes to this bug.