From Bugzilla Helper: User-Agent: Mozilla/4.76 [en] (X11; U; Linux 2.4.2-2 i686) Description of problem: Possibly mount- who knows? Tried to "upgrade" from rh6.2->rh7.1 after choosing "upgrade" but before any RPM action the machine tried to reboot but failed. After power cycling the original installation failed to boot due to a divide by zero cpu error in the swapper- I have no real details the machine is locked and the error scrolls off screen. Using a boot floppy or 7.1 cd now results in the same problem as soon as an attempt to mount any of the partitions on the hard disk. I have put the hard disk into a working machine with kernel 2.4.2 and a rh7.1 installation. If I try to mount any of the partitions I get a system error divide by zero etc etc (I can get this out of the logs so see below) and mount locks and cannot be killed. The machine continues to work but fails to unmount the disks on reboot. I have managed run fsck on the first partition (boot) but it fails on the remaining partitions. I have managed to fix the remaining partitions using alternative superblock. So the bugs are 1) rh7.1 install trashed the filesystem and I don't know why and have no access to any record of what happened and Under both rh 6.2 and rh7.1 the system was brought down by trying to mount the corrupt partition. Under 6.2 the error on boot was "divide by zero :0000" and then a set of processor flags and the stack frame (scrolled away before any details)- this does not surprise me as the faulty file system was loaded at boot. Under 7.1 the error was similar and caught in the log files like so- I could reproduce it using mount to get this error- and a dodgy system Oct 22 12:14:00 substitute kernel: divide error: 0000 Oct 22 12:14:00 substitute kernel: CPU: 0 Oct 22 12:14:00 substitute kernel: EIP: 0010:[ext2_read_super+1236/1776] Oct 22 12:14:00 substitute kernel: EIP: 0010:[<c015a034>] Oct 22 12:14:00 substitute kernel: EFLAGS: 00010246 Oct 22 12:14:00 substitute kernel: eax: 000f88fe ebx: 00001000 ecx: 00000000 edx: 00000000 Oct 22 12:14:00 substitute kernel: esi: c2452600 edi: 00000000 ebp: c20c3400 esp: c2195e84 Oct 22 12:14:00 substitute kernel: ds: 0018 es: 0018 ss: 0018 Oct 22 12:14:00 substitute kernel: Process mount (pid: 891, stackpage=c2195000) Oct 22 12:14:00 substitute kernel: Stack: 00000007 c2195ee8 00000000 00000346 00000000 c20c3400 c20ddc60 00000001 Oct 22 12:14:00 substitute kernel: 00000000 00000000 00000003 00000000 00000000 c2452600 00000000 c38ea3a0 Oct 22 12:14:00 substitute kernel: c025b6f8 c0138bcb c2452600 00000000 00000000 00000000 00000000 00000000 Oct 22 12:14:00 substitute kernel: Call Trace: [read_super+251/368] [get_sb_bdev+320/416] [do_mount+378/704] [copy_mount_options+78/160] [sys_mou nt+124/192] [system_call+51/56] Oct 22 12:14:00 substitute kernel: Call Trace: [<c0138bcb>] [<c0138df0>] [<c013998a>] [<c01397be>] [<c0139b4c>] [<c010901b>] Oct 22 12:14:00 substitute kernel: Oct 22 12:14:00 substitute kernel: Code: f7 f1 8b 96 e0 00 00 00 89 d1 89 86 e4 00 00 00 8d 44 02 ff and while the system sometimes kept going it failed to unmount the disks cleanly. Here are a few more copies of the same error Oct 22 12:21:36 substitute kernel: divide error: 0000 Oct 22 12:21:36 substitute kernel: CPU: 0 Oct 22 12:21:36 substitute kernel: EIP: 0010:[ext2_read_super+1236/1776] Oct 22 12:21:36 substitute kernel: EIP: 0010:[<c015a034>] Oct 22 12:21:36 substitute kernel: EFLAGS: 00010246 Oct 22 12:21:36 substitute kernel: eax: 000f88fe ebx: 00001000 ecx: 00000000 edx: 00000000 Oct 22 12:21:36 substitute kernel: esi: c2657600 edi: 00000000 ebp: c24fa400 esp: c223fe84 Oct 22 12:21:36 substitute kernel: ds: 0018 es: 0018 ss: 0018 Oct 22 12:21:36 substitute kernel: Process mount (pid: 868, stackpage=c223f000) Oct 22 12:21:36 substitute kernel: Stack: 00000007 c223fee8 00000000 00000346 00000000 c24fa400 c1f61860 00000001 Oct 22 12:21:36 substitute kernel: 00000246 00000000 00000003 00000000 00000000 c2657600 00000000 c3e9caa0 Oct 22 12:21:36 substitute kernel: c025b6f8 c0138bcb c2657600 00000000 00000000 00000000 00000000 00000000 Oct 22 12:21:36 substitute kernel: Call Trace: [read_super+251/368] [get_sb_bdev+320/416] [do_mount+378/704] [copy_mount_options+78/160] [sys_mou nt+124/192] [system_call+51/56] Oct 22 12:21:36 substitute kernel: Call Trace: [<c0138bcb>] [<c0138df0>] [<c013998a>] [<c01397be>] [<c0139b4c>] [<c010901b>] Oct 22 12:21:36 substitute kernel: Oct 22 12:21:36 substitute kernel: Code: f7 f1 8b 96 e0 00 00 00 89 d1 89 86 e4 00 00 00 8d 44 02 ff Oct 22 12:38:58 substitute kernel: divide error: 0000 Oct 22 12:38:58 substitute kernel: CPU: 0 Oct 22 12:38:58 substitute kernel: EIP: 0010:[ext2_read_super+1236/1776] Oct 22 12:38:58 substitute kernel: EIP: 0010:[<c015a034>] Oct 22 12:38:58 substitute kernel: EFLAGS: 00010246 Oct 22 12:38:58 substitute kernel: eax: 000f88fe ebx: 00001000 ecx: 00000000 edx: 00000000 Oct 22 12:38:58 substitute kernel: esi: c2848a00 edi: 00000000 ebp: c2da4400 esp: c1f49e84 Oct 22 12:38:58 substitute kernel: ds: 0018 es: 0018 ss: 0018 Oct 22 12:38:58 substitute kernel: Process mount (pid: 1104, stackpage=c1f49000) Oct 22 12:38:58 substitute kernel: Stack: c1113b30 c2803005 00000000 00000346 00000000 c2da4400 c32dff00 00000001 Oct 22 12:38:58 substitute kernel: 00000246 00000000 00000003 00000000 00000000 c2848a00 00000000 c3e9c320 Oct 22 12:38:58 substitute kernel: c025b6f8 c0138bcb c2848a00 00000000 00000000 00000000 00000000 00000000 Oct 22 12:38:58 substitute kernel: Call Trace: [read_super+251/368] [get_sb_bdev+320/416] [do_mount+378/704] [copy_mount_options+78/160] [sys_mou nt+124/192] [system_call+51/56] Oct 22 12:38:58 substitute kernel: Call Trace: [<c0138bcb>] [<c0138df0>] [<c013998a>] [<c01397be>] [<c0139b4c>] [<c010901b>] Oct 22 12:38:58 substitute kernel: Oct 22 12:38:58 substitute kernel: Code: f7 f1 8b 96 e0 00 00 00 89 d1 89 86 e4 00 00 00 8d 44 02 ff Oct 22 12:14:00 substitute kernel: divide error: 0000 Oct 22 12:14:00 substitute kernel: CPU: 0 Oct 22 12:14:00 substitute kernel: EIP: 0010:[ext2_read_super+1236/1776] Oct 22 12:14:00 substitute kernel: EIP: 0010:[<c015a034>] Oct 22 12:14:00 substitute kernel: EFLAGS: 00010246 Oct 22 12:14:00 substitute kernel: eax: 000f88fe ebx: 00001000 ecx: 00000000 edx: 00000000 Oct 22 12:14:00 substitute kernel: esi: c2452600 edi: 00000000 ebp: c20c3400 esp: c2195e84 Oct 22 12:14:00 substitute kernel: ds: 0018 es: 0018 ss: 0018 Oct 22 12:14:00 substitute kernel: Process mount (pid: 891, stackpage=c2195000) Oct 22 12:14:00 substitute kernel: Stack: 00000007 c2195ee8 00000000 00000346 00000000 c20c3400 c20ddc60 00000001 Oct 22 12:14:00 substitute kernel: 00000000 00000000 00000003 00000000 00000000 c2452600 00000000 c38ea3a0 Oct 22 12:14:00 substitute kernel: c025b6f8 c0138bcb c2452600 00000000 00000000 00000000 00000000 00000000 Oct 22 12:14:00 substitute kernel: Call Trace: [read_super+251/368] [get_sb_bdev+320/416] [do_mount+378/704] [copy_mount_options+78/160] [sys_mou nt+124/192] [system_call+51/56] Oct 22 12:14:00 substitute kernel: Call Trace: [<c0138bcb>] [<c0138df0>] [<c013998a>] [<c01397be>] [<c0139b4c>] [<c010901b>] Oct 22 12:14:00 substitute kernel: Oct 22 12:14:00 substitute kernel: Code: f7 f1 8b 96 e0 00 00 00 89 d1 89 86 e4 00 00 00 8d 44 02 ff Oct 22 12:21:36 substitute kernel: divide error: 0000 Oct 22 12:21:36 substitute kernel: CPU: 0 Oct 22 12:21:36 substitute kernel: EIP: 0010:[ext2_read_super+1236/1776] Oct 22 12:21:36 substitute kernel: EIP: 0010:[<c015a034>] Oct 22 12:21:36 substitute kernel: EFLAGS: 00010246 Oct 22 12:21:36 substitute kernel: eax: 000f88fe ebx: 00001000 ecx: 00000000 edx: 00000000 Oct 22 12:21:36 substitute kernel: esi: c2657600 edi: 00000000 ebp: c24fa400 esp: c223fe84 Oct 22 12:21:36 substitute kernel: ds: 0018 es: 0018 ss: 0018 Oct 22 12:21:36 substitute kernel: Process mount (pid: 868, stackpage=c223f000) Oct 22 12:21:36 substitute kernel: Stack: 00000007 c223fee8 00000000 00000346 00000000 c24fa400 c1f61860 00000001 Oct 22 12:21:36 substitute kernel: 00000246 00000000 00000003 00000000 00000000 c2657600 00000000 c3e9caa0 Oct 22 12:21:36 substitute kernel: c025b6f8 c0138bcb c2657600 00000000 00000000 00000000 00000000 00000000 Oct 22 12:21:36 substitute kernel: Call Trace: [read_super+251/368] [get_sb_bdev+320/416] [do_mount+378/704] [copy_mount_options+78/160] [sys_mou nt+124/192] [system_call+51/56] Oct 22 12:21:36 substitute kernel: Call Trace: [<c0138bcb>] [<c0138df0>] [<c013998a>] [<c01397be>] [<c0139b4c>] [<c010901b>] Oct 22 12:21:36 substitute kernel: Oct 22 12:21:36 substitute kernel: Code: f7 f1 8b 96 e0 00 00 00 89 d1 89 86 e400 00 00 8d 44 02 ff Oct 22 12:38:58 substitute kernel: divide error: 0000 Oct 22 12:38:58 substitute kernel: CPU: 0 Oct 22 12:38:58 substitute kernel: EIP: 0010:[ext2_read_super+1236/1776] Oct 22 12:38:58 substitute kernel: EIP: 0010:[<c015a034>] Oct 22 12:38:58 substitute kernel: EFLAGS: 00010246 Oct 22 12:38:58 substitute kernel: eax: 000f88fe ebx: 00001000 ecx: 00000000 edx: 00000000 Oct 22 12:38:58 substitute kernel: esi: c2848a00 edi: 00000000 ebp: c2da4400 esp: c1f49e84 Oct 22 12:38:58 substitute kernel: ds: 0018 es: 0018 ss: 0018 Oct 22 12:38:58 substitute kernel: Process mount (pid: 1104, stackpage=c1f49000) Oct 22 12:38:58 substitute kernel: Stack: c1113b30 c2803005 00000000 00000346 00000000 c2da4400 c32dff00 00000001 Oct 22 12:38:58 substitute kernel: 00000246 00000000 00000003 00000000 00000000 c2848a00 00000000 c3e9c320 Oct 22 12:38:58 substitute kernel: c025b6f8 c0138bcb c2848a00 00000000 00000000 00000000 00000000 00000000 Oct 22 12:38:58 substitute kernel: Call Trace: [read_super+251/368] [get_sb_bdev+320/416] [do_mount+378/704] [copy_mount_options+78/160] [sys_mou nt+124/192] [system_call+51/56] Oct 22 12:38:58 substitute kernel: Call Trace: [<c0138bcb>] [<c0138df0>] [<c013998a>] [<c01397be>] [<c0139b4c>] [<c010901b>] Oct 22 12:38:58 substitute kernel: Oct 22 12:38:58 substitute kernel: Code: f7 f1 8b 96 e0 00 00 00 89 d1 89 86 e4 00 00 00 8d 44 02 ff and here is a ksymoops output for the last error ksymoops ksymoops 2.4.0 on i686 2.4.2-2. Options used -V (default) -k /proc/ksyms (default) -l /proc/modules (default) -o /lib/modules/2.4.2-2/ (default) -m /boot/System.map-2.4.2-2 (default) Warning: You did not tell me where to find symbol information. I will assume that the log matches the kernel and modules that are running right now and I'll use the default options above for symbol resolution. If the current kernel and/or modules do not match the log, you can get more accurate output by telling me the kernel version and where to find map, modules, ksyms etc. ksymoops -h explains the options. Warning (compare_maps): ksyms_base symbol __VERSIONED_SYMBOL(shmem_file_setup) not found in System.map. Ignoring ksyms_base entry Warning (compare_maps): mismatch on symbol partition_name , ksyms_base says c01af860, System.map says c0153510. Ignoring ksyms_base entry Warning (compare_maps): mismatch on symbol usb_devfs_handle , usbcore says c48271a0, /lib/modules/2.4.2-2/kernel/drivers/usb/usbcore.o says c482 6cc0. Ignoring /lib/modules/2.4.2-2/kernel/drivers/usb/usbcore.o entry Reading Oops report from the terminal Oct 22 12:38:58 substitute kernel: divide error: 0000 Oct 22 12:38:58 substitute kernel: CPU: 0 Oct 22 12:38:58 substitute kernel: EIP: 0010:[ext2_read_super+1236/1776] Oct 22 12:38:58 substitute kernel: EIP: 0010:[<c015a034>] Oct 22 12:38:58 substitute kernel: EFLAGS: 00010246 Oct 22 12:38:58 substitute kernel: CPU: 0Oct 22 12:38:58 substitute kernel: eax: 000f88fe ebx: 00001000 ecx: 00000000 edx: 00000000 Oct 22 12:38:58 substitute kernel: esi: c2848a00 edi: 00000000 ebp: c2da4400 esp: c1f49e84 Oct 22 12:38:58 substitute kernel: ds: 0018 es: 0018 ss: 0018 Oct 22 12:38:58 substitute kernel: Process mount (pid: 1104, stackpage=c1f49000) Oct 22 12:38:58 substitute kernel: Stack: c 1113b30 c2803005 00000000 00000346 00000000 c2da4400 c32dff00 00000001 Oct 22 12:38:58 substitute kernel: 00000246 00000000 00000003 00000000 00000000 c2848a00 00000000 c3e9c320 Oct 22 12:38:58 substitute kernel: c025b6f8 c0138bcb c2848a00 00000000 00000000 00000000 00000000 00000000 Oct 22 12:38:58 substitute kernel: Call Trace: [read_super+251/368] [get_sb_bdev+320/Oct 22 12:38:58 substitute kernel: EIP: 0010:[ext2_read_s uper+1236/1776] Oct 22 12:38:58 substitute kernel: EIP: 0010:[<c015a034>] Using defaults from ksymoops -t elf32-i386 -a i386 Oct 22 12:38:58 substitute kernel: EFLAGS: 00010246 Oct 22 12:38:58 substitute kernel: eax: 000f88fe ebx: 00001000 ecx: 00000000 edx: 00000000 Oct 22 12:38:58 substitute kernel: esi: c2848a00 edi: 00000000 ebp: c2da4400 esp: c1f49e84 Oct 22 12:38:58 substitute kernel: ds: 0018 es: 0018 ss: 0018 416] [do_mount+378/704] [copy_mount_options+78/160] [sys_mount+124/19 2] [system_call+51/56] Oct 22 12:38:58 substitute kernel: Call Trace: [<c0138bcb>] [<c0138df0>] [<c013998a>] [<c01397be>] [<c0139b4c>] [<c010901b>] Oct 22 12:38:58 substitute kernel: Oct 22 12:38:58 substitute kernel: Code: f7 f1 8b 96 e0 00 00 00 89 d1 89 86 e4 00 00 00 8d 44 02 ff Oct 22 12:38:58 substitute kernel: Process mount (pid: 1104, stackpage=c1f49000) Oct 22 12:38:58 substitute kernel: Stack: c1113b30 c2803005 00000000 00000346 00000000 c2da4400 c32dff00 00000001 Oct 22 12:38:58 substitute kernel: 00000246 00000000 00000003 00000000 00000000 c2848a00 00000000 c3e9c320 Oct 22 12:38:58 substitute kernel: c025b6f8 c0138bcb c2848a00 00000000 00000000 00000000 00000000 00000000 Oct 22 12:38:58 substitute kernel: Call Trace: [read_super+251/368] [get_sb_bdev+320/416] [do_mount+378/704] [copy_mount_options+78/160] [sys_mou nt+124/19 Oct 22 12:38:58 substitute kernel: Call Trace: [<c0138bcb>] [<c0138df0>] [<c013998a>] [<c01397be>] [<c0139b4c>] [<c010901b>] Oct 22 12:38:58 substitute kernel: Code: f7 f1 8b 96 e0 00 00 00 89 d1 89 86 e4 00 00 00 8d 44 02 ff >>EIP; c015a034 <ext2_read_super+4d4/6f0> <===== Trace; c0138bcb <read_super+fb/170> Trace; c0138df0 <get_sb_bdev+140/1a0> Trace; c013998a <do_mount+17a/2c0> Trace; c01397be <copy_mount_options+4e/a0> Trace; c0139b4c <sys_mount+7c/c0> Trace; c010901b <system_call+33/38> Code; c015a034 <ext2_read_super+4d4/6f0> 00000000 <_EIP>: Code; c015a034 <ext2_read_super+4d4/6f0> <===== 0: f7 f1 div %ecx,%eax <===== Code; c015a036 <ext2_read_super+4d6/6f0> 2: 8b96 e0 00 00 00 mov 0xe0(%esi),%edx Code; c015a03c <ext2_read_super+4dc/6f0> 8: 89 d1 mov %edx,%ecx Code; c015a03e <ext2_read_super+4de/6f0> a: 89 86 e4 00 00 00 mov %eax,0xe4(%esi) Code; c015a044 <ext2_read_super+4e4/6f0> 10: 8d 44 02 ff lea 0xffffffff(%edx,%eax,1),%eax It seems to me that regardless of the state of the partition that the filesystem utilities should behave more gracefully. I apologise if this has been dealt with elsewhere- bugzilla seems to have crashed and I can't see the previous bug reports Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: Can't tell you- will depend on the corrupt partition which I have managed to fix- rh7.1 install caused the problem trying to update a rh6.2 install. The keysymoops should let you know where it is- My money is on an untrapped divide by zero in ext2_read_super. I got the error by getting a corrupted filesystem (caused by rh7.1 upgrade) and then trying to mount the disk Actual Results: on attempting to mount the disk I get the divide by zero error then the system goes bad- can't mount anymore can umount any more and in the case of the system it came from the fatal error prevented booting by any method Expected Results: 1) the upgrade should have occured without error (this install version has trashed quite a of my computers for a variety of reasons). 2) mount should have handled this gracefully (ext2 drivers at fault?) 3) the boot process should have handled it gracefully- if it were not for the fact I many spare linux machines around I would have lost all data. Additional info: This is a severe bug in the ext filesystem and appears to be caused by an easily trapped divide by zero error in ext2_read_super. It can be fatal to the system and appears to have other side effects as far as stability goes.
Looks like a kernel thing to me.
What happens now ?-matt
There are two issues here. First, what caused the original problem? Unless it's reproducible, there's not enough information here to diagnose it --- it could be just about any combination of hardware or software problems. The second problem is, why is the kernel panicing on mounting of the new filesystems? I can see one or two possible reasons for that, all of which involve massively corrupt filesystems which ext2 isn't *quite* smart enough about rejecting. Can you possibly send me a copy of one of the corrupt superblocks (if you still have one) so that I can verify (a) exactly what is causing this, and (b) that it is fixed once the kernel is patched? The command dd if=/dev/whatever of=superblock.dat bs=4k count=1 will do it.
It is reproducible for a particular machine- a cycle of building, corrupting and recovering is lengthy and requires an addition free machine. I am not likely to do this in the near future. Some thoughts about it none the less- If I try to install kernel 2.4 on this machine the superblock is corrupted. Once the disk is corrupted then it can only be fixed on another machine- no version of the kernel can fix it on the original machine. I originally thought the cause of the problem was the format utilities in the install but I now think it is mount and / or ext2. I tried building the system on the hard disk in another machine- all the same hardware but a different motherboard hence a different hard disk controller. The install went well and the machine was stable. I then rebuild the original machine and booted it. It boot 100% ok and ran a stable system. I then rebooted the machine and the superblocks were corrupted. I repeated this using ext3 fs with the same results except the error on reboot was different- something like iblock (1024) does not equal bblock size (4096)- can't remember exactly. In this case the error was trapped by ext3 fs - it was fatal. I agree the are two issues- (a)the process of corruption and the (bug 54884) (b) process of failing to handle the error gracefully (bug 54873) There are less than 10 divides in the whole of the ext2_fs tree that could be suspect- it should be fairly easy to trap all of these. (a) is a tougher problem- as soon as the machine is available for a long enough (when I am free) I will try to get you a corrupt superblock but this make be some time (months). IDing the disk contoller might be a start though (although I don't know how to do this apart from reading the chips on the board).
In current kernels, the only divide in ext2_read_super is divide-by-blocksize, and we now validate that first via a prior set_blocksize() call.