This oops from a Wolverine default kernel happend on an attempt to copy files from vfat to vfat directories. This oops actually looks pretty similar to problems I got when trying to install (#29427 and #29472 in bugzilla). Here is a decoded oops and the whole log from reboot to oops is attached. Michal michal ksymoops 2.4.0 on i686 2.4.1-0.1.9. Options used -v /boot/vmlinux-2.4.1-0.1.9 (specified) -k /proc/ksyms (default) -l /proc/modules (default) -o /lib/modules/2.4.1-0.1.9/ (default) -m /boot/System.map-2.4.1-0.1.9 (default) Error (expand_objects): cannot stat(/lib/ncr53c8xx.o) for ncr53c8xx Error (expand_objects): cannot stat(/lib/sd_mod.o) for sd_mod Error (expand_objects): cannot stat(/lib/scsi_mod.o) for scsi_mod Error (pclose_local): find_objects pclose failed 0x100 Warning (compare_maps): ksyms_base symbol __VERSIONED_SYMBOL(shmem_file_setup) not found in vmlinux. Ignoring ksyms_base entry Warning (compare_maps): mismatch on symbol partition_name , ksyms_base says c01b0600, vmlinux says c01524a0. Ignoring ksyms_base entry Unable to handle kernel paging request at virtual address 08000004 c0134336 *pde = 1d3a7067 Oops: 0000 CPU: 0 EIP: 0010:[<c0134336>] Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010206 eax: c18a0000 ebx: 00000009 ecx: 000014e0 edx: 08000000 esi: 00040008 edi: 00002202 ebp: 0000000f esp: dd311e14 ds: 0018 es: 0018 ss: 0018 Process cp (pid: 791, stackpage=dd311000) Stack: 000014e0 00000000 00000e00 cb2b0da0 00000c00 c013517b 00002202 00040008 00000200 cb2b0da0 00000c00 c0135517 cbd54420 00000000 c0135538 cb2b0d40 cb2b0f80 cb2ac000 dd311e6c 00000200 000016ae cb2b0d40 06b60600 00001000 Call Trace: [<c013517b>] [<c0135517>] [<c0135538>] [<c0134b0c>] [<c0134b5c>] [<e086d5f0>] [<c0135cfa>] [<e086d5f0>] [<e086efd5>] [<e086d5f0>] [<c0127ad9>] [<e086d731>] [<e086d709>] [<c0132fe6>] [<c0109007>] Code: 39 72 04 75 f5 0f b7 42 08 3b 44 24 20 75 eb 66 39 7a 0c 75 >>EIP; c0134336 <get_hash_table+66/90> <===== Trace; c013517b <unmap_underlying_metadata+1b/60> Trace; c0135517 <__block_prepare_write+117/300> Trace; c0135538 <__block_prepare_write+138/300> Trace; c0134b0c <balance_dirty_state+c/50> Trace; c0134b5c <balance_dirty+c/40> Trace; e086d5f0 <[cdrom]cdrom_ioctl+ab0/e20> Trace; c0135cfa <cont_prepare_write+22a/370> Trace; e086d5f0 <[cdrom]cdrom_ioctl+ab0/e20> Trace; e086efd5 <[cdrom]cdrom_sysctl_info+5a5/5d0> Trace; e086d5f0 <[cdrom]cdrom_ioctl+ab0/e20> Trace; c0127ad9 <generic_file_write+3a9/5f0> Trace; e086d731 <[cdrom]cdrom_ioctl+bf1/e20> Trace; e086d709 <[cdrom]cdrom_ioctl+bc9/e20> Trace; c0132fe6 <sys_write+96/d0> Trace; c0109007 <system_call+33/38> Code; c0134336 <get_hash_table+66/90> 00000000 <_EIP>: Code; c0134336 <get_hash_table+66/90> <===== 0: 39 72 04 cmp %esi,0x4(%edx) <===== Code; c0134339 <get_hash_table+69/90> 3: 75 f5 jne fffffffa <_EIP+0xfffffffa> c0134330 <get_hash_table+60/90> Code; c013433b <get_hash_table+6b/90> 5: 0f b7 42 08 movzwl 0x8(%edx),%eax Code; c013433f <get_hash_table+6f/90> 9: 3b 44 24 20 cmp 0x20(%esp,1),%eax Code; c0134343 <get_hash_table+73/90> d: 75 eb jne fffffffa <_EIP+0xfffffffa> c0134330 <get_hash_table+60/90> Code; c0134345 <get_hash_table+75/90> f: 66 39 7a 0c cmp %di,0xc(%edx) Code; c0134349 <get_hash_table+79/90> 13: 75 00 jne 15 <_EIP+0x15> c013434b <get_hash_table+7b/90> 2 warnings and 4 errors issued. Results may not be reliable.
Created attachment 11081 [details] log file leading to an oops
I am not sure if tying that bug to vfat is a correct thing to do. It was observed while attempting to copy between vfat file systems but possibly only because I cannot do much more with this minimal installation yet. A quoted decoded oops seems to imply CD and an IDE channel which was not even used during the operation in question. A copy was from /dev/hdg to /dev/hde and CD is /dev/hda.
This defect is considered MUST-FIX for Florence Gold release
It may be vfat, vm, or interaction between them. Could you see if this happens with the latest kernel from rawhide? I think that's currently 2.4.1-0.1.14
I know now that this is NOT associated with vfat as I suggested from the very beginning (again, see #29427 and #29472). I can repeat similar troubles when using 2.2.19pre14 and 2.4.2-ac5 kernels and also when copying from ext2 to ext2 system. I simply had the biggest block of files on vfat partition. I cannot exclude a broken hardware at this point.
I think that I found what triggers (as opposed to a reason) the described behaviour. In 1005C Award BIOS there are two "advanced" options: System Performance Setting [Optimal, Normal] USB Legacy Support [Auto, Enabled, Disabled] If the first one is set to "Normal" and the second one to "Disabled" then the whole system becomes stable. I copied from various file systems to a directory on ext2 around 1.2 GB of files without any ill effects and run succesfully 'diff -r' between two directories 475 MB each. If BIOS options are any other way then one should expect spectacular blowups with corrupted file systems and other nasty effects after the first oops. It is difficult to know what is "System Performance Setting" as it always shows "Optimal" regardless of a status on the last save. But a system behaviour depends on how it was set. How "USB Legacy Support" comes into the picture I cannot even imagine. I did try with 2.2.19pre and 2.4 kernels and the picture does not change. I still have to try more extensive tests, including a full installation, (cf. other reports referenced above) but this looks like it. No idea how to even start explaining all of that in installation instructions.
Thanks for that information, that's useful. This is with the promise controller discussed in #29427, right?
Yes. The same "box from hell" all over. K6 Athlon on A7V Asus, Award 1005C BIOS, PDC20265 Promise IDE controller, NCR 53c810 SCSI controller (but at this moment I doubt if any of the later has anything to do with it).
In the currently available kernels from rawhide, lots of corruption issues with VIA chipsets in combination with Promise controllers are fixed. Can you please try one of there kernels and verify they actually fix the problem? (it basically fixes the bios-settings you mentioned) (kernels 2.4.2-0.1.25 or later) I will close this bug; if you can reproduce the problem please reopen it.
I am afraid that I have a bad news. I tried the latest kernel from rawhide, i.e. "2.4.2-0.1.29 #1 Thu Mar 15 20:34:20 EST 2001 i686". After switching BIOS to default factory settings (the board was updated to the latest version of 1007 by now) a removal of close to 1 GB of data with 'rm -rf' went without troubles. But an attempt to copy some files to a target went awry after 192 Megabytes from ext2 file system was copied. I can see that this happened while copying /dev/ttyU* nodes as only 46 out of 288 was found later in a copy and /dev/ttyU136 ended up as lost+found/#6171. :-) In case you wonder previous blow ups happened when copying regular files (there are no special nodes if data are coming from a vfat system) or directories. Only amount of data seems to matter and 192 is the record so far. With previous kernels this was happening regularly in 130 - 140 Megabytes range; so some improvement can be claimed. :-) A failure of a copy was followed by oops in an attempted shutdown. Luckily sysrq key (I do have that turned on) still worked and it was possible to remount all file systems read-only. After that failure I switched BIOS back to "safe" settings and in that form, and with the same kernel, I was able to copy around 1 GB of stuff from one disk to another without any incidents. I attach my full log of errors from the last attempt. I started to wonder if this does not have anything to do with "256 -> 255" error/not-error which was discussed on linux-kernel list very recently. Note: I am afraid that this particular test machine is going away any hour right now. I already kept it much longer than I really should have. Sigh!
Created attachment 13024 [details] fragment of log files with errors triggered by 'cp'
We used 128 for the maxsector, not 256, so we're safe against that. If you have still some time, I would appreciate the output of "lspci -vxxx" for both the "safe" and the "failure" bios setting. Thanks!
It is not that bad. :-) This box is still here but it will likely go pretty soon. A SCSI controller definitely dislikes 'lscpi -vxxx' and reacts with ncr53c810-0: SCSI parity error detected: SCR1=65 DBC=50000000 SSTAT1=f and is unhappy on reboot. Also with "factory" settings in BIOS I start collecting messages like that: usb-uhci.c: interrupt, status 31, frame# 506 usb-uhci.c: interrupt, status 31, frame# 1357 usb-uhci.c: interrupt, status 31, frame# 1534 This does not happens with "safe" settings when "Legacy USB Support" is turned off. With "factory" BIOS settings I also had problems on a shutdown. A claim was that network file systems are busy and the whole process got stuck (the same kernel from rawhide and there are no troubles of that sort with BIOS in a "safe" position). If you want to tell me that this hardware/firmware is a junk I heartily agree. Attached 'lspci.default' is for an output with BIOS in default and 'lspci.safe' is my "normal" stuff ("System Performance" is "Normal" and "Legacy USB Support" is off).
Created attachment 13043 [details] an output from 'lspci -vxxx' with different BIOS settings
We have found SO many problems with viachipsets that we decided to turn off IDE dma for those machines. It is nearly impossible to fix the corruption as it is a chipset/motherboard bug, and the workarounds are board and bios-version specific.