Bug 29508

Summary:

[via] Oops on copy from Wolverine kernel

Product:

[Retired] Red Hat Linux

Reporter:

Michal Jaegermann <michal>

Component:

kernel

Assignee:

Michael K. Johnson <johnsonm>

Status:

CLOSED RAWHIDE

QA Contact:

Brock Organ <borgan>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

7.1

CC:

Target Milestone:

---

Target Release:

---

Hardware:

i386

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2001-03-19 19:37:15 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
log file leading to an oops	none
fragment of log files with errors triggered by 'cp'	none
an output from 'lspci -vxxx' with different BIOS settings	none

Description Michal Jaegermann 2001-02-26 04:06:41 UTC

This oops from a Wolverine default kernel happend on an attempt to
copy files from vfat to vfat directories.   This oops actually looks
pretty similar to problems I got when trying to install (#29427 and
#29472 in bugzilla).

Here is a decoded oops and the whole log from reboot to oops is
attached.

   Michal
   michal

ksymoops 2.4.0 on i686 2.4.1-0.1.9.  Options used
     -v /boot/vmlinux-2.4.1-0.1.9 (specified)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.4.1-0.1.9/ (default)
     -m /boot/System.map-2.4.1-0.1.9 (default)

Error (expand_objects): cannot stat(/lib/ncr53c8xx.o) for ncr53c8xx
Error (expand_objects): cannot stat(/lib/sd_mod.o) for sd_mod
Error (expand_objects): cannot stat(/lib/scsi_mod.o) for scsi_mod
Error (pclose_local): find_objects pclose failed 0x100
Warning (compare_maps): ksyms_base symbol
__VERSIONED_SYMBOL(shmem_file_setup) not found in vmlinux.  Ignoring
ksyms_base entry
Warning (compare_maps): mismatch on symbol partition_name , ksyms_base
says c01b0600, vmlinux says c01524a0.  Ignoring ksyms_base entry

Unable to handle kernel paging request at virtual address 08000004
c0134336
*pde = 1d3a7067
Oops: 0000
CPU:    0
EIP:    0010:[<c0134336>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010206
eax: c18a0000   ebx: 00000009   ecx: 000014e0   edx: 08000000
esi: 00040008   edi: 00002202   ebp: 0000000f   esp: dd311e14
ds: 0018   es: 0018   ss: 0018
Process cp (pid: 791, stackpage=dd311000)
Stack: 000014e0 00000000 00000e00 cb2b0da0 00000c00 c013517b 00002202
00040008
       00000200 cb2b0da0 00000c00 c0135517 cbd54420 00000000 c0135538
cb2b0d40
       cb2b0f80 cb2ac000 dd311e6c 00000200 000016ae cb2b0d40 06b60600
00001000
Call Trace: [<c013517b>] [<c0135517>] [<c0135538>] [<c0134b0c>]
[<c0134b5c>] [<e086d5f0>] [<c0135cfa>]
       [<e086d5f0>] [<e086efd5>] [<e086d5f0>] [<c0127ad9>] [<e086d731>]
[<e086d709>] [<c0132fe6>] [<c0109007>]
Code: 39 72 04 75 f5 0f b7 42 08 3b 44 24 20 75 eb 66 39 7a 0c 75

>>EIP; c0134336 <get_hash_table+66/90>   <=====
Trace; c013517b <unmap_underlying_metadata+1b/60>
Trace; c0135517 <__block_prepare_write+117/300>
Trace; c0135538 <__block_prepare_write+138/300>
Trace; c0134b0c <balance_dirty_state+c/50>
Trace; c0134b5c <balance_dirty+c/40>
Trace; e086d5f0 <[cdrom]cdrom_ioctl+ab0/e20>
Trace; c0135cfa <cont_prepare_write+22a/370>
Trace; e086d5f0 <[cdrom]cdrom_ioctl+ab0/e20>
Trace; e086efd5 <[cdrom]cdrom_sysctl_info+5a5/5d0>
Trace; e086d5f0 <[cdrom]cdrom_ioctl+ab0/e20>
Trace; c0127ad9 <generic_file_write+3a9/5f0>
Trace; e086d731 <[cdrom]cdrom_ioctl+bf1/e20>
Trace; e086d709 <[cdrom]cdrom_ioctl+bc9/e20>
Trace; c0132fe6 <sys_write+96/d0>
Trace; c0109007 <system_call+33/38>
Code;  c0134336 <get_hash_table+66/90>
00000000 <_EIP>:
Code;  c0134336 <get_hash_table+66/90>   <=====
   0:   39 72 04                  cmp    %esi,0x4(%edx)   <=====
Code;  c0134339 <get_hash_table+69/90>
   3:   75 f5                     jne    fffffffa <_EIP+0xfffffffa>
c0134330 <get_hash_table+60/90>
Code;  c013433b <get_hash_table+6b/90>
   5:   0f b7 42 08               movzwl 0x8(%edx),%eax
Code;  c013433f <get_hash_table+6f/90>
   9:   3b 44 24 20               cmp    0x20(%esp,1),%eax
Code;  c0134343 <get_hash_table+73/90>
   d:   75 eb                     jne    fffffffa <_EIP+0xfffffffa>
c0134330 <get_hash_table+60/90>
Code;  c0134345 <get_hash_table+75/90>
   f:   66 39 7a 0c               cmp    %di,0xc(%edx)
Code;  c0134349 <get_hash_table+79/90>
  13:   75 00                     jne    15 <_EIP+0x15> c013434b
<get_hash_table+7b/90>


2 warnings and 4 errors issued.  Results may not be reliable.

Comment 1 Michal Jaegermann 2001-02-26 04:07:44 UTC

Created attachment 11081 [details]
log file leading to an oops

Comment 2 Michal Jaegermann 2001-02-26 15:35:12 UTC

I am not sure if tying that bug to vfat is a correct thing to do.  It was
observed while attempting to copy between vfat file systems but possibly only
because I cannot do much more with this minimal installation yet.  A quoted
decoded oops seems to imply CD and an IDE channel which was not even used during
the operation in question.  A copy was from /dev/hdg to /dev/hde and CD is
/dev/hda.

Comment 3 Glen Foster 2001-02-26 23:54:45 UTC

This defect is considered MUST-FIX for Florence Gold release

Comment 4 Michael K. Johnson 2001-02-28 00:20:04 UTC

It may be vfat, vm, or interaction between them.
Could you see if this happens with the latest kernel from rawhide?
I think that's currently 2.4.1-0.1.14

Comment 5 Michal Jaegermann 2001-02-28 00:40:02 UTC

I know now that this is NOT associated with vfat as I suggested from the
very beginning (again, see #29427 and #29472). I can repeat similar troubles
when using 2.2.19pre14 and 2.4.2-ac5 kernels and also when copying from
ext2 to ext2 system.  I simply had the biggest block of files on vfat
partition.

I cannot exclude a broken hardware at this point.

Comment 6 Michal Jaegermann 2001-02-28 19:49:01 UTC

I think that I found what triggers (as opposed to a reason) the described 
behaviour.  In 1005C Award BIOS there are two "advanced" options:
System Performance Setting [Optimal, Normal]
USB Legacy Support [Auto, Enabled, Disabled]
If the first one is set to "Normal" and the second one to "Disabled" then the
whole system becomes stable.  I copied from various file systems to a directory
on ext2 around 1.2 GB of files without any ill effects and run succesfully
'diff -r' between two directories 475 MB each.  If BIOS options are any other
way then one should expect spectacular blowups with corrupted file systems
and other nasty effects after the first oops.

It is difficult to know what is "System Performance Setting" as it always
shows "Optimal" regardless of a status on the last save.  But a system behaviour
depends on how it was set.  How "USB Legacy Support" comes into the picture
I cannot even imagine.

I did try with 2.2.19pre and 2.4 kernels and the picture does not change.
I still have to try more extensive tests, including a full installation,
(cf. other reports referenced above) but this looks like it.

No idea how to even start explaining all of that in installation instructions.

Comment 7 Michael K. Johnson 2001-02-28 20:46:10 UTC

Thanks for that information, that's useful.
This is with the promise controller discussed in #29427, right?

Comment 8 Michal Jaegermann 2001-02-28 21:02:23 UTC

Yes.  The same "box from hell" all over.  K6 Athlon on A7V Asus,  Award 1005C
BIOS, PDC20265 Promise IDE controller, NCR 53c810 SCSI controller (but at
this moment I doubt if any of the later has anything to do with it).

Comment 9 Arjan van de Ven 2001-03-19 15:52:57 UTC

In the currently available kernels from rawhide, lots of corruption issues
with VIA chipsets in combination with Promise controllers are fixed.
Can you please try one of there kernels and verify they actually fix the
problem? (it basically fixes the bios-settings you mentioned)
(kernels 2.4.2-0.1.25 or later)

I will close this bug; if you can reproduce the problem please reopen it.

Comment 10 Michal Jaegermann 2001-03-19 18:42:06 UTC

I am afraid that I have a bad news.  I tried the latest kernel from
rawhide, i.e. "2.4.2-0.1.29 #1 Thu Mar 15 20:34:20 EST 2001 i686".
After switching BIOS to default factory settings (the board was updated
to the latest version of 1007 by now) a removal of close to 1 GB of data
with 'rm -rf' went without troubles.  But an attempt to copy some files
to a target went awry after 192 Megabytes from ext2 file system was
copied.  I can see that this happened while copying /dev/ttyU* nodes as
only 46 out of 288 was found later in a copy and /dev/ttyU136 ended up
as lost+found/#6171. :-) In case you wonder previous blow ups happened
when copying regular files (there are no special nodes if data are
coming from a vfat system) or directories.  Only amount of data seems
to matter and 192 is the record so far.  With previous kernels this
was happening regularly in 130 - 140 Megabytes range; so some
improvement can be claimed. :-)

A failure of a copy was followed by oops in an attempted shutdown.
Luckily sysrq key (I do have that turned on) still worked and it
was possible to remount all file systems read-only.

After that failure I switched BIOS back to "safe" settings and in
that form, and with the same kernel, I was able to copy around
1 GB of stuff from one disk to another without any incidents.

I attach my full log of errors from the last attempt.  I started to
wonder if this does not have anything to do with "256 -> 255"
error/not-error which was discussed on linux-kernel list very recently.

Note: I am afraid that this particular test machine is going away any
hour right now.  I already kept it much longer than I really should
have.  Sigh!

Comment 11 Michal Jaegermann 2001-03-19 18:43:47 UTC

Created attachment 13024 [details]
fragment of log files with errors triggered by 'cp'

Comment 12 Arjan van de Ven 2001-03-19 18:47:01 UTC

We used 128 for the maxsector, not 256, so we're safe against that.
If you have still some time, I would appreciate the output of
"lspci -vxxx" for both the "safe" and the "failure" bios setting.

Thanks!

Comment 13 Michal Jaegermann 2001-03-19 19:36:04 UTC

It is not that bad. :-)  This box is still here but it will likely
go pretty soon.

A SCSI controller definitely dislikes 'lscpi -vxxx' and reacts with

ncr53c810-0: SCSI parity error detected: SCR1=65 DBC=50000000 SSTAT1=f

and is unhappy on reboot.

Also with "factory" settings in BIOS I start collecting messages like
that:

usb-uhci.c: interrupt, status 31, frame# 506
usb-uhci.c: interrupt, status 31, frame# 1357
usb-uhci.c: interrupt, status 31, frame# 1534

This does not happens with "safe" settings when "Legacy USB Support"
is turned off.

With "factory" BIOS settings I also had problems on a shutdown.  A claim
was that network file systems are busy and the whole process got stuck
(the same kernel from rawhide and there are no troubles of that sort
with BIOS in a "safe" position).  If you want to tell me that this
hardware/firmware is a junk I heartily agree.

Attached 'lspci.default' is for an output with BIOS in default and
'lspci.safe' is my "normal" stuff ("System Performance" is "Normal"
and "Legacy USB Support" is off).

Comment 14 Michal Jaegermann 2001-03-19 19:37:12 UTC

Created attachment 13043 [details]
an output from 'lspci -vxxx' with different BIOS settings

Comment 15 Arjan van de Ven 2001-03-29 22:14:57 UTC

We have found SO many problems with viachipsets that we decided to turn off
IDE dma for those machines. It is nearly impossible to fix the corruption as it
is a chipset/motherboard bug, and the workarounds are board and bios-version
specific.