Bug 113133

Summary:	Kernel interrupt problems removing ethernet cable
Product:	[Retired] Red Hat Linux	Reporter:	David Yerger <davidy>
Component:	kernel	Assignee:	Arjan van de Ven <arjanv>
Status:	CLOSED WONTFIX	QA Contact:
Severity:	high	Docs Contact:
Priority:	medium
Version:	9	CC:	riel
Target Milestone:	---
Target Release:	---
Hardware:	athlon
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-09-30 15:41:46 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description David Yerger 2004-01-08 19:16:36 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.5)
Gecko/20031007

Description of problem:

System is pretty well loaded with a NetGear tg3-based NIC,
Adaptec 29160 Ultra160 SCSI adapter, and a 3ware 
Escalade 6xxx series IDE RAID card.  The motherboard is
an Asus A7V8x-x, with an Athlon XP 3000+. 

Problems using new kernel (2.4.20-28.9):

First time, when I pulled/reinserted ethernet cable, got a kernel
panic, sorry, didn't get screen, it blamed
scsi0 I think (My adaptec card), message included

   Aieee! killing interrupt handler

Second time, got (written to /var/log/messages):


  Jan  7 23:00:41 cache kernel: tg3: eth0: Link is down.


but no "link up" message when reinserting, instead:

*************************************************
  Jan  7 23:03:24 cache kernel: scsi0: PCI error Interrupt at seqaddr
= 0x9
  Jan  7 23:03:24 cache kernel: scsi0: Data Parity Error Detected
during address or write data phase
  Jan  7 23:03:25 cache kernel: scsi0: PCI error Interrupt at seqaddr
= 0x1a3
  Jan  7 23:03:25 cache kernel: scsi0: Data Parity Error Detected
during address or write data phase
  Jan  7 23:03:28 cache kernel: scsi0: PCI error Interrupt at seqaddr
= 0x9
  Jan  7 23:03:28 cache kernel: scsi0: Data Parity Error Detected
during address or write data phase
********************************************************

when I removed/inserted the ethernet cable.

Reverted to 2.4.20-20.9-XFS (patched by SGI), no problem.



Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Boot with 2.4.20-28.9
2. Pull out ethernet cable
3. Put back in
    

Actual Results:  Kernel panic or PCI IRQ errors logged to console

Expected Results:  Link down, then link up.

Additional info:

Oddly, I didn't see any IRQ conflicts when the system was behaving
badly, I believe it did not change from what it is now, which is:

           CPU0
  0:    5446802          XT-PIC  timer
  1:        358          XT-PIC  keyboard
  2:          0          XT-PIC  cascade
  4:    4347084          XT-PIC  eth0
  8:          1          XT-PIC  rtc
 10:     217430          XT-PIC  3ware Storage Controller
 11:     821293          XT-PIC  aic7xxx
 14:          2          XT-PIC  ide0
 15:      30685          XT-PIC  ide1
NMI:          0
ERR:          0

Comment 1 David Yerger 2004-02-19 01:31:17 UTC

Same kind of problem with kernel released today (2.4.20-30.9)

I'm pasting in sections from /var/log/messages, I thought the times
might be useful.

Tried pulling ethernet cable and replacing in runlevel 1, OK.
In runlevel 3, got:

Feb 18 19:34:56 cache kernel: scsi0: PCI error Interrupt at seqaddr = 0x8
Feb 18 19:34:56 cache kernel: scsi0: Data Parity Error Detected during
address or write data phase
Feb 18 19:34:56 cache kernel: scsi0: PCI error Interrupt at seqaddr = 0x9
Feb 18 19:34:56 cache kernel: scsi0: Data Parity Error Detected during
address or write data phase
Feb 18 19:37:26 cache ntpd[1715]: kernel time discipline status change 41
Feb 18 19:38:30 cache ntpd[1715]: kernel time discipline status change 1
Feb 18 19:40:00 cache CROND[2114]: (root) CMD (/usr/lib/sa/sa1 1 1)


Then I shut down my database, and as it was releasing locks got:


Feb 18 19:40:18 cache kernel: swap_free: Bad swap file entry 00010052
Feb 18 19:40:18 cache kernel: swap_free: Unused swap offset entry 00c10000
Feb 18 19:40:18 cache kernel: swap_free: Unused swap offset entry 00010000
Feb 18 19:40:18 cache last message repeated 3 times


Then, after database (Intersystems' CachÃ© 5.0.4) was all the way down,
an Oops:

Feb 18 19:40:38 cache kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000074
Feb 18 19:40:38 cache kernel:  printing eip:
Feb 18 19:40:38 cache kernel: c0142000
Feb 18 19:40:38 cache kernel: *pde = 00000000
Feb 18 19:40:38 cache kernel: Oops: 0000
Feb 18 19:40:38 cache kernel: i2c-isa it87 i2c-proc i2c-core tg3
reiserfs ext3 jbd raid1 3w-xxxx aic7xxx sd_mod scsi_mod
Feb 18 19:40:38 cache kernel: CPU:    0
Feb 18 19:40:38 cache kernel: EIP:    0060:[<c0142000>]    Not tainted
Feb 18 19:40:38 cache kernel: EFLAGS: 00010202
Feb 18 19:40:38 cache kernel:
Feb 18 19:40:38 cache kernel: EIP is at page_referenced [kernel] 0x210
(2.4.20-30.9)
Feb 18 19:40:38 cache kernel: eax: c1825ea8   ebx: 0000001c   ecx:
00000000   edx: 00000001
Feb 18 19:40:38 cache kernel: esi: 0000000e   edi: ee5d7840   ebp:
00000000   esp: c46cdf84
Feb 18 19:40:38 cache kernel: ds: 0068   es: 0068   ss: 0068
Feb 18 19:40:38 cache kernel: Process kscand/HighMem (pid: 8,
stackpage=c46cd000)
Feb 18 19:40:38 cache kernel: Stack: c46cdfa0 00000000 00000000
c46cdfb4 c1f01ac0 c1f01ac0 c030a0f4 c1f01b14
Feb 18 19:40:38 cache kernel:        00000000 c013a984 c46cc000
c01254e0 00000001 00000000 c46cc000 c0309f80
Feb 18 19:40:38 cache kernel:        c46cc000 c013bb34 c0309f80
00000000 00000000 c025b760 000009c4 c013ba40
Feb 18 19:40:38 cache kernel: Call Trace:   [<c013a984>]
scan_active_list [kernel] 0x34 (0xc46cdfa8))
Feb 18 19:40:38 cache kernel: [<c01254e0>] process_timeout [kernel]
0x0 (0xc46cdfb0))
Feb 18 19:40:38 cache kernel: [<c013bb34>] kscand [kernel] 0xf4
(0xc46cdfc8))
Feb 18 19:40:39 cache kernel: [<c013ba40>] kscand [kernel] 0x0
(0xc46cdfe0))
Feb 18 19:40:39 cache kernel: [<c010727d>] kernel_thread_helper
[kernel] 0x5 (0xc46cdff0))
Feb 18 19:40:39 cache kernel:
Feb 18 19:40:39 cache kernel:
Feb 18 19:40:39 cache kernel: Code: 8b 41 74 39 41 60 0f 43 54 24 04
45 4e 89 54 24 04 0f 89 3e


Then, while I was writing the above (not realizing it was being
written to disk), got:


Feb 18 20:03:53 cache kernel:  <6>NETDEV WATCHDOG: eth0: transmit
timed out
Feb 18 20:03:53 cache kernel: tg3: eth0: transmit timed out, resetting
Feb 18 20:03:53 cache kernel: tg3: tg3_stop_block timed out, ofs=3400
enable_bit=2
Feb 18 20:03:53 cache kernel: tg3: tg3_stop_block timed out, ofs=2400
enable_bit=2
Feb 18 20:03:53 cache kernel: tg3: tg3_stop_block timed out, ofs=1400
enable_bit=2
Feb 18 20:03:53 cache kernel: tg3: tg3_stop_block timed out, ofs=c00
enable_bit=2

Tried same thing with 2.4.20-20.9-XFS, no problem.

Comment 2 David Yerger 2004-02-19 01:36:39 UTC

Probably should mention that I had 

(with 2.4.20-20.9-XFS)
# up2date --force kernel kernel-source
# reboot
the system hung on "Turning Off Swap" so on rebooting I did 

# swapoff -a
# mkswap /dev/md3
# swapon -a

(under 2.4.20-30.9)

Comment 3 David Yerger 2004-02-22 23:39:07 UTC

Tried vanilla 2.4.25, also seems OK.

Question:  I noticed in
/usr/src/linux-2.4.25/drivers/scsi/aic7xxx/README-aic7xxx that

              Option: pci_parity
          Definition: Toggles the detection of PCI parity errors.
                      On many motherboards with VIA chipsets,
                      PCI parity is not generated correctly on the
                      PCI bus.  It is impossible for the hardware to
                      differentiate between these "spurious" parity
                      errors and real parity errors.  The symptom of
                      this problem is a stream of the message:
    "scsi0: Data Parity Error Detected during address or write data phase"
                      output by the driver.
     Possible Values: This option is a toggle
       Default Value: PCI Parity Error reporting is disabled

Is this default enabled in recent RedHat kernels?  I *do* have a Via
chipset (KT400)

Would this possibly lead to the oops?

Comment 4 Bugzilla owner 2004-09-30 15:41:46 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/