33590 – kernel panic(read/write lock error) occured when network high workload was needed

Bug 33590 - kernel panic(read/write lock error) occured when network high workload was needed

Summary: kernel panic(read/write lock error) occured when network high workload was ne...

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.3
Hardware:	ia64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Ingo Molnar
QA Contact:	Brock Organ
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2001-03-28 11:15 UTC by Bill Huang
Modified:	2005-10-31 22:00 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2001-04-16 09:13:36 UTC
Embargoed:

Attachments	(Terms of Use)
patch_ipi_tlb_resend_do_nothing_k240.diff (1.17 KB, patch) 2001-03-28 11:17 UTC, Bill Huang	no flags	Details \| Diff
patch_ide_scan-k241.diff (310 bytes, patch) 2001-03-28 11:18 UTC, Bill Huang	no flags	Details \| Diff
br_writelock.diff (731 bytes, patch) 2001-03-28 11:19 UTC, Bill Huang	no flags	Details \| Diff
more info:panic_msg (995 bytes, patch) 2001-03-29 03:59 UTC, Bill Huang	no flags	Details \| Diff
more info:machine information (4.41 KB, patch) 2001-03-29 04:00 UTC, Bill Huang	no flags	Details \| Diff
View All

Description Bill Huang 2001-03-28 11:15:53 UTC

(originally reported by Hitachi)
Test enviroment:ia64 8 way server CF-1
When they tested network high workload on the ia64 server,(I asked them the
detail about the test procedure)
they found read/write lock error.

1.Using ITP to investigate the error,they found errors in the function
net_rx_action()
 (1)The second(next) address in the socket list is wrong.
    They think that line that fault occured is
     ...
      if (!ptype->dev || ptype->dev == skb->dev)
     ...
 
 (2) kernel panic occured when net_rx_action() is run
     1) In CPU0,each elements of socket list in function net_rx_action()
are checked one-by-one
     2) In CPU1,socket release is started
     3) Released adresses in CPU1 are used to distribute buffer for socket
     4) In CPU0,when socket list in net_rx_action() is checked
one-by-one,since the wrong address is used,panic is occured.
  
   in the process of socket operating,the programs below are guessed to use
read/write lock control program.
     - checking out the elements of socket list in function net_rx_action()
     - releasing the socket in the function dev_remove_pack()
2,depending on the investigation,the patches below are considered useful:

(1) The interval of re-distrubuting IPI(IPI_FLUSH_TLB) is set to 10 times
of original 
    and receiving message of re-distributing is inviladed.
(2) Changed IDE scaning times from 10 to 2 
    patch:patch_ide_scan-k241.diff
(3) updated the brlock bug of kernel-2.4-x
    patch:br_writelock.diff
(4) update all kinds of driver version
    aic7xxx:  6.0.8 BETA  SCSI (adaptec)
    DAC960:   2.4.10      RAID (Mylex)
    e100:     1.5.0       NIC  (Intel)
    e1000:    3.0.1       GigaNIC(Intel)
    qla2x00:  4.15 beta   FC    (QLogic)

the useful URL is described below:
http://people.freebsd.org/~gibbs/linux
http://support.intel.com/support/network/adaptec/pro100/100Linux.htm

Comment 1 Bill Huang 2001-03-28 11:17:11 UTC

Created attachment 13924 [details]
patch_ipi_tlb_resend_do_nothing_k240.diff

Comment 2 Bill Huang 2001-03-28 11:18:10 UTC

Created attachment 13925 [details]
patch_ide_scan-k241.diff

Comment 3 Bill Huang 2001-03-28 11:19:53 UTC

Created attachment 13926 [details]
br_writelock.diff

Comment 4 Arjan van de Ven 2001-03-28 16:18:08 UTC

Dave: is this something we should worry about?

Comment 5 Michael K. Johnson 2001-03-28 23:10:06 UTC

David thinks this is probably ligit, but wants Ingo's feedback.

Comment 6 Bill Huang 2001-03-29 03:59:46 UTC

Created attachment 14064 [details]
more info:panic_msg

Comment 7 Bill Huang 2001-03-29 04:00:53 UTC

Created attachment 14065 [details]
more info:machine information

Comment 8 Ingo Molnar 2001-03-29 07:10:40 UTC

The patch is 100% legit. (It's perhaps only because the write path
is so rarely used that we didnt see any problems with this code
earlier.) The bug does not corrupt memory, it's only the write-locking
semantics that were violated.

Comment 9 Bill Huang 2001-03-29 07:46:14 UTC

I have asked Hitachi to send us the test tools they used for network high
workloading...

Comment 10 Ingo Molnar 2001-04-16 09:13:31 UTC

this patch should be in the current CVS tree, please test.

Comment 11 Bill Nottingham 2001-05-29 18:00:52 UTC

closing as resolved.

Note You need to log in before you can comment on or make changes to this bug.