Bug 33590 - kernel panic(read/write lock error) occured when network high workload was needed
kernel panic(read/write lock error) occured when network high workload was ne...
Status: CLOSED RAWHIDE
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
7.3
ia64 Linux
high Severity high
: ---
: ---
Assigned To: Ingo Molnar
Brock Organ
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2001-03-28 06:15 EST by Bill Huang
Modified: 2005-10-31 17:00 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2001-04-16 05:13:36 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
patch_ipi_tlb_resend_do_nothing_k240.diff (1.17 KB, patch)
2001-03-28 06:17 EST, Bill Huang
no flags Details | Diff
patch_ide_scan-k241.diff (310 bytes, patch)
2001-03-28 06:18 EST, Bill Huang
no flags Details | Diff
br_writelock.diff (731 bytes, patch)
2001-03-28 06:19 EST, Bill Huang
no flags Details | Diff
more info:panic_msg (995 bytes, patch)
2001-03-28 22:59 EST, Bill Huang
no flags Details | Diff
more info:machine information (4.41 KB, patch)
2001-03-28 23:00 EST, Bill Huang
no flags Details | Diff

  None (edit)
Description Bill Huang 2001-03-28 06:15:53 EST
(originally reported by Hitachi)
Test enviroment:ia64 8 way server CF-1
When they tested network high workload on the ia64 server,(I asked them the
detail about the test procedure)
they found read/write lock error.

1.Using ITP to investigate the error,they found errors in the function
net_rx_action()
 (1)The second(next) address in the socket list is wrong.
    They think that line that fault occured is
     ...
      if (!ptype->dev || ptype->dev == skb->dev)
     ...
 
 (2) kernel panic occured when net_rx_action() is run
     1) In CPU0,each elements of socket list in function net_rx_action()
are checked one-by-one
     2) In CPU1,socket release is started
     3) Released adresses in CPU1 are used to distribute buffer for socket
     4) In CPU0,when socket list in net_rx_action() is checked
one-by-one,since the wrong address is used,panic is occured.
  
   in the process of socket operating,the programs below are guessed to use
read/write lock control program.
     - checking out the elements of socket list in function net_rx_action()
     - releasing the socket in the function dev_remove_pack()
2,depending on the investigation,the patches below are considered useful:

(1) The interval of re-distrubuting IPI(IPI_FLUSH_TLB) is set to 10 times
of original 
    and receiving message of re-distributing is inviladed.
(2) Changed IDE scaning times from 10 to 2 
    patch:patch_ide_scan-k241.diff
(3) updated the brlock bug of kernel-2.4-x
    patch:br_writelock.diff
(4) update all kinds of driver version
    aic7xxx:  6.0.8 BETA  SCSI (adaptec)
    DAC960:   2.4.10      RAID (Mylex)
    e100:     1.5.0       NIC  (Intel)
    e1000:    3.0.1       GigaNIC(Intel)
    qla2x00:  4.15 beta   FC    (QLogic)

the useful URL is described below:
http://people.freebsd.org/~gibbs/linux
http://support.intel.com/support/network/adaptec/pro100/100Linux.htm
Comment 1 Bill Huang 2001-03-28 06:17:11 EST
Created attachment 13924 [details]
patch_ipi_tlb_resend_do_nothing_k240.diff
Comment 2 Bill Huang 2001-03-28 06:18:10 EST
Created attachment 13925 [details]
patch_ide_scan-k241.diff
Comment 3 Bill Huang 2001-03-28 06:19:53 EST
Created attachment 13926 [details]
br_writelock.diff
Comment 4 Arjan van de Ven 2001-03-28 11:18:08 EST
Dave: is this something we should worry about?
Comment 5 Michael K. Johnson 2001-03-28 18:10:06 EST
David thinks this is probably ligit, but wants Ingo's feedback.
Comment 6 Bill Huang 2001-03-28 22:59:46 EST
Created attachment 14064 [details]
more info:panic_msg
Comment 7 Bill Huang 2001-03-28 23:00:53 EST
Created attachment 14065 [details]
more info:machine information
Comment 8 Ingo Molnar 2001-03-29 02:10:40 EST
The patch is 100% legit. (It's perhaps only because the write path
is so rarely used that we didnt see any problems with this code
earlier.) The bug does not corrupt memory, it's only the write-locking
semantics that were violated.
Comment 9 Bill Huang 2001-03-29 02:46:14 EST
I have asked Hitachi to send us the test tools they used for network high
workloading...
Comment 10 Ingo Molnar 2001-04-16 05:13:31 EDT
this patch should be in the current CVS tree, please test.
Comment 11 Bill Nottingham 2001-05-29 14:00:52 EDT
closing as resolved.

Note You need to log in before you can comment on or make changes to this bug.