Bug 33590 - kernel panic(read/write lock error) occured when network high workload was needed
Summary: kernel panic(read/write lock error) occured when network high workload was ne...
Status: CLOSED RAWHIDE
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: kernel   
(Show other bugs)
Version: 7.3
Hardware: ia64 Linux
high
high
Target Milestone: ---
Assignee: Ingo Molnar
QA Contact: Brock Organ
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2001-03-28 11:15 UTC by Bill Huang
Modified: 2005-10-31 22:00 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2001-04-16 09:13:36 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch_ipi_tlb_resend_do_nothing_k240.diff (1.17 KB, patch)
2001-03-28 11:17 UTC, Bill Huang
no flags Details | Diff
patch_ide_scan-k241.diff (310 bytes, patch)
2001-03-28 11:18 UTC, Bill Huang
no flags Details | Diff
br_writelock.diff (731 bytes, patch)
2001-03-28 11:19 UTC, Bill Huang
no flags Details | Diff
more info:panic_msg (995 bytes, patch)
2001-03-29 03:59 UTC, Bill Huang
no flags Details | Diff
more info:machine information (4.41 KB, patch)
2001-03-29 04:00 UTC, Bill Huang
no flags Details | Diff

Description Bill Huang 2001-03-28 11:15:53 UTC
(originally reported by Hitachi)
Test enviroment:ia64 8 way server CF-1
When they tested network high workload on the ia64 server,(I asked them the
detail about the test procedure)
they found read/write lock error.

1.Using ITP to investigate the error,they found errors in the function
net_rx_action()
 (1)The second(next) address in the socket list is wrong.
    They think that line that fault occured is
     ...
      if (!ptype->dev || ptype->dev == skb->dev)
     ...
 
 (2) kernel panic occured when net_rx_action() is run
     1) In CPU0,each elements of socket list in function net_rx_action()
are checked one-by-one
     2) In CPU1,socket release is started
     3) Released adresses in CPU1 are used to distribute buffer for socket
     4) In CPU0,when socket list in net_rx_action() is checked
one-by-one,since the wrong address is used,panic is occured.
  
   in the process of socket operating,the programs below are guessed to use
read/write lock control program.
     - checking out the elements of socket list in function net_rx_action()
     - releasing the socket in the function dev_remove_pack()
2,depending on the investigation,the patches below are considered useful:

(1) The interval of re-distrubuting IPI(IPI_FLUSH_TLB) is set to 10 times
of original 
    and receiving message of re-distributing is inviladed.
(2) Changed IDE scaning times from 10 to 2 
    patch:patch_ide_scan-k241.diff
(3) updated the brlock bug of kernel-2.4-x
    patch:br_writelock.diff
(4) update all kinds of driver version
    aic7xxx:  6.0.8 BETA  SCSI (adaptec)
    DAC960:   2.4.10      RAID (Mylex)
    e100:     1.5.0       NIC  (Intel)
    e1000:    3.0.1       GigaNIC(Intel)
    qla2x00:  4.15 beta   FC    (QLogic)

the useful URL is described below:
http://people.freebsd.org/~gibbs/linux
http://support.intel.com/support/network/adaptec/pro100/100Linux.htm

Comment 1 Bill Huang 2001-03-28 11:17:11 UTC
Created attachment 13924 [details]
patch_ipi_tlb_resend_do_nothing_k240.diff

Comment 2 Bill Huang 2001-03-28 11:18:10 UTC
Created attachment 13925 [details]
patch_ide_scan-k241.diff

Comment 3 Bill Huang 2001-03-28 11:19:53 UTC
Created attachment 13926 [details]
br_writelock.diff

Comment 4 Arjan van de Ven 2001-03-28 16:18:08 UTC
Dave: is this something we should worry about?

Comment 5 Michael K. Johnson 2001-03-28 23:10:06 UTC
David thinks this is probably ligit, but wants Ingo's feedback.

Comment 6 Bill Huang 2001-03-29 03:59:46 UTC
Created attachment 14064 [details]
more info:panic_msg

Comment 7 Bill Huang 2001-03-29 04:00:53 UTC
Created attachment 14065 [details]
more info:machine information

Comment 8 Ingo Molnar 2001-03-29 07:10:40 UTC
The patch is 100% legit. (It's perhaps only because the write path
is so rarely used that we didnt see any problems with this code
earlier.) The bug does not corrupt memory, it's only the write-locking
semantics that were violated.

Comment 9 Bill Huang 2001-03-29 07:46:14 UTC
I have asked Hitachi to send us the test tools they used for network high
workloading...

Comment 10 Ingo Molnar 2001-04-16 09:13:31 UTC
this patch should be in the current CVS tree, please test.

Comment 11 Bill Nottingham 2001-05-29 18:00:52 UTC
closing as resolved.


Note You need to log in before you can comment on or make changes to this bug.