(originally reported by Hitachi) Test enviroment:ia64 8 way server CF-1 When they tested network high workload on the ia64 server,(I asked them the detail about the test procedure) they found read/write lock error. 1.Using ITP to investigate the error,they found errors in the function net_rx_action() (1)The second(next) address in the socket list is wrong. They think that line that fault occured is ... if (!ptype->dev || ptype->dev == skb->dev) ... (2) kernel panic occured when net_rx_action() is run 1) In CPU0,each elements of socket list in function net_rx_action() are checked one-by-one 2) In CPU1,socket release is started 3) Released adresses in CPU1 are used to distribute buffer for socket 4) In CPU0,when socket list in net_rx_action() is checked one-by-one,since the wrong address is used,panic is occured. in the process of socket operating,the programs below are guessed to use read/write lock control program. - checking out the elements of socket list in function net_rx_action() - releasing the socket in the function dev_remove_pack() 2,depending on the investigation,the patches below are considered useful: (1) The interval of re-distrubuting IPI(IPI_FLUSH_TLB) is set to 10 times of original and receiving message of re-distributing is inviladed. (2) Changed IDE scaning times from 10 to 2 patch:patch_ide_scan-k241.diff (3) updated the brlock bug of kernel-2.4-x patch:br_writelock.diff (4) update all kinds of driver version aic7xxx: 6.0.8 BETA SCSI (adaptec) DAC960: 2.4.10 RAID (Mylex) e100: 1.5.0 NIC (Intel) e1000: 3.0.1 GigaNIC(Intel) qla2x00: 4.15 beta FC (QLogic) the useful URL is described below: http://people.freebsd.org/~gibbs/linux http://support.intel.com/support/network/adaptec/pro100/100Linux.htm
Created attachment 13924 [details] patch_ipi_tlb_resend_do_nothing_k240.diff
Created attachment 13925 [details] patch_ide_scan-k241.diff
Created attachment 13926 [details] br_writelock.diff
Dave: is this something we should worry about?
David thinks this is probably ligit, but wants Ingo's feedback.
Created attachment 14064 [details] more info:panic_msg
Created attachment 14065 [details] more info:machine information
The patch is 100% legit. (It's perhaps only because the write path is so rarely used that we didnt see any problems with this code earlier.) The bug does not corrupt memory, it's only the write-locking semantics that were violated.
I have asked Hitachi to send us the test tools they used for network high workloading...
this patch should be in the current CVS tree, please test.
closing as resolved.