Bug 620485

Summary: system crashes due to corrupt net_device_wrapper structure
Product: Red Hat Enterprise Linux 4 Reporter: Dwight (Bud) Brown <bubrown>
Component: kernelAssignee: Jerome Marchand <jmarchan>
Status: CLOSED ERRATA QA Contact: Barry Donahue <bdonahue>
Severity: high Docs Contact:
Priority: urgent    
Version: 4.8CC: andriusb, bturner, davem, dhoward, fge, jwest, mtian, nhorman, plyons, vgoyal, yugzhang
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 620508 (view as bug list) Environment:
Requires: Qlogic iscsi hba: ISP4032-based Ethernet IPv6 NIC
Last Closed: 2011-02-16 15:18:29 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 624364    
Attachments:
Description Flags
Fixed patch none

Description Dwight (Bud) Brown 2010-08-02 16:36:45 UTC
Description of problem:
Customer's system crashes anytime a large dd command (1-5GB) is done to to iscsi device with qla3xxx driver.  The net_device_wrapper is corrupt within the core and system dies in net_rx_action() when attempting to deref .npinfo struct pointer within the net_device_wrapper.

root cause: private data structure of qla3xxx is 0x15c00 in length, this length is stored in a 32 bit short int field within net_device resulting in a recorded length of 5c00.  See additional information below.



Version-Release number of selected component (if applicable):
Vanilla 4.8, but checked up thru 5.5 and it has the same issue.

How reproducible:
100%


Steps to Reproduce:
1. dd if=/dev/zero of=/iscsimnt/bigfile bs=1024 count=10000
2.
3.
  
Actual results:
system crashes

Expected results:
system doesn't crash


Additional info:
Created test kernel using char reserved field just before priv_len for customer to try/test.  Verified that storing the whole 24 bit length stops the crashes.  

RCA:
qla3xxx driver allocates private state >0xFFFF in length (0x15c00) which is stored in short int priv_len field within net_device structure (0x5c00).

The net_device, private struct, and net_device_wrapper are allocated as a single unit/alloc via alloc_netdev.  The net_device_wrapper pointer is created off the truncated priv_len pointer and initialized.  The qla3xxx driver eventually uses the part of its private state that overlays this area and writes over the net_device_wrapper.  Next net_rx_action() will crash the system upon bad pointer deref.

+-------------------------------------------------------------
|  net_device
|
:              .priv -----------------+
.              .priv_len              |
+-------------------------------      |
: <alignment padding, if needed>      |
+-------------------------------  <---+
|
:  <private data area assoc. w/net_device
.
+-------------------------------
: <alignment padding, if needed>
+-------------------------------
|
|  net_device_wrapper structure
|
.     .netpoll_setup = (function)
.     .netpoll_start_xmit = (function)
.     .npinfo = (structure ptr)
+--------------------------------------------------------------

crash> px sizeof(struct ql3_adapter)
$3 = 0x15c00   << ok, bigger than short int.

struct net_device:
    short unsigned int priv_len;

stored value from crash:
  priv_len = 0x5c00,

Comment 6 Neil Horman 2010-08-05 19:20:44 UTC
Dwight, we can't really resize the priv_len field.  If any third party modules call the dev_extended function the offset of priv_len will be wrong and they'll get garbage data, causing errors/corruption/etc.  I've got an alternate patch in bz620528

Comment 7 Dwight (Bud) Brown 2010-08-05 19:32:07 UTC
Yeah, that's what I thought.  Changing the private struct to be under the size limit is cleaner.  As an aside, having the kernel routines that pass this size around having the variable declared as an int and then storing it into a short int may have contributed to the issue... I'd expect type checking to have helped prevent the issue if the interface had sizeof_priv as short int within call args to begin with?  Not advocating changing it at this point, so more just a question...

thanks,
bud brown
seg/storage

Comment 8 Neil Horman 2010-08-06 00:08:29 UTC
It would have been better that way, but alloc_netdev and friends are on the abi whitelist, so thats not an option.  Ideally, the best solution would be to embed all the net_devices_extended data where it belongs in the net_device struct, like upstream did, but then we really break abi, which we promise not to do.  We should have made priv_len a u32, but I was trying to save space, and didn't think any priv struct would be 65k.  Oh well....

Comment 10 Neil Horman 2010-08-09 15:32:57 UTC
Jerome Can you build a RHEL4 kernel?  The RHEL5 patch should apply pretty cleanly I would imagine

Comment 11 Jerome Marchand 2010-08-09 16:29:35 UTC
Yes it did. The build is in progress:
https://brewweb.devel.redhat.com/taskinfo?taskID=2666912

Comment 12 Jerome Marchand 2010-08-09 16:36:48 UTC
Created attachment 437647 [details]
Fixed patch

That patch also check for sizeof_priv in alloc_netdev() to prevent future crash with other drivers.

Comment 16 RHEL Program Management 2010-08-13 20:39:09 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 18 Vivek Goyal 2010-08-18 13:48:11 UTC
Committed in 89.31.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 19 Vivek Goyal 2010-08-18 18:24:57 UTC
Putting the bz back to POST state. There is one more incremental patch on top of existing patch to be applied. Converting kmalloc() to kzalloc().

Comment 20 Vivek Goyal 2010-08-20 19:48:16 UTC
Committed in 89.32.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 21 Andrius Benokraitis 2010-09-03 18:22:41 UTC
(In reply to comment #20)
> Committed in 89.32.EL . RPMS are available at
> http://people.redhat.com/vgoyal/rhel4/

NOTE: these are only *test kernels* and cannot be used in production environments. This kernel is unsupported unless otherwise included in an official release or hot fix.

Comment 22 Igor Zhang 2010-10-13 03:51:06 UTC
Hi all,
Must it be ISP4032-based Ethernet NIC? I cannot reproduce this bug with ISP4022-based Ethernet NIC, which can only be found in Beaker.

Thanks
Igor

Comment 27 Brenda Tian 2011-01-26 07:44:53 UTC
Through code review, confirmed the patch included into
 kernel-2.6.9-97.EL, marked SanityOnly.

Comment 28 errata-xmlrpc 2011-02-16 15:18:29 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0263.html