Description of problem: Customer's system crashes anytime a large dd command (1-5GB) is done to to iscsi device with qla3xxx driver. The net_device_wrapper is corrupt within the core and system dies in net_rx_action() when attempting to deref .npinfo struct pointer within the net_device_wrapper. root cause: private data structure of qla3xxx is 0x15c00 in length, this length is stored in a 32 bit short int field within net_device resulting in a recorded length of 5c00. See additional information below. Version-Release number of selected component (if applicable): Vanilla 4.8, but checked up thru 5.5 and it has the same issue. How reproducible: 100% Steps to Reproduce: 1. dd if=/dev/zero of=/iscsimnt/bigfile bs=1024 count=10000 2. 3. Actual results: system crashes Expected results: system doesn't crash Additional info: Created test kernel using char reserved field just before priv_len for customer to try/test. Verified that storing the whole 24 bit length stops the crashes. RCA: qla3xxx driver allocates private state >0xFFFF in length (0x15c00) which is stored in short int priv_len field within net_device structure (0x5c00). The net_device, private struct, and net_device_wrapper are allocated as a single unit/alloc via alloc_netdev. The net_device_wrapper pointer is created off the truncated priv_len pointer and initialized. The qla3xxx driver eventually uses the part of its private state that overlays this area and writes over the net_device_wrapper. Next net_rx_action() will crash the system upon bad pointer deref. +------------------------------------------------------------- | net_device | : .priv -----------------+ . .priv_len | +------------------------------- | : <alignment padding, if needed> | +------------------------------- <---+ | : <private data area assoc. w/net_device . +------------------------------- : <alignment padding, if needed> +------------------------------- | | net_device_wrapper structure | . .netpoll_setup = (function) . .netpoll_start_xmit = (function) . .npinfo = (structure ptr) +-------------------------------------------------------------- crash> px sizeof(struct ql3_adapter) $3 = 0x15c00 << ok, bigger than short int. struct net_device: short unsigned int priv_len; stored value from crash: priv_len = 0x5c00,
Dwight, we can't really resize the priv_len field. If any third party modules call the dev_extended function the offset of priv_len will be wrong and they'll get garbage data, causing errors/corruption/etc. I've got an alternate patch in bz620528
Yeah, that's what I thought. Changing the private struct to be under the size limit is cleaner. As an aside, having the kernel routines that pass this size around having the variable declared as an int and then storing it into a short int may have contributed to the issue... I'd expect type checking to have helped prevent the issue if the interface had sizeof_priv as short int within call args to begin with? Not advocating changing it at this point, so more just a question... thanks, bud brown seg/storage
It would have been better that way, but alloc_netdev and friends are on the abi whitelist, so thats not an option. Ideally, the best solution would be to embed all the net_devices_extended data where it belongs in the net_device struct, like upstream did, but then we really break abi, which we promise not to do. We should have made priv_len a u32, but I was trying to save space, and didn't think any priv struct would be 65k. Oh well....
Jerome Can you build a RHEL4 kernel? The RHEL5 patch should apply pretty cleanly I would imagine
Yes it did. The build is in progress: https://brewweb.devel.redhat.com/taskinfo?taskID=2666912
Created attachment 437647 [details] Fixed patch That patch also check for sizeof_priv in alloc_netdev() to prevent future crash with other drivers.
Posted: http://post-office.corp.redhat.com/archives/rhkernel-list/2010-August/msg00649.html
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Committed in 89.31.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
Putting the bz back to POST state. There is one more incremental patch on top of existing patch to be applied. Converting kmalloc() to kzalloc().
Committed in 89.32.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
(In reply to comment #20) > Committed in 89.32.EL . RPMS are available at > http://people.redhat.com/vgoyal/rhel4/ NOTE: these are only *test kernels* and cannot be used in production environments. This kernel is unsupported unless otherwise included in an official release or hot fix.
Hi all, Must it be ISP4032-based Ethernet NIC? I cannot reproduce this bug with ISP4022-based Ethernet NIC, which can only be found in Beaker. Thanks Igor
Through code review, confirmed the patch included into kernel-2.6.9-97.EL, marked SanityOnly.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0263.html