Bug 517329

Summary: [RHEL4.8] igb driver doesn't allocate enough buffer for ethtool_get_strings()
Product: Red Hat Enterprise Linux 4 Reporter: Flavio Leitner <fleitner>
Component: kernelAssignee: Stefan Assmann <sassmann>
Status: CLOSED ERRATA QA Contact: Evan McNabb <emcnabb>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.8CC: bobby.suber, daniel_frazier, dannf, david.graham, dhoward, emcnabb, peterm, sandy.garza, tao
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
The data buffer ethtool_get_strings() allocated, for the igb driver, was smaller than the amount of data that was copied in igb_get_strings(), because of a miscalculation in IGB_QUEUE_STATS_LEN, resulting in memory corruption. This bug could have led to a kernel panic.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-02-16 15:46:23 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 522738    
Attachments:
Description Flags
bug517329-igb-buffer-fix.patch none

Description Flavio Leitner 2009-08-13 14:36:51 UTC
Description of problem:

The size returned by igb driver to allocate data buffer in ethtool_get_strings()
is smaller than it actually copies on igb_get_strings()

578 static int ethtool_get_strings(struct net_device *dev, void __user *useraddr)
579 {
580         struct ethtool_gstrings gstrings;
581         struct ethtool_ops *ops = dev->ethtool_ops;
582         u8 *data;
583         int ret; 
...
597         case ETH_SS_STATS:
598                 if (!ops->get_stats_count)
599                         return -EOPNOTSUPP;

here it gets the total number of entries which is wrong, see below:

600                 gstrings.len = ops->get_stats_count(dev);
601                 break;
...

now allocates a buffer of entries x ETH_GSTRING_LEN
606         data = kmalloc(gstrings.len * ETH_GSTRING_LEN, GFP_USER);
607         if (!data)
608                 return -ENOMEM;
609 
610         ops->get_strings(dev, gstrings.string_set, data);

ops->get_strings is igb_get_strings() which does:
1953 static void igb_get_strings(struct net_device *netdev, u32 stringset, u8 *data)
1954 {
1955         struct igb_adapter *adapter = netdev_priv(netdev);
1956         u8 *p = data;
1957         int i;
1958 
1959         switch (stringset) {
...
1964         case ETH_SS_STATS:

This part below is good:
1965                 for (i = 0; i < IGB_GLOBAL_STATS_LEN; i++) {
1966                         memcpy(p, igb_gstrings_stats[i].stat_string,
1967                                ETH_GSTRING_LEN);
1968                         p += ETH_GSTRING_LEN;
1969                 }


But this strings below are not counted in buffer size.
1970                 for (i = 0; i < adapter->num_tx_queues; i++) {
1971                         sprintf(p, "tx_queue_%u_packets", i);
1972                         p += ETH_GSTRING_LEN;
1973                         sprintf(p, "tx_queue_%u_bytes", i);
1974                         p += ETH_GSTRING_LEN;
1975                 }
1976                 for (i = 0; i < adapter->num_rx_queues; i++) {
1977                         sprintf(p, "rx_queue_%u_packets", i);
1978                         p += ETH_GSTRING_LEN;
1979                         sprintf(p, "rx_queue_%u_bytes", i);
1980                         p += ETH_GSTRING_LEN;
1981                 }
1982 /*              BUG_ON(p - data != IGB_STATS_LEN * ETH_GSTRING_LEN); */
1983                 break;
1984         }
1985 }

It seems to have another problem on igb_get_ethtool_stats() because
it just provides rx_queue stats. Upstream driver has this second issue fixed.


Version-Release number of selected component (if applicable):
RHEL4u8 - kernel-2.6.9-89.EL

How reproducible:
Always.


Steps to Reproduce:
1.  Download PSP from HP's website - http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?swItem=MTX-dd9f895848b24831a1c8b9f060〈=en&cc=us&idx=2&mode=4&
Select "Download" to download psp-8.26.rhel4.i686.en.tar.gz
2.  Untar the psp
3.  cd to compaq/csp/linux
4.  ./install826.sh
5.  Select "Force installation of all modules" and select "Install"
6.  Reboot
7. then run in a loop:   '/etc/init.d/network restart'

Actual results:
The NIC Agent Daemon (cmanicd) will use ethtool while starting causing
a memory corruption followed by a kernel oops.

Additional info:
Vmcore is available.

Logs with a systemtap script probing on igb_get_strings():
Starting NIC Agent Daemon (cmanicd): ethtool [23131]: calling igb_get_strings
ethtool [23310]: calling igb_get_strings
[  OK  ]
 
Shutting down interface eth0:  Unable to handle kernel paging request at virtual address 715f787e  <-- this is actually an ASCII
printing eip:
c01282f2
*pde = 00000000
Oops: 0000 [#1]
SMP
Modules linked in: stap_e7e6e0c721692655ee06de6774bd6790_485(U) mptctl mptbase sg ipmi_devintf ipmi_si ipmi_msghandler md5 ipv6 autofs4 i2c_dev i2c_core sunrpc cpufreq_powersave ide_dump scsi_dump diskdump zlib_deflate dm_mirror dm_mod joydev button battery ac ehci_hcd uhci_hcd igb inet_lro ext3 jbd ata_piix libata sd_mod scsi_mod
CPU:    0
EIP:    0060:[<c01282f2>]    Not tainted VLI
EFLAGS: 00010282   (2.6.9-89.ELsmp)
EIP is at start_unregistering+0xc/0x6a
eax: 715f7872   ebx: 715f7872   ecx: 00000000   edx: dead4ead
esi: f78ff800   edi: f8cacbdc   ebp: 00000040   esp: f4eebcc4
ds: 007b   es: 007b   ss: 0068
Process ip (pid: 23504, threadinfo=f4eeb000 task=f42e8270)
Stack: 00000000 0001e5d5 000065d7 00000246 00000000 715f7872 f78ff800 f4eebce8
      715f7872 f78ff800 c0128912 f4f4d600 f8c7a902 00000000 f8c78c67 ffffffff
      00000001 f66ed000 f5851e00 f78ff800 00000040 f7906a60 f8c78223 f5457900
Call Trace:
[<c0128912>] unregister_sysctl_table+0x2a/0x49
[<f8c7a902>] addrconf_sysctl_unregister+0x16/0x2b [ipv6]
[<f8c78c67>] addrconf_ifdown+0x28e/0x2d3 [ipv6]
[<f8c78223>] inet6_addr_del+0x6d/0x8a [ipv6]
[<f8c79417>] inet6_rtm_deladdr+0x0/0x5b [ipv6]
[<c028ffef>] rtnetlink_rcv+0x226/0x327
[<c029b01b>] netlink_data_ready+0x14/0x44
[<c029a688>] netlink_sendskb+0x52/0x6c
[<c029ae36>] netlink_sendmsg+0x271/0x280
[<c027fc6c>] sock_sendmsg+0xdb/0xf7
[<c02815ae>] sys_recvmsg+0x172/0x1db
[<c0281608>] sys_recvmsg+0x1cc/0x1db
[<c0120ef9>] autoremove_wake_function+0x0/0x2d
[<c0280f6a>] sys_sendto+0xc7/0xe2
[<c011b897>] do_page_fault+0x1ae/0x5c6
[<c01516fe>] vma_link+0x44/0xbc
[<c0281781>] sys_socketcall+0x16a/0x1fb
[<c02dd8e3>] syscall_call+0x7/0xb
Code: b2 01 89 d0 c3 89 c2 8b 40 0c 48 89 42 0c 85 c0 75 0c 8b 42 10 85 c0 74 05 e9 b1 6f ff ff c3 56 ba ad 4e ad de 53 83 ec 20 89 c3 <83> 7b 0c 00 b8 01 00 00 00 89 44 24 04 8d 44 24 0c c7 04 24 00

Comment 2 dave graham 2009-08-13 17:46:29 UTC
I've been looking into this with the Intel igb author, who points out a clear bug in the define of IGB_QUEUE_STATS_LEN, and that becuase TX-stats are not recorded in this driver, the TX queue -relevant stuff can be removed.
The changes are required in drivers/net/igb/igb_ethtool.c

The bug is in the check against the number of queues, which increments by  0! if there is exactly one RX or TX quueue.

Because TXStats are not recoded, we should change this:.

098 #define IGB_QUEUE_STATS_LEN \
099         ((((((struct igb_adapter *)netdev->priv)->num_rx_queues > 1) ? \
100           ((struct igb_adapter *)netdev->priv)->num_rx_queues : 0) + \
101          (((((struct igb_adapter *)netdev->priv)->num_tx_queues > 1) ? \
102           ((struct igb_adapter *)netdev->priv)->num_tx_queues : 0))) * \
103         (sizeof(struct igb_queue_stats) / sizeof(u64)))

to this:

098 #define IGB_QUEUE_STATS_LEN
099         ((((struct igb_adapter *)netdev->priv)->num_rx_queues) *
	                 (sizeof(struct igb_queue_stats) / sizeof(u64)))
                 
And , again, because there are no TXQueue stats recorded, this should be removed:

1864                 for (i = 0; i < adapter->num_tx_queues; i++) {
1865                         sprintf(p, "tx_queue_%u_packets", i);
1866                         p += ETH_GSTRING_LEN;
1867                         sprintf(p, "tx_queue_%u_bytes", i);
1868                         p += ETH_GSTRING_LEN;
1869                 }

I have not yet has time to verify that this resolves the issue, but hope it helps.

Comment 3 dann frazier 2009-08-13 18:40:53 UTC
Could we also add an assertion, e.g.:

BUG_ON(p - data >= igb_get_stats_count(dev) * ETH_GSTRING_LEN)

Comment 4 dave graham 2009-08-14 19:21:22 UTC
The changes I proposed above are based on the original driver in the RHEL 2.6.9-78 source, and the issue addressed is actually __fixed__ in the igb-1.3.8.6 module that is installed as part of the PSP that exposed the issue. 

It is unfortunate that the installation of the new fixed driver exposes an issue in the older one. I am not sure how (or if) it should be addressed. As long as the newer (igb-1.3.8.6) driver is installed 'enough' to be loadable on the next system boot, at least this one panic will be the last one cause by the old driver. Possibly there are actions that the PSP install could take to ensure that the old driver does not receive an etthool -S query before the reboot has completed ? 

The assertion is a good idea, but should actually use ">" rather than ">="

   BUG_ON(p - data > igb_get_stats_count(dev) * ETH_GSTRING_LEN)

Comment 7 Stefan Assmann 2009-08-25 14:54:41 UTC
I tried to reproduce the issue with an Intel 82576 NIC on RHEL4.

I followed the description in the initial comment as close as possible, however in Step 4 I had to execute install825.sh instead of install826.sh and didn't get asked for Step 5. The other difference is that I tested on x86_64.

I ran "for i in `seq 100`; do /etc/init.d/network restart; done" but the machine didn't crash it only displayed
Shutting down interface eth1:  [  OK  ]
Shutting down loopback interface:  [  OK  ]
Setting network parameters:  [  OK  ]
Bringing up loopback interface:  [  OK  ]
Bringing up interface eth1:  [  OK  ]
SIOCGIFXQLEN: No such device
Shutting down interface eth1:  [  OK  ]
Shutting down loopback interface:  [  OK  ]
Setting network parameters:  [  OK  ]
Bringing up loopback interface:  [  OK  ]
Bringing up interface eth1:  [  OK  ]
SIOCGIFXQLEN: No such device
Shutting down interface eth1:  [  OK  ]
Shutting down loopback interface:  [  OK  ]
Setting network parameters:  [  OK  ]

Will I need any special hardware?

Additional info:
dmesg |grep eth
divert: not allocating divert_blk for non-ethernet device lo
divert: allocating divert_blk for eth0
igb 0000:03:00.0: eth0: (PCIe:2.5Gb/s:Width x4) 00:1b:21:38:2f:50
igb 0000:03:00.0: eth0: PBA No: e19418-006
divert: allocating divert_blk for eth1
igb 0000:03:00.1: eth1: (PCIe:2.5Gb/s:Width x4) 00:1b:21:38:2f:51
igb 0000:03:00.1: eth1: PBA No: e19418-006
divert: allocating divert_blk for eth2
eth2: Tigon3 [partno(BCM95754) rev 5784100 PHY(5784)] (PCI Express) 10/100/1000Base-T Ethernet 00:1a:4b:0c:d8:7a
eth2: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] WireSpeed[1] TSOcap[1]
eth2: dma_rwctrl[76180000] dma_mask[64-bit]
divert: not allocating divert_blk for non-ethernet device sit0
eth1: no IPv6 routers present

Comment 8 dann frazier 2009-08-25 15:49:42 UTC
restarting the network is not sufficient. the corruption comes as a result of an ioctl that only gets called when hp-snmp-agents starts. To reproduce, you need to also restart that service.

while :; do
  /etc/init.d/hp-snmp-agents restart # causes corruption
  /etc/init.d/network start # trips over corruption and crashes
done

Comment 9 Stefan Assmann 2009-08-26 08:18:10 UTC
Created attachment 358683 [details]
bug517329-igb-buffer-fix.patch

I was able to reproduce the problem and can confirm that with this patch no oops occurs. If you want to have anything else added to this patch let me know.

Comment 10 Stefan Assmann 2009-08-26 10:53:54 UTC
when will this be fixed upstream?

Comment 11 dann frazier 2009-08-26 16:44:11 UTC
I believe it is already fixed upstream in:

commit e21ed3538f1946ea623caf28f1c44ede50224275
Author: Alexander Duyck <alexander.h.duyck>

    igb: update ethtool stats to support multiqueue


Which was in 2.6.27-rc1.

Comment 12 Stefan Assmann 2009-08-28 12:07:02 UTC
Thanks Dann! All my questions have been addressed.

Comment 13 RHEL Program Management 2009-08-28 12:32:27 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 16 Vivek Goyal 2009-10-01 22:08:55 UTC
Committed in 89.12.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 23 Douglas Silas 2011-01-31 00:03:16 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
The data buffer ethtool_get_strings() allocated, for the igb driver, was smaller than the amount of data that was copied in igb_get_strings(), because of a miscalculation in IGB_QUEUE_STATS_LEN, resulting in memory corruption. This bug could have led to a kernel panic.

Comment 24 errata-xmlrpc 2011-02-16 15:46:23 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0263.html