Bug 471657 - Kudzu corrupts configuration data of dual-port Chelsio 10G-NIC
Summary: Kudzu corrupts configuration data of dual-port Chelsio 10G-NIC
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kudzu
Version: 5.2
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: 5.5
Assignee: Bill Nottingham
QA Contact: BaseOS QE
URL:
Whiteboard: ptam
: 498089 (view as bug list)
Depends On:
Blocks: 499522
TreeView+ depends on / blocked
 
Reported: 2008-11-14 21:14 UTC by Issue Tracker
Modified: 2018-10-27 15:50 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-03-30 07:59:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
patch for this issue (2.00 KB, text/plain)
2008-11-25 17:05 UTC, Bill Nottingham
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2010:0191 0 normal SHIPPED_LIVE kudzu bug fix and enhancement update 2010-03-29 12:22:37 UTC

Description Issue Tracker 2008-11-14 21:14:11 UTC
Escalated to Bugzilla from IssueTracker

Comment 1 Issue Tracker 2008-11-14 21:14:12 UTC
Problem:
 Kudzu corrupts configuration data of dual-port Chelsio 10G-NIC

Description of problem:
 If kudzu runs after installing a server with dual-port Chelsio 10G-NIC,
 It will corrupt the configuration file.  We explain three cases where
 problem occurred.

 - Case1:Add e1000e NIC card to a system installed with dual-port Chelsio 10G NIC
   System couldn't recognize 2nd port,
   And Kudzu overwrites different NIC's config on this port's config.
    
 - Case2: Add dual-port Chelsio 10G NIC to a system installed with e1000e NIC
   It didn't update secondly ports configuration.

 We have observed cases, where the NIC's configuration is corrupted just
 By rebooting the system without adding not removing devices.
 However, this symptom does not always occur and happens about 20% of the time.


Version-Release number of selected component:
 kernel version: 2.6.18-92.el5 (RHEL5.2GA)

How reproducible:
 Case1/Case2: always reproduced (100%)

Steps to Reproduce:
 [Case1]
 1) Install Chelsio 10G NIC.(Server has onboard e1000e NIC)
 2) Install and startup RHEL5.2
 3) Shutdown
 4) Add more e1000e NIC(2port type)
 5) Reboot

 [Case2]
 1) Install and startup RHEL5.2(Server has onboard e1000e NIC)
 2) Shutdown 
 3) Add Chelsio 10G NIC
 4) Reboot


Actual results:

 We checked 2 cases.

 Case1:OnboardNIC(e1000e)+Chelsio10GNIC > after OS installed, add more e1000e.

 	[Previous] /etc/modprobe.conf  <-- Pre-Installed 10GB NIC
 	Alias eth0 cxgb3(MAC:00:07:43:05:47:C2)
 	alias eth1 cxgb3(MAC:00:07:43:05:47:C3)
 	alias eth2 e1000e(MAC:00:16:97:62:36:A2)
 	alias eth3 e1000e(MAC:00:16:97:62:36:A3)
 	alias eth4 e1000e(MAC:00:16:97:62:36:A4)
 	alias eth5 e1000e(MAC:00:16:97:62:36:A5)
 	alias scsi_hostadapter megaraid_sas

	>This configuration was kept [ifcifg,eth1.bak]

 	[After add e1000e] /etc/modprobe.conf]
 	alias eth0 cxgb3(MAC:00:07:43:05:47:C2)
 	alias eth1 e1000e(MAC:00:16:97:44:77:36)  <--- Added e1000e
 	alias eth2 e1000e(MAC:00:16:97:62:36:A2)
 	alias eth3 e1000e(MAC:00:16:97:62:36:A3)
 	alias eth4 e1000e(MAC:00:16:97:62:36:A4)
 	alias eth5 e1000e(MAC:00:16:97:62:36:A5)
 	alias scsi_hostadapter megaraid_sas
 	alias eth6 e1000e(MAC:00:16:97:44:77:37) <--- Added e1000e

        >Replaced eth1(cxgb3 to added e1000e).
        >previous cxgb3's configuration was deleted.
 
 Case2:OnboardNIC use only > after OS installed, add more Chelsio10GNIC
	[Previous]	/etc/modprobe.conf]
	alias eth0 e1000e(MAC:00:16:97:62:36:A2)
	alias eth1 e1000e(MAC:00:16:97:62:36:A3)
	alias eth2 e1000e(MAC:00:16:97:62:36:A4)
	alias eth3 e1000e(MAC:00:16:97:62:36:A5)
	alias scsi_hostadapter megaraid_sas
	
	[After add 10GB NIC]	etc/modprobe.conf]
	alias eth0 e1000e(MAC:00:16:97:62:36:A2)
	alias eth1 e1000e(MAC:00:16:97:62:36:A3)
	alias eth2 e1000e(MAC:00:16:97:62:36:A4)
	alias eth3 e1000e(MAC:00:16:97:62:36:A5)
	alias scsi_hostadapter megaraid_sas
	alias eth4 cxgb3(MAC:00:07:43:05:47:c2) <--- Added Device

	>It was not reflected 10GB NIC's port #2.(MAC:00:07:43:05:47:C3)

Hardware info:
 Express5800/B140a-T and Express5800/140Ba-10
 dual-port 10GNIC    : N8403-024(Chelsio Communications Inc T320 10GbE Dual Port Protocol Engine Ethernet Adapter)
 dual-port e1000eNIC : N8403-017(Intel Corporation 82571EB Gigabit Ethernet Controller)

Business impact:
  Customer cannot use dual-port Chelsio 10G NIC card.

Additional info:

During installation, anaconda uses information about the network
interface in /proc/net/dev to create ifcfg-eth files.
However, function probeDevices() in kudzu gets the same information from PCI.

Dual-port Chelsio 10G NIC only has one function id as a PCI device, 
so kudzu incorrectly thinks it has only one port.
Intel e1000e has two function ids so it does not have this problem.

 [Attached documents]
 sysreport,/etc/sysconfig/network-scripts/ifcfg-ethx,/etc/sysconfig/hwconf
 /etc/modprobe.conf

This event sent from IssueTracker by jbastian  [Support Engineering Group]
 issue 238559

Comment 3 Issue Tracker 2008-11-14 21:14:15 UTC
We checked that the same problem also occurs on RHEL5.3 Snapshot1 too.


This event sent from IssueTracker by jbastian  [Support Engineering Group]
 issue 238559

Comment 4 Issue Tracker 2008-11-14 21:14:16 UTC
Hi SEG, escalation. The network information gets corrupted when NIC is
added to box. Apparently kudzu is doing something wrong: 

   1. Provide time and date of the problem
N/A this has occurred at the vendor's test site. 


   2. Indicate the platform(s) (architectures) the problem is being
reported against.
x86_64 RHEL5 but may occur on other arch. 


   3. Provide clear and concise problem description as it is understood at
the time of escalation
          * Observed behavior
When new NIC is added to box kudzu does not detect it properly; result to
destroy the previous network information. 

<snip>
This occurs under the following two circumstances: 
[Case1]
1) Install Chelsio 10G NIC.(Server has onboard e1000e NIC)
2) Install and startup RHEL5.2
3) Shutdown
4) Add more e1000e NIC(2port type)
5) Reboot

[Case2]
1) Install and startup RHEL5.2(Server has onboard e1000e NIC)
2) Shutdown
3) Add Chelsio 10G NIC
4) Reboot
</snip>


          * Desired behavior 
kudzu would handle added NIC properly. 


   4. State specific action requested of SEG
Looking in the code there' s difference in how NIC is detected between
anaconda and kudzu. Anaconda uses /proc whereas kudzu, pci device
information. 

anaconda: network.py

    292         f = open("/proc/net/dev")
    293         lines = f.readlines()
    294         f.close()
    295         # skip first two lines, they are header
    296         lines = lines[2:]
    297         for line in lines:
    298             dev = string.strip(line[0:6])
    299             if dev != "lo" and dev[0:3] != "sit" and not
self.netdevices.has_key(dev):
    300                 if self.firstnetdevice is None:
    301                     self.firstnetdevice = dev
    302 
    303                 self.netdevices[dev] = NetworkDevice(dev)
    304 
    305                 try:
    306                     hwaddr = isys.getMacAddress(dev)
    307                     if rhpl.getArch() != "s390" and hwaddr and
hwaddr != "00:00:00:00:00:00" and hwaddr != "ff:ff:ff:ff:ff:ff":
    308                         self.netdevices[dev].set(("hwaddr",
hwaddr))
    309                 except Exception, e:
    310                     log.error("exception getting mac addr: %s"
%(e,))
    311 
    312         if ksdevice and self.netdevices.has_key(ksdevice):
    313             self.firstnetdevice = ksdevice
    314 
    315         return self.netdevices
    316 


kudzu probeDevices: 
    836 struct device ** probeDevices ( enum deviceClass probeClass,
    837                               enum deviceBus probeBus,
    838                               int probeFlags
    839                               ) {
    840         struct device *devices=NULL,**devlist=NULL;
    841         int numDevs=0, bus, x, index=0;
    842         enum deviceClass cl=CLASS_UNSPEC;
    843         int logLevel = -1;
    844 
    845 #ifndef __LOADER__
    846         logLevel = getLogLevel();
    847         setLogLevel(1);
    848 #endif
    849         setupKernelVersion();
    850 
    851         for (bus=1;buses[bus].string;bus++) {
    852             if ( (probeBus & buses[bus].busType) &&
    853                  !(probeBus == BUS_UNSPEC &&
    854                   buses[bus].busType & BUS_DDC))
    855                 if (buses[bus].probeFunc) {
    856                     DEBUG("Probing %s\n",buses[bus].string);
    857                     devices = buses[bus].probeFunc(probeClass,
    858                                                    probeFlags,
devices);
    859                 }
    860             if ((probeFlags & PROBE_ONE) && (devices))
    861                 break;
    862         }
    863         if (devices == NULL) {
    864 #ifndef __LOADER__
    865                 setLogLevel(logLevel);
    866 #endif
    867                 return NULL;
    868         }
    869 #ifndef __LOADER__     
    870         if (probeClass & CLASS_VIDEO)
    871             fbProbe(devices);
    872 #endif
    873 #ifndef __LOADER__
    874         setLogLevel(logLevel);
    875 #endif
    876         if (probeClass & CLASS_NETWORK) {
    877                 if (probeFlags & PROBE_LOADED) {
    878                         devices = filterNetDevices(devices);
    879                         if (!devices)
    880                                 return NULL;
    881                 }
    882         }
    883 
    884         while (devices) {
    885                 devlist=realloc(devlist, (numDevs+2) *
sizeof(struct device *));
    886                 devlist[numDevs]=devices;
    887                 devlist[numDevs+1]=NULL;
    888                 numDevs++;
    889                 devices=devices->next;
    890         }
    891         qsort(devlist, numDevs, sizeof(struct device *), devCmp);
    892         /* We need to sort the network devices by module name.
Fun. */
    893         for (x=0; devlist[x]; x++) {
    894                 devlist[x]->next = devlist[x+1];
    895         }
    896         if (probeClass & CLASS_NETWORK) {
    897                 sortNetDevices(devlist[0]);
    898                 if (!(probeFlags & PROBE_NOLOAD))
    899                         matchNetDevices(devlist[0]);
    900         }
    901         devices = devlist[0];
    902         for (x = 0; x < numDevs ; x++) {
    903                 devlist[x] = devices;
    904                 devices = devices->next;
    905         }
    906 
    907         for (x=0;devlist[x];x++) {
    908                 if (devlist[x]->type!=cl) {
    909                         index = 0;
    910                 }
    911                 devlist[x]->index = index;
    912                 cl = devlist[x]->type;
    913                 index++;
    914         }
    915         return devlist;
    916 }


I wasn't sure which to be fixed but if this is done in only either the
routine this wouldn't occur IMO. 
 

   5. State whether or not a defect in the product is suspected
          * Provide Bugzilla if one already exists 
anaconda, or kudzu. 


   6. If there is a proposed patch, make sure it is in unified diff format
(diff -pruN)
N/a 


Issue escalated to Support Engineering Group by: tumeya.
Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by jbastian  [Support Engineering Group]
 issue 238559

Comment 10 Bill Nottingham 2008-11-25 17:05:07 UTC
Created attachment 324630 [details]
patch for this issue

Here's a patch that specifically addresses the case where a PCI device has multiple ethernet devices. It at least appears to enumerate correctly on a dual-port card; it has not been tested on quad or greater.

Comment 14 Issue Tracker 2008-12-16 01:52:29 UTC
Umeya-san,
Sorry for the delay.
We confirmed that this problem was resolved with uploaded patch.



This event sent from IssueTracker by ishimoto.sunao 
 issue 238559

Comment 15 RHEL Program Management 2009-03-26 17:25:38 UTC
This request was evaluated by Red Hat Product Management for
inclusion, but this component is not scheduled to be updated in
the current Red Hat Enterprise Linux release. If you would like
this request to be reviewed for the next minor release, ask your
support representative to set the next rhel-x.y flag to "?".

Comment 16 Bill Nottingham 2009-05-13 22:30:13 UTC
*** Bug 498089 has been marked as a duplicate of this bug. ***

Comment 32 Casey Dahlin 2009-09-16 20:56:08 UTC
I'm seeing similar symptoms on tg3 from one customer (see attached IT)

Comment 45 Bill Nottingham 2009-11-06 21:44:14 UTC
Building as 1.2.57.1.22-1.

Comment 48 Chris Ward 2009-11-11 13:34:03 UTC
Grab the bits here:

http://people.redhat.com/~cward/5.5/kudzu/

Comment 53 Miroslav Vadkerti 2010-02-09 07:28:20 UTC
@IBM, the available build RHEL5.5-Server-20100201.0-ppc-DVD.iso was tested by our
RTT and seems to have no issues. Can you please try this DVD installation if it
fixes your issues? This build was previously used for verification of a kudzu bugs by IBM - see https://bugzilla.redhat.com/show_bug.cgi?id=555188#c61 for example.

Comment 54 Miroslav Vadkerti 2010-02-23 08:12:55 UTC
Any update from IBM?

Comment 56 errata-xmlrpc 2010-03-30 07:59:22 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0191.html


Note You need to log in before you can comment on or make changes to this bug.