Escalated to Bugzilla from IssueTracker
Problem: Kudzu corrupts configuration data of dual-port Chelsio 10G-NIC Description of problem: If kudzu runs after installing a server with dual-port Chelsio 10G-NIC, It will corrupt the configuration file. We explain three cases where problem occurred. - Case1:Add e1000e NIC card to a system installed with dual-port Chelsio 10G NIC System couldn't recognize 2nd port, And Kudzu overwrites different NIC's config on this port's config. - Case2: Add dual-port Chelsio 10G NIC to a system installed with e1000e NIC It didn't update secondly ports configuration. We have observed cases, where the NIC's configuration is corrupted just By rebooting the system without adding not removing devices. However, this symptom does not always occur and happens about 20% of the time. Version-Release number of selected component: kernel version: 2.6.18-92.el5 (RHEL5.2GA) How reproducible: Case1/Case2: always reproduced (100%) Steps to Reproduce: [Case1] 1) Install Chelsio 10G NIC.(Server has onboard e1000e NIC) 2) Install and startup RHEL5.2 3) Shutdown 4) Add more e1000e NIC(2port type) 5) Reboot [Case2] 1) Install and startup RHEL5.2(Server has onboard e1000e NIC) 2) Shutdown 3) Add Chelsio 10G NIC 4) Reboot Actual results: We checked 2 cases. Case1:OnboardNIC(e1000e)+Chelsio10GNIC > after OS installed, add more e1000e. [Previous] /etc/modprobe.conf <-- Pre-Installed 10GB NIC Alias eth0 cxgb3(MAC:00:07:43:05:47:C2) alias eth1 cxgb3(MAC:00:07:43:05:47:C3) alias eth2 e1000e(MAC:00:16:97:62:36:A2) alias eth3 e1000e(MAC:00:16:97:62:36:A3) alias eth4 e1000e(MAC:00:16:97:62:36:A4) alias eth5 e1000e(MAC:00:16:97:62:36:A5) alias scsi_hostadapter megaraid_sas >This configuration was kept [ifcifg,eth1.bak] [After add e1000e] /etc/modprobe.conf] alias eth0 cxgb3(MAC:00:07:43:05:47:C2) alias eth1 e1000e(MAC:00:16:97:44:77:36) <--- Added e1000e alias eth2 e1000e(MAC:00:16:97:62:36:A2) alias eth3 e1000e(MAC:00:16:97:62:36:A3) alias eth4 e1000e(MAC:00:16:97:62:36:A4) alias eth5 e1000e(MAC:00:16:97:62:36:A5) alias scsi_hostadapter megaraid_sas alias eth6 e1000e(MAC:00:16:97:44:77:37) <--- Added e1000e >Replaced eth1(cxgb3 to added e1000e). >previous cxgb3's configuration was deleted. Case2:OnboardNIC use only > after OS installed, add more Chelsio10GNIC [Previous] /etc/modprobe.conf] alias eth0 e1000e(MAC:00:16:97:62:36:A2) alias eth1 e1000e(MAC:00:16:97:62:36:A3) alias eth2 e1000e(MAC:00:16:97:62:36:A4) alias eth3 e1000e(MAC:00:16:97:62:36:A5) alias scsi_hostadapter megaraid_sas [After add 10GB NIC] etc/modprobe.conf] alias eth0 e1000e(MAC:00:16:97:62:36:A2) alias eth1 e1000e(MAC:00:16:97:62:36:A3) alias eth2 e1000e(MAC:00:16:97:62:36:A4) alias eth3 e1000e(MAC:00:16:97:62:36:A5) alias scsi_hostadapter megaraid_sas alias eth4 cxgb3(MAC:00:07:43:05:47:c2) <--- Added Device >It was not reflected 10GB NIC's port #2.(MAC:00:07:43:05:47:C3) Hardware info: Express5800/B140a-T and Express5800/140Ba-10 dual-port 10GNIC : N8403-024(Chelsio Communications Inc T320 10GbE Dual Port Protocol Engine Ethernet Adapter) dual-port e1000eNIC : N8403-017(Intel Corporation 82571EB Gigabit Ethernet Controller) Business impact: Customer cannot use dual-port Chelsio 10G NIC card. Additional info: During installation, anaconda uses information about the network interface in /proc/net/dev to create ifcfg-eth files. However, function probeDevices() in kudzu gets the same information from PCI. Dual-port Chelsio 10G NIC only has one function id as a PCI device, so kudzu incorrectly thinks it has only one port. Intel e1000e has two function ids so it does not have this problem. [Attached documents] sysreport,/etc/sysconfig/network-scripts/ifcfg-ethx,/etc/sysconfig/hwconf /etc/modprobe.conf This event sent from IssueTracker by jbastian [Support Engineering Group] issue 238559
We checked that the same problem also occurs on RHEL5.3 Snapshot1 too. This event sent from IssueTracker by jbastian [Support Engineering Group] issue 238559
Hi SEG, escalation. The network information gets corrupted when NIC is added to box. Apparently kudzu is doing something wrong: 1. Provide time and date of the problem N/A this has occurred at the vendor's test site. 2. Indicate the platform(s) (architectures) the problem is being reported against. x86_64 RHEL5 but may occur on other arch. 3. Provide clear and concise problem description as it is understood at the time of escalation * Observed behavior When new NIC is added to box kudzu does not detect it properly; result to destroy the previous network information. <snip> This occurs under the following two circumstances: [Case1] 1) Install Chelsio 10G NIC.(Server has onboard e1000e NIC) 2) Install and startup RHEL5.2 3) Shutdown 4) Add more e1000e NIC(2port type) 5) Reboot [Case2] 1) Install and startup RHEL5.2(Server has onboard e1000e NIC) 2) Shutdown 3) Add Chelsio 10G NIC 4) Reboot </snip> * Desired behavior kudzu would handle added NIC properly. 4. State specific action requested of SEG Looking in the code there' s difference in how NIC is detected between anaconda and kudzu. Anaconda uses /proc whereas kudzu, pci device information. anaconda: network.py 292 f = open("/proc/net/dev") 293 lines = f.readlines() 294 f.close() 295 # skip first two lines, they are header 296 lines = lines[2:] 297 for line in lines: 298 dev = string.strip(line[0:6]) 299 if dev != "lo" and dev[0:3] != "sit" and not self.netdevices.has_key(dev): 300 if self.firstnetdevice is None: 301 self.firstnetdevice = dev 302 303 self.netdevices[dev] = NetworkDevice(dev) 304 305 try: 306 hwaddr = isys.getMacAddress(dev) 307 if rhpl.getArch() != "s390" and hwaddr and hwaddr != "00:00:00:00:00:00" and hwaddr != "ff:ff:ff:ff:ff:ff": 308 self.netdevices[dev].set(("hwaddr", hwaddr)) 309 except Exception, e: 310 log.error("exception getting mac addr: %s" %(e,)) 311 312 if ksdevice and self.netdevices.has_key(ksdevice): 313 self.firstnetdevice = ksdevice 314 315 return self.netdevices 316 kudzu probeDevices: 836 struct device ** probeDevices ( enum deviceClass probeClass, 837 enum deviceBus probeBus, 838 int probeFlags 839 ) { 840 struct device *devices=NULL,**devlist=NULL; 841 int numDevs=0, bus, x, index=0; 842 enum deviceClass cl=CLASS_UNSPEC; 843 int logLevel = -1; 844 845 #ifndef __LOADER__ 846 logLevel = getLogLevel(); 847 setLogLevel(1); 848 #endif 849 setupKernelVersion(); 850 851 for (bus=1;buses[bus].string;bus++) { 852 if ( (probeBus & buses[bus].busType) && 853 !(probeBus == BUS_UNSPEC && 854 buses[bus].busType & BUS_DDC)) 855 if (buses[bus].probeFunc) { 856 DEBUG("Probing %s\n",buses[bus].string); 857 devices = buses[bus].probeFunc(probeClass, 858 probeFlags, devices); 859 } 860 if ((probeFlags & PROBE_ONE) && (devices)) 861 break; 862 } 863 if (devices == NULL) { 864 #ifndef __LOADER__ 865 setLogLevel(logLevel); 866 #endif 867 return NULL; 868 } 869 #ifndef __LOADER__ 870 if (probeClass & CLASS_VIDEO) 871 fbProbe(devices); 872 #endif 873 #ifndef __LOADER__ 874 setLogLevel(logLevel); 875 #endif 876 if (probeClass & CLASS_NETWORK) { 877 if (probeFlags & PROBE_LOADED) { 878 devices = filterNetDevices(devices); 879 if (!devices) 880 return NULL; 881 } 882 } 883 884 while (devices) { 885 devlist=realloc(devlist, (numDevs+2) * sizeof(struct device *)); 886 devlist[numDevs]=devices; 887 devlist[numDevs+1]=NULL; 888 numDevs++; 889 devices=devices->next; 890 } 891 qsort(devlist, numDevs, sizeof(struct device *), devCmp); 892 /* We need to sort the network devices by module name. Fun. */ 893 for (x=0; devlist[x]; x++) { 894 devlist[x]->next = devlist[x+1]; 895 } 896 if (probeClass & CLASS_NETWORK) { 897 sortNetDevices(devlist[0]); 898 if (!(probeFlags & PROBE_NOLOAD)) 899 matchNetDevices(devlist[0]); 900 } 901 devices = devlist[0]; 902 for (x = 0; x < numDevs ; x++) { 903 devlist[x] = devices; 904 devices = devices->next; 905 } 906 907 for (x=0;devlist[x];x++) { 908 if (devlist[x]->type!=cl) { 909 index = 0; 910 } 911 devlist[x]->index = index; 912 cl = devlist[x]->type; 913 index++; 914 } 915 return devlist; 916 } I wasn't sure which to be fixed but if this is done in only either the routine this wouldn't occur IMO. 5. State whether or not a defect in the product is suspected * Provide Bugzilla if one already exists anaconda, or kudzu. 6. If there is a proposed patch, make sure it is in unified diff format (diff -pruN) N/a Issue escalated to Support Engineering Group by: tumeya. Internal Status set to 'Waiting on SEG' This event sent from IssueTracker by jbastian [Support Engineering Group] issue 238559
Created attachment 324630 [details] patch for this issue Here's a patch that specifically addresses the case where a PCI device has multiple ethernet devices. It at least appears to enumerate correctly on a dual-port card; it has not been tested on quad or greater.
Umeya-san, Sorry for the delay. We confirmed that this problem was resolved with uploaded patch. This event sent from IssueTracker by ishimoto.sunao issue 238559
This request was evaluated by Red Hat Product Management for inclusion, but this component is not scheduled to be updated in the current Red Hat Enterprise Linux release. If you would like this request to be reviewed for the next minor release, ask your support representative to set the next rhel-x.y flag to "?".
*** Bug 498089 has been marked as a duplicate of this bug. ***
I'm seeing similar symptoms on tg3 from one customer (see attached IT)
Building as 1.2.57.1.22-1.
Grab the bits here: http://people.redhat.com/~cward/5.5/kudzu/
@IBM, the available build RHEL5.5-Server-20100201.0-ppc-DVD.iso was tested by our RTT and seems to have no issues. Can you please try this DVD installation if it fixes your issues? This build was previously used for verification of a kudzu bugs by IBM - see https://bugzilla.redhat.com/show_bug.cgi?id=555188#c61 for example.
Any update from IBM?
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0191.html