Bug 200810 - RHEL4 UP4 Beta2 Intel ethernet problem (tel Corporation PRO/1000 EB Network Connection with I/O Acceleration)
Summary: RHEL4 UP4 Beta2 Intel ethernet problem (tel Corporation PRO/1000 EB Network C...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: John W. Linville
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-07-31 19:15 UTC by Dave Olson
Modified: 2007-11-30 22:07 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-08-07 14:42:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Dave Olson 2006-07-31 19:15:48 UTC
Description of problem:
ssh connections fail intermittently, getting stuck in NIS lookups.
Fails even with NIS disabled, just not as often.

This is on unreleased Intel hardware, with the 2.6.9-40.ELsmp kernel

The symptom is that if you do:
   while :; do date;ssh idev-17 hostname; date; done

About one or two times out of a hundred you'll see 45-60 seconds delta
between the two date commands.   Running strace on sshd, you'll see it's
trying to get NIS info.   ifconfig eth0 shows no errors, netstat -s shows
tcp retransmits when this is happening.

The problem was not seen on RHEL4 UP3, when using the Intel-provided
updated ethernet driver (UP3's e1000 driver would not work on this
builtin ethernet).


Other types of systems on the same ethernet switch, with the same
configuration don't have problems.

The uname, cpuinfo and lspci output are as follows:

Linux idev-17 2.6.9-40.ELsmp #1 SMP Mon Jun 26 17:40:45 EDT 2006 x86_64 x86_64
x86_64 GNU/Linux

idev-17 12:21_~.1005 cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Genuine Intel(R) CPU                  @ 2.66GHz
stepping        : 4
cpu MHz         : 2666.719
cache size      : 4096 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm pni monitor
ds_cpl est tm2 cx16 xtpr
bogomips        : 5339.09
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

(repeats 3 more times, for the remainder of the dual core dual socket cpus)

idev-17 12:22_~.1006 lspci
00:00.0 Host bridge: Intel Corporation Server Memory Controller Hub (rev 92)
00:02.0 PCI bridge: Intel Corporation Server PCI Express x8 Port 2-3 (rev 92)
00:04.0 PCI bridge: Intel Corporation Server PCI Express x8 Port 4-5 (rev 92)
00:06.0 PCI bridge: Intel Corporation Server PCI Express x8 Port 6-7 (rev 92)
00:08.0 System peripheral: Intel Corporation Server DMA Engine (rev 92)
00:10.0 Host bridge: Intel Corporation Server Error Reporting Registers (rev 92)
00:10.1 Host bridge: Intel Corporation Server Error Reporting Registers (rev 92)
00:10.2 Host bridge: Intel Corporation Server Error Reporting Registers (rev 92)
00:11.0 Host bridge: Intel Corporation Reserved Registers (rev 92)
00:13.0 Host bridge: Intel Corporation Reserved Registers (rev 92)
00:15.0 Host bridge: Intel Corporation Server FBD Registers (rev 92)
00:16.0 Host bridge: Intel Corporation Server FBD Registers (rev 92)
00:1c.0 PCI bridge: Intel Corporation Enterprise Southbridge PCI Express Root
Port 1 (rev 09)
00:1d.0 USB Controller: Intel Corporation Enterprise Southbridge UHCI USB #1
(rev 09)
00:1d.1 USB Controller: Intel Corporation Enterprise Southbridge UHCI USB #2
(rev 09)
00:1d.2 USB Controller: Intel Corporation Enterprise Southbridge UHCI USB #3
(rev 09)
00:1d.7 USB Controller: Intel Corporation Enterprise Southbridge EHCI USB (rev 09)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9)
00:1f.0 ISA bridge: Intel Corporation Enterprise Southbridge LPC (rev 09)
00:1f.2 IDE interface: Intel Corporation Enterprise Southbridge SATA IDE (rev 09)
00:1f.3 SMBus: Intel Corporation Enterprise Southbridge SMBus (rev 09)
01:00.0 PCI bridge: Intel Corporation Enterprise Southbridge PCI Express
Upstream Port (rev 01)
01:00.3 PCI bridge: Intel Corporation Enterprise Southbridge PCI Express to
PCI-X Bridge (rev 01)
02:00.0 PCI bridge: Intel Corporation Enterprise Southbridge PCI Express
Downstream Port E1 (rev 01)
02:02.0 PCI bridge: Intel Corporation Enterprise Southbridge PCI Express
Downstream Port E3 (rev 01)
03:00.0 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge A (rev 09)
03:00.2 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge B (rev 09)
04:02.0 SCSI storage controller: Adaptec AIC-7902B U320 (rev 10)
04:02.1 SCSI storage controller: Adaptec AIC-7902B U320 (rev 10)
06:00.0 Ethernet controller: Intel Corporation PRO/1000 EB Network Connection
with I/O Acceleration (rev 01)
06:00.1 Ethernet controller: Intel Corporation PRO/1000 EB Network Connection
with I/O Acceleration (rev 01)
09:00.0 InfiniBand: PathScale, Inc: Unknown device 0010 (rev 01)
0b:01.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02)



Additional info:

Comment 1 John W. Linville 2006-08-01 18:16:10 UTC
Is the e1000 device on the NIS client?  or server?  It isn't clear to me from 
the above message.

Presuming the e1000 is on the NIS client, can you use tcpdump or 
ethereal/wireshark to ascertain whether or not the NIS requests are getting to 
the server?  You will likely need to run them on the server itself or on 
another box on the same LAN segment (i.e. on a hub on the same switch port as 
the server).

Comment 2 Dave Olson 2006-08-01 19:13:26 UTC
The system with the problem is an NIS client.   The problem shows up
even with all NIS services disabled, and all the nsswitch.conf lines
using just "files" (or for hosts, "files dns".

The server was seeing the requests in the failing case, but the client
was not seeing the server response, so far as I could tell.   I've since 
witched those clients to SLES10, so I can't easily get a tcpdump trace.
SLES10 shows a similar problem, but less frequently.

Comment 3 John W. Linville 2006-08-07 14:42:50 UTC
If you happen to either switch the boxes back to RHEL or find a new RHEL box 
that exhibits the problem, then please collect and post the info requested in 
comment 1.  Also, the output of sysreport would be most welcome. :-)

Without that info, I don't think I have enough here to pursue a solution. :-(


Note You need to log in before you can comment on or make changes to this bug.