Description of problem: ssh connections fail intermittently, getting stuck in NIS lookups. Fails even with NIS disabled, just not as often. This is on unreleased Intel hardware, with the 2.6.9-40.ELsmp kernel The symptom is that if you do: while :; do date;ssh idev-17 hostname; date; done About one or two times out of a hundred you'll see 45-60 seconds delta between the two date commands. Running strace on sshd, you'll see it's trying to get NIS info. ifconfig eth0 shows no errors, netstat -s shows tcp retransmits when this is happening. The problem was not seen on RHEL4 UP3, when using the Intel-provided updated ethernet driver (UP3's e1000 driver would not work on this builtin ethernet). Other types of systems on the same ethernet switch, with the same configuration don't have problems. The uname, cpuinfo and lspci output are as follows: Linux idev-17 2.6.9-40.ELsmp #1 SMP Mon Jun 26 17:40:45 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux idev-17 12:21_~.1005 cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Genuine Intel(R) CPU @ 2.66GHz stepping : 4 cpu MHz : 2666.719 cache size : 4096 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm pni monitor ds_cpl est tm2 cx16 xtpr bogomips : 5339.09 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: (repeats 3 more times, for the remainder of the dual core dual socket cpus) idev-17 12:22_~.1006 lspci 00:00.0 Host bridge: Intel Corporation Server Memory Controller Hub (rev 92) 00:02.0 PCI bridge: Intel Corporation Server PCI Express x8 Port 2-3 (rev 92) 00:04.0 PCI bridge: Intel Corporation Server PCI Express x8 Port 4-5 (rev 92) 00:06.0 PCI bridge: Intel Corporation Server PCI Express x8 Port 6-7 (rev 92) 00:08.0 System peripheral: Intel Corporation Server DMA Engine (rev 92) 00:10.0 Host bridge: Intel Corporation Server Error Reporting Registers (rev 92) 00:10.1 Host bridge: Intel Corporation Server Error Reporting Registers (rev 92) 00:10.2 Host bridge: Intel Corporation Server Error Reporting Registers (rev 92) 00:11.0 Host bridge: Intel Corporation Reserved Registers (rev 92) 00:13.0 Host bridge: Intel Corporation Reserved Registers (rev 92) 00:15.0 Host bridge: Intel Corporation Server FBD Registers (rev 92) 00:16.0 Host bridge: Intel Corporation Server FBD Registers (rev 92) 00:1c.0 PCI bridge: Intel Corporation Enterprise Southbridge PCI Express Root Port 1 (rev 09) 00:1d.0 USB Controller: Intel Corporation Enterprise Southbridge UHCI USB #1 (rev 09) 00:1d.1 USB Controller: Intel Corporation Enterprise Southbridge UHCI USB #2 (rev 09) 00:1d.2 USB Controller: Intel Corporation Enterprise Southbridge UHCI USB #3 (rev 09) 00:1d.7 USB Controller: Intel Corporation Enterprise Southbridge EHCI USB (rev 09) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9) 00:1f.0 ISA bridge: Intel Corporation Enterprise Southbridge LPC (rev 09) 00:1f.2 IDE interface: Intel Corporation Enterprise Southbridge SATA IDE (rev 09) 00:1f.3 SMBus: Intel Corporation Enterprise Southbridge SMBus (rev 09) 01:00.0 PCI bridge: Intel Corporation Enterprise Southbridge PCI Express Upstream Port (rev 01) 01:00.3 PCI bridge: Intel Corporation Enterprise Southbridge PCI Express to PCI-X Bridge (rev 01) 02:00.0 PCI bridge: Intel Corporation Enterprise Southbridge PCI Express Downstream Port E1 (rev 01) 02:02.0 PCI bridge: Intel Corporation Enterprise Southbridge PCI Express Downstream Port E3 (rev 01) 03:00.0 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge A (rev 09) 03:00.2 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge B (rev 09) 04:02.0 SCSI storage controller: Adaptec AIC-7902B U320 (rev 10) 04:02.1 SCSI storage controller: Adaptec AIC-7902B U320 (rev 10) 06:00.0 Ethernet controller: Intel Corporation PRO/1000 EB Network Connection with I/O Acceleration (rev 01) 06:00.1 Ethernet controller: Intel Corporation PRO/1000 EB Network Connection with I/O Acceleration (rev 01) 09:00.0 InfiniBand: PathScale, Inc: Unknown device 0010 (rev 01) 0b:01.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02) Additional info:
Is the e1000 device on the NIS client? or server? It isn't clear to me from the above message. Presuming the e1000 is on the NIS client, can you use tcpdump or ethereal/wireshark to ascertain whether or not the NIS requests are getting to the server? You will likely need to run them on the server itself or on another box on the same LAN segment (i.e. on a hub on the same switch port as the server).
The system with the problem is an NIS client. The problem shows up even with all NIS services disabled, and all the nsswitch.conf lines using just "files" (or for hosts, "files dns". The server was seeing the requests in the failing case, but the client was not seeing the server response, so far as I could tell. I've since witched those clients to SLES10, so I can't easily get a tcpdump trace. SLES10 shows a similar problem, but less frequently.
If you happen to either switch the boxes back to RHEL or find a new RHEL box that exhibits the problem, then please collect and post the info requested in comment 1. Also, the output of sysreport would be most welcome. :-) Without that info, I don't think I have enough here to pursue a solution. :-(