From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1 Description of problem: When running the rhr NETWORK2 suite of tests with RHEL3u4 and RHEL3u6 on an Intel IA32E (EM64T) system in both 32-bit and 64-bit, the lat_udp in the lmbench package test fails to generate any packets to send to the server. We used strace to determine that the test was not making any calls. We used tcpdump on both the client and the server to determine that no packets were either being received or being sent when the tool was run manually. Other systems do not appear to exhibit this problem. Obtaining the latest version of lmbench from sourceforge and running the test on the failing platform allowed the test to pass. Version-Release number of selected component (if applicable): 2.0.4-2 How reproducible: Always Steps to Reproduce: 1. Run 2 servers running RHEL3u4 or RHELu6 on an IA32E blade in either 32-bit or 64-bit. 2. Download and install the rhr2 lmbench suite and install it 3. run /usr/lib/lmbench/bin/<platform>/lat_udp -s on the server 4. run /usr/lib/lmbench/bin/<platform>/lat_udp <server_ip> on the client 5. Wait for "recv timeout" failure message on the server. Actual Results: The lat_udp program failed to generate, send or receive any UDP packets when run with version 2.0.4-2. Expected Results: The latency of UDP traffic should have been calculated after the client and server exchanged network traffic. Additional info: lmbench version 3.05a has been proven to work as expected. I had a conversation with Mike Gahagan at RedHat who suggested I file this bug and ask for the rhr2 lmbench RPM to be updated so we can continue our certification testing. This platform's certification testing is blocked until this is resolved.
Updating to lmbench 3 is a good idea but not prudent for a single bugfix. I will investigate the source of this particular problem.
I can't reproduce this on our test machines. Can you confirm that lat_udp wasn't sending packets? What does the output of strace -o lat_udp.log /usr/lib/lmbench/bin/<platform>/lat_udp <server_ip> show? Are you sure the server is running? Does it work properly if you try using another (non-x86_64) machine as the client/server?
I don't have that server up and running anymore at this moment so I don't have the strace output. I'll try to work on getting that. The server was running - I didn't touch it when I move to lmbnch v3 and everything worked fine. When we ran 'strace lat_udp <ip_addr>' on the failed client, we simply saw the timer tick. tcpdump on both the client and server showed no packets from strace but we saw the typical arp messages from _other_ systems on the net. The included packages worked without any problems on all our other platforms: RHEL4 32&64bit AMD & EM64T RHEL3 32&64bit AMD
Created attachment 125131 [details] strace log as requested.
I can verify this bug on exactly one machine with an nVIDIA network chip. Other machines with a Broadcom NIC don't show this problem. Even the same hardware that has 2 network interfaces (1 Broadcom, 1 nVIDIA) shows it only with the nVIDIA NIC. In my case the lat_udp sends out packages, at least an ethereal session on the server always catches some packets, sometimes more, sometimes less. And in average every second run of lat_udp fails with "Recv timed out". I also built the lmbench-3.05-a5 from source, but even there lat_udp fails in every second attempt. My test network connection is quite simple, there is the client, a Gigabit switch and the server and no other machines that produce any traffic. Its practically an isolated network, but even there the test fails. On the other hand I'm a bit puzzled why the failure of this lat_udp test is causing the whole NETWORK2 test to fail. UDP is known as a protocol where data can get lost, so UDP errors are part of the life.
Re: Comment #4: That strace log is successful, so it doesn't help me much. I'd really like to see a log where the packets aren't being sent, so I can have some clue where to start looking to find out why they aren't being sent. Re: Comment #5: I agree that the udp test should allow a certain amount of packet loss, but in our tests we experienced no packet loss at all. It seemed reasonable that (on an otherwise quiet network) two machines should be able to exchange ~7MB of UDP data without dropping packets, and our tests confirmed it, so we let it go. In addition, I find it strange that it only fails on certain hardware for you - doesn't that indicate a hardware problem rather than a test problem? If you are still seeing this behavior, and you are sure that this is a problem with the lat_udp test, please file another bug. Since your symptoms are different we should track that problem separately.
Egenera has provided an strace of a failed NETWORK2 lat_udp test. I'm attaching it to the ticket.
Created attachment 135799 [details] new strace of lat_udp failure
rhr2 has been deprecated, closing these remaining bugs as WONTFIX. Future bugs against the "hts" test suite should be opened agains the "Red Hat Hardware Certification Program" product selecting either "Test Suite (harness)" or "Test Suite (tests)" components.