Description of problem: during network test, console display broken pipe error message with a string of numbers. Version-Release number of selected component (if applicable): rhr2-rhel4-1.0.8 How reproducible: Steps to Reproduce: 1. run the network test portion of redhat-ready suite 2. during test, console displays broken pipe message along with a string of long numbers. 3. when all tests completed, end results will show network test failed. Actual results: the errors are basically in this form (from the /var/log/rhr/tests/NETWORK/[0|1|2]/output.log ) + ssh -l root -x 129.153.2.53 'mkdir ~/mnt; mount 10.1.162.24:/tmp/rhr/NETWORK/export ~/mnt; cp ~/mnt/httptest.file ~/httptest.file; umount ~/mnt;' Connection closed by 129.153.2.53 17046953642904635492138164570399815860 or + ssh -l root -x 10.6.72.167 'mkdir ~/mnt; mount 10.6.73.73:/tmp/rhr/NETWORK/export ~/mnt; cp ~/mnt/httptest.file ~/httptest.file; umount ~/mnt;' Write failed: Connection timed out 75918616847078452106380745169566081445 or + scp /var/www/html/httptest.file 'root.2.53:~/httptest.file' 166880170717494048586701277746167657225 Expected results: - network test passes. Additional info: - repeatedly tried on various platforms with the same testsuites (rhr2-rhel4-1.0.8 and 2.6.9-5.EL | 2.6.9-5.ELsmp kernels) => same type of failures. - network access to various used 'remote servers' for the network test configured in /etc/rhr/test.conf are properly checked and accessible.
The numbers are expected; they're generated checksums by the test suite. - Can you post the complete output.log file? - What type of network / network cards are being used? - Did you try different servers / rebooting the server(s)?
Created attachment 110372 [details] output.log file attaching 1 of the output.log files from 1 of the tested servers network cards used: 03:07.0 Ethernet controller: Intel Corp. 82546EB Gigabit Ethernet Controller (Copper) (rev 01) 03:07.1 Ethernet controller: Intel Corp. 82546EB Gigabit Ethernet Controller (Copper) (rev 01)
additional test run - using a remote server on the same subnet with all static ip addr seeing unexpected console messages during network test: audit(1106931638.166:0): avc: denied { write } for pid=3977 exe=/usr/sbin/httpd name=mibs dev=sda2 ino=1033881 scontext=root:system_r:httpd_t tcontext=system_u:object_r:usr_t tclass=dir output.log shows: + service httpd start Starting httpd: [ OK ] + scp /var/www/html/httptest.file 'root.162.24:~/httptest.file' Connection closed by 10.1.162.24 lost connection 118320326186668918855431567005731752547
It looks like the two problems above may not be related, the first (audit) should be that httpd doesn't have permissions to wherever "mibs" is, so setting that directory to owner=nobody would probably resolve that. (From what I know mibs are usually printer related, so I would assume perhaps the web based cups was in use? For the second one, the first thing that comes to mind is to verify that the keys are properly in place (see process pdf for instructions), and hopefully that is simply the scp timing out while waiting for a password.
Created attachment 110633 [details] output.log file from most recent test run
- in regard to the 'audit' console messages, i belive this is due to the enforcing of selinux. can you show me how to disable this feature while the test is running? - can you tell me where i can find the process pdf that you refered to? the 2 test rpms that i downloaded and used didn't have any process doc. - repeated test runs will end up as failed regardless of the use of static or dynamic ip. - on some failed test, i saw the output.log shows Starting httpd: [ OK ] + scp /var/www/html/httptest.file 'root.73.73:~/httptest.file' + ssh -l root -x 10.6.73.73 'ab -c 128 -k -n 256 10.6.73.167/httptest.file' This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0 Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/ Benchmarking 10.6.73.167 (be patient) Completed 100 requests Total of 205 requests completed Completed 200 requests apr_recv: Connection reset by peer (104) 180442663894960314621829737647163614200 - check bugzilla on redhat shows a similar bug was reported against apachebench on apr_recv error: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=119890 - this bug is now closed due to the fix of apachebench version 2.0.48-16 and later - the most recent test run, its output.log shows: + ssh -l root -x lnx24.west.sun.com 'mkdir ~/mnt; mount 10.1.162.28:/tmp/rhr/NETWORK/export ~/mnt; cp ~/mnt/httptest.file ~/httptest.file; umount ~/mnt;' Read from socket failed: Connection reset by peer 273621255585469362885536325619456197651 - i'm attaching the tcpdump.out and the complete output.log files in here for your reference
Created attachment 110634 [details] tcpdump.out
The ab problem is fixed in the latest errata available over Red Hat Network. On lnx24.west.sun.com, can you umount ~/mnt directly (not over ssh)? The process PDF is available from the hwcert web site after you log in: http://bugzilla.redhat.com/hwcert/ (documentation link in the navbar on the left). SELinux can be disabled by typing setenforce 0 as root. This will, however, leave your contexts in an inconsistent state. we have not had reports of problems with selinux here, so unless it poses a problem in getting the tests to pass, we do not recommend this action.
There seem to be two problems at least here: 1) Why are we htting SELinux issues? We performed a standard kickstart install of Everything, with serial console. No customization of anything. (We have been doing 'echo 0 > /selinux/enforce' to workaround). 2) The umount works locally - there seem to be a variety of error messages when ssh drops the connection. Here's another one I just ran on different hw: + net_cleanup + udp_cleanup + ssh -l root -x 10.10.0.10 '[ -d ~/mnt ] && umount ~/mnt' Connection closed by 10.10.0.10 318471953684363617670584311423616810998 Perhaps we should review exactly where the rhel4 cd images, and certification rpm, are copied from? It seems like a basic mismatch of some sort.
i've downloaded and used the latest rhr2 1.0-14 - thr result is still the same - fail on ssh and connection close as previous runs
Just to clarify: is it failing on the NFS part of the network test? If so, have you tried rebooting the NFS server (perhaps it's a stale NFS handle?) Otherwise, if it's failing on ab, can you try adding "-v 4" to the ab test options in tests/network/tcp ?
I switched test machines and routers to insure that the foundry 100mbit switches weren't at fault. It's failing on ab every time, after only 200 requests. The load appears very light on both machines. I'm now using two v20zs with cisco WS-C3750G-24T. The "-v 4" is creating a ton of debug output, which I will attach shortly. Concerning the selinux failure, should we open a separate bug to get the test to work when it is enabled (everything installed)? Looks like a minor modification to the http setup is needed.
The -v4 option is intended to make it so that the ssh connection doesn't time out. Yes, please open a separate bug on the selinux. What happens when you run the ab test manually, without the ssh setup in the tests?
Running the test manually worked, e.g. running this command on the SUT to the remote client: ssh -l root -x 192.168.13.30 'ab -c 30 -k -n 2000 192.168.13.21/httptest.file' Note however that a kernel panic was triggered during an nfs unmount after trying to manually cleanup a previous test run (see 149557).
The '-v 4' option has slowed the ab test down so that it has now run for about 4-5 hours. It isn't failing (yet), but a gigantic amount of debug output is being generated, e.g. 368 -rw-r--r-- 1 root root 365565 Feb 23 10:55 output.log tail output.log ... LOG: Response code = 200 LOG: header received: HTTP/1.1 200 OK Date: Wed, 23 Feb 2005 18:53:56 GMT Server: Apache/2.0.52 (Red Hat) Last-Modified: Wed, 23 Feb 2005 15:37:18 GMT ETag: "1fc4b4-7d00000-3f0c978eca780" Accept-Ranges: bytes Content-Length: 131072000 Connection: close Content-Type: text/plain; charset=UTF-8
Here's a diff of netstat -st from a minute or so apart - looks like a large number (700+) of TCPTimeouts.2c2 < 1941 active connections openings --- > 1946 active connections openings 6,9c6,9 < 30 connections established < 169668309 segments received < 136726755 segments send out < 742 segments retransmited --- > 31 connections established > 170453487 segments received > 137360963 segments send out > 743 segments retransmited 17c17 < 994042 delayed acks sent --- > 998539 delayed acks sent 22,24c22,24 < 90932179 packets header predicted < TCPPureAcks: 3604 < TCPHPAcks: 54024 --- > 91350570 packets header predicted > TCPPureAcks: 3614 > TCPHPAcks: 54026 35c35 < TCPLossUndo: 635 --- > TCPLossUndo: 636 44c44 < TCPTimeouts: 700 --- > TCPTimeouts: 701 No physical errors being reported by the driver or the cisco switch. netstat -i Kernel Interface table Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg eth1 1500 0 174110685 0 0 0 139973757 0 0 0 BMRU lo 16436 0 50 0 0 0 50 0 0 0 LRU
There are a couple issues mentioned in the failing NETWORK test in this ticket. Can you confirm the following: 1. NFS failures. These don't seem to be a problem any more (the kernel panic on umount excepted), given your comment #13 that is "failing on ab test every time". 2. ab failures. Manually running it via ssh passes, and there are TCP timeouts. The -v4 option slows down the test, but does it allow the test to pass?
I'm not seeing nfs failures any more. I will try the -v 4 option again today. It had not completed after 6 hours, and the system was inadvertantly shutdown. Note that the rhr NETWORK test cannot pass, as I understand it, because the ab debug output in the output.log will be flagged as an error; I assume you mean no connection reset messages in output.log
The test passed twice with the -v 4 option.
We'll accept NETWORK results with the -v4 option.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-419.html