146510 – broken pipe error during network test

Bug 146510 - broken pipe error during network test

Summary: broken pipe error during network test

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ready Certification Tests
Classification:	Retired
Component:	net
Sub Component:
Version:	2
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Will Woods
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-01-28 22:39 UTC by erik nguyen
Modified:	2007-04-18 17:18 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2005-05-11 16:39:34 UTC
Embargoed:

Attachments	(Terms of Use)
output.log file (2.73 KB, text/plain) 2005-01-28 22:54 UTC, erik nguyen	no flags	Details
output.log file from most recent test run (2.79 KB, text/plain) 2005-02-04 00:48 UTC, erik nguyen	no flags	Details
tcpdump.out (100 bytes, text/plain) 2005-02-04 00:50 UTC, erik nguyen	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2005:419	0	normal	SHIPPED_LIVE	Hardware Certification Suite bug fix update	2005-05-11 04:00:00 UTC

Description erik nguyen 2005-01-28 22:39:39 UTC

Description of problem:
during network test, console display broken pipe error message with a
string of numbers. 

Version-Release number of selected component (if applicable):
rhr2-rhel4-1.0.8

How reproducible:


Steps to Reproduce:
1. run the network test portion of redhat-ready suite
2. during test, console displays broken pipe message along with a
string of long numbers.
3. when all tests completed, end results will show network test failed.

  
Actual results:
the errors are basically in this form (from the
/var/log/rhr/tests/NETWORK/[0|1|2]/output.log )

+ ssh -l root -x 129.153.2.53 'mkdir ~/mnt; mount
10.1.162.24:/tmp/rhr/NETWORK/export ~/mnt; cp ~/mnt/httptest.file
~/httptest.file; umount ~/mnt;'
Connection closed by 129.153.2.53
17046953642904635492138164570399815860

or

+ ssh -l root -x 10.6.72.167 'mkdir ~/mnt; mount
10.6.73.73:/tmp/rhr/NETWORK/export ~/mnt; cp ~/mnt/httptest.file
~/httptest.file; umount ~/mnt;'
Write failed: Connection timed out
75918616847078452106380745169566081445

or

+ scp /var/www/html/httptest.file 'root.2.53:~/httptest.file'
166880170717494048586701277746167657225

Expected results:

- network test passes.

Additional info:

- repeatedly tried on various platforms with the same testsuites
(rhr2-rhel4-1.0.8 and 2.6.9-5.EL | 2.6.9-5.ELsmp kernels) => same type
of failures.
- network access to various used 'remote servers' for the network test
configured in /etc/rhr/test.conf are properly checked and accessible.

Comment 1 Richard Li 2005-01-28 22:44:06 UTC

The numbers are expected; they're generated checksums by the test suite.

- Can you post the complete output.log file?
- What type of network / network cards are being used?
- Did you try different servers / rebooting the server(s)?

Comment 2 erik nguyen 2005-01-28 22:54:07 UTC

Created attachment 110372 [details]
output.log file

attaching 1 of the output.log files from 1 of the tested servers

network cards used: 

03:07.0 Ethernet controller: Intel Corp. 82546EB Gigabit Ethernet Controller
(Copper) (rev 01)
03:07.1 Ethernet controller: Intel Corp. 82546EB Gigabit Ethernet Controller
(Copper) (rev 01)

Comment 3 erik nguyen 2005-01-29 00:15:59 UTC

additional test run - using a remote server on the same subnet with
all static ip addr

seeing unexpected console messages during network test: 

audit(1106931638.166:0): avc:  denied  { write } for  pid=3977
exe=/usr/sbin/httpd name=mibs dev=sda2 ino=1033881
scontext=root:system_r:httpd_t tcontext=system_u:object_r:usr_t tclass=dir

output.log shows:

+ service httpd start
Starting httpd:                                            [  OK  ]
+ scp /var/www/html/httptest.file 'root.162.24:~/httptest.file'
Connection closed by 10.1.162.24
lost connection
118320326186668918855431567005731752547

Comment 4 Rob Landry 2005-01-31 13:37:36 UTC

It looks like the two problems above may not be related, the first
(audit) should be that httpd doesn't have permissions to wherever
"mibs" is, so setting that directory to owner=nobody would probably
resolve that.  (From what I know mibs are usually printer related, so
I would assume perhaps the web based cups was in use?

For the second one, the first thing that comes to mind is to verify
that the keys are properly in place (see process pdf for
instructions), and hopefully that is simply the scp timing out while
waiting for a password.

Comment 5 erik nguyen 2005-02-04 00:48:45 UTC

Created attachment 110633 [details]
output.log file from most recent test run

Comment 6 erik nguyen 2005-02-04 00:49:43 UTC

- in regard to the 'audit' console messages, i belive this is due to
the enforcing of selinux. can you show me how to disable this feature
while the test is running?

- can you tell me where i can find the process pdf that you refered
to? the 2 test rpms that i downloaded and used didn't have any process
doc.

- repeated test runs will end up as failed regardless of the use of
static or dynamic ip.

- on some failed test, i saw the output.log shows 

Starting httpd: [  OK  ]
+ scp /var/www/html/httptest.file 'root.73.73:~/httptest.file'
+ ssh -l root -x 10.6.73.73 'ab -c 128 -k -n 256
10.6.73.167/httptest.file'
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd,
http://www.zeustech.net/
Copyright (c) 1998-2002 The Apache Software Foundation,
http://www.apache.org/

Benchmarking 10.6.73.167 (be patient)
Completed 100 requests
Total of 205 requests completed
Completed 200 requests
apr_recv: Connection reset by peer (104)
180442663894960314621829737647163614200

- check bugzilla on redhat shows a similar bug was reported against
apachebench on apr_recv error:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=119890 - this bug
is now closed due to the fix of apachebench version 2.0.48-16 and later

- the most recent test run, its output.log shows:

+ ssh -l root -x lnx24.west.sun.com 'mkdir ~/mnt; mount
10.1.162.28:/tmp/rhr/NETWORK/export ~/mnt; cp ~/mnt/httptest.file
~/httptest.file; umount ~/mnt;'
Read from socket failed: Connection reset by peer
273621255585469362885536325619456197651

- i'm attaching the tcpdump.out and the complete output.log files in
here for your reference

Comment 7 erik nguyen 2005-02-04 00:50:37 UTC

Created attachment 110634 [details]
tcpdump.out

Comment 8 Richard Li 2005-02-04 14:20:56 UTC

The ab problem is fixed in the latest errata available over Red Hat Network.

On lnx24.west.sun.com, can you umount ~/mnt directly (not over ssh)?

The process PDF is available from the hwcert web site after you log in:
http://bugzilla.redhat.com/hwcert/ (documentation link in the navbar on the left).

SELinux can be disabled by typing setenforce 0 as root. This will, however,
leave your contexts in an inconsistent state. we have not had reports of
problems with selinux here, so unless it poses a problem in getting the tests to
pass, we do not recommend this action.

Comment 9 Christopher P Johnson 2005-02-07 18:47:02 UTC

There seem to be two problems at least here:

1) Why are we htting SELinux issues? We performed a standard kickstart install
of Everything, with serial console. No customization of anything.

(We have been doing 'echo 0 > /selinux/enforce' to workaround).

2) The umount works locally - there seem to be a variety of error messages when
ssh drops the connection. Here's another one I just ran on different hw:

+ net_cleanup
 + udp_cleanup
 + ssh -l root -x 10.10.0.10 '[ -d ~/mnt ] && umount ~/mnt'
Connection closed by 10.10.0.10
318471953684363617670584311423616810998

Perhaps we should review exactly where the rhel4 cd images, and certification
rpm, are copied from? It seems like a basic mismatch of some sort.

Comment 11 erik nguyen 2005-02-15 21:11:37 UTC

i've downloaded and used the latest rhr2 1.0-14 - thr result is still
the same - fail on ssh and connection close as previous runs

Comment 12 Richard Li 2005-02-18 18:51:54 UTC

Just to clarify: is it failing on the NFS part of the network test? If so, have
you tried rebooting the NFS server (perhaps it's a stale NFS handle?)


Otherwise, if it's failing on ab, can you try adding "-v 4" to the ab test
options  in tests/network/tcp ?

Comment 13 Christopher P Johnson 2005-02-23 22:09:15 UTC

I switched test machines and routers to insure that the foundry
100mbit switches weren't at fault. It's failing on ab every time,
after only  200 requests. The load appears very light on both
machines. I'm
now using two v20zs with cisco WS-C3750G-24T. The "-v 4" is creating a
ton of debug output, which I will attach shortly.

Concerning the selinux failure, should we open a separate bug to get
the test to work when it is enabled (everything installed)? Looks like
a minor modification to the http setup is needed.

Comment 14 Richard Li 2005-02-23 22:18:44 UTC

The -v4 option is intended to make it so that the ssh connection doesn't time out. 

Yes, please open a separate bug on the selinux.

What happens when you run the ab test manually, without the ssh setup in the tests?

Comment 15 Christopher P Johnson 2005-02-23 23:07:10 UTC

Running the test manually worked, e.g. running this command on the SUT
to the remote client:

ssh -l root -x 192.168.13.30 'ab -c 30 -k -n 2000
192.168.13.21/httptest.file'

Note however that a kernel panic was triggered during an nfs unmount
after trying to manually cleanup a previous test run (see 149557).

Comment 16 Christopher P Johnson 2005-02-24 02:57:27 UTC

The '-v 4' option has slowed the ab test down so that it has now run
for about 4-5 hours. It isn't failing (yet), but a gigantic amount
of debug output is being generated, e.g. 

368 -rw-r--r--  1 root root 365565 Feb 23 10:55 output.log

tail output.log
...
LOG: Response code = 200
LOG: header received:
HTTP/1.1 200 OK
Date: Wed, 23 Feb 2005 18:53:56 GMT
Server: Apache/2.0.52 (Red Hat)
Last-Modified: Wed, 23 Feb 2005 15:37:18 GMT
ETag: "1fc4b4-7d00000-3f0c978eca780"
Accept-Ranges: bytes
Content-Length: 131072000
Connection: close
Content-Type: text/plain; charset=UTF-8

Comment 17 Christopher P Johnson 2005-02-24 05:53:11 UTC

Here's a diff of netstat -st from a minute or so apart - looks
like a large number (700+) of TCPTimeouts.2c2
<     1941 active connections openings
---
>     1946 active connections openings
6,9c6,9
<     30 connections established
<     169668309 segments received
<     136726755 segments send out
<     742 segments retransmited
---
>     31 connections established
>     170453487 segments received
>     137360963 segments send out
>     743 segments retransmited
17c17
<     994042 delayed acks sent
---
>     998539 delayed acks sent
22,24c22,24
<     90932179 packets header predicted
<     TCPPureAcks: 3604
<     TCPHPAcks: 54024
---
>     91350570 packets header predicted
>     TCPPureAcks: 3614
>     TCPHPAcks: 54026
35c35
<     TCPLossUndo: 635
---
>     TCPLossUndo: 636
44c44
<     TCPTimeouts: 700
---
>     TCPTimeouts: 701




No physical errors being reported by the driver or the cisco switch.

netstat -i
Kernel Interface table
Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR
TX-DRP TX-OVR Flg
eth1       1500   0 174110685      0      0      0 139973757      0  
   0      0 BMRU
lo        16436   0       50      0      0      0       50      0    
 0      0 LRU

Comment 18 Richard Li 2005-02-24 16:48:17 UTC

There are a couple issues mentioned in the failing NETWORK test in this ticket.
Can you confirm the following:

1. NFS failures. These don't seem to be a problem any more (the kernel panic on
umount excepted), given your comment #13 that is "failing on ab test every time".
2. ab failures. Manually running it via ssh passes, and there are TCP timeouts.
The -v4 option slows down the test, but does it allow the test to pass?

Comment 19 Christopher P Johnson 2005-02-24 17:19:52 UTC

I'm not seeing nfs failures any more. I will try the -v 4 option again today.
It had not completed after 6 hours, and the system was inadvertantly shutdown.

Note that the rhr NETWORK test cannot pass, as I understand it, because the ab
debug output in the output.log will be flagged as an error; I assume you mean
no connection reset messages in output.log

Comment 20 Christopher P Johnson 2005-02-25 02:52:52 UTC

The test passed twice with the -v 4 option.

Comment 21 Richard Li 2005-02-25 04:42:07 UTC

We'll accept NETWORK results with the -v4 option.

Comment 22 Richard Li 2005-05-11 16:39:34 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-419.html

Note You need to log in before you can comment on or make changes to this bug.