322661 – NETWORK issue from Egenera pBlade

Bug 322661 - NETWORK issue from Egenera pBlade

Summary: NETWORK issue from Egenera pBlade

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Hardware Certification Program
Classification:	Retired
Component:	Test Suite (tests)
Sub Component:
Version:	5
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Greg Nichols
QA Contact:
Docs Contact:
URL:	https://hardware.redhat.com/show.cgi?...
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-10-08 02:55 UTC by Xu Bo
Modified:	2008-07-16 21:58 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2007-12-17 15:49:03 UTC
Embargoed:

Attachments	(Terms of Use)

Description Xu Bo 2007-10-08 02:55:52 UTC

Description of problem:
When the network speed value reported by ethtool is something unexpected, rather
than using the default value of 100Mb/s the python script can bomb out with a
divide by zero error. 

Version-Release number of selected component (if applicable):

hts-5.0-48

How reproducible:

Every time

Steps to Reproduce:
1. Install hts 
2. Run network portion of hts on a system that reports bad ethtool data
3. Network.py errors out.
  
Actual results:

Python error:
copy from  /tmp/tmpjNWxlYnfsdir/172.30.192.159/6 to /var/www/html/httptest.file
copy from  /tmp/tmpjNWxlYnfsdir/172.30.192.159/7 to /var/www/html/httptest.file
copy from  /tmp/tmpjNWxlYnfsdir/172.30.192.159/8 to /var/www/html/httptest.file
copy from  /tmp/tmpjNWxlYnfsdir/172.30.192.159/9 to /var/www/html/httptest.file
Traceback (most recent call last):
  File "./network.py", line 432, in ?
    returnValue = networkTest.do(sys.argv)
  File "/usr/lib/python2.4/site-packages/hts/test.py", line 225, in do
    return self.run()
  File "./network.py", line 412, in run
    returnValue = self.nfsTest()
  File "./network.py", line 372, in nfsTest
    print "%u mbit received in %u sec ( %e mbit/s)" % (mbit, rxtime, mbit/rxtime
)
ZeroDivisionError: float division
...finished running ./network.py, exit code=1

Expected results:

Completed test with either a PASS or FAIL result.

Additional info:
Bad ethtool output that's generated by this system:
[root@REDHAT-HTS-RHEL5-1 network]# ethtool eth0
Settings for eth0:
        Supported ports: [ ]
        Supported link modes:   
        Supports auto-negotiation: No
        Advertised link modes:  Not reported
        Advertised auto-negotiation: No
        Speed: Unknown! (0)
        Duplex: Half
        Port: Twisted Pair
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: off
        Supports Wake-on: d
        Wake-on: d
        Current message level: 0x08061924 (134617380)
        Link detected: yes

Comment 1 Xu Bo 2007-10-08 02:58:32 UTC

Forcibly setting the interface speed to 100 will work around the problem, so it
looks like all we need is better error handling (and then working with the
vendor to correct the real problem).

    def getInterfaceSpeed(self):       
        self.interfaceSpeed = 100
        return
        #skip the rest of this, trying to work around bad
        #data returning from ethtool
        for interfaceString in (self.interface, "p%s" % self.interface):
            ethtoolCommand = "ethtool %s | fgrep \"Speed\"" % interfaceString
            pipe = os.popen(ethtoolCommand)            line = pipe.readline()
            pipe.close()
            if line:
                pattern = re.compile("\d+")
                match = pattern.search(line)
                if match:
                    self.interfaceSpeed = string.atoi(match.group())
                    return
        # otherwise
        self.interfaceSpeed = 100
        print "interface speed is %u" % self.interfaceSpeed

Comment 2 Greg Nichols 2007-10-08 13:42:41 UTC

Fixed in R4

Comment 3 Gary Case 2007-10-09 17:57:30 UTC

This is the patch I used to get around the speed detection:

 diff -Naurp network.py.orig network.py
--- network.py.orig     2007-10-05 18:21:52.000000000 -0400
+++ network.py  2007-10-05 18:23:58.000000000 -0400
@@ -193,6 +193,10 @@ class NetworkTest(Test):        
         return returnValue == 0
     
     def getInterfaceSpeed(self):       
+        self.interfaceSpeed = 100
+        return
+        #skip the rest of this, trying to work around bad 
+        #data returning from ethtool
         for interfaceString in (self.interface, "p%s" % self.interface):
             ethtoolCommand = "ethtool %s | fgrep \"Speed\"" % interfaceString
             pipe = os.popen(ethtoolCommand)

Comment 4 Gary Case 2007-10-09 21:43:43 UTC

Hm...

There's something else going on here, but I don't know what it is yet. After
going through the ethtool output, I found that all Xen dom0s I tested (i686,
x86_64, ia64 and this Egenera i686 blade) output link status and nothing else.
Link speed is never displayed. This means my original hypothesis was wrong.

That being the case, you'd think that the forced self.interfaceSpeed = 100 line
wouldn't be necessary, as no system reports speed correctly in the scripts. But
on these blades, if you use the unmodified script, the dd command that creates
/var/www/html/httptest.file never stops, eventually filling up the entire hard
drive. So, I put the patch back in and that portion of the test runs.

Shortly after that portion of the test finishes, the system begins the NFS test
and immediately displays errors that are not present when run with the PAE kernel:

nfs: server 172.30.192.193 not responding, still trying

This repeats over and over until the test is killed. After some trial and error,
I determined that the problem is caused by the mount protocol. If you use UDP,
mounts hang after issuing a simple 'ls' command. If you switch to TCP, commands
work as expected. After modifying the network.py script further, changing the
protocol on the nfsopts line to tcp:

       nfsopts="rw,intr,rsize=12288,wsize=12288,tcp"

I can obtain a successful run of the NETWORK test.

Now we need to determine what is causing dd to run out of control and why UDP
mounts are unsuccessful.

Comment 5 Greg Nichols 2007-10-09 23:38:07 UTC

A little background: The reason the network test uses NFS mounted via UDP is to
test UDP.

Comment 6 Gary Case 2007-10-10 02:08:11 UTC

Well then, I would say that it's definitely doing its job. I wonder what's so
different about Egenera's architecture that would cause UDP traffic to fail? I'm
waiting to hear more from Egenera.

Note You need to log in before you can comment on or make changes to this bug.