Bug 322661

Summary: NETWORK issue from Egenera pBlade
Product: [Retired] Red Hat Hardware Certification Program Reporter: Xu Bo <bxu>
Component: Test Suite (tests)Assignee: Greg Nichols <gnichols>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 5CC: gcase, ykun
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
URL: https://hardware.redhat.com/show.cgi?id=317391
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-12-17 15:49:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Xu Bo 2007-10-08 02:55:52 UTC
Description of problem:
When the network speed value reported by ethtool is something unexpected, rather
than using the default value of 100Mb/s the python script can bomb out with a
divide by zero error. 

Version-Release number of selected component (if applicable):

hts-5.0-48

How reproducible:

Every time

Steps to Reproduce:
1. Install hts 
2. Run network portion of hts on a system that reports bad ethtool data
3. Network.py errors out.
  
Actual results:

Python error:
copy from  /tmp/tmpjNWxlYnfsdir/172.30.192.159/6 to /var/www/html/httptest.file
copy from  /tmp/tmpjNWxlYnfsdir/172.30.192.159/7 to /var/www/html/httptest.file
copy from  /tmp/tmpjNWxlYnfsdir/172.30.192.159/8 to /var/www/html/httptest.file
copy from  /tmp/tmpjNWxlYnfsdir/172.30.192.159/9 to /var/www/html/httptest.file
Traceback (most recent call last):
  File "./network.py", line 432, in ?
    returnValue = networkTest.do(sys.argv)
  File "/usr/lib/python2.4/site-packages/hts/test.py", line 225, in do
    return self.run()
  File "./network.py", line 412, in run
    returnValue = self.nfsTest()
  File "./network.py", line 372, in nfsTest
    print "%u mbit received in %u sec ( %e mbit/s)" % (mbit, rxtime, mbit/rxtime
)
ZeroDivisionError: float division
...finished running ./network.py, exit code=1

Expected results:

Completed test with either a PASS or FAIL result.

Additional info:
Bad ethtool output that's generated by this system:
[root@REDHAT-HTS-RHEL5-1 network]# ethtool eth0
Settings for eth0:
        Supported ports: [ ]
        Supported link modes:   
        Supports auto-negotiation: No
        Advertised link modes:  Not reported
        Advertised auto-negotiation: No
        Speed: Unknown! (0)
        Duplex: Half
        Port: Twisted Pair
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: off
        Supports Wake-on: d
        Wake-on: d
        Current message level: 0x08061924 (134617380)
        Link detected: yes

Comment 1 Xu Bo 2007-10-08 02:58:32 UTC
Forcibly setting the interface speed to 100 will work around the problem, so it
looks like all we need is better error handling (and then working with the
vendor to correct the real problem).

    def getInterfaceSpeed(self):       
        self.interfaceSpeed = 100
        return
        #skip the rest of this, trying to work around bad
        #data returning from ethtool
        for interfaceString in (self.interface, "p%s" % self.interface):
            ethtoolCommand = "ethtool %s | fgrep \"Speed\"" % interfaceString
            pipe = os.popen(ethtoolCommand)            line = pipe.readline()
            pipe.close()
            if line:
                pattern = re.compile("\d+")
                match = pattern.search(line)
                if match:
                    self.interfaceSpeed = string.atoi(match.group())
                    return
        # otherwise
        self.interfaceSpeed = 100
        print "interface speed is %u" % self.interfaceSpeed

Comment 2 Greg Nichols 2007-10-08 13:42:41 UTC
Fixed in R4

Comment 3 Gary Case 2007-10-09 17:57:30 UTC
This is the patch I used to get around the speed detection:

 diff -Naurp network.py.orig network.py
--- network.py.orig     2007-10-05 18:21:52.000000000 -0400
+++ network.py  2007-10-05 18:23:58.000000000 -0400
@@ -193,6 +193,10 @@ class NetworkTest(Test):        
         return returnValue == 0
     
     def getInterfaceSpeed(self):       
+        self.interfaceSpeed = 100
+        return
+        #skip the rest of this, trying to work around bad 
+        #data returning from ethtool
         for interfaceString in (self.interface, "p%s" % self.interface):
             ethtoolCommand = "ethtool %s | fgrep \"Speed\"" % interfaceString
             pipe = os.popen(ethtoolCommand)


Comment 4 Gary Case 2007-10-09 21:43:43 UTC
Hm...

There's something else going on here, but I don't know what it is yet. After
going through the ethtool output, I found that all Xen dom0s I tested (i686,
x86_64, ia64 and this Egenera i686 blade) output link status and nothing else.
Link speed is never displayed. This means my original hypothesis was wrong.

That being the case, you'd think that the forced self.interfaceSpeed = 100 line
wouldn't be necessary, as no system reports speed correctly in the scripts. But
on these blades, if you use the unmodified script, the dd command that creates
/var/www/html/httptest.file never stops, eventually filling up the entire hard
drive. So, I put the patch back in and that portion of the test runs.

Shortly after that portion of the test finishes, the system begins the NFS test
and immediately displays errors that are not present when run with the PAE kernel:

nfs: server 172.30.192.193 not responding, still trying

This repeats over and over until the test is killed. After some trial and error,
I determined that the problem is caused by the mount protocol. If you use UDP,
mounts hang after issuing a simple 'ls' command. If you switch to TCP, commands
work as expected. After modifying the network.py script further, changing the
protocol on the nfsopts line to tcp:

       nfsopts="rw,intr,rsize=12288,wsize=12288,tcp"

I can obtain a successful run of the NETWORK test.

Now we need to determine what is causing dd to run out of control and why UDP
mounts are unsuccessful.

Comment 5 Greg Nichols 2007-10-09 23:38:07 UTC
A little background: The reason the network test uses NFS mounted via UDP is to
test UDP.

Comment 6 Gary Case 2007-10-10 02:08:11 UTC
Well then, I would say that it's definitely doing its job. I wonder what's so
different about Egenera's architecture that would cause UDP traffic to fail? I'm
waiting to hear more from Egenera.