Description of problem: When the network speed value reported by ethtool is something unexpected, rather than using the default value of 100Mb/s the python script can bomb out with a divide by zero error. Version-Release number of selected component (if applicable): hts-5.0-48 How reproducible: Every time Steps to Reproduce: 1. Install hts 2. Run network portion of hts on a system that reports bad ethtool data 3. Network.py errors out. Actual results: Python error: copy from /tmp/tmpjNWxlYnfsdir/172.30.192.159/6 to /var/www/html/httptest.file copy from /tmp/tmpjNWxlYnfsdir/172.30.192.159/7 to /var/www/html/httptest.file copy from /tmp/tmpjNWxlYnfsdir/172.30.192.159/8 to /var/www/html/httptest.file copy from /tmp/tmpjNWxlYnfsdir/172.30.192.159/9 to /var/www/html/httptest.file Traceback (most recent call last): File "./network.py", line 432, in ? returnValue = networkTest.do(sys.argv) File "/usr/lib/python2.4/site-packages/hts/test.py", line 225, in do return self.run() File "./network.py", line 412, in run returnValue = self.nfsTest() File "./network.py", line 372, in nfsTest print "%u mbit received in %u sec ( %e mbit/s)" % (mbit, rxtime, mbit/rxtime ) ZeroDivisionError: float division ...finished running ./network.py, exit code=1 Expected results: Completed test with either a PASS or FAIL result. Additional info: Bad ethtool output that's generated by this system: [root@REDHAT-HTS-RHEL5-1 network]# ethtool eth0 Settings for eth0: Supported ports: [ ] Supported link modes: Supports auto-negotiation: No Advertised link modes: Not reported Advertised auto-negotiation: No Speed: Unknown! (0) Duplex: Half Port: Twisted Pair PHYAD: 0 Transceiver: internal Auto-negotiation: off Supports Wake-on: d Wake-on: d Current message level: 0x08061924 (134617380) Link detected: yes
Forcibly setting the interface speed to 100 will work around the problem, so it looks like all we need is better error handling (and then working with the vendor to correct the real problem). def getInterfaceSpeed(self): self.interfaceSpeed = 100 return #skip the rest of this, trying to work around bad #data returning from ethtool for interfaceString in (self.interface, "p%s" % self.interface): ethtoolCommand = "ethtool %s | fgrep \"Speed\"" % interfaceString pipe = os.popen(ethtoolCommand) line = pipe.readline() pipe.close() if line: pattern = re.compile("\d+") match = pattern.search(line) if match: self.interfaceSpeed = string.atoi(match.group()) return # otherwise self.interfaceSpeed = 100 print "interface speed is %u" % self.interfaceSpeed
Fixed in R4
This is the patch I used to get around the speed detection: diff -Naurp network.py.orig network.py --- network.py.orig 2007-10-05 18:21:52.000000000 -0400 +++ network.py 2007-10-05 18:23:58.000000000 -0400 @@ -193,6 +193,10 @@ class NetworkTest(Test): return returnValue == 0 def getInterfaceSpeed(self): + self.interfaceSpeed = 100 + return + #skip the rest of this, trying to work around bad + #data returning from ethtool for interfaceString in (self.interface, "p%s" % self.interface): ethtoolCommand = "ethtool %s | fgrep \"Speed\"" % interfaceString pipe = os.popen(ethtoolCommand)
Hm... There's something else going on here, but I don't know what it is yet. After going through the ethtool output, I found that all Xen dom0s I tested (i686, x86_64, ia64 and this Egenera i686 blade) output link status and nothing else. Link speed is never displayed. This means my original hypothesis was wrong. That being the case, you'd think that the forced self.interfaceSpeed = 100 line wouldn't be necessary, as no system reports speed correctly in the scripts. But on these blades, if you use the unmodified script, the dd command that creates /var/www/html/httptest.file never stops, eventually filling up the entire hard drive. So, I put the patch back in and that portion of the test runs. Shortly after that portion of the test finishes, the system begins the NFS test and immediately displays errors that are not present when run with the PAE kernel: nfs: server 172.30.192.193 not responding, still trying This repeats over and over until the test is killed. After some trial and error, I determined that the problem is caused by the mount protocol. If you use UDP, mounts hang after issuing a simple 'ls' command. If you switch to TCP, commands work as expected. After modifying the network.py script further, changing the protocol on the nfsopts line to tcp: nfsopts="rw,intr,rsize=12288,wsize=12288,tcp" I can obtain a successful run of the NETWORK test. Now we need to determine what is causing dd to run out of control and why UDP mounts are unsuccessful.
A little background: The reason the network test uses NFS mounted via UDP is to test UDP.
Well then, I would say that it's definitely doing its job. I wonder what's so different about Egenera's architecture that would cause UDP traffic to fail? I'm waiting to hear more from Egenera.