162793 – NETWORK test fails when running more than one test machine

Bug 162793 - NETWORK test fails when running more than one test machine

Summary: NETWORK test fails when running more than one test machine

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Ready Certification Tests
Classification:	Retired
Component:	net
Sub Component:
Version:	2
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Will Woods
QA Contact:	Rob Landry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-07-08 18:07 UTC by Jeff Lane
Modified:	2007-04-18 17:29 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-07-27 18:55:28 UTC
Embargoed:

Attachments	(Terms of Use)
patch to fix the lockfile issue in functions/wait_for_lock() (803 bytes, patch) 2005-07-08 18:09 UTC, Jeff Lane	no flags	Details \| Diff
patch to allow multiple network tests on one ispec server (1.22 KB, patch) 2005-07-08 18:11 UTC, Jeff Lane	no flags	Details \| Diff
sourcecode file including the two patches. (66.77 KB, application/x-rpm) 2005-07-08 18:14 UTC, Jeff Lane	no flags	Details
View All

Description Jeff Lane 2005-07-08 18:07:55 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.2) Gecko/20040308

Description of problem:
This actually is two problems in one. First of all in the functions script, the wait_for_lock() function has a hardcoded lock file name called rhr.NETWORK.

If you start two test machines at roughly the same time, and they both perform the network prep at the same time, the prep stage errors out for NETWORK because two different machiens are using the exact same lock file on the host(ispec) server.

I corrected this by adding a date timestamp in the lockfile name. Please see the patch file attached to this called: rhr2-1.1-multi_locking.patch

Problem 2 comes after this when the NETWORK test is run during the automated stage of the test. From what I could uncover, when the tcp test is running, apachebench is launched via ssh from the ispec server against the test machine. When running this on multiple test machines, the problem arrises that apachebench is killed off generically with

#killall -9 ab

as soon as one of the servers finishes the tcp test. Obviously, this also kills off all the ab instances that are running for any other test machines that are being certified.

This was fixed (I believe) by changing the way the tcp_cleanup() kills off a machines apache bench.

Since each SSH connection from a test machine generates its own PID, that PID becomes the parent PID of any processes run from that login/bash instance. SO, if server1 runs tcp, it logs in to the ispec server and gets a bash instance with the PID of 5401 (arbitrary PID). Then all of server1's ab tests have 5401 set as their PPID.

Long story short (the patch for this makes it look more obvious) tcp_cleanup() was changed so that it kills all processes under a particular servers PPID instead of just a blanket "killall -9 ab". That way, when server1 with a PPID of 5401 ends, the only ab instances killed are ones that have 5401 as the PPID, leaving server2's (PPID of 5500) instances of ab alone.

I tested this a couple times running network tests on two different test machines simultaneously, and both completed with passing grades.

Please see patch called:
rhr2-1.1-multi-net-tests.patch

I am also attaching a sourcecode file for rhr2-1.1-3 that I added the patches to, and arbitrarily renumberd it to rhr2-1.1-4 just to keep it from being mistaken for the correct current cert suite.

Version-Release number of selected component (if applicable):
rhr2-1.1-3

How reproducible:
Always

Steps to Reproduce:
1.set up two or three test machines
2.run network tests simultaneously on all test machines using only one ispec server
3.profit!

Actual Results: see description

Expected Results: network tests should have all completed without issue

Additional info:

Comment 1 Jeff Lane 2005-07-08 18:09:40 UTC

Created attachment 116528 [details]
patch to fix the lockfile issue in functions/wait_for_lock()

Comment 2 Jeff Lane 2005-07-08 18:11:43 UTC

Created attachment 116529 [details]
patch to allow multiple network tests on one ispec server

Comment 3 Jeff Lane 2005-07-08 18:14:00 UTC

Created attachment 116530 [details]
sourcecode file including the two patches.

This is the source code.  The version 1.1-4 was just an arbitrary number that I
used to keep from confusing this patched version with the original correct
current version of rhr2.

Comment 4 Richard Li 2005-07-27 18:55:28 UTC

Hi, Jeff, thanks for the patches and bug report. This is by design since the
NETWORK test is saturating the link. If we allow the server to handle multiple
concurrent NETWORK tests, then the NIC on the server would become overwhelmed
and we wouldn't be able to do a full NETWORK test for the clients. This is why
the locking mechanism is there.

Note You need to log in before you can comment on or make changes to this bug.