From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.2) Gecko/20040308
Description of problem:
This actually is two problems in one. First of all in the functions script, the wait_for_lock() function has a hardcoded lock file name called rhr.NETWORK.
If you start two test machines at roughly the same time, and they both perform the network prep at the same time, the prep stage errors out for NETWORK because two different machiens are using the exact same lock file on the host(ispec) server.
I corrected this by adding a date timestamp in the lockfile name. Please see the patch file attached to this called: rhr2-1.1-multi_locking.patch
Problem 2 comes after this when the NETWORK test is run during the automated stage of the test. From what I could uncover, when the tcp test is running, apachebench is launched via ssh from the ispec server against the test machine. When running this on multiple test machines, the problem arrises that apachebench is killed off generically with
#killall -9 ab
as soon as one of the servers finishes the tcp test. Obviously, this also kills off all the ab instances that are running for any other test machines that are being certified.
This was fixed (I believe) by changing the way the tcp_cleanup() kills off a machines apache bench.
Since each SSH connection from a test machine generates its own PID, that PID becomes the parent PID of any processes run from that login/bash instance. SO, if server1 runs tcp, it logs in to the ispec server and gets a bash instance with the PID of 5401 (arbitrary PID). Then all of server1's ab tests have 5401 set as their PPID.
Long story short (the patch for this makes it look more obvious) tcp_cleanup() was changed so that it kills all processes under a particular servers PPID instead of just a blanket "killall -9 ab". That way, when server1 with a PPID of 5401 ends, the only ab instances killed are ones that have 5401 as the PPID, leaving server2's (PPID of 5500) instances of ab alone.
I tested this a couple times running network tests on two different test machines simultaneously, and both completed with passing grades.
Please see patch called:
I am also attaching a sourcecode file for rhr2-1.1-3 that I added the patches to, and arbitrarily renumberd it to rhr2-1.1-4 just to keep it from being mistaken for the correct current cert suite.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.set up two or three test machines
2.run network tests simultaneously on all test machines using only one ispec server
Actual Results: see description
Expected Results: network tests should have all completed without issue
Created attachment 116528 [details]
patch to fix the lockfile issue in functions/wait_for_lock()
Created attachment 116529 [details]
patch to allow multiple network tests on one ispec server
Created attachment 116530 [details]
sourcecode file including the two patches.
This is the source code. The version 1.1-4 was just an arbitrary number that I
used to keep from confusing this patched version with the original correct
current version of rhr2.
Hi, Jeff, thanks for the patches and bug report. This is by design since the
NETWORK test is saturating the link. If we allow the server to handle multiple
concurrent NETWORK tests, then the NIC on the server would become overwhelmed
and we wouldn't be able to do a full NETWORK test for the clients. This is why
the locking mechanism is there.