Hide Forgot
Sometimes Beaker users submit tests that run wild and cause significant impact to the rest of our infrastructure. Rather than trying to identify every test a user has queued to pre-emptively kill them, being able to temporarily disable a user's account would be helpful. From an admin perspective, disabling the account should take effect immediately - tests that are running or queued at the time the user account's disabled should be cancelled immediately. It'd be nice if the account disable feature allowed us to pass along a brief message to the user explaining why their tests were disabled, but that's a "nice to have" thing.
this is 90% done, but it needs lots of testing so I don't think its going to make 0.6.12.
Please define "run wild and cause significant impact".
Jan, By "run wild and cause significant impact", we're talking about instances where specific Beaker jobs are causing a degradation of service to either the Engineering labs, or in some cases, service degradation to an entire office. The specific instance that prompted this RFE to be filed was a RHN/Satellite test that had been submitted that was hammering the Engineering CVS server. I wasn't involved in the discussions that took place after the offending Beaker jobs were cancelled, but if I recall correctly, it was determined that one of the tasks in the job contained a bug that caused CVS checkouts to be performed in an infinite loop and was the clear root cause of the load spike on the CVS server. Just from memory, I can recall at least two specific incidents where network stress testing jobs have been submitted, and the submitter of the job did not take proper care to ensure that the systems being tested were housed in the same lab. This resulted in network stress testing being attempted across the WAN links, sent the CPU usage on the core routers to 99%, which in turn knocked out the MPLS link between RDU and BOS, which in turn knocked the *entire* BOS office offline for a few hours. In both of the cases mentioned above, Eng-Ops had to take considerable measures to identify the offending client systems, trace the systems back to an active Beaker job, cancel those jobs and look through the Scheduler's job queue to try and identify other tests submitted by the same user which may be suspect. In the case of the MPLS circuit being knocked offline, this required intervention from both Eng-Ops and corporate IT to identify and resolve the problem. In all three of these instances, having the software equivalent of a 'big red button' to immediately cancel all running jobs by the offending user would have allowed us to more quickly and efficiently bring the labs back to order.
Hello. Thanks. This being "red button" seems to be OK as usually it is for extreme and urgent situations. Thanks, Jan