615785 – [RFE] if virt install failed, guest installation should not wait 24h to fail.

Bug 615785 - [RFE] if virt install failed, guest installation should not wait 24h to fail.

Summary: [RFE] if virt install failed, guest installation should not wait 24h to fail.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Beaker
Classification:	Retired
Component:	beah
Sub Component:
Version:	0.7
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Nick Coghlan
QA Contact:
Docs Contact:
URL:
Whiteboard:	Misc
Duplicates (1):	772907 (view as bug list)
Depends On:	655009
Blocks:	545868
TreeView+	depends on / blocked

Reported:	2010-07-18 19:22 UTC by Šimon Lukašík
Modified:	2012-11-22 09:16 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-11-22 09:16:27 UTC
Embargoed:

Attachments	(Terms of Use)

Description Šimon Lukašík 2010-07-18 19:22:20 UTC

Description of problem:
If /distribution/virt/install fails for some reason, guest system installation waits for 24hours and then fails. Could it be possible to make guest systems or whole job fail immediately? It causes problem, when waiting for job complete by some other automation.

Version-Release number of selected component (if applicable):


How reproducible:
sometimes, (if virt install fails)

Steps to Reproduce:
1.
2.
3.
  
Actual results:
Time Remaining -1 day, 

Expected results:
Aborted immediately, if possible.

Additional info:

Comment 1 Marian Csontos 2010-07-27 21:11:35 UTC

Instead of implementing AI solution to kill only the right jobs (as that would be sometimes wrong), this should be made part of task's logic. Adding following line to task should do the job:

    rhts-abort -t RECIPESET

Anyway, I am not sure it is the right thing. What about multihost jobs, e.g. virtual-machine migration,...?

Comment 2 Petr Sklenar 2010-07-28 09:22:29 UTC

Hello Marian,
is possible some magic like this:

We have test:

0. beaker machine reservation
1. my first test/task
2. my second test/task
-
if my first test fails then beaker automagicaly close the whole task and reserve and install another machine and try 'my first test again' then 'my second test'
-
is here some way how to do it^ ?

Comment 3 Marian Csontos 2010-07-28 11:06:53 UTC

Bug 618960: rhts-abort -t recipeset not working

Comment 4 Marian Csontos 2010-07-28 13:01:08 UTC

Re: Comment 2: No.

However, I would like to add some job-control... See Bug 619018

Comment 5 Marian Csontos 2012-01-10 10:55:54 UTC

*** Bug 772907 has been marked as a duplicate of this bug. ***

Comment 6 Marian Csontos 2012-01-10 10:59:40 UTC

One possible solution would be to add a parameter to virt-(install|start) tasks and on failure this would abort recipe set if (not) set. Gurhan, does this make sense?

Comment 7 Alexander Todorov 2012-01-10 11:22:08 UTC

From the duplicate bug:
> That's a feature not a bug.

I don't agree. Having to wait another day for a system is a big bug IMHO. Bumping priority/severity because this issue impacts testing on ia64 where we don't have that much systems.

Comment 8 Gurhan Ozen 2012-01-11 22:42:44 UTC

(In reply to comment #6)
> One possible solution would be to add a parameter to virt-(install|start) tasks
> and on failure this would abort recipe set if (not) set. Gurhan, does this make
> sense?

It kinda makes but it's a workaround. Also I don't know how to do this, you did say in previous comments that "rhts-abort -t RECIPESET" didn't work? 

Just to make sure I understand this correctly, what you are asking here to have virt/install,start programs to abort the recipeset if it fails right?

BTW, that still won't work properly if there are multiple guests in the recipe sets, it could be that one guest doesn't install/start but others still might. 

I don't know, if there is absolutely no other alternatives to do this, I can put a workaround in, but won't be the best solution. 

Beaker does understand what recipe is machine, what recipe is guest, right? is it possible to make it smart enough to finish the job if the host/dom0 is at 100% and the all guests are 0% . Note that, even in this solution, you'll have to make sure that ALL guests are at 0%.

Comment 9 Alexander Todorov 2012-01-12 09:26:03 UTC

(In reply to comment #8)

> Beaker does understand what recipe is machine, what recipe is guest, right? is
> it possible to make it smart enough to finish the job if the host/dom0 is at
> 100% and the all guests are 0% . Note that, even in this solution, you'll have
> to make sure that ALL guests are at 0%.

This will likely not be true. In the job I saw the issue the host was at 100%, one of the guests was at 94% and the other at 0%. Despite the fact that host FAILED the 94% guest managed to complete successfully after a while.

Comment 10 Gurhan Ozen 2012-01-16 16:29:25 UTC

(In reply to comment #9)
> (In reply to comment #8)
> 
> > Beaker does understand what recipe is machine, what recipe is guest, right? is
> > it possible to make it smart enough to finish the job if the host/dom0 is at
> > 100% and the all guests are 0% . Note that, even in this solution, you'll have
> > to make sure that ALL guests are at 0%.
> 
> This will likely not be true. In the job I saw the issue the host was at 100%,
> one of the guests was at 94% and the other at 0%. Despite the fact that host
> FAILED the 94% guest managed to complete successfully after a while.

See, that's a valid case. What happens is this: 
 -- dom0/host installs the guests.
 -- if you just use /distribution/virt/start , it starts the guests and the test is done. 
 -- However, after the guests are started, the tests inside the guests start running. So while all the tests dom0/host might have finished (because as far as they are concerned, just installing and starting up the guests are all), the guests might have bunch of tests that'll take a while to complete. 

This is not an easy thing to solve. I think, the best way to solve would be to somehow trigger the /distribution/install test inside the guest from dom0 after the guest is installed, so that if the guest installation went awry or guest is installed but just doesn't boot whatever reason, the /distribution/install test of the guest would've timeout and aborted. 

Marian, is it possible to somehow tell beaker that /distribution/install test inside the guest should start without even having it executed inside the guest? What I want to do is.. When the guest is started, i wanted to tell beaker that /distribution/install test inside the guest has started. So if the guest doesn't boot for whatever reason, the install test inside the guest times out and the whole guest recipe gets aborted. Is this possible?

Comment 11 Marian Csontos 2012-01-30 11:33:02 UTC

It does tell beaker when the guest's task gets started. The problem is the watchog for the guests is set to a high value:

When job is scheduled all first tasks in recipes are considered in "Running" state with watchdogs assigned.

What could help is to reset guests' watchdog to reasonable value after virt-install task or better in virt-start right before the VM is started.

Comment 12 Jan Stancek 2012-01-31 13:38:00 UTC

(In reply to comment #11)
> What could help is to reset guests' watchdog to reasonable value after
> virt-install task or better in virt-start right before the VM is started.

I'm assuming we don't want tests talking to lab controller directly, so some new rhts- command would be needed in harness, for example:
rhts-guest-started <guest hostname> 
or
rhts-recipe-tasks <hostname> to list tasks with their IDs, then getting ID from there and call: rhts-extend

Comment 13 Marian Csontos 2012-01-31 14:06:39 UTC

But still, if there are two guestrecipes and one fails while the second will get to a long running task (like reservesys), it will block for another 24 hours until all EWDs expire.

Simplest solution to this problem is making an extension to virt-start, which would wait until all[1] guests are up and abort recipeset if not.

This could be desirable for other multihost tests as well, perhaps as a separate task to include right after /distribution/install.

[1]: All is a good default, but we could use quantity smaller than all. But at the moment, there is no use running with just part of VMs provisioned in case of multihost test, as we first need a way to reconfigure roles in harness according to which machines are available. This would be an useful extension, especially if beaker allowed returning single machines.

Comment 14 Jan Stancek 2012-01-31 14:40:43 UTC

(In reply to comment #13)
> But still, if there are two guestrecipes and one fails while the second will
> get to a long running task (like reservesys), it will block for another 24
> hours until all EWDs expire.
> 
> Simplest solution to this problem is making an extension to virt-start, which
> would wait until all[1] guests are up and abort recipeset if not.

This looks better. It would also be nice to have parameter to set, if whole recipeset should be aborted, or just guest recipes which failed to check-in.

> 
> This could be desirable for other multihost tests as well, perhaps as a
> separate task to include right after /distribution/install.
> 
> [1]: All is a good default, but we could use quantity smaller than all. But at
> the moment, there is no use running with just part of VMs provisioned in case
> of multihost test, as we first need a way to reconfigure roles in harness
> according to which machines are available. This would be an useful extension,
> especially if beaker allowed returning single machines.

I think we don't have multihost tests in guests right now, because guest hostname beaker gives you is different from one you have at runtime from DNS/DHCP.

So I think this also means, we can't use rhts-sync-* between host and guest, because you can't reach guest with beaker guest hostname.

Comment 15 Dan Callaghan 2012-10-17 03:57:12 UTC

I think this can be solved pretty easily once bug 655009 is in place. /distribution/virt/install can just tell Beaker to start the recipe for each guest right before it starts the installation of the guest, like Marian suggested in comment 11.

Comment 16 Nick Coghlan 2012-10-17 04:39:25 UTC

Bulk reassignment of issues as Bill has moved to another team.

Comment 17 Nick Coghlan 2012-11-22 09:16:27 UTC

As noted in the comments on bug 655009, this issue should have been resolved in 0.10.

Feel free to reopen if the problem still occurs.

Note You need to log in before you can comment on or make changes to this bug.