Bug 637260 - [RFE] automatic removal of systems that do not Install
Summary: [RFE] automatic removal of systems that do not Install
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Beaker
Classification: Retired
Component: inventory
Version: 0.5
Hardware: All
OS: Linux
high
high
Target Milestone: ---
Assignee: Dan Callaghan
QA Contact:
URL:
Whiteboard:
Depends On: 591652
Blocks: 632609
TreeView+ depends on / blocked
 
Reported: 2010-09-24 18:14 UTC by Bill Peck
Modified: 2011-09-28 15:34 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 591652
Environment:
Last Closed: 2010-10-14 03:13:15 UTC
Embargoed:


Attachments (Terms of Use)

Comment 1 Jeff Burke 2010-09-24 19:00:51 UTC
We should be taking advantage of koan.

If a system fails to pxe boot but then continues to boot. The installation should be done with koan.

Comment 2 Bill Peck 2010-09-24 19:15:28 UTC
(In reply to comment #1)
> We should be taking advantage of koan.
> 
> If a system fails to pxe boot but then continues to boot. The installation
> should be done with koan.

True,  But I'm thinking if something is wrong with the system itself.  I'll rename the bug. ;-)

We want to try and catch systems that have a broken hard disk or something else that keeps installs that should work from happening.

Comment 3 Dan Callaghan 2010-09-27 01:56:42 UTC
I'm not clear on how we can detect this. Can we check for a failing result for the /distribution/install task?

And do we want to apply some kind of heuristics (as discussed in the other bug) like "mark as broken if more than X different distros consecutively fail to install"?

Comment 4 Bill Peck 2010-09-27 12:04:43 UTC
Hi Dan,

Yes, currently we can go to a systems page and click on the tab "Tasks".  Here you can filter on task "/distribution/install".  

Jeff Burke does this to see if a system has been working or not.

Comment 5 Dan Callaghan 2010-09-28 06:40:58 UTC
I'm not sure that we can rely on a failing result for /distribution/install to indicate that the install did not work. I have had a look through production data at a bunch of recipes where /distribution/install failed, and a very large proportion of the failures (even with "STABLE" distros) are due to errors like this one [1]:

******** Potential Issues dmesg ********
processor: probe of LNXCPU:00 failed with error -22
processor: probe of LNXCPU:01 failed with error -22
agpgart-serverworks: probe of 0000:00:00.0 failed with error -22
agpgart-serverworks: probe of 0000:00:00.1 failed with error -22
...

These kernel messages seem to be non-fatal (the rest of the recipe runs fine) so I don't think we should be marking systems as broken because of these. So I think we will have to try another tack for this.

[1] https://beaker.engineering.redhat.com/logs/2010/08/123/12343/23063/288802/810671///test_log--distribution-install-Sysinfo.log

Comment 6 Jeff Burke 2010-09-28 12:24:15 UTC
Dan,
  That is because the way subtests propagate there results to the top level. In reality the install actually worked(passed) it was the Sysinfo subtest that reported the failure.

  For starters we should be able to catch issues like this one:
https://beaker.engineering.redhat.com/recipes/5174
The recipe "Aborted" "Warn"

  The other thing I just realized(you may have already taken this into account is
this should only be dome for systems that are Type Machine

Comment 7 Bill Peck 2010-09-28 13:10:45 UTC
Jeff points out the important part here.  Status Aborted Result Warn is the important pieces to query on.

Comment 8 Dan Callaghan 2010-09-28 23:14:21 UTC
Okay, Aborted status for /distribution/install seems to be more like what we want. There seems to be three common cases: cobbler fails to provision the system (I think this will already be handled by bug 591652); or there's a kernel panic, in which case the result is Panic; or the machine just goes away (bad network config etc), in which case the external watchdog kills the job with result Warn.

So our rule should be: If the /distribution/install task is Aborted for a STABLE distro, then inspect that system's history to check if any other STABLE distro recipe has been Aborted without an intervening successful recipe. If so, mark the system as broken and notify its owner. Does that sound right? Am I making this too complicated?

Comment 9 Bill Peck 2010-09-29 13:00:15 UTC
I think you have it.  It is a bit complicated, but unavoidable.

The only thing I'm worried about is reporting the same failure and moving the system to broken when its already been investigated and found to be working.

For example:

System fails multiple installs
cron job detects failures and moves machine to broken and emails
owner investigates and fixes the problem.
Moves the system back to Working.
cron job runs and no new jobs have been run yet, so it falsely puts it back in broken and emails owner. :-/

Currently history and tasks are separate tables.  can we craft a query that will join these chronologically?  Or do we need to think about recording these watchdog aborts in the history table?

Comment 10 Dan Callaghan 2010-09-29 22:06:10 UTC
I don't think we necessarily have to put it in a cron job. Here's a patch I knocked up yesterday (completely untested as yet, but it gives you an idea of what I was thinking):

--- a/Server/bkr/server/model.py
+++ b/Server/bkr/server/model.py
@@ -4199,6 +4199,28 @@ class Recipe(TaskBase):
         """
         self._abort(msg)
         self.update_status()
+        if self.system is not None and \
+                self.system.type == SystemType.by_name(u'Machine') and \
+                u'STABLE' in self.distro.tags:
+            # bz637260: does this system have a history of bad behaviour?
+            aborted = TaskStatus.by_name(u'Aborted')
+            for recipe in sorted(self.system.recipes,
+                    key=lambda r: r.finish_time, reverse=True):
+                if recipe is self: continue
+                if recipe.status != aborted: break
+                if recipe.distro != self.distro and \
+                        u'STABLE' in recipe.distro.tags:
+                    reason = _(u'Recipe %s with distro %s aborted, and '
+                            'previous recipe %s also aborted') % (self.id,
+                            self.distro, recipe.id)
+                    log.error('Marking system as broken: %s' % reason)
+                    old_status = self.system.status
+                    self.system.mark_broken(reason=reason, recipe=self)
+                    self.system.activity.append(
+                            SystemActivity(service='Scheduler',
+                            action='Changed', field_name='Status',
+                            old_value=old_status,
+                            new_value=recipe.system.status))
 
     def _abort(self, msg=None):
         """

You're right that this doesn't handle the situation where someone has already investigated it, marked it as working, and then another STABLE distro happens to fail.

We could record these "suspicious" aborts in the system history, and use that to decide whether to mark it as broken. Or we could use a counter stored in the database (new column on system table, I guess) which we then reset once the system is changed back to working.

Comment 11 Dan Callaghan 2010-09-30 01:59:09 UTC
Notes from IRC discussion:

* system.recipes is an associated collection, and might be large, so we should avoid fully populating it
  - does slicing it make it lazy?
  - could we make a "dynamic loader" thingy like recipe.dyn_systems?
  - otherwise just use a query
* can we re-queue the recipe if we decide the system is broken?
* as already mentioned, we have to account for an intervening status change
  - could store suspicious aborts in system history, and examine that instead of system.recipes (but putting structured data into the mostly unstructured system history leads to pain)
  - could examine a union of status changes from system history and system.recipes?

Comment 13 Dan Callaghan 2010-10-05 23:46:27 UTC
I'm glad Ray pointed out the dependent subquery issue, I revisited the query and it was both more complex and less correct than it should have been. It now has no dependent subqueries and the EXPLAIN looks fine. It runs in 0 seconds on my machine now. And it works properly ;-)

I also added a check to skip all this broken-detection logic if the system is already broken. That should handle the case where something else has marked the system broken for us already (e.g. cobbler failure handling from bug 591652).


Note You need to log in before you can comment on or make changes to this bug.