We should be taking advantage of koan. If a system fails to pxe boot but then continues to boot. The installation should be done with koan.
(In reply to comment #1) > We should be taking advantage of koan. > > If a system fails to pxe boot but then continues to boot. The installation > should be done with koan. True, But I'm thinking if something is wrong with the system itself. I'll rename the bug. ;-) We want to try and catch systems that have a broken hard disk or something else that keeps installs that should work from happening.
I'm not clear on how we can detect this. Can we check for a failing result for the /distribution/install task? And do we want to apply some kind of heuristics (as discussed in the other bug) like "mark as broken if more than X different distros consecutively fail to install"?
Hi Dan, Yes, currently we can go to a systems page and click on the tab "Tasks". Here you can filter on task "/distribution/install". Jeff Burke does this to see if a system has been working or not.
I'm not sure that we can rely on a failing result for /distribution/install to indicate that the install did not work. I have had a look through production data at a bunch of recipes where /distribution/install failed, and a very large proportion of the failures (even with "STABLE" distros) are due to errors like this one [1]: ******** Potential Issues dmesg ******** processor: probe of LNXCPU:00 failed with error -22 processor: probe of LNXCPU:01 failed with error -22 agpgart-serverworks: probe of 0000:00:00.0 failed with error -22 agpgart-serverworks: probe of 0000:00:00.1 failed with error -22 ... These kernel messages seem to be non-fatal (the rest of the recipe runs fine) so I don't think we should be marking systems as broken because of these. So I think we will have to try another tack for this. [1] https://beaker.engineering.redhat.com/logs/2010/08/123/12343/23063/288802/810671///test_log--distribution-install-Sysinfo.log
Dan, That is because the way subtests propagate there results to the top level. In reality the install actually worked(passed) it was the Sysinfo subtest that reported the failure. For starters we should be able to catch issues like this one: https://beaker.engineering.redhat.com/recipes/5174 The recipe "Aborted" "Warn" The other thing I just realized(you may have already taken this into account is this should only be dome for systems that are Type Machine
Jeff points out the important part here. Status Aborted Result Warn is the important pieces to query on.
Okay, Aborted status for /distribution/install seems to be more like what we want. There seems to be three common cases: cobbler fails to provision the system (I think this will already be handled by bug 591652); or there's a kernel panic, in which case the result is Panic; or the machine just goes away (bad network config etc), in which case the external watchdog kills the job with result Warn. So our rule should be: If the /distribution/install task is Aborted for a STABLE distro, then inspect that system's history to check if any other STABLE distro recipe has been Aborted without an intervening successful recipe. If so, mark the system as broken and notify its owner. Does that sound right? Am I making this too complicated?
I think you have it. It is a bit complicated, but unavoidable. The only thing I'm worried about is reporting the same failure and moving the system to broken when its already been investigated and found to be working. For example: System fails multiple installs cron job detects failures and moves machine to broken and emails owner investigates and fixes the problem. Moves the system back to Working. cron job runs and no new jobs have been run yet, so it falsely puts it back in broken and emails owner. :-/ Currently history and tasks are separate tables. can we craft a query that will join these chronologically? Or do we need to think about recording these watchdog aborts in the history table?
I don't think we necessarily have to put it in a cron job. Here's a patch I knocked up yesterday (completely untested as yet, but it gives you an idea of what I was thinking): --- a/Server/bkr/server/model.py +++ b/Server/bkr/server/model.py @@ -4199,6 +4199,28 @@ class Recipe(TaskBase): """ self._abort(msg) self.update_status() + if self.system is not None and \ + self.system.type == SystemType.by_name(u'Machine') and \ + u'STABLE' in self.distro.tags: + # bz637260: does this system have a history of bad behaviour? + aborted = TaskStatus.by_name(u'Aborted') + for recipe in sorted(self.system.recipes, + key=lambda r: r.finish_time, reverse=True): + if recipe is self: continue + if recipe.status != aborted: break + if recipe.distro != self.distro and \ + u'STABLE' in recipe.distro.tags: + reason = _(u'Recipe %s with distro %s aborted, and ' + 'previous recipe %s also aborted') % (self.id, + self.distro, recipe.id) + log.error('Marking system as broken: %s' % reason) + old_status = self.system.status + self.system.mark_broken(reason=reason, recipe=self) + self.system.activity.append( + SystemActivity(service='Scheduler', + action='Changed', field_name='Status', + old_value=old_status, + new_value=recipe.system.status)) def _abort(self, msg=None): """ You're right that this doesn't handle the situation where someone has already investigated it, marked it as working, and then another STABLE distro happens to fail. We could record these "suspicious" aborts in the system history, and use that to decide whether to mark it as broken. Or we could use a counter stored in the database (new column on system table, I guess) which we then reset once the system is changed back to working.
Notes from IRC discussion: * system.recipes is an associated collection, and might be large, so we should avoid fully populating it - does slicing it make it lazy? - could we make a "dynamic loader" thingy like recipe.dyn_systems? - otherwise just use a query * can we re-queue the recipe if we decide the system is broken? * as already mentioned, we have to account for an intervening status change - could store suspicious aborts in system history, and examine that instead of system.recipes (but putting structured data into the mostly unstructured system history leads to pain) - could examine a union of status changes from system history and system.recipes?
Pushed branch bz637260 for review: http://git.fedorahosted.org/git/?p=beaker.git;a=commitdiff;h=1c0f9aab67afd9cec6f6d25d108cb4e438de100a
I'm glad Ray pointed out the dependent subquery issue, I revisited the query and it was both more complex and less correct than it should have been. It now has no dependent subqueries and the EXPLAIN looks fine. It runs in 0 seconds on my machine now. And it works properly ;-) I also added a check to skip all this broken-detection logic if the system is already broken. That should handle the case where something else has marked the system broken for us already (e.g. cobbler failure handling from bug 591652).