Bug 1568224 - beakerd fails to update dirty job when sending mail notification: AttributeError: 'NoneType' object has no attribute 'distro_name'
Summary: beakerd fails to update dirty job when sending mail notification: AttributeE...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Beaker
Classification: Community
Component: general
Version: 25
Hardware: Unspecified
OS: Unspecified
urgent
unspecified vote
Target Milestone: 25.3
Assignee: Roman Joost
QA Contact: tools-bugs
URL:
Whiteboard:
: 1571857 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-17 02:04 UTC by Dan Callaghan
Modified: 2018-05-18 00:51 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-18 00:51:09 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Beaker Project Gerrit 6071 None None None 2018-04-17 06:20:37 UTC
Beaker Project Gerrit 6073 None None None 2018-04-17 07:08:05 UTC
Beaker Project Gerrit 6080 None None None 2018-04-23 03:20:19 UTC
Beaker Project Gerrit 6081 None None None 2018-04-24 00:11:40 UTC

Description Dan Callaghan 2018-04-17 02:04:44 UTC
Description of problem:


Version-Release number of selected component (if applicable):
25.1

How reproducible:
unsure

Steps to Reproduce:
unsure

Actual results:
bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429732)
 Traceback (most recent call last):
   File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 98, in update_dirty_jobs
     update_dirty_job(job_id)
   File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 113, in update_dirty_job
     job.update_status()
   File "/usr/lib/python2.6/site-packages/bkr/server/model/scheduler.py", line 1259, in update_status
     self._update_status()
   File "/usr/lib/python2.6/site-packages/bkr/server/model/scheduler.py", line 1298, in _update_status
     mail.job_notify(self)
   File "/usr/lib/python2.6/site-packages/bkr/server/mail.py", line 68, in job_notify
     body=failed_recipes(job),
   File "/usr/lib/python2.6/site-packages/bkr/server/mail.py", line 42, in failed_recipes
     distro_name = recipe.installation.distro_name if recipe.installation.distro_name else \
 AttributeError: 'NoneType' object has no attribute 'distro_name'

Expected results:
no crash

Additional info:
Very similar to other regressions found on bug 911515 relating to recipes which were started before the upgrade.

Comment 1 Dan Callaghan 2018-04-17 02:08:52 UTC
Other affected jobs:
Apr 17 01:59:53 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429732)
Apr 17 02:00:46 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429728)
Apr 17 02:02:29 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429732)
Apr 17 02:03:00 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429728)
Apr 17 02:05:01 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429344)
Apr 17 02:06:27 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429732)
Apr 17 02:06:35 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429882)
Apr 17 02:06:57 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429878)
Apr 17 02:07:10 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429728)
Apr 17 02:08:03 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429344)
Apr 17 02:08:06 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429878)
Apr 17 02:08:24 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429732)
Apr 17 02:08:26 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429882)

and possibly more.

Comment 2 Dan Callaghan 2018-04-17 05:43:29 UTC
In the case of job 2429732 the problematic recipe set appears to be RS:4058942 with R:5047182 Waiting and R:5047183 Scheduled. R:5047182has an installation row but R:5047183 does not.

I think the problem is in this commit, which was for a similar problem:

https://gerrit.beaker-project.org/c/5970/

we should have included the status of 'Scheduled' in the INSERT query.

In Beaker 24.x, the installation row was inserted in the recipe.provision() method which is called by beakerd in the provision_scheduled_recipeset() routine, which only happens when all recipes in the set have been Scheduled.

Comment 3 Dan Callaghan 2018-04-17 05:48:36 UTC
Incidentally, R:5047182 itself was aborted by beakerd with this error, presumably the same cause?

 Failed to provision recipeid 5047182, 'NoneType' object has no attribute 'osmajor'

Comment 4 Dan Callaghan 2018-04-17 05:49:53 UTC
(In reply to Dan Callaghan from comment #2)
> R:5047182has an installation row but R:5047183 does not.

Sorry I gave these ids the wrong way around. R:5047182 has *no* installation row, which is why beakerd failed to provision it *and* is now failing to send the mail for it.

Comment 5 Roman Joost 2018-04-17 06:20:38 UTC
Fixes migration to Beaker 25 if recipes are in a scheduled state.

Comment 6 Roman Joost 2018-04-17 07:08:06 UTC
Adds migration for 25.1 -> 25.2

Comment 7 Roman Joost 2018-04-19 03:51:01 UTC
Released with 25.2: https://beaker-project.org/docs/whats-new/release-25.html#beaker-25-2

Due to their importance and impact on systems in use, we've decided to skip an expensive self test, rely on our automated tests and get them deployed ASAP.

Comment 9 Dan Callaghan 2018-04-19 06:54:43 UTC
I believed this is the set of all job ids still affected, after this fix was applied:

Apr 19 06:43:25 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429344)
Apr 19 06:44:30 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2428440)
Apr 19 06:45:51 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429732)
Apr 19 06:46:18 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429728)
Apr 19 06:46:25 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429261)
Apr 19 06:47:41 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429873)
Apr 19 06:48:09 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429563)
Apr 19 06:48:32 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429018)

For comparison, these are the ones which were failing before the online migration was run (a slightly larger set):

Apr 19 06:15:02 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429344)
Apr 19 06:15:41 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2428440)
Apr 19 06:17:15 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429732)
Apr 19 06:17:17 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2411598)
Apr 19 06:17:29 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429821)
Apr 19 06:17:43 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429728)
Apr 19 06:17:52 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429261)
Apr 19 06:18:42 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429873)
Apr 19 06:18:44 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429882)
Apr 19 06:19:07 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429563)
Apr 19 06:19:12 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429822)
Apr 19 06:19:21 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429878)
Apr 19 06:19:27 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429018)
Apr 19 06:19:29 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2414552)

Comment 10 Roman Joost 2018-04-20 00:08:08 UTC
Reopening this bug to look into it further.

Comment 11 Roman Joost 2018-04-20 00:20:23 UTC
The attached fix was a stab based on the hypothesis that the migration simply missed recipes in a 'Scheduled' state, which for some was true, but not for all.

Taking J:2429018 for example, the recipe 5045756 was scheduled by beakerd:

Apr 16 13:45:13 beaker-server beakerd[45907]: bkr.server.tools.beakerd DEBUG Counting candidate systems for recipe 5045756
Apr 16 13:45:13 beaker-server beakerd[45907]: bkr.server.tools.beakerd DEBUG Computing all candidate systems for recipe 5045756
Apr 16 13:45:13 beaker-server beakerd[45907]: bkr.server.tools.beakerd INFO recipe ID 5045756 matches one system, bumping priority
Apr 16 13:45:13 beaker-server beakerd[45907]: bkr.server.tools.beakerd INFO recipe ID 5045756 moved from New to Processed
Apr 16 13:51:12 beaker-server beakerd[45907]: bkr.server.tools.beakerd INFO recipe: 5045756 moved from Processed to Queued
Apr 16 21:19:11 beaker-server beakerd[45907]: bkr.server.tools.beakerd DEBUG Selecting a system for recipe 5045756
Apr 16 21:19:11 beaker-server beakerd[45907]: bkr.server.tools.beakerd DEBUG System : rdma-dev-09.lab.bos.redhat.com is available for Recipe 5045756
Apr 16 21:19:11 beaker-server beakerd[45907]: bkr.server.model.scheduler DEBUG Reserving system rdma-dev-09.lab.bos.redhat.com for recipe 5045756
Apr 16 21:19:14 beaker-server beakerd[45907]: bkr.server.tools.beakerd DEBUG Created watchdog for recipe id: 5045756 and system: rdma-dev-09.lab.bos.redhat.com
Apr 16 21:19:14 beaker-server beakerd[45907]: bkr.server.tools.beakerd INFO recipe ID 5045756 moved from Queued to Scheduled

It seems that it never was provisioned. Also note the long delays between moving into the different states.

Looking at the current data, there is no installation row created for it, even though the online migration step should have created one.

Comment 12 Dan Callaghan 2018-04-20 01:20:45 UTC
The actual problem with 5045756 is that it hit this exception during provision_scheduled_recipeset():

Apr 17 17:53:01 beaker-server beakerd[65774]: bkr.server.tools.beakerd DEBUG Provisioning recipe 5045755 in RS:4057849
Apr 17 17:53:07 beaker-server beakerd[65774]: bkr.server.model.activity DEBUG Tentative SystemActivity: object_id=3365L, service=u'Scheduler', field=u'Distro Tree', action=u'Provision', old=u'', new=u'RHEL-7.5-20180406.n.0 Server x86_64', user=jenkins/infiniband-qe
Apr 17 17:53:07 beaker-server beakerd[65774]: bkr.server.tools.beakerd DEBUG Provisioning recipe 5045756 in RS:4057849
Apr 17 17:53:07 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Failed to provision recipeid 5045756
Apr 17 17:53:07 beaker-server beakerd[65774]:  Traceback (most recent call last):
Apr 17 17:53:07 beaker-server beakerd[65774]:    File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 561, in provision_scheduled_recipeset
Apr 17 17:53:07 beaker-server beakerd[65774]:      recipe.provision()
Apr 17 17:53:07 beaker-server beakerd[65774]:    File "/usr/lib/python2.6/site-packages/bkr/server/model/scheduler.py", line 2551, in provision
Apr 17 17:53:07 beaker-server beakerd[65774]:      install_options = self.reduced_install_options()
Apr 17 17:53:07 beaker-server beakerd[65774]:    File "/usr/lib/python2.6/site-packages/bkr/server/model/scheduler.py", line 2534, in reduced_install_options
Apr 17 17:53:07 beaker-server beakerd[65774]:      self.installation.osmajor,
Apr 17 17:53:07 beaker-server beakerd[65774]:  AttributeError: 'NoneType' object has no attribute 'osmajor'

which is the same exception noted above in comment 3. However because of a defect in the exception handling for provision_scheduled_recipeset() we *eat* that exception *and then commit the transaction anyway*. Thus the recipe is left in "halfway" provisioned state, which is now what beakerd is choking on in update_dirty_jobs().

It would be worth verifying that each other affected job was in the same situation, but I bet they were.

The solution here is thus twofold: figure out exactly what his "halfway" provisioned state looks like, devise a query which matches such recipes, and write a data migration to fix them up in some way.

More importantly, we also need to fix the exception handling codepath in schedule_provisioned_recipe():

        except Exception, e:
            log.exception("Failed to provision recipeid %s", recipe.id)
            recipe.recipeset.abort(u"Failed to provision recipeid %s, %s"
                    % (recipe.id, e))
            return
            ^^^ BADNESS IS HERE

We need the more complicated structure where we *first* roll the whole loop iteration back, and then in a *new* transaction we abort all recipes in the set, and then commit that. There is a similar thing in provision_virt_recipe() already.

Comment 13 Roman Joost 2018-04-20 02:56:55 UTC
Yeah so looking in the logs. Recipes which exert the same problem are:

J:2429261
5046220: (no installation row)
beakerd.log.3.gz:Apr 17 02:42:37 beaker-server beakerd[65774]: bkr.server.tools.beakerd DEBUG Provisioning recipe 5046220 in RS:4058213                                                                                                       beakerd.log.3.gz:Apr 17 02:42:37 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Failed to provision recipeid 5046220

while 5046221 seems to be fine (has installation row).


J:2429728
5047175: (no installation row)
beakerd.log.3.gz:Apr 17 01:49:49 beaker-server beakerd[65774]: bkr.server.tools.beakerd DEBUG Provisioning recipe 5047175 in RS:4058936                                                                                                       beakerd.log.3.gz:Apr 17 01:49:49 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Failed to provision recipeid 5047175

while R:5047174 is fine.


J:2429732
5047182: (no installation row)
beakerd.log.3.gz:Apr 17 01:49:50 beaker-server beakerd[65774]: bkr.server.tools.beakerd DEBUG Provisioning recipe 5047182 in RS:4058942
beakerd.log.3.gz:Apr 17 01:49:50 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Failed to provision recipeid 5047182

while R:5047183 is fine.


A different edge case here which has one recipe without an installation row, while other recipes in the job installed fine and even completed:

J:2428440
5044481: (no installation row)
beakerd.log.3.gz:Apr 16 08:53:48 beaker-server beakerd[45907]: bkr.server.tools.beakerd DEBUG Counting candidate systems for recipe 5044481                                                                                                   beakerd.log.3.gz:Apr 16 08:53:48 beaker-server beakerd[45907]: bkr.server.tools.beakerd INFO recipe ID 5044481 moved from New to Aborted 

The recipe went to Aborted because it didn't match any systems. But our migration would never migrate aborted recipes and thus the problem.

J:2429344
5046377 same problem. Went from New to Aborted and no installation row.

J:2429873
5047488 same problem. Went from New to Aborted and no installation row.

and similar:

J:2429563
R:5046806 no installation row
beakerd.log.3.gz:Apr 16 19:48:05 beaker-server beakerd[45907]: bkr.server.tools.beakerd INFO recipe ID 5046806 moved from Queued to Scheduled
but then was Cancelled and no migration script ever migrated Cancelled recipes.

Comment 14 Roman Joost 2018-04-23 03:20:19 UTC
Additional online migration to fix both cases as documented in comment 13. Still need to write a reproducer and fix for beakerd from comment 12.

Comment 15 Roman Joost 2018-04-24 00:11:41 UTC
.. and the additional fix for beakerd for comment 12.

Comment 16 Roman Joost 2018-04-25 22:39:06 UTC
The migration in Comment 14 should now target the remaining incomplete jobs. Overall the result are 10 rows (7 jobs):

  jobid  | recipeid 
---------+----------
 2333112 |  4854521
 2333112 |  4854522
 2333112 |  4854523
 2429848 |  5047439
 2428440 |  5044481
 2429344 |  5046377
 2429563 |  5046806
 2429865 |  5047463
 2429865 |  5047464
 2429873 |  5047488

Comment 17 Roman Joost 2018-04-25 22:55:16 UTC
*** Bug 1571857 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.