Description of problem: Version-Release number of selected component (if applicable): 25.1 How reproducible: unsure Steps to Reproduce: unsure Actual results: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429732) Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 98, in update_dirty_jobs update_dirty_job(job_id) File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 113, in update_dirty_job job.update_status() File "/usr/lib/python2.6/site-packages/bkr/server/model/scheduler.py", line 1259, in update_status self._update_status() File "/usr/lib/python2.6/site-packages/bkr/server/model/scheduler.py", line 1298, in _update_status mail.job_notify(self) File "/usr/lib/python2.6/site-packages/bkr/server/mail.py", line 68, in job_notify body=failed_recipes(job), File "/usr/lib/python2.6/site-packages/bkr/server/mail.py", line 42, in failed_recipes distro_name = recipe.installation.distro_name if recipe.installation.distro_name else \ AttributeError: 'NoneType' object has no attribute 'distro_name' Expected results: no crash Additional info: Very similar to other regressions found on bug 911515 relating to recipes which were started before the upgrade.
Other affected jobs: Apr 17 01:59:53 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429732) Apr 17 02:00:46 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429728) Apr 17 02:02:29 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429732) Apr 17 02:03:00 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429728) Apr 17 02:05:01 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429344) Apr 17 02:06:27 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429732) Apr 17 02:06:35 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429882) Apr 17 02:06:57 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429878) Apr 17 02:07:10 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429728) Apr 17 02:08:03 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429344) Apr 17 02:08:06 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429878) Apr 17 02:08:24 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429732) Apr 17 02:08:26 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429882) and possibly more.
In the case of job 2429732 the problematic recipe set appears to be RS:4058942 with R:5047182 Waiting and R:5047183 Scheduled. R:5047182has an installation row but R:5047183 does not. I think the problem is in this commit, which was for a similar problem: https://gerrit.beaker-project.org/c/5970/ we should have included the status of 'Scheduled' in the INSERT query. In Beaker 24.x, the installation row was inserted in the recipe.provision() method which is called by beakerd in the provision_scheduled_recipeset() routine, which only happens when all recipes in the set have been Scheduled.
Incidentally, R:5047182 itself was aborted by beakerd with this error, presumably the same cause? Failed to provision recipeid 5047182, 'NoneType' object has no attribute 'osmajor'
(In reply to Dan Callaghan from comment #2) > R:5047182has an installation row but R:5047183 does not. Sorry I gave these ids the wrong way around. R:5047182 has *no* installation row, which is why beakerd failed to provision it *and* is now failing to send the mail for it.
Fixes migration to Beaker 25 if recipes are in a scheduled state.
Adds migration for 25.1 -> 25.2
Released with 25.2: https://beaker-project.org/docs/whats-new/release-25.html#beaker-25-2 Due to their importance and impact on systems in use, we've decided to skip an expensive self test, rely on our automated tests and get them deployed ASAP.
I believed this is the set of all job ids still affected, after this fix was applied: Apr 19 06:43:25 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429344) Apr 19 06:44:30 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2428440) Apr 19 06:45:51 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429732) Apr 19 06:46:18 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429728) Apr 19 06:46:25 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429261) Apr 19 06:47:41 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429873) Apr 19 06:48:09 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429563) Apr 19 06:48:32 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429018) For comparison, these are the ones which were failing before the online migration was run (a slightly larger set): Apr 19 06:15:02 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429344) Apr 19 06:15:41 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2428440) Apr 19 06:17:15 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429732) Apr 19 06:17:17 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2411598) Apr 19 06:17:29 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429821) Apr 19 06:17:43 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429728) Apr 19 06:17:52 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429261) Apr 19 06:18:42 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429873) Apr 19 06:18:44 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429882) Apr 19 06:19:07 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429563) Apr 19 06:19:12 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429822) Apr 19 06:19:21 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429878) Apr 19 06:19:27 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2429018) Apr 19 06:19:29 beaker-server beakerd[120450]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(2414552)
Reopening this bug to look into it further.
The attached fix was a stab based on the hypothesis that the migration simply missed recipes in a 'Scheduled' state, which for some was true, but not for all. Taking J:2429018 for example, the recipe 5045756 was scheduled by beakerd: Apr 16 13:45:13 beaker-server beakerd[45907]: bkr.server.tools.beakerd DEBUG Counting candidate systems for recipe 5045756 Apr 16 13:45:13 beaker-server beakerd[45907]: bkr.server.tools.beakerd DEBUG Computing all candidate systems for recipe 5045756 Apr 16 13:45:13 beaker-server beakerd[45907]: bkr.server.tools.beakerd INFO recipe ID 5045756 matches one system, bumping priority Apr 16 13:45:13 beaker-server beakerd[45907]: bkr.server.tools.beakerd INFO recipe ID 5045756 moved from New to Processed Apr 16 13:51:12 beaker-server beakerd[45907]: bkr.server.tools.beakerd INFO recipe: 5045756 moved from Processed to Queued Apr 16 21:19:11 beaker-server beakerd[45907]: bkr.server.tools.beakerd DEBUG Selecting a system for recipe 5045756 Apr 16 21:19:11 beaker-server beakerd[45907]: bkr.server.tools.beakerd DEBUG System : rdma-dev-09.lab.bos.redhat.com is available for Recipe 5045756 Apr 16 21:19:11 beaker-server beakerd[45907]: bkr.server.model.scheduler DEBUG Reserving system rdma-dev-09.lab.bos.redhat.com for recipe 5045756 Apr 16 21:19:14 beaker-server beakerd[45907]: bkr.server.tools.beakerd DEBUG Created watchdog for recipe id: 5045756 and system: rdma-dev-09.lab.bos.redhat.com Apr 16 21:19:14 beaker-server beakerd[45907]: bkr.server.tools.beakerd INFO recipe ID 5045756 moved from Queued to Scheduled It seems that it never was provisioned. Also note the long delays between moving into the different states. Looking at the current data, there is no installation row created for it, even though the online migration step should have created one.
The actual problem with 5045756 is that it hit this exception during provision_scheduled_recipeset(): Apr 17 17:53:01 beaker-server beakerd[65774]: bkr.server.tools.beakerd DEBUG Provisioning recipe 5045755 in RS:4057849 Apr 17 17:53:07 beaker-server beakerd[65774]: bkr.server.model.activity DEBUG Tentative SystemActivity: object_id=3365L, service=u'Scheduler', field=u'Distro Tree', action=u'Provision', old=u'', new=u'RHEL-7.5-20180406.n.0 Server x86_64', user=jenkins/infiniband-qe Apr 17 17:53:07 beaker-server beakerd[65774]: bkr.server.tools.beakerd DEBUG Provisioning recipe 5045756 in RS:4057849 Apr 17 17:53:07 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Failed to provision recipeid 5045756 Apr 17 17:53:07 beaker-server beakerd[65774]: Traceback (most recent call last): Apr 17 17:53:07 beaker-server beakerd[65774]: File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 561, in provision_scheduled_recipeset Apr 17 17:53:07 beaker-server beakerd[65774]: recipe.provision() Apr 17 17:53:07 beaker-server beakerd[65774]: File "/usr/lib/python2.6/site-packages/bkr/server/model/scheduler.py", line 2551, in provision Apr 17 17:53:07 beaker-server beakerd[65774]: install_options = self.reduced_install_options() Apr 17 17:53:07 beaker-server beakerd[65774]: File "/usr/lib/python2.6/site-packages/bkr/server/model/scheduler.py", line 2534, in reduced_install_options Apr 17 17:53:07 beaker-server beakerd[65774]: self.installation.osmajor, Apr 17 17:53:07 beaker-server beakerd[65774]: AttributeError: 'NoneType' object has no attribute 'osmajor' which is the same exception noted above in comment 3. However because of a defect in the exception handling for provision_scheduled_recipeset() we *eat* that exception *and then commit the transaction anyway*. Thus the recipe is left in "halfway" provisioned state, which is now what beakerd is choking on in update_dirty_jobs(). It would be worth verifying that each other affected job was in the same situation, but I bet they were. The solution here is thus twofold: figure out exactly what his "halfway" provisioned state looks like, devise a query which matches such recipes, and write a data migration to fix them up in some way. More importantly, we also need to fix the exception handling codepath in schedule_provisioned_recipe(): except Exception, e: log.exception("Failed to provision recipeid %s", recipe.id) recipe.recipeset.abort(u"Failed to provision recipeid %s, %s" % (recipe.id, e)) return ^^^ BADNESS IS HERE We need the more complicated structure where we *first* roll the whole loop iteration back, and then in a *new* transaction we abort all recipes in the set, and then commit that. There is a similar thing in provision_virt_recipe() already.
Yeah so looking in the logs. Recipes which exert the same problem are: J:2429261 5046220: (no installation row) beakerd.log.3.gz:Apr 17 02:42:37 beaker-server beakerd[65774]: bkr.server.tools.beakerd DEBUG Provisioning recipe 5046220 in RS:4058213 beakerd.log.3.gz:Apr 17 02:42:37 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Failed to provision recipeid 5046220 while 5046221 seems to be fine (has installation row). J:2429728 5047175: (no installation row) beakerd.log.3.gz:Apr 17 01:49:49 beaker-server beakerd[65774]: bkr.server.tools.beakerd DEBUG Provisioning recipe 5047175 in RS:4058936 beakerd.log.3.gz:Apr 17 01:49:49 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Failed to provision recipeid 5047175 while R:5047174 is fine. J:2429732 5047182: (no installation row) beakerd.log.3.gz:Apr 17 01:49:50 beaker-server beakerd[65774]: bkr.server.tools.beakerd DEBUG Provisioning recipe 5047182 in RS:4058942 beakerd.log.3.gz:Apr 17 01:49:50 beaker-server beakerd[65774]: bkr.server.tools.beakerd ERROR Failed to provision recipeid 5047182 while R:5047183 is fine. A different edge case here which has one recipe without an installation row, while other recipes in the job installed fine and even completed: J:2428440 5044481: (no installation row) beakerd.log.3.gz:Apr 16 08:53:48 beaker-server beakerd[45907]: bkr.server.tools.beakerd DEBUG Counting candidate systems for recipe 5044481 beakerd.log.3.gz:Apr 16 08:53:48 beaker-server beakerd[45907]: bkr.server.tools.beakerd INFO recipe ID 5044481 moved from New to Aborted The recipe went to Aborted because it didn't match any systems. But our migration would never migrate aborted recipes and thus the problem. J:2429344 5046377 same problem. Went from New to Aborted and no installation row. J:2429873 5047488 same problem. Went from New to Aborted and no installation row. and similar: J:2429563 R:5046806 no installation row beakerd.log.3.gz:Apr 16 19:48:05 beaker-server beakerd[45907]: bkr.server.tools.beakerd INFO recipe ID 5046806 moved from Queued to Scheduled but then was Cancelled and no migration script ever migrated Cancelled recipes.
Additional online migration to fix both cases as documented in comment 13. Still need to write a reproducer and fix for beakerd from comment 12.
.. and the additional fix for beakerd for comment 12.
The migration in Comment 14 should now target the remaining incomplete jobs. Overall the result are 10 rows (7 jobs): jobid | recipeid ---------+---------- 2333112 | 4854521 2333112 | 4854522 2333112 | 4854523 2429848 | 5047439 2428440 | 5044481 2429344 | 5046377 2429563 | 5046806 2429865 | 5047463 2429865 | 5047464 2429873 | 5047488
*** Bug 1571857 has been marked as a duplicate of this bug. ***