1142532 – clear_running_commands XML-RPC call fails with MemoryError

Bug 1142532 - clear_running_commands XML-RPC call fails with MemoryError

Summary: clear_running_commands XML-RPC call fails with MemoryError

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Beaker
Classification:	Retired
Component:	lab controller
Sub Component:
Version:	0.18
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	19.0
Assignee:	Dan Callaghan
QA Contact:	tools-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-09-17 01:17 UTC by Dan Callaghan
Modified:	2018-02-06 00:41 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-11-25 07:18:41 UTC
Embargoed:

Attachments	(Terms of Use)

Description Dan Callaghan 2014-09-17 01:17:14 UTC

clear_running_commands (which is executed on beaker-provision startup, to clear any commands that were left running from a previous unclean termination) can fail with a MemoryError like this:

cherrypy.msg INFO HTTP: Traceback (most recent call last):
   File "/usr/lib/python2.6/site-packages/CherryPy-2.3.0-py2.6.egg/cherrypy/_cphttptools.py", line 121, in _run
     self.main()
   File "/usr/lib/python2.6/site-packages/CherryPy-2.3.0-py2.6.egg/cherrypy/_cphttptools.py", line 264, in main
     body = page_handler(*virtual_path, **self.params)
   File "<string>", line 3, in RPC2
   File "/usr/lib/python2.6/site-packages/turbogears/controllers.py", line 361, in expose
     *args, **kw)
   File "/usr/lib/python2.6/site-packages/bkr/server/wsgi.py", line 60, in run_with_transaction_noop
     return func(*args, **kwargs)
   File "<generated code>", line 0, in _expose
   File "/usr/lib/python2.6/site-packages/peak/rules/core.py", line 153, in __call__
     return self.body(*args, **kw)
   File "/usr/lib/python2.6/site-packages/turbogears/controllers.py", line 390, in <lambda>
     fragment, options, args, kw)))
   File "/usr/lib/python2.6/site-packages/turbogears/controllers.py", line 425, in _execute_func
     output = errorhandling.try_call(func, *args, **kw)
   File "/usr/lib/python2.6/site-packages/turbogears/errorhandling.py", line 77, in try_call
     return func(self, *args, **kw)
   File "/usr/lib/python2.6/site-packages/bkr/server/xmlrpccontroller.py", line 67, in RPC2
     session.rollback()
   File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/scoping.py", line 139, in do
     return getattr(self.registry(), name)(*args, **kwargs)
   File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/session.py", line 583, in rollback
     self.transaction.rollback()
   File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/session.py", line 411, in rollback
     transaction._rollback_impl()
   File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/session.py", line 427, in _rollback_impl
     self._restore_snapshot()
   File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/session.py", line 295, in _restore_snapshot
     for s in set(self._new).union(self.session._new):
 MemoryError: <bound method Root.RPC2 of <bkr.server.controllers.Root object at 0x7f1ab77a0b50>>

In this case it was attempting to clear 40 running commands. They were left running because the hypervisor for the system where beaker-provision was running crashed.

This is a recurrence of bug 974319. In that bug it was mentioned that aborting commands would be done with a separate transaction for each command (I guess to work around large memory consumption due to the SQLAlchemy object graph) but it looks like that was never actually done.

We really need to adjust the clear_running_commands code to avoid pulling in a large object graph for each command. And we really need a test proving this is actually fixed.

Comment 1 Dan Callaghan 2014-10-15 05:44:29 UTC

When I was working on this patch <https://git.beaker-project.org/cgit/beaker/commit/?id=cbb8ecd2c24086bde7ede40e2a7e613a1c4b93e6> I realised that the large object graph causing this MemoryError was probably due to system activity records. Mainly because that is the only thing (aside from the commands themselves) that that code path even touches from the db.

Some of the older regularly-used systems in our production db have 10-20,000 activity records, and the current version of the code would fetch them all for every system which had a stale command.

So I patched the tests to simulate that situation:

--- a/IntegrationTests/src/bkr/inttest/server/selenium/test_labcontrollers.py
+++ b/IntegrationTests/src/bkr/inttest/server/selenium/test_labcontrollers.py
@@ -455,11 +455,16 @@ def test_obeys_max_running_commands_limit(self):
 
     def test_clear_running_commands(self):
         with session.begin():
-            system = data_setup.create_system(lab_controller=self.lc)
-            command = CommandActivity(
-                    user=None, service=u'testdata', action=u'on',
-                    status=CommandStatus.running)
-            system.command_queue.append(command)
+            for _ in range(20):
+                system = data_setup.create_system(lab_controller=self.lc)
+                for i in range(10000):
+                    system.record_activity(user=system.owner, service=u'testdata',
+                            action=u'Changed', field=u'Loaned To',
+                            old=u'someone%s' % i, new=u'someone else%s' % i)
+                command = CommandActivity(
+                        user=None, service=u'testdata', action=u'on',
+                        status=CommandStatus.running)
+                system.command_queue.append(command)
             other_system = data_setup.create_system()
             other_command = CommandActivity(
                     user=None, service=u'testdata', action=u'on',

and it triggers the MemoryError on 0.18, when rlimit_as=1000000000 is set. On develop (with the above patch) the problem is gone, the request succeeds without MemoryError.

I won't commit the test itself because it chews a lot of memory and takes a long time to set up. But we can consider this fixed in 19.

Comment 4 Dan Callaghan 2014-11-20 00:52:16 UTC

Marking this as VERIFIED now that 19.0rc2 acceptance testing is complete.

Comment 5 Dan Callaghan 2014-11-25 07:18:41 UTC

Beaker 19.0 has been released.

Note You need to log in before you can comment on or make changes to this bug.