1120439 – MemoryError during /csv/action_export causes all subsequent HTTP requests to fail

Bug 1120439 - MemoryError during /csv/action_export causes all subsequent HTTP requests to fail

Summary: MemoryError during /csv/action_export causes all subsequent HTTP requests to ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Beaker
Classification:	Retired
Component:	general
Sub Component:
Version:	0.17
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	0.17.3
Assignee:	Dan Callaghan
QA Contact:	tools-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-07-16 23:49 UTC by Dan Callaghan
Modified:	2018-02-06 00:41 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-08-14 04:50:32 UTC
Embargoed:

Attachments	(Terms of Use)

Description Dan Callaghan 2014-07-16 23:49:31 UTC

Beaker hit a MemoryError while handling /csv/action_export, most likely to the size of the result set. (We may need to file a separate bug about making that code more efficient, it should be able to run without exhausting our heap limit.)

bkr.server ERROR Exception on /csv/action_export [GET]
 Traceback (most recent call last):
   File "/usr/lib/python2.6/site-packages/flask/app.py", line 1817, in wsgi_app
     response = self.full_dispatch_request()
   File "/usr/lib/python2.6/site-packages/flask/app.py", line 1479, in full_dispatch_request
     response = self.process_response(response)
   File "/usr/lib/python2.6/site-packages/flask/app.py", line 1691, in process_response
     response = handler(response)
   File "/usr/lib/python2.6/site-packages/bkr/server/wsgi.py", line 113, in commit_or_rollback_session
     session.rollback()
   File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/scoping.py", line 139, in do
     return getattr(self.registry(), name)(*args, **kwargs)
   File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/session.py", line 583, in rollback
     self.transaction.rollback()
   File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/session.py", line 411, in rollback
     transaction._rollback_impl()
   File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/session.py", line 427, in _rollback_impl
     self._restore_snapshot()
   File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/session.py", line 306, in _restore_snapshot
     for s in self.session.identity_map.all_states():
   File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/identity.py", line 197, in all_states
     return dict.values(self)
 MemoryError: <bound method CSV.action_export of <bkr.server.CSV_import_export.CSV object at 0x7f38b3778c50>>

The more serious problem is that, after this exception, every HTTP request to that worker process failed because it never successfully closed the SQLAlchemy session:

bkr.server.wsgi WARNING Session active when tearing down app context, rolling back
bkr.server.wsgi ERROR Error closing session when tearing down app context
 Traceback (most recent call last):
   File "/usr/lib/python2.6/site-packages/bkr/server/wsgi.py", line 121, in close_session
     session.rollback()
   File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/scoping.py", line 139, in do
     return getattr(self.registry(), name)(*args, **kwargs)
   File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/session.py", line 583, in rollback
     self.transaction.rollback()
   File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/session.py", line 411, in rollback
     transaction._rollback_impl()
   File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/session.py", line 427, in _rollback_impl
     self._restore_snapshot()
   File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/session.py", line 306, in _restore_snapshot
     for s in self.session.identity_map.all_states():
   File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/identity.py", line 197, in all_states
     return dict.values(self)
 MemoryError: <bound method CSV.action_export of <bkr.server.CSV_import_export.CSV object at 0x7f38b3778c50>>

The Flask handler for closing the session needs to be more robust -- it needs to close the session and transaction under all circumstances, or crash the entire worker process (at least then it would be restarted by mod_wsgi instead of continuing to fail subsequent requests).

It may be enough to do finally: session.close(), or even replace session.rollback() with session.close() entirely. But MemoryError is a tricky case, we need to recover or die *without* allocating anything new...

Comment 3 Nick Coghlan 2014-07-21 05:36:07 UTC

If the MemoryError was due to some large allocation failing, or enough of the stack has been unwound, then allocations may actually work.

Comment 4 Dan Callaghan 2014-07-22 23:20:06 UTC

This is quite a tricky one to reproduce in the gunicorn dev server. I was trying to fetch systems CSV and gradually reducing rlimit_as until it no longer succeeded. The problem was that below 660000000 rather than hitting a MemoryError in Python land, the worker would abort with this bizarre message:

libgcc_s.so.1 must be installed for pthread_cancel to work

That turned out to be because the MySQL client libraries do some pthread hackery which involves spawning a new thread that calls pthread_exit() on itself:

http://osxr.org/mysql/source/mysys/my_thr_init.c#0054

But in order to implement stack unwinding in pthread_exit() glibc also has some hackery which dlopen's libgcc_s.so in order to use GCC's stack unwinding machinery:

https://sourceware.org/ml/libc-help/2009-10/msg00023.html

But the dlopen() was failing with ENOMEM because rlimit_as was already exceeded.

Anyway, it turns out I could reproduce the MemoryError by exporting the system key-values CSV with rlimit_as 700000000, I guess because the system key-values CSV is much larger and involves loading more stuff into the SQLAlchemy session. (This is using a production db dump.)

Comment 5 Dan Callaghan 2014-07-22 23:58:12 UTC

On Gerrit: http://gerrit.beaker-project.org/3216

Comment 7 Dan Callaghan 2014-08-07 01:55:15 UTC

This bug will stay at ON_QA until 0.17.3 passes smoke testing. We decided that independently verifying the fix was not feasible given how difficult it is to reproduce the exact failure scenario.

While writing the patch I did verify on my development VM with a production DB dump that the worker process now aborts if session.close() fails due to MemoryError.

Comment 8 Amit Saha 2014-08-14 04:50:32 UTC

Beaker 0.17.3 has been released (https://beaker-project.org/docs/whats-new/release-0.17.html#beaker-0-17-3)

Note You need to log in before you can comment on or make changes to this bug.