1097861 – [BEAKER][500 INTERNAL SERVER ERROR] harness fails to complete tasks due to database error: Lock wait timeout exceeded

Bug 1097861 - [BEAKER][500 INTERNAL SERVER ERROR] harness fails to complete tasks due to database error: Lock wait timeout exceeded

Summary: [BEAKER][500 INTERNAL SERVER ERROR] harness fails to complete tasks due to da...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Beaker
Classification:	Retired
Component:	scheduler
Sub Component:
Version:	0.16
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	future_maint
Assignee:	beaker-dev-list
QA Contact:	tools-bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (6):	1110460 1164291 1205867 1208820 1217114 1568737 (view as bug list)
Depends On:	1297603 1574311
Blocks:	798710
TreeView+	depends on / blocked

Reported:	2014-05-14 17:34 UTC by PaulB
Modified:	2020-10-21 14:15 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-10-21 14:15:51 UTC
Embargoed:

Attachments	(Terms of Use)

Description PaulB 2014-05-14 17:34:16 UTC

Description of problem:
 During automated kernel testing, a task within a recipe has successful
results yet it hits EWD (External Watchdog) due to:
<ProtocolError for beaker.engineering.redhat.com/client/: 500 INTERNAL SERVER ERROR>"

Version-Release number of selected component (if applicable):
 Beaker: Version 0.16.2

How reproducible:
 unknown

Actual results:
 https://beaker.engineering.redhat.com/recipes/1350778
 https://beaker.engineering.redhat.com/recipes/1350780
 http://lab-02.rhts.eng.bos.redhat.com/beaker/logs/recipes/1350+/1350780/console.log
-----<-snip->-----
2014-05-13 11:15:13,063 backend simple_recipe: ERROR Encoutnered problem while running task '21170382'.  
Traceback (most recent call last):  
  File "/usr/lib/python2.6/site-packages/beah/backends/beakerlc.py", line 518, in simple_recipe  
    result = task.getResult()  
  File "/usr/lib64/python2.6/site-packages/twisted/internet/defer.py", line 584, in getResult  
    self.result.raiseException()  
  File "/usr/lib64/python2.6/site-packages/twisted/internet/defer.py", line 605, in _deferGenerator  
    result = g.next()  
  File "/usr/lib/python2.6/site-packages/beah/backends/beakerlc.py", line 599, in _run_task  
    thingy.getResult()  
  File "/usr/lib64/python2.6/site-packages/twisted/internet/defer.py", line 584, in getResult  
    self.result.raiseException()  
  File "/usr/lib64/python2.6/site-packages/twisted/web/xmlrpc.py", line 307, in parseResponse  
    response = xmlrpclib.loads(contents)[0][0]  
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1184, in loads  
    return u.close(), u.getmethodname()  
  File "/usr/lib64/python2.6/xmlrpclib.py", line 838, in close  
    raise Fault(**self._stack[0])  
Fault: <Fault 1: "<class 'xmlrpclib.ProtocolError'>:<ProtocolError for beaker.engineering.redhat.com/client/: 500 INTERNAL SERVER ERROR>">  
2014-05-13 11:15:13,063 backend simple_recipe: ERROR Encoutnered problem while running task '21170382'.  
Traceback (most recent call last):  
  File "/usr/lib/python2.6/site-packages/beah/backends/beakerlc.py", line 518, in simple_recipe  
    result = task.getResult()  
  File "/usr/lib64/python2.6/site-packages/twisted/internet/defer.py", line 584, in getResult  
    self.result.raiseException()  
  File "/usr/lib64/python2.6/site-packages/twisted/internet/defer.py", line 605, in _deferGenerator  
    result = g.next()  
  File "/usr/lib/python2.6/site-packages/beah/backends/beakerlc.py", line 599, in _run_task  
    thingy.getResult()  
  File "/usr/lib64/python2.6/site-packages/twisted/internet/defer.py", line 584, in getResult  
    self.result.raiseException()  
  File "/usr/lib64/python2.6/site-packages/twisted/web/xmlrpc.py", line 307, in parseResponse  
    response = xmlrpclib.loads(contents)[0][0]  
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1184, in loads  
    return u.close(), u.getmethodname()  
  File "/usr/lib64/python2.6/xmlrpclib.py", line 838, in close  
    raise Fault(**self._stack[0])  
Fault: <Fault 1: "<class 'xmlrpclib.ProtocolError'>:<ProtocolError for beaker.engineering.redhat.com/client/: 500 INTERNAL SERVER ERROR>">  
-----<-snip->-----

Expected results:
 Task results are reported and testing continues.

Additional info:
 This issue was seen while testing on an 	
 RHEL6 x86_64 KVM guest.

Comment 2 Raymond Mancy 2014-05-14 21:38:52 UTC

This looks similar to BZ#969578, except I can't find a similarly obvious exception in the server-errors.log file.

Comment 3 Dan Callaghan 2014-05-14 22:45:13 UTC

The exception was:

Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/flask/app.py", line 1817, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/lib/python2.6/site-packages/flask/app.py", line 1479, in full_dispatch_request
    response = self.process_response(response)
  File "/usr/lib/python2.6/site-packages/flask/app.py", line 1691, in process_response
    response = handler(response)
  File "/usr/lib/python2.6/site-packages/bkr/server/wsgi.py", line 106, in commit_or_rollback_session
    session.commit()
  [...]
OperationalError: (OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE job SET dirty_version=%s WHERE job.id = %s'

Comment 4 Dan Callaghan 2014-05-14 22:51:02 UTC

I'm inclined to chalk this up to the very poor performance of our overloaded database server, but I will investigate some more in case there is any possible way we could be holding a lock on job rows for too long.

Comment 7 Dan Callaghan 2014-06-19 00:46:06 UTC

*** Bug 1110460 has been marked as a duplicate of this bug. ***

Comment 9 Dan Callaghan 2014-11-18 22:57:11 UTC

*** Bug 1164291 has been marked as a duplicate of this bug. ***

Comment 10 Dan Callaghan 2015-03-26 02:56:05 UTC

*** Bug 1205867 has been marked as a duplicate of this bug. ***

Comment 11 Dan Callaghan 2015-04-07 01:00:04 UTC

*** Bug 1208820 has been marked as a duplicate of this bug. ***

Comment 12 Dan Callaghan 2015-04-29 23:41:18 UTC

*** Bug 1217114 has been marked as a duplicate of this bug. ***

Comment 13 PaulB 2015-06-01 13:07:04 UTC

All,
Here is another instance with Beaker 20.1:
 https://beaker.engineering.redhat.com/recipes/1993450
 http://lab-02.rhts.eng.bos.redhat.com/beaker/logs/recipes/1993+/1993450/console.log
---<-snip->---
2015-05-31 19:08:09,572 backend _run_task: ERROR Exception raised when running task. Setting to finished  
Traceback (most recent call last):  
  File "/usr/lib/python2.6/site-packages/beah/backends/beakerlc.py", line 598, in _run_task  
    thingy.getResult()  
  File "/usr/lib64/python2.6/site-packages/twisted/internet/defer.py", line 584, in getResult  
    self.result.raiseException()  
  File "/usr/lib64/python2.6/site-packages/twisted/web/xmlrpc.py", line 307, in parseResponse  
    response = xmlrpclib.loads(contents)[0][0]  
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1184, in loads  
    return u.close(), u.getmethodname()  
  File "/usr/lib64/python2.6/xmlrpclib.py", line 838, in close  
    raise Fault(**self._stack[0])  
Fault: <Fault 1: "<class 'xmlrpclib.ProtocolError'>:<ProtocolError for beaker.engineering.redhat.com/client/: 500 INTERNAL SERVER ERROR>">  
2015-05-31 19:08:09,666 backend simple_recipe: ERROR Encountered problem while running task '30960348'.  
Traceback (most recent call last):  
  File "/usr/lib/python2.6/site-packages/beah/backends/beakerlc.py", line 517, in simple_recipe  
    result = task.getResult()  
  File "/usr/lib64/python2.6/site-packages/twisted/internet/defer.py", line 584, in getResult  
    self.result.raiseException()  
  File "/usr/lib64/python2.6/site-packages/twisted/internet/defer.py", line 605, in _deferGenerator  
    result = g.next()  
  File "/usr/lib/python2.6/site-packages/beah/backends/beakerlc.py", line 598, in _run_task  
    thingy.getResult()  
  File "/usr/lib64/python2.6/site-packages/twisted/internet/defer.py", line 584, in getResult  
    self.result.raiseException()  
  File "/usr/lib64/python2.6/site-packages/twisted/web/xmlrpc.py", line 307, in parseResponse  
    response = xmlrpclib.loads(contents)[0][0]  
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1184, in loads  
    return u.close(), u.getmethodname()  
  File "/usr/lib64/python2.6/xmlrpclib.py", line 838, in close  
    raise Fault(**self._stack[0])  
Fault: <Fault 1: "<class 'xmlrpclib.ProtocolError'>:<ProtocolError for beaker.engineering.redhat.com/client/: 500 INTERNAL SERVER ERROR>">  
---<snip>---

Best,
-pbunyan

Comment 15 PaulB 2015-11-06 13:52:09 UTC

All,
Doing some Beaker BZ review for some issues I opened.
I am closing this issue, as I have not seen it while testing
with recent versions of Beaker. 
Looks like closing as "worksforme" is the best status choice.

I will reopen if issue is reproduced.

Best,
-pbunyan

Comment 16 PaulB 2017-08-17 19:40:49 UTC

All,
Here is a reproducer:
Beaker 24.3

https://beaker.engineering.redhat.com/recipes/4174491#task60050305
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2017/08/20077/2007733/4174491/console.log
---<-snip->---
[ 6999.333020] beah-beaker-backend[3984]: 2017-08-16 23:50:43,195 backend __init__: INFO RPMTest kernel-kernel-vdso-sanity - /mnt/testarea/tmp4RNOQD.sh ['kernel-kernel-vdso-sanity'] 
[ 7002.193650] beah-beaker-backend[3984]: 2017-08-16 23:50:46,055 backend _send_cmd: INFO Command ['Command', 'run', '29ee0e8b-f42b-4456-bc41-bea820666c1e', {'args': ['kernel-kernel-vdso-sanity'], 'task_info': {'name': '/kernel/vdso/sanity', 'file': '/mnt/testarea/tmp4RNOQD.sh'}, 'env': {'RECIPETYPE': 'machine', 'BEAKER_REPOS': '', 'TESTID': '60050305', 'FAMILY': 'RedHatEnterpriseLinuxAlternateArchitectures7', 'RECIPEID': '4174491', 'RECIPE_MEMBERS': 'ibm-p8-kvm-01-guest-05.rhts.eng.brq.redhat.com', 'JOBID': '2007733', 'RECIPESETID': '3371742', 'GUESTS': '', 'SUBMITTER': 'hkrzesin', 'TESTRPMNAME': 'kernel-kernel-vdso-sanity', 'ARCH': 'ppc64le', 'TASKNAME': '/kernel/vdso/sanity', 'RECIPE_ROLE': 'None', 'STANDALONE': 'ibm-p8-kvm-01-guest-05.rhts.eng.brq.redhat.com', 'KILLTIME': '600', 'VARIANT': 'Server', 'LAB_CONTROLLER': '127.0.0.1', 'TESTORDER': '152', 'ROLE': 'STANDALONE', 'RECIPETESTID': '60050305', 'TASKID': '60050305', 'TEST': '/kernel/vdso/sanity', 'TESTPATH': '/mnt/tests/kernel/vdso/sanity', 'DISTRO': 'RHEL-ALT-7.4-20170808.0'}}] sent. 
[ 7002.226769] beah-srv[3983]: 2017-08-16 23:50:46,087 beah connectionMade: INFO TaskStdoutProtocol:connectionMade 
[ 7002.227263] beah-srv[3983]: 2017-08-16 23:50:46,088 beah task_started: INFO Task 29ee0e8b-f42b-4456-bc41-bea820666c1e has started. 
[ 7005.575913] beah-srv[3983]: 2017-08-16 23:50:49,437 beah buildProtocol: INFO TaskListener: Connected.  Address: UNIXAddress('') 
[ 7005.667050] beah-srv[3983]: 2017-08-16 23:50:49,528 beah buildProtocol: INFO TaskListener: Connected.  Address: UNIXAddress('') 
[ 7005.667548] beah-beaker-backend[3984]: 2017-08-16 23:50:49,529 backend async_proc: INFO Extending Watchdog for task 60050305 by 1200.. 
[ 7006.577881] beah-beaker-backend[3984]: 2017-08-16 23:50:50,440 backend _send_cmd: INFO Command ['Command', 'forward', '3c33a3ec-f5a3-47d1-92a1-89f8e313d021', {'event': ['Event', 'extend_watchdog', 'a04dc134-a19b-4649-a248-f07c1c6c35d4', {'source': 'beah_beaker_backend', 'id': '29ee0e8b-f42b-4456-bc41-bea820666c1e'}, None, {'timeout': 1194}]}] sent. 
[ 7032.959766] beah-srv[3983]: 2017-08-16 23:51:16,821 beah buildProtocol: INFO TaskListener: Connected.  Address: UNIXAddress('') 
[ 7042.594848] beah-beaker-backend[3984]: 2017-08-16 23:51:26,457 backend async_proc: INFO Extending Watchdog for task 60050305 by 2400.. 
[ 7043.556190] beah-beaker-backend[3984]: 2017-08-16 23:51:27,418 backend _send_cmd: INFO Command ['Command', 'forward', 'fe913af2-6a7c-4303-a0ab-910c979f9f0c', {'event': ['Event', 'extend_watchdog', '7f8a6952-fd54-43a2-b99c-7f97f728a73d', {'source': 'beah_beaker_backend', 'id': '29ee0e8b-f42b-4456-bc41-bea820666c1e'}, None, {'timeout': 2394}]}] sent. 
[ 7078.179613] beah-srv[3983]: 2017-08-16 23:52:02,041 beah processExited: INFO TaskStdoutProtocol:processExited([Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ProcessDone'>: A process has ended without apparent errors: process finished with exit code 0. 
[ 7078.183942] beah-srv[3983]: ]) 
[ 7078.184151] beah-srv[3983]: 2017-08-16 23:52:02,041 beah processEnded: INFO TaskStdoutProtocol:processEnded([Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ProcessDone'>: A process has ended without apparent errors: process finished with exit code 0. 
[ 7078.184632] beah-srv[3983]: ]) 
[ 7078.435147] beah-srv[3983]: 2017-08-16 23:52:02,297 beah task_finished: INFO Task 29ee0e8b-f42b-4456-bc41-bea820666c1e has finished. Marking it as done. 
[ 7078.436437] beah-beaker-backend[3984]: 2017-08-16 23:52:02,298 backend async_proc: INFO Task 60050305 done. Submitting logs... 
[ 7080.030152] beah-beaker-backend[3984]: 2017-08-16 23:52:03,891 backend _send_cmd: INFO Command ['Command', 'forward', '473d739c-a058-4677-ad5c-a45b204418e4', {'event': ['Event', 'extend_watchdog', 'a373671e-543d-48c1-80ca-e8827f5a4d34', {'source': 'beah_beaker_backend', 'id': '29ee0e8b-f42b-4456-bc41-bea820666c1e'}, None, {'timeout': 14394}]}] sent. 
[ 7138.710276] beah-beaker-backend[3984]: 2017-08-16 23:53:02,389 backend _run_task: ERROR Exception raised when running task. Setting to finished 
[ 7138.712245] beah-beaker-backend[3984]: Traceback (most recent call last): 
[ 7138.712505] beah-beaker-backend[3984]: File "/usr/lib/python2.7/site-packages/beah/backends/beakerlc.py", line 598, in _run_task 
[ 7138.712822] beah-beaker-backend[3984]: thingy.getResult() 
[ 7138.713070] beah-beaker-backend[3984]: File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 870, in getResult 
[ 7138.713384] beah-beaker-backend[3984]: self.result.raiseException() 
[ 7138.713651] beah-beaker-backend[3984]: File "/usr/lib64/python2.7/site-packages/twisted/web/xmlrpc.py", line 444, in parseResponse 
[ 7138.714037] beah-beaker-backend[3984]: response = xmlrpclib.loads(contents)[0][0] 
[ 7138.714310] beah-beaker-backend[3984]: File "/usr/lib64/python2.7/xmlrpclib.py", line 1138, in loads 
[ 7138.714633] beah-beaker-backend[3984]: return u.close(), u.getmethodname() 
[ 7138.714920] beah-beaker-backend[3984]: File "/usr/lib64/python2.7/xmlrpclib.py", line 794, in close 
[ 7138.715239] beah-beaker-backend[3984]: raise Fault(**self._stack[0]) 
[ 7138.715497] beah-beaker-backend[3984]: Fault: <Fault 1: "<class 'xmlrpclib.ProtocolError'>:<ProtocolError for beaker.engineering.redhat.com/client/: 500 INTERNAL SERVER ERROR>"> 
[ 7138.715906] beah-beaker-backend[3984]: 2017-08-16 23:53:02,573 backend simple_recipe: ERROR Encountered problem while running task '60050305'. 
[ 7138.716267] beah-beaker-backend[3984]: Traceback (most recent call last): 
[ 7138.716557] beah-beaker-backend[3984]: File "/usr/lib/python2.7/site-packages/beah/backends/beakerlc.py", line 517, in simple_recipe 
[ 7138.716913] beah-beaker-backend[3984]: result = task.getResult() 
[ 7138.717186] beah-beaker-backend[3984]: File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 870, in getResult 
[ 7138.717536] beah-beaker-backend[3984]: self.result.raiseException() 
[ 7138.717830] beah-beaker-backend[3984]: File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 891, in _deferGenerator 
[ 7138.718307] beah-beaker-backend[3984]: result = g.next() 
[ 7138.718577] beah-beaker-backend[3984]: File "/usr/lib/python2.7/site-packages/beah/backends/beakerlc.py", line 598, in _run_task 
[ 7138.718926] beah-beaker-backend[3984]: thingy.getResult() 
[ 7138.719189] beah-beaker-backend[3984]: File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 870, in getResult 
[ 7138.719547] beah-beaker-backend[3984]: self.result.raiseException() 
[ 7138.719879] beah-beaker-backend[3984]: File "/usr/lib64/python2.7/site-packages/twisted/web/xmlrpc.py", line 444, in parseResponse 
[ 7138.720241] beah-beaker-backend[3984]: response = xmlrpclib.loads(contents)[0][0] 
[ 7138.720551] beah-beaker-backend[3984]: File "/usr/lib64/python2.7/xmlrpclib.py", line 1138, in loads 
[ 7138.720904] beah-beaker-backend[3984]: return u.close(), u.getmethodname() 
[ 7138.721194] beah-beaker-backend[3984]: File "/usr/lib64/python2.7/xmlrpclib.py", line 794, in close 
[ 7138.721547] beah-beaker-backend[3984]: raise Fault(**self._stack[0]) 
[ 7138.721865] beah-beaker-backend[3984]: Fault: <Fault 1: "<class 'xmlrpclib.ProtocolError'>:<ProtocolError for beaker.engineering.redhat.com/client/: 500 INTERNAL SERVER ERROR>"> 
---<-snip->---

Best,
-pbunyan

Comment 17 Dan Callaghan 2017-08-18 06:42:07 UTC

Thanks for the report Paul. I jumped onto it straight away so the logs are not rotated out. And I found that it is indeed the same "Lock wait timeout" errors from MySQL.

Aug 16 21:53:02 beaker-server beaker-server[79336]:  OperationalError: (OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE job SET dirty_version=%s WHERE job.id = %s' ('\x8f"P5\xd18D\xea\xaf\xab\xb7\xd2L\xde\xde3', 2007733L)

I was confused for a little while because the log messages around XMLRPC request handling are a bit deceptive. Specifically, this failure happens in sqlalchemy session flush which is actually *after* all the XMLRPC request handling, including the log message which claims the request is handled, and the exception handling etc.

So we can fix the error handling path at least. This patch does nothing to help the lock wait timeouts, but it should ensure that when they occur the logs are clearer about the fact that the request was not handled successfully. Also fixes the timing info for successful requests to include all the database operations which it seems are currently being excluded. And it should give us a more specific error back to the client as well, which will help.

https://gerrit.beaker-project.org/5794

Comment 18 Dan Callaghan 2017-08-18 06:54:41 UTC

As far as what to do about the lock wait timeouts themselves... I am pretty much out of ideas at this point. :-/

The MySQL database does run on the same VM as Beaker itself, and is heavily loaded, but not *overloaded*. I had a look at the all the metrics available in Zabbix around that timeframe (16 Aug 21:00-22:00 UTC) and nothing looks bad. And the VM is now running on pretty much the fastest hardware we've got.

It's possible that we are just starting to hit the limits of what MySQL can handle when everything is serialized through a single table like the job.dirty_version stuff is.

Comment 19 PaulB 2017-08-31 02:30:10 UTC

All,
Here is a reproducer:
Beaker 24.3 

https://beaker.engineering.redhat.com/recipes/4218513#task60658775http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2017/08/20282/2028282/4218513/console.log
---<-snip->---
'beah_beaker_backend', 'id': '2aab1183-320b-478b-b628-4b1a42443120'}, None, {'timeout': 14394}]}] sent. 
[-- MARK -- Wed Aug 30 20:25:00 2017] 
[  310.510200] beah-beaker-backend[4981]: 2017-08-30 16:25:00,763 backend _run_task: ERROR Exception raised when running task. Setting to finished 
[  310.510665] beah-beaker-backend[4981]: Traceback (most recent call last): 
[  310.511013] beah-beaker-backend[4981]: File "/usr/lib/python2.7/site-packages/beah/backends/beakerlc.py", line 598, in _run_task 
[  310.511435] beah-beaker-backend[4981]: thingy.getResult() 
[  310.511762] beah-beaker-backend[4981]: File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 870, in getResult 
[  310.512185] beah-beaker-backend[4981]: self.result.raiseException() 
[  310.512514] beah-beaker-backend[4981]: File "/usr/lib64/python2.7/site-packages/twisted/web/xmlrpc.py", line 444, in parseResponse 
[  310.512909] beah-beaker-backend[4981]: response = xmlrpclib.loads(contents)[0][0] 
[  310.513255] beah-beaker-backend[4981]: File "/usr/lib64/python2.7/xmlrpclib.py", line 1138, in loads 
[  310.513570] beah-beaker-backend[4981]: return u.close(), u.getmethodname() 
[  310.513991] beah-beaker-backend[4981]: File "/usr/lib64/python2.7/xmlrpclib.py", line 794, in close 
[  310.514453] beah-beaker-backend[4981]: raise Fault(**self._stack[0]) 
[  310.514828] beah-beaker-backend[4981]: Fault: <Fault 1: "<class 'xmlrpclib.ProtocolError'>:<ProtocolError for beaker.engineering.redhat.com/client/: 500 INTERNAL SERVER ERROR>"> 
[  310.515522] beah-beaker-backend[4981]: 2017-08-30 16:25:00,774 backend simple_recipe: ERROR Encountered problem while running task '60658775'. 
[  310.516119] beah-beaker-backend[4981]: Traceback (most recent call last): 
[  310.516514] beah-beaker-backend[4981]: File "/usr/lib/python2.7/site-packages/beah/backends/beakerlc.py", line 517, in simple_recipe 
[  310.517067] beah-beaker-backend[4981]: result = task.getResult() 
[  310.517444] beah-beaker-backend[4981]: File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 870, in getResult 
[  310.517998] beah-beaker-backend[4981]: self.result.raiseException() 
[  310.518262] beah-beaker-backend[4981]: File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 891, in _deferGenerator 
[  310.518617] beah-beaker-backend[4981]: result = g.next() 
[  310.518872] beah-beaker-backend[4981]: File "/usr/lib/python2.7/site-packages/beah/backends/beakerlc.py", line 598, in _run_task 
[  310.519227] beah-beaker-backend[4981]: thingy.getResult() 
[  310.519491] beah-beaker-backend[4981]: File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 870, in getResult 
[  310.519849] beah-beaker-backend[4981]: self.result.raiseException() 
[  310.520145] beah-beaker-backend[4981]: File "/usr/lib64/python2.7/site-packages/twisted/web/xmlrpc.py", line 444, in parseResponse 
[  310.520501] beah-beaker-backend[4981]: response = xmlrpclib.loads(contents)[0][0] 
[  310.520928] beah-beaker-backend[4981]: File "/usr/lib64/python2.7/xmlrpclib.py", line 1138, in loads 
[  310.521391] beah-beaker-backend[4981]: return u.close(), u.getmethodname() 
[  310.521810] beah-beaker-backend[4981]: File "/usr/lib64/python2.7/xmlrpclib.py", line 794, in close 
[  310.522273] beah-beaker-backend[4981]: raise Fault(**self._stack[0]) 
[  310.522648] beah-beaker-backend[4981]: Fault: <Fault 1: "<class 'xmlrpclib.ProtocolError'>:<ProtocolError for beaker.engineering.redhat.com/client/: 500 INTERNAL SERVER ERROR>"> 
---<-snip->---

Best,
-pbunyan

Comment 20 PaulB 2017-09-29 15:35:45 UTC

All,
Here is a reproducer:
Beaker 24.3 

2017-09-28 13:31:13 -04:00

https://beaker.engineering.redhat.com/recipes/4310997#task61853560
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2017/09/20731/2073170/4310997/console.log
---<-snip->---
[ 1163.023553] beah-beaker-backend[10292]: Fault: <Fault 1: "<class 'xmlrpclib.P 
rotocolError'>:<ProtocolError for beaker.engineering.redhat.com/client/: 500 INT 
ERNAL SERVER ERROR>">                                                            
[ 1163.023668] beah-beaker
---<-snip->---

Best,
-pbunyan

Comment 21 PaulB 2017-10-04 13:00:27 UTC

All,
Here is a reproducer:
Beaker 24.3 

 	
2017-10-03 12:04:44 -04:00

https://beaker.engineering.redhat.com/recipes/4323278#task62016366
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2017/10/20785/2078527/4323278/console.log
---<-snip->---
[   98.223997] beah-beaker-backend[5309]: Fault: <Fault 1: "<class 'xmlrpclib.ProtocolError'>:<ProtocolError for beaker.engineering.redhat.com/client/: 500 INTERNAL SERVER ERROR>"> 
---<-snip->--

-Best,
-pbunyan

Comment 22 PaulB 2017-10-04 13:10:42 UTC

All,
***
Just adding the references, as there seems to be an increase in frequency of this issue.
***

Here is a reproducer:
Beaker 24.3 

2017-10-03 19:40:13 -04:00

https://beaker.engineering.redhat.com/recipes/4324227#task62032799http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2017/10/20788/2078820/4324227/console.log
---<-snip->--
[  123.690943] beah-beaker-backend[4649]: Fault: <Fault 1: "<class 'xmlrpclib.ProtocolError'>:<ProtocolError for beaker.engineering.redhat.com/client/: 500 INTERNAL SERVER ERROR>"> 
---<-snip->--


Best,
-pbunyan

Comment 23 PaulB 2017-10-06 14:51:15 UTC

All,
***
Just adding the references, 
as there seems to be an increase in frequency of this issue.
***

Here is a reproducer:
Beaker 24.3 

2017-10-06 05:04:45 -04:00
https://beaker.engineering.redhat.com/recipes/4332140
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2017/10/20824/2082439/4332140/console.log
---<-snip->---
[  365.365897] beah-beaker-backend[8222]: Fault: <Fault 1: "<class 'xmlrpclib.ProtocolError'>:<ProtocolError for beaker.engineering.redhat.com/client/: 500 INTERNAL SERVER ERROR>"> 
---<-snip->---

Best,
-pbunyan

Comment 24 PaulB 2017-11-15 19:58:25 UTC

All,
Here is another instance:
Beaker 24.5 

2017-11-14 17:59:40 -05:00
https://beaker.engineering.redhat.com/recipes/4466286#task63953173
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2017/11/21452/2145258/4466286/console.log
---<-snip->---
[ 4839.835444] beah-beaker-backend[3578]: 2017-11-14 12:07:37,724 backend _run_task: ERROR Exception raised when running task. Setting to finished 
[ 4839.836416] beah-beaker-backend[3578]: Traceback (most recent call last): 
[ 4839.837274] beah-beaker-backend[3578]: File "/usr/lib/python2.7/site-packages/beah/backends/beakerlc.py", line 598, in _run_task 
[ 4839.838080] beah-beaker-backend[3578]: thingy.getResult() 
[ 4839.838853] beah-beaker-backend[3578]: File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 870, in getResult 
[ 4839.839732] beah-beaker-backend[3578]: self.result.raiseException() 
[ 4839.840660] beah-beaker-backend[3578]: File "/usr/lib64/python2.7/site-packages/twisted/web/xmlrpc.py", line 444, in parseResponse 
[ 4839.841494] beah-beaker-backend[3578]: response = xmlrpclib.loads(contents)[0][0] 
[ 4839.842305] beah-beaker-backend[3578]: File "/usr/lib64/python2.7/xmlrpclib.py", line 1138, in loads 
[ 4839.843079] beah-beaker-backend[3578]: return u.close(), u.getmethodname() 
[ 4839.843841] beah-beaker-backend[3578]: File "/usr/lib64/python2.7/xmlrpclib.py", line 794, in close 
[ 4839.844600] beah-beaker-backend[3578]: raise Fault(**self._stack[0]) 
[ 4839.845346] beah-beaker-backend[3578]: Fault: <Fault 1: "<class 'sqlalchemy.exc.OperationalError'>:(OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE job SET dirty_version=%s WHERE job.id = %s' ('\\xfb\\xf2f\\x91\\xae[@d\\x8ct\\x13\\xe6\\x82\\xd4B\\xdb', 2145258L)"> 
[ 4839.846140] beah-beaker-backend[3578]: 2017-11-14 12:07:37,738 backend
---<-snip->---

Best,
-pbunyan

Comment 25 PaulB 2017-12-06 21:36:37 UTC

All,
Here is another instance:
Beaker 24.5 
 	
2017-12-05 19:33:31 -05:00
https://beaker.engineering.redhat.com/recipes/4549351#task64960415
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2017/12/21838/2183894/4549351/console.log
---<-snip->---
[  644.354987] beah-beaker-backend[3762]: 2017-12-05 19:34:47,915 backend _send_cmd: INFO Command ['Command', 'forward', '91ce5b86-8412-49bb-9770-439e35399b9b', {'event': ['Event', 'extend_watchdog', 'bb610ed8-288b-4cf8-a6f7-6420fdd2e8df', {'source': 'beah_beaker_backend', 'id': '3934e982-2b1f-480e-be94-fc851e1fdba1'}, None, {'timeout': 14394}]}] sent. 
[  696.245062] beah-beaker-backend[3762]: 2017-12-05 19:35:39,763 backend _run_task: ERROR Exception raised when running task. Setting to finished 
[  696.249092] beah-beaker-backend[3762]: Traceback (most recent call last): 
[  696.251545] beah-beaker-backend[3762]: File "/usr/lib/python2.7/site-packages/beah/backends/beakerlc.py", line 598, in _run_task 
[  696.254159] beah-beaker-backend[3762]: thingy.getResult() 
[  696.255696] beah-beaker-backend[3762]: File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 870, in getResult 
[  696.258227] beah-beaker-backend[3762]: self.result.raiseException() 
[  696.260054] beah-beaker-backend[3762]: File "/usr/lib64/python2.7/site-packages/twisted/web/xmlrpc.py", line 444, in parseResponse 
[  696.262599] beah-beaker-backend[3762]: response = xmlrpclib.loads(contents)[0][0] 
[  696.264555] beah-beaker-backend[3762]: File "/usr/lib64/python2.7/xmlrpclib.py", line 1138, in loads 
[  696.266687] beah-beaker-backend[3762]: return u.close(), u.getmethodname() 
[  696.268437] beah-beaker-backend[3762]: File "/usr/lib64/python2.7/xmlrpclib.py", line 794, in close 
[  696.270582] beah-beaker-backend[3762]: raise Fault(**self._stack[0]) 
[  696.272256] beah-beaker-backend[3762]: Fault: <Fault 1: "<class 'sqlalchemy.exc.OperationalError'>:(OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE job SET dirty_version=%s WHERE job.id = %s' ('J\\xa5\\nM\\xb8\\xc2B-\\x83\\x1e\\x8f\\xc6}A\\xbb\\xa2', 2183894L)"> 
[  696.277328] beah-beaker-backend[3762]: 2017-12-05 19:35:39,806 backend simple_recipe: ERROR Encountered problem while running task '64960415'. 
[  696.280056] beah-beaker-backend[3762]: Traceback (most recent call last): 
[  696.281803] beah-beaker-backend[3762]: File "/usr/lib/python2.7/site-packages/beah/backends/beakerlc.py", line 517, in simple_recipe 
[  696.284506] beah-beaker-backend[3762]: result = task.getResult() 
[  696.286132] beah-beaker-backend[3762]: File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 870, in getResult 
[  696.288652] beah-beaker-backend[3762]: self.result.raiseException() 
[  696.290395] beah-beaker-backend[3762]: File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 891, in _deferGenerator 
[  696.293007] beah-beaker-backend[3762]: result = g.next() 
[  696.294547] beah-beaker-backend[3762]: File "/usr/lib/python2.7/site-packages/beah/backends/beakerlc.py", line 598, in _run_task 
[  696.297062] beah-beaker-backend[3762]: thingy.getResult() 
[  696.298576] beah-beaker-backend[3762]: File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 870, in getResult 
[  696.301246] beah-beaker-backend[3762]: self.result.raiseException() 
[  696.302937] beah-beaker-backend[3762]: File "/usr/lib64/python2.7/site-packages/twisted/web/xmlrpc.py", line 444, in parseResponse 
[  696.305615] beah-beaker-backend[3762]: response = xmlrpclib.loads(contents)[0][0] 
[  696.307466] beah-beaker-backend[3762]: File "/usr/lib64/python2.7/xmlrpclib.py", line 1138, in loads 
[  696.309581] beah-beaker-backend[3762]: return u.close(), u.getmethodname() 
[  696.311400] beah-beaker-backend[3762]: File "/usr/lib64/python2.7/xmlrpclib.py", line 794, in close 
[  696.313506] beah-beaker-backend[3762]: raise Fault(**self._stack[0]) 
[  696.315218] beah-beaker-backend[3762]: Fault: <Fault 1: "<class 'sqlalchemy.exc.OperationalError'>:(OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE job SET dirty_version=%s WHERE job.id = %s' ('J\\xa5\\nM\\xb8\\xc2B-\\x83\\x1e\\x8f\\xc6}A\\xbb\\xa2', 2183894L)"> 
---<-snip->---

Best,
-pbunyan

Comment 26 PaulB 2018-01-16 14:09:52 UTC

All,
Here is another instance:
Beaker 24.5 

https://beaker.engineering.redhat.com/recipes/4686990#task66543381
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2018/01/22512/2251225/4686990/console.log
---<-snip->---
[-- MARK -- Mon Jan 15 15:15:00 2018] 
[   98.725597] beah-beaker-backend[2501]: 2018-01-15 10:15:08,205 backend _run_task: ERROR Exception raised when running task. Setting to finished 
[   98.726099] beah-beaker-backend[2501]: Traceback (most recent call last): 
[   98.726271] beah-beaker-backend[2501]: File "/usr/lib/python2.7/site-packages/beah/backends/beakerlc.py", line 598, in _run_task 
[   98.726446] beah-beaker-backend[2501]: thingy.getResult() 
[   98.726615] beah-beaker-backend[2501]: File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 870, in getResult 
[   98.726781] beah-beaker-backend[2501]: self.result.raiseException() 
[   98.726947] beah-beaker-backend[2501]: File "/usr/lib64/python2.7/site-packages/twisted/web/xmlrpc.py", line 444, in parseResponse 
[   98.727146] beah-beaker-backend[2501]: response = xmlrpclib.loads(contents)[0][0] 
[   98.727310] beah-beaker-backend[2501]: File "/usr/lib64/python2.7/xmlrpclib.py", line 1138, in loads 
[   98.727485] beah-beaker-backend[2501]: return u.close(), u.getmethodname() 
[   98.727693] beah-beaker-backend[2501]: File "/usr/lib64/python2.7/xmlrpclib.py", line 794, in close 
[   98.727922] beah-beaker-backend[2501]: raise Fault(**self._stack[0]) 
[   98.728111] beah-beaker-backend[2501]: Fault: <Fault 1: '<class \'sqlalchemy.exc.OperationalError\'>:(OperationalError) (1205, \'Lock wait timeout exceeded; try restarting transaction\') \'UPDATE job SET dirty_version=%s WHERE job.id = %s\' ("\\xa4\\xd1\\x8c0#\\x15G6\\xb9\\x05\\x1c3\\xd5\\xdc\'\\xf1", 2251225L)'> 

---<-snip->---

Best,
-pbunyan

Comment 27 PaulB 2018-01-19 13:49:27 UTC

All,

=============================
Noting the following comment:
=============================
https://bugzilla.redhat.com/show_bug.cgi?id=1097861#c18


=============================
referencing latest incidents:
=============================

--------------------------
2018-01-18 04:45:57 -05:00
--------------------------
https://beaker.engineering.redhat.com/recipes/4700295#task66712742
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2018/01/22575/2257559/4700295/console.log
---<-snip->---
exc.OperationalError\'>:(OperationalError) (1205, \'Lock wait timeout exceeded;  
try restarting transaction\') \'UPDATE job SET dirty_version=%s WHERE job.id = % 
s\' (\'.Hv\\xb5A\\x9fJ\\xa9\\xaaXx[\\xfb\\xd7\\xa0"\', 2257559L)'>
---<-snip->---


--------------------------	
018-01-18 04:46:39 -05:00
--------------------------
https://beaker.engineering.redhat.com/recipes/4700296#task66712763
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2018/01/22575/2257559/4700296/console.log
---<-snip->---
[ 2868.971433] beah-beaker-backend[3250]: raise Fault(**self._stack[0])          
[ 2868.971853] beah-beaker-backend[3250]: Fault: <Fault 1: "<class 'sqlalchemy.e 
xc.OperationalError'>:(OperationalError) (1205, 'Lock wait timeout exceeded; try 
 restarting transaction') 'UPDATE job SET dirty_version=%s WHERE job.id = %s' (' 
\\x9f\\x84\\xbd\\xaeO\\xc2O9\\xacC\\x1a?\\xe3\\x89\\xdc\\xbe', 2257559L)">       
---<-snip->---


Best,
-pbunyan

Comment 28 PaulB 2018-01-24 17:03:23 UTC

All,
Here is another instance:

--------------------------
2018-01-23 11:21:52 -05:00
--------------------------
https://beaker.engineering.redhat.com/recipes/4717564#task66907720
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2018/01/22656/2265610/4717564/console.log
---<-snip->---
[   87.699955] beah-beaker-backend[1299]: raise Fault(**self._stack[0]) 
[   87.701139] beah-beaker-backend[1299]: Fault: <Fault 1: "<class 'sqlalchemy.exc.OperationalError'>:(OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE job SET dirty_version=%s WHERE job.id = %s' ('\\xca$u\\xee\\x1f|IA\\x87\\xdc\\x03t\\x0ep\\xfd\\x11', 2265610L)"> 
[   87.704228] beah-beaker-backend[1299]: 2018-01-23 11:24:51,348 backend simple_recipe: ERROR Encountered problem while running task '66907720'. 
---<-snip->--- 	


--------------------------
2018-01-23 16:25:26 -05:00
--------------------------
https://beaker.engineering.redhat.com/recipes/4717567#task66907791
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2018/01/22656/2265610/4717567/console.log
---<-snip->--- 	
[ 8303.014322] beah-beaker-backend[1244]: Fault: <Fault 1: "<class 'sqlalchemy.exc.OperationalError'>:(OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE job SET dirty_version=%s WHERE job.id = %s' ('\\x01\\x0c\\xfd>\\xee\\xecF\\xab\\xb8\\xa8\\xef-&\\xb5\\xf7\\x1f', 2265610L)"> 
[ 8303.018301] beah-beaker-backend[1244]: 2018-01-23 16:28:05,006 backend simple_recipe: ERROR Encountered problem while running task '66907791'. 
---<-snip->--- 	



Best,
-pbunyan

Comment 29 PaulB 2018-01-24 17:05:05 UTC

Dan,
So I have updated this BZ numerous times.

We seen this is a reoccurring persistent issue :/

Is there something that can be done?

Pretty please :)

Best,
-pbunyan

Comment 30 Dan Callaghan 2018-01-24 17:10:42 UTC

The most recent spate of these errors is related to some as-yet-unidentified database issues on beaker.engineering.redhat.com:

https://redhat.service-now.com/surl.do?n=PNT0145431

Our next best hope for improving the database I/O performance is moving it onto a separate VM and running Mariadb 10 on RHEL7 (rather than the current setup which has MySQL 5.1 on RHEL6 running on the same VM as the Beaker server application itself). That one is:

https://redhat.service-now.com/surl.do?n=PNT0119348

Comment 31 PaulB 2018-03-15 01:52:11 UTC

Dan/Roman,
There were numerous instances of this issue during this job:
 https://beaker.engineering.redhat.com/jobs/2371281

There were more than six instances of this issue during this job :(
=================
Here is a sample:
=================
--------------------------- 	
2018-03-13 23:04:12 -04:00
---------------------------
https://beaker.engineering.redhat.com/recipes/4927054#task69294089
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2018/03/23712/2371281/4927054/console.log
---<-snip->---
[ 3044.370807] beah-beaker-backend[2500]: Fault: <Fault 1: "<class 'sqlalchemy.exc.OperationalError'>:(OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE job SET dirty_version=%s WHERE job.id = %s' ('I6\\xae\\xe6\\xc0\\xf0Hd\\xba\\x96\\x03`\\xcb\\x06\\x12\\xb0', 2371281L)"> 
---<-snip->---

There were more than six instances of this issue during this job :(

Best,
-pbunyan

Comment 32 Roman Joost 2018-03-15 07:26:00 UTC

(In reply to PaulB from comment #31)
> Dan/Roman,
> There were numerous instances of this issue during this job:
>  https://beaker.engineering.redhat.com/jobs/2371281
> 
We hear you. I'm followed up on the ticket today. It's going slowly, but hopefully steadily. I wish I could have something better at this point ...

Comment 33 Dan Callaghan 2018-04-18 23:01:21 UTC

*** Bug 1568737 has been marked as a duplicate of this bug. ***

Comment 35 Dan Callaghan 2018-04-30 06:52:03 UTC

The current set of logs we have in production go back to Apr 25 03:32:08, which is slightly more than 5 days worth of data (although it covers a weekend which is generally quiet for Beaker, so it may be worth repeating this analysis on a Friday for better data).

During that time, the application hit these "Lock wait timeout" errors 149 times, for the following statements:

143 UPDATE job SET dirty_version=%s WHERE job.id = %s
3 UPDATE job SET deleted=%s WHERE job.id = %s
3 INSERT INTO log_recipe (...

I didn't examine all 143 occurrences of the "UPDATE job SET dirty_version" exception, but I looked at several and they were all in the recipe.tasks.start or recipe.tasks.stop XMLRPC calls. The 3 occurrences of "UPDATE job SET deleted" exception were for POST /jobs/delete_job_row, which is the Delete button on the jobs grid of the web UI. The 3 occurrences of "INSERT INTO log_recipe" were in recipes.register_file XMLRPC calls.

Beakerd only hit the error once in the same period, for this statement:

UPDATE recipe SET status=%s WHERE recipe.status = %s AND recipe.id = %s

which was in update_dirty_job().

We also observe from the beakerd logs that, although it may not be reaching the lock timeout (120 seconds) in many cases it seems to be waiting a long time for MySQL to honour INSERT and UPDATE queries.

For example on 27 April, each iteration of process_new_recipe() was taking nearly 120 seconds:

Apr 27 02:35:15 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Counting candidate systems for recipe 5085673
Apr 27 02:35:43 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Metrics collection took 32 seconds
Apr 27 02:36:09 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Computing all candidate systems for recipe 5085673
Apr 27 02:36:20 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Metrics collection took 32 seconds
Apr 27 02:36:58 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Metrics collection took 33 seconds
Apr 27 02:37:04 beaker-server beakerd[4063]: bkr.server.tools.beakerd INFO recipe ID 5085673 moved from New to Processed

Here on 26 April, each iteration of provision_scheduled_recipeset() was taking almost 10 seconds for each machine recipe (although not for guest recipes):

Apr 26 03:46:07 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Provisioning recipe 5081867 in RS:4086142
Apr 26 03:46:18 beaker-server beakerd[4063]: bkr.server.model.activity DEBUG Tentative SystemActivity: object_id=926L, service=u'Scheduler', field=u'Distro Tree', action=u'Provision', old=u'', new=u'RHEL-6.9 Server x86_64', user=redacted
Apr 26 03:46:18 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Provisioning recipe 5081868 in RS:4086142
Apr 26 03:46:19 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Provisioning recipe 5081871 in RS:4086145
Apr 26 03:46:27 beaker-server beakerd[4063]: bkr.server.model.activity DEBUG Tentative SystemActivity: object_id=5098L, service=u'Scheduler', field=u'Distro Tree', action=u'Provision', old=u'', new=u'RHEL-6.10-20180426.n.0 Workstation i386', user=redacted

The update_dirty_job() function also appears to take a long time in some cases, specifically when it is releasing a system for a finished recipe. For example here it took 15 seconds to do so:

Apr 27 03:31:53 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Updating dirty job 2449211
Apr 27 03:31:54 beaker-server beakerd[4063]: bkr.server.model.scheduler DEBUG Releasing system hp-dl580g5-01.rhts.eng.bos.redhat.com for recipe 5085597
Apr 27 03:32:09 beaker-server beakerd[4063]: bkr.server.model.activity DEBUG Tentative SystemActivity: object_id=1279L, service=u'Scheduler', field=u'User', action=u'Returned', old=u'yumliu', new=u'', user=yumliu
Apr 27 03:32:09 beaker-server beakerd[4063]: turbomail.manager INFO Attempting delivery of message <20180427033209.4063.12736.prod.eng.bos.redhat.com>.
Apr 27 03:32:09 beaker-server beakerd[4063]: turbomail.transport INFO Attempting delivery of message <20180427033209.4063.12736.prod.eng.bos.redhat.com>.
Apr 27 03:32:09 beaker-server beakerd[4063]: turbomail.delivery INFO <20180427033209.4063.12736.prod.eng.bos.redhat.com> DELIVER
Apr 27 03:32:09 beaker-server beakerd[4063]: turbomail.transport INFO Connecting to SMTP server 127.0.0.1.
Apr 27 03:32:09 beaker-server beakerd[4063]: turbomail.delivery INFO <20180427033209.4063.12736.prod.eng.bos.redhat.com> SENT
Apr 27 03:32:09 beaker-server beakerd[4063]: turbomail.manager INFO Delivery of message <20180427033209.4063.12736.prod.eng.bos.redhat.com> successful.
Apr 27 03:32:09 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Updating dirty job 2449189

I'm also about 75% certain that in the past we have seen beakerd hit Lock wait timeouts specifically on inserting into system_recipe_map and command_queue (although I can't find any written record of those exceptions right now). Inserting into system_recipe_map is done by process_new_recipe(), and both provisioning and releasing systems will insert into command_queue. All of the above will also interact with the job table in the same transaction. Moreover, those are some of the largest tables we have. The command_queue table includes all historical command records and is currently around 30 million rows. The system_recipe_map table does not hold historical rows but we tend to insert and delete several thousands of rows at a time.

The log_recipe table is not involved in any of the above code paths, its issues likely stem from the expensive hackery done by lazy_create() to avoid race conditions on insert. Or else it is just falling victim to the same issues with the job table which is ultimately updated in the same database transaction.

Comment 36 Dan Callaghan 2018-04-30 07:55:45 UTC

I set up beakerd to run in my local dev environment against a somewhat recent production snapshot. For the most part, all the queries beakerd sends to MySQL return straight away. The only observably slow queries are:

* the two I just identified in bug 1573081, which take ~30 seconds each; and
* the query with dirty_version != clean_version to find all dirty jobs in update_dirty_jobs(), which takes ~5 seconds.

We have also seen the dirty_version != clean_version query show up in the MySQL slow query logs so we know it is sometimes a problem.

But those are not involved at all in the code paths mentioned in the previous comment. beakerd does those in serial. So while those problematic queries are certainly not *helping*, they are not directly related to this problem.

The fact that all of the UPDATE and INSERT statements complete instantly in isolation means there is nothing inherently wrong with those operations themselves (and we can see that already, they are all straightforward single-row insertions). It means the problem really is only in the interaction between all the different transactions happening at once in our heavily-loaded production environment.

Since MySQL does not give us any useful tools to narrow down the problem, the best I can do is to guess at ways of refactoring the db schema to reduce the amount of work MySQL has to do in the above situations. And then we try them and see what helps.

Based on the above, I think the best approach is:

* Move finished power commands into a separate table, leaving command_queue for ones that are still in flight. This will bring the size of the command_queue table down drastically. We will still have beakerd contending for INSERTs at the same time as the LC XMLRPC calls are UPDATE'ing on the same table, but there is typically only ever 0-40 power commands in flight at any time.

* Find a cheaper alternative to dirty_version != clean_version. We arrived at that after many iterations as a way of avoiding races between transactions, but each row gets its own distinct uuid value which will be blowing out the index and eventually will become untenable.

* Fix whatever mess this is (should be a single WHERE condition by recipe_id):

2018-04-30 17:19:15,456 sqlalchemy.engine.base.Engine INFO DELETE FROM system_recipe_map WHERE system_recipe_map.system_id = %s AND system_recipe_map.recipe_id = %s
2018-04-30 17:19:15,456 sqlalchemy.engine.base.Engine INFO ((182L, 4896360L), (200L, 4896360L), (225L, 4896360L), (229L, 4896360L), (240L, 4896360L), (839L, 4896360L), (888L, 4896360L), (1208L, 4896360L) ... displaying 10 of 172 total bound parameter sets ... (10505L, 4896360L), (10540L, 4896360L))

Comment 37 Dan Callaghan 2018-04-30 07:58:55 UTC

One other small thing I noticed in the system releasing code path, this SELECT FROM command_queue costs 4 seconds and is probably unnecessary (it would also be avoided if completed power commands were moved to a separate table):

2018-04-30 17:57:08,903 bkr.server.model.scheduler DEBUG Releasing system hp-sl390s-01.rhts.eng.bos.redhat.com for recipe 4896352
2018-04-30 17:57:08,906 sqlalchemy.engine.base.Engine INFO SELECT tg_user.password AS tg_user_password, tg_user.root_password AS tg_user_root_password, tg_user.user_id AS tg_user_user_id, tg_user.user_name AS tg_user_user_name, tg_user.email_address AS tg_user_email_address, tg_user.display_name AS tg_user_display_name, tg_user.rootpw_changed AS tg_user_rootpw_changed, tg_user.openstack_trust_id AS tg_user_openstack_trust_id, tg_user.use_old_job_page AS tg_user_use_old_job_page, tg_user.notify_job_completion AS tg_user_notify_job_completion, tg_user.notify_broken_system AS tg_user_notify_broken_system, tg_user.notify_system_loan AS tg_user_notify_system_loan, tg_user.notify_group_membership AS tg_user_notify_group_membership, tg_user.notify_reservesys AS tg_user_notify_reservesys, tg_user.created AS tg_user_created, tg_user.disabled AS tg_user_disabled, tg_user.removed AS tg_user_removed 
FROM tg_user 
WHERE tg_user.user_id = %s
2018-04-30 17:57:08,906 sqlalchemy.engine.base.Engine INFO (1893L,)
2018-04-30 17:57:08,915 sqlalchemy.engine.base.Engine INFO UPDATE job SET wtasks=%s WHERE job.id = %s
2018-04-30 17:57:08,915 sqlalchemy.engine.base.Engine INFO (0, 2355200L)
2018-04-30 17:57:08,916 sqlalchemy.engine.base.Engine INFO UPDATE recipe_set SET result=%s, status=%s, wtasks=%s WHERE recipe_set.id = %s
2018-04-30 17:57:08,916 sqlalchemy.engine.base.Engine INFO ('Warn', 'Aborted', 5, 3942419L)
2018-04-30 17:57:08,920 sqlalchemy.engine.base.Engine INFO UPDATE recipe SET result=%s, status=%s, wtasks=%s WHERE recipe.id = %s
2018-04-30 17:57:08,920 sqlalchemy.engine.base.Engine INFO ('Warn', 'Aborted', 5, 4896352L)
2018-04-30 17:57:08,923 sqlalchemy.engine.base.Engine INFO UPDATE recipe_task SET result=%s WHERE recipe_task.id = %s
2018-04-30 17:57:08,923 sqlalchemy.engine.base.Engine INFO ('Warn', 68914259L)
2018-04-30 17:57:08,924 sqlalchemy.engine.base.Engine INFO UPDATE recipe_task SET result=%s WHERE recipe_task.id = %s
2018-04-30 17:57:08,924 sqlalchemy.engine.base.Engine INFO ('Warn', 68914260L)
2018-04-30 17:57:08,925 sqlalchemy.engine.base.Engine INFO UPDATE recipe_task SET result=%s WHERE recipe_task.id = %s
2018-04-30 17:57:08,925 sqlalchemy.engine.base.Engine INFO ('Warn', 68914261L)
2018-04-30 17:57:08,926 sqlalchemy.engine.base.Engine INFO UPDATE recipe_task SET result=%s WHERE recipe_task.id = %s
2018-04-30 17:57:08,926 sqlalchemy.engine.base.Engine INFO ('Warn', 68914262L)
2018-04-30 17:57:08,927 sqlalchemy.engine.base.Engine INFO UPDATE recipe_task SET result=%s WHERE recipe_task.id = %s
2018-04-30 17:57:08,927 sqlalchemy.engine.base.Engine INFO ('Warn', 68914263L)
2018-04-30 17:57:08,929 sqlalchemy.engine.base.Engine INFO UPDATE reservation SET finish_time=%s WHERE reservation.id = %s AND reservation.finish_time IS NULL
2018-04-30 17:57:08,929 sqlalchemy.engine.base.Engine INFO (datetime.datetime(2018, 4, 30, 7, 57, 8, 928758), 4537447L)
2018-04-30 17:57:08,932 sqlalchemy.engine.base.Engine INFO SELECT lab_controller.id AS lab_controller_id, lab_controller.fqdn AS lab_controller_fqdn, lab_controller.disabled AS lab_controller_disabled, lab_controller.removed AS lab_controller_removed, lab_controller.user_id AS lab_controller_user_id 
FROM lab_controller 
WHERE lab_controller.id = %s
2018-04-30 17:57:08,932 sqlalchemy.engine.base.Engine INFO (8L,)
2018-04-30 17:57:08,935 sqlalchemy.engine.base.Engine INFO SELECT command_queue.id AS command_queue_id, command_queue.user_id AS command_queue_user_id, command_queue.service AS command_queue_service, command_queue.queue_time AS command_queue_queue_time, command_queue.start_time AS command_queue_start_time, command_queue.finish_time AS command_queue_finish_time, command_queue.system_id AS command_queue_system_id, command_queue.action AS command_queue_action, command_queue.status AS command_queue_status, command_queue.delay_until AS command_queue_delay_until, command_queue.quiescent_period AS command_queue_quiescent_period, command_queue.error_message AS command_queue_error_message, command_queue.installation_id AS command_queue_installation_id 
FROM command_queue 
WHERE %s = command_queue.system_id ORDER BY command_queue.queue_time DESC, command_queue.id DESC
2018-04-30 17:57:08,936 sqlalchemy.engine.base.Engine INFO (3597L,)
2018-04-30 17:57:12,847 sqlalchemy.engine.base.Engine INFO SELECT power.id AS power_id_1, power.power_type_id AS power_power_type_id, power.system_id AS power_system_id, power.power_address AS power_power_address, power.power_user AS power_power_user, power.power_passwd AS power_power_passwd, power.power_id AS power_power_id, power.power_quiescent_period AS power_power_quiescent_period 
FROM power 
WHERE %s = power.system_id
2018-04-30 17:57:12,848 sqlalchemy.engine.base.Engine INFO (3597L,)
2018-04-30 17:57:12,852 bkr.server.model.activity DEBUG Tentative SystemActivity: object_id=3597L, service=u'Scheduler', field=u'User', action=u'Returned', old=u'bxue', new=u'', user=bxue

Comment 38 Dan Callaghan 2018-05-03 04:55:30 UTC

(In reply to Dan Callaghan from comment #37)
> One other small thing I noticed in the system releasing code path, this
> SELECT FROM command_queue costs 4 seconds and is probably unnecessary

Filed bug 1574311 for this, since we can fix it independently of any database schema changes.

It's *probably* not contributing to the lock contention in this bug, but it's certainly making the provision_scheduled_recipes() step very slow.

Comment 39 Dan Callaghan 2018-05-03 07:35:04 UTC

(In reply to Dan Callaghan from comment #36)
> * Fix whatever mess this is (should be a single WHERE condition by
> recipe_id):
> 
> 2018-04-30 17:19:15,456 sqlalchemy.engine.base.Engine INFO DELETE FROM
> system_recipe_map WHERE system_recipe_map.system_id = %s AND
> system_recipe_map.recipe_id = %s
> 2018-04-30 17:19:15,456 sqlalchemy.engine.base.Engine INFO ((182L,
> 4896360L), (200L, 4896360L), (225L, 4896360L), (229L, 4896360L), (240L,
> 4896360L), (839L, 4896360L), (888L, 4896360L), (1208L, 4896360L)  ...
> displaying 10 of 172 total bound parameter sets ...  (10505L, 4896360L),
> (10540L, 4896360L))

https://gerrit.beaker-project.org/6093

Comment 40 Dan Callaghan 2018-06-04 05:14:41 UTC

(In reply to Dan Callaghan from comment #36)
> * Move finished power commands into a separate table, leaving command_queue
> for ones that are still in flight.

I spent some time working on a patch for this last week.

Splitting the tables themselves is fairly straightforward, and I wrote some code to manage moving the rows from one to the other when a power command finishes.

But it turns out the problematic part is actually the implementation of the GET /systems/<fdqn>/commands/ API. Right now it essentially just takes system.dyn_command_queue and passes that to json_collection() and the API is implemented "for free". But with this change, there would no longer be a single table to query but rather two. json_collection() does not handle that kind of situation and it is a lot of extra code to make it do that. Or else we could use some kind of hackery like a UNION query across both tables and pass that to json_collection().

Either way there is a substantial amount of work required to preserve the existing behaviour of that API if the commands are split across two tables.

But I am tentatively optimistic that with the fix for bug 1574311 we will not have any more lock contention on the command_queue table anymore. So I am hoping we will not need to tackle this.

Comment 41 Dan Callaghan 2018-06-04 06:59:56 UTC

(In reply to Dan Callaghan from comment #36)
> * Find a cheaper alternative to dirty_version != clean_version. We arrived
> at that after many iterations as a way of avoiding races between
> transactions, but each row gets its own distinct uuid value which will be
> blowing out the index and eventually will become untenable.

The indexes are actually completely unused for a condition like dirty_version != clean_version so we should really just drop them, they are wasting CPU time and disk space.

As for improving the dirty_version != clean_condition itself...

I am starting to wonder whether that is more complicated than it needs to be. The naive approach to this problem would be just a single boolean flag column "is_dirty" at the job level, which is set to TRUE when a new result is added and then back to FALSE when the new overall job status is computed.

Reading back through my old comments on the original bug 807237 I talked about needing to SELECT...FOR UPDATE the job row *and* all the relevant recipe_set, recipe, and recipe_task rows in every transaction which touches a job... which was true when we had the old _bubble_up/_bubble_down structure.

But now that we have the separate update_dirty_job() routine (which happens independently of the transaction where results are recorded), there is really only one race condition we need to avoid:

tx A:
  load dirty job
  update status
  set is_dirty = FALSE

tx B:
  insert new result for job
  set is_dirty = TRUE

In the above case, the is_dirty flag could end up (incorrectly) set to FALSE if tx A commits after tx B, even though tx A did not see the new result inserted by tx B.

The dirty_version/clean_version approach avoids that race today, because tx B in the above example is always inserting a new randomly generated UUID that is different to the one seen by tx A.

But we *could* go with the simple boolean flag column, and instead avoid the race by having update_dirty_job() SELECT...FOR UPDATE the job row before it starts doing any work. In the above example tx B would be blocked on setting is_dirty = TRUE until tx A has committed, thus guaranteeing we don't lose the dirty state.

The downside of that approach is that the HTTP requests to insert new results could be blocked waiting for the update_dirty_job() routine, which can sometimes be quite slow on large jobs (or because of things like bug 1574311). So I think we would be trading less throughput and more lock contention on an *individual* job row, in exchange for better overall performance of update_dirty_jobs() due to not constantly doing a table scan for the dirty_version != clean_version condition.

So I'm not sure whether it would be an overall win for this problem, but it might be worth trying.

Comment 42 Dan Callaghan 2018-06-04 07:29:01 UTC

(In reply to Dan Callaghan from comment #41)

I just remembered we already have an open bug about (more or less) the same issues around dirty_version != clean_version. That's bug 1297603. So I will post the patch against that one instead.

Comment 43 PaulB 2018-06-04 12:37:44 UTC

(In reply to Dan Callaghan from comment #42)
> (In reply to Dan Callaghan from comment #41)
> 
> I just remembered we already have an open bug about (more or less) the same
> issues around dirty_version != clean_version. That's bug 1297603. So I will
> post the patch against that one instead.

Just an fyi...
This BZ was opened first in 2014-05-14 13:34:16 EDT.

Best,
-pbunyan

Comment 44 Roman Joost 2018-06-17 22:42:31 UTC

With the jump to MariaDB now, this should be a thing of the past. Shall we close it as NOTABUG (since MySQL contention related) and perhaps reopen it if it really does still happen?

Comment 45 PaulB 2018-06-18 12:58:30 UTC

(In reply to Roman Joost from comment #44)
> With the jump to MariaDB now, this should be a thing of the past. Shall we
> close it as NOTABUG (since MySQL contention related) and perhaps reopen it
> if it really does still happen?

Roman,
I think this is a BZ that was (hopefully) fixed by the data base update.
Whatever the version for "MariaDB" would be the fixed in version.
Fixed in current release:
 MariaDB*

I will leave decision to you.

Me I am grateful for Beakers resurrection!
Thank you Beaker Gods!

Best,
-pbunyan

Comment 46 Roman Joost 2018-06-25 02:31:48 UTC

Yeah I guess I was a bit too quick with this one. Browsing through the server logs I found two entries already:

Jun 24 03:45:10 beaker-server beaker-server[31081]:  OperationalError: (OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE job SET dirty_version=%s WHERE job.id = %s' ('\xf1\xa8\x87\x82|\xedC\xb5\x94\x9d\x88\x99\x80\xe6gn', 2563049L)

Jun 21 12:45:34 beaker-server beaker-server[110965]:  OperationalError: (OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE job SET dirty_version=%s WHERE job.id = %s' ('\xdaP\xddm\x93QD\t\xac\x96\x8d\x99\x10L\xc9@', 2558589L)

Suppose we're not in the clear yet :(

Comment 47 Dan Callaghan 2018-06-25 07:26:16 UTC

If the only remaining lock wait timeout issues are around the "UPDATE job SET dirty_version.." queries, then we could consider taking my patch in bug 1297603 which is trying to address that. I am still just nervous about whether that change will actually make the lock contention issues better or worse though...

Comment 48 Roman Joost 2018-07-11 01:13:51 UTC

Would it help to ship the patch as a patch in our RPM spec file. We build one RPM with and one without, so that if they make the lock contention issues worse we downgrade or upgrade the RPM and move on?

Comment 49 Dan Callaghan 2018-07-11 01:14:44 UTC

(In reply to Roman Joost from comment #48)

No, because changing between one version or the other requires a large, expensive database schema migration.

Comment 50 Jan Tluka 2018-10-10 13:14:00 UTC

I just saw this issue in one of my jobs (UPDATE job SET dirty_version) :

https://beaker.engineering.redhat.com/recipes/5899108#task81593035
http://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2018/10/29356/2935699/5899108/console.log

Comment 51 Dan Callaghan 2018-10-17 01:58:50 UTC

(In reply to Jan Tluka from comment #50)

Thanks for the extra data point Jan. That particular query is the one which I am hoping will be fixed up by the changes in bug 1297603 in Beaker 26.0. That landed in our production environment on 2018-10-15. So I will be keen to see if it reproduces again since then.

Comment 53 Jan Tluka 2019-04-10 12:28:24 UTC

Hit the "lock wait timeout exceeded" issue again.

https://beaker.engineering.redhat.com/recipes/6716981#task90826347
http://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2019/04/34617/3461775/6716981/console.log

Comment 54 Jan Tluka 2019-08-26 13:27:41 UTC

This issue happened again.

Link to job: https://beaker.engineering.redhat.com/recipes/7281023#task98285181

I had to cancel the job as it already went through all tasks in the
recipe and it was blocking our job queue.

Comment 55 Jan Tluka 2019-09-30 10:14:19 UTC

We hit this issue again. As this adds additional work to my team where we need to investigate and rerun tests I'd like to ask for raising priority of this bug.

https://beaker.engineering.redhat.com/recipes/7410285#task99995397,task99995398
https://beaker.engineering.redhat.com/recipes/7410281#task99995348
https://beaker.engineering.redhat.com/recipes/7410345#task99996034

Comment 56 Jan Tluka 2019-10-14 07:26:04 UTC

Hit the "lock wait timeout exceeded" issue again.

https://beaker.engineering.redhat.com/recipes/7463852#task100725953

Comment 57 Martin Styk 2020-06-23 08:02:09 UTC

I haven't seen this issue in the past months.
On RHEL7 + Fixed DB + Replica + Beaker 27.3(+).

If someone has a different experience, please raise concerns now. Otherwise, I will close this ticket.

Note You need to log in before you can comment on or make changes to this bug.