Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1727850

Summary: Tests killed because uploading intermediate results takes a lot of time
Product: [Retired] Beaker Reporter: Alicja Kario <hkario>
Component: lab controllerAssignee: Martin Styk <mastyk>
Status: CLOSED CURRENTRELEASE QA Contact: tools-bugs <tools-bugs>
Severity: urgent Docs Contact:
Priority: high    
Version: 26CC: amahdal, asosedki, azelinka, bpeck, cbouchar, coli, dapospis, fdvorak, fsumsal, jbastian, kzhang, mastyk, mdujava, mkyral, olichtne, pkis, rasibley, rebarbos, szidek, todoleza, vkadlcik, zhguo
Target Milestone: ---Keywords: Patch, TestBlocker, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1744536 (view as bug list) Environment:
Last Closed: 2020-03-09 12:31:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1744536    

Description Alicja Kario 2019-07-08 11:09:19 UTC
Description of problem:
The time between phases, when resultoutputfile.log is uploaded with restraint and beakerlib, sometimes takes a lot of time. Normally it's immediate (<1s), but in some cases it takes minutes, at one time I've seen it take an hour.

Version-Release number of selected component (if applicable):
26.5

How reproducible:
Random, around 1% of runs

Steps to Reproduce:
1. Schedule test with few phases that take few seconds few hundred times, look for executions that take more than 30s

Actual results:
some tests get killed by watchdog

Expected results:
no tests killed by watchdog

Additional info:
on beaker proxy server, there sometimes are traces, but I don't know how connected they are:

 proxy.log.4.gz:Jun 21 13:36:18 lab-02 beaker-proxy[28871]: bkr.labcontroller.main DEBUG Time: 0:00:00.738917 recipe_upload_file ('7032702', '/', 'boot.log', 9636, '', 0, 'WxtbMzJ proxy.log.4.gz:Jun 21 13:36:18 lab-02 beaker-proxy[28871]: bkr.labcontroller.main DEBUG Time: 0:00:00.787292 recipe_upload_file ('7032702', '/', 'messages', 74173, '', 0, 'SnVuID proxy.log.4.gz:Jun 21 13:36:18 lab-02 beaker-proxy[28871]: bkr.labcontroller.proxy DEBUG recipe_upload_file recipe_id:7032702 name:dmesg offset:0 size:38000 proxy.log.4.gz:Jun 21 13:36:18 lab-02 beaker-proxy[28871]: bkr.labcontroller.proxy DEBUG recipe_upload_file recipe_id:7032702 name:messages offset:0 size:74173 proxy.log.4.gz:Jun 21 13:36:18 lab-02 beaker-proxy[28871]: bkr.labcontroller.proxy DEBUG recipe_upload_file recipe_id:7032702 name:messages offset:74173 size:77 proxy.log.4.gz:Jun 21 13:36:20 lab-02 beaker-proxy[28871]: bkr.labcontroller.main DEBUG Time: 0:00:01.177475 recipe_upload_file ('7032702', '/', 'dmesg', 38000, '', 0, 'WyAgICAwL watchdog.log.1.gz:Jun 21 13:07:12 lab-02 beaker-watchdog[28928]: bkr.labcontroller.proxy INFO Watching console log file /var/consoles/sweetpig-11.lab.eng.bos.redhat.com for recipe 7032702 Jun 21 16:29:16 lab-02 beaker-proxy[28871]: bkr.labcontroller.main ERROR Error handling request PUT //recipes/7032702/tasks/95147602/results/435474998/logs/resultoutputfile.log Jun 21 16:29:16 lab-02 beaker-proxy[28871]: Traceback (most recent call last): Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib/python2.6/site-packages/bkr/labcontroller/main.py", line 147, in _log_failed_requests Jun 21 16:29:16 lab-02 beaker-proxy[28871]: return func(environ, start_response) Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib/python2.6/site-packages/werkzeug/wrappers.py", line 250, in <lambda> Jun 21 16:29:16 lab-02 beaker-proxy[28871]: return _patch_wrapper(f, lambda *a: f(*a[:-2]+(cls(a[-2]),))(*a[-2:])) Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib/python2.6/site-packages/bkr/labcontroller/main.py", line 132, in __call__ Jun 21 16:29:16 lab-02 beaker-proxy[28871]: return getattr(obj, attr)(req, **args) Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib/python2.6/site-packages/bkr/labcontroller/proxy.py", line 910, in do_result_log Jun 21 16:29:16 lab-02 beaker-proxy[28871]: return self._put_log(log_file, req) Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib/python2.6/site-packages/bkr/labcontroller/proxy.py", line 859, in _put_log Jun 21 16:29:16 lab-02 beaker-proxy[28871]: with log_file: Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib/python2.6/site-packages/bkr/labcontroller/log_storage.py", line 52, in __enter__ Jun 21 16:29:16 lab-02 beaker-proxy[28871]: self.register_func() Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib/python2.6/site-packages/bkr/labcontroller/log_storage.py", line 131, in <lambda> Jun 21 16:29:16 lab-02 beaker-proxy[28871]: result_base_dir), Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib64/python2.6/xmlrpclib.py", line 1199, in __call__ Jun 21 16:29:16 lab-02 beaker-proxy[28871]: return self.__send(self.__name, args) Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib64/python2.6/xmlrpclib.py", line 1489, in __request Jun 21 16:29:16 lab-02 beaker-proxy[28871]: verbose=self.__verbose Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py", line 571, in request Jun 21 16:29:16 lab-02 beaker-proxy[28871]: result = transport_class.request(self, *args, **kwargs) Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py", line 427, in _request Jun 21 16:29:16 lab-02 beaker-proxy[28871]: return self.parse_response(h.getfile()) Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib64/python2.6/xmlrpclib.py", line 1361, in parse_response Jun 21 16:29:16 lab-02 beaker-proxy[28871]: return self._parse_response(file, None) Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib64/python2.6/xmlrpclib.py", line 1392, in _parse_response Jun 21 16:29:16 lab-02 beaker-proxy[28871]: return u.close() Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib64/python2.6/xmlrpclib.py", line 838, in close Jun 21 16:29:16 lab-02 beaker-proxy[28871]: raise Fault(**self._stack[0]) Jun 21 16:29:16 lab-02 beaker-proxy[28871]: Fault: <Fault 1: "<class 'sqlalchemy.exc.OperationalError'>:(OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'INSERT INTO log_recipe_task_result (path, filename, start_time, recipe_task_result_id)\\nSELECT %s, %s, %s, %s\\nFROM DUAL\\nWHERE NOT (EXISTS (SELECT 1 \\nFROM log_recipe_task_result \\nWHERE log_recipe_task_result.path = %s AND log_recipe_task_result.filename = %s AND log_recipe_task_result.recipe_task_result_id = %s FOR UPDATE))' ('/', 'resultoutputfile.log', datetime.datetime(2019, 6, 21, 16, 29, 16, 370798), 435474998L, '/', 'resultoutputfile.log', 435474998L)"> transfer.log.1.gz:Jun 21 16:52:33 lab-02 beaker-transfer[28956]: bkr.labcontroller.proxy DEBUG Fetching files list for recipe 7032702 transfer.log.1.gz:Jun 21 16:54:03 lab-02 beaker-transfer[28956]: bkr.labcontroller.proxy DEBUG Updating recipe 7032702 file locations on the server watchdog.log.1.gz:Jun 21 16:50:46 lab-02 beaker-watchdog[28928]: bkr.labcontroller.watchdog INFO Stopping monitor for recipe 7032702

Comment 2 Tomas Klohna 🔧 2019-07-08 12:04:32 UTC
Formatted traceback: http://pastebin.test.redhat.com/777923

Comment 3 Tomas Klohna 🔧 2019-07-08 12:06:00 UTC
Why not post it directly right 🤦‍♂️

Traceback (most recent call last):
	File "/usr/lib/python2.6/site-packages/bkr/labcontroller/main.py", line 147, in _log_failed_requests
	return func(environ, start_response)
	File "/usr/lib/python2.6/site-packages/werkzeug/wrappers.py", line 250, in <lambda>
	return _patch_wrapper(f, lambda *a: f(*a[:-2]+(cls(a[-2]),))(*a[-2:]))
	File "/usr/lib/python2.6/site-packages/bkr/labcontroller/main.py", line 132, in __call__
	return getattr(obj, attr)(req, **args)
	File "/usr/lib/python2.6/site-packages/bkr/labcontroller/proxy.py", line 910, in do_result_log
	return self._put_log(log_file, req)
	File "/usr/lib/python2.6/site-packages/bkr/labcontroller/proxy.py", line 859, in _put_log
	with log_file:
	File "/usr/lib/python2.6/site-packages/bkr/labcontroller/log_storage.py", line 52, in __enter__
	self.register_func()
	File "/usr/lib/python2.6/site-packages/bkr/labcontroller/log_storage.py", line 131, in <lambda>
	result_base_dir),
	File "/usr/lib64/python2.6/xmlrpclib.py", line 1199, in __call__
	return self.__send(self.__name, args)
	File "/usr/lib64/python2.6/xmlrpclib.py", line 1489, in __request
	verbose=self.__verbose
	File "/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py", line 571, in request
	result = transport_class.request(self, *args, **kwargs)
	File "/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py", line 427, in _request
	return self.parse_response(h.getfile())
	File "/usr/lib64/python2.6/xmlrpclib.py", line 1361, in parse_response
	return self._parse_response(file, None)
	File "/usr/lib64/python2.6/xmlrpclib.py", line 1392, in _parse_response
	return u.close()
	File "/usr/lib64/python2.6/xmlrpclib.py", line 838, in close
	raise Fault(**self._stack[0])
	Fault: <Fault 1: "<class 'sqlalchemy.exc.OperationalError'>:(OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'INSERT INTO log_recipe_task_result (path, filename, start_time, recipe_task_result_id)\\nSELECT %s, %s, %s, %s\\nFROM DUAL\\nWHERE NOT (EXISTS (SELECT 1 \\nFROM log_recipe_task_result \\nWHERE log_recipe_task_result.path = %s AND log_recipe_task_result.filename = %s AND log_recipe_task_result.recipe_task_result_id = %s FOR UPDATE))' ('/', 'resultoutputfile.log', datetime.datetime(2019, 6, 21, 16, 29, 16, 370798), 435474998L, '/', 'resultoutputfile.log', 435474998L)"> transfer.log.1.gz

Comment 4 Tomas Klohna 🔧 2019-07-08 12:26:06 UTC
Seems like there's an underlying issue because this was already "fixed" long time ago - https://gerrit.beaker-project.org/c/beaker/+/2325

Comment 5 Tomas Klohna 🔧 2019-07-08 12:31:41 UTC
Just for today, there is 26 OperationalErrors... 

Server-side traceback:

Jul  8 12:25:03 beaker-server beaker-server[125433]: bkr.server.xmlrpccontroller ERROR Error handling XML-RPC method
Jul  8 12:25:03 beaker-server beaker-server[125433]:  Traceback (most recent call last):
Jul  8 12:25:03 beaker-server beaker-server[125433]:    File "/usr/lib/python2.6/site-packages/bkr/server/xmlrpccontroller.py", line 59, in RPC2
Jul  8 12:25:03 beaker-server beaker-server[125433]:      response = self.process_rpc(method,params)
Jul  8 12:25:03 beaker-server beaker-server[125433]:    File "/usr/lib/python2.6/site-packages/bkr/server/xmlrpccontroller.py", line 44, in process_rpc
Jul  8 12:25:03 beaker-server beaker-server[125433]:      response = obj(*params)
Jul  8 12:25:03 beaker-server beaker-server[125433]:    File "<string>", line 2, in register_result_file
Jul  8 12:25:03 beaker-server beaker-server[125433]:    File "/usr/lib/python2.6/site-packages/bkr/server/identity.py", line 288, in require
Jul  8 12:25:03 beaker-server beaker-server[125433]:      return func(*args, **kwargs)
Jul  8 12:25:03 beaker-server beaker-server[125433]:    File "/usr/lib/python2.6/site-packages/bkr/server/recipetasks.py", line 121, in register_result_file
Jul  8 12:25:03 beaker-server beaker-server[125433]:      filename=filename,
Jul  8 12:25:03 beaker-server beaker-server[125433]:    File "/usr/lib/python2.6/site-packages/bkr/server/model/scheduler.py", line 299, in lazy_create
Jul  8 12:25:03 beaker-server beaker-server[125433]:      return super(Log, cls).lazy_create(path=cls._normalized_path(path), **kwargs)
Jul  8 12:25:03 beaker-server beaker-server[125433]:    File "/usr/lib/python2.6/site-packages/bkr/server/model/base.py", line 78, in lazy_create
Jul  8 12:25:03 beaker-server beaker-server[125433]:      raise e
Jul  8 12:25:03 beaker-server beaker-server[125433]:  OperationalError: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'INSERT INTO log_recipe_task_result (path, filename, start_time, recipe_task_result_id)\nSELECT %s, %s, %s, %s\nFROM DUAL\nWHERE NOT (EXISTS (SELECT 1 \nFROM log_recipe_task_result \nWHERE log_recipe_task_result.path = %s AND log_recipe_task_result.filename = %s AND log_recipe_task_result.recipe_task_result_id = %s FOR UPDATE))' ('/', 'resultoutputfile.log', datetime.datetime(2019, 7, 8, 12, 25, 3, 696933), 439826120L, '/', 'resultoutputfile.log', 439826120L)

Comment 6 Martin Styk 2019-07-09 07:18:38 UTC
Tomas, fix you posted it is not even close to the proper fix. This is just a recommended pattern, however, we can see 7 retries are not enough for those transactions anymore. Due to massive concurrency
If we want to scale Beaker further this needs to be fixed properly.
This means queries for uploading results needs to be written in the way that we will take locks in the same order every time.

Comment 7 Tomas Klohna 🔧 2019-07-10 14:07:22 UTC
No, I understand that it's not what I'm saying.
My proposal is to raise the workaround limit, to maybe 10 retries and a little bit longer delay. We definitely need to address this (and in this ticket), but since we have more pressing issues and the number of Deadlocks went down after we killed all the reservesys abusers, we don't have to do it now.

Comment 8 Alicja Kario 2019-08-12 11:18:54 UTC
the timeouts are happening more often now, if I had to estimate, I'd say 2 to 5% of recipe sets are affected

Comment 28 Martin Styk 2020-01-15 11:10:45 UTC
One of the potential problems between Restraint/Beaker communication fixed in bug 1791205

Comment 29 Alois Mahdal 2020-01-15 16:22:40 UTC
4. Make it possible to disable intermediate log uploading
   and plumb that option all the way to the QE job scheduling tools and scripts.
   Please, we've had enough of this, just let us opt out already.

Comment 30 Alois Mahdal 2020-01-15 16:23:55 UTC
(sorry for the last comment; that was a c/p error).

> 4. Make it possible to disable intermediate log uploading
>   and plumb that option all the way to the QE job scheduling tools and scripts.
>   Please, we've had enough of this, just let us opt out already.

+1, I was just about to ask for something like this.

If we could have just the <result/> without the resultoutputfile.log, we would be happy.   Actually, taskout.log/TESTOUT.log are only files that people should read, resultoutputfile.log is actually a rather nuissance.  So if we could opt out of this we would be happy to.

Comment 31 Martin Styk 2020-01-15 16:26:50 UTC
(In reply to Alois Mahdal from comment #30)
> (sorry for the last comment; that was a c/p error).
> 
> > 4. Make it possible to disable intermediate log uploading
> >   and plumb that option all the way to the QE job scheduling tools and scripts.
> >   Please, we've had enough of this, just let us opt out already.
> 
> +1, I was just about to ask for something like this.
> 
> If we could have just the <result/> without the resultoutputfile.log, we
> would be happy.   Actually, taskout.log/TESTOUT.log are only files that
> people should read, resultoutputfile.log is actually a rather nuissance.  So
> if we could opt out of this we would be happy to.

What exactly this have to do with Beaker?
What is reported or is not reported is up to test harness - Most of the cases Restraint

Comment 32 Alois Mahdal 2020-01-15 17:48:42 UTC
(In reply to Martin Styk 🤦‍♂️ from comment #31)
> [...]
>
> > If we could have just the <result/> without the resultoutputfile.log, we
> > would be happy.   Actually, taskout.log/TESTOUT.log are only files that
> > people should read, resultoutputfile.log is actually a rather nuissance.  So
> > if we could opt out of this we would be happy to.
> 
> What exactly this have to do with Beaker?
> What is reported or is not reported is up to test harness - Most of the
> cases Restraint

You're right, sorry for spam, reported RFE: https://github.com/p3ck/restraint/issues/39

Comment 33 Alicja Kario 2020-01-15 18:13:00 UTC
(In reply to Alois Mahdal from comment #29)
> 4. Make it possible to disable intermediate log uploading
>    and plumb that option all the way to the QE job scheduling tools and
> scripts.
>    Please, we've had enough of this, just let us opt out already.

actually there is one log that needs to be uploaded once per phase, one that we need to inspect (when it fails): the AVC log

:/

Comment 34 Martin Styk 2020-01-15 18:20:12 UTC
(In reply to Hubert Kario from comment #33)
> (In reply to Alois Mahdal from comment #29)
> > 4. Make it possible to disable intermediate log uploading
> >    and plumb that option all the way to the QE job scheduling tools and
> > scripts.
> >    Please, we've had enough of this, just let us opt out already.
> 
> actually there is one log that needs to be uploaded once per phase, one that
> we need to inspect (when it fails): the AVC log
> 
> :/

Hi folks.

Can someone please create an issue against Restraint in Bugzilla? We will investigate it there.


We have to patch restraint a bit more to make sure we can mitigate this.
Restraint is sending really small chunks back to Beaker all the time. So in case something happening with network, this may take forever due to retries (and let's say you had to do retry for few bytes).

Comment 35 Rachel Sibley 2020-01-15 18:45:40 UTC
Hi Martin I created https://bugzilla.redhat.com/show_bug.cgi?id=1791407, feel free to reword the summary.

Comment 36 Carol Bouchard 2020-01-15 19:12:09 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1782422  may help resolve this issue since I see harness.log has a lot of unnecessary output.

Comment 37 Martin Styk 2020-01-16 08:39:08 UTC
(In reply to Carol Bouchard from comment #36)
> https://bugzilla.redhat.com/show_bug.cgi?id=1782422  may help resolve this
> issue since I see harness.log has a lot of unnecessary output.

It will not.
This is not tight only to <fetch>.

Chunking + response time is problem here. If we have to do <Restraint> <---> <LC> <---> <Master> round trip and it will start timeouting then we can kill the job.

Comment 44 Martin Styk 2020-02-14 09:54:09 UTC
Resolving bug 1792421 in restraint should resolve the situation

Comment 45 Dalibor Pospíšil 2020-03-02 14:37:26 UTC
Raising severity as it effectively blocks us in delivering compose qualification on time. Also setting TestBlocker.

Comment 48 Martin Styk 2020-03-09 12:31:40 UTC
I'm closing this one. The fix should be included as part of Beaker 27.2 + internal deployment parts.

I spoke with several teams and they confirmed that performance is good. We are still going to address bug 1792421 (Currently in-review).
To further improve the situation in Beaker and providing you stable results.

If you have any questions feel free to reach out to me.