Bug 1727850
| Summary: | Tests killed because uploading intermediate results takes a lot of time | |||
|---|---|---|---|---|
| Product: | [Retired] Beaker | Reporter: | Alicja Kario <hkario> | |
| Component: | lab controller | Assignee: | Martin Styk <mastyk> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | tools-bugs <tools-bugs> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | high | |||
| Version: | 26 | CC: | amahdal, asosedki, azelinka, bpeck, cbouchar, coli, dapospis, fdvorak, fsumsal, jbastian, kzhang, mastyk, mdujava, mkyral, olichtne, pkis, rasibley, rebarbos, szidek, todoleza, vkadlcik, zhguo | |
| Target Milestone: | --- | Keywords: | Patch, TestBlocker, Triaged | |
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1744536 (view as bug list) | Environment: | ||
| Last Closed: | 2020-03-09 12:31:40 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1744536 | |||
Formatted traceback: http://pastebin.test.redhat.com/777923 Why not post it directly right 🤦♂️
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/bkr/labcontroller/main.py", line 147, in _log_failed_requests
return func(environ, start_response)
File "/usr/lib/python2.6/site-packages/werkzeug/wrappers.py", line 250, in <lambda>
return _patch_wrapper(f, lambda *a: f(*a[:-2]+(cls(a[-2]),))(*a[-2:]))
File "/usr/lib/python2.6/site-packages/bkr/labcontroller/main.py", line 132, in __call__
return getattr(obj, attr)(req, **args)
File "/usr/lib/python2.6/site-packages/bkr/labcontroller/proxy.py", line 910, in do_result_log
return self._put_log(log_file, req)
File "/usr/lib/python2.6/site-packages/bkr/labcontroller/proxy.py", line 859, in _put_log
with log_file:
File "/usr/lib/python2.6/site-packages/bkr/labcontroller/log_storage.py", line 52, in __enter__
self.register_func()
File "/usr/lib/python2.6/site-packages/bkr/labcontroller/log_storage.py", line 131, in <lambda>
result_base_dir),
File "/usr/lib64/python2.6/xmlrpclib.py", line 1199, in __call__
return self.__send(self.__name, args)
File "/usr/lib64/python2.6/xmlrpclib.py", line 1489, in __request
verbose=self.__verbose
File "/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py", line 571, in request
result = transport_class.request(self, *args, **kwargs)
File "/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py", line 427, in _request
return self.parse_response(h.getfile())
File "/usr/lib64/python2.6/xmlrpclib.py", line 1361, in parse_response
return self._parse_response(file, None)
File "/usr/lib64/python2.6/xmlrpclib.py", line 1392, in _parse_response
return u.close()
File "/usr/lib64/python2.6/xmlrpclib.py", line 838, in close
raise Fault(**self._stack[0])
Fault: <Fault 1: "<class 'sqlalchemy.exc.OperationalError'>:(OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'INSERT INTO log_recipe_task_result (path, filename, start_time, recipe_task_result_id)\\nSELECT %s, %s, %s, %s\\nFROM DUAL\\nWHERE NOT (EXISTS (SELECT 1 \\nFROM log_recipe_task_result \\nWHERE log_recipe_task_result.path = %s AND log_recipe_task_result.filename = %s AND log_recipe_task_result.recipe_task_result_id = %s FOR UPDATE))' ('/', 'resultoutputfile.log', datetime.datetime(2019, 6, 21, 16, 29, 16, 370798), 435474998L, '/', 'resultoutputfile.log', 435474998L)"> transfer.log.1.gz
Seems like there's an underlying issue because this was already "fixed" long time ago - https://gerrit.beaker-project.org/c/beaker/+/2325 Just for today, there is 26 OperationalErrors...
Server-side traceback:
Jul 8 12:25:03 beaker-server beaker-server[125433]: bkr.server.xmlrpccontroller ERROR Error handling XML-RPC method
Jul 8 12:25:03 beaker-server beaker-server[125433]: Traceback (most recent call last):
Jul 8 12:25:03 beaker-server beaker-server[125433]: File "/usr/lib/python2.6/site-packages/bkr/server/xmlrpccontroller.py", line 59, in RPC2
Jul 8 12:25:03 beaker-server beaker-server[125433]: response = self.process_rpc(method,params)
Jul 8 12:25:03 beaker-server beaker-server[125433]: File "/usr/lib/python2.6/site-packages/bkr/server/xmlrpccontroller.py", line 44, in process_rpc
Jul 8 12:25:03 beaker-server beaker-server[125433]: response = obj(*params)
Jul 8 12:25:03 beaker-server beaker-server[125433]: File "<string>", line 2, in register_result_file
Jul 8 12:25:03 beaker-server beaker-server[125433]: File "/usr/lib/python2.6/site-packages/bkr/server/identity.py", line 288, in require
Jul 8 12:25:03 beaker-server beaker-server[125433]: return func(*args, **kwargs)
Jul 8 12:25:03 beaker-server beaker-server[125433]: File "/usr/lib/python2.6/site-packages/bkr/server/recipetasks.py", line 121, in register_result_file
Jul 8 12:25:03 beaker-server beaker-server[125433]: filename=filename,
Jul 8 12:25:03 beaker-server beaker-server[125433]: File "/usr/lib/python2.6/site-packages/bkr/server/model/scheduler.py", line 299, in lazy_create
Jul 8 12:25:03 beaker-server beaker-server[125433]: return super(Log, cls).lazy_create(path=cls._normalized_path(path), **kwargs)
Jul 8 12:25:03 beaker-server beaker-server[125433]: File "/usr/lib/python2.6/site-packages/bkr/server/model/base.py", line 78, in lazy_create
Jul 8 12:25:03 beaker-server beaker-server[125433]: raise e
Jul 8 12:25:03 beaker-server beaker-server[125433]: OperationalError: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'INSERT INTO log_recipe_task_result (path, filename, start_time, recipe_task_result_id)\nSELECT %s, %s, %s, %s\nFROM DUAL\nWHERE NOT (EXISTS (SELECT 1 \nFROM log_recipe_task_result \nWHERE log_recipe_task_result.path = %s AND log_recipe_task_result.filename = %s AND log_recipe_task_result.recipe_task_result_id = %s FOR UPDATE))' ('/', 'resultoutputfile.log', datetime.datetime(2019, 7, 8, 12, 25, 3, 696933), 439826120L, '/', 'resultoutputfile.log', 439826120L)
Tomas, fix you posted it is not even close to the proper fix. This is just a recommended pattern, however, we can see 7 retries are not enough for those transactions anymore. Due to massive concurrency If we want to scale Beaker further this needs to be fixed properly. This means queries for uploading results needs to be written in the way that we will take locks in the same order every time. No, I understand that it's not what I'm saying. My proposal is to raise the workaround limit, to maybe 10 retries and a little bit longer delay. We definitely need to address this (and in this ticket), but since we have more pressing issues and the number of Deadlocks went down after we killed all the reservesys abusers, we don't have to do it now. the timeouts are happening more often now, if I had to estimate, I'd say 2 to 5% of recipe sets are affected One of the potential problems between Restraint/Beaker communication fixed in bug 1791205 4. Make it possible to disable intermediate log uploading and plumb that option all the way to the QE job scheduling tools and scripts. Please, we've had enough of this, just let us opt out already. (sorry for the last comment; that was a c/p error).
> 4. Make it possible to disable intermediate log uploading
> and plumb that option all the way to the QE job scheduling tools and scripts.
> Please, we've had enough of this, just let us opt out already.
+1, I was just about to ask for something like this.
If we could have just the <result/> without the resultoutputfile.log, we would be happy. Actually, taskout.log/TESTOUT.log are only files that people should read, resultoutputfile.log is actually a rather nuissance. So if we could opt out of this we would be happy to.
(In reply to Alois Mahdal from comment #30) > (sorry for the last comment; that was a c/p error). > > > 4. Make it possible to disable intermediate log uploading > > and plumb that option all the way to the QE job scheduling tools and scripts. > > Please, we've had enough of this, just let us opt out already. > > +1, I was just about to ask for something like this. > > If we could have just the <result/> without the resultoutputfile.log, we > would be happy. Actually, taskout.log/TESTOUT.log are only files that > people should read, resultoutputfile.log is actually a rather nuissance. So > if we could opt out of this we would be happy to. What exactly this have to do with Beaker? What is reported or is not reported is up to test harness - Most of the cases Restraint (In reply to Martin Styk 🤦♂️ from comment #31) > [...] > > > If we could have just the <result/> without the resultoutputfile.log, we > > would be happy. Actually, taskout.log/TESTOUT.log are only files that > > people should read, resultoutputfile.log is actually a rather nuissance. So > > if we could opt out of this we would be happy to. > > What exactly this have to do with Beaker? > What is reported or is not reported is up to test harness - Most of the > cases Restraint You're right, sorry for spam, reported RFE: https://github.com/p3ck/restraint/issues/39 (In reply to Alois Mahdal from comment #29) > 4. Make it possible to disable intermediate log uploading > and plumb that option all the way to the QE job scheduling tools and > scripts. > Please, we've had enough of this, just let us opt out already. actually there is one log that needs to be uploaded once per phase, one that we need to inspect (when it fails): the AVC log :/ (In reply to Hubert Kario from comment #33) > (In reply to Alois Mahdal from comment #29) > > 4. Make it possible to disable intermediate log uploading > > and plumb that option all the way to the QE job scheduling tools and > > scripts. > > Please, we've had enough of this, just let us opt out already. > > actually there is one log that needs to be uploaded once per phase, one that > we need to inspect (when it fails): the AVC log > > :/ Hi folks. Can someone please create an issue against Restraint in Bugzilla? We will investigate it there. We have to patch restraint a bit more to make sure we can mitigate this. Restraint is sending really small chunks back to Beaker all the time. So in case something happening with network, this may take forever due to retries (and let's say you had to do retry for few bytes). Hi Martin I created https://bugzilla.redhat.com/show_bug.cgi?id=1791407, feel free to reword the summary. https://bugzilla.redhat.com/show_bug.cgi?id=1782422 may help resolve this issue since I see harness.log has a lot of unnecessary output. (In reply to Carol Bouchard from comment #36) > https://bugzilla.redhat.com/show_bug.cgi?id=1782422 may help resolve this > issue since I see harness.log has a lot of unnecessary output. It will not. This is not tight only to <fetch>. Chunking + response time is problem here. If we have to do <Restraint> <---> <LC> <---> <Master> round trip and it will start timeouting then we can kill the job. Resolving bug 1792421 in restraint should resolve the situation Raising severity as it effectively blocks us in delivering compose qualification on time. Also setting TestBlocker. I'm closing this one. The fix should be included as part of Beaker 27.2 + internal deployment parts. I spoke with several teams and they confirmed that performance is good. We are still going to address bug 1792421 (Currently in-review). To further improve the situation in Beaker and providing you stable results. If you have any questions feel free to reach out to me. |
Description of problem: The time between phases, when resultoutputfile.log is uploaded with restraint and beakerlib, sometimes takes a lot of time. Normally it's immediate (<1s), but in some cases it takes minutes, at one time I've seen it take an hour. Version-Release number of selected component (if applicable): 26.5 How reproducible: Random, around 1% of runs Steps to Reproduce: 1. Schedule test with few phases that take few seconds few hundred times, look for executions that take more than 30s Actual results: some tests get killed by watchdog Expected results: no tests killed by watchdog Additional info: on beaker proxy server, there sometimes are traces, but I don't know how connected they are: proxy.log.4.gz:Jun 21 13:36:18 lab-02 beaker-proxy[28871]: bkr.labcontroller.main DEBUG Time: 0:00:00.738917 recipe_upload_file ('7032702', '/', 'boot.log', 9636, '', 0, 'WxtbMzJ proxy.log.4.gz:Jun 21 13:36:18 lab-02 beaker-proxy[28871]: bkr.labcontroller.main DEBUG Time: 0:00:00.787292 recipe_upload_file ('7032702', '/', 'messages', 74173, '', 0, 'SnVuID proxy.log.4.gz:Jun 21 13:36:18 lab-02 beaker-proxy[28871]: bkr.labcontroller.proxy DEBUG recipe_upload_file recipe_id:7032702 name:dmesg offset:0 size:38000 proxy.log.4.gz:Jun 21 13:36:18 lab-02 beaker-proxy[28871]: bkr.labcontroller.proxy DEBUG recipe_upload_file recipe_id:7032702 name:messages offset:0 size:74173 proxy.log.4.gz:Jun 21 13:36:18 lab-02 beaker-proxy[28871]: bkr.labcontroller.proxy DEBUG recipe_upload_file recipe_id:7032702 name:messages offset:74173 size:77 proxy.log.4.gz:Jun 21 13:36:20 lab-02 beaker-proxy[28871]: bkr.labcontroller.main DEBUG Time: 0:00:01.177475 recipe_upload_file ('7032702', '/', 'dmesg', 38000, '', 0, 'WyAgICAwL watchdog.log.1.gz:Jun 21 13:07:12 lab-02 beaker-watchdog[28928]: bkr.labcontroller.proxy INFO Watching console log file /var/consoles/sweetpig-11.lab.eng.bos.redhat.com for recipe 7032702 Jun 21 16:29:16 lab-02 beaker-proxy[28871]: bkr.labcontroller.main ERROR Error handling request PUT //recipes/7032702/tasks/95147602/results/435474998/logs/resultoutputfile.log Jun 21 16:29:16 lab-02 beaker-proxy[28871]: Traceback (most recent call last): Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib/python2.6/site-packages/bkr/labcontroller/main.py", line 147, in _log_failed_requests Jun 21 16:29:16 lab-02 beaker-proxy[28871]: return func(environ, start_response) Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib/python2.6/site-packages/werkzeug/wrappers.py", line 250, in <lambda> Jun 21 16:29:16 lab-02 beaker-proxy[28871]: return _patch_wrapper(f, lambda *a: f(*a[:-2]+(cls(a[-2]),))(*a[-2:])) Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib/python2.6/site-packages/bkr/labcontroller/main.py", line 132, in __call__ Jun 21 16:29:16 lab-02 beaker-proxy[28871]: return getattr(obj, attr)(req, **args) Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib/python2.6/site-packages/bkr/labcontroller/proxy.py", line 910, in do_result_log Jun 21 16:29:16 lab-02 beaker-proxy[28871]: return self._put_log(log_file, req) Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib/python2.6/site-packages/bkr/labcontroller/proxy.py", line 859, in _put_log Jun 21 16:29:16 lab-02 beaker-proxy[28871]: with log_file: Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib/python2.6/site-packages/bkr/labcontroller/log_storage.py", line 52, in __enter__ Jun 21 16:29:16 lab-02 beaker-proxy[28871]: self.register_func() Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib/python2.6/site-packages/bkr/labcontroller/log_storage.py", line 131, in <lambda> Jun 21 16:29:16 lab-02 beaker-proxy[28871]: result_base_dir), Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib64/python2.6/xmlrpclib.py", line 1199, in __call__ Jun 21 16:29:16 lab-02 beaker-proxy[28871]: return self.__send(self.__name, args) Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib64/python2.6/xmlrpclib.py", line 1489, in __request Jun 21 16:29:16 lab-02 beaker-proxy[28871]: verbose=self.__verbose Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py", line 571, in request Jun 21 16:29:16 lab-02 beaker-proxy[28871]: result = transport_class.request(self, *args, **kwargs) Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib/python2.6/site-packages/bkr/common/xmlrpc.py", line 427, in _request Jun 21 16:29:16 lab-02 beaker-proxy[28871]: return self.parse_response(h.getfile()) Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib64/python2.6/xmlrpclib.py", line 1361, in parse_response Jun 21 16:29:16 lab-02 beaker-proxy[28871]: return self._parse_response(file, None) Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib64/python2.6/xmlrpclib.py", line 1392, in _parse_response Jun 21 16:29:16 lab-02 beaker-proxy[28871]: return u.close() Jun 21 16:29:16 lab-02 beaker-proxy[28871]: File "/usr/lib64/python2.6/xmlrpclib.py", line 838, in close Jun 21 16:29:16 lab-02 beaker-proxy[28871]: raise Fault(**self._stack[0]) Jun 21 16:29:16 lab-02 beaker-proxy[28871]: Fault: <Fault 1: "<class 'sqlalchemy.exc.OperationalError'>:(OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'INSERT INTO log_recipe_task_result (path, filename, start_time, recipe_task_result_id)\\nSELECT %s, %s, %s, %s\\nFROM DUAL\\nWHERE NOT (EXISTS (SELECT 1 \\nFROM log_recipe_task_result \\nWHERE log_recipe_task_result.path = %s AND log_recipe_task_result.filename = %s AND log_recipe_task_result.recipe_task_result_id = %s FOR UPDATE))' ('/', 'resultoutputfile.log', datetime.datetime(2019, 6, 21, 16, 29, 16, 370798), 435474998L, '/', 'resultoutputfile.log', 435474998L)"> transfer.log.1.gz:Jun 21 16:52:33 lab-02 beaker-transfer[28956]: bkr.labcontroller.proxy DEBUG Fetching files list for recipe 7032702 transfer.log.1.gz:Jun 21 16:54:03 lab-02 beaker-transfer[28956]: bkr.labcontroller.proxy DEBUG Updating recipe 7032702 file locations on the server watchdog.log.1.gz:Jun 21 16:50:46 lab-02 beaker-watchdog[28928]: bkr.labcontroller.watchdog INFO Stopping monitor for recipe 7032702